Biomarkers are widely used in pharmaceutical industry for drug discovery and development at various stages, from preclinical animal study to phase I- III and post market clinical trials, and can be used for target identification, diseased diagnostics, patient stratification, treatment prediction and etc. High-throughput assay technology enables the collection of various types of biomarkers, such as gene expression and various omics biomarkers. In high-throughput assays, a large number of biomarkers are measured in a single experiment, but subject to substantial known or unknown variability. Hence biomarkers from high-throughput assays yield two major characteristics, including high dimensionality of the data and relative large data variability. Due to the two characteristics, it is critical and challenging to visualize the biomarker data and corresponding statistical results to ensure the data quality and the reliability of downstream statistical analysis results. Therefore we developed this R package as a visualization tool for generating the commonly used plots in analyzing biomarkers from high-throughput assays, including data quality control, individual biomarker analysis and multivariate analysis. The tool also included an analysis pipeline for analyzing biomarkers in the setting of two groups comparison with the flexibility to extend to customized or project specific analysis.
The R package statVisual provide novel solutions to the users by utilizing many powerful R base functions and R packages. For example, the function hist in the R package graphics can draw the histogram for a set of observations. However, to visualize histograms for two or more groups of observations in one figure, the users need to write their own code. The R package statVisual provides the function Hist to help the users to obtain such figure in one command. This vignette illustrates the usages of the functions provided by the R package statVisual.
R (>=3.5.0) is required to run the R package statVisual properly.
The following R packages are required to be installed before installing and running statVisual:
# packages in Bioconductor
library(Biobase) # base package for Bioconductor
library(limma) # linear models for continuous omics data
library(pvca) # principal variance component analysis
# packages in CRAN
library(dplyr) # data manipulation and pipe operation
library(factoextra) # extract and visualize results of multivariate data analysis
library(forestplot) # forest plot
library(gbm) # generalized boosted regression models
library(GGally) # extension to 'ggplot2'
library(ggdendro) # dendrogram for data clustering
library(ggfortify) # data visualization tools for statistical analysis results
library(ggplot2) # create graphics based on "The Grammer of Graphics"
library(ggrepel) # tidy text display in ggplot
library(glmnet) # cross validation plot for glmnet
library(grDevices) # R graphics devices and support for colors and fonts
library(gridExtra) # Grid graphics
library(knitr) # dynamic report generation
library(methods) # formal methods and classes
library(pROC) # display and analyze ROC curves
library(randomForest) # Random forest variable importance
library(reshape2) # flexibly reshape data
library(rmarkdown) # dynamic documents for R
library(rpart.plot) # plots for recursive partitioning for classification, regression and survival trees
library(tibble) # simple data frames
library(stats) # basic statistical functions
To load statVisual package, please type the following R statement:
library(statVisual)
To check the information about the statVisual package, please type the following R statement:
library(help = statVisual)
To find the usage of a function (e.g., Hist) in statVisual, please use the help function or use ?. For example,
help(Hist)
?Hist
Below is a list of currently available functions in statVisual for plotting.
The simulated data set diffCorDat contains expression levels of 2 gene probes for 50 cases and 50 controls. The expression levels of probe1 are generated from \(N(0, 1)\). The expression levels of probe2 for controls are also generated from \(N(0, 1)\). The expression levels of probe 2 for cases are generated from the formula \[\begin{equation} probe2_{i} = -probe1_{i} + e_i, i=1, \ldots, nCases, \end{equation}\] where \(e_i\sim N(0, 0.3^2)\).
That is, the expression levels of probe 1 and probe 2 are negatively correlated in cases, but not correlated in controls.
To load diffCorDat, we can use the following R code:
data(diffCorDat)
print(dim(diffCorDat))
#> [1] 100 3
print(diffCorDat[1:2,])
#> probe1 probe2 grp
#> 1 0.1567038 0.2438042 cases
#> 2 1.3738112 -1.5249113 cases
The dataset esSim was generated based on the R code in the manual of the function of the R Bioconductor package . There are 100 probes and 20 samples (10 controls and 10 cases). The first 3 probes are over-expressed in cases. The 4-th and 5-th probes are under-expressed in cases. The remaining 95 probes are non-differentially expressed between cases and controls. Expression levels for 100 probes were first generated from normal distribution with mean 0 and standard deviation varying between probes (\(sd=0.3\sqrt{4/\chi^2_4}\)). For the 3 OE probes, we add 2 to the expression levels of the 10 cases. For the 2 UE probes, we subtract 2 from the expression levels of the 10 cases.
To load esSim, we can use the following R code:
data(esSim)
print(dim(esSim))
#> Features Samples
#> 100 20
print(esSim)
#> ExpressionSet (storageMode: lockedEnvironment)
#> assayData: 100 features, 20 samples
#> element names: exprs
#> protocolData: none
#> phenoData
#> sampleNames: subj1 subj2 ... subj20 (20 total)
#> varLabels: sid grp
#> varMetadata: labelDescription
#> featureData
#> featureNames: probe1 probe2 ... probe100 (100 total)
#> fvarLabels: probeid gene chr
#> fvarMetadata: labelDescription
#> experimentData: use 'experimentData(object)'
#> Annotation:
genoSim: A simulated genotype dataset
genoSim is an ExpressionSet object containing genotype data of 10 SNPs for 100 subjects (50 cases and 50 controls). Eight of SNPs have same minor allele frequency (\(MAF=0.2\)) between cases and controls. The other 2 SNPs have the different MAFs between cases and controls (\(MAF_{cases}=0.4\) and \(MAF_{controls}=0.2\)).
We assume Hardy-Weinberg Equilibrium. That is, the genotype for wild-type homozygote is \((1-MAF)^2\); the genotype for heterozygote is \(2*MAF*(1-MAF)\); and the genotype for mutant homozygote is \(MAF^2\).
The phenotype of the ExpressionSet object contains two variables: subject id (\(sid\)) and case-control status (\(grp\)).
The feature data contains two variables: snp id (\(snp\)) and SNP significance status (\(memSNPs\)).
The dataset longDat is generated from the following mixed effects model for repeated measures: \[\begin{equation} y_{ij}=\beta_{0i}+\beta_1 t_{j} + \beta_2 grp_{2i} + \beta_3 grp_{3i} + \beta_4 \times\left(t_{j}\times grp_{2i}\right) + \beta_5 \times\left(t_{j}\times grp_{3i}\right) +\epsilon_{ij}, \end{equation}\] where \(y_{ij}\) is the outcome value for the \(i\)-th subject measured at \(j\)-th time point \(t_{j}\), \(grp_{2i}\) is a dummy variable indicating if the \(i\)-th subject is from group 2, \(grp_{3i}\) is a dummy variable indicating if the \(i\)-th subject is from group 3, \(\beta_{0i}\sim N\left(\beta_0, \sigma_b^2\right)\), \(\epsilon_{ij}\sim N\left(0, \sigma_e^2\right)\), \(i=1,\ldots, n, j=1, \ldots, m\), \(n\) is the number of subjects, and \(m\) is the number of time points.
When \(t_j=0\), the expected outcome value is \[\begin{equation} E\left(y_{ij}\right)=\beta_0+\beta_2 dose_{2i} + \beta_3 dose_{3i}. \end{equation}\] Hence, we have at baseline \[\begin{equation} E\left(y_{ij}\right)=\beta_0,\; \mbox{for dose 1 group},\\ E\left(y_{ij}\right)=\beta_0 + \beta_2,\; \mbox{for dose 2 group},\\ E\left(y_{ij}\right)=\beta_0 + \beta_3,\; \mbox{for dose 3 group}. \end{equation}\]
For dose 1 group, the expected outcome values across time is \[\begin{equation} E\left(y_{ij}\right)=\beta_0+\beta_1 t_{j}. \end{equation}\]
We also can get the expected difference of outcome values between dose 2 group and dose 1 group, between dose 3 group and dose 1 group, and between dose 3 group and dose 2 group: \[\begin{equation} E\left(y_{ij} - y_{i'j}\right) =\beta_2+\beta_4 t_{j},\;\mbox{for subject $i$ in dose 2 group and subject $i'$ in dose 1 group}, \end{equation}\]
\[\begin{equation} E\left(y_{kj} - y_{i'j}\right) =\beta_3+\beta_5 t_{j},\;\mbox{for subject $k$ in dose 3 group and subject $i'$ in dose 1 group}, \end{equation}\]
\[\begin{equation} E\left(y_{kj} - y_{ij}\right) =\left(\beta_3 -\beta_2\right)+\left(\beta_5-\beta_4\right) t_{j},\;\mbox{for subject $k$ in dose 3 group and subject $i$ in dose 2 group}. \end{equation}\]
We set \(n=90\), \(m=6\), \(\beta_0=5\), \(\beta_1=0\), \(\beta_2=0\), \(\beta_3=0\), \(\beta_4=2\), \(\beta_5=-2\), \(\sigma_e=1\), \(\sigma_b=0.5\), and \(t_{j}=j\), \(j=0, \ldots, m-1\).
That is, the trajectories for dose 1 group are horizontal with mean intercept at \(5\), the trajectories for dose 2 group are linearly increasing with slope \(2\) and mean intercept \(5\), and the trajectories for dose 3 group are linearly decreasing with slope \(-2\) and mean intercept \(5\).
To load longDat, we can use the following R code:
data(longDat)
print(dim(longDat))
#> [1] 540 4
print(longDat[1:2,])
#> sid time y grp
#> 1 1 time0 4.539033 grp1
#> 2 1 time1 5.738715 grp1
A common task in statistical comparison is to compare the mean values among groups. The reason to comparing the summary statistics (means) is to simplify the problem of comparing two distributions since it is hard to numerically compare two distributions. However, we can easily compare two distributions by visualizing the empirical distributions (e.g., histograms).
To compare histograms across groups, we can use the function Hist:
# expression data
dat = exprs(esSim)
print(dim(dat))
#> [1] 100 20
print(dat[1:2,])
#> subj1 subj2 subj3 subj4 subj5 subj6 subj7 subj8
#> probe1 1.462929 1.596203 2.252267 1.974616 1.842606 1.677204 2.242665 1.705923
#> probe2 1.574342 2.232537 1.846260 1.980999 2.483777 2.048599 1.786081 1.934728
#> subj9 subj10 subj11 subj12 subj13 subj14 subj15
#> probe1 1.758028 1.919597 0.07477602 0.3054811 0.12356526 0.3221119 -0.1346480
#> probe2 1.740329 1.942189 0.08308643 -0.1012774 -0.06669481 0.5413783 0.0245575
#> subj16 subj17 subj18 subj19 subj20
#> probe1 -0.03845011 0.3504460 0.4173905 -0.1438801 0.3335083
#> probe2 -0.03849619 -0.2536194 0.1420066 -0.1072697 -0.1750033
# phenotype data
pDat = pData(esSim)
print(dim(pDat))
#> [1] 20 2
print(pDat[1:2,])
#> sid grp
#> subj1 subj1 1
#> subj2 subj2 1
# feature data
fDat = fData(esSim)
print(dim(fDat))
#> [1] 100 3
print(fDat[1:2,])
#> probeid gene chr
#> probe1 probe1 fakeGene1 1
#> probe2 probe2 fakeGene2 1
# choose the first probe which is over-expressed in cases
pDat$probe1 = dat[1,]
# check histograms of probe 1 expression in cases and controls
pDat$grp=factor(pDat$grp)
print(table(pDat$grp, useNA = "ifany"))
#>
#> 0 1
#> 10 10
Hist(
data = pDat,
y = 'probe1',
group = 'grp')
We also can use the wrapper function statVisual:
statVisual(type = 'Hist',
data = pDat,
y = 'probe1',
group = 'grp')
We can compare two distribution by comparing the estimated density functions.
The function Den is used to visualize the differences of densities across groups.
Den(
data = pDat,
y = 'probe1',
group = 'grp')
We can also use the wrapper function statVisual:
statVisual(type = 'Den',
data = pDat,
y = 'probe1',
group = 'grp')
The correlation is an important statistic to evaluate the linear relationship between two continuous variables. To check the linearity and to visualize the strength of the linear relationship, we can draw scatter plot. Some time, it is of interest to compare the correlations among groups. The function XYscatter can help the comparison by display the scatter plots across groups in one figure:
For example, to check if the relationship between Sepal length vs. Sepal width is the same across different species, we can use the R code:
XYscatter(
data = diffCorDat,
x = 'probe1',
y = 'probe2',
group = 'grp',
title = 'Scatter Plot: probe1 vs probe2')
We can also use the wrapper function statVisual:
statVisual(type = 'XYscatter',
data = diffCorDat,
x = 'probe1',
y = 'probe2',
group = 'grp',
title = 'Scatter Plot: probe1 vs probe2')
For two categorical variables, we can use the function stackedBarPlot to show their association.
For example, in the ExpressionSet object genoSim, there are simulated genotypes of 10 SNPs for 50 cases and 50 control. If we would like to know if the pattern of the genotypes of the SNP 1 in cases is the same as that in controls, we can draw bar plots.
data(genoSim)
pDat = pData(genoSim)
geno = exprs(genoSim)
pDat$snp1 = geno[1,]
print(table(pDat$snp1, pDat$grp, useNA="ifany"))
#>
#> 0 1
#> 0 37 19
#> 1 11 22
#> 2 2 9
stackedBarPlot(dat = pDat,
catVar = "snp1",
group = "grp",
xlab = "snp1",
ylab = "Count",
group.lab = "grp",
title = "Stacked barplots of counts for SNP1",
catVarLevel = NULL)
We can also use the wrapper function statVisual:
statVisual(type = 'stackedBarPlot',
dat = pDat,
catVar = "snp1",
group = "grp",
xlab = "snp1",
ylab = "Count",
group.lab = "grp",
title = "Stacked barplots of counts for SNP1",
catVarLevel = NULL)
Note that the input parameter catVarLevel can be used to change the order of the values of catVar shown in x-axis. For example,
statVisual(type = 'stackedBarPlot',
dat = pDat,
catVar = "snp1",
group = "grp",
xlab = "snp1",
ylab = "Count",
group.lab = "grp",
title = "Stacked barplots of counts for SNP1",
catVarLevel = c(2, 0, 1))
Some time, we would like to compare two outcomes with different scales across groups in one figure using error bar plots.
The function BiAxisErrBar can do this task. Each bar plot displays mean standard error.
library(tidyverse)
library(ggplot2)
print(head(mtcars))
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#> Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#> Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
print(table(mtcars$gear, useNA="ifany"))
#>
#> 3 4 5
#> 15 12 5
BiAxisErrBar(
dat = mtcars,
group = "gear",
y.left = "mpg",
y.right = "wt")
We can also use the wrapper function statVisual:
statVisual(type = "BiAxisErrBar",
dat= mtcars,
group = "gear",
y.left = "mpg",
y.right = "wt")
In clinical trial, it is common to collect data at multiple time points. Therefore, it is natural to compare groups based on the trajectories of individual subjects across time. The function LinePlot can do this task:
LinePlot(
data = longDat,
x = 'time',
y = 'y',
sid = 'sid',
group = 'grp')
We also can use the wrapper function statVisual:
statVisual(type = "LinePlot",
data = longDat,
x = 'time',
y = 'y',
sid = 'sid',
group = 'grp')
If there are many individuals in a longitudinal dataset, the trajectory plot might look messy. In this case, we develop the function Box to compare groups using boxplots at each time point. In addition, line segments are used to connect the mean/median of each boxplot of the same group across time to show the differences between the mean trajectories. Note that this function is suitable for the scenarios where observations of all subjects are measured at a few fixed time points.
library(dplyr)
Box(
data = longDat,
x = 'time',
y = 'y',
group = 'grp',
title = "Boxplots across time")
Or we can use the wrapper function statVisual:
statVisual(type = 'Box',
data = longDat,
x = 'time',
y = 'y',
group = 'grp',
title = "Boxplots across time")