To use DiPALM, you need to first install the package. In R, type install.packages("DiPALM_1.0.tar.gz", repos=NULL, type="source")
. This is a one time process. For each new R session, you must first load the DiPALM package using the library
function.
library(DiPALM)
This vignette will also use additional R packages:
- edgeR (https://bioconductor.org/packages/release/bioc/html/edgeR.html)
- pheatmap (https://CRAN.R-project.org/package=pheatmap)
We will use data from Greenham et al., eLife 2017;6:e29655. The raw data is provided with the DiPALM package. Once the package is loaded, the example data can be loaded into your workspace.
data("exampleData")
cnts<-exampleData$rawCounts
The data object cnts
contains the raw count data for the drought and well watered samples that was aligned to the B. rapa R500 genome using Tophat2 and RSubread. Columns represent independent time points when leaf tissue was collected during a 2-day time-course starting 1h after dawn and every 4h following for 12 time points. Biological replicates are indicated by S1
and S2
, drought samples are D1-D12
and well watered are W1-W12
. Row names are the gene identifiers. We will first rename the columns with better descriptors.
colnms<-colnames(cnts)
colnms<-gsub("S1","R1_",colnms)
colnms<-gsub("S2","R2_",colnms)
colnms<-gsub("D","Drought_",colnms)
colnms<-gsub("W","Watered_",colnms)
colnames(cnts)<-colnms
We first normalize the data using the edgeR package with length normalization. This requires an annotation file that defines feature lengths for FPKM. Included in the DiPALM data is the geneAnnotations
object that is a simplified version of a gff file in a format called SAF. This format is used by the subread program to count reads. We will first rename this object to safMat for simplicity. This object gives us feature lengths. Additional information can be found in the subread, edgeR and limma documentation.
require(edgeR)
safMat<-exampleData$geneAnnotations
geneLen<-setNames(abs(safMat$End-safMat$Start),nm = safMat$GeneID)
cntsDge<- DGEList(counts = cnts)
cntsDge<- calcNormFactors(cntsDge)
DroughtLog<- rpkm(cntsDge, log=T, gene.length = geneLen, prior.count=0)
To filter out genes that are not expressed we calculate an average expression value across all observed samples. This yields a bimodal distribution that is typical of gene expression data, with expressed genes in the higher-abundance population.
geneMeans<-apply(DroughtLog,1,function(x) mean(x,na.rm = T))
hist(geneMeans,col="skyblue", breaks = seq(-12,16,0.25))
abline(v=0,col="red",lwd=3,lty=2)
We plot the distribution to determine the cutoff and remove genes that do not have at least 1 sample with log2(fpkm) > 0.
Droughtfiltered<-DroughtLog[which(geneMeans>0),]
We still have ‘NaN’ values where we had zero counts in the original matrix. For the sake of future computations, we substitute these with (minimal non-zero value of the matrix) - 1.
minVal<-min(Droughtfiltered[!is.na(Droughtfiltered)])-1
Droughtfiltered[is.na(Droughtfiltered)]<-minVal
Next we separate the data into the 4 “samples” in order to run our comparisons: (Drought.R1, Drought.R2, Watered.R1, Watered.R2). Each of these “samples” is actually comprised of a full time-course. After this, our data will be in a comparable state to data in any differential expression analysis, except instead of single expression values for each sample/replicate, we have a full time-course for each. From this step on, a “sample” refers to a whole time-course taken for one specific condition or replicate. This function assumes that your sample names are structured the same as the example dataset (‘R1_Drought_10’), if not, you can manually insert the sample names for the tnms
object. The tnms
object converts the sample names to ‘Drought.R1’, ‘Drought.R2’, ‘Watered.R1’ and ‘Watered.R2’ which becomes the INDEX terms for the tapply function to separate the time-courses.
tmp<-Droughtfiltered
spNms<-strsplit(x = colnames(Droughtfiltered), split = "_")
tnms<-sapply(spNms,function(x) paste(x[c(2,1)],collapse = "."))
TCsDrought<-tapply(1:length(tnms),INDEX = tnms, function(x) tmp[,x])
Any linear model or correlation calculation relies on the presence of variance in all the variables involved. If any of your variables are constant (for example, all zeros), it is impossible to determine a linear association between it and any other variable. In our data, we have some rare instances where the gene is detected in one condition but not in the other. We are unable to carry out the analysis for these genes and they will result in errors later on so we remove them in this step by requiring each gene to have > 0 variation in each of our 4 different samples (time-courses). We end up removing 4 genes.
varFiltered<-lapply(TCsDrought,function(x) apply(x,1,function(y) var(y)>0))
varFiltered<-do.call(cbind,varFiltered)
varFiltered<-apply(varFiltered,1,function(x) all(x))
varFiltered<-names(varFiltered)[which(varFiltered)]
TCsDrought<-lapply(TCsDrought,function(x) x[varFiltered,])
Before merging the datasets for the network analysis we order the data points of each time-course by time point (they get re-arranged during raw data processing).
TCsDrought<-lapply(TCsDrought,function(x) x[,order(as.numeric(sapply(strsplit(colnames(x),split = "_"),function(y) y[3])))])
It is a good check to print the order of the data points before continuing. From here, each of the 4 time-courses must match up.
sapply(TCsDrought,colnames)
Each row in the above matrix should contain 4 names with the same time point number (i.e. row 1 names should all end with “_1" etc.)
Next we merge the data into a large matrix.
TCsAll<-do.call(rbind,TCsDrought)
We are now ready to build a co-expression network.
To characterize the expression pattern for each gene we run the co-expression network analysis in WGCNA to cluster the genes with similar patterns. This step can be computationally intensive depending on the size of the matrix. The parameters for this step should be optimized for each dataset; in particular, choosing the soft-thresholding power
and minModuleSize
will influence the quality of the network modules. We recommend first running the pickSoftThreshold
function in WGCNA to select the appropriate power for each dataset. We also recommend users read the help material provided with WGCNA to select the appropriate values for the other parameters included with blockwiseModules
. For the drought data, run the following.
require(WGCNA)
BlockModsDrought<- blockwiseModules(datExpr = t(TCsAll), power = 10, networkType = "signed", corType="bicor", TOMType="signed", minModuleSize=100, mergeCutHeight=0.2, deepSplit=1, pamRespectsDendro = F, nThreads = 4, verbose=3)
We recommend saving the network at this stage.
save(BlockModsDrought, file="BlockModsDrought.RData")
We next extract the module eigengenes for each module.
MEs<-BlockModsDrought[[3]]
To obtain the number of modules in the network use dim(MEs)
. MEs is a matrix where each row is a time point and each column is the eigengene for that module. To associate the expression pattern of each gene with the module eigengenes we calculate the module membership or eigengene-based connectivity (kME) as defined by WGCNA. Briefly, the kME is the correlation between the gene expression profile and eigengene.
kMEsList<-BuildModMembership(MeMat = MEs, TCsLst = TCsDrought)
kMEsList is a list containing one element for every module eigengene. The elements are matrices with genes as rows and columns for datasets, in this example there are 4 (2 reps of Well-Watered and 2 reps of Drought). The values in the matrix are the kMEs for each gene against that module eigengene. We have now reduced the data to one linear model test per module while maintaining the expression pattern information by representing the entire time-course as a relationship to the module eigengene.
The kME is a good measure for expression pattern and will be used in subsequent steps to test for changes in pattern shape as it pertains to phase. To identify changes in overall expression level throughout the time-course we calculate the median expression across time points and compare these values in our limma-based analysis.
Med<-sapply(TCsDrought,function(x) apply(x,1,function(y) median(y,na.rm = T)))
Before setting up the linear model tests, we generate a permuted data matrix that will be used to estimate the null distribution of the test statistic and set the significance threshold for calling differential patterns. To do this, we go back to the pre-merged data and randomly scramble the row order in each sample resulting in comparisons between samples of 4 random genes.
TCsDroughtPerm<-lapply(TCsDrought,function(x) x[sample(1:nrow(x),nrow(x),replace = T),])
Calculate kME and kMed scores from the permuted data.
kMEsPerm<-BuildModMembership(MeMat = MEs, TCsLst = TCsDroughtPerm)
MedPerm<-sapply(TCsDroughtPerm,function(x) apply(x,1,function(y) median(y,na.rm = T)))
We now set up our linear model design matrix for use with the limma package. The main difference from the limma method is that we have replaced expression values for each factor with an expression time-course using our module eigengene values. The contrasts and model parameters are the same.
Treat<-as.factor(c("Drought","Drought","Watered","Watered"))
design<-model.matrix(~0+Treat)
colnames(design)<-levels(Treat)
contr<-"Drought-Watered"
Run the limma models by looping through each module eigengene resulting in a t-score for every gene and eigengene.
LimmaModskMEs<-lapply(kMEsList, function(x) BuildLimmaLM(dataMat = x, designMat = design, contrastStr = contr))
LimmaModsMed<-BuildLimmaLM(dataMat = Med, designMat = design, contrastStr = contr)
If the median is the same for a given contrast there will be no measured variation and a warning will be produced during the LimmaModsMed call saying “Zero sample variances detected, have been offset away from zero”. Limma will randomly inject a tiny amount of variance into these values that will allow the program to run but will never result in any significant contrast. If you see this warning, it’s OK to ignore it.
For subsequent steps the t-scores from the limma output will be used. To reduce the ram space the t-scores are pulled out and the memory is cleared using gc
.
LimmaModskMEs<-do.call(cbind,lapply(LimmaModskMEs,function(x) x$t))
LimmaModsMed<-LimmaModsMed$t
gc()
Repeat the limma tests and t-score extraction on the permuted data.
LimmaModskMEsPerm<-lapply(kMEsPerm, function(x) BuildLimmaLM(dataMat = x, designMat = design, contrastStr = contr))
LimmaModsMedPerm<-BuildLimmaLM(dataMat = MedPerm, designMat = design, contrastStr = contr)
LimmaModskMEsPerm<-do.call(cbind,lapply(LimmaModskMEsPerm,function(x) x$t))
LimmaModsMedPerm<-LimmaModsMedPerm$t
gc()
Depending on the direction of the contrasts and the eigengene profile, the t-scores will have positive and negative values that are arbitrary. The sum of the absolute value is used to remove the sign. For the kMed (expression difference) test, we only have one median value so no summation is required.
TestSumskMEs<-apply(LimmaModskMEs,1, function(x) sum(abs(x),na.rm = T))
TestSumsMed<-abs(LimmaModsMed[,1])
Repeat for the permuted results.
PermSumskMEs<-apply(LimmaModskMEsPerm,1, function(x) sum(abs(x),na.rm = T))
PermSumsMed<-abs(LimmaModsMedPerm[,1])
To determine an appropriate cutoff we plot the test vs. permuted distributions.
ggPlotMultiDensities(denslist = list(Test=TestSumskMEs,Permuted=PermSumskMEs), main = "Pattern Change Scores", xlab = "Differential Pattern Score",lwidth = 1)
ggPlotMultiDensities(denslist = list(Test=TestSumsMed,Permuted=PermSumsMed), main = "Expression Change Scores", xlab = "Differential Expression Score",lwidth = 1)
As you can see from these distributions, the major source of transcript level variation in this dataset is a pattern change rather than overall expression level.
To determine a significance cutoff, we calculate a pValue using the individual values from each test sum and permuted test sum with an FDR correction.
AdjkMEs<-sapply(TestSumskMEs,function(x) AdjustPvalue(tVal = x, tVec = TestSumskMEs, pVec = PermSumskMEs))
AdjMed<-sapply(TestSumsMed,function(x) AdjustPvalue(tVal = x, tVec = TestSumsMed, pVec = PermSumsMed))
Pull out the genes at a significance cutoff of 0.01.
SigkMEs<-AdjkMEs[which(AdjkMEs<0.01)]
SigMed<-AdjMed[which(AdjMed<0.01)]
Plot all 4 time-courses for a single gene.
topgene<-"BraA07g20790R"
PlotTCs(TClst = TCsDrought,tgene = topgene, scale = T, xlab="ZT Time (hours)", xAxsLabs = c(seq(1,23,4),seq(1,23,4)), ledgeX="topleft",tcols = c("red","red","blue","blue"), tltys = c(1,2,1,2))
The pattern analysis has thousands of genes. We could look at them one-by-one but that would probably not be the most efficient use of time. Alternatively, we can cluster them based on the pattern of each time-course relative to all of our eigengene patterns. We have already calculated this and used it to call differential patterning. For this step, we will go back to our limma results matrix that stores the t-statistics for each comparison to the eigenenes. This is conveniently already a matrix in the exact form we need. We pull out the genes with differential patterns, cluster using a pearson correlation as the distance and then plot a heatmap of the expression values ordered by this clustering.
LimmaModskMEsSig<-LimmaModskMEs[names(SigkMEs),]
patternCor<-cor(t(LimmaModskMEsSig))
patternTree<-hclust(as.dist(1-patternCor),method = "complete")
We will also use the mean expression value of the replicates for each time point to simplify the plot. First, combine all the separate time-courses into one matrix.
expressionMat<-do.call(cbind,TCsDrought)
To average the replicates, we first define an index for the columns that need to be averaged together.
eMatCols<-colnames(expressionMat)
eMatCols<-gsub("^R[[:digit:]]\\_","",eMatCols)
Calculate the averages using tapply
(this returns a list).
expressionAvg<-tapply(colnames(expressionMat),INDEX = eMatCols, function(x) rowSums(as.data.frame(expressionMat[,x]),na.rm = T)/length(x))
This next line looks very complicated but its simply combining the list into a matrix. Unfortunately, tapply
returns the samples (columns) out of order so they must be re-ordered using the treatment label and then the number at the end of the name.
orderVec<-strsplit(names(expressionAvg),split = "_")
orderVec<-lapply(1:2,function(x) sapply(orderVec,function(y) y[x]))
expressionAvg<-do.call(cbind,expressionAvg[order(orderVec[[1]],as.numeric(orderVec[[2]]))])
For this heatmap, we will define our own color palette.
colFunc<-colorRampPalette(colors = c("darkblue","blue","lightblue","white","orange"))
Finally, plot it.
require(pheatmap)
pheatmap(mat = expressionAvg[names(SigkMEs),], cluster_rows = patternTree, cluster_cols = F,scale = "row", color = colFunc(25), gaps_col = 12, show_rownames = F)
This heatmap is pretty and you can tell that the genes are grouping into decent clusters based on pattern changes in drought. Also, you can see that the drought side is different from the watered side, but its difficult to make out anything beyond that. It would be more informative if we could define clusters and then plot all 4 time-courses on top of each other (as we did with the single gene above).
We can use the dynamicTreeCut library (that comes with WGCNA) to extract clusters from our patternTree that we just made. In this case, the minClusterSize
is the variable with the most control over the number of clusters.
patternClusters<-cutreeDynamic(dendro = patternTree, minClusterSize = 100, distM = 1-patternCor, deepSplit = 1)
names(patternClusters)<-patternTree$labels
table(patternClusters)
We now have 20 or so clusters. We can make a combined plot of all the genes in any cluster. First we make it a little easier by converting the clusters object into a list of gene groups.
patternClusters<-tapply(X = names(patternClusters), INDEX = patternClusters, function(x) x)
Now we can plot one of our clusters using a DiPALM plotting function.
clustScores<-sapply(patternClusters,function(x) mean(TestSumskMEs[x]))
topClust<-which.max(clustScores)
PlotTCsRibbon(TClst = TCsDrought, tgenes = patternClusters[[topClust]], main="Pattern-Change Cluster",xAxsLabs = c(seq(1,23,4),seq(1,23,4)), xlab="ZT Time (hours)", scale = T, tcols = c("red","red","blue","blue"), tltys = c(1,2,1,2))
Now you can really see the drought responsive effect for this cluster of genes. We see an increase in amplitude and an earlier phase around ZT21.
As shown in the permuted distributions, the differential expression analysis has a lot less significant genes than the pattern analysis. This may not be the case for every data set. We can do a similar clustering of these genes using the expression data matrix which gives us a limma-based t-score for every gene at every time point for the comparison in question.
ExpMatrixMedSig<-expressionMat[names(SigMed),]
expCor<-cor(t(ExpMatrixMedSig))
expTree<-hclust(as.dist(1-expCor),method = "complete")
Similar plots can be generated using the functions below.
colFunc<-colorRampPalette(colors = c("darkblue","white","orange"))
pheatmap(mat = expressionAvg[names(SigMed),], cluster_rows = expTree, cluster_cols = F,scale = "row", color = colFunc(25), gaps_col = 12, show_rownames = F)
expClusters<-cutreeDynamic(dendro = expTree, minClusterSize = 5, distM = 1-expCor, deepSplit = 1)
names(expClusters)<-expTree$labels
table(expClusters)
expClusters<-tapply(X = names(expClusters), INDEX = expClusters, function(x) x)
clustScores<-sapply(expClusters,function(x) mean(TestSumsMed[x]))
topClust<-which.max(clustScores)
PlotTCsRibbon(TClst = TCsDrought, tgenes = expClusters[[topClust]], main="Expression-Change Cluster",xAxsLabs = c(seq(1,23,4),seq(1,23,4)), xlab="ZT Time (hours)", scale = T, tcols = c("red","red","blue","blue"), tltys = c(1,2,1,2))