Dimension reduction in gjam

James S. Clark, Daniel Taylor-Rodriguez, Duke University

2016-10-04

citations:

Clark, J.S., D. Nemergut, B. Seyednasrollah, P. Turner, and S. Zhang. 2016. Generalized joint attribute modeling for biodiversity analysis: Median-zero, multivariate, multifarious data, Ecological Monographs, in press.

Taylor-Rodriguez, D., K. Kaufeld, E. Schliep, J. S. Clark, and A. Gelfand, 2016. Joint Species distribution modeling: dimension reduction using Dirichlet processes, Bayesian Analysis, in press.

files are found here

Overview

Microbiome, genetic, and hyperspectral satelitte data are examples of observations characterized by a large number of response variables \(S\) (e.g., species); we refer to such data sets as ‘big-S’. To see why big-S represents a modeling challenge recall the first-stage model in gjam,

\[\mathbf{w}_{i} \sim MVN(\mathbf{B}'\mathbf{x}_{i},\boldsymbol{\Sigma})\]

Covariance \(\Sigma\) has dimension \(S \times S\). It must be inverted, an order\((S^{3})\) operation. Even in cases where \(\Sigma\) can be inverted the number of observations may not be sufficient to accurately estimate the large number of parameters in the model. In gjam, big-S is handled by generating a lower order approximation of \(\Sigma\) (Taylor-Rodriguez et al. 2016).

The total number of estimates in the full model is

\[\frac{S(S + 1)}{2} + QS\]

The two terms come from \(\boldsymbol{\Sigma}\) and \(\mathbf{B}\), respectively. The number of observed responses is \(Sn\). Thus, to prescribe a mimumum ratio of \(a = \frac{no. estimates}{no. observations}\) we might choose to fit the model with no more than

\[S_{min} = 2(an - Q) - 1\]

parameters. If we think about replacing \(\boldsymbol{\Sigma}\) with a new \(r \times N\) matrix described below, the maximum size might be prescribed as

\[r \times N_{max} = S(an - Q)\]

The interpretation of a reduced model warrents a few words. If we replace \(\boldsymbol{\Sigma}\) with a much smaller number of estimates we cannot insist that we can know the covariance between every species. If \(\boldsymbol{\Sigma}\) does not contain structure that can be adequately summarized with fewer estimates, then we have at best a version of the model that soaks up some of the dependence structure that is important for estimating \(\mathbf{B}\). On the other hand \(\boldsymbol{\Sigma}\) may contain substantial structure that can be captured by a small number of estimates. We may require far less than order\((S^{2})\) parameters to describe dependence.

In this vignette we summarize aspects of dimension reduction in gjam that can be tuned to specific applications. First we point out that an analysis of big-S data sets need not include every species that might be recorded in a data set and how gjam functions can be used to trim large data sets.

How many responses?

A species s that bears no relationship to any of the predictors in \(\mathbf{X}\) (all \(\mathbf{B}_{s}\) small) or to other species s’ (all \(\boldsymbol{\Sigma}_{s',s}\) small) will not be ‘explained’ by the model. Such species will contribute little to the model fit, while degrading performance. Consider either of two options for reducing the number of species in the model, trimming and aggregation.

Trim species that are not of interest, that will not affect the fit, or both.

Aggregation can be based on a number of criteria, such as phylogenetic similarity (e.g., members of the same genus), by functional similarity (e.g., a feeding guild, C3 vs C4 plants), and so forth. Rare species can be aggregated into a single group. For example, Clark et al. (2014) include 96 tree species that occur on a minimum of 50 forest inventory plots in eastern North America. The remaining species can be gathered into a single class. When this option is used the name ‘other’ is assigned to this class in the plots-by-species matrix . Including this class is important where species compete, such as forest trees. It can also be used as a reference category for composition data, summarized below

Dimension reduction in gjam

As mentioned above, covariance matrix \(\boldsymbol{\Sigma}\) has \(S(S + 1)/2\) unique elements, the S diagonal elements plus \(1/2\) of the non-diagonal elements. For example, a data set with \(S = 100\) has 5050 unique elements in \(\boldsymbol{\Sigma}\). The rank of \(\boldsymbol{\Sigma}\) can be reduced by finding structure, essentially groups of responses that might respond similarly.

In gjam the total number of covariance parameter estimates is reduced to \(N \times r\), where \(r < N << S\). The integer \(N\) represents the potential number of response groups. The integer \(r\) is the dimensionality of each group. In other words, large N means more groups, and large r increases the flexibility of those N groups.

Dimension reduction is invoked in one of two ways. The first way is automatic, when i) a data set includes more species than can be fitted given sample size n or when ii) S is too large irrespective of n.

A second way to invoke dimension reduction is to specify it in modelList, through the list reductList. Here is an example using simulated data, where the number of species is twice the number of observations.

library(gjam)
S   <- 200
f   <- gjamSimData(n = 100, S = S, typeNames='CA')
rl  <- list(r = 5, N = 20)
ml  <- list(ng = 2000, burnin = 500, typeNames = f$typeNames, 
            reductList = rl, PREDICTX = F)  
out <- gjamGibbs(f$formula, f$xdata, f$ydata, modelList = ml)
pl  <- list(trueValues = f$trueValues, SMALLPLOTS = F, 
            GRIDPLOTS=T, specLabs = F)
gjamPlot(output = out, plotPars = pl)

The full matrix is not stored, so gjam needs time to construct versions of it as needed. The setting PREDICTX = F can be included in modelList to speed up computation, when prediction of inputs is not of interest.

The massive reduction in rank of the covariance matrix means that the we cannot estimate the ‘true’ version of \(\boldsymbol{\Sigma}\), particularly given the fact that the simulator does not generate a structured \(\boldsymbol{\Sigma}\). These appear as highly structured GRIDPLOTS for the posterior mean estimates of the correlation matrix \(\mathbf{R}\). However, we can still obtain estimates of \(\mathbf{B}\) and predictions of \(\mathbf{Y}\) that are close to true values.

Big-S composition data

Microbiome data are often big-S, small-n; with thousands of response variables, columns in \(\mathbf{Y}\) (e.g., OTUs). They are also composition count ('CC') data, discrete counts, but not related to absolute abundance; they are meaningful in a relative sense. Because data only inform about relative abundance, there is information for only \(S - 1\) species. If there are thousands of species, most of which are rare and thus not explained by the model, consider aggregating the many rare types into a single other class.

Fungal endophyte example

Fungal endophytes were sequenced on host tree seedlings (Hersh et al. 2016). In the data set fungEnd there is a compressed version of responses yDeZero containing OTU counts, a data.frame xdata containing predictors, and status, a vector of host responses, 0 for morbid, 1 for no signs of morbidity. Several histograms show the overwhelming numbers of zeros. Here we extract the data, stored in de-zeroed format, and generate some plots:

library(gjam)
library(repmis)
source_data("https://github.com/jimclarkatduke/gjam/blob/master/fungEnd.RData?raw=True")

xdata  <- fungEnd$xdata
otu    <- gjamReZero(fungEnd$yDeZero)
status <- fungEnd$status

par(mfrow=c(1,3), bty='n', mar=c(1,1,1,1), oma = c(0,0,0,0), 
    mar = c(3,2,2,1), tcl = -0.5, mgp = c(3,1,0), family='')
hist(status, main='Host condition (morbid = 0)', ylab = 'Host obs')
hist(otu, nclass=100, ylab = 'Reads', main='each observation')
nobs <- gjamTrimY(otu, minObs = 1, OTHER = F)$nobs
hist(nobs, nclass=100, ylab = 'Total reads per OTU', main='Full sample')

The model will provide no information on the rarest taxa. Here we trim otu to include only OTUs that occur in > 100 observations. The rarest OTUs are aggregated into the last column of y with the column name other:

tmp <- gjamTrimY(otu, minObs = 100)
y   <- tmp$y
dim(fungEnd$y)               # all OTUs
dim(y)                       # trimmed data
tail(colnames(y))            # 'other' class added

The full response matrix includes the OTU composition counts and the host status in column 1:

ydata <- cbind(status, y) # host status is also a response
S     <- ncol(ydata)
typeNames    <- rep('CC',S)   # composition count data
typeNames[1] <- 'PA'          # binary host status 

The interactions in the model involve two factors poly (two levels, polyculture vs monoculture) and host (eight factor levels, one for each host species). I assign acerRubr as the reference class for host,

xdata$host <- relevel(xdata$host,'acerRubr')

The gjam vignette on traits discusses multilevel factors in more detail. We discuss multilevel factors in the context of interactions below.

For this example we specify up to \(N = 20\) clusters with \(r = 3\) columns each. Here is an analysis of host seedling and polyculture effect on combined host morbidity status and the microbiome composition:

rl <- list(r = 5, N = 20)
ml <- list(ng = 2000, burnin = 500, typeNames = typeNames, reductList = rl)
output <- gjamGibbs(~ host*poly, xdata, ydata, modelList = ml)

Here is output:

S <- ncol(ydata)
specColor     <- rep('black',S)
specColor[1]  <- 'red' # highlight host status
plotPars      <- list(corLines=F, specColor = specColor, GRIDPLOTS=T,
                      specLabs = F, sdScaleY = T, SMALLPLOTS = F) 
fit <- gjamPlot(output, plotPars)
fit$eComs[1:5,]

Check the chains for convergence.

Again, the low dimensional version of covariance \(\boldsymbol{\Sigma}\) is expected to perform best when there is structure in the data. The responses in matrix \(\mathbf{E}\), returned in fit$ematrix, classify OTUs in three main groups, contained in fit$eComs.

Interactions and indirect effects

A plot of main effects, interactions, and indirect effects is used in this example to show contributions to host status, the response variable status. The effects on host status of host species is available as a table with standard errors and credible intervals:

beta <- fit$summaryCoeffs$betaCoeff
ws   <- grep('status_',rownames(beta))  # find coefficients for status
beta[ws,]

Following the intercept are rows showing main effects. These are followed by interaction terms. A quick visual of coefficients having credible intervals that exclude zero is here:

fit$summaryCoeffs$betaSig['status',]

with - and + indicating negative and postive values.

Here we use the function gamIIE to create the object fit1, a list of these main, interaction, and indirect effects. We specify not to include the response variable other as an indirect effect on status, because we want to focus on the effects of microbes that have been assigned to known taxonomic groups.

We specify the values for main effects that are involved in the interactions between poly and host. Each factor has one less column in the design matrix x than factor levels. poly has two classes in xdata, one each for monoculture and polyculture, so there is one poly column in x. host has eight species in xdata, so there are seven columns in x. Recall that we assigned acerRubr to be the reference class, so there is no column for it in x.

The vector of predictor values xvector passed to gjamIIE has the same elements and names as columns in x. For this reason it is easiest to simply assign it a row in x, then change the values. The only values that influence interactions are those that are involved in interaction terms, as specified in formula.

For the following plots we copy the first row of x. In the first are main effects of all predictors on status. In the second plot is interactions with poly set to 1. In the third plot are indirect effects of the microbes:

xvector <- output$x[1,]*0
xnames  <- colnames(output$x)
names(xvector)  <- xnames

xvector['hostfraxAmer'] <- 1
xvector['polypoly'] <- 1
fit1 <- gjamIIE(output, xvector, omitY = 'other')

par(mfrow=c(1,3), bty='n', mar=c(1,1,1,1), oma = c(0,0,0,0), 
    mar = c(3,2,2,1), tcl = -0.5, mgp = c(3,1,0))
gjamIIEplot(fit1, response = 'status', effectMu = 'direct', 
            effectSd = 'direct',
            legLoc = 'bottomright', ylim=c(-.5,.5))
title('Direct effect by host')

gjamIIEplot(fit1, response = 'status', effectMu = 'int', effectSd = 'int',
            legLoc = 'topright', ylim=c(-.5,.5))
title('Interactions with polyculture')

gjamIIEplot(fit1, response = 'status', effectMu = 'ind', effectSd = 'ind',
            legLoc = 'topright', ylim=c(-.5,.5))
title('Indirect effect of microbiome')

The plot at left is the direct effect, which includes both the main effects plus interactions and plotted relative to the mean over all hosts. The interaction contribution at center is the effect of each host, when grown in polyculture (poly = ref1) and of polyculture when the host = 'fraxAmer.

The indirect effects bring with them the main effects and interaction effects on each microbial taxon. In this example the indirect effects are noisy, showing large 95% intervals.

I might also wish to explore the taxa that are more abundant in healthy versus morbid hosts. This can be done with gjamPredict. Here are conditional predictions for responses with status first set to 0 (morbid) and then set to 1 (healthy):

y0 <- ydata[,1,drop=F]*0       #unhealthy host

newdata   <- list(ydataCond = y0, nsim=50)
morbid    <- gjamPredict(output, newdata = newdata) 

newdata   <- list(ydataCond = y0 + 1, nsim = 50 )
healthy   <- gjamPredict(output, newdata = newdata)

# compare predictions
par(mfrow=c(1,2), bty='n')
plot(healthy$sdList$yMu[,-1],morbid$sdList$yMu[,-1], cex=.4,
     xlab='healty',ylab='morbid')
abline(0, 1, lty=2,col='grey')
plot(output$y[,2:20],healthy$sdList$yMu[,2:20], cex=.4,col='orange',
     xlab='Observed',ylab='Predicted', pch=16)
points(output$y[,2:20],morbid$sdList$yMu[,2:20], cex=.4,col='blue', pch=16)
abline(0, 1, lty=2,col='grey')

In the first plot dots above the the 1:1 line are microbial taxa predicted to be more abundant in morbid hosts, and vice versa. In the second plot response to healthy hosts are in orange for a subset of types, to limit clutter.

For additional information see this link

The model is described in Clark et al (2016).

Acknowledgements

For valuable feedback on the model and computation we thank Bene Bachelot, Alan Gelfand, Erin Schliep, Daniel Taylor-Rodirquez, and Bradley Tomasek.

References

Clark, J.S., D. Nemergut, B. Seyednasrollah, P. Turner, and S. Zhang. 2016. Generalized joint attribute modeling for biodiversity analysis: Median-zero, multivariate, multifarious data, Ecological Monographs, in press.

Hersh, H., S. Benetiz, R. Vilgalys, J.S. Clark, in review.

Taylor-Rodriguez, D., K. Kaufeld, E. Schliep, J. S. Clark, and A. Gelfand, 2016. Joint Species distribution modeling: dimension reduction using Dirichlet processes. Bayesian Analysis, in press.