Trait modeling in gjam

James S. Clark, Duke University

2016-10-04

citation:

Clark, J.S. 2016. Why species tell us more about traits than traits tell us about species, Ecology, 97, 1979-1993.

Clark, J.S., D. Nemergut, B. Seyednasrollah, P. Turner, and S. Zhang. 2016. Generalized joint attribute modeling for biodiversity analysis: Median-zero, multivariate, multifarious data. Ecological Monographs, in press.

files are found here

gjam vignettes:

  1. Generalized joint attribute modeling - gjam: overview
  2. Dimension reduction in gjam: application to many response variables (‘Big-S’)
  3. Trait modeling in gjam: ecological trait analysis

Overview

Because it accommodates different data types gjam can be used to model ecological traits by either of two approaches (Clark 2016). One approach uses community weighted mean/mode (CWMM) trait values for a plot \(i\) as a response vector \(\mathbf{u}_{i}\), where each trait has a corresponding data type designation in typeNames. I discuss this approach first. I then summarize the second approach, predictive trait modeling.

Trait response model (TRM)

There are \(n\) observations of \(M\) traits to be explained by \(Q - 1\) predictors in design matrix \(\mathbf{X}\). The Trait Response Model (TRM) in Clark (2016) is

\[\mathbf{w}_{i} \sim MVN(\mathbf{u}_i,\Omega)\]

\[\mathbf{u}_i = \mathbf{A}'\mathbf{x}_{i}\]

where \(\mathbf{u}_{i}\) is a length-\(M\) vector of CWMM values, corresponding to \(\mathbf{w}_{i}\) on the latent scale, \(\mathbf{A}\) is the \(Q \times M\) matrix of coefficients, and \(\Omega\) is the \(M \times M\) residual covariance (Fig. 1). After describing the setup and model fitting I show how gjam summarizes the estimates and predictions.

Figure 1. Trait response model showing the sizes of matrices for a sample containing n observations, M traits, and Q predictors.

Input data

Data contained in forestTraits include predictors in xdata, a character vector of data types in traitTypes, and treesDeZero, which contains tree biomass in de-zeroed format. Here the data are loaded, re-zeroed with gjamReZero:

library(gjam)
library(repmis)
source_data("https://github.com/jimclarkatduke/gjam/blob/master/forestTraits.RData?raw=True")
## Downloading data from: https://github.com/jimclarkatduke/gjam/blob/master/forestTraits.RData?raw=True
## SHA-1 hash of the downloaded data file is:
## 1ea78837a63fc59c0e39f8cd665048827251681a
## [1] "forestTraits"
xdata <- forestTraits$xdata                    # n X Q
types <- forestTraits$traitTypes               # 12 trait types 
sbyt  <- forestTraits$specByTrait              # S X 12
pbys  <- gjamReZero(forestTraits$treesDeZero)  # n X S
head(sbyt)
##           gmPerSeed      maxHt      leafN      leafP         SLA
## abieBals -0.6749613 -0.7215657 -1.3581941 -1.0169734 -1.53743111
## acerBarb -0.2325344 -0.7215657  0.4023159  0.2311101  0.90381913
## acerNegu -0.1398369 -0.7215657  1.5275903  0.3119445  1.34368404
## acerPens -0.1649578 -2.4505299 -0.3962660  0.6352821  2.76591391
## acerRubr -0.3474630 -0.3510734 -0.5777619 -0.4640659  0.06807581
## acerSac2 -0.2077359  0.8839010  1.2008977  0.2311101 -0.21050530
##               woodSG shade drought flood            leaf   xylem
## abieBals -1.71516623     5       1     2 needleevergreen diffuse
## acerBarb -0.29558420     4       2     1  broaddeciduous diffuse
## acerNegu -0.80257778     3       3     3  broaddeciduous diffuse
## acerPens -0.59978035     4       2     1  broaddeciduous diffuse
## acerRubr -0.09278677     3       2     3  broaddeciduous diffuse
## acerSac2 -0.59978035     4       3     3  broaddeciduous diffuse
##               repro
## abieBals monoecious
## acerBarb monoecious
## acerNegu  dioecious
## acerPens  dioecious
## acerRubr  dioecious
## acerSac2  dioecious

The matrix pbys holds biomass values for species, rounded off to reduce storage. The first six columns of sbyt are centered and standardized. The three ordinal classes are integer values, but do not represent an absolute scale (see below). The three groups of categorical variables in data.frame sbyt have different numbers of levels shown here:

table(sbyt$leaf)      # four levels
## 
##  broaddeciduous  broadevergreen needledeciduous needleevergreen 
##              76               3               3              16
table(sbyt$xylem)     # diffuse/tracheid vs ring-porous
## 
## diffuse    ring 
##      63      35
table(sbyt$repro)     # two levels
## 
##  dioecious monoecious 
##         20         78

These species traits are translated into community-weighted means and modes (CWMM) by the function gjamSpec2Trait:

tmp         <- gjamSpec2Trait(pbys, sbyt, types)
tTypes      <- tmp$traitTypes                  # M = 15 values
u           <- tmp$plotByCWM                   # n X M
censor      <- tmp$censor                      # (0, 1) censoring, two-level CAT's
specByTrait <- tmp$specByTrait                 # S X M
M           <- ncol(u)
n           <- nrow(u)
types                                          # 12 individual trait types
##  [1] "CON" "CON" "CON" "CON" "CON" "CON" "OC"  "OC"  "OC"  "CAT" "CAT"
## [12] "CAT"
cbind(colnames(u),tTypes)                      # M trait names and types
##                             tTypes
##  [1,] "gmPerSeed"           "CON" 
##  [2,] "maxHt"               "CON" 
##  [3,] "leafN"               "CON" 
##  [4,] "leafP"               "CON" 
##  [5,] "SLA"                 "CON" 
##  [6,] "woodSG"              "CON" 
##  [7,] "shade"               "OC"  
##  [8,] "drought"             "OC"  
##  [9,] "flood"               "OC"  
## [10,] "leafother"           "FC"  
## [11,] "leafbroaddeciduous"  "FC"  
## [12,] "leafneedledeciduous" "FC"  
## [13,] "leafneedleevergreen" "FC"  
## [14,] "ring"                "CA"  
## [15,] "dioecious"           "CA"

Traits by species

Note the change in data types by comparing types for individuals of a species with tTypes for CWMM values at the plot scale. At the plot scale tTypes has \(M = 15\) values, because the leaf 'CAT' group in types includes four levels, which are expanded to four 'FC' columns in u. The two-level groups 'xylem' and 'repro' are transformed to censored continuous values on (0, 1) and thus each occupy a single column in u.

As discussed in Clark (2016) the interpretation of CWMM values in u is not the same as the interpretation of species-level traits assigned in forestTraits$specByTrait. Let \(\mathbf{T'}\) be a species-by-traits matrix specByTrait, constructed as CWMM values in function gjamSpec2Trait. The row names of specByTrait match the column names for the \(n \times S\) species abundance matrix plotByTrees. The latter is referenced to individuals of a species.

The plot-by-trait matrix u is referenced to a location, i.e., one row in matrix u. It is a CWMM, with values derived from measurements on individual trees, but combined to produce a weighted value for each location. Ordinal traits (shade, drought, flood) are community weighted modes, because ordinal scores cannot be averaged. The CWMM value for a plot may not be the same data type as the trait measured on an individual tree sbyt. Here is a table of 15 columns in u:

trait typeName partition comment
gmPerSeed CON \((-\infty, \infty)\) centered, standardized
maxHt CON " "
leafN CON " "
leafP CON " "
SLA CON " "
woodSG CON " "
shade OC \((-\infty, 0, p_{s1}, p_{s2}, p_{s3}, p_{s4}, \infty)\) five tolerance bins
drought OC " "
flood OC " "
leaf_broaddeciduous FC \((-\infty, 0, 1, \infty)\) categorical traits become FC data as CWMs
leaf_broadevergreen FC " "
leaf_needleevergreen FC " "
leaf_other FC " "
repro_monoecious CA \((-\infty, 0, 1, \infty)\) two categories become continuous (censored)
xylem_ring CA " "

The first six CON variables are continuous, centered, and standardized, as is often done in trait studies. In gjam CON is the only type that is not assumed to be censored at zero.

The three OC variables are ordinal classes, lacking an absolute scale–the partition must be estimated.

The four fractional composition FC columns are the levels of the single CAT variable leaf, expanded by the function gjamSpec2Trait.

The last two traits in u are fractions with two classes, only one of which is included here. They are censored at both 0 and 1, the intervals \((-\infty, 0)\) and \((1, \infty)\). This censoring can be generated using gjamCensorY:

censorList    <- gjamCensorY(values = c(0,1), intervals = cbind( c(-Inf,0),c(1,Inf) ), 
                             y = u, whichcol = c(13:14))$censor

This censoring was already done with gjamSpec2Trait, which knows to treat 'CAT' data with only two levels as censored 'CA' data. In this case the values = c(0,1) indicates that zeros and ones in the data indicate censoring. The intervals matrix gives their ranges.

Factors in this example

Multilevel factors in xdata require some interpretation. If you have not worked with multilevel factors, refer to the R help page for factor. The interpretation of coefficients for multilevel factors depends on the reference level used to construct a contrasts matrix. Standard models in R assign contrasts that may not assume the reference level that is desired. Moreover, results may depend on the order of observations and variables in the data.

In xdata the variable soil is a multilevel factor, which includes soil types that are both common and have potentially strong effects. Here are the first few rows of xdata:

##         temp     deficit   moisture         u1          u2          u3
## 1  1.2165433  0.03637914  0.6870299 0.01182693 0.003898011  0.04438465
## 2  0.1825447  0.20708706  1.6655992 0.02679904 0.000000000  0.87181296
## 3 -0.9409308  0.20345146 -0.1892591 0.14917175 0.860971854  0.41368482
## 4  0.6435989  0.81532968  0.3900552 0.04019072 0.061030872 -0.04561379
## 5  0.8238641 -0.17556592  0.8500289 0.20903325 0.571141912  0.28128988
## 6  0.2201759  0.76258277 -0.8765680 0.27511333 0.647060523 -0.63773372
##        stdage      soil
## 1 -0.16697961 reference
## 2 -0.02907271 reference
## 3  0.48147624  SpodHist
## 4 -0.07895393 reference
## 5  0.36007415 reference
## 6 -0.58986965 reference
## 
##   EntVert       Mol reference  SpodHist    UltKan 
##       107        39      1062       354        55

I used the name reference for a soil type to aggregate types that are rare. Factor levels that rarely occur cannot be estimated in the model.

The R function relevel allows definition of a reference level. In this case I want to compare levels to the reference soil type reference:

xdata$soil <- relevel(xdata$soil,'reference')

To avoid confusion, contrasts can be inspected as output$modelSummary$contrasts. If the reference class is all zeros and other classes are zeros and ones, then the intercept is the reference class.

TRM analysis

Here is an analysis of the data, with 20 holdout plots. Predictors in xdata are winter temperature (temp), slope (u1), aspect (u2, u3), local moisture, climatic moisture deficit and soil.

\[[u_{i,1}, u_{i,2}, u_{i,3}]' = [sin(slope_{i}), sin(slope_{i})sin(aspect_{i}), sin(slope_{i})cos(aspect_{i})]'\]

(Clark 1990). As discussed above, the variable soil is a multi-level factor. Because slope and aspect variables are products (interactions) I do not standardize them, including them in notStandard,

ml  <- list(ng = 3000, burnin = 500, typeNames = tTypes, holdoutN = 20,
            censor=censor, notStandard = c('u1','u2','u3'))
out <- gjamGibbs(~ temp + stdage + moisture*deficit + deficit*soil, 
                 xdata = xdata, ydata = u, modelList = ml)
tnames    <- colnames(u)
specColor <- rep('black', M)                           # highlight types
wo <- which(tnames %in% c("leafN","leafP","SLA") )     # foliar traits
wf <- grep("leaf",tnames)                              # leaf habit
wc <- which(tnames %in% c("woodSG","diffuse","ring") ) # wood anatomy

specColor[wc] <- 'brown'
specColor[wf] <- 'darkblue'
specColor[wo] <- 'darkgreen'

pl  <- list(GRIDPLOTS = TRUE, plotAllY = T, specColor = specColor, 
            SMALLPLOTS = F, sigOnly=F, ncluster = 3)
fit <- gjamPlot(output = out, plotPars = pl)

The model fit is interpreted in the same way as other gjam analyses. Note that specColor is used to highlight different types of traits in the posterior plots for values in coefficient matrix \(\boldsymbol{\alpha}\). Parameter estimates are contained in modelSummary,

out$modelSummary$betaMu      # Q by M coefficient matrix alpha
out$modelSummary$betaSe      # Q by M coefficient std errors
out$modelSummary$sigMu       # M by M covariance matrix omega
out$modelSummary$sigSe       # M by M covariance std errors

The output list contains a large number of diagnostics explained in help pages. The output$modelSummary holds objects described in the help pages.

The object fit generated by gjamPlot holds coefficients that are summarized in a table:

fit$betaEstimates[1:5,]      # Q by M coefficient matrix alpha

The objects in out that contain the word traits are empty, because gjam does not know that responses are traits. These objects are used when traits are modeled as predictive distributions, discussed next.

Interactions and indirect effects

Consider the interactions and indirect effects for this model. If there are no interactions in the formula passed to gjamGibbs, then there will be no interactions to estimate with the function gjamIIE (there will still be indirect effects, discussed below). If there are interactions in the formula, I must specify the values for main effects that are involved in these interactions to be used for estimating their effects on predictions. For example, consider a model containing the interaction between predictors \(q\) and \(q'\),

\[E[y_{s}] = \cdots + \beta_{q,s}x_{q} + \beta_{q',s}x_{q'} + \beta_{qq',s}x_{q}x_{q'} + \cdots\]

The ‘effect’ of predictor \(x_{q}\) on \(y_{s}\) is the derivative

\[\frac{dy_{s}}{dx_{q}} = \beta_{q,s} + \beta_{qq',s}x_{q'}\]

which depends not on \(x_{q}\), but rather on \(x_{q'}\). So if I want to know how interactions affect the response I have to decide on values for all of the predictors that are involved in interactions. These values are passed to gjamIIE in xvector. The default has sdScaleX = F, which means that effects can be compared on the basis of variation in \(\mathbf{X}\).

In this example interactions involve moisture, deficit, and the multi-level factor soil, as specified in the formula passed to gjamGibbs. The first row of the design matrix is used with moisture and deficit set to -1 or +1 standard deviation to compare dry and wet sites in a dry climate:

xdrydry <- xwetdry  <- out$x[1,]
xdrydry['moisture'] <- xdrydry['deficit'] <- -1
xwetdry['moisture'] <- 1
xwetdry['deficit']  <- -1

The first observation is from the reference soil level reference, so all other soil classes are zero. Here is a plot of main effects and interactions for deciduous and evergreen traits:

par(mfrow=c(2,2), bty='n', mar=c(1,3,1,1), oma = c(0,0,0,0), 
    mar = c(3,2,2,1), tcl = -0.5, mgp = c(3,1,0), family='')

fit1 <- gjamIIE(output = out, xvector = xdrydry)
fit2 <- gjamIIE(output = out, xvector = xwetdry)

gjamIIEplot(fit1, response = 'leafbroaddeciduous', 
            effectMu = c('main','int'), 
            effectSd = c('main','int'), legLoc = 'bottomleft',
            ylim=c(-.31,.3)
title('deciduous')
gjamIIEplot(fit1, response = 'leafneedleevergreen', 
            effectMu = c('main','int'), 
            effectSd = c('main','int'), legLoc = 'bottomleft',
            ylim=c(-.3,.3))
title('evergreen')

gjamIIEplot(fit2, response = 'leafbroaddeciduous', 
            effectMu = c('main','int'), 
            effectSd = c('main','int'), legLoc = 'bottomleft',
            ylim=c(-.3,.3))
gjamIIEplot(fit2, response = 'leafneedleevergreen', 
            effectMu = c('main','int'), 
            effectSd = c('main','int'), legLoc = 'bottomleft',
            ylim=c(-.3,.3))

The main effects plotted in the graphs do not depend on the values in xvector. Although this observation is taken from the reference soil, the plot shows the main effects that would be obtained if it were on the different soils included in the model. The interactions show how the effect of each predictor is modified by interactions with other variables. Again, the interactions from each predictor do not depend on values for the predictor itself, but rather on the other variables with which it interacts. For example, the interaction effect of soilUltKan on the broaddeciduous trait is positive on dry sites in dry climates (top left). Combined with a negative main effect, this means that deciduous trees tend to be more abundance on moist sites in this soil type. Its main effect on leafneedleevergreen is positive, but less so on moist sites in dry climates (bottom right).

The indirect effects come from the effects of responses. This example shows indirect effects for foliar N and P that come through broaddeciduous leaf habit:

xvector <- out$x[1,]
par(mfrow=c(2,1), bty='n', mar=c(1,1,1,1), oma = c(0,0,0,0), 
    mar = c(3,2,2,1), tcl = -0.5, mgp = c(3,1,0), family='')

omitY <- colnames(u)[colnames(u) != 'leafbroaddeciduous'] # omit all but deciduous

fit <- gjamIIE(out, xvector)
gjamIIEplot(fit, response = 'leafP', effectMu = c('main','ind'), 
            effectSd = c('main','ind'), legLoc = 'topright',
            ylim=c(-.6,.6))
title('foliar P')
gjamIIEplot(fit, response = 'leafN', effectMu = c('main','ind'), 
            effectSd = c('main','ind'), legLoc = 'bottomright',
            ylim=c(-.6,.6))
title('foliar N')

There will always be indirect effects, because they come through the covariance matrix.

Predictive Trait Model (PTM)

The PTM models species abundance data, then predicts traits. This approach has a number of advantages over TRM discussed in Clark (2016). The response is the \(n \times S\) matrix \(\mathbf{Y}\), which could be counts, biomass, and so forth. On the latent scale the observation is represented by a composition vector,

\[E\big[\mathbf{y}_{i}] = \boldsymbol{\beta'}\mathbf{x}_{i}\]

\[\mathbf{w}_{i} \sim MVN(\boldsymbol{\beta'}\mathbf{x}_{i},\Sigma)\]

where \(\boldsymbol{\beta}\) is the \(Q \times S\) matrix of coefficients, and \(\boldsymbol{\Sigma}\) is the \(S \times S\) residual covariance. A predictive distribution on the trait scale is obtained as a variable change,

\[\boldsymbol{\alpha} = \boldsymbol{\beta}\mathbf{T}\] \[\boldsymbol{\Omega} = \mathbf{T'}\boldsymbol{\Sigma}\mathbf{T}\] \[\mathbf{u}_{i} = \mathbf{T'}\mathbf{w}_{i}\]

where \(\mathbf{T}\) is a \(S \times M\) matrix of trait values for each species, \(\boldsymbol{\alpha}\) is the \(Q \times M\) matrix of coefficients, and \(\boldsymbol{\Omega}\) is the \(M \times M\) residual covariance (Fig. 2).

Figure 2. The predictive trait model fits species data and predicts traits using the species-by-trait matrix T, contained in the object specbyTrait. The white boxes are fitted, with trait matrix U, and coefficient matrix \(\boldsymbol{\alpha'}\) obtained by variable change.

The PTM begins by fitting pbys, followed by predicting plotByTraits. This requires a traitList, which defines the objects needed for prediction. The species are weights, so they should be modeled as composition data, eight 'FC' (rows sum to 1) or 'CC'. Here the model is fitted with dimension reduction:

tl  <- list(plotByTrait = u, traitTypes = tTypes, specByTrait = specByTrait)
rl  <- list(r = 8, N = 20)
ml  <- list(ng = 1000, burnin = 200, typeNames = 'CC', holdoutN = 20,
                  traitList = tl, reductList = rl)
out <- gjamGibbs(~ temp + stdage + deficit*soil, xdata = xdata, 
                     ydata = pbys, modelList = ml)
S <- nrow(specByTrait)
specColor <- rep('black',S)

wr <- which(specByTrait[,'ring'] == 1)                  # ring porous
wb <- which(specByTrait[,'leafneedleevergreen'] == 1)   # evergreen
ws <- which(specByTrait[,'shade'] >= 4)                 # shade tolerant
specColor[wr] <- 'brown'
specColor[ws] <- 'black'
specColor[wb] <- 'darkgreen'
         
par(family = '')
pl  <- list(width=4, height=4, corLines=F, SMALLPLOTS=F,GRIDPLOTS=T,
                  specColor = specColor, ncluster = 8) 
fit <- gjamPlot(output = out, pl)

Output is interpreted as previously, now with coefficients \(\boldsymbol{\beta}\) and covariance \(\boldsymbol{\Sigma}\). gjamPlot generates an additional plot with trait predictions. Parameter values are here:

out$modelSummary$betaTraitMu   # Q by M coefficient matrix alpha
out$modelSummary$betaTraitSe   # Q by M coefficient std errors
out$modelSummary$sigmaTraitMu  # M by M covariance matrix omega
out$modelSummary$sigmaTraitSe  # M by M covariance std errors

Trait predictive distributions are summarized here:

out$modelSummary$tMu[1:5,]     # n by M predictive means
out$modelSummary$tSd[1:5,]     # n by M predictive std errors

The groupings of species in terms of their similar responses to the environment (the ematrix) are here, showing only the 4 most frequent species in each of the ncluster = 8 groups:

fit$eComs[,1:4]

Additional quantities can be predicted from the output using the MCMC output in the list out$chains.

Acknowledgements

I thank Benedict Bachelot for review of the code.

References

Clark, J.S. 2016. Clark, J.S. 2016. Why species tell us more about traits than traits tell us about species: Predictive models. Ecology 97, 1979-1993.

Clark, J.S., D. Nemergut, B. Seyednasrollah, P. Turner, and S. Zhang. 2016. Generalized joint attribute modeling for biodiversity analysis: Median-zero, multivariate, multifarious data, in review.