The occumb()
function implements a hierarchical model that represents sequence read count data obtained from spatially replicated environmental DNA metabarcoding as a consequence of sequential stochastic processes (i.e., ecological and observational processes). The model allows us to account for false-negative species detection errors that occur at different stages of the metabarcoding workflow under several assumptions, including that there are no false positive errors. To identify model parameters replicates at different levels are required; i.e., you should have multiple sites and multiple within-site replicates. However, it is not necessary to have a balanced design with the same number of replicates at all sites. A quick and less formal description of this model is provided in the following. Readers are encouraged to refer to the original paper for a more formal and complete explanation.
We assume that the occurrence of I
focal species was monitored at J
sites sampled from an area of interest. At site j
, K[j]
replicates of environmental samples are collected. For each replicate, a library for DNA sequencing is prepared to obtain separate sequence reads for each replicate. We denote the resulting sequence read counts of species i
for replicate k
at site j
, obtained using high-throughput sequencing and subsequent bioinformatic processing, as y[i, j, k]
.
Adapted from Figure 1 in Fukaya et al. 2022 (CC BY 4.0)
The figure above shows a diagram of the minimal model that can be fitted by occumb()
with default settings. The process of generating y
is represented by a series of latent variables z
, u
, and r
, and parameters psi
, theta
, and phi
that govern the variation of the latent variables. Although psi
, theta
, and phi
are assumed here to have species-specific values, modeling these parameters as functions of covariates allows for further variation in them (see the following section).
First, look at the right-pointing arrow on the left, representing the ecological process of species distribution. For the DNA sequence of species to be detected at a site, the site must be occupied by the species (i.e., the species’ environmental DNA must be present at the site). In the model, the site occupancy of species is indicated by the latent variable z[i, j]
. If site j
is occupied by species i
, then z[i, j] = 1
; otherwise, z[i, j] = 0
. psi[i]
represents the probability that species i
occupies a site selected from the region of interest. Thus, species with high psi[i]
values should occur at many sites in the region, while species with low psi[i]
values should occur at a limited number of sites.
Next, focus on the second right-pointing arrow from the left, which represents one of the two observation processes, the capture of species DNA sequence. For the DNA sequence of species to be detected, the environmental sample collected at an occupied site (i.e., z[i, j] = 1
) must contain species eDNA and the amplicon derived from it must be present in the prepared sequencing library. In the model, the inclusion of the DNA sequence of species in a sequencing library is indicated by the latent variable u[i, j, k]
. If the library of replicate k
at site j
contains the DNA sequence of species i
, then u[i, j, k] = 1
; otherwise, u[i, j, k] = 0
. theta[i]
represents the probability that the DNA sequence of species i
is captured per replicate collected at a site occupied by the species. DNA sequences of species with high theta[i]
values are more reliably captured at occupied sites, while DNA sequences of species with low theta[i]
values are more difficult to capture. Note that u[i, j, k]
always takes zero for sites not occupied by species (i.e., z[i, j] = 0
), assuming that false positives are not to occur.
Finally, look at the right-pointing arrow on the right side. This model part represents another observation process, the allocation of species sequence reads in high-throughput sequencing. Here, the sequence read count vector y[1:I, j, k]
is assumed to follow a multinomial distribution with the total number of sequence reads in replicate k
of site j
as the number of trials. Its multinomial cell probability, pi[1:I, j, k]
(not shown in the figure), is modeled as a function of latent variables u[1:I, j, k]
described above and r[1:I, j, k]
that is proportional to the relative frequency of species sequence reads. The variation of r[i, j, k]
is governed by the parameter phi[i]
, which represents the relative dominance of species sequence. Species with higher phi[i]
values tend to have relatively more reads when the species sequence is included in the library (i.e. u[i, j, k] = 1
), while species with lower phi[i]
values tend to have relatively fewer reads. No false positives are assumed to occur at this stage, too; that is, pi[i, j, k]
always takes zero for replicates that do not include the species DNA sequence (i.e., u[i, j, k] = 0
).
In the figure, the arrows directed at psi[i]
, theta[i]
, and phi[i]
indicate that the variation of these parameters is governed by a community-level multivariate normal prior distribution with mean Mu
and covariance matrix Sigma
. Details are discussed below.
psi
, theta
, and phi
Variation in psi
, theta
, and phi
can be modeled as a function of covariates in a manner similar to generalized linear models (GLMs) and generalized linear mixed models (GLMMs). That is, the covariates are incorporated into linear predictors on the appropriate link scales for the parameters (logit for psi
and theta
and log for phi
). The occumb()
function allows for covariate modeling using standard R formula syntax.
There are three types of related covariates: species covariates that can take on different values for each species (e.g., traits), site covariates that can take on different values for each site (e.g., environment), and replicate covariates that can take on different values for each combination of site and replicate (e.g., the amount of water filtered). These covariates can be included in the data object via the spec_cov
, site_cov
, and repl_cov
arguments of the occumbData()
function and used to specify models in the occumb()
function.
The occumb()
function specifies covariates for each parameter using the formula_<parameter name>
and formula_<parameter name>_shared
arguments. The formula_<parameter name>
and formula_<parameter name>_shared
arguments are used to specify species-specific effects and effects shared by all species, respectively. The following table shows examples of modeling psi
using the formula_psi
and formula_psi_shared
arguments where i
is the species index, j
is the site index, speccov1
is a continuous species covariate, and sitecov1
and sitecov2
are continuous site covariates. psi
can be modeled as a function of species and site covariates.
formula_psi |
formula_psi_shared |
Linear predictor specified |
---|---|---|
~ 1 |
~ 1 |
logit(psi[i]) = gamma[i, 1] |
~ sitecov1 |
~ 1 |
logit(psi[i, j]) = gamma[i, 1] + gamma[i, 2] * sitecov1[j] |
~ sitecov1 + sitecov2 |
~ 1 |
logit(psi[i, j]) = gamma[i, 1] + gamma[i, 2] * sitecov1[j] + gamma[i, 3] * sitecov2[j] |
~ sitecov1 * sitecov2 |
~ 1 |
logit(psi[i, j]) = gamma[i, 1] + gamma[i, 2] * sitecov1[j] + gamma[i, 3] * sitecov2[j] + gamma[i, 4] * sitecov1[j] * sitecov2[j] |
~ 1 |
~ speccov1 |
logit(psi[i]) = gamma[i, 1] + gamma_shared[1] * speccov1[i] |
~ 1 |
~ sitecov1 |
logit(psi[i, j]) = gamma[i, 1] + gamma_shared[1] * sitecov1[j] |
In occumb()
, species-specific effects on psi
are denoted as gamma
, and shared effects on psi
are denoted as gamma_shared
. The first case specifies a default, intercept-only model. logit(psi[i])
is therefore determined only by the intercept term gamma[i, 1]
. Note that occumb()
always estimates the species-specific intercept gamma[i, 1]
, as in this simplest case. In the second case, species-specific effects gamma[i, 2]
of the site covariate sitecov1
are incorporated. Note here that the site subscript j
is added to psi
on the left-hand side of the equation because the value of psi
now varies from site to site depending on the value of sitecov1[j]
. In the third and fourth cases, another site covariate, sitecov2
, is specified in addition to sitecov1
. As shown in the fourth case, interaction terms can be specified using the *
operator.
In the fifth case, the formula_psi_shared
argument specifies the shared effect of the species covariate speccov1
. Note that the effect gamma_shared[1]
of speccov1[i]
in the linear predictor does not have subscript i
. Because species-specific effects cannot be estimated for species covariates, occumb()
accepts species covariates and their interactions only in its formula_<parameter name>_shared
argument. Introducing species covariates does not change the dimension of psi
(note that it has only the subscript i
) but may help reveal variations in site occupancy probability associated with species characteristics.
In the sixth case, the site covariate sitecov1
is specified in the formula_psi_shared
argument. In contrast to the second case, sitecov1[j]
has a shared effect gamma_shared[1]
in the linear predictor. Since species are often expected to respond differently to site characteristics, site covariates are likely to be introduced using the formula_psi
argument. Nevertheless, the formula_psi_shared
argument can be used when assuming consistent covariate effects across species is reasonable or when the data support doing so.
A similar approach can be applied to theta
and phi
, which can be modeled as a function of species, site, and replicate covariates. The following table shows examples of modeling for theta
where i
is the species index, j
is the site index, k
is the replicate index, speccov1
is a continuous species covariate, sitecov1
is a continuous site covariate, and replcov1
is a continuous replicate covariate.
formula_theta |
formula_theta_shared |
Linear predictor specified |
---|---|---|
~ 1 |
~ 1 |
logit(theta[i]) = beta[i, 1] |
~ sitecov1 |
~ 1 |
logit(theta[i, j]) = beta[i, 1] + beta[i, 2] * sitecov1[j] |
~ replcov1 |
~ 1 |
logit(theta[i, j, k]) = beta[i, 1] + beta[i, 2] * replcov1[j, k] |
~ 1 |
~ speccov1 |
logit(theta[i]) = beta[i, 1] + beta_shared[1] * speccov1[i] |
~ 1 |
~ sitecov1 |
logit(theta[i, j]) = beta[i, 1] + beta_shared[1] * sitecov1[j] |
~ 1 |
~ replcov1 |
logit(theta[i, j, k]) = beta[i, 1] + beta_shared[1] * replcov1[j, k] |
In occumb()
, species-specific effects on theta
are denoted as beta
, and shared effects on theta
are denoted as beta_shared
. The first case specifies an intercept-only model. As in the case of psi
, occumb()
always estimates the species-specific intercept beta[i, 1]
. The second and third cases can be contrasted with the second case of the psi
example with a single covariate specified in the formula_psi
argument, and the remaining other cases with the fifth and sixth cases of the psi
example with a single covariate specified in the formula_psi_shared
argument. Because replicate covariate replcov1
have both the site index j
and replicate index k
, specifying it adds these two indexes to theta
.
The same rule applies to phi
. Here is an example of more complex cases involving interactions between different types of covariates.
formula_phi |
formula_phi_shared |
Linear predictor specified |
---|---|---|
~ 1 |
~ 1 |
log(phi[i]) = alpha[i, 1] |
~ sitecov1 * replcov1 |
~ 1 |
log(phi[i, j, k]) = alpha[i, 1] + alpha[i, 2] * sitecov1[j] + alpha[i, 3] * replcov1[j, k] + alpha[i, 4] * sitecov1[j] * replcov1[j, k] |
~ replcov1 |
~ speccov1 * sitecov1 |
log(phi[i, j, k]) = alpha[i, 1] + alpha[i, 2] * replcov1[j, k] + alpha_shared[1] * speccov1[i] + alpha_shared[2] * sitecov1[j] + alpha_shared[3] * speccov1[i] * sitecov1[j] |
In occumb()
, species-specific effects on phi
are denoted as alpha
, and shared effects on phi
are denoted as alpha_shared
. As with the other two parameters, occumb()
always estimates the species-specific intercept alpha[i, 1]
.
The following table summarizes the covariate types accepted by each formula
argument.
Argument | spec_cov |
site_cov |
repl_cov |
---|---|---|---|
formula_phi |
✓ | ✓ | |
formula_theta |
✓ | ✓ | |
formula_psi |
✓ | ||
formula_phi_shared |
✓ | ✓ | ✓ |
formula_theta_shared |
✓ | ✓ | ✓ |
formula_psi_shared |
✓ | ✓ |
A hierarchical prior distribution is specified for the species-specific effects alpha
, beta
, and gamma
. Specifically, the vector of these effects is assumed to follow a multivariate normal distribution, and a prior distribution is specified for the elements of its mean vector Mu
and covariance matrix Sigma
whose values are estimated from the data. Because these hyperparameters summarize the variation in species-specific effects at the community level, their estimates may be of interest.
A normal prior distribution with mean 0 and precision prior_prec
is specified for each element of Mu
. The prior_prec
value is determined by the prior_prec
argument of the occumb()
function, which by default is set to a small value 1e-4
to specify vague priors. Note that precision is the inverse of variance.
Sigma
is decomposed into elements of standard deviation sigma
and correlation coefficient rho
, each specified with a different vague prior. Specifically, a uniform prior distribution with a lower limit of 0 and an upper limit of prior_ulim
is specified for sigma
, and a uniform prior with a lower limit of −1 and an upper limit of 1 is set for rho
. The value of prior_ulim
is determined by the prior_ulim
argument of the occumb()
function and is set to 1e4
by default.
For each of the shared effects alpha_shared
, beta_shared
, and gamma_shared
, a normal prior distribution with mean 0 and precision prior_prec
is specified.
The latent variables and parameters of the model to be estimated and saved using the occumb()
function is as follows (note that occumb()
will not save u
and r
, but their function pi
). Posterior samples of these latent variables and parameters can be accessed using get_post_samples()
or get_post_summary()
functions.
z
pi
phi
theta
psi
alpha
phi
).
beta
theta
).
gamma
psi
).
alpha_shared
phi
) common across species.
beta_shared
theta
) that are common across species.
gamma_shared
psi
) that are common across species.
Mu
alpha
, beta
, gamma
).
sigma
alpha
, beta
, gamma
).
rho
alpha
, beta
, gamma
).