---
title: "3 Extensions of the basic model structure"
author: "Jan-Ole Fischer"
output: rmarkdown::html_vignette
# output: pdf_document
vignette: >
  %\VignetteIndexEntry{Model extensions}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: refs.bib
link-citations: yes
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  # fig.path = "img/",
  fig.align = "center",
  fig.dim = c(8, 6),
  out.width = "85%"
)
```

> Before diving into this vignette, we recommend reading the vignettes [**Introduction to LaMa**](https://janolefi.github.io/LaMa/articles/Intro_to_LaMa.html) and [**Automatic differentiation via RTMB**](https://janolefi.github.io/LaMa/articles/LaMa_and_RTMB.html)

In this vignette, we discuss several extensions of the basic model structure, assuming you know how to fit basic models using automatic differentiation.


# Multiple data streams

It is often the case that we observe more than one time series of interest, e.g. in animal movement we often have both step lengths and turning angles. If we assume *contemporaneous conditional independence*, i.e. that the observations at each time $t$ are independent of each other given the underlying state $S_t$, modelling such data is straightforward. 
Inside the likelihood this simply leads to us having to calculate state-dependent densities for each data stream, which are multiplied because of the independence assumption. Let's see this in the `trex` example:

```{r setup and data}
library(LaMa)

head(trex, 5)
```

We now include the variable `angle` in order to incorporate additional information about the hidden state process. For this variable, we assume a state-dependent *von Mises* distribution, which is basically a normal distribution wrapped around the unit circle. It is parameterised by a mean direction $\mu \in [-\pi, \pi]$ and a concentration parameter $\kappa > 0$. For movement analyses, the state-dependent angle means are usually zero, hence we fix them at zero.
The likelihood function is easily modified to incorporate the turning angle:

```{r nll multiple data streams}
nll_multi = function(par) {
  getAll(par, dat)
  Gamma = tpm(eta)
  delta = stationary(Gamma)
  # Parameter transformations for strictly positive parameters
  mu = exp(log_mu); REPORT(mu)
  sigma = exp(log_sigma); REPORT(sigma)
  # additional state-dependent turning angle concentration parameter
  kappa = exp(log_kappa); REPORT(kappa)
  # Calculating all state-dependent densities
  allprobs = matrix(1, nrow = length(step), ncol = N)
  ind = which(!is.na(step) & !is.na(angle)) # only for non-NA obs.
  for(j in 1:N){
    allprobs[ind,j] = dgamma2(step[ind], mu[j], sigma[j]) * 
      dvm(angle[ind], 0, kappa[j]) # simply multiplying the state-dep. densities
  }
  -forward(delta, Gamma, allprobs) # simple forward algorithm
}
```

We now simply have to include initial values for the turning angle concentration parameters and fit the model:

```{r multiple streams fit}
N = 2 # number of hidden states

par = list(
  eta = rep(-2, N*(N-1)),                   # (logit) initial t.p.m. parameters
  log_mu = log(seq(0.3, 1, length=N)),      # (log) initial means for step length
  log_sigma = log(seq(0.2, 0.7, length=N)), # (log) initial sds for step length 
  log_kappa = log(seq(0.2, 0.4, length=N))  # (log) initial concentration for angle
)    

dat = list(
  step = trex$step,   # hourly step lengths
  angle = trex$angle, # hourly turning angles
  N = N
)

# Fitting the model
obj_multi = MakeADFun(nll_multi, par, silent = TRUE)
opt_multi = nlminb(obj_multi$par, obj_multi$fn, obj_multi$gr)
mod_multi = report(obj_multi) # reporting from fitted model
```

We can now plot the estimated state-dependent densities separately for each state:

```{r state dist multi, fig.width = 8, fig.height = 4}
# extracting estimated parameters
delta = mod_multi$delta
mu = mod_multi$mu
sigma = mod_multi$sigma
kappa = mod_multi$kappa

# defining color vector
color = c("orange", "deepskyblue")

oldpar = par(mfrow = c(1,2))

# step length
hist(trex$step, prob = TRUE, breaks = 40, 
     bor = "white", main = "", xlab = "step length")
for(j in 1:N) curve(delta[j] * dgamma2(x, mu[j], sigma[j]), lwd = 2, add = TRUE, col = color[j])
curve(delta[1]*dgamma2(x, mu[1], sigma[1]) + delta[2]*dgamma2(x, mu[2], sigma[2]), 
      lwd = 2, lty = 2, add = TRUE)
legend("top", lwd = 2, col = c(color, 1), lty = c(1,1,2), 
       legend = c("state 1", "state 2", "marginal"), bty = "n")

# turning angle
hist(trex$angle, prob = TRUE, breaks = seq(-pi, pi, length = 20), 
     bor = "white", main = "", xlab = "turning angle")
for(j in 1:2) curve(delta[j] * dvm(x, 0, kappa[j]), lwd = 2, add = TRUE, col = color[j])
curve(delta[1] * dvm(x, 0, kappa[1]) + delta[2] * dvm(x, 0, kappa[2]), 
      lwd = 2, lty = 2, add = TRUE)
par(oldpar)
```


# Covariate effects

In many empirical applications, either the state-process model or the state-dependent-process model depends on covariates. For example in animal movement, the transition probabilities between states might depend on environmental covariates such as temperature or vegetation cover. We will first look at covariate effects in the state process, and then at covariate effects in the state-dependent process, which is typically called Markov-switching regression.

## Covariate effects in the state process

Covariate effects in the state process are usually modelled by expressing the transition probability matrix $\boldsymbol{\Gamma}$ as a function of covariates.
Let $\boldsymbol{z}_t$ denote a vector of covariates of length $p+1$ for time points $t = 1, \dots, T$, where the first entry is always equal to $1$ to include an intercept. Moreover, let $\boldsymbol{\beta}_{ij}$ be a vector of regression coefficients, also of length $p+1$ for each off-diagonal element ($i \neq j$) of the t.p.m. First, consider forming linear predictors
$$
\eta_{ij}^{(t)} = \boldsymbol{\beta}_{ij}^\top \boldsymbol{z}_t = \beta_{ij0} + \beta_{ij1} z_{t1} + \dots + \beta_{ijp} z_{tp},
$$
for $t = 1, \dots, T$. In order to link transition probabilities to these predictors, we need to ensure that the former are in the interval $(0,1)$, and that each row of the transition matrix sums to one. This is achieved by using the inverse multinomial logistic link function (softmax), which gives us
$$
\gamma_{ij}^{(t)} = \Pr(S_t = j \mid S_{t-1} = i) = \frac{\exp(\eta_{ij}^{(t)})}{\sum_{k=1}^N \exp(\eta_{ik}^{(t)})},
$$
where $\eta_{ii}$ is set to zero for $i = 1, \dots, N$ for identifiability and $N$ is the number of hidden states. We can do this computation with `LaMa` by providing a $T \times p$ covariate design matrix `Z` and an $N(N-1) \times (p+1)$ coefficient matrix `beta` to `tpm()`. The function then computes the associated transition matrix for each time point, which is returned as an array of dimension $N \times N \times T$. Note that `tpm()` handles the intercept automatically: if `Z` does not already contain a leading column of ones, it is added internally, so `Z` only needs to contain the actual covariates.

<!-- At this point we want to point out that the definition of the transition probabilities is not necessarily unique. Indeed for data points at times $1, \dots, T$ we only need $T-1$ transition probability matrices. The definition above means that the transition probability between $t-1$ and $t$ depends on the covariate values at time point $t$, but we could also have defined -->
<!-- $$ -->
<!-- \gamma_{ij}^{(t)} = \Pr(S_{t+1} = j \mid S_t = i). -->
<!-- $$ -->
<!-- We want to point out that these two specifications are **not** equivalent. For HMMs there is no established convention, so this choice needs to be made by users and can be important when the exact timing of the covariate effect is relevant. In `LaMa` this comes down to either passing the design matrix excluding its first or last row to `tpm_g()`, where we use the first option in this vignette. If you forget to exclude the first or the last row of the design matrix when calculating all transition matrices, and pass an array of dimension `c(N,N,T)` to `forward_g()` for likelihood evaluation, the function will revert to the first option by just ignoring the first slice of the array. -->

### Simulation example

Let's start by simulating data from the above specified model, assuming $N = 2$ states and Gaussian state-dependent distributions.
The covariate effects for the state process are fully specified by a parameter matrix `beta` of dimension `c(N*(N-1), p+1)`. By default the function `tpm()` will fill the off-diagonal elements of each transition matrix by column, which can be changed by setting `byrow = TRUE`. The latter is useful, as popular HMM packages like `moveHMM` or `momentuHMM` return the parameter matrix such that the t.p.m. needs to be filled by row.

```{r parameters}
# parameters
mu = c(5, 20)   # state-dependent means
sigma = c(4, 5) # state-dependent standard deviations

# state process regression parameters
beta = matrix(c(-2, -2,       # intercepts
                -1, 1.5,      # linear effects
                0.25, -0.5),  # quadratic effects
              nrow = 2)

n = 1000 # number of observations
set.seed(123)
z = rnorm(n)
Z = cbind(z, z^2) # quadratic effect of z
Gamma = tpm(beta, Z) # of dimension c(2, 2, n)
delta = c(0.5, 0.5) # non-stationary initial distribution


# Let's plot the covariate effects that we simulated:
z_p = seq(-2,2, length = 100) # prediction grid for covariate plot
Z_p = cbind(z_p, z_p^2) # prediction matrix
Gamma_p = tpm(beta, Z_p)
plot(z_p, Gamma_p[1,2,], type = "l", lwd = 3, bty = "n", ylim = c(0,1), 
     xlab = "z", ylab = expression(gamma[ij]), col = color[1])
lines(z_p, Gamma_p[2,1,], lwd = 3, col = color[2])
```

Let's now simulate synthetic data from the above specified model.

```{r data}
s = rep(NA, n)
s[1] = sample(1:2, 1, prob = delta) # sampling first state from initial distribution
for(t in 2:n){
  # sampling next state conditional on previous one with tpm at that time point
  s[t] = sample(1:2, 1, prob = Gamma[s[t-1],,t])
}
# sampling observations conditional on the states
x = rnorm(n, mu[s], sigma[s])

plot(x[1:200], bty = "n", pch = 20, ylab = "x", 
     col = c(color[1], color[2])[s[1:200]])
```


### Fitting an HMM to the data

In order to fit the model to the simulated data, we specify the likelihood function, pretending to know the polynomial degree of the effect of $z$ on the transition probabilities, which in reality we would have to determine based on model selection criteria.

Note that because the Markov chain is now inhomogeneous — its transition probabilities change over time — it does not have a single stationary distribution. We therefore cannot use `stationary()` to fix the initial distribution and instead have to estimate it. We do this by adding a single unconstrained parameter `logit_delta` to `par` and constructing the initial distribution from it inside the likelihood.

```{r mllk}
nll_cov = function(par) {
  getAll(par, dat)
  Gamma = tpm(beta, Z)
  # unconstrained parameterisation of a 2-state initial distribution
  delta = c(1, exp(logit_delta)) # make > 0
  delta = delta / sum(delta) # distribution -> needs to sum to 1
  sigma = exp(log_sigma); REPORT(sigma); REPORT(mu)
  # calculate all state-dependent probabilities
  allprobs = matrix(1, length(x), N)
  for(j in 1:N) allprobs[,j] = dnorm(x, mu[j], sigma[j])
  # forward algorithm
  -forward(delta, Gamma, allprobs)
}
```

Using automatic differentiation with `RTMB`, we can -- very neatly -- specify `beta` as a parameter matrix in the desired shape. Model fitting is also extremely quick:

```{r model, warning=FALSE}
par = list(
  beta = cbind(rep(-2, 2), matrix(0, 2, 2)), # initialising with slopes = 0
  logit_delta = 0,                           # starting value for initial distribution
  mu = c(4, 14),                             # initial state-dependent means
  log_sigma = c(log(3), log(5))              # initial state-dependent sds
)
dat <- list(
  x = x, 
  Z = Z,
  N = 2
)

obj_cov = MakeADFun(nll_cov, par, silent = TRUE)
system.time(
  opt <- nlminb(obj_cov$par, obj_cov$fn, obj_cov$gr)
)
mod_cov <- report(obj_cov) # reporting from fitted model
```

Let's plot the fitted model, starting with the estimated state-dependent densities. We can obtain approximate global state probabilities from decoded states.

```{r mod_cov_densities}
mod_cov$states = viterbi(mod = mod_cov)
delta = prop.table(table(mod_cov$states))

# get state-dependent parameters
mu = mod_cov$mu
sigma = mod_cov$sigma

hist(x, prob = TRUE, bor = "white", breaks = 20, main = "", ylim = c(0, 0.06))
curve(delta[1] * dnorm(x, mu[1], sigma[1]), add = TRUE, lwd = 3, col = color[1])
curve(delta[2] * dnorm(x, mu[2], sigma[2]), add = TRUE, lwd = 3, col = color[2])
curve(delta[1] * dnorm(x, mu[1], sigma[1]) +
        delta[2] * dnorm(x, mu[2], sigma[2]),
      add = TRUE, lwd = 3, lty = "dashed")
legend("topright", col = c(color[1], color[2], "black"), lwd = 3, bty = "n",
       lty = c(1,1,2), legend = c("state 1", "state 2", "marginal"))
```

In order to plot the estimated effect with 95 percent confidence intervals, we can use `MCreport()` to get samples of `beta` from the approximate normal distribution of the MLE and then use `tpm()` to get the associated transition probabilities on our prediction grid.

```{r cov_visualization}
samples = MCreport(obj_cov)

# compute samples
g12 = sapply(samples$beta, function(b) tpm(b, Z_p)[1,2,]) # gamma_12 
g21 = sapply(samples$beta, function(b) tpm(b, Z_p)[2,1,]) # gamma_21
# compute quantiles
g12.q = apply(g12, 1, quantile, probs = c(0.025, 0.975))
g21.q = apply(g21, 1, quantile, probs = c(0.025, 0.975))

# at MLE
Gamma_p = tpm(mod_cov$beta, Z_p) # predicted tpm

# plot fitted state-process regression
plot(NA, bty = "n", ylim = c(0,1), xlim = c(-2,2), 
     xlab = "z", ylab = "Transition probability")

# confidence interval
polygon(c(z_p, rev(z_p)), c(g12.q[1,], rev(g12.q[2,])), col = adjustcolor(color[1], 0.3), border = NA)
# MLE
lines(z_p, Gamma_p[1,2,], lwd = 3, col = color[1])

# confidence interval
polygon(c(z_p, rev(z_p)), c(g21.q[1,], rev(g21.q[2,])), col = adjustcolor(color[2], 0.3), border = NA)
# MLE
lines(z_p, Gamma_p[2,1,], lwd = 3, col = color[2])

legend("top", col = color, legend = c(expression(gamma[12]), expression(gamma[21])), lwd = 3, bty = "n")
```


## Covariate effects in the state-dependent process

We now look at a setting where covariates influence the mean of the state-dependent distribution, while the state switching is controlled by a homogeneous Markov chain. This is often called **Markov-switching regression**. Assuming the observation process to be conditionally normally distributed, this means

$$
X_t \mid S_t = j \sim N(\beta_j^\top z_t, \: \sigma_j^2), \quad j = 1, \dots, N.
$$

### Simulation example

First we specify parameters for the simulation. The important change here is that `beta` now contains the regression coefficients for the state-dependent regressions.

```{r parameters2}
sigma = c(1, 1) # state-dependent standard deviations (homoscedasticity)

# parameter matrix
# each row contains parameter vector for the corresponding state
beta = matrix(c(8, 10,             # intercepts
                -2, 1, 0.5, -0.5), # slopes
              nrow = 2)

n = 1000 # number of observations
set.seed(123)
z = rnorm(n)
Z = cbind(z, z^2) # quadratic effect of z

Gamma = matrix(c(0.9, 0.1, 0.05, 0.95), 
               nrow = 2, byrow = TRUE) # homogeneous t.p.m.
delta = stationary(Gamma) # stationary Markov chain
```

### Simulation

In the simulation code, the state-dependent mean now is not fixed anymore, but changes according to the covariate values in `Z`.

```{r data2}
s = x = rep(NA, n)
s[1] = sample(1:2, 1, prob = delta)
x[1] = rnorm(1, beta[s[1],]%*%c(1, Z[1,]), # state-dependent regression
                    sigma[s[1]])
for(t in 2:n){
  s[t] = sample(1:2, 1, prob = Gamma[s[t-1],])
  x[t] = rnorm(1, beta[s[t],] %*% c(1, Z[t,]), # state-dependent regression
                      sigma[s[t]])
}

oldpar = par(mfrow = c(1,2))
plot(x[1:400], bty = "n", pch = 20, ylab = "x", 
     col = c(color[1], color[2])[s[1:400]])

plot(z[which(s==1)], x[which(s==1)], pch = 16, col = color[1], bty = "n", 
     ylim = c(0,15), xlab = "z", ylab = "x")
points(z[which(s==2)], x[which(s==2)], pch = 16, col = color[2])
par(oldpar)
```

### Writing the negative log-likelihood function

In the likelihood function, we also add the state-dependent regression in the loop calculating the state-dependent probabilities. The code `cbind(1,Z) %*% t(beta)` computes all linear predictor for the states 1 and 2 at once. It thus returns a matrix of dimension $T \times 2$.

```{r mllk3}
nll_msr = function(par) {
  getAll(par, dat)
  Gamma = tpm(eta)
  delta = stationary(Gamma)
  sigma = exp(log_sigma); REPORT(sigma)
  # calculate all state-dependent probabilities
  allprobs = matrix(1, length(x), 2)
  # state-dependent regression
  REPORT(beta)
  Mu = cbind(1,Z) %*% t(beta) # Z %*% beta[j,] all at once
  for(j in 1:2) allprobs[,j] = dnorm(x, Mu[,j], sigma[j])
  # forward algorithm
  -forward(delta, Gamma, allprobs)
}
```

### Fitting a Markov-switching regression model
```{r model3, warning=FALSE}
par = list(
  eta = rep(-2, 2),                        # starting values state process
  beta = cbind(c(10, 10), matrix(0, 2, 2)), # starting values for regression
  log_sigma = c(log(1), log(1))            # starting values for sigma
)
dat = list(
  x = x,
  Z = Z
)

obj_msr = MakeADFun(nll_msr, par, silent = TRUE)
opt_msr = nlminb(obj_msr$par, obj_msr$fn, obj_msr$gr)

mod_msr = report(obj_msr)
```

### Visualising results

To visualise the results, we transform the parameters to working parameters and add the two estimated state-specific regressions to the scatter plot.

```{r visualization3}
Gamma_hat_reg = mod_msr$Gamma
delta_hat_reg = mod_msr$delta
sigma_hat_reg = mod_msr$sigma
beta_hat_reg = mod_msr$beta

# we have some label switching
plot(z, x, pch = 16, bty = "n", xlab = "z", ylab = "x", col = color[s])
points(z, x, pch = 20)
curve(beta_hat_reg[1,1] + beta_hat_reg[1,2]*x + beta_hat_reg[1,3]*x^2, 
      add = TRUE, lwd = 4, col = color[2])
curve(beta_hat_reg[2,1] + beta_hat_reg[2,2]*x + beta_hat_reg[2,3]*x^2, 
      add = TRUE, lwd = 4, col = color[1])
```


# A special covariate: Seasonality

A notable special case for covariate effects is called *seasonality*. For example, the transition probabilities might vary with the time of day or the day of year. Functions of such covariates need to fulfil the boundary condition that the values of the beginning and end of a cycle need to match. In practice, this is usually done by using a *trigonometric basis expansion* which, in its simplest form, for a cycle length of e.g. $24$, looks like:

$$
\eta^{(t)}_{ij} = \beta_0^{(ij)} + \beta_{1}^{(ij)} \sin \bigl(\frac{2\pi \, \text{time}_t}{24} \bigr) + \beta_{2}^{(ij)} \cos\bigl(\frac{2 \pi \, \text{time}_t}{24}\bigr).
$$
While this is a non-linear function in the variable *time*, it is still linear in the model parameters. Hence, the sine and cosine expressions can also be precomputed and stored in a design matrix. In `LaMa`, this can be done using the `cosinor()` function, e.g.

```{r cosinor}
Z_tod = cosinor(1:24, period = 24)
head(Z_tod)
```

If we now express the transition matrix as a function of the linear predictors $\eta^{(t)}_{ij}$ above, the Markov chain has a special property. Its transition probabilities are *periodically varying*, and hence for all $t$
$$
\Gamma^{(t+L)} = \Gamma^{(t)},
$$
where $L$ is the cycle length. As shown by @koslik2023inference, such chains exhibit a special periodic behaviour. In particular, they converge to a unique periodically stationary state distribution. This distribution, giving the probability of the process being at any of the states at a particular time point in the cycle, can be computed using the function `stationary_p()`. It expects a tpm array with third dimension equal to the cycle length and returns a matrix, where each row gives the distribution for that time point in the cycle. 

Let's try out everything we learned using the `trex` data set as an example. It has a time of day (`tod`) covariate:

```{r trex data set}
head(trex, 6)
```

As usual, we set initial parameter values. In the dat list, we include the trigonometric design matrix as well as the integer time of day variable. 

```{r parameters seasonal}
par = list(
  log_mu = log(c(0.3, 1)),      # initial means for step length
  log_sigma = log(c(0.2, 0.7)), # initial sds for step length
  beta = cbind(c(-2,-2), matrix(0, 2, 2)) # initial t.p.m. parameters
)    

dat = list(
  step = trex$step,   # hourly step lengths
  N = 2,
  Z_tod = Z_tod,
  tod = trex$tod
)
```

In the likelihood function, we only compute the unique 24 transition matrices and then, based on these, the corresponding stationary distribution(s) using `stationary_p()`.
When running the forward algorithm in the last line, we then have to i) pick the stationary distribution corresponding to the first time of day value (`Delta[tod[1],]`) and ii) build an array of tpms for the entire time series by indexing the 24 matrices using (`Gamma[,,tod]`).

```{r nll seasonal}
nll_seas = function(par) {
  getAll(par, dat)
  Gamma = tpm(beta, Z_tod)
  Delta = stationary_p(Gamma) # periodically stationary distribution
  ADREPORT(Delta)
  # Parameter transformations for strictly positive parameters
  mu = exp(log_mu); REPORT(mu)
  sigma = exp(log_sigma); REPORT(sigma)
  # Calculating all state-dependent densities
  allprobs = matrix(1, nrow = length(step), ncol = N)
  ind = which(!is.na(step)) # only for non-NA obs.
  for(j in 1:N){
    allprobs[ind,j] = dgamma2(step[ind], mu[j], sigma[j])
  }
  -forward(Delta[tod[1],], Gamma[,,tod], allprobs)
}
```

Let's fit the model:
```{r seasonal fit}
obj_seas = MakeADFun(nll_seas, par, silent = TRUE)
opt_seas = nlminb(obj_seas$par, obj_seas$fn, obj_seas$gr)
sdr_seas = sdreport(obj_seas)
mod_seas = report(obj_seas)
```

And look at the estimated periodically stationary distribution:
```{r p stationary}
Delta = as.list(sdr_seas, "Est", report = TRUE)$Delta
Delta_sd = as.list(sdr_seas, "Std", report = TRUE)$Delta

# Only plot for state 2 (state 1 is just 1-delta_2)
id = c(24, 1:24) # copy value at 24 to 0 for nicer plot
plot(0:24, Delta[id,2], type = "b", col = "deepskyblue", bty = "n", lwd = 2, 
     pch = 16, ylim = c(0,1), xlim = c(0, 24), 
     xlab = "time of day", ylab = "Pr(state 2)", xaxt = "n")
axis(1, at = c(0, 6, 12, 18, 24), label = c(0, 6, 12, 18, 24))
polygon(c(0:24, 24:0), 
        c(Delta[id,2] + 2*Delta_sd[id,2], rev(Delta[id,2] - 2*Delta_sd[id,2])),
        col = adjustcolor("deepskyblue", 0.2), border = FALSE)
```


> Continue reading with [**Longitudinal data**](https://janolefi.github.io/LaMa/articles/Longitudinal_data.html) or [**Penalised splines**](https://janolefi.github.io/LaMa/articles/Penalised_splines.html).


# References