Modeling Cases of Mosquito-Borne Disease Using Persistent Homology

Background

In 2015, the Zika virus (ZIKV) was first introduced and spread across Brazil due to a plethora of Aedes aegypti mosquitoes.¹ This species transmits ZIKV to humans, which can cause mild symptoms or even severe health conditions such as Guillain-Barré Syndrome and microcephaly.² ³ Lo and Park (2018) modeled the spread of ZIKV using linear regression, while comparing results between two models — one with predictors calculated from persistent homology (PH) and one without.⁴ They show that using the number of 0-dimensional and 1-dimensional homological features and the maximum lifetime (persistence) of any 1-dimensional feature can better predict the number of ZIKV cases in Brazil.

Here, we demonstrate how to use topological features to model disease transmission by A. aegypti, similar to the ZIKV model built by Lo and Park.

Methods

To imitate the model from Lo and Park with available data, we would predict the number of ZIKV cases confirmed in each state using linear regression. However, data points of A. aegypti mosquitoes in Brazil are only available for the year 2013, while documented ZIKV cases are available only since 2015. Instead of using the number of ZIKV cases as the response variable in our regression model, we obtained the number of Dengue cases in Brazil for the year 2013,⁵ since Dengue is also transmitted by A. aegypti mosquitoes. Predictions of number of cases in each state will be made using average temperatures, precipitation levels,⁶ and human population densities.⁷ ⁸ Notice that the source for temperature data provided by Lo and Park is no longer available. The accessible dataset for temperatures and precipitations are averaged throughout multiple years, thus it may not be completely accurate for 2013.

Besides these state-level attributes, we will also utilize spatial information on A. aegypti occurrences to model the disease.⁹ We include the number of 0-dimensional features (H0) as a predictor since a higher value of H0 indicates that a state harbors more mosquitoes. We also include the number of 1-dimensional features (H1) in our model, as we interpret a larger value of H1 to mean that A. aegypti mosquitoes are more spread out. In addition, extending the analysis of Lo and Park, we include the maximum lifetime (H1ML) and the median lifetime (H1MD) of the 1-dimensinoal features, which we expect give additional insight into the statewide connectivity of A. aegypti distributions. A high H1ML indicates mosquitoes have a wider spread and are less clustered at a small number of specific regions. We are also curious about whether the median of H1 lifetime can improve upon the effect of H1ML demonstrated by Lo and Park.

Datasets

First, we load the aegypti dataset containing geographic locations of A. aegypti mosquitoes. Also, we load the case_predictors dataset to obtain human population density, average temperature, precipitation level, and the number of cases in each state.

We will use state abbreviations to join elements from these two data sets and from the PH calculations. The vector stateAbbSort, de-duplicated from the A. aegypti dataset, provides a convenient tool for these tasks.

data(aegypti)
data(case_predictors)

stateAbbSort <- sort(aegypti$state_code[!duplicated(aegypti$state_code)])

We will incrementally add the predictive variables to the data frame caseModel.

caseModel <- data.frame()

Persistent homology

Before building the model, the following code provides an example of using Vietoris-Rips filtration on the mosquito occurrence patterns for a single state. First, we retrieve occurrence coordinates in the state of Acre (AC) from the A. aegypti dataset. Applying the vietoris_rips function to these coordinates, we receive a data frame that contains the birth and death information of every 0-dimensional (H0) and 1-dimensional (H1) feature. We plot these persistence data in the following graph, where the horizontal and vertical axes represent the birth and death of each feature, respectively.

AC_coord <- aegypti[aegypti$state_code == "AC", c("x", "y"), drop = FALSE]
AC_rips <- vietoris_rips(AC_coord) ##filtration

plot.new()
max_finite_death <- max(AC_rips$death[is.finite(AC_rips$death)])
plot.window(
  xlim = c(0, max_finite_death),
  ylim = c(0, max_finite_death),
  asp = 1
)
axis(1L)
axis(2L)
abline(a = 0, b = 1)
points(AC_rips[AC_rips$dimension == 0L, c("birth", "death")], pch = 16L)
points(AC_rips[AC_rips$dimension == 1L, c("birth", "death")], pch = 17L)

For reference, we can also plot the mosquito occurrences based on their latitudes and longitudes in the aegypti dataset. Note, however, that the plot is not exactly faithful to the geography, since the distances between longitude lines vary from location to location. This also means that our PH calculations are not exactly right. The same limitation applies to the analysis of Lo and Park.

plot(x = AC_coord$x, y = AC_coord$y, asp = 1, xlab = "X", ylab = "Y",
     main = "Aedes Aegypti Occurrences in AC")

The following function takes in a single state’s persistence data and returns a vector consisting of the values H0, H1, H1ML, and H1MD.

topologicalFeatures <- function(state_rips){
  numH0 <- length(which(state_rips$dimension == 0))
  numH1 <- length(which(state_rips$dimension == 1))
  if (numH1 != 0) {
    maxPerst <- max(state_rips$persistence[(numH0 + 1) : (numH0 + numH1)])
    medianPerst <- median(state_rips$persistence[(numH0 + 1) : (numH0 + numH1)])
    
  } else {
    maxPerst <- 0
    medianPerst <- 0
  }
  return(c(numH0, numH1, medianPerst, maxPerst))
}

Iterating over all the states in Brazil, we bind every state’s topological features to caseModel.

for(val in stateAbbSort) {
  state_coord <- aegypti[aegypti$state_code == val, c("x", "y"), drop = FALSE]
  state_rips <- vietoris_rips(state_coord)
  state_rips$persistence<-(state_rips$death - state_rips$birth)
  
  features <- topologicalFeatures(state_rips)
  caseModel <- rbind(
    caseModel,
    c(features[1], features[2], features[3], features[4])
  )
}
colnames(caseModel) <-  c("H0", "H1", "H1MD","H1ML")
rownames(caseModel) <- stateAbbSort

We merge caseModel with case_predictors, matched by state code. We have now loaded all potential predictors to our model. The following code prints out first few lines of the caseModel to demonstrate how states and their variables lie within this data frame.

caseModel <- merge(caseModel, case_predictors, by = "row.names")[, -1]
rownames(caseModel) <- stateAbbSort
head(caseModel)

##     H0  H1       H1MD       H1ML        POP     TEMP    PRECIP  CASE
## AC  14   2 0.06063393 0.07681638   4.729529 25.66136 169.20068  2568
## AL 102  29 0.03059556 0.12310635 118.607876 24.67370  96.40583 11296
## AM  30   3 1.24451733 1.24898100   2.442278 26.49226 198.98254 17832
## AP  13   1 0.28318882 0.28318882   5.158925 26.22867 224.78617  1708
## BA 413 145 0.04901767 0.31302834  26.638086 23.88865  80.96277  6111
## CE 180  63 0.04965019 0.27650332  58.958386 26.22163  78.03878 30219

Cases Visualization

The plots below visualize the number of cases within each state. To reproduce the model by Lo and Park, we will log-transform the number of cases to reduce data skew and the influence of outliers.

par(mfrow = c(2L, 1L))

hist(caseModel$CASE, main = "Distribution of Case Counts")

hist(log(caseModel$CASE), 
     main = "Distribution of Log-Transformed Case Counts")

We use the following plot to check if there is any correlation between predictors. We see that H0 and H1 are linearly related to each other, so H1 may not be a necessary predictor. During the model selection process, there will be a comparison testing inclusion of H1 as a predictor and decisions will be made at that point.

pairs(caseModel, pch = 16L)

First fit: Model without Topological Features

First, we fit a linear model to log-transformed case counts predicted by population, temperature, precipitation, and H0 (the number of aegypti occurrences). We reject the null hypothesis of no predictive value (\(p = 0.0199\)).

fit.1 <- lm(log(CASE) ~ POP + TEMP + PRECIP + H0, data = caseModel)
summary(fit.1)

## 
## Call:
## lm(formula = log(CASE) ~ POP + TEMP + PRECIP + H0, data = caseModel)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3891 -1.1654 -0.1291  0.7239  3.3595 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 1.404949   4.062529   0.346  0.73275   
## POP         0.007921   0.003407   2.325  0.02969 * 
## TEMP        0.238363   0.144454   1.650  0.11313   
## PRECIP      0.001877   0.008761   0.214  0.83234   
## H0          0.007377   0.002306   3.200  0.00413 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.634 on 22 degrees of freedom
## Multiple R-squared:  0.3992, Adjusted R-squared:   0.29 
## F-statistic: 3.655 on 4 and 22 DF,  p-value: 0.01984

To check error assumptions for linear regression, we plot residuals versus fitted values and theoretical quantiles versus standardized residuals. We observe that residuals for each fitted values are evenly distributed above and below the mean 0, and the relationship between residuals and fitted values seem to have a linear relation. In the quantile-quantile plot, standardized residuals closely line up with normal values, with a few outliers. Therefore, the normality assumption on the error term is satisfied.

par(mfrow = c(1L, 2L))
plot(fit.1, which = 1)
plot(fit.1, which = 2)

Next, we use the likelihood ratio test (LRT) to assess if we should drop the precipitation predictor PRECIP. In the LRT, \(p = 0.8135\), which fails to reject the nested model. This result suggests that the simpler model without PRECIP is better. We therefore replace the original model with this more parsimonious one. Even though TEMP is not statistically significant, we preserve it in the model as temperature is an effectual predictor in most epidemiology modeling, and one whose point estimate we may want to be able to report.

fit.nested <- lm(log(CASE) ~ POP + TEMP + H0, data = caseModel)
lmtest::lrtest(fit.nested, fit.1)

## Likelihood ratio test
## 
## Model 1: log(CASE) ~ POP + TEMP + H0
## Model 2: log(CASE) ~ POP + TEMP + PRECIP + H0
##   #Df  LogLik Df  Chisq Pr(>Chisq)
## 1   5 -48.831                     
## 2   6 -48.803  1 0.0563     0.8125

fit.1 <- lm(log(CASE) ~ POP + TEMP + H0, data = caseModel)
summary(fit.1)

## 
## Call:
## lm(formula = log(CASE) ~ POP + TEMP + H0, data = caseModel)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3756 -1.1705 -0.0936  0.6976  3.3166 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 1.761653   3.627981   0.486  0.63186   
## POP         0.007694   0.003170   2.427  0.02346 * 
## TEMP        0.235612   0.140867   1.673  0.10796   
## H0          0.007162   0.002031   3.526  0.00181 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.6 on 23 degrees of freedom
## Multiple R-squared:  0.398,  Adjusted R-squared:  0.3194 
## F-statistic: 5.068 on 3 and 23 DF,  p-value: 0.007698

Second fit: Model with Topological Features

Let’s add H1, H1ML, and H1MD to our current regression model and name it fit.2. From the summary output of the second model, \(F = 3.268\) and \(p = 0.01878\). We know that coefficients of these predictors are not all 0. However, we might overfit the data as many predictors are not statistically significant. Thus, we prefer to exclude irrelevant variables.

fit.2 <- lm(log(CASE) ~ POP + TEMP + H0 + H1 + H1ML + H1MD,
            data = caseModel)
summary(fit.2)

## 
## Call:
## lm(formula = log(CASE) ~ POP + TEMP + H0 + H1 + H1ML + H1MD, 
##     data = caseModel)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9556 -1.1606  0.2781  0.5272  3.0658 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  3.231339   3.466734   0.932  0.36240   
## POP          0.009540   0.003137   3.042  0.00644 **
## TEMP         0.156633   0.137891   1.136  0.26942   
## H0          -0.022641   0.023373  -0.969  0.34428   
## H1           0.080924   0.063805   1.268  0.21926   
## H1ML         3.606118   1.752755   2.057  0.05292 . 
## H1MD        -1.345753   1.927491  -0.698  0.49310   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.505 on 20 degrees of freedom
## Multiple R-squared:  0.5368, Adjusted R-squared:  0.3979 
## F-statistic: 3.864 on 6 and 20 DF,  p-value: 0.01009

Similarly, the two plots below exhibit that error terms from the regression model are normally distributed and centered at zero. There are a few outliers from the Q-Q plot, but their effects are trivial in our regression model.

par(mfrow = c(1L, 2L))
plot(fit.2, which = 1)
plot(fit.2, which = 2)

Using LRT to see if H1MD is a useful predictor, we fail to reject the nested model when H1MD is absent (\(p = 0.4126\)). Therefore, we leave out H1MD in our model.

fit.test0 <- lm(log(CASE) ~ POP + TEMP + H0 + H1 + H1ML,
                data = caseModel)
summary(fit.test0)

## 
## Call:
## lm(formula = log(CASE) ~ POP + TEMP + H0 + H1 + H1ML, data = caseModel)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9543 -0.9471  0.2150  0.5538  3.0678 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  3.131348   3.421244   0.915  0.37044   
## POP          0.009410   0.003092   3.043  0.00618 **
## TEMP         0.161011   0.136057   1.183  0.24987   
## H0          -0.018619   0.022374  -0.832  0.41467   
## H1           0.070886   0.061401   1.154  0.26128   
## H1ML         2.731621   1.211053   2.256  0.03489 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.486 on 21 degrees of freedom
## Multiple R-squared:  0.5255, Adjusted R-squared:  0.4126 
## F-statistic: 4.652 on 5 and 21 DF,  p-value: 0.005153

lmtest::lrtest(fit.test0, fit.2)

## Likelihood ratio test
## 
## Model 1: log(CASE) ~ POP + TEMP + H0 + H1 + H1ML
## Model 2: log(CASE) ~ POP + TEMP + H0 + H1 + H1ML + H1MD
##   #Df  LogLik Df  Chisq Pr(>Chisq)
## 1   7 -45.616                     
## 2   8 -45.291  1 0.6502       0.42

We observe that coefficients of both H0 and H1 are not statistically significant. Also, it is concerning because the coefficient of H0 in the nested model above is negative, which means more mosquito occurrences tend to decrease the number of cases. We decide to test the model without H1, as it seems to correlate with H0 and cause confounding.

fit.test1 <- lm(log(CASE) ~ POP + TEMP + H0 + H1ML, data = caseModel)
summary(fit.test1)

## 
## Call:
## lm(formula = log(CASE) ~ POP + TEMP + H0 + H1ML, data = caseModel)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2852 -1.0814 -0.1375  0.6900  3.1474 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 2.698803   3.426291   0.788  0.43929   
## POP         0.009618   0.003111   3.092  0.00532 **
## TEMP        0.158484   0.137065   1.156  0.25997   
## H0          0.007119   0.001902   3.744  0.00112 **
## H1ML        2.471396   1.198857   2.061  0.05127 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.497 on 22 degrees of freedom
## Multiple R-squared:  0.4954, Adjusted R-squared:  0.4037 
## F-statistic: 5.401 on 4 and 22 DF,  p-value: 0.00348

lmtest::lrtest(fit.test1, fit.test0)

## Likelihood ratio test
## 
## Model 1: log(CASE) ~ POP + TEMP + H0 + H1ML
## Model 2: log(CASE) ~ POP + TEMP + H0 + H1 + H1ML
##   #Df  LogLik Df  Chisq Pr(>Chisq)
## 1   6 -46.447                     
## 2   7 -45.616  1 1.6614     0.1974

The likelihood ratio test suggests that we need to choose the simpler model without H1 (\(p = 0.2161\)). In this case, we only need H1ML as our topological predictor.

fit.2 <- lm(log(CASE) ~ POP + TEMP + H0 + H1ML, data = caseModel)

Results

We compare regression summaries of our final first and second models and observe an improvement of the adjusted R-squared when adding H1ML as an additional predictor.

summary(fit.1)

## 
## Call:
## lm(formula = log(CASE) ~ POP + TEMP + H0, data = caseModel)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3756 -1.1705 -0.0936  0.6976  3.3166 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 1.761653   3.627981   0.486  0.63186   
## POP         0.007694   0.003170   2.427  0.02346 * 
## TEMP        0.235612   0.140867   1.673  0.10796   
## H0          0.007162   0.002031   3.526  0.00181 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.6 on 23 degrees of freedom
## Multiple R-squared:  0.398,  Adjusted R-squared:  0.3194 
## F-statistic: 5.068 on 3 and 23 DF,  p-value: 0.007698

summary(fit.2)

## 
## Call:
## lm(formula = log(CASE) ~ POP + TEMP + H0 + H1ML, data = caseModel)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2852 -1.0814 -0.1375  0.6900  3.1474 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 2.698803   3.426291   0.788  0.43929   
## POP         0.009618   0.003111   3.092  0.00532 **
## TEMP        0.158484   0.137065   1.156  0.25997   
## H0          0.007119   0.001902   3.744  0.00112 **
## H1ML        2.471396   1.198857   2.061  0.05127 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.497 on 22 degrees of freedom
## Multiple R-squared:  0.4954, Adjusted R-squared:  0.4037 
## F-statistic: 5.401 on 4 and 22 DF,  p-value: 0.00348

H0 and H1ML are both positively related to the number of cases in each state. A higher value of H0 indicates more A. aegypti occurrences within the given state, so it leads to more confirmed cases. Since we include both H0 and H1ML in the model, given a fixed number of A. aegypti occurrences, a more persistent cycle would predict more cases as disease outbreaks can involve more locations. To interpret this, we select three states with similar H0 but different H1ML and compare the number of cases in each.

We choose states Pernambuco, Ceará, and Goiás for this example and print out H0, H1ML, POP, and CASE for each state. We observe that Goiás has the largest H1ML and the most cases, while Pernambuco has the smallest H1ML and the least number of cases. To reduce the effect of H0 on case numbers and H1ML, H0 is fixed since H0 for three selected states all fall into the range around 200.

Example_States <- c("PE", "CE", "GO")
Coord_Example <- data.frame()
for(val in Example_States){
  targetRow <- caseModel[rownames(caseModel) == val,]
  Coord_Example <- rbind(Coord_Example,
                         c(targetRow$H0, targetRow$H1ML, targetRow$CASE))
}
rownames(Coord_Example) <- Example_States
colnames(Coord_Example) <- c("H0", "H1ML", "CASE");Coord_Example

##     H0      H1ML   CASE
## PE 185 0.1594328   7985
## CE 180 0.2765033  30219
## GO 246 0.3654354 139357

While controlling H0, we want to give a visualized interpretation of how H1ML positively relates to the number of cases. To achieve this, we plot A. aegypti occurrences on an xy-axis. We also limit the range of the x-axis to 10 in all three plots to fix the geographic dimension, in order to have a fair comparison between aegypti distributions.

par(mfrow = c(1L, 3L))

PE_coord <- aegypti[aegypti$state_code == "PE", c("x", "y"), drop = FALSE]
PE_rips <- vietoris_rips(PE_coord)
plot(x = PE_coord$x, y = PE_coord$y, asp = 1, col = "green",
     xlim = c(-42, -32), xlab = "X", ylab = "Y", main = "Pernambuco")

CE_coord <- aegypti[aegypti$state_code == "CE", c("x", "y"), drop = FALSE]
CE_rips <- vietoris_rips(CE_coord)
plot(x = CE_coord$x, y = CE_coord$y, asp = 1, col = "blue",
     xlim = c(-43, -33), xlab = "X", ylab = "Y", main = "Ceará")

GO_coord <- aegypti[aegypti$state_code == "GO", c("x", "y"), drop = FALSE]
GO_rips <- vietoris_rips(GO_coord) 
plot(x = GO_coord$x, y = GO_coord$y, asp = 1, col = "red",
     xlim = c(-54, - 44), xlab = "X", ylab = "Y", main = "Goiás")

par(mfrow = c(1L, 1L))

Distributions of aegypti populations serve as a guidance map for interpreting H1ML: a bigger H1ML comes from sparse and wider range of occurrences where infections can extend to more locations. In Pernambuco, which has the tiniest H1ML, most occurrences appear to form a single cluster. Hence, the spread of disease in Pernambuco may be easier to prevent and contain in the population close to this aegypti cluster, such as by applying pesticides and increasing predators. This may be part of the reason that Pernambuco, with the tiniest H1ML, has a small number of cases. In Ceará, occurrences are not clustered in one chunk but are separated into two clusters, one to the north and the other to the south. Based on the aegypti distribution in Goiás, we observe that multiple aegypti points are sparsely located instead of tightly attached to one another, which can lead to a higher value of H1ML. Comparing the shape of distributions among state Ceará and Goiás, occurrences in Goiás have a larger spread and are more uniformly distributed across the state, so the disease in Goiás can be transmitted across a wider range to reach more distant locations. More uniformed occurrences of aegypti species can increase the number of infections as it bridges more neighboring communities. Overall, predictor H1ML effectively detects patterns among clusters of occurrences and explains that the increase in cases may be due to a wider range of disease spread.

Discussion

We have shown that using topological features from A. aegypti occurrences in conjunction with standard epidemiology variables could better use the spatial information of existing data to benefit the prediction of ZIKV cases. Compared to using different years’ datasets by Lo and Park, we have kept all our datasets to 2013 so our model leads to a different result. Also, improvements in our regression model after using topological features are not extraordinarily significant. In addition to variables included in our model, it is possible to further involve interaction terms, quadratic predictors, and log transformation of topological features when they are contextually and statistically appropriate. We have tested our model while incorporating the interaction between H1 and H1ML as an additional predictor, in order to follow the approach by Lo and Park. However, this interaction term is not significant and does not contribute to the overall fitness of our model. Even though variable selections and results deviate somewhat from the regression model built by Lo and Park, our model successfully demonstrates how to take advantage of topological features generated by the Vietoris-Rips filtration.