In 2015, the Zika virus (ZIKV) was first introduced and spread across Brazil due to a plethora of Aedes aegypti mosquitoes.1 This species transmits ZIKV to humans, which can cause mild symptoms or even severe health conditions such as Guillain-Barré Syndrome and microcephaly.23 Lo and Park (2018) modeled the spread of ZIKV using linear regression, while comparing results between two models — one with predictors calculated from persistent homology (PH) and one without.4 They show that using the number of 0-dimensional and 1-dimensional homological features and the maximum lifetime (persistence) of any 1-dimensional feature can better predict the number of ZIKV cases in Brazil.
Here, we demonstrate how to use topological features to model disease transmission by A. aegypti, similar to the ZIKV model built by Lo and Park.
To imitate the model from Lo and Park with available data, we would predict the number of ZIKV cases confirmed in each state using linear regression. However, data points of A. aegypti mosquitoes in Brazil are only available for the year 2013, while documented ZIKV cases are available only since 2015. Instead of using the number of ZIKV cases as the response variable in our regression model, we obtained the number of Dengue cases in Brazil for the year 2013,5 since Dengue is also transmitted by A. aegypti mosquitoes. Predictions of number of cases in each state will be made using average temperatures, precipitation levels,6 and human population densities.78 Notice that the source for temperature data provided by Lo and Park is no longer available. The accessible dataset for temperatures and precipitations are averaged throughout multiple years, thus it may not be completely accurate for 2013.
Besides these state-level attributes, we will also utilize spatial information on A. aegypti occurrences to model the disease.9 We include the number of 0-dimensional features (H0) as a predictor since a higher value of H0 indicates that a state harbors more mosquitoes. We also include the number of 1-dimensional features (H1) in our model, as we interpret a larger value of H1 to mean that A. aegypti mosquitoes are more spread out. In addition, extending the analysis of Lo and Park, we include the maximum lifetime (H1ML) and the median lifetime (H1MD) of the 1-dimensinoal features, which we expect give additional insight into the statewide connectivity of A. aegypti distributions. A high H1ML indicates mosquitoes have a wider spread and are less clustered at a small number of specific regions. We are also curious about whether the median of H1 lifetime can improve upon the effect of H1ML demonstrated by Lo and Park.
First, we load the aegypti
dataset containing geographic locations of A. aegypti mosquitoes. Also, we load the case_predictors
dataset to obtain human population density, average temperature, precipitation level, and the number of cases in each state.
We will use state abbreviations to join elements from these two data sets and from the PH calculations. The vector stateAbbSort
, de-duplicated from the A. aegypti dataset, provides a convenient tool for these tasks.
data(aegypti)
data(case_predictors)
<- sort(aegypti$state_code[!duplicated(aegypti$state_code)]) stateAbbSort
We will incrementally add the predictive variables to the data frame caseModel
.
<- data.frame() caseModel
Before building the model, the following code provides an example of using Vietoris-Rips filtration on the mosquito occurrence patterns for a single state. First, we retrieve occurrence coordinates in the state of Acre (AC) from the A. aegypti dataset. Applying the vietoris_rips
function to these coordinates, we receive a data frame that contains the birth and death information of every 0-dimensional (H0) and 1-dimensional (H1) feature. We plot these persistence data in the following graph, where the horizontal and vertical axes represent the birth and death of each feature, respectively.
<- aegypti[aegypti$state_code == "AC", c("x", "y"), drop = FALSE]
AC_coord <- vietoris_rips(AC_coord) ##filtration
AC_rips
plot.new()
plot.window(
xlim = c(0, max(AC_rips$death)),
ylim = c(0, max(AC_rips$death)),
asp = 1
)axis(1L)
axis(2L)
abline(a = 0, b = 1)
points(AC_rips[AC_rips$dimension == 0L, c("birth", "death")], pch = 16L)
points(AC_rips[AC_rips$dimension == 1L, c("birth", "death")], pch = 17L)
For reference, we can also plot the mosquito occurrences based on their latitudes and longitudes in the aegypti dataset. Note, however, that the plot is not exactly faithful to the geography, since the distances between longitude lines vary from location to location. This also means that our PH calculations are not exactly right. The same limitation applies to the analysis of Lo and Park.
plot(x = AC_coord$x, y = AC_coord$y, asp = 1, xlab = "X", ylab = "Y",
main = "Aedes Aegypti Occurrences in AC")
The following function takes in a single state’s persistence data and returns a vector consisting of the values H0, H1, H1ML, and H1MD.
<- function(state_rips){
topologicalFeatures <- length(which(state_rips$dimension == 0))
numH0 <- length(which(state_rips$dimension == 1))
numH1 if (numH1 != 0) {
<- max(state_rips$persistence[(numH0 + 1) : (numH0 + numH1)])
maxPerst <- median(state_rips$persistence[(numH0 + 1) : (numH0 + numH1)])
medianPerst
else {
} <- 0
maxPerst <- 0
medianPerst
}return(c(numH0, numH1, medianPerst, maxPerst))
}
Iterating over all the states in Brazil, we bind every state’s topological features to caseModel
.
for(val in stateAbbSort) {
<- aegypti[aegypti$state_code == val, c("x", "y"), drop = FALSE]
state_coord <- vietoris_rips(state_coord)
state_rips $persistence<-(state_rips$death - state_rips$birth)
state_rips
<- topologicalFeatures(state_rips)
features <- rbind(
caseModel
caseModel,c(features[1], features[2], features[3], features[4])
)
}colnames(caseModel) <- c("H0", "H1", "H1MD","H1ML")
rownames(caseModel) <- stateAbbSort
We merge caseModel
with case_predictors
, matched by state code. We have now loaded all potential predictors to our model. The following code prints out first few lines of the caseModel
to demonstrate how states and their variables lie within this data frame.
<- merge(caseModel, case_predictors, by = "row.names")[, -1]
caseModel rownames(caseModel) <- stateAbbSort
head(caseModel)
## H0 H1 H1MD H1ML POP TEMP PRECIP CASE
## AC 13 2 0.06063392 0.07681639 4.729529 25.66136 169.20068 2568
## AL 101 29 0.03059555 0.12310635 118.607876 24.67370 96.40583 11296
## AM 29 3 1.24451720 1.24898089 2.442278 26.49226 198.98254 17832
## AP 12 1 0.28318879 0.28318879 5.158925 26.22867 224.78617 1708
## BA 412 147 0.04711213 0.31302836 26.638086 23.88865 80.96277 6111
## CE 179 63 0.04965020 0.27650333 58.958386 26.22163 78.03878 30219
The plots below visualize the number of cases within each state. To reproduce the model by Lo and Park, we will log-transform the number of cases to reduce data skew and the influence of outliers.
par(mfrow = c(2L, 1L))
hist(caseModel$CASE, main = "Distribution of Case Counts")
hist(log(caseModel$CASE),
main = "Distribution of Log-Transformed Case Counts")
We use the following plot to check if there is any correlation between predictors. We see that H0 and H1 are linearly related to each other, so H1 may not be a necessary predictor. During the model selection process, there will be a comparison testing inclusion of H1 as a predictor and decisions will be made at that point.
pairs(caseModel, pch = 16L)
First, we fit a linear model to log-transformed case counts predicted by population, temperature, precipitation, and H0 (the number of aegypti occurrences). We reject the null hypothesis of no predictive value (p=0.0199).
.1 <- lm(log(CASE) ~ POP + TEMP + PRECIP + H0, data = caseModel)
fitsummary(fit.1)
##
## Call:
## lm(formula = log(CASE) ~ POP + TEMP + PRECIP + H0, data = caseModel)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3891 -1.1671 -0.1294 0.7238 3.3588
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.416851 4.061618 0.349 0.73053
## POP 0.007909 0.003406 2.322 0.02988 *
## TEMP 0.238266 0.144471 1.649 0.11331
## PRECIP 0.001867 0.008762 0.213 0.83324
## H0 0.007375 0.002306 3.198 0.00415 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.634 on 22 degrees of freedom
## Multiple R-squared: 0.399, Adjusted R-squared: 0.2898
## F-statistic: 3.652 on 4 and 22 DF, p-value: 0.0199
To check error assumptions for linear regression, we plot residuals versus fitted values and theoretical quantiles versus standardized residuals. We observe that residuals for each fitted values are evenly distributed above and below the mean 0, and the relationship between residuals and fitted values seem to have a linear relation. In the quantile-quantile plot, standardized residuals closely line up with normal values, with a few outliers. Therefore, the normality assumption on the error term is satisfied.
par(mfrow = c(1L, 2L))
plot(fit.1, which = 1)
plot(fit.1, which = 2)
Next, we use the likelihood ratio test (LRT) to assess if we should drop the precipitation predictor PRECIP
. In the LRT, p=0.8135, which fails to reject the nested model. This result suggests that the simpler model without PRECIP
is better. We therefore replace the original model with this more parsimonious one. Even though TEMP
is not statistically significant, we preserve it in the model as temperature is an effectual predictor in most epidemiology modeling, and one whose point estimate we may want to be able to report.
<- lm(log(CASE) ~ POP + TEMP + H0, data = caseModel)
fit.nested ::lrtest(fit.nested, fit.1) lmtest
## Likelihood ratio test
##
## Model 1: log(CASE) ~ POP + TEMP + H0
## Model 2: log(CASE) ~ POP + TEMP + PRECIP + H0
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 5 -48.834
## 2 6 -48.807 1 0.0557 0.8135
.1 <- lm(log(CASE) ~ POP + TEMP + H0, data = caseModel)
fitsummary(fit.1)
##
## Call:
## lm(formula = log(CASE) ~ POP + TEMP + H0, data = caseModel)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3757 -1.1723 -0.0940 0.6976 3.3161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.771317 3.627567 0.488 0.62997
## POP 0.007684 0.003170 2.424 0.02362 *
## TEMP 0.235534 0.140883 1.672 0.10811
## H0 0.007161 0.002032 3.524 0.00182 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.6 on 23 degrees of freedom
## Multiple R-squared: 0.3978, Adjusted R-squared: 0.3193
## F-statistic: 5.065 on 3 and 23 DF, p-value: 0.007721
Let’s add H1
, H1ML
, and H1MD
to our current regression model and name it fit.2
. From the summary output of the second model, F=3.268 and p=0.01878. We know that coefficients of these predictors are not all 0. However, we might overfit the data as many predictors are not statistically significant. Thus, we prefer to exclude irrelevant variables.
.2 <- lm(log(CASE) ~ POP + TEMP + H0 + H1 + H1ML + H1MD,
fitdata = caseModel)
summary(fit.2)
##
## Call:
## lm(formula = log(CASE) ~ POP + TEMP + H0 + H1 + H1ML + H1MD,
## data = caseModel)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.014 -1.151 0.246 0.577 3.068
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.228176 3.473642 0.929 0.36380
## POP 0.009529 0.003143 3.032 0.00659 **
## TEMP 0.155718 0.138196 1.127 0.27317
## H0 -0.021864 0.023392 -0.935 0.36109
## H1 0.078002 0.063214 1.234 0.23153
## H1ML 3.632102 1.767727 2.055 0.05321 .
## H1MD -1.377716 1.941589 -0.710 0.48615
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.508 on 20 degrees of freedom
## Multiple R-squared: 0.5348, Adjusted R-squared: 0.3952
## F-statistic: 3.832 on 6 and 20 DF, p-value: 0.01049
Similarly, the two plots below exhibit that error terms from the regression model are normally distributed and centered at zero. There are a few outliers from the Q-Q plot, but their effects are trivial in our regression model.
par(mfrow = c(1L, 2L))
plot(fit.2, which = 1)
plot(fit.2, which = 2)
Using LRT to see if H1MD
is a useful predictor, we fail to reject the nested model when H1MD is absent (p=0.4126). Therefore, we leave out H1MD
in our model.
<- lm(log(CASE) ~ POP + TEMP + H0 + H1 + H1ML,
fit.test0 data = caseModel)
summary(fit.test0)
##
## Call:
## lm(formula = log(CASE) ~ POP + TEMP + H0 + H1 + H1ML, data = caseModel)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9994 -0.9372 0.1673 0.6046 3.0713
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.121003 3.429087 0.910 0.37307
## POP 0.009399 0.003101 3.032 0.00635 **
## TEMP 0.160272 0.136405 1.175 0.25316
## H0 -0.017450 0.022281 -0.783 0.44226
## H1 0.067017 0.060560 1.107 0.28097
## H1ML 2.731668 1.216049 2.246 0.03557 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.49 on 21 degrees of freedom
## Multiple R-squared: 0.5231, Adjusted R-squared: 0.4095
## F-statistic: 4.606 on 5 and 21 DF, p-value: 0.005412
::lrtest(fit.test0, fit.2) lmtest
## Likelihood ratio test
##
## Model 1: log(CASE) ~ POP + TEMP + H0 + H1 + H1ML
## Model 2: log(CASE) ~ POP + TEMP + H0 + H1 + H1ML + H1MD
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 7 -45.687
## 2 8 -45.351 1 0.6713 0.4126
We observe that coefficients of both H0 and H1 are not statistically significant. Also, it is concerning because the coefficient of H0 in the nested model above is negative, which means more mosquito occurrences tend to decrease the number of cases. We decide to test the model without H1
, as it seems to correlate with H0 and cause confounding.
<- lm(log(CASE) ~ POP + TEMP + H0 + H1ML, data = caseModel)
fit.test1 summary(fit.test1)
##
## Call:
## lm(formula = log(CASE) ~ POP + TEMP + H0 + H1ML, data = caseModel)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2853 -1.0814 -0.1371 0.6899 3.1475
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.708512 3.426125 0.791 0.43765
## POP 0.009607 0.003111 3.089 0.00537 **
## TEMP 0.158410 0.137089 1.156 0.26027
## H0 0.007117 0.001902 3.742 0.00113 **
## H1ML 2.471004 1.199093 2.061 0.05135 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.498 on 22 degrees of freedom
## Multiple R-squared: 0.4952, Adjusted R-squared: 0.4035
## F-statistic: 5.396 on 4 and 22 DF, p-value: 0.003494
::lrtest(fit.test1, fit.test0) lmtest
## Likelihood ratio test
##
## Model 1: log(CASE) ~ POP + TEMP + H0 + H1ML
## Model 2: log(CASE) ~ POP + TEMP + H0 + H1 + H1ML
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 6 -46.452
## 2 7 -45.687 1 1.5303 0.2161
The likelihood ratio test suggests that we need to choose the simpler model without H1
(p=0.2161). In this case, we only need H1ML as our topological predictor.
.2 <- lm(log(CASE) ~ POP + TEMP + H0 + H1ML, data = caseModel) fit
We compare regression summaries of our final first and second models and observe an improvement of the adjusted R-squared when adding H1ML
as an additional predictor.
summary(fit.1)
##
## Call:
## lm(formula = log(CASE) ~ POP + TEMP + H0, data = caseModel)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3757 -1.1723 -0.0940 0.6976 3.3161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.771317 3.627567 0.488 0.62997
## POP 0.007684 0.003170 2.424 0.02362 *
## TEMP 0.235534 0.140883 1.672 0.10811
## H0 0.007161 0.002032 3.524 0.00182 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.6 on 23 degrees of freedom
## Multiple R-squared: 0.3978, Adjusted R-squared: 0.3193
## F-statistic: 5.065 on 3 and 23 DF, p-value: 0.007721
summary(fit.2)
##
## Call:
## lm(formula = log(CASE) ~ POP + TEMP + H0 + H1ML, data = caseModel)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2853 -1.0814 -0.1371 0.6899 3.1475
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.708512 3.426125 0.791 0.43765
## POP 0.009607 0.003111 3.089 0.00537 **
## TEMP 0.158410 0.137089 1.156 0.26027
## H0 0.007117 0.001902 3.742 0.00113 **
## H1ML 2.471004 1.199093 2.061 0.05135 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.498 on 22 degrees of freedom
## Multiple R-squared: 0.4952, Adjusted R-squared: 0.4035
## F-statistic: 5.396 on 4 and 22 DF, p-value: 0.003494
H0
and H1ML
are both positively related to the number of cases in each state. A higher value of H0
indicates more A. aegypti occurrences within the given state, so it leads to more confirmed cases. Since we include both H0
and H1ML
in the model, given a fixed number of A. aegypti occurrences, a more persistent cycle would predict more cases as disease outbreaks can involve more locations. To interpret this, we select three states with similar H0
but different H1ML
and compare the number of cases in each.
We choose states Pernambuco, Ceará, and Goiás for this example and print out H0
, H1ML
, POP
, and CASE
for each state. We observe that Goiás has the largest H1ML
and the most cases, while Pernambuco has the smallest H1ML
and the least number of cases. To reduce the effect of H0
on case numbers and H1ML
, H0
is fixed since H0
for three selected states all fall into the range around 200.
<- c("PE", "CE", "GO")
Example_States <- data.frame()
Coord_Example for(val in Example_States){
<- caseModel[rownames(caseModel) == val,]
targetRow <- rbind(Coord_Example,
Coord_Example c(targetRow$H0, targetRow$H1ML, targetRow$CASE))
}rownames(Coord_Example) <- Example_States
colnames(Coord_Example) <- c("H0", "H1ML", "CASE");Coord_Example
## H0 H1ML CASE
## PE 184 0.1594328 7985
## CE 179 0.2765033 30219
## GO 245 0.3654354 139357
While controlling H0
, we want to give a visualized interpretation of how H1ML
positively relates to the number of cases. To achieve this, we plot A. aegypti occurrences on an xy-axis. We also limit the range of the x-axis to 10 in all three plots to fix the geographic dimension, in order to have a fair comparison between aegypti distributions.
par(mfrow = c(1L, 3L))
<- aegypti[aegypti$state_code == "PE", c("x", "y"), drop = FALSE]
PE_coord <- vietoris_rips(PE_coord)
PE_rips plot(x = PE_coord$x, y = PE_coord$y, asp = 1, col = "green",
xlim = c(-42, -32), xlab = "X", ylab = "Y", main = "Pernambuco")
<- aegypti[aegypti$state_code == "CE", c("x", "y"), drop = FALSE]
CE_coord <- vietoris_rips(CE_coord)
CE_rips plot(x = CE_coord$x, y = CE_coord$y, asp = 1, col = "blue",
xlim = c(-43, -33), xlab = "X", ylab = "Y", main = "Ceará")
<- aegypti[aegypti$state_code == "GO", c("x", "y"), drop = FALSE]
GO_coord <- vietoris_rips(GO_coord)
GO_rips plot(x = GO_coord$x, y = GO_coord$y, asp = 1, col = "red",
xlim = c(-54, - 44), xlab = "X", ylab = "Y", main = "Goiás")
par(mfrow = c(1L, 1L))
Distributions of aegypti populations serve as a guidance map for interpreting H1ML
: a bigger H1ML
comes from sparse and wider range of occurrences where infections can extend to more locations. In Pernambuco, which has the tiniest H1ML
, most occurrences appear to form a single cluster. Hence, the spread of disease in Pernambuco may be easier to prevent and contain in the population close to this aegypti cluster, such as by applying pesticides and increasing predators. This may be part of the reason that Pernambuco, with the tiniest H1ML
, has a small number of cases. In Ceará, occurrences are not clustered in one chunk but are separated into two clusters, one to the north and the other to the south. Based on the aegypti distribution in Goiás, we observe that multiple aegypti points are sparsely located instead of tightly attached to one another, which can lead to a higher value of H1ML
. Comparing the shape of distributions among state Ceará and Goiás, occurrences in Goiás have a larger spread and are more uniformly distributed across the state, so the disease in Goiás can be transmitted across a wider range to reach more distant locations. More uniformed occurrences of aegypti species can increase the number of infections as it bridges more neighboring communities. Overall, predictor H1ML
effectively detects patterns among clusters of occurrences and explains that the increase in cases may be due to a wider range of disease spread.
We have shown that using topological features from A. aegypti occurrences in conjunction with standard epidemiology variables could better use the spatial information of existing data to benefit the prediction of ZIKV cases. Compared to using different years’ datasets by Lo and Park, we have kept all our datasets to 2013 so our model leads to a different result. Also, improvements in our regression model after using topological features are not extraordinarily significant. In addition to variables included in our model, it is possible to further involve interaction terms, quadratic predictors, and log transformation of topological features when they are contextually and statistically appropriate. We have tested our model while incorporating the interaction between H1
and H1ML
as an additional predictor, in order to follow the approach by Lo and Park. However, this interaction term is not significant and does not contribute to the overall fitness of our model. Even though variable selections and results deviate somewhat from the regression model built by Lo and Park, our model successfully demonstrates how to take advantage of topological features generated by the Vietoris-Rips filtration.
Gatherer D and Kohl A. Zika virus, a previously slow pandemic spreads rapidly through the Americas. Journal of General Virology. 2016 97: 269–273. https://doi.org/10.1099/jgv.0.000381 PMID: 26684466↩︎
Winer J. An update in Guillain-Barre´ Syndrome. Autoimmune Diseases. 2014. https://doi.org/10.1155/2014/793024 PMID: 24511391↩︎
Johansson MA, Mier-y-Teran-Romero L, Reefhuis J, Gilboa S and Hills S. Zika and the risk of microcephaly. N Engl J Med. 2016 375: 1–4. https://doi.org/10.1056/NEJMp1605367 PMID: 27222919↩︎
Lo D, Park B (2018) Modeling the spread of the Zika virus using topological data analysis. PLoS ONE 13(2): e0192120. https://doi.org/10.1371/journal.pone.0192120↩︎
Brazil Ministry of Health. 2014. Boletim Epidemiológico. ISSN 2358-9450. https://web.archive.org/web/20210209122713/https://www.gov.br/saude/pt-br/assuntos/boletins-epidemiologicos-1/por-assunto↩︎
Brazilian Institute of Geography and Statistics. 2015. Retrieved from https://ftp.ibge.gov.br/Estimativas_de_Populacao/↩︎
Brazilian Institute of Geography and Statistics. 2020. https://www.ibge.gov.br/geociencias/organizacao-do-territorio/estrutura-territorial/15761-areas-dos-municipios.html?edicao=30133&t=acesso-ao-produto↩︎
Kraemer, Moritz U. G. et al. 2017. Data from: The global compendium of Aedes aegypti and Ae. albopictus occurrence, Dryad, Dataset, https://doi.org/10.5061/dryad.47v3c↩︎