In this vignette, we describe how prevalence mapping using DHS data can be carried out. Prevalence mapping is Small Area Estimation (SAE) for proportions of some indicator of interest. The vignette is organized as follows. In Section 2 we first describe the data preparation steps to obtain and process the relevant DHS survey data in R. We then describe three classes of models
Then in Section 6 we show several tools for visualization of different estimates, and in Section 7 we discuss how to aggregate the result for a given model to produce estimates at higher admin levels.
The following pieces of data are needed to produce SAE for DHS surveys:
The latest surveyPrev package can be installed from CRAN with
The development version of the surveyPrev package can be installed from GitHub with
After installation, the package can be loaded with
One of the key required packages, INLA
is not available on CRAN or GitHub, and needs to be installed with the following code (Rue, Martino, and Chopin 2009).
install.packages("INLA", repos=c(getOption("repos"),
INLA="https://inla.r-inla-download.org/R/stable"), dep=TRUE)
library(INLA)
The following packages are needed to run the example analysis described in this vignette.
Currently, the surveyPrev
package supports 22 indicators. The list of indicators and their IDs can be found in the surveyPrevIndicators
dataset. The ID column correspond to the indicator IDs in the DHS API and are more standardized. The alternative ID column provides convenient and more readable indicator names for only a subset of indicators. Both can be used to retrieve the indicator in the functions described later. The full list of indicators and their IDs are summarized in the Table below.
ID | alternative.ID | Description | Topic |
---|---|---|---|
AN_ANEM_W_ANY | womananemia | Percentage of women aged 15-49 classified as having any anemia | Nutrition |
AN_NUTS_W_THN | Underweight (Prevalence of underweight (BMI < 18.5) women of reproductive age) | Nutrition | |
CH_DIAT_C_ORT | Diarrhea treatment (Children under five with diarrhea treated with either ORS or RHF) | Child Health | |
CH_VACC_C_BAS | Children age 12-23 months with all 8 basic vaccinations | Child Health | |
CH_VACC_C_DP1 | Percentage of children 12-23 months who had received DPT 1 vaccination | Child Health | |
CH_VACC_C_DP3 | DPT3 | Percentage of children 12-23 months who had received DPT 3 vaccination | Child Health |
CH_VACC_C_MSL | Percentage of children 12-23 months who had received MCV 1 (Measles containing vaccine) | Child Health | |
CH_VACC_C_NON | Children 12-23 months with no vaccinations | Child Health | |
CM_ECMR_C_NNR | nmr | Probability of dying in the first month of life in the five or ten years preceding the survey | Infant and Child Mortality |
CN_ANMC_C_ANY | Children under five with any anemia | Nutrition | |
CN_BRFS_C_EXB | Children exclusively breastfed (Prevalence of exclusive breastfeeding of children under six months of age) | Nutrition | |
CN_NUTS_C_HA2 | stunting | Stunting rate (Prevalence of stunted (HAZ < -2) children under five (0-59 months)) | Nutrition |
CN_NUTS_C_WH2 | wasting | Wasting rate (Prevalence of wasted (HAZ < -2) children under five (0-59 months)) | Nutrition |
FP_CUSA_W_MOD | Modern contraceptive prevalence rate (Married women currently using any modern method of contraception) | Family Planning | |
FP_NADA_W_UNT | unmet_family | Percentage of women with an unmet need for family planning | Family Planning |
HA_HIVP_B_HIV | HIV prevalence | HIV prevalence | |
ML_NETP_H_IT2 | Households with at least one insecticide-treated mosquito net (ITN) for every two persons who stayed in the household the previous night | Malaria | |
RH_ANCN_W_N4P | ancvisit4+ | Antenatal visits for pregnancy: 4+ visits | Maternal Healthcare |
RH_DELA_C_SKP | Assistance during delivery from a skilled provide | Maternal Healthcare | |
WS_SRCE_P_BAS | Population using a basic water source | Water, Sanitation and Hygiene | |
WS_TLET_H_IMP | sanitation | Percentage of households using an improved sanitation facility | Water, Sanitation and Hygiene |
WS_TLET_P_BAS | Population with access to a basic sanitation service | Water, Sanitation and Hygiene |
In this vignette, we consider estimating the percentage of women who had a live birth in the five years before the survey who had four or more antenatal care visits. This indicator can be extracted with indicator = "ancvisit4+"
or indicator = "RH_ANCN_W_N4P"
.
Details on more common DHS indicators can be found in the Guide to DHS Statistics at https://dhsprogram.com/data/Guide-to-DHS-Statistics/. User-specified function can also be used to process raw DHS data into new indicators. Please refer to the vignette on Creating Customized Indicators for surveyPrev for more details.
Processing the binary indicator from the raw DHS data consist of two steps:
1. download the relevant DHS data recode using getDHSdata()
, or manually from the DHS website,
2. create a data frame with the binary indicator and relevant information using getDHSindicator()
.
The getDHSdata()
function downloads the relevant DHS data directly from the DHS website using the rdhs package. This step requires users to
rdhs::set_rdhs_config()
.After setting up the DHS account information, getDHSdata()
can be used to download the relevant survey files directly into R. If data download using the API fails, or if it is preferred to work with downloaded data files offline, you may also manually download the file from the DHS website and read into R. The getDHSdata()
function returns a message specifying which file is used (e.g., Individual Record file for the ANC visit example).
indicator <- "ancvisit4+"
year <- 2018
country <- "Zambia"
dhsData <- getDHSdata(country = country, indicator = indicator, year = year)
We then use getDHSindicator()
to process the raw survey data into a data frame, where the column value
is the indicator of interest. The data frame also contains information specifying survey designs in the survey package, including cluster ID, household ID, survey weight and strata information.
## cluster householdID v024 weight strata value
## 1 1 1 eastern 1892890 rural 1
## 2 1 2 eastern 1892890 rural 1
## 3 1 3 eastern 1892890 rural 1
## 4 1 4 eastern 1892890 rural 0
## 5 1 7 eastern 1892890 rural 1
## 6 1 9 eastern 1892890 rural NA
The getDHSgeo()
function downloads the GPS data for cluster locations, also through the rdhs package. This step loads in the GPS location as a SpatialPointsDataFrame
, but it can also be performed manually by downloading the GPS file from the DHS website. The GPS location of the clusters are used to match clusters to regions at different admin levels.
In the case of Zambia, the maps are already included in the package as example datasets (data(ZambiaAdm1)
and data(ZambiaAdm2)
) so we load them directly within R. For other countries, the Admin 1 and Admin 2 boundary shapefiles can be downloaded from the GADM site3 and read into R manually. Another alternative is to use the geodata and sf package to download the maps from the GADM site and load them in R directly, in the following steps:
poly.adm1 <- geodata::gadm(country="ZMB", level=1, path=tempdir())
poly.adm1 <- sf::st_as_sf(poly.adm1)
poly.adm2 <- geodata::gadm(country="ZMB", level=2, path=tempdir())
poly.adm2 <- sf::st_as_sf(poly.adm2)
The clusterInfo()
function extracts Admin 1 and Admin 2 area information for each cluster. To avoid confusion in countries that have duplicated region names, we create new labels for Admin 2 regions using the variable admin2.name.full
. The clusters with invalid GPS locations are saved in the wrong.points
object and not used in any models.
cluster.info <- clusterInfo(geo=geo, poly.adm1=poly.adm1, poly.adm2=poly.adm2,
by.adm1 = "NAME_1",by.adm2 = "NAME_2")
head(cluster.info$data)
## cluster LONGNUM LATNUM geometry admin1.name
## 1 1 32.06631 -13.895685 POINT (32.06631 -13.89568) Eastern
## 2 2 28.73625 -13.419209 POINT (28.73625 -13.41921) Copperbelt
## 3 3 31.13254 -8.754055 POINT (31.13254 -8.754055) Northern
## 4 4 28.44930 -14.419112 POINT (28.4493 -14.41911) Central
## 5 5 29.96905 -12.677275 POINT (29.96905 -12.67727) Central
## 6 6 32.74745 -9.319054 POINT (32.74745 -9.319054) Muchinga
## admin2.name admin2.name.full
## 1 Katete Eastern_Katete
## 2 Masaiti Copperbelt_Masaiti
## 3 Mpulungu Northern_Mpulungu
## 4 Kabwe Central_Kabwe
## 5 Chitambo Central_Chitambo
## 6 Nakonde Muchinga_Nakonde
## [1] 19 33 65 179 236 281 293 360 361 469
The adminInfo()
function computes the spatial the adjacency matrix, and if supplied with additional information on population, combines needed information for each administrative area, including the name, population, and urban population fractions of each administrative area. The admin.mat
object stores the adjacency matrix for spatial models. We will ignore the population information for now and focus on only information from the DHS data, and will return to the population information in Section 7.
admin.info1 <- adminInfo(poly.adm = poly.adm1, admin = 1,by.adm = "NAME_1")
admin.info2 <- adminInfo(poly.adm = poly.adm2, admin = 2,by.adm = "NAME_2",by.adm.upper = "NAME_1")
head(admin.info2$data)
## admin1.name admin2.name admin2.name.full population urban
## 1 Central Chibombo Central_Chibombo NA NA
## 2 Central Chisamba Central_Chisamba NA NA
## 3 Central Chitambo Central_Chitambo NA NA
## 4 Central Itezhi-tezhi Central_Itezhi-tezhi NA NA
## 5 Central Kabwe Central_Kabwe NA NA
## 6 Central Kapiri Mposhi Central_Kapiri Mposhi NA NA
For \(y_j\) be the binary outcome of interest for the \(j^{\text{th}}\) individual in the survey and \(w_j\) be the design weight associated with this individual. For a given area denoted as \(i\), we have the weighted estimator:
\[ \hat p^{W}_{i}=\frac {\sum_{j \in S_i} y_j w_j}{\sum_{j \in S_i} w_j}\]
where \(S_i\) is the set of individual index within the \(i\)-th region. The directEST()
function calculates direct estimates at different Admin levels using the SUMMER::smoothSurvey()
function in the SUMMER package internally (Li et al. 2020). The output of the function includes the direct estimates and their variance, standard error, and confidence intervals based on the specified confidence level. The confidence intervals are computed on the logit scale, i.e., if we use \(D_i\) to denote the design-based variance of \(\hat p^{W}_i\), then the asymptotic design-based variance of \(\text{logit}(\hat p^{W}_i)\) is
\[
V_i = \frac{D_i}{(\hat p^{W}_i)^2(1-\hat p^{W}_i)^2}
\]
and we compute the confidence interval on the probability scale to be
\[(\quad \text{expit}[\text{logit}(\hat p^{W}_i) - z_{\alpha/2}V_i^{1/2}], \;\;
\text{expit}[\text{logit}(\hat p^{W}_i) + z_{\alpha/2}V_i^{1/2}]\quad).
\]
Currently the package defaults to a two-stage stratified cluster sampling design, with the sampling clusters (enumeration areas) being stratified by Admin 1 areas and urban/rural strata, which is the most common sampling design in the DHS.
## admin1.name direct.est direct.var direct.logit.est direct.logit.var
## 1 Central 0.5845443 0.0007002450 0.3414564 0.011873143
## 2 Copperbelt 0.6023371 0.0006960677 0.4152124 0.012132270
## 3 Eastern 0.6505890 0.0003224625 0.6216294 0.006240117
## 4 Luapula 0.6453174 0.0007216008 0.5985190 0.013774331
## 5 Lusaka 0.5817949 0.0004319779 0.3301459 0.007296978
## 6 Muchinga 0.6884903 0.0004137610 0.7930706 0.008995197
## direct.logit.prec direct.se direct.lower direct.upper cv
## 1 84.22370 0.02646214 0.5319292 0.6352999 0.04526970
## 2 82.42480 0.02638310 0.5496679 0.6527379 0.04380122
## 3 160.25341 0.01795724 0.6146268 0.6849157 0.02760151
## 4 72.59881 0.02686263 0.5910940 0.6960479 0.04162700
## 5 137.04303 0.02078408 0.5405908 0.6218883 0.03572406
## 6 111.17044 0.02034112 0.6472976 0.7269017 0.02954452
For national level (admin = 0
), the function can additionally produce urban- and rural-specific direct estimates by specifying argument strata
.
res_ad0 <- directEST(data=data,
cluster.info= cluster.info,
admin=0,
strata="all")
res_ad0_urban <- directEST(data=data,
cluster.info= cluster.info,
admin=0,
strata="urban")
res_ad0_rural <- directEST(data=data,
cluster.info= cluster.info,
admin=0,
strata="rural")
list(all = res_ad0$res.admin0,
urban = res_ad0_urban$res.admin0,
rural = res_ad0_rural$res.admin0)
## $all
## direct.est direct.var direct.logit.est direct.logit.var direct.logit.prec
## 1 0.6344011 7.614786e-05 0.5511447 0.001415533 706.4475
## direct.se direct.lower direct.upper
## 1 0.008726274 0.6171347 0.6513289
##
## $urban
## direct.est direct.var direct.logit.est direct.logit.var direct.logit.prec
## 1 0.6066641 0.0002399573 0.4333115 0.004214152 237.2957
## direct.se direct.lower direct.upper
## 1 0.01549055 0.5759275 0.6365788
##
## $rural
## direct.est direct.var direct.logit.est direct.logit.var direct.logit.prec
## 1 0.6519019 0.0001103517 0.6274099 0.002142946 466.6473
## direct.se direct.lower direct.upper
## 1 0.01050484 0.6310396 0.6721974
In general, direct estimates at Admin 2 are unstable. For countries with a large number of Admin 2 areas, some of the direct estimates can be undefined if there is no cluster in the area.
Fay-Herriot models provides smoothed estimates at the areal level using direct estimate \(\hat p^{W}_{i}\) as input. The direct estimates are modeled as a noisy observation of the true prevalence, with the variance of noise determined by the design-based variance. We consider the spatial Fay-Herriot model for the logit transformed direct estimates, which is defined as follows:
\[\text{logit}(\hat p^{W}_{i})|\lambda_{i} \sim\textrm{Normal}(\lambda_{i}, V_{i}^{HT}),\] \[\lambda_{i}= \alpha + e_i+S_i.\] Here \(\text{expit}(\lambda_{i})\) is the latent true prevalence, and \(e_i\) and \(S_i\) are unstructured and structured spatial random effects. Inference is carried out using Bayesian methods and so the model specific is completed by priors on \(\alpha\), \(e\) and \(S\), and their hyperpriors. More details of the Bayesian model setup can be found in Wakefield, Okonek, and Pedersen (2020). Area level Fay-Herriot model are viewed as the most reliable model choice, since they acknowledge the design through the sue of a weighted estimate and its associated variance. See chapters 4 to 6 of Rao and Molina (2015).
As of now the package allows only an overall intercept \(\alpha\), but future versions of the package will allow area level covariates to be included. The default prior for the intercept is \(N(0, 1000)\). The structured and non-structured random effects are implemented used Besag-York-MolliƩ (BYM) via BYM2 parameterization, with default PC priors such that the marginal standard deviation has a prior such that \(Prob(\sigma > 1) = 0.01\) and the proportion of variation explained by the spatial effect, \(\phi\) has a prior such that \(Prob(\phi > 0.5) = 2/3\) (Riebler et al. 2016; Simpson et al. 2017).
The fhModel()
calculates spatial Fay-Herriot estimates at different administrative levels using SUMMER::smoothSurvey()
. Users can specify either only i.i.d. random effects (i.e., no \(S_i\) term) or the BYM2 model by setting model = "iid"
or model = "bym2"
. The function returns model results at the specified admin level with aggregation = FALSE
. Aggregated to higher admin levels is possible with aggregation = TRUE
when additional population information is provided. The aggregation steps are described in more details in Section 7.
smth_res_ad1_bym2 <- fhModel(data,
cluster.info = cluster.info,
admin.info = admin.info1,
admin = 1,
model = "bym2",
aggregation =FALSE)
smth_res_ad1_iid <- fhModel(data,
cluster.info = cluster.info,
admin.info = admin.info1,
admin = 1,
model = "iid",
aggregation =FALSE)
head(smth_res_ad1_bym2$res.admin1)
## admin1.name mean var median lower upper logit.mean
## 1 Central 0.6028279 0.0005262850 0.6035275 0.5559112 0.6455631 0.4181968
## 2 Copperbelt 0.6130262 0.0005303788 0.6135925 0.5663871 0.6567400 0.4611124
## 3 Eastern 0.6479354 0.0002699419 0.6480306 0.6153815 0.6799110 0.6107456
## 4 Luapula 0.6468364 0.0005282296 0.6471917 0.6007430 0.6910090 0.6066520
## 5 Lusaka 0.5941311 0.0003830601 0.5944710 0.5549919 0.6312606 0.3816880
## 6 Muchinga 0.6768713 0.0003558819 0.6769527 0.6397106 0.7137660 0.7407531
## logit.var logit.median logit.lower logit.upper sd cv
## 1 0.009195237 0.4201849 0.2245841 0.2245841 0.02294090 0.03805548
## 2 0.009451857 0.4624387 0.2671258 0.2671258 0.02302995 0.03756765
## 3 0.005201276 0.6103938 0.4699907 0.4699907 0.01642991 0.02535732
## 4 0.010167730 0.6067175 0.4085620 0.4085620 0.02298325 0.03553177
## 5 0.006601776 0.3824795 0.2208608 0.2208608 0.01957192 0.03294209
## 6 0.007476746 0.7398026 0.5741084 0.5741084 0.01886483 0.02787064
Columns 2-6 represent the posterior mean, posterior variance, posterior median and 2.5%, and 97.5% posterior quantiles, respectively. The posterior variance can be viewed as an analogue of the mean squared error (MSE) summary.
A Fay-Herriot model at Admin 2 level can be fitted by treating areas without direct estimates as missing data, though it is usually not recommended due to data sparsity. In addition, numerical issue arise when design-based variance of direct estimate is close to zero and logit precision become too large. A potential fix for this issue is to identify regions with too small design variance(< 1e-30), and delete the clusters in these regions, before fitting the model. However, this step creates additional bias in the results if the number of clusters deleted is large.
bad_admin2 <- subset(res_ad2$res.admin2, direct.var < 1e-30)$admin2.name.full
# identify clusters in these regions
bad_clusters <- subset(cluster.info$data, admin2.name.full %in% bad_admin2)$cluster
# Run FH model without these clusters
smth_res_ad2 <- fhModel(subset(data, !cluster %in% bad_clusters),
cluster.info = cluster.info,
admin.info = admin.info2,
admin = 2,
model = "bym2",
aggregation = FALSE)
head(smth_res_ad2$res.admin2)
## admin2.name.full mean var median lower upper
## 1 Central_Chibombo 0.6535171 0.0026360727 0.6550209 0.5480457 0.7496094
## 2 Central_Chisamba 0.5579803 0.0006107532 0.5581483 0.5089874 0.6060565
## 3 Central_Chitambo 0.6217739 0.0101383288 0.6276832 0.4110805 0.8000109
## 4 Central_Itezhi-tezhi 0.6436590 0.0101398292 0.6505298 0.4284150 0.8197984
## 5 Central_Kabwe 0.5213792 0.0011020175 0.5213857 0.4560794 0.5861891
## 6 Central_Kapiri Mposhi 0.5106414 0.0015870676 0.5106247 0.4323244 0.5880890
## logit.mean logit.var logit.median logit.lower logit.upper sd
## 1 0.64252506 0.05257552 0.64118322 0.19277749 0.19277749 0.05134270
## 2 0.23355308 0.01008966 0.23365025 0.03595353 0.03595353 0.02471342
## 3 0.52003748 0.19864247 0.52229030 -0.35950027 -0.35950027 0.10068927
## 4 0.61982125 0.20921818 0.62136870 -0.28832074 -0.28832074 0.10069672
## 5 0.08594926 0.01785317 0.08559503 -0.17613618 -0.17613618 0.03319665
## 6 0.04284320 0.02574167 0.04250537 -0.27237407 -0.27237407 0.03983802
## cv
## 1 0.07856367
## 2 0.04429085
## 3 0.16193872
## 4 0.15644419
## 5 0.06367084
## 6 0.07801564
Cluster-level models assume smoothing models for counts of events in each cluster (Wakefield, Okonek, and Pedersen 2020; Li et al. 2020). In terms of traditional SAE literature, cluster-level models are a type of unit-level model. We start with describing an unstratified model without taking into account the urban/rural stratification in the sampling design.
Let \(Y_c\) be the number of events in cluster \(c\), and \(n_c\) be the number of individuals at risk, where \(c= 1,\dots,C\). The unstratified model assumes the hierarchical structure:
\[Y_c \mid p_c,d\sim \textrm{BetaBinomial}(n_c,p_c,d),\] \[p_c=\textrm{expit}(\alpha+e_{i[s_c]}+S_{i[s_c]}),\]
where \(\alpha\) is the intercept, and \(i[s_c]\) indexes the area within which the cluster \(s_c\) resides. Similar to the area-level model, \(e_i\) and \(S_i\) are unstructured and structured spatial random effects with the same prior as before. The Beta-binomial distribution arise from a hierarchical model where the probability follows a \(\text{Beta}(a, b)\) prior. The overdispersion parameter, \(d=\frac{1}{\alpha+\beta+1}\), is between 0 and 1 and represent the the intracluster correlation between Bernoulli draws within cluster. The default prior for \(d\) is \(\text{logit}(d) \sim \text{Normal}(0,0.4)\).
The clusterModel()
function fits the cluster-level model above. Unstratified model can be specified by setting stratification = FALSE
, and either i.i.d. or the BYM2 model by setting model = "bym2"
or model = "iid"
. For the the overdispersion parameter \(d\), users can change the prior mean and precision through overdisp.mean
and overdisp.prec
.
We fit the unstratified model in both Admin 1 and Admin 2 below. These models differ only in the admin level at which the spatial random effects enter, with the cluster-level model being the same in each case.
cl_res_ad1 <- clusterModel(data=data,
cluster.info=cluster.info,
admin.info = admin.info1,
stratification = FALSE,
model = "bym2",
admin = 1,
aggregation =FALSE,
CI = 0.95)
cl_res_ad2 <- clusterModel(data=data,
cluster.info= cluster.info,
admin.info = admin.info2,
model = "bym2",
stratification =FALSE,
admin = 2,
aggregation =FALSE,
CI = 0.95)
head(cl_res_ad2$res.admin2)
## admin2.name.full mean median sd var lower
## 1 Central_Chibombo 0.6448760 0.6447917 0.04581087 0.002098636 0.5518984
## 2 Central_Chisamba 0.5963255 0.5973862 0.05510070 0.003036088 0.4839398
## 3 Central_Chitambo 0.5798794 0.5841081 0.06852435 0.004695587 0.4385194
## 4 Central_Itezhi-tezhi 0.6458915 0.6466465 0.07320677 0.005359231 0.4920623
## 5 Central_Kabwe 0.5445200 0.5434148 0.04622923 0.002137142 0.4522349
## 6 Central_Kapiri Mposhi 0.5598836 0.5622164 0.04545403 0.002066069 0.4685788
## upper cv admin1.name admin2.name population urban
## 1 0.7313468 0.07103827 Central Chibombo NA NA
## 2 0.6996858 0.09240039 Central Chisamba NA NA
## 3 0.7063556 0.11817001 Central Chitambo NA NA
## 4 0.7798355 0.11334222 Central Itezhi-tezhi NA NA
## 5 0.6331086 0.08489905 Central Kabwe NA NA
## 6 0.6465233 0.08118479 Central Kapiri Mposhi NA NA
Cluster-level models will produce biased estimates if the specified model does not hold for all units in the population. In the DHS, urban clusters are often oversampled. If such oversampling occurs and the outcome associated with urban/rural, then bias will result when the model does not allow for differences between urban/rural. To emphasize, bias will only arise when oversampling of urban (or rural) areas occurs, and the indicator has an association with urban/rural. One may check whether one needs to include urban/rural terms in the model, and then aggregate by urban/rural.
A fully adjusted model would include interaction terms representing the crossing of Admin 1 areas and urban/rural status. However, for reasons of parsimony we include area-level random effects and a single urban/rural fixed effect. The sampling model becomes
\[Y_c|p_c,d\sim \textrm{BetaBinomial}(n_c,p_c,d),\] \[p_c=\textrm{expit}(\alpha+\gamma\times I(s_c \in \textrm{urban})+x_c\beta+e_{i[s_c]}+S_{i[s_c]}).\]
The area-level risk is defined as \[p_i=q_i\times \text{expit}(\alpha+\gamma+e_{i[s_c]}+S_{i[s_c]})+(1-q_i)\times \text{expit}(\alpha+e_{i[s_c]}+S_{i[s_c]}),\] where \(q_i\) is the proportion of urban population in area \(i\) where āurbanā is defined by the survey design sampling frame.
Stratified cluster-level model can be fit by setting stratification = TRUE
. The returned results then consist of urban-specific, rural-specific, and the overall area-level prevalences. In order to facilitate aggregation for cluster-level model with urban/rural effects, we need to know the fraction of urban population by each administrative region. This computation can be performed with the getUR()
function (see example for details), but it requires external information (i.e., the urban/rural definition by census), so we omit this step from this document for now and assume such information is already obtained from a difference source. he urban proportions for women of age 15 to 49 in Zambia in 2018 has been pre-computed and saved in ZambiaPopWomen$admin1_urban
and ZambiaPopWomen$admin2_urban
.
## admin1.name admin2.name urban
## 1 Eastern Chadiza 0.049373174
## 2 Muchinga Chama 0.054972234
## 3 Eastern Chasefu 0.001718483
## 4 North-Western Chavuma 0.325830536
## 5 Luapula Chembe 0.000000000
## 6 Central Chibombo 0.577152577
Given the urban/rural proportions, we first add the information into admin.info2
using adminInfo()
.T
admin.info2 <- adminInfo(poly.adm = poly.adm2,
admin = 2,
proportion = ZambiaPopWomen$admin2_urban,
by.adm="NAME_2",by.adm.upper="NAME_1")
head(admin.info2$data)
## admin1.name admin2.name admin2.name.full urban population
## 1 Central Chibombo Central_Chibombo 0.57715258 NA
## 2 Central Chisamba Central_Chisamba 0.03825222 NA
## 3 Central Chitambo Central_Chitambo 0.00000000 NA
## 4 Central Itezhi-tezhi Central_Itezhi-tezhi 0.00000000 NA
## 5 Central Kabwe Central_Kabwe 0.68604463 NA
## 6 Central Kapiri Mposhi Central_Kapiri Mposhi 0.13246326 NA
We can then fit the stratified cluster-level model by specifying stratification = TRUE
.
We can use mapPlot()
function from the SUMMER package to visualize estimates and standard errors.
# Arrange all estimates into a long-format data frame
out1 <- res_ad1$res.admin1[, c("admin1.name", "direct.est", "cv")]
colnames(out1)[2] <- "mean"
out1$model <- "Direct Estimates"
out2 <- smth_res_ad1_bym2$res.admin1[, c("admin1.name", "mean", "cv")]
out2$model <- "Fay-Herriot Model"
out3 <- cl_res_ad1$res.admin1[, c("admin1.name", "mean", "cv")]
out3$model <- "Unstratified Cluster-level Model"
g1 <- mapPlot(data = rbind(out1, out2, out3), geo = poly.adm1,
by.data = "admin1.name", by.geo = "NAME_1", is.long = TRUE,
variable = "model", value = "mean", legend.label = "Mean")
g2 <- mapPlot(data = rbind(out1, out2, out3), geo = poly.adm1,
by.data = "admin1.name", by.geo = "NAME_1", is.long = TRUE,
variable = "model", value = "cv", legend.label = "CV")
g1 / g2
We note that for Admin 2 estimates, surveyPrev uses a recreated region name in the form of [admin1_name]_[admin2_name] to avoid duplicated Admin 2 names. This new region identifier is assigned a column name of admin2.name.full
in the output. Thus when plotting, we need to manually create the admin2.name.full
in the spatial polygon object.
poly.adm2$admin2.name.full=paste0(poly.adm2$NAME_1,"_",poly.adm2$NAME_2)
# Arrange all estimates into a long-format data frame
out1<- res_ad2$res.admin2[, c("admin2.name.full", "direct.est", "cv")]
colnames(out1)[2] <- "mean"
out1$model <- "Direct Estimates"
out2 <- smth_res_ad2$res.admin2[, c("admin2.name.full", "mean", "cv")]
out2$model <- "Fay-Herriot Model"
out3 <- cl_res_ad2$res.admin2[, c("admin2.name.full", "mean", "cv")]
out3$model <- "Unstratified Cluster-level Model"
out4 <- cl_res_ad2_T$res.admin2[cl_res_ad2_T$res.admin2$type=="full", c("admin2.name.full", "mean", "cv")]
out4$model <- "Stratified Cluster-level Model"
g1 <- mapPlot(data = rbind(out1, out2, out3, out4), geo = poly.adm2,
by.data = "admin2.name.full", by.geo = "admin2.name.full",
is.long = TRUE, variable = "model", value = "mean",
legend.label = "Mean", ncol = 4)
g2 <- mapPlot(data = rbind(out1, out2, out3, out4), geo = poly.adm2,
by.data = "admin2.name.full", by.geo = "admin2.name.full",
is.long = TRUE, variable = "model", value = "cv",
legend.label = "CV", ncol = 4)
g1 / g2
The scatterPlot()
function takes results from two model outputs, and produce a scatter plot of the selected variable. For example, we can compare the direct estimates with the Fay-Herriot model (top row) and Cluster-level model (bottom row) in terms of their point estimates and standard deviations. In the two plots on the left, we see the expected shrinkage (attenuation) of the spatial FayāHerriot posterior mean estimates as compared to the direct estimates. In the plots on the right we see the reduction in uncertainty which arises from using all of the data in a single Model.
s1 <- scatterPlot(
res1=res_ad1$res.admin1,
res2=smth_res_ad1_bym2$res.admin1,
value1="direct.est",
value2="mean",
by.res1="admin1.name",
by.res2="admin1.name",
title="Fay-Herriot vs Direct estimate",
label1="Direct Est",
label2="Spatial Fay-Herriot")
s2 <- scatterPlot(
res1=res_ad1$res.admin1,
res2=smth_res_ad1_bym2$res.admin1,
value1="direct.se",
value2="sd",
by.res1="admin1.name",
by.res2="admin1.name",
title="FayāHerriot vs Direct SE",
label1="Direct Est",
label2="Spatial FayāHerriot")
s3 <- scatterPlot(
res1=res_ad1$res.admin1,
res2=cl_res_ad1$res.admin1,
value1="direct.est",
value2="mean",
by.res1="admin1.name",
by.res2="admin1.name",
title="Cluster-level model vs Direct estimate",
label1="Direct Est",
label2="Cluster-level model")
s4 <- scatterPlot(
res1=res_ad1$res.admin1,
res2=cl_res_ad1$res.admin1,
value1="direct.se",
value2="sd",
by.res1="admin1.name",
by.res2="admin1.name",
title="Cluster-level model vs Direct SE",
label1="Direct Est",
label2="Cluster-level model")
(s1 + s2 ) / (s3 + s4)
Similar smoothing patterns can be observed when comparing estimates at Admin 2 level.
s1 <- scatterPlot(
res1=res_ad2$res.admin2,
res2=smth_res_ad2$res.admin2,
value1="direct.est",
value2="mean",
by.res1="admin2.name.full",
by.res2="admin2.name.full",
title="Fay-Herriot vs Direct estimate",
label1="Direct Est",
label2="Spatial Fay-Herriot")
s2 <- scatterPlot(
res1=res_ad2$res.admin2,
res2=smth_res_ad2$res.admin2,
value1="direct.se",
value2="sd",
by.res1="admin2.name.full",
by.res2="admin2.name.full",
title="FayāHerriot vs Direct SE",
label1="Direct Est",
label2="Spatial FayāHerriot")
s3 <- scatterPlot(
res1=res_ad2$res.admin2,
res2=cl_res_ad2$res.admin2,
value1="direct.est",
value2="mean",
by.res1="admin2.name.full",
by.res2="admin2.name.full",
title="Cluster-level model vs Direct estimate",
label1="Direct Est",
label2="Cluster-level model")
s4 <- scatterPlot(
res1=res_ad2$res.admin2,
res2=cl_res_ad2$res.admin2,
value1="direct.se",
value2="sd",
by.res1="admin2.name.full",
by.res2="admin2.name.full",
title="Cluster-level model vs Direct SE",
label1="Direct Est",
label2="Cluster-level model")
(s1 + s2 ) / (s3 + s4)
Finally, we compare the unstratified and stratified cluster-level model at Admin 2 level. We can observe mild differences but the estimates are mostly similar between the two models.
s1 <- scatterPlot(
res1=cl_res_ad2$res.admin2,
res2=cl_res_ad2_T$res.admin2[cl_res_ad2_T$res.admin2$type=="full",],
value1="mean",
value2="mean",
by.res1="admin2.name.full",
by.res2="admin2.name.full",
title="Stratified vs Unstratified model estimate",
label1="Unstratified model estimates",
label2="Stratified model estimates")
s2 <- scatterPlot(
res1=cl_res_ad2$res.admin2,
res2=cl_res_ad2_T$res.admin2[cl_res_ad2_T$res.admin2$type=="full",],
value1="sd",
value2="sd",
by.res1="admin2.name.full",
by.res2="admin2.name.full",
title="Stratified vs Unstratified model SD",
label1="Unstratified model estimates",
label2="Stratified model estimates")
(s1 + s2 )
Another visualization function intervalPlot()
can provide interval plots at different admin levels for model comparison with compare=T
. Input should be a list of model outputs by surveyPrev
with self defined model name, (e.g.model=list("name 1"= fit1,"name 2"= fit2)
).
intervalPlot(admin = 1, compare = TRUE, model = list(
"Direct estimate model"= res_ad1,
"Fay-Herriot model"= smth_res_ad1_bym2,
"Unstratified Cluster-level model"= cl_res_ad1))