---
title: "`srvyr` compared to the `survey` package"
author: "Greg Freedman"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{srvyr Compared to the survey Package}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
The `srvyr` package adds `dplyr` like syntax to the `survey` package.
This vignette focuses on how `srvyr` compares to the `survey` package, for more
information about survey design and analysis, check out the vignettes in the `survey`
package, or Thomas Lumley's book, [*Complex Surveys: A Guide to Analysis
Using R*](http://r-survey.r-forge.r-project.org/svybook/). (Also see the bottom
of this document for some more resources).
Everything that `srvyr` can do, can also be done in `survey`. In fact, behind
the scenes the `survey` package is doing all of the hard work for `srvyr`.
`srvyr` strives to make your code simpler and more easily readable to you,
especially if you are already used to the `dplyr` package.
# Motivating example
The `dplyr` package has made it easy to write code to summarize data.
For example, if we wanted to check how the year-to-year change in academic
progress indicator score varied by school level and percent of parents were
high school graduates, we can do this:
```{r, message = FALSE, fig.width = 6}
library(survey)
library(ggplot2)
library(dplyr)
data(api)
out <- apistrat %>%
mutate(hs_grad_pct = cut(hsg, c(0, 20, 100), include.lowest = TRUE,
labels = c("<20%", "20+%"))) %>%
group_by(stype, hs_grad_pct) %>%
summarize(api_diff = weighted.mean(api00 - api99, pw),
n = n())
ggplot(data = out, aes(x = stype, y = api_diff, group = hs_grad_pct, fill = hs_grad_pct)) +
geom_col(stat = "identity", position = "dodge") +
geom_text(aes(y = 0, label = n), position = position_dodge(width = 0.9), vjust = -1)
```
However, if we wanted to add error bars to the graph to capture the uncertainty due to sampling
variation, we have to completely rewrite the `dplyr` code for the `survey` package.
`srvyr` allows a more direct translation.
# Preparing a survey dataset
`as_survey_design()`, `as_survey_rep()` and `as_survey_twophase()` are analogous
to `survey::svydesign()`, `survey::svrepdesign()` and `survey::twophase()`
respectively. Because they are designed to match `dplyr`'s style of non-standard
evaluation, they accept bare column names instead of formulas (~). They also
move the data argument first, so that it is easier to use `magrittr` pipes
(`%>%`).
```{r, message = FALSE}
library(srvyr)
# simple random sample
srs_design_srvyr <- apisrs %>% as_survey_design(ids = 1, fpc = fpc)
srs_design_survey <- svydesign(ids = ~1, fpc = ~fpc, data = apisrs)
```
The `srvyr` functions also accept `dplyr::select()`'s special selection
functions (such as `starts_with()`, `one_of()`, etc.), so these functions are
analogous:
```{r, message = FALSE}
# selecting variables to keep in the survey object (stratified example)
strat_design_srvyr <- apistrat %>%
as_survey_design(1, strata = stype, fpc = fpc, weight = pw,
variables = c(stype, starts_with("api")))
strat_design_survey <- svydesign(~1, strata = ~stype, fpc = ~fpc,
variables = ~stype + api99 + api00 + api.stu,
weight = ~pw, data = apistrat)
```
The function `as_survey()` will automatically choose between the three `as_survey_*`
functions based on the arguments, so you can save a few keystrokes.
```{r, message = FALSE}
# simple random sample (again)
srs_design_srvyr2 <- apisrs %>% as_survey(ids = 1, fpc = fpc)
```
# Data manipulation
Once you've set up your survey data, you can use `dplyr` verbs such as `mutate()`,
`select()`, `filter()` and `rename()`.
```{r, message = FALSE}
strat_design_srvyr <- strat_design_srvyr %>%
mutate(api_diff = api00 - api99) %>%
rename(api_students = api.stu)
strat_design_survey$variables$api_diff <- strat_design_survey$variables$api00 -
strat_design_survey$variables$api99
names(strat_design_survey$variables)[names(strat_design_survey$variables) == "api.stu"] <- "api_students"
```
Note that `arrange()` is not available, because the `srvyr` object expects to
stay in the same order. Nor are two-table verbs such as `full_join()`,
`bind_rows()`, etc. available to `srvyr` objects either because they may have
implications on the survey design. If you need to use these functions, you
should use them earlier in your analysis pipeline, when the objects are still
stored as `data.frame`s.
# Summary statistics
## Of the entire population
`srvyr` also provides `summarize()` and several survey-specific functions that
calculate summary statistics on numeric variables: `survey_mean()`, `survey_total()`,
`survey_quantile()` and `survey_ratio()`. These functions differ from their
counterparts in `survey` because they always return a data.frame in a consistent
format. As such, they do not return the variance-covariance matrix, and so are
not as flexible.
```{r, message = FALSE}
# Using srvyr
out <- strat_design_srvyr %>%
summarize(api_diff = survey_mean(api_diff, vartype = "ci"))
out
# Using survey
out <- svymean(~api_diff, strat_design_survey)
out
confint(out)
```
## By group
`srvyr` also allows you to calculate statistics on numeric variables by group,
using `group_by()`.
```{r, message = FALSE}
# Using srvyr
strat_design_srvyr %>%
group_by(stype) %>%
summarize(api_increase = survey_total(api_diff >= 0),
api_decrease = survey_total(api_diff < 0))
# Using survey
svyby(~api_diff >= 0, ~stype, strat_design_survey, svytotal)
```
## Proportions by group
You can also calculate the proportion or count in each group of a factor
or character variable by leaving x empty in `survey_mean()` or `survey_total()`.
```{r, message = FALSE}
# Using srvyr
srs_design_srvyr %>%
group_by(awards) %>%
summarize(proportion = survey_mean(),
total = survey_total())
# Using survey
svymean(~awards, srs_design_survey)
svytotal(~awards, srs_design_survey)
```
## Unweighted calculations
Finally, the `unweighted()` function can act as an escape hatch to calculate unweighted
calculations on the dataset.
```{r, message = FALSE}
# Using srvyr
strat_design_srvyr %>%
group_by(stype) %>%
summarize(n = unweighted(n()))
# Using survey
svyby(~api99, ~stype, strat_design_survey, unwtd.count)
```
# Back to the example
So now, we have all the tools needed to create the first graph and add error bounds.
Notice that the data manipulation code is nearly identical to the `dplyr` code, with a
little extra set up, and replacing `weighted.mean()` with `survey_mean`.
```{r, message = FALSE, fig.width = 6}
strat_design <- apistrat %>%
as_survey_design(strata = stype, fpc = fpc, weight = pw)
out <- strat_design %>%
mutate(hs_grad_pct = cut(hsg, c(0, 20, 100), include.lowest = TRUE,
labels = c("<20%", "20+%"))) %>%
group_by(stype, hs_grad_pct) %>%
summarize(api_diff = survey_mean(api00 - api99, vartype = "ci"),
n = unweighted(n()))
ggplot(data = out, aes(x = stype, y = api_diff, group = hs_grad_pct, fill = hs_grad_pct,
ymax = api_diff_upp, ymin = api_diff_low)) +
geom_col(stat = "identity", position = "dodge") +
geom_errorbar(position = position_dodge(width = 0.9), width = 0.1) +
geom_text(aes(y = 0, label = n), position = position_dodge(width = 0.9), vjust = -1)
```
# Comparison to the survey package (Degrees of freedom)
For the most part, `srvyr` tries to be a drop-in replacement for the survey package, only
changing the syntax that you wrote. However, the way that calculations of degrees of
freedom when calculating confidence intervals is different.
`srvyr` assumes that you want to use the true degrees of freedom by default, but the `survey`
package uses `Inf` as the default. You can use the argument `df` to get the same result
as the survey package.
```{r}
# Set pillar print methods so tibble has more decimal places
old_sigfig <- options("pillar.sigfig")
options("pillar.sigfig" = 5)
# survey default
estimate <- svymean(~api99, strat_design)
confint(estimate)
# srvyr default
strat_design %>%
summarize(x = survey_mean(api99, vartype = "ci"))
# setting the degrees of freedom so srvyr matches survey default
strat_design %>%
summarize(x = survey_mean(api99, vartype = "ci", df = Inf)) %>%
print()
# setting the degrees of freedom so survey matches survey default
confint(estimate, df = degf(strat_design))
# reset significant figures
options("pillar.sigfig" = old_sigfig)
```
# Grab Bag
## Using `survey` functions on `srvyr` objects
Because `srvyr` objects are just `survey` objects with some extra structure,
all of the functions from `survey` will still work with them. If you need to calculate
something beyond simple summary statistics, you can use `survey` functions.
```{r, message = FALSE}
glm <- svyglm(api00 ~ ell + meals + mobility, design = strat_design)
summary(glm)
```
## Using expressions to create variables on the fly
Like `dplyr`, `srvyr` allows you to use expressions in the arguments,
allowing you to create variables in a single step. For example, you can
use expressions:
1) as the arguments inside the survey statistic functions like `survey_mean`
```{r, message = FALSE}
strat_design %>%
summarize(prop_api99_over_700 = survey_mean(api99 > 700))
```
2) as an argument to `summarize`
```{r, message = FALSE}
strat_design %>%
group_by(awards) %>%
summarize(percentage = 100 * survey_mean())
```
3) and you can even create variables inside of `group_by`
```{r, message = FALSE}
strat_design %>%
group_by(api99_above_700 = api99 > 700) %>%
summarize(api00_mn = survey_mean(api00))
```
Though on-the-fly expressions are syntactically valid, it is possible to make statistically
invalid numbers from them. For example, though the standard error and confidence intervals
can be multiplied by a scalar (like 100), the variance does not scale the same way, so the
following is invalid:
```{r, message = FALSE, eval=FALSE}
# BAD DON'T DO THIS!
strat_design %>%
group_by(awards) %>%
summarize(percentage = 100 * survey_mean(vartype = "var"))
# VARIANCE IS WRONG
```
## Non-Standard evaluation
Srvyr supports the non-standard evaluation conventions that dplyr uses.
If you'd like to use a function programmatically, you can use the functions from
rlang like the `{{` operator (aka "curly curly") from `rlang`.
Here's a quick example, but please see the dplyr vignette
[`vignette("programming", package = "dplyr")`](https://dplyr.tidyverse.org/articles/programming.html)
for more details.
```{r, message = FALSE}
mean_with_ci <- function(.data, var) {
summarize(.data, mean = survey_mean({{var}}, vartype = "ci"))
}
srs_design_srvyr <- apisrs %>% as_survey_design(fpc = fpc)
mean_with_ci(srs_design_srvyr, api99)
```
Srvyr will also follow dplyr's lead on deprecating the old methods of NSE,
such as `rlang::quo`, and `!!`, in addition to the so-called "underscore functions"
(like `summarize_`). Currently, they have been soft-deprecated, they may be
removed altogether in some future version of srvyr.
## Working column-wise
As of version 1.0 of srvyr, it supports dplyr's across function, so when you want
to calculate a statistic on more than one variable, it is easy to do so.
See [`vignette("colwise", package = "dplyr")`](https://dplyr.tidyverse.org/articles/colwise.html) for more details, but here is another quick example:
```{r}
# Calculate survey mean for all variables that have names starting with "api"
strat_design %>%
summarize(across(starts_with("api"), survey_mean))
```
Srvyr also supports older methods of working column-wise, the "scoped variants", such as
`summarize_at`, `summarize_if`, `summarize_all` and `summarize_each`. Again, these are
maintained for backwards compatibility, matching what the tidyverse team has done, but
may be removed from a future version.
## Calculating proportions in groups
You can calculate the weighted proportion that falls into a group using the
`survey_prop()` function (or the `survey_mean()` function with no `x` argument).
The proportion is calculated by "unpeeling" the last variable used in `group_by()`
and then calculating the proportion within the other groups that fall into the last
group (so that the proportion within each group that was unpeeled sums to 100%).
```{r}
# Calculate the proportion that falls into each category of `awards` per `stype`
strat_design %>%
group_by(stype, awards) %>%
summarize(prop = survey_prop())
```
If you want to calculate the proportion for groups from multiple variables at the same
time that add up to 100%, the `interact` function can help. The `interact` function
creates a variable that is automatically split apart so that more than one variable
can be unpeeled.
```{r}
# Calculate the proportion that falls into each category of both `awards` and `stype`
strat_design %>%
group_by(interact(stype, awards)) %>%
summarize(prop = survey_prop())
```
# Learning More
Here are some free resources put together by the community about srvyr:
- **"How-to"s & examples of using srvyr**
- Stephanie Zimmer & Rebecca Powell's [2021 AAPOR Workshop "Tidy Survey Analysis in R using the srvyr Package"](https://github.com/szimmer/tidy-survey-aapor-2021)
- "The Epidemiologist R Handbook", by Neale Batra et al. has a [chapter on survey analysis](https://epirhandbook.com/en/) with srvyr and survey package examples
- Kieran Healy's book ["Data Visualization: A Practical Introduction"](https://socviz.co/modeling.html#plots-from-complex-surveys) has a section on using srvyr to visualize the ESS.
- The IPUMS PMA team's blog had a series showing examples of using the [PMA COVID survey panel with weights](https://tech.popdata.org/pma-data-hub/index.html)
- ["Open Case Studies: Vaping Behaviors in American Youth"](https://www.opencasestudies.org/ocs-bp-vaping-case-study/) by Carrie Wright, Michael Ontiveros, Leah Jager, Margaret Taub, and Stephanie Hicks is a detailed case study that includes using srvyr to analyze the National Youth Tobacco Survey.
- ["How to plot Likert scales with a weighted survey in a dplyr friendly way"](https://towardsdatascience.com/how-to-plot-likert-scales-with-a-weighted-survey-in-a-dplyr-friendly-way-68df600881a) by Francisco Suárez Salas
- The tidycensus package vignette ["Working with Census microdata"](https://walker-data.com/tidycensus/articles/pums-data.html) includes information about using the weights from the ACS retrieved from the census API.
- ["The Joy of Calculating the Direct Standard Error for PUMS Estimates"](https://ldaly.github.io/giveinandblogit/) by GitHub user @ldaly
- **About survey statistics**
- Thomas Lumley's book ["Complex Surveys: a guide to analysis using R"](http://r-survey.r-forge.r-project.org/svybook/)
- [Chris Skinner. Jon Wakefield. "Introduction to the Design and Analysis of Complex Survey Data." Statist. Sci. 32 (2) 165 - 175, May 2017. 10.1214/17-STS614](https://projecteuclid.org/accountAjax/Download?downloadType=journal%20article&urlId=10.1214%2F17-STS614&isResultClick=True)
- Sharon Lohr's textbook "Sampling: Design and Analysis". [Second ](https://www.sharonlohr.com/sampling-design-and-analysis-2e) or [Third ](https://www.sharonlohr.com/sampling-design-and-analysis-3e) Editions
- "Survey weighting is a mess" is the opening to Andrew Gelman's ["Struggles with Survey Weighting and Regression Modeling"](http://www.stat.columbia.edu/~gelman/research/published/STS226.pdf)
- Anthony Damico's website ["Analyze Survey Data for Free"](https://asdfree.com) has the weight specifications for a wide variety of public use survey datasets.
- **Working programmatically and/or on multiple columns at once (eg `dplyr::across` and `rlang`'s "curly curly" `{{}}`)**
- dplyr's included package vignettes ["Column-wise operations"](https://dplyr.tidyverse.org/articles/colwise.html) & ["Programming with dplyr"](https://dplyr.tidyverse.org/articles/programming.html)
- **Non-English resources**
- *Em português:* ["Análise de Dados Amostrais Complexos"](https://djalmapessoa.github.io/adac/) by Djalma Pessoa and Pedro Nascimento Silva
- *En español:* ["Usando R para jugar con los microdatos del INEGI"](https://medium.com/tacosdedatos/usando-r-para-sacar-información-de-los-microdatos-del-inegi-b21b6946cf4f) by Claudio Daniel Pacheco Castro
- *Tiếng Việt:* ["Dịch tễ học ứng dụng và y tế công cộng với R"](https://epirhandbook.com/vn/survey-analysis.html)
- **Other cool stuff that uses srvyr**
- A (free) graphical interface allowing exploratory data analysis of survey data without writing code: [iNZight](https://inzight.nz/) (and [survey data instructions](https://inzight.nz/docs/survey-specification.html))
- ["serosurvey: Serological Survey Analysis For Prevalence Estimation Under Misclassification"](https://avallecam.github.io/serosurvey/) by Andree Valle Campos
- Several packages on CRAN depend on srvyr, you can see them by looking at the [reverse Imports/Suggestions on CRAN](https://cran.r-project.org/package=srvyr).
**Still need help?**
I think the best way to get help is to form a specific question and ask it in some place like [posit's community website](https://forum.posit.co/) (known for it's friendly community) or [stackoverflow.com](https://stackoverflow.com) (maybe not known for being quite as friendly, but probably has more people). If you think you've found a bug in srvyr's code, please file an [issue on GitHub](https://github.com/gergness/srvyr/issues/new), but note that I'm not a great resource for helping specific issue, both because I have limited capacity but also because I do not consider myself an expert in the statistical methods behind survey analysis.
**Have something to add?**
These resources were mostly found via vanity searches on twitter & github. If you know of anything I missed, or have written something yourself, [please let me know in this GitHub issue]()!