Analyzing Census Data with Data Commons

Learning Objectives

By the end of this vignette, you will be able to:

Why Start with Census Data?

Census data provides an ideal introduction to Data Commons for several reasons:

  1. Familiar territory: Many analysts have worked with census data, making it easier to focus on learning the Data Commons approach rather than the data itself.

  2. Rich relationships: Census data showcases the power of the knowledge graph through natural hierarchies (country → state → county → city), demonstrating how Data Commons connects entities.

  3. Integration opportunities: Census demographics become even more valuable when combined with health, environmental, and economic data—showing the true power of Data Commons.

  4. Real-world relevance: The examples we’ll explore address actual policy questions that require integrated data to answer properly.

Why Data Commons?

The R ecosystem has excellent packages for specific data sources. For example, the tidycensus package provides fantastic functionality for working with U.S. Census data, with deep dataset-specific features and conveniences.

So why use Data Commons? The real value is in data integration.

Data Commons is part of Google’s philanthropic initiatives, designed to democratize access to public data by combining datasets from organizations like the UN, World Bank, and U.S. Census into a unified knowledge graph.

Imagine you’re a policy analyst studying the social determinants of health. You need to analyze relationships between:

With traditional approaches, you’d need to:

  1. Learn multiple different APIs
  2. Deal with different geographic coding systems
  3. Reconcile different time periods and update cycles
  4. Match entities across datasets (is “Los Angeles County” the same in all datasets?)

Data Commons solves this by providing a unified knowledge graph that links all these datasets together. One API, one set of geographic identifiers, one consistent way to access everything. The datacommons R package is your gateway to this integrated data ecosystem, enabling reproducible analysis pipelines that seamlessly combine diverse data sources.

Understanding the Knowledge Graph

Data Commons organizes information as a graph, similar to how web pages link to each other. Here’s the key terminology:

DCIDs (Data Commons IDs)

Every entity in Data Commons has a unique identifier called a DCID. Think of it like a social security number for data:

Relationships

Entities are connected by relationships, following the Schema.org standard—a collaborative effort to create structured data vocabularies that help machines understand web content. For Data Commons, this means consistent, machine-readable relationships between places and data:

This structure lets us traverse the graph to find related information. Want all counties in California? Follow the containedInPlace relationships backward.

Statistical Variables

These are the things we can measure:

The power comes from being able to query any variable for any place using the same consistent approach through the datacommons R package.

Prerequisites

# Core packages
library(datacommons)
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
library(scales)
library(knitr)

# Set a consistent theme for plots
theme_set(theme_minimal())

Setting Up API Access

You’ll need a free API key from https://docs.datacommons.org/api/#obtain-an-api-key

# Set your API key
dc_set_api_key("YOUR_API_KEY_HERE")

# Or manually set DATACOMMONS_API_KEY in your .Renviron file

Don’t forget to restart your R session after setting the key to automatically load it.

Finding What You Need

The datacommons R package requires three key pieces:

  1. Statistical Variables: What you want to measure
  2. Place DCIDs: Where you want to measure it
  3. Dates: When you want the measurement
    • Use "all" to retrieve the entire time series
    • Use "latest" for the most recent available data
    • Use specific years like "2020" or date ranges
    • See the API documentation for advanced date options

Let’s see how to find these in practice.

Example 2: Using Relationships to Find States

Real-world motivation: State-level demographic analysis is crucial for understanding regional variations in aging, which impacts healthcare planning, workforce development, and social services allocation.

# Method 1: Use the parent/child relationship in observations
state_data <- dc_get_observations(
  variable_dcids = c("Count_Person", "Median_Age_Person"),
  date = "latest",
  parent_entity = "country/USA",
  entity_type = "State",
  return_type = "data.frame"
)

# The code automatically constructs the following query:
# Find all entities of type State contained in country/USA

glimpse(state_data)
#> Rows: 724
#> Columns: 8
#> $ entity_dcid   <chr> "geoId/04", "geoId/11", "geoId/17", "geoId/01", "geoId/1…
#> $ entity_name   <chr> "Arizona", "District of Columbia", "Illinois", "Alabama"…
#> $ variable_dcid <chr> "Count_Person", "Count_Person", "Count_Person", "Count_P…
#> $ variable_name <chr> "Total population", "Total population", "Total populatio…
#> $ date          <chr> "2023", "2023", "2020", "2024", "2024", "2023", "2023", …
#> $ value         <dbl> 7268175, 672079, 12812508, 5157699, 11180878, 21928881, …
#> $ facet_id      <chr> "1964317807", "1145703171", "1541763368", "2176550201", …
#> $ facet_name    <chr> "CensusACS5YearSurvey_SubjectTables_S0101", "CensusACS5Y…

# Process the data - reshape from long to wide
state_summary <- state_data |>
  filter(str_detect(facet_name, "Census")) |> # Use Census data
  select(
    entity = entity_dcid,
    state_name = entity_name,
    variable = variable_dcid,
    value
  ) |>
  group_by(entity, state_name, variable) |>
  # Take first if duplicates
  summarize(value = first(value), .groups = "drop") |>
  pivot_wider(names_from = variable, values_from = value) |>
  filter(
    !is.na(Count_Person),
    !is.na(Median_Age_Person),
    Count_Person > 500000 # Focus on states, not small territories
  )

# Visualize the relationship
ggplot(state_summary, aes(x = Median_Age_Person, y = Count_Person)) +
  geom_point(aes(size = Count_Person), alpha = 0.6, color = "darkblue") +
  geom_text(
    data = state_summary |> filter(Count_Person > 10e6),
    aes(label = state_name),
    vjust = -1, hjust = 0.5, size = 3
  ) +
  scale_y_log10(labels = label_comma()) +
  scale_size(range = c(3, 10), guide = "none") +
  labs(
    title = "State Demographics: Population vs. Median Age",
    x = "Median Age (years)",
    y = "Population (log scale)",
    caption = "Source: U.S. Census Bureau via Data Commons"
  )


# Find extremes
state_summary |>
  filter(
    Median_Age_Person == min(Median_Age_Person) |
      Median_Age_Person == max(Median_Age_Person)
  ) |>
  select(state_name, Median_Age_Person, Count_Person) |>
  mutate(Count_Person = label_comma()(Count_Person)) |>
  kable(caption = "States with extreme median ages")
States with extreme median ages
state_name Median_Age_Person Count_Person
Maine 44.8 1,377,400
Utah 31.7 3,331,187

Key insight: The 13-year gap between Maine (44.8) and Utah (31.7) in median age represents dramatically different demographic challenges. Maine faces an aging workforce and increasing healthcare demands, while Utah’s younger population suggests different priorities around education and family services. These demographic differences drive fundamentally different policy needs across states.

Example 3: Cross-Dataset Integration

Real-world motivation: Public health researchers often need to understand how socioeconomic factors relate to health outcomes. Let’s explore potential connections between age demographics, obesity rates, and economic conditions at the county level—exactly the type of integrated analysis that would be extremely difficult without Data Commons.

# Get multiple variables for California counties
# Notice how we can mix variables from different sources in one query!
ca_integrated <- dc_get_observations(
  variable_dcids = c(
    "Count_Person", # Census
    "Median_Age_Person", # Census
    "Percent_Person_Obesity", # CDC
    "UnemploymentRate_Person" # BLS
  ),
  date = "latest",
  parent_entity = "geoId/06", # California
  entity_type = "County",
  return_type = "data.frame"
)

# Check which sources we're pulling from
ca_integrated |>
  group_by(variable_name, facet_name) |>
  summarize(n = n(), .groups = "drop") |>
  slice_head(n = 10) |>
  kable(caption = "Data sources by variable")
Data sources by variable
variable_name facet_name n
Median age of population CensusACS5YearSurvey 58
Median age of population CensusACS5YearSurvey_SubjectTables_S0101 58
Percentage of Adult Population That Is Obese CDC500 232
Total population CDC_Mortality_UnderlyingCause 58
Total population CDC_Social_Vulnerability_Index 58
Total population CensusACS1YearSurvey 41
Total population CensusACS5YearSurvey 58
Total population CensusACS5YearSurvey_SubjectTables_S0101 58
Total population CensusPEP 56
Total population USCensusPEP_AgeSexRaceHispanicOrigin 58

# Process the integrated data
ca_analysis <- ca_integrated |>
  # Pick one source per variable for consistency
  filter(
    (variable_dcid == "Count_Person" &
       str_detect(facet_name, "CensusACS5Year")) |
      (variable_dcid == "Median_Age_Person" &
         str_detect(facet_name, "CensusACS5Year")) |
      (variable_dcid == "Percent_Person_Obesity" &
         str_detect(facet_name, "CDC")) |
      (variable_dcid == "UnemploymentRate_Person" &
         str_detect(facet_name, "BLS"))
  ) |>
  select(
    entity = entity_dcid,
    county_name = entity_name,
    variable = variable_dcid,
    value
  ) |>
  group_by(entity, county_name, variable) |>
  summarize(value = first(value), .groups = "drop") |>
  pivot_wider(names_from = variable, values_from = value) |>
  drop_na() |>
  mutate(
    county_name = str_remove(county_name, " County$"),
    population_k = Count_Person / 1000
  )

# Explore relationships between variables
ggplot(ca_analysis, aes(x = Median_Age_Person, y = Percent_Person_Obesity)) +
  geom_point(aes(size = Count_Person, color = UnemploymentRate_Person),
    alpha = 0.7
  ) +
  geom_smooth(method = "lm", se = FALSE, color = "red", linetype = "dashed") +
  scale_size(range = c(2, 10), guide = "none") +
  scale_color_viridis_c(name = "Unemployment\nRate (%)") +
  labs(
    title = "California Counties: Age, Obesity, and Unemployment",
    subtitle = "Integrating Census, CDC, and BLS data through Data Commons",
    x = "Median Age (years)",
    y = "Obesity Rate (%)",
    caption = "Sources: Census ACS, CDC PLACES, BLS via Data Commons"
  ) +
  theme(legend.position = "right")


# Show correlations
ca_analysis |>
  select(Median_Age_Person, Percent_Person_Obesity, UnemploymentRate_Person) |>
  cor() |>
  round(2) |>
  kable(caption = "Correlations between demographic and health variables")
Correlations between demographic and health variables
Median_Age_Person Percent_Person_Obesity UnemploymentRate_Person
Median_Age_Person 1.00 -0.36 -0.43
Percent_Person_Obesity -0.36 1.00 0.62
UnemploymentRate_Person -0.43 0.62 1.00

Key insight: The negative correlation (-0.30) between median age and obesity rates is counterintuitive—we might expect older populations to have higher obesity rates. However, the strong positive correlation (0.61) between unemployment and obesity suggests economic factors may be more important than age. This type of finding, made possible by the datacommons package’s easy data integration, could inform targeted public health interventions that address economic barriers to healthy living.

Example 4: Time Series Across Cities

Real-world motivation: Urban planners and demographers track city growth patterns to inform infrastructure investments, housing policy, and resource allocation. Let’s examine how major U.S. cities have grown differently over the past two decades.

# Major city DCIDs (found using datacommons.org/place)
major_cities <- c(
  "geoId/3651000", # New York City
  "geoId/0644000", # Los Angeles
  "geoId/1714000", # Chicago
  "geoId/4835000", # Houston
  "geoId/0455000" # Phoenix
)

# Get historical population data
city_populations <- dc_get_observations(
  variable_dcids = "Count_Person",
  entity_dcids = major_cities,
  date = "all",
  return_type = "data.frame"
)

# Process - use Census PEP data for consistency
city_pop_clean <- city_populations |>
  filter(
    str_detect(facet_name, "USCensusPEP"),
    !is.na(value)
  ) |>
  mutate(
    year = as.integer(date),
    population_millions = value / 1e6,
    city = str_extract(entity_name, "^[^,]+")
  ) |>
  filter(year >= 2000) |>
  group_by(city, year) |>
  summarize(
    population_millions = mean(population_millions),
    .groups = "drop"
  )

# Visualize growth trends
ggplot(
  city_pop_clean,
  aes(x = year, y = population_millions, color = city)
) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 2) +
  scale_color_brewer(palette = "Set1") +
  scale_y_continuous(labels = label_number(suffix = "M")) +
  labs(
    title = "Population Growth in Major U.S. Cities",
    x = "Year",
    y = "Population (millions)",
    color = "City",
    caption = "Source: Census Population Estimates Program via Data Commons"
  ) +
  theme(legend.position = "bottom")


# Calculate growth rates
city_pop_clean |>
  group_by(city) |>
  filter(year == min(year) | year == max(year)) |>
  arrange(city, year) |>
  summarize(
    years = paste(min(year), "-", max(year)),
    start_pop = first(population_millions),
    end_pop = last(population_millions),
    total_growth_pct = round((end_pop / start_pop - 1) * 100, 1),
    .groups = "drop"
  ) |>
  arrange(desc(total_growth_pct)) |>
  kable(
    caption = "City population growth rates",
    col.names = c("City", "Period", "Start (M)", "End (M)", "Growth %")
  )
City population growth rates
City Period Start (M) End (M) Growth %
Phoenix 2000 - 2024 1.327196 1.673164 26.1
Houston 2000 - 2024 1.977408 2.390125 20.9
New York City 2000 - 2024 8.015209 8.478072 5.8
Los Angeles 2000 - 2024 3.702574 3.878704 4.8
Chicago 2000 - 2024 2.895723 2.721308 -6.0

Key insight: The stark contrast between Sun Belt cities (Phoenix +26%, Houston +21%) and Rust Belt Chicago (-6%) reflects major economic and demographic shifts in America. These patterns have profound implications for everything from housing affordability to political representation. The datacommons package makes it trivial to extend this analysis—we could easily add variables like temperature, job growth, or housing costs to understand what drives these migration patterns.

Working with Data Sources (Facets)

Each data source in Data Commons is identified by a facet_id and facet_name. Understanding these helps you choose the right data:

# Example: What sources are available for U.S. population?
us_population |>
  filter(date == "2020") |>
  select(facet_name, value) |>
  distinct() |>
  mutate(value = label_comma()(value)) |>
  arrange(desc(value)) |>
  kable(caption = "Different sources for 2020 U.S. population")
Different sources for 2020 U.S. population
facet_name value
WorldDevelopmentIndicators 331,577,720
USCensusPEP_AgeSexRaceHispanicOrigin 331,577,720
USCensusPEP_Annual_Population 331,526,933
OECDRegionalDemography_Population 331,526,933
USDecennialCensus_RedistrictingRelease 331,449,281
CensusACS5YearSurvey_AggCountry 329,824,950
CDC_Mortality_UnderlyingCause 329,484,123
CensusACS5YearSurvey_SubjectTables_S0101 326,569,308
CensusACS5YearSurvey_SubjectTables_S2601A 326,569,308
CensusACS5YearSurvey_SubjectTables_S2602 326,569,308
CensusACS5YearSurvey_SubjectTables_S2603 326,569,308

Common sources and when to use them:

Tips for Effective Use

  1. Start with the websites: Use Data Commons web tools to explore what’s available before writing code.

  2. Use relationships: The parent_entity and entity_type parameters are powerful for traversing the graph.

  3. Be specific about sources: Filter by facet_name when you need consistency across places or times.

  4. Expect messiness: Real-world data has gaps, multiple sources, and inconsistencies. Plan for it.

  5. Leverage integration: The real power is in combining datasets that would normally require multiple APIs.

Going Further

The datacommons R package currently focuses on data retrieval, which is the foundation for analysis. While more advanced features may come in future versions, you can already:

The package integrates seamlessly with the tidyverse, making it easy to incorporate Data Commons into your existing R workflows and create reproducible analysis pipelines.

Explore more at:

Summary

In this vignette, we’ve explored how the datacommons R package provides unique value through data integration. While specialized packages like tidycensus excel at deep functionality for specific datasets, the datacommons package shines when you need to:

The power lies not in any single dataset, but in the connections between them. By using the datacommons package, you can focus on analysis rather than data wrangling, enabling insights that would be difficult or impossible to achieve through traditional approaches.

Resources