1.1 Introduction

The OMOP CDM is a person-centric model. The person table contains records that uniquely identify each individual along with some of their demographic information. Below we create a mock CDM reference which, as is standard, has a person table which contains fields which indicate an individuals date of birth, gender, race, and ethnicity. Each of the latter are represented by a concept ID, and as the person table contains one record per person these fields are treated as time-invariant.

library(PatientProfiles)
library(duckdb)
library(dplyr)

cdm <- mockPatientProfiles(
  patient_size = 10000,
  drug_exposure_size = 10000
)

cdm$person %>% 
  dplyr::glimpse()
## Rows: ??
## Columns: 7
## Database: DuckDB v0.10.0 [martics@Windows 10 x64:R 4.2.1/:memory:]
## $ person_id            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
## $ gender_concept_id    <chr> "8507", "8507", "8507", "8532", "8507", "8532", "…
## $ year_of_birth        <dbl> 1972, 1999, 1922, 1975, 1995, 1926, 1943, 1958, 1…
## $ month_of_birth       <dbl> 9, 4, 6, 2, 12, 7, 8, 7, 6, 9, 3, 5, 4, 5, 1, 8, …
## $ day_of_birth         <dbl> 5, 6, 14, 19, 15, 12, 29, 7, 29, 5, 19, 29, 20, 2…
## $ race_concept_id      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ethnicity_concept_id <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

As well as the person table, every CDM reference will include an observation period table. This table contains spans of times during which an individual is considered to being under observation. Individuals can have multiple observation periods, but they cannot overlap.

cdm$observation_period %>% 
  dplyr::glimpse()
## Rows: ??
## Columns: 5
## Database: DuckDB v0.10.0 [martics@Windows 10 x64:R 4.2.1/:memory:]
## $ observation_period_id         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1…
## $ person_id                     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1…
## $ observation_period_start_date <date> 2006-11-16, 2006-03-22, 2007-10-10, 200…
## $ observation_period_end_date   <date> 2098-01-22, 2137-04-20, 2113-01-27, 204…
## $ period_type_concept_id        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

When performing analyses we will often be interested in working with the person and observation period tables to identify individuals’ characteristics on some date of interest. PatientProfiles provides a number of functions that can help us do this.

1.2 Adding characteristics to OMOP CDM tables

Let’s say we’re working with the condition occurrence table.

cdm$condition_occurrence %>%
  glimpse()
## Rows: ??
## Columns: 6
## Database: DuckDB v0.10.0 [martics@Windows 10 x64:R 4.2.1/:memory:]
## $ condition_occurrence_id   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ person_id                 <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ condition_concept_id      <int> 5, 4, 5, 3, 5, 5, 1, 2, 1, 5, 2, 5, 4, 3, 3,…
## $ condition_start_date      <date> 2010-03-04, 2009-04-11, 2006-12-02, 2012-06…
## $ condition_end_date        <date> 2010-12-09, 2011-11-29, 2007-07-24, 2014-03…
## $ condition_type_concept_id <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

This table contains diagnoses of individuals and we might, for example, want to identify their age on their date of diagnosis. This involves linking back to the person table which contains their date of birth (split across three different columns). PatientProfiles provides a simple function for this. addAge() will add a new column to the table containing each patient’s age relative to the specified index date.

cdm$condition_occurrence <- cdm$condition_occurrence %>%
  addAge(indexDate = "condition_start_date")

cdm$condition_occurrence %>%
  glimpse()
## Rows: ??
## Columns: 7
## Database: DuckDB v0.10.0 [martics@Windows 10 x64:R 4.2.1/:memory:]
## $ condition_occurrence_id   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ person_id                 <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ condition_concept_id      <int> 5, 4, 5, 3, 5, 5, 1, 2, 1, 5, 2, 5, 4, 3, 4,…
## $ condition_start_date      <date> 2010-03-04, 2009-04-11, 2006-12-02, 2012-06…
## $ condition_end_date        <date> 2010-12-09, 2011-11-29, 2007-07-24, 2014-03…
## $ condition_type_concept_id <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ age                       <dbl> 37, 10, 84, 37, 19, 88, 65, 56, 71, 30, 64, …

As well as calculating age, we can also create age groups at the same time. Here we create three age groups: those aged 0 to 17, those 18 to 65, and those 66 or older.

cdm$condition_occurrence <- cdm$condition_occurrence %>%
  addAge(
    indexDate = "condition_start_date",
    ageGroup = list(
        "0 to 17" = c(0, 17),
        "18 to 65" = c(18, 65),
        ">= 66" = c(66, Inf)))

cdm$condition_occurrence %>%
  glimpse()
## Rows: ??
## Columns: 8
## Database: DuckDB v0.10.0 [martics@Windows 10 x64:R 4.2.1/:memory:]
## $ condition_occurrence_id   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ person_id                 <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ condition_concept_id      <int> 5, 4, 5, 3, 5, 5, 1, 2, 1, 5, 2, 5, 4, 3, 4,…
## $ condition_start_date      <date> 2010-03-04, 2009-04-11, 2006-12-02, 2012-06…
## $ condition_end_date        <date> 2010-12-09, 2011-11-29, 2007-07-24, 2014-03…
## $ condition_type_concept_id <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ age                       <dbl> 37, 10, 84, 37, 19, 88, 65, 56, 71, 30, 64, …
## $ age_group                 <chr> "18 to 65", "0 to 17", ">= 66", "18 to 65", …

By default when adding age the new column will have been called “age” and will have been calculated using all available information on date of birth contained in the person. We can, however, alter these defaults like so (where we impose month of birth to be January and day of birth to be the 1st for all individuals)

cdm$condition_occurrence <- cdm$condition_occurrence %>%
  addAge(indexDate = "condition_start_date", 
         ageName = "age_from_year_of_birth", 
         ageDefaultMonth = 1,
         ageDefaultDay = 1,
         ageImposeMonth = TRUE, 
         ageImposeDay = TRUE)

cdm$condition_occurrence %>%
  glimpse()
## Rows: ??
## Columns: 9
## Database: DuckDB v0.10.0 [martics@Windows 10 x64:R 4.2.1/:memory:]
## $ condition_occurrence_id   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ person_id                 <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ condition_concept_id      <int> 5, 4, 5, 3, 5, 5, 1, 2, 1, 5, 2, 5, 4, 3, 4,…
## $ condition_start_date      <date> 2010-03-04, 2009-04-11, 2006-12-02, 2012-06…
## $ condition_end_date        <date> 2010-12-09, 2011-11-29, 2007-07-24, 2014-03…
## $ condition_type_concept_id <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ age                       <dbl> 37, 10, 84, 37, 19, 88, 65, 56, 71, 30, 64, …
## $ age_group                 <chr> "18 to 65", "0 to 17", ">= 66", "18 to 65", …
## $ age_from_year_of_birth    <dbl> 38, 10, 84, 37, 20, 89, 65, 56, 71, 31, 64, …

As well as age at diagnosis, we might also want identify patients’ sex. PatientProfiles provides the addSex() function that will add this for us. Because this is treated as time-invariant, we will not have to specify any index variable.

cdm$condition_occurrence <- cdm$condition_occurrence %>%
  addSex()

cdm$condition_occurrence %>%
  glimpse()
## Rows: ??
## Columns: 10
## Database: DuckDB v0.10.0 [martics@Windows 10 x64:R 4.2.1/:memory:]
## $ condition_occurrence_id   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ person_id                 <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ condition_concept_id      <int> 5, 4, 5, 3, 5, 5, 1, 2, 1, 5, 2, 5, 4, 3, 4,…
## $ condition_start_date      <date> 2010-03-04, 2009-04-11, 2006-12-02, 2012-06…
## $ condition_end_date        <date> 2010-12-09, 2011-11-29, 2007-07-24, 2014-03…
## $ condition_type_concept_id <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ age                       <dbl> 37, 10, 84, 37, 19, 88, 65, 56, 71, 30, 64, …
## $ age_group                 <chr> "18 to 65", "0 to 17", ">= 66", "18 to 65", …
## $ age_from_year_of_birth    <dbl> 38, 10, 84, 37, 20, 89, 65, 56, 71, 31, 64, …
## $ sex                       <chr> "Male", "Male", "Male", "Female", "Male", "F…

Similarly we could also identify whether an individual was in observation at the time of their diagnosis (i.e. had an observation period that overlaps with their diagnosis date), as well as identifying how much prior observation time they had on this date and how much they have following it.

cdm$condition_occurrence <- cdm$condition_occurrence %>%
  addInObservation(indexDate = "condition_start_date") %>% 
  addPriorObservation(indexDate = "condition_start_date") %>% 
  addFutureObservation(indexDate = "condition_start_date")

cdm$condition_occurrence %>%
  glimpse()
## Rows: ??
## Columns: 13
## Database: DuckDB v0.10.0 [martics@Windows 10 x64:R 4.2.1/:memory:]
## $ condition_occurrence_id   <int> 1, 2, 5, 6, 8, 12, 13, 14, 17, 18, 23, 25, 2…
## $ person_id                 <int> 1, 2, 5, 6, 8, 12, 13, 14, 17, 18, 23, 25, 2…
## $ condition_concept_id      <int> 5, 4, 5, 5, 2, 5, 4, 3, 3, 3, 5, 5, 1, 5, 4,…
## $ condition_start_date      <date> 2010-03-04, 2009-04-11, 2015-01-10, 2015-02…
## $ condition_end_date        <date> 2010-12-09, 2011-11-29, 2015-02-10, 2015-04…
## $ condition_type_concept_id <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ age                       <dbl> 37, 10, 19, 88, 56, 70, 56, 49, 10, 21, 12, …
## $ age_group                 <chr> "18 to 65", "0 to 17", "18 to 65", ">= 66", …
## $ age_from_year_of_birth    <dbl> 38, 10, 20, 89, 56, 71, 56, 49, 10, 21, 13, …
## $ sex                       <chr> "Male", "Male", "Male", "Female", "Male", "F…
## $ in_observation            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ prior_observation         <dbl> 1204, 1116, 2404, 3082, 2644, 2623, 5158, 16…
## $ future_observation        <dbl> 32101, 46760, 11870, 19230, 30583, 13946, 90…

For these latter functions which work with information from the observation table, it is important to note that the results will be based on the observation period during which the index date falls within. Moreover, if a patient is not under observation at the specified date, addPriorObservation() and addFutureObservation() functions will return NA.

1.3 Adding characteristics to a cohort tables

The above functions can be used on both standard OMOP CDM tables and cohort tables. Note as the default index date in the functions is “cohort_start_date” we can now omit this function.

cdm$cohort1 %>%
  glimpse()
## Rows: ??
## Columns: 4
## Database: DuckDB v0.10.0 [martics@Windows 10 x64:R 4.2.1/:memory:]
## $ cohort_definition_id <dbl> 1, 1, 1, 2
## $ subject_id           <dbl> 1, 1, 2, 3
## $ cohort_start_date    <date> 2020-01-01, 2020-06-01, 2020-01-02, 2020-01-01
## $ cohort_end_date      <date> 2020-04-01, 2020-08-01, 2020-02-02, 2020-03-01
cdm$cohort1 <- cdm$cohort1 %>%
  addAge(ageGroup = list(
        "0 to 17" = c(0, 17),
        "18 to 65" = c(18, 65),
        ">= 66" = c(66, Inf))) %>% 
  addSex() %>% 
  addInObservation() %>%
  addPriorObservation() %>%
  addFutureObservation()

cdm$cohort1 %>%
  glimpse()
## Rows: ??
## Columns: 10
## Database: DuckDB v0.10.0 [martics@Windows 10 x64:R 4.2.1/:memory:]
## $ cohort_definition_id <dbl> 1, 1, 2, 1
## $ subject_id           <dbl> 1, 2, 3, 1
## $ cohort_start_date    <date> 2020-06-01, 2020-01-02, 2020-01-01, 2020-01-01
## $ cohort_end_date      <date> 2020-08-01, 2020-02-02, 2020-03-01, 2020-04-01
## $ age                  <dbl> 47, 20, 97, 47
## $ age_group            <chr> "18 to 65", "18 to 65", ">= 66", "18 to 65"
## $ sex                  <chr> "Male", "Male", "Male", "Male"
## $ in_observation       <dbl> 1, 1, 1, 1
## $ prior_observation    <dbl> 4946, 5034, 4466, 4794
## $ future_observation   <dbl> 28359, 42842, 33994, 28511

1.4 Getting multiple characteristics at once

The above functions each fetch the related information one by one. In the cases where we are interested in adding multiple characteristics, we can add these all at the same time using addDemographics(). This is more efficient as it requires fewer joins between our table of interest and the person and observation period table.

cdm$cohort2 %>%
  glimpse()
## Rows: ??
## Columns: 4
## Database: DuckDB v0.10.0 [martics@Windows 10 x64:R 4.2.1/:memory:]
## $ cohort_definition_id <dbl> 1, 1, 2, 3, 1
## $ subject_id           <dbl> 1, 3, 1, 2, 1
## $ cohort_start_date    <date> 2019-12-30, 2020-01-01, 2020-05-25, 2020-01-01, 2…
## $ cohort_end_date      <date> 2019-12-30, 2020-01-01, 2020-05-25, 2020-01-01, 2…
tictoc::tic()
cdm$cohort2 %>%
  addAge(ageGroup = list(
        "0 to 17" = c(0, 17),
        "18 to 65" = c(18, 65),
        ">= 66" = c(66, Inf))) %>% 
  addSex() %>% 
  addInObservation() %>%
  addPriorObservation() %>%
  addFutureObservation()
## # Source:   table<og_023_1711380333> [5 x 10]
## # Database: DuckDB v0.10.0 [martics@Windows 10 x64:R 4.2.1/:memory:]
##   cohort_definition_id subject_id cohort_start_date cohort_end_date   age
##                  <dbl>      <dbl> <date>            <date>          <dbl>
## 1                    1          1 2020-05-25        2020-05-25         47
## 2                    3          2 2020-01-01        2020-01-01         20
## 3                    1          3 2020-01-01        2020-01-01         97
## 4                    2          1 2020-05-25        2020-05-25         47
## 5                    1          1 2019-12-30        2019-12-30         47
## # ℹ 5 more variables: age_group <chr>, sex <chr>, in_observation <dbl>,
## #   prior_observation <dbl>, future_observation <dbl>
tictoc::toc()
## 2.11 sec elapsed
tictoc::tic()
cdm$cohort2 %>%
  addDemographics(
    age = TRUE,
    ageName = "age",
    ageGroup = list(
        "0 to 17" = c(0, 17),
        "18 to 65" = c(18, 65),
        ">= 66" = c(66, Inf)),
    sex = TRUE,
    sexName = "sex",
    priorObservation = TRUE,
    priorObservationName = "prior_observation",
    futureObservation = FALSE,
  ) %>%
  glimpse()
## Rows: ??
## Columns: 8
## Database: DuckDB v0.10.0 [martics@Windows 10 x64:R 4.2.1/:memory:]
## $ cohort_definition_id <dbl> 1, 3, 1, 2, 1
## $ subject_id           <dbl> 1, 2, 3, 1, 1
## $ cohort_start_date    <date> 2020-05-25, 2020-01-01, 2020-01-01, 2020-05-25, 2…
## $ cohort_end_date      <date> 2020-05-25, 2020-01-01, 2020-01-01, 2020-05-25, 2…
## $ age                  <dbl> 47, 20, 97, 47, 47
## $ sex                  <chr> "Male", "Male", "Male", "Male", "Male"
## $ prior_observation    <dbl> 4939, 5033, 4466, 4939, 4792
## $ age_group            <chr> "18 to 65", "18 to 65", ">= 66", "18 to 65", "18…
tictoc::toc()
## 0.97 sec elapsed

In our small mock dataset we see a small improvement in performance, but this difference will become much more noticeable when working with real data that will typically be far larger.