Working with the eventreport package

Welcome to the eventreport package. eventreport includes a set of functions to diagnose, visualize, and aggregate event report level data to the event level. The package is intended for working with event report level data, meaning that the data contains multiple report observations for each event, for instance, multiple news reports covering the same electoral violence incident.

This vignette explains what event report level data is and how to work with such data using the functions contained in this package.

Before starting, we load the tidyverse package for writing tidy code, the tinytable package to draw easy-to-read tables, as well as a small subset of the maverick_event_report dataset to exemplify the package functions. For users that are interested in working with the MAVERICK dataset contained in this package, we refer to the MAVERICK documentation.

What is event report level data?

Event report level data refers to data where each observation is an event that takes place on a single day and in a particular location as reported in a single source. The report level means that multiple reports about the same event constitute separate observations. For example, if both BBC and Reuters report on a violent post-election demonstration, the demonstration is the event, whereas the BBC and Reuters reports constitute the event reports. The table below provides an example of event report level data from the MAVERICK dataset, and lists 11 unique reports about a single electoral violence event.

event_id	city	location	actor1	deaths_best	source
CIV-0003	Abidjan	Abobo	Unknown security force (Côte d'Ivoire)	2	AFP (2011-01-11) Côte d'Ivoire: au moins deux civils tués après des tirs à Abidjan
CIV-0003	Abidjan	Abobo	Forces de Défense et de Sécurité (FDS)	4	AFP (2011-01-11) Côte d'Ivoire: quatre morts dans des violences à Abidjan
CIV-0003	Abidjan	Abobo	Forces de Défense et de Sécurité (FDS)	5	LEJD (2011-01-11) Au moins cinq morts à Abidjan
CIV-0003	Abidjan	Abobo	Unknown security force (Côte d'Ivoire)	2	AP (2011-01-11) Affrontements dans le quartier d'Abobo à Abidjan
CIV-0003	Abidjan	Abobo	Unknown security force (Côte d'Ivoire)	3	Le Point (2011-01-11) Nouveaux affrontements à Abidjan
CIV-0003	Abidjan	Abobo	Unknown security force (Côte d'Ivoire)	4	AFP (2011-01-11) Four killed in clashes as Ivory Coast tensions worsen

Before using the eventreport package, make sure that your data is recorded at the event report level and not the event level. In addition, you need a column that allows you to group event reports concerning the same event together. For example, in the MAVERICK dataset, the event_id column identifies which reports are about the same event.

Why work with event report data?

Coding event reports rather than events has several benefits (see e.g. Cook and Weidmann 2019; Weidmann and Geelmuyden Rød 2015). First, this coding procedure makes the information extraction step more transparent and helps preserve the raw data contained in the source material. Second, as the aggregation of multiple event reports into single events implies making decisions about report credibility and contradictory information, this coding procedure makes the aggregation process more transparent, flexible, and reproducible. Third, by automating the aggregation process, the coding procedure allows users to replicate their analyses using different aggregation models and to override default aggregation rules and instead develop their own procedures. Fourth, by preserving the raw event reports, this data structure allows users to also use the data to investigate reporting biases and different approaches to improving data quality.

Given the particularities of event report data, we recommend all package users to also consult the associated methods paper, which provides a detailed overview of the strengths and limitations of our suggested approach, as well as the underlying reasoning behind the different aggregation functions (van Baalen and Höglund 2025). For other in-depth analyses of the benefits and limitations of working with event report level data and automatic aggregation procedures, we recommend Cook and Weidmann (2019) and Weidmann and Geelmuyden Rød (2015).

Why the `eventreport` package?

Standard statistics software, such as R, already contain some functionalities that can be used for aggregating event report level data to the event level. Cook and Weidmann (2019), for instance, use base R functions such as max, min, and mean to aggregate the Mass Mobilization in Autocracies Database, an event report level dataset on protest events. However, as we have detailed elsewhere (van Baalen and Höglund 2025), the aggregation of event reports often demands additional functionalities, such as the use of tie-break rules or information contained in meta variables.

The eventreport package adds several functionalities not contained in existing software. Among those benefits, the package:

Handles different variable classes: eventreport handles a range of different variables, including character, date, numeric, and binary numeric variables. This feature makes the package ideal for working with event report datasets that include different variable classes.
Enables tie-breaking rules: many vectors are multi-modal, meaning that simple functions for identifying the most frequent values will yield multiple results. eventreport therefore enables users to specify up to two tie-breaking rules that help adjudicate between multiple modes variables.
Integrates precision scores: sometimes researchers are interested in recording the most precise value, such as more precise location estimates or more precise actor names. eventreport allows users to specify precision score variables that help prioritize what values to select when the values cannot be ranked.
Provides simple functions: aggregating event report level data is a complex coding project. eventreport makes this procedure more straightforward by providing simple functions that carry out complex tasks. All functions were developed in the context of a concrete event report level data collection effort, and are therefore both needs-based and well-tested.
Allows easy customization: the combination of simple functions and several convenience functions allows users to stipulate a range of complex aggregation rule sets with minimal coding. Moreover, because eventreport is tidyverse compatible, users can integrate the package functions in a tidy workflow.

Installation

Before we begin, let’s install the eventreport package. Install from CRAN:

#install.packages("eventreport")

Once we have installed the package, we can load it:

library(eventreport)

Aggregation diagnostics

Event report level data can come in many forms, where some datasets only include events recorded by at least two sources, whereas other datasets include both single- and multi-source events. Moreover, some variables may harbor more divergences in their values for the same event than other variables. These differences mean that not all datasets and variables are equally sensitive to aggregation choices (van Baalen and Höglund 2025).

Calculate unique values using `dscore`

eventreport includes several functions that allow users to diagnose their event report level data. dscore calculates the total number of unique values for each event (subtracted by 1 so that it only captures divergences). This divergence score allows users to assess how sensitive particular events and variables are to how they aggregate the event report level data:

dscore(
  df,
  group_var = "event_id",
  variables = c("country", "actor1", "deaths_best")
  ) %>% 
  head(10)
## # A tibble: 10 × 4
##    event_id dscore_country dscore_actor1 dscore_deaths_best
##    <chr>             <dbl>         <dbl>              <dbl>
##  1 CIV-0001              0             1                  4
##  2 CIV-0002              0             0                  0
##  3 CIV-0003              0             3                  4
##  4 CIV-0004              0             2                  3
##  5 CIV-0008              0             0                  0
##  6 CIV-0009              0             0                  0
##  7 CIV-0010              0             0                  0
##  8 CIV-0011              0             0                  0
##  9 CIV-0012              0             0                  0
## 10 CIV-0013              0             1                  1

From the above output, we can see that event CIV-0003 stands out as particularly sensitive to aggregation choices, as the variable actor1 can take a total of 3 additional values beyond the one chosen by a particular aggregation choice. The variable deaths_best can take 4 additional values. In contrast, aggregation choices will not matter for the events CIV-0002, CIV-0008, CIV-0009, CIV-0010, CIV-0011, and CIV-0012, as there are no additional values that the chosen variables can take.

Calculate the average number of unique values using `mean_dscore`

We can also calculate mean divergence scores for each variable to get a better sense of what variables are most sensitive to aggregation choices. The mean divergence score is calculated as the average number of divergent values per event and variable, and returns a dataframe containing the variable names and the mean divergence scores:

mean_dscore(
  df,
  group_var = "event_id",
  variables = c("country", "actor1", "deaths_best", "injuries_best")
  )
## # A tibble: 4 × 2
##   variable      dscore
##   <chr>          <dbl>
## 1 country        0    
## 2 actor1         0.571
## 3 deaths_best    0.571
## 4 injuries_best  0.514

From the table above, we learn that some variables are more sensitive to aggregation choices than others. For example, while the country variable is not at all sensitive to aggregation choices, the actor1 and deaths_best variables are comparatively more sensitive to how we decide to aggregate the data.

The raw divergence score can sometimes be misleading, as variables differ in the number of possible variables. Hence, users can also calculate normalized mean divergence scores for each variable with the normalize = TRUE argument, which returns the mean number of divergences divided by the total number of unique values in each variable:

mean_dscore(
  df,
  group_var = "event_id",
  variables = c("country", "actor1", "deaths_best", "injuries_best"),
  normalize = TRUE
  )
## # A tibble: 4 × 2
##   variable      dscore
##   <chr>          <dbl>
## 1 country       0     
## 2 actor1        0.0336
## 3 deaths_best   0.0440
## 4 injuries_best 0.0321

Finally, users can take a visual look at the mean and normalized divergence scores by using the plot = TRUE argument to return a ggplot object:

mean_dscore(
  df,
  group_var = "event_id",
  variables = c("country", "actor1", "deaths_best"),
  normalize = TRUE,
  plot = TRUE
  )

Calculate multiple aggregation diagnostics using `aggregation_diagnostics`

eventreport provides convenience functions for calculating six different aggregation diagnostics. The six diagnostics help evaluate how much disagreement exists between different event reports describing the same event.

Mean divergence (mean_dscore) shows how often values differ across reports by counting how many additional unique values are reported per event and variable.
Normalized divergence (mean_dscore(normalize = TRUE) puts this into perspective by dividing the divergence by the total number of possible unique values, making it easier to compare across variables.
Mean standard deviation (mean_sd) measures how much reported numbers (like deaths or injuries) vary around their average for each event.
Mean range (mean_range) captures the distance between the lowest and highest reported values, highlighting extreme differences.
Share of events with disagreement (event_level_disagreement), the most easily interpretable metric, tells us how often at least two reports disagree on a particular variable value.
Modal confidence (modal_confidence) shows how dominant the most commonly reported value is-high scores mean most sources agree on the modal value, while lower scores suggest disagreement.

To easily compare different aggregation diagnostics, users can run all diagnostics for a set of variables with one command:

diagnostics <- aggregation_diagnostics(
  df,
  group_var = "event_id",
  variables = c("city", "deaths_best", "actor1")
)

tt(diagnostics)

Variable	Mean divergence	Normalized divergence	Mean standard deviation	Mean range	Share of events with disagreement (%)	Modal confidence (%)
city	0.11	0.01			0.11	0.97
deaths_best	0.57	0.04	1.41	1.66	0.29	0.88
actor1	0.57	0.03			0.37	0.87

Use the aggregation functions

The eventreport package consists of a number of different functions that help the user aggregate event report level data into event report data. All functions and their use are outlined in the package documentation.

calc_mode

Find the mode value of a character vector:

calc_mode(c("Sweden", "Sweden", "Denmark", "Sweden"))
## [1] "Sweden"

Given that some vectors may have multiple mode values, the calc_mode function allows the user to specify up to two tie-breaking rules that help arbitrate multi-modal results. These tie-breaking rules must be numerical vectors where higher values give priority if it comes down to a tie-break:

calc_mode(
  c("Sweden", "Sweden", "Denmark", "Denmark"),
  tie_break = c(1, 1, 1, 1),
  second_tie_break = c(1, 4, 1, 1)
)
## [1] "Sweden"

In cases where no mode value can be found after two tie-breaks, the calc_mode function returns the value "Indeterminate", thereby forcing users to explicitly make a decision on how to handle multi-modal vectors.

calc_mode(
  c("Sweden", "Sweden", "Denmark", "Denmark")
)
## [1] "Indeterminate"

The calc_mode function treats both NA values and empty strings as real values, and hence returns NA or empty strings whenever those are the most common values:

calc_mode(
  c("Sweden", "", "", "Denmark")
)
## [1] ""

calc_mode_na_ignore

Find the mode value of a character vector while ignoring NA values and empty strings:

calc_mode_na_ignore(
  c("Sweden", "", "", "Denmark"),
  tie_break = c(1, 1, 1, 1),
  second_tie_break = c(4, 1, 1, 1)
)
## [1] "Sweden"

calc_mode_binary

Find the mode value from a binary numeric vector:

calc_mode_binary(
  c(0, 1, 1, 1, 0, 0)
)
## [1] 1

calc_mode_numeric

Find the mode value in a numeric vector:

calc_mode_numeric(
  c(1, 1, 1, 2, 3, 5)
)
## [1] 1

calc_mode_date

Find the mode date from a character vector written in the format YYYY-MM-DD:

calc_mode_date(
  c("2024-01-01", "2024-01-01", "2024-01-02")
)
## [1] "2024-01-01"

calc_max_precision

Find the most specific value in a character vector by using an auxiliary precision score.

calc_max_precision(
  x = c("Tranas", "Smaland", "Sweden"),
  precision_var = c(3, 2, 1)
)
## [1] "Tranas"

calc_min_precision

Find the least specific value in a character vector by using an auxiliary precision score.

calc_min_precision(
  x = c("Tranas", "Smaland", "Sweden"),
  precision_var = c(3, 2, 1)
)
## [1] "Sweden"

combine_strings

Users can also decide to concatenate strings instead of selecting a specific value by using the aggregate_strings function:

aggregate_strings(
  c("Sweden", "Sweden", "Denmark", "", "Finland")
)
## [1] "Sweden; Denmark; Finland"

Aggregate multiple variables at once with aggregateData

The main purpose of the eventreport package is to allow users to aggregate entire datasets from the event report level to the event level. This task is best achieved with the aggregateData function, which enables users to specify multiple aggregation rules at once and store the output as a dataframe. To illustrate its use, we first load the MAVERICK event report data stored in the eventreport package (using only 100 observations for faster computing):

df <- maverick_event_report %>% dplyr::arrange(event_id) %>% utils::head(n = 100)

A basic aggregateData call must include the data argument, the group_var argument, and specify at least one aggregation rule for one variable. Because aggregateData builds on the dplyr package, we can call the function using the pipe operator:

df %>% 
  aggregateData(
    group_var = "event_id",
    find_mode = "city"
  ) %>% 
  utils::head(10)
## # A tibble: 10 × 4
##    event_id city           number_of_sources unit_of_analysis
##    <chr>    <chr>                      <int> <chr>           
##  1 CIV-0001 "Duékoué"                      5 Event           
##  2 CIV-0002 ""                             2 Event           
##  3 CIV-0003 "Abidjan"                     12 Event           
##  4 CIV-0004 "Abidjan"                      6 Event           
##  5 CIV-0008 "Man"                          1 Event           
##  6 CIV-0009 "Vavoua"                       2 Event           
##  7 CIV-0010 "Abidjan"                      1 Event           
##  8 CIV-0011 "Yamoussoukro"                 1 Event           
##  9 CIV-0012 "Gagnoa"                       4 Event           
## 10 CIV-0013 "Daloa"                        4 Event

The aggregateData call returns a tibble consisting of the specified variables (city), a variable that now contains the mode value for each group specified in group_var. In addition, aggregateData automatically returns two additional variables: the number_of_sources variable, which counts the number of reports per group; and the unit_of_analysis variable, which indicates that the data is aggregated at the event level.

Most event report datasets consist of multiple variables of different classes and hence demand more complex aggregation rule sets than defined in our minimal example. To include additional variables, users need only provide a list of variable names for each rule:

df %>% 
  aggregateData(
    group_var = "event_id",
    find_mode = c("city", "location", "actor1")
  ) %>% 
  utils::head(10)
## # A tibble: 10 × 6
##    event_id city           location    actor1 number_of_sources unit_of_analysis
##    <chr>    <chr>          <chr>       <chr>              <int> <chr>           
##  1 CIV-0001 "Duékoué"      ""          Membe…                 5 Event           
##  2 CIV-0002 ""             "Indetermi… Unkno…                 2 Event           
##  3 CIV-0003 "Abidjan"      "Abobo"     Unkno…                12 Event           
##  4 CIV-0004 "Abidjan"      "Abobo"     Unkno…                 6 Event           
##  5 CIV-0008 "Man"          ""          Youth…                 1 Event           
##  6 CIV-0009 "Vavoua"       ""          Unkno…                 2 Event           
##  7 CIV-0010 "Abidjan"      "Marcory"   Polic…                 1 Event           
##  8 CIV-0011 "Yamoussoukro" ""          Polic…                 1 Event           
##  9 CIV-0012 "Gagnoa"       ""          Unkno…                 4 Event           
## 10 CIV-0013 "Daloa"        ""          Unkno…                 4 Event

Moreover, users can specify different rules for different lists of variables. In the example below, we for example aggregate the data using the mode value for the variables city and location, but use the mode reported value for the actor1 variable and the maximum value for the deaths_best variable. In addition, we use the combine_strings argument to retain all sources used to code each event:

df %>% 
  aggregateData(
    group_var = "event_id",
    find_mode = c("city", "location"),
    find_mode_na_ignore = "actor1",
    find_max = "deaths_best",
    combine_strings = "source"
  ) %>% 
  dplyr::select(event_id:actor1, deaths_best:unit_of_analysis, source) %>% 
  dplyr::filter(event_id == "CIV-0002")
## # A tibble: 1 × 8
##   event_id city  location  actor1 deaths_best number_of_sources unit_of_analysis
##   <chr>    <chr> <chr>     <chr>        <int>             <int> <chr>           
## 1 CIV-0002 ""    Indeterm… Unkno…           1                 2 Event           
## # ℹ 1 more variable: source <chr>

So far, we have used the aggregateData function without any tie-breaking rules, meaning that efforts to find the mode value often return the value "Indeterminate". This occurs because several groups are multi-modal, meaning that there are two or more mode values. To limit the risk of indeterminate values, we can make use of the tie-breaking arguments to draw on additional information to determine which mode value to retain in our data.

In the specification below, for instance, we stipulate that in the case of multi-modal results, the function should first select the value from the report with the highest value in the source_classification variable (which ranks MAVERICK reports based on their reputation for trustworthiness), and thereafter select the value from the report with the highest value in the certain variable (which ranks MAVERICK reports based on how election-related the event was):

df %>% 
  aggregateData(
    group_var = "event_id",
    find_mode = c("city", "location"),
    find_mode_na_ignore = "actor1",
    find_max = "deaths_best",
    tie_break = "source_classification",
    second_tie_break = "certain"
  ) %>% 
  utils::head(10)
## # A tibble: 10 × 7
##    event_id city  location actor1 deaths_best number_of_sources unit_of_analysis
##    <chr>    <chr> <chr>    <chr>        <int>             <int> <chr>           
##  1 CIV-0001 "Dué… ""       Membe…          40                 5 Event           
##  2 CIV-0002 ""    ""       Unkno…           1                 2 Event           
##  3 CIV-0003 "Abi… "Abobo"  Unkno…           5                12 Event           
##  4 CIV-0004 "Abi… "Abobo"  Unkno…           7                 6 Event           
##  5 CIV-0008 "Man" ""       Youth…           0                 1 Event           
##  6 CIV-0009 "Vav… ""       Unkno…           0                 2 Event           
##  7 CIV-0010 "Abi… "Marcor… Polic…           0                 1 Event           
##  8 CIV-0011 "Yam… ""       Polic…           0                 1 Event           
##  9 CIV-0012 "Gag… ""       Unkno…           5                 4 Event           
## 10 CIV-0013 "Dal… ""       Unkno…           3                 4 Event

We can also use precision scores to rank variable values and prioritize the most or least precise values. For example, below we use the MAVERICK geographical precision scores to find the most precise city and location information:

df %>% 
  aggregateData(
    group_var = "event_id",
    find_most_precise = list(
      list(var = "city", precision_var = "geo_precision"),
      list(var = "location", precision_var = "geo_precision")
    ),
    find_mode_na_ignore = "actor1",
    find_max = "deaths_best",
    tie_break = "source_classification",
    second_tie_break = "certain",
  ) %>% 
  utils::head(10)
## # A tibble: 10 × 7
##    event_id actor1 deaths_best city  location number_of_sources unit_of_analysis
##    <chr>    <chr>        <int> <chr> <chr>                <int> <chr>           
##  1 CIV-0001 Membe…          40 "Dué… ""                       5 Event           
##  2 CIV-0002 Unkno…           1 ""    ""                       2 Event           
##  3 CIV-0003 Unkno…           5 "Abi… "Abobo"                 12 Event           
##  4 CIV-0004 Unkno…           7 "Abi… "Abobo"                  6 Event           
##  5 CIV-0008 Youth…           0 "Man" ""                       1 Event           
##  6 CIV-0009 Unkno…           0 "Vav… ""                       2 Event           
##  7 CIV-0010 Polic…           0 "Abi… "Marcor…                 1 Event           
##  8 CIV-0011 Polic…           0 "Yam… ""                       1 Event           
##  9 CIV-0012 Unkno…           5 "Gag… ""                       4 Event           
## 10 CIV-0013 Unkno…           3 "Dal… "Dioula…                 4 Event

Finally, because some users may want to compare aggregation results across different rule sets (one of the main strengths of working with event report level data), we can assign a name to our aggregation rule set using the aggregation_name argument. Doing so allows us to generate different aggregation sets and compare results across aggregations:

conservative <- df %>% 
  aggregateData(
    group_var = "event_id",
    find_mode = c("city", "location"),
    find_min = c("deaths_best", "injuries_best"),
    tie_break = "source_classification",
    second_tie_break = "certain",
    aggregation_name = "Most-conservative"
  ) %>% 
  utils::head(10)

maximalist <- df %>% 
  aggregateData(
    group_var = "event_id",
    find_mode_na_ignore = c("city", "location"),
    find_max = c("deaths_best", "injuries_best"),
    tie_break = "source_classification",
    second_tie_break = "certain",
    aggregation_name = "Most-informative"
  ) %>% 
  utils::head(10)

rbind(conservative, maximalist) %>% 
  dplyr::arrange(event_id)
## # A tibble: 20 × 8
##    event_id city           location  deaths_best injuries_best number_of_sources
##    <chr>    <chr>          <chr>           <int>         <int>             <int>
##  1 CIV-0001 "Duékoué"      ""                  8             0                 5
##  2 CIV-0001 "Duékoué"      ""                 40            91                 5
##  3 CIV-0002 ""             ""                  1             0                 2
##  4 CIV-0002 ""             "Maison …           1             0                 2
##  5 CIV-0003 "Abidjan"      "Abobo"             0             0                12
##  6 CIV-0003 "Abidjan"      "Abobo"             5             2                12
##  7 CIV-0004 "Abidjan"      "Abobo"             1             0                 6
##  8 CIV-0004 "Abidjan"      "Abobo"             7             3                 6
##  9 CIV-0008 "Man"          ""                  0             0                 1
## 10 CIV-0008 "Man"          ""                  0             0                 1
## 11 CIV-0009 "Vavoua"       ""                  0             0                 2
## 12 CIV-0009 "Vavoua"       ""                  0             0                 2
## 13 CIV-0010 "Abidjan"      "Marcory"           0             0                 1
## 14 CIV-0010 "Abidjan"      "Marcory"           0             0                 1
## 15 CIV-0011 "Yamoussoukro" ""                  0             0                 1
## 16 CIV-0011 "Yamoussoukro" ""                  0             0                 1
## 17 CIV-0012 "Gagnoa"       ""                  5             0                 4
## 18 CIV-0012 "Gagnoa"       ""                  5            12                 4
## 19 CIV-0013 "Daloa"        ""                  2             0                 4
## 20 CIV-0013 "Daloa"        "Dioulab…           3             0                 4
## # ℹ 2 more variables: unit_of_analysis <chr>, aggregation <chr>

Empirical illustration

To demonstrate how eventreport enables users to account for aggregation sensitivity in their analyses based on event report level data, we end with a short empirical illustration. Let’s say that we want to explore the temporal dynamics of electoral violence severity during Côte d’Ivoire’s 2010-2011 election crisis. Using the full MAVERICK dataset, we can quickly see that such an analysis may be sensitive to aggregation choices:

# Calculate the average divergence score

mean_dscore(
  maverick_event_report,
  group_var = "event_id",
  variables = c("date_start", "deaths_best")
)
## # A tibble: 2 × 2
##   variable    dscore
##   <chr>        <dbl>
## 1 date_start  0.0941
## 2 deaths_best 0.138

We then proceed to create two different event datasets for the variables we are interested in: a representative aggregation set that uses the mode date and mode death estimate, and an informative aggregation set that uses the latest date and highest death estimate. Moreover, we combine these data frames into a single data frame.

# Create representative aggregation set

representative <- maverick_event_report %>%
  aggregateData(
    group_var = "event_id",
    find_mode = "country",
    find_mode_numeric = "deaths_best",
    find_mode_date = "date_start",
    tie_break = "source_classification",
    second_tie_break = "certain",
    aggregation_name = "Representative"
  )

# Create informative aggregation set

informative <- maverick_event_report %>%
  aggregateData(
    group_var = "event_id",
    find_mode = "country",
    find_max = c("deaths_best", "date_start"),
    tie_break = "source_classification",
    second_tie_break = "certain",
    aggregation_name = "Informative"
  )

# Combine dataframes

combined <- rbind(representative, informative)

Because aggregation sensitivity is only an issue for events recorded in at least two sources, we subset the dataset to only contain multi-source events. To explore electoral violence severity over time during the Ivorian election crisis, we then use the dplyr and lubridate packages to convert date_start into a week variable, and then calculate the number of estimated electoral violence deaths per week.

# Subset and calculate deaths per week

maverick_time_series_week <- combined %>%
  dplyr::filter(number_of_sources > 1) %>%
  dplyr::mutate(date_start = as.Date(as.character(date_start), format = "%Y-%m-%d")) %>%
  dplyr::mutate(week_start = lubridate::floor_date(date_start, unit = "week")) %>%
  tidyr::complete(
    week_start = seq(ymd("1995-01-01"), ymd("2023-12-31"), by = "1 week"),
    country, aggregation, fill = list(deaths_best = 0)
  ) %>%
  dplyr::group_by(week_start, country, aggregation) %>%
  dplyr::summarize(deaths_best = sum(deaths_best, na.rm = TRUE), .groups = "drop")

Finally, we filter the data to the relevant time period (October 2010 to June 2011) and plot the estimated number of deaths per week and aggregation approach using the ggplot2 package. As the figure clearly shows, the total number of estimated deaths per week (as reported by at least two sources) is highly sensitive to our aggregation choices:


maverick_time_series_week %>%
  dplyr::filter(
    week_start > "2010-09-30"
    & week_start < "2011-06-01"
    & country == "Ivory Coast"
  ) %>%
  ggplot2::ggplot() +
  ggplot2::geom_line(aes(y = deaths_best, x = week_start, color = aggregation), linewidth = 1) +
  ggplot2::scale_x_date(
    breaks = seq(as.Date("2010-10-01"), as.Date("2011-06-01"), by = "1 month"),
    date_labels = "%b %Y"
  ) +
  ggplot2::labs(
    x = NULL,
    y = "Best estimated number of weekly deaths"
  ) +
  ggplot2::theme_bw()

Working with the eventreport package

Sebastian van Baalen

2025-10-09

What is event report level data?

Why work with event report data?

Why the `eventreport` package?

Installation

Aggregation diagnostics

Calculate unique values using `dscore`

Calculate the average number of unique values using `mean_dscore`

Calculate multiple aggregation diagnostics using `aggregation_diagnostics`

Use the aggregation functions

calc_mode

calc_mode_na_ignore

calc_mode_binary

calc_mode_numeric

calc_mode_date

calc_max_precision

calc_min_precision

combine_strings

Aggregate multiple variables at once with aggregateData

Empirical illustration

References

Working with the eventreport package

Sebastian van Baalen

2025-10-09

What is event report level data?

Why work with event report data?

Why the eventreport package?

Installation

Aggregation diagnostics

Calculate unique values using dscore

Calculate the average number of unique values using mean_dscore

Calculate multiple aggregation diagnostics using aggregation_diagnostics

Use the aggregation functions

calc_mode

calc_mode_na_ignore

calc_mode_binary

calc_mode_numeric

calc_mode_date

calc_max_precision

calc_min_precision

combine_strings

Aggregate multiple variables at once with aggregateData

Empirical illustration

References

Why the `eventreport` package?

Calculate unique values using `dscore`

Calculate the average number of unique values using `mean_dscore`

Calculate multiple aggregation diagnostics using `aggregation_diagnostics`