Welcome to the eventreport
package.
eventreport
includes a set of functions to
diagnose, visualize, and aggregate event
report level data to the event level. The package is intended for
working with event report level data, meaning that the data contains
multiple report observations for each event, for instance, multiple news
reports covering the same electoral violence incident.
This vignette explains what event report level data is and how to work with such data using the functions contained in this package.
Before starting, we load the tidyverse
package for
writing tidy code, the tinytable
package to draw
easy-to-read tables, as well as a small subset of the
maverick_event_report
dataset to exemplify the package
functions. For users that are interested in working with the MAVERICK
dataset contained in this package, we refer to the MAVERICK
documentation.
Event report level data refers to data where each observation is an event that takes place on a single day and in a particular location as reported in a single source. The report level means that multiple reports about the same event constitute separate observations. For example, if both BBC and Reuters report on a violent post-election demonstration, the demonstration is the event, whereas the BBC and Reuters reports constitute the event reports. The table below provides an example of event report level data from the MAVERICK dataset, and lists 11 unique reports about a single electoral violence event.
event_id | city | location | actor1 | deaths_best | source |
---|---|---|---|---|---|
CIV-0003 | Abidjan | Abobo | Unknown security force (Côte d'Ivoire) | 2 | AFP (2011-01-11) Côte d'Ivoire: au moins deux civils tués après des tirs à Abidjan |
CIV-0003 | Abidjan | Abobo | Forces de Défense et de Sécurité (FDS) | 4 | AFP (2011-01-11) Côte d'Ivoire: quatre morts dans des violences à Abidjan |
CIV-0003 | Abidjan | Abobo | Forces de Défense et de Sécurité (FDS) | 5 | LEJD (2011-01-11) Au moins cinq morts à Abidjan |
CIV-0003 | Abidjan | Abobo | Unknown security force (Côte d'Ivoire) | 2 | AP (2011-01-11) Affrontements dans le quartier d'Abobo à Abidjan |
CIV-0003 | Abidjan | Abobo | Unknown security force (Côte d'Ivoire) | 3 | Le Point (2011-01-11) Nouveaux affrontements à Abidjan |
CIV-0003 | Abidjan | Abobo | Unknown security force (Côte d'Ivoire) | 4 | AFP (2011-01-11) Four killed in clashes as Ivory Coast tensions worsen |
Before using the eventreport
package, make sure that
your data is recorded at the event report level and not the event level.
In addition, you need a column that allows you to group event reports
concerning the same event together. For example, in the MAVERICK
dataset, the event_id
column identifies which reports are
about the same event.
Coding event reports rather than events has several benefits (see e.g. Cook and Weidmann 2019; Weidmann and Geelmuyden Rød 2015). First, this coding procedure makes the information extraction step more transparent and helps preserve the raw data contained in the source material. Second, as the aggregation of multiple event reports into single events implies making decisions about report credibility and contradictory information, this coding procedure makes the aggregation process more transparent, flexible, and reproducible. Third, by automating the aggregation process, the coding procedure allows users to replicate their analyses using different aggregation models and to override default aggregation rules and instead develop their own procedures. Fourth, by preserving the raw event reports, this data structure allows users to also use the data to investigate reporting biases and different approaches to improving data quality.
Given the particularities of event report data, we recommend all package users to also consult the associated methods paper, which provides a detailed overview of the strengths and limitations of our suggested approach, as well as the underlying reasoning behind the different aggregation functions (van Baalen and Höglund 2025). For other in-depth analyses of the benefits and limitations of working with event report level data and automatic aggregation procedures, we recommend Cook and Weidmann (2019) and Weidmann and Geelmuyden Rød (2015).
eventreport
package?Standard statistics software, such as R
, already contain
some functionalities that can be used for aggregating event report level
data to the event level. Cook and Weidmann (2019), for instance, use base
R
functions such as max
, min
, and
mean
to aggregate the Mass Mobilization in Autocracies
Database, an event report level dataset on protest events. However,
as we have detailed elsewhere (van Baalen and Höglund
2025), the aggregation of event reports often demands
additional functionalities, such as the use of tie-break rules or
information contained in meta variables.
The eventreport
package adds several functionalities not
contained in existing software. Among those benefits, the package:
Handles different variable classes:
eventreport
handles a range of different variables,
including character, date, numeric, and binary numeric variables. This
feature makes the package ideal for working with event report datasets
that include different variable classes.
Enables tie-breaking rules: many vectors are
multi-modal, meaning that simple functions for identifying the most
frequent values will yield multiple results. eventreport
therefore enables users to specify up to two tie-breaking rules that
help adjudicate between multiple modes variables.
Integrates precision scores: sometimes
researchers are interested in recording the most precise value, such as
more precise location estimates or more precise actor names.
eventreport
allows users to specify precision score
variables that help prioritize what values to select when the values
cannot be ranked.
Provides simple functions: aggregating event
report level data is a complex coding project. eventreport
makes this procedure more straightforward by providing simple functions
that carry out complex tasks. All functions were developed in the
context of a concrete event report level data collection effort, and are
therefore both needs-based and well-tested.
Allows easy customization: the combination of
simple functions and several convenience functions allows users to
stipulate a range of complex aggregation rule sets with minimal coding.
Moreover, because eventreport
is tidyverse
compatible, users can integrate the package functions in a tidy
workflow.
Before we begin, let’s install the eventreport
package.
Install from CRAN:
Once we have installed the package, we can load it:
Event report level data can come in many forms, where some datasets only include events recorded by at least two sources, whereas other datasets include both single- and multi-source events. Moreover, some variables may harbor more divergences in their values for the same event than other variables. These differences mean that not all datasets and variables are equally sensitive to aggregation choices (van Baalen and Höglund 2025).
dscore
eventreport
includes several functions that allow users
to diagnose their event report level data. dscore
calculates the total number of unique values for each event (subtracted
by 1 so that it only captures divergences). This divergence
score allows users to assess how sensitive particular events and
variables are to how they aggregate the event report level data:
dscore(
df,
group_var = "event_id",
variables = c("country", "actor1", "deaths_best")
) %>%
head(10)
## # A tibble: 10 × 4
## event_id dscore_country dscore_actor1 dscore_deaths_best
## <chr> <dbl> <dbl> <dbl>
## 1 CIV-0001 0 1 4
## 2 CIV-0002 0 0 0
## 3 CIV-0003 0 3 4
## 4 CIV-0004 0 2 3
## 5 CIV-0008 0 0 0
## 6 CIV-0009 0 0 0
## 7 CIV-0010 0 0 0
## 8 CIV-0011 0 0 0
## 9 CIV-0012 0 0 0
## 10 CIV-0013 0 1 1
From the above output, we can see that event CIV-0003 stands out as
particularly sensitive to aggregation choices, as the variable
actor1
can take a total of 3 additional values beyond the
one chosen by a particular aggregation choice. The variable
deaths_best
can take 4 additional values. In contrast,
aggregation choices will not matter for the events CIV-0002, CIV-0008,
CIV-0009, CIV-0010, CIV-0011, and CIV-0012, as there are no additional
values that the chosen variables can take.
mean_dscore
We can also calculate mean divergence scores for each variable to get a better sense of what variables are most sensitive to aggregation choices. The mean divergence score is calculated as the average number of divergent values per event and variable, and returns a dataframe containing the variable names and the mean divergence scores:
mean_dscore(
df,
group_var = "event_id",
variables = c("country", "actor1", "deaths_best", "injuries_best")
)
## # A tibble: 4 × 2
## variable dscore
## <chr> <dbl>
## 1 country 0
## 2 actor1 0.571
## 3 deaths_best 0.571
## 4 injuries_best 0.514
From the table above, we learn that some variables are more sensitive
to aggregation choices than others. For example, while the
country
variable is not at all sensitive to aggregation
choices, the actor1
and deaths_best
variables
are comparatively more sensitive to how we decide to aggregate the
data.
The raw divergence score can sometimes be misleading, as variables
differ in the number of possible variables. Hence, users can also
calculate normalized mean divergence scores for each variable
with the normalize = TRUE
argument, which returns the mean
number of divergences divided by the total number of unique values in
each variable:
mean_dscore(
df,
group_var = "event_id",
variables = c("country", "actor1", "deaths_best", "injuries_best"),
normalize = TRUE
)
## # A tibble: 4 × 2
## variable dscore
## <chr> <dbl>
## 1 country 0
## 2 actor1 0.0336
## 3 deaths_best 0.0440
## 4 injuries_best 0.0321
Finally, users can take a visual look at the mean and normalized
divergence scores by using the plot = TRUE
argument to
return a ggplot object:
mean_dscore(
df,
group_var = "event_id",
variables = c("country", "actor1", "deaths_best"),
normalize = TRUE,
plot = TRUE
)
aggregation_diagnostics
eventreport
provides convenience functions for
calculating six different aggregation diagnostics. The six diagnostics
help evaluate how much disagreement exists between different event
reports describing the same event.
Mean divergence (mean_dscore
) shows
how often values differ across reports by counting how many additional
unique values are reported per event and variable.
Normalized divergence
(mean_dscore(normalize = TRUE)
puts this into perspective
by dividing the divergence by the total number of possible unique
values, making it easier to compare across variables.
Mean standard deviation (mean_sd
)
measures how much reported numbers (like deaths or injuries) vary around
their average for each event.
Mean range (mean_range
) captures
the distance between the lowest and highest reported values,
highlighting extreme differences.
Share of events with disagreement
(event_level_disagreement)
, the most easily interpretable
metric, tells us how often at least two reports disagree on a particular
variable value.
Modal confidence (modal_confidence
)
shows how dominant the most commonly reported value is-high scores mean
most sources agree on the modal value, while lower scores suggest
disagreement.
To easily compare different aggregation diagnostics, users can run all diagnostics for a set of variables with one command:
diagnostics <- aggregation_diagnostics(
df,
group_var = "event_id",
variables = c("city", "deaths_best", "actor1")
)
tt(diagnostics)
Variable | Mean divergence | Normalized divergence | Mean standard deviation | Mean range | Share of events with disagreement (%) | Modal confidence (%) |
---|---|---|---|---|---|---|
city | 0.11 | 0.01 | 0.11 | 0.97 | ||
deaths_best | 0.57 | 0.04 | 1.41 | 1.66 | 0.29 | 0.88 |
actor1 | 0.57 | 0.03 | 0.37 | 0.87 |
The eventreport
package consists of a number of
different functions that help the user aggregate event report level data
into event report data. All functions and their use are outlined in the
package documentation.
Find the mode value of a character vector:
Given that some vectors may have multiple mode values, the
calc_mode
function allows the user to specify up to two
tie-breaking rules that help arbitrate multi-modal results. These
tie-breaking rules must be numerical vectors where higher values give
priority if it comes down to a tie-break:
calc_mode(
c("Sweden", "Sweden", "Denmark", "Denmark"),
tie_break = c(1, 1, 1, 1),
second_tie_break = c(1, 4, 1, 1)
)
## [1] "Sweden"
In cases where no mode value can be found after two tie-breaks, the
calc_mode
function returns the value
"Indeterminate"
, thereby forcing users to explicitly make a
decision on how to handle multi-modal vectors.
The calc_mode
function treats both NA values and empty
strings as real values, and hence returns NA or empty strings whenever
those are the most common values:
Find the mode value of a character vector while ignoring NA values and empty strings:
Find the mode value from a binary numeric vector:
Find the mode value in a numeric vector:
Find the mode date from a character vector written in the format YYYY-MM-DD:
Find the most specific value in a character vector by using an auxiliary precision score.
Find the least specific value in a character vector by using an auxiliary precision score.
The main purpose of the eventreport
package is to allow
users to aggregate entire datasets from the event report level to the
event level. This task is best achieved with the
aggregateData
function, which enables users to specify
multiple aggregation rules at once and store the output as a dataframe.
To illustrate its use, we first load the MAVERICK event report data
stored in the eventreport
package (using only 100
observations for faster computing):
A basic aggregateData
call must include the
data
argument, the group_var
argument, and
specify at least one aggregation rule for one variable. Because
aggregateData
builds on the dplyr
package, we
can call the function using the pipe operator:
df %>%
aggregateData(
group_var = "event_id",
find_mode = "city"
) %>%
utils::head(10)
## # A tibble: 10 × 4
## event_id city number_of_sources unit_of_analysis
## <chr> <chr> <int> <chr>
## 1 CIV-0001 "Duékoué" 5 Event
## 2 CIV-0002 "" 2 Event
## 3 CIV-0003 "Abidjan" 12 Event
## 4 CIV-0004 "Abidjan" 6 Event
## 5 CIV-0008 "Man" 1 Event
## 6 CIV-0009 "Vavoua" 2 Event
## 7 CIV-0010 "Abidjan" 1 Event
## 8 CIV-0011 "Yamoussoukro" 1 Event
## 9 CIV-0012 "Gagnoa" 4 Event
## 10 CIV-0013 "Daloa" 4 Event
The aggregateData
call returns a tibble consisting of
the specified variables (city
), a variable that now
contains the mode value for each group specified in
group_var
. In addition, aggregateData
automatically returns two additional variables: the
number_of_sources
variable, which counts the number of
reports per group; and the unit_of_analysis
variable, which
indicates that the data is aggregated at the event level.
Most event report datasets consist of multiple variables of different classes and hence demand more complex aggregation rule sets than defined in our minimal example. To include additional variables, users need only provide a list of variable names for each rule:
df %>%
aggregateData(
group_var = "event_id",
find_mode = c("city", "location", "actor1")
) %>%
utils::head(10)
## # A tibble: 10 × 6
## event_id city location actor1 number_of_sources unit_of_analysis
## <chr> <chr> <chr> <chr> <int> <chr>
## 1 CIV-0001 "Duékoué" "" Membe… 5 Event
## 2 CIV-0002 "" "Indetermi… Unkno… 2 Event
## 3 CIV-0003 "Abidjan" "Abobo" Unkno… 12 Event
## 4 CIV-0004 "Abidjan" "Abobo" Unkno… 6 Event
## 5 CIV-0008 "Man" "" Youth… 1 Event
## 6 CIV-0009 "Vavoua" "" Unkno… 2 Event
## 7 CIV-0010 "Abidjan" "Marcory" Polic… 1 Event
## 8 CIV-0011 "Yamoussoukro" "" Polic… 1 Event
## 9 CIV-0012 "Gagnoa" "" Unkno… 4 Event
## 10 CIV-0013 "Daloa" "" Unkno… 4 Event
Moreover, users can specify different rules for different lists of
variables. In the example below, we for example aggregate the data using
the mode value for the variables city
and
location
, but use the mode reported value for the
actor1
variable and the maximum value for the
deaths_best
variable. In addition, we use the
combine_strings
argument to retain all sources used to code
each event:
df %>%
aggregateData(
group_var = "event_id",
find_mode = c("city", "location"),
find_mode_na_ignore = "actor1",
find_max = "deaths_best",
combine_strings = "source"
) %>%
dplyr::select(event_id:actor1, deaths_best:unit_of_analysis, source) %>%
dplyr::filter(event_id == "CIV-0002")
## # A tibble: 1 × 8
## event_id city location actor1 deaths_best number_of_sources unit_of_analysis
## <chr> <chr> <chr> <chr> <int> <int> <chr>
## 1 CIV-0002 "" Indeterm… Unkno… 1 2 Event
## # ℹ 1 more variable: source <chr>
So far, we have used the aggregateData
function without
any tie-breaking rules, meaning that efforts to find the mode value
often return the value "Indeterminate"
. This occurs because
several groups are multi-modal, meaning that there are two or more mode
values. To limit the risk of indeterminate values, we can make use of
the tie-breaking arguments to draw on additional information to
determine which mode value to retain in our data.
In the specification below, for instance, we stipulate that in the
case of multi-modal results, the function should first select the value
from the report with the highest value in the
source_classification
variable (which ranks MAVERICK
reports based on their reputation for trustworthiness), and thereafter
select the value from the report with the highest value in the
certain
variable (which ranks MAVERICK reports based on how
election-related the event was):
df %>%
aggregateData(
group_var = "event_id",
find_mode = c("city", "location"),
find_mode_na_ignore = "actor1",
find_max = "deaths_best",
tie_break = "source_classification",
second_tie_break = "certain"
) %>%
utils::head(10)
## # A tibble: 10 × 7
## event_id city location actor1 deaths_best number_of_sources unit_of_analysis
## <chr> <chr> <chr> <chr> <int> <int> <chr>
## 1 CIV-0001 "Dué… "" Membe… 40 5 Event
## 2 CIV-0002 "" "" Unkno… 1 2 Event
## 3 CIV-0003 "Abi… "Abobo" Unkno… 5 12 Event
## 4 CIV-0004 "Abi… "Abobo" Unkno… 7 6 Event
## 5 CIV-0008 "Man" "" Youth… 0 1 Event
## 6 CIV-0009 "Vav… "" Unkno… 0 2 Event
## 7 CIV-0010 "Abi… "Marcor… Polic… 0 1 Event
## 8 CIV-0011 "Yam… "" Polic… 0 1 Event
## 9 CIV-0012 "Gag… "" Unkno… 5 4 Event
## 10 CIV-0013 "Dal… "" Unkno… 3 4 Event
We can also use precision scores to rank variable values and prioritize the most or least precise values. For example, below we use the MAVERICK geographical precision scores to find the most precise city and location information:
df %>%
aggregateData(
group_var = "event_id",
find_most_precise = list(
list(var = "city", precision_var = "geo_precision"),
list(var = "location", precision_var = "geo_precision")
),
find_mode_na_ignore = "actor1",
find_max = "deaths_best",
tie_break = "source_classification",
second_tie_break = "certain",
) %>%
utils::head(10)
## # A tibble: 10 × 7
## event_id actor1 deaths_best city location number_of_sources unit_of_analysis
## <chr> <chr> <int> <chr> <chr> <int> <chr>
## 1 CIV-0001 Membe… 40 "Dué… "" 5 Event
## 2 CIV-0002 Unkno… 1 "" "" 2 Event
## 3 CIV-0003 Unkno… 5 "Abi… "Abobo" 12 Event
## 4 CIV-0004 Unkno… 7 "Abi… "Abobo" 6 Event
## 5 CIV-0008 Youth… 0 "Man" "" 1 Event
## 6 CIV-0009 Unkno… 0 "Vav… "" 2 Event
## 7 CIV-0010 Polic… 0 "Abi… "Marcor… 1 Event
## 8 CIV-0011 Polic… 0 "Yam… "" 1 Event
## 9 CIV-0012 Unkno… 5 "Gag… "" 4 Event
## 10 CIV-0013 Unkno… 3 "Dal… "Dioula… 4 Event
Finally, because some users may want to compare aggregation results
across different rule sets (one of the main strengths of working with
event report level data), we can assign a name to our aggregation rule
set using the aggregation_name
argument. Doing so allows us
to generate different aggregation sets and compare results across
aggregations:
conservative <- df %>%
aggregateData(
group_var = "event_id",
find_mode = c("city", "location"),
find_min = c("deaths_best", "injuries_best"),
tie_break = "source_classification",
second_tie_break = "certain",
aggregation_name = "Most-conservative"
) %>%
utils::head(10)
maximalist <- df %>%
aggregateData(
group_var = "event_id",
find_mode_na_ignore = c("city", "location"),
find_max = c("deaths_best", "injuries_best"),
tie_break = "source_classification",
second_tie_break = "certain",
aggregation_name = "Most-informative"
) %>%
utils::head(10)
rbind(conservative, maximalist) %>%
dplyr::arrange(event_id)
## # A tibble: 20 × 8
## event_id city location deaths_best injuries_best number_of_sources
## <chr> <chr> <chr> <int> <int> <int>
## 1 CIV-0001 "Duékoué" "" 8 0 5
## 2 CIV-0001 "Duékoué" "" 40 91 5
## 3 CIV-0002 "" "" 1 0 2
## 4 CIV-0002 "" "Maison … 1 0 2
## 5 CIV-0003 "Abidjan" "Abobo" 0 0 12
## 6 CIV-0003 "Abidjan" "Abobo" 5 2 12
## 7 CIV-0004 "Abidjan" "Abobo" 1 0 6
## 8 CIV-0004 "Abidjan" "Abobo" 7 3 6
## 9 CIV-0008 "Man" "" 0 0 1
## 10 CIV-0008 "Man" "" 0 0 1
## 11 CIV-0009 "Vavoua" "" 0 0 2
## 12 CIV-0009 "Vavoua" "" 0 0 2
## 13 CIV-0010 "Abidjan" "Marcory" 0 0 1
## 14 CIV-0010 "Abidjan" "Marcory" 0 0 1
## 15 CIV-0011 "Yamoussoukro" "" 0 0 1
## 16 CIV-0011 "Yamoussoukro" "" 0 0 1
## 17 CIV-0012 "Gagnoa" "" 5 0 4
## 18 CIV-0012 "Gagnoa" "" 5 12 4
## 19 CIV-0013 "Daloa" "" 2 0 4
## 20 CIV-0013 "Daloa" "Dioulab… 3 0 4
## # ℹ 2 more variables: unit_of_analysis <chr>, aggregation <chr>
To demonstrate how eventreport
enables users to account
for aggregation sensitivity in their analyses based on event report
level data, we end with a short empirical illustration. Let’s say that
we want to explore the temporal dynamics of electoral violence severity
during Côte d’Ivoire’s 2010-2011 election crisis. Using the full
MAVERICK dataset, we can quickly see that such an analysis may be
sensitive to aggregation choices:
# Calculate the average divergence score
mean_dscore(
maverick_event_report,
group_var = "event_id",
variables = c("date_start", "deaths_best")
)
## # A tibble: 2 × 2
## variable dscore
## <chr> <dbl>
## 1 date_start 0.0941
## 2 deaths_best 0.138
We then proceed to create two different event datasets for the variables we are interested in: a representative aggregation set that uses the mode date and mode death estimate, and an informative aggregation set that uses the latest date and highest death estimate. Moreover, we combine these data frames into a single data frame.
# Create representative aggregation set
representative <- maverick_event_report %>%
aggregateData(
group_var = "event_id",
find_mode = "country",
find_mode_numeric = "deaths_best",
find_mode_date = "date_start",
tie_break = "source_classification",
second_tie_break = "certain",
aggregation_name = "Representative"
)
# Create informative aggregation set
informative <- maverick_event_report %>%
aggregateData(
group_var = "event_id",
find_mode = "country",
find_max = c("deaths_best", "date_start"),
tie_break = "source_classification",
second_tie_break = "certain",
aggregation_name = "Informative"
)
# Combine dataframes
combined <- rbind(representative, informative)
Because aggregation sensitivity is only an issue for events recorded
in at least two sources, we subset the dataset to only contain
multi-source events. To explore electoral violence severity over time
during the Ivorian election crisis, we then use the dplyr
and lubridate
packages to convert date_start
into a week variable, and then calculate the number of estimated
electoral violence deaths per week.
# Subset and calculate deaths per week
maverick_time_series_week <- combined %>%
dplyr::filter(number_of_sources > 1) %>%
dplyr::mutate(date_start = as.Date(as.character(date_start), format = "%Y-%m-%d")) %>%
dplyr::mutate(week_start = lubridate::floor_date(date_start, unit = "week")) %>%
tidyr::complete(
week_start = seq(ymd("1995-01-01"), ymd("2023-12-31"), by = "1 week"),
country, aggregation, fill = list(deaths_best = 0)
) %>%
dplyr::group_by(week_start, country, aggregation) %>%
dplyr::summarize(deaths_best = sum(deaths_best, na.rm = TRUE), .groups = "drop")
Finally, we filter the data to the relevant time period (October 2010
to June 2011) and plot the estimated number of deaths per week and
aggregation approach using the ggplot2
package. As the
figure clearly shows, the total number of estimated deaths per week (as
reported by at least two sources) is highly sensitive to our aggregation
choices:
maverick_time_series_week %>%
dplyr::filter(
week_start > "2010-09-30"
& week_start < "2011-06-01"
& country == "Ivory Coast"
) %>%
ggplot2::ggplot() +
ggplot2::geom_line(aes(y = deaths_best, x = week_start, color = aggregation), linewidth = 1) +
ggplot2::scale_x_date(
breaks = seq(as.Date("2010-10-01"), as.Date("2011-06-01"), by = "1 month"),
date_labels = "%b %Y"
) +
ggplot2::labs(
x = NULL,
y = "Best estimated number of weekly deaths"
) +
ggplot2::theme_bw()