The goal of this vignette is to illustrate how event data can be used for descriptive analysis in R. The data from the first municipality of the BPI Challenge 2015 will be used throughout this vignette. It is made available by the package under the name BPIC15_1
and already preprocessed to an object of the class eventlog
. For more information on the preprocessing of event data, look at the corresponding vignette.
library(edeaR)
data("BPIC15_1")
The most high-level way to describe an eventlog is to use the generic R
function summary
.
summary(BPIC15_1)
## Number of events: 52217
## Number of cases: 1199
## Number of traces: 1099
## Number of activities: 398
## Average trace length: 43.55046
##
## Start eventlog: 2010-10-04 22:00:00
## End eventlog: 2015-07-31 22:00:00
## case_concept.name event_question event_dateFinished
## Length:52217 Length:52217 Length:52217
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## event_dueDate event_action_code event_activityNameEN
## Length:52217 Length:52217 Length:52217
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## event_planned event_time.timestamp event_monitoringResource
## Length:52217 Min. :2010-10-04 22:00:00 Length:52217
## Class :character 1st Qu.:2011-11-07 09:32:10 Class :character
## Mode :character Median :2012-11-19 08:25:49 Mode :character
## Mean :2012-12-12 19:44:44
## 3rd Qu.:2014-01-15 23:00:00
## Max. :2015-07-31 22:00:00
## event_org.resource event_activityNameNL event_concept.name
## Length:52217 Length:52217 Length:52217
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## event_lifecycle.transition event_dateStop activity_instance
## Length:52217 Length:52217 Min. : 1
## Class :character Class :character 1st Qu.:13055
## Mode :character Mode :character Median :26109
## Mean :26109
## 3rd Qu.:39163
## Max. :52217
As can be observed above, the summary contains the number of events, activities, traces and cases, as well as the time span covered by the event log.
The cases
function returns a data.frame which contains general descriptives about each individual case.
case_information <- cases(BPIC15_1)
case_information
## Source: local data frame [1,199 x 10]
##
## case_concept.name trace_length number_of_activities start_timestamp
## (chr) (int) (int) (time)
## 1 10009138 45 45 2014-04-10 22:00:00
## 2 10051383 57 56 2014-04-16 22:00:00
## 3 10053042 57 56 2014-04-13 22:00:00
## 4 10083315 58 57 2014-04-16 22:00:00
## 5 10093171 46 46 2014-04-21 22:00:00
## 6 10128431 56 55 2014-04-24 22:00:00
## 7 10153084 58 57 2014-04-28 22:00:00
## 8 10154600 47 47 2014-04-29 22:00:00
## 9 10186016 71 70 2014-05-01 22:00:00
## 10 10186644 55 54 2014-04-30 22:00:00
## .. ... ... ... ...
## Variables not shown: complete_timestamp (time), trace (chr), trace_id
## (dbl), duration_in_days (dbl), first_activity (fctr), last_activity
## (fctr)
For each case, the following values are reported
The resulting data.frame as such has little value, as there might be hunderds of cases. However, it can be further summarized and visualized. Below, the most common start and end activities of a case are shown. While almost all cases start with 01_HOOFD_010, there is much more variance in the last activity.
library(dplyr)
summary(select(case_information, first_activity, last_activity))
## first_activity last_activity
## 01_HOOFD_010 :1182 01_HOOFD_530 :302
## 11_AH_II_040b : 7 01_HOOFD_510_2a:106
## 01_HOOFD_030_2: 2 01_HOOFD_820 : 95
## 01_HOOFD_065_2: 2 01_HOOFD_510_2 : 92
## 01_HOOFD_011 : 1 01_HOOFD_516 : 82
## 01_HOOFD_080 : 1 01_HOOFD_510_4 : 48
## (Other) : 4 (Other) :474
Using the package ggplot2
, we can also visalize this information. The next code will visualize the distribution of throughput time, i.e. duration.
library(ggplot2)
ggplot(case_information) +
geom_bar(aes(duration_in_days), binwidth = 30, fill = "#0072B2") +
scale_x_continuous(limits = c(0,500)) +
xlab("Duration (in days)") +
ylab("Number of cases")
## Activities
The activities
functions shows the frequencies of the different activities.
activity_information <- activities(BPIC15_1)
activity_information
## Source: local data frame [398 x 3]
##
## event_concept.name absolute_frequency relative_frequency
## (chr) (int) (dbl)
## 1 01_BB_550 1 1.915085e-05
## 2 01_BB_560 1 1.915085e-05
## 3 01_BB_670_1 1 1.915085e-05
## 4 01_BB_680 1 1.915085e-05
## 5 01_HOOFD_197 1 1.915085e-05
## 6 01_HOOFD_331 1 1.915085e-05
## 7 01_HOOFD_446_1 1 1.915085e-05
## 8 01_HOOFD_446_2 1 1.915085e-05
## 9 01_HOOFD_456 1 1.915085e-05
## 10 01_HOOFD_496_1 1 1.915085e-05
## .. ... ... ...
The following graph shows an cumulative distribution function for the absolute frequency of activities. It shows that about 75% of the activities only occur less than a 100 times.
ggplot(activity_information) +
stat_ecdf(aes(absolute_frequency), lwd = 1, col = "#0072B2") +
scale_x_continuous(breaks = seq(0, 1000, by = 100)) +
xlab("Absolute activity frequencies") +
ylab("Cumulative percentage")
## Predefined descriptive metrics
Next to the more general descriptives seen so far, a series of specific descriptives metrics have been defined. Three different analysis levels are distinguished, log, trace and activity. The metrics look at aspects of time as well as structuredness of the eventlog. Some of the metrics will be illustrated below.
The next piece of code will computed the number of selfloops at the level of activites.
activity_selfloops <- number_of_selfloops(BPIC15_1, level_of_analysis = "activity")
activity_selfloops
## event_concept.name absolute relative
## 1 01_HOOFD_205 86 0.565789474
## 2 01_HOOFD_100 31 0.086834734
## 3 01_HOOFD_190_2 9 0.068181818
## 4 08_AWB45_005 5 0.006684492
## 5 01_HOOFD_065_2 2 0.003067485
## 6 01_HOOFD_110 1 0.001858736
## 7 01_HOOFD_120 1 0.001972387
## 8 01_HOOFD_180 1 0.000896861
## 9 01_HOOFD_200 1 0.001027749
## 10 01_HOOFD_510_2 1 0.001108647
## 11 01_HOOFD_790 1 0.015873016
## 12 02_DRZ_030_2 1 0.200000000
## 13 10_UOV_065 1 0.076923077
The output shows that 13 activites sometimes occur in a selfloop. The activity 01_HOOFD_205 shows the most selfloops, i.e. 86.
Visualized:
ggplot(activity_selfloops) +
geom_bar(aes(reorder(event_concept.name, -absolute), absolute), stat = "identity", fill = "#0072B2") +
theme(axis.text.x = element_text(angle = 90)) +
xlab("Activity") +
ylab("Number of selfloops")
### Repetitions
Complementary to selfloops are repetitions: activities which are repeated in a case, but not directly following each other.
activity_repetitions <- repetitions(BPIC15_1, level_of_analysis = "activity")
activity_repetitions
## event_concept.name relative_frequency absolute relative
## 1 01_HOOFD_180 0.0213723500 78 0.0650542118
## 2 01_HOOFD_200 0.0186529291 37 0.0308590492
## 3 01_HOOFD_510_2 0.0172932187 3 0.0025020851
## 4 08_AWB45_005 0.0144205910 143 0.1192660550
## 5 01_HOOFD_065_2 0.0125246567 1 0.0008340284
## 6 01_HOOFD_110 0.0103223088 71 0.0592160133
## 7 01_HOOFD_120 0.0097286324 67 0.0558798999
## 8 01_HOOFD_100 0.0074305303 156 0.1301084237
## 9 01_HOOFD_205 0.0045579026 3 0.0025020851
## 10 01_HOOFD_190_2 0.0027002700 10 0.0083402836
## 11 01_HOOFD_790 0.0012256545 12 0.0100083403
## 12 10_UOV_065 0.0002681119 0 0.0000000000
## 13 02_DRZ_030_2 0.0001149051 0 0.0000000000
Visualized:
ggplot(activity_repetitions) +
geom_bar(aes(reorder(event_concept.name, -absolute), absolute), stat = "identity", fill = "#0072B2") +
theme(axis.text.x = element_text(angle = 90)) +
xlab("Activity") +
ylab("Number of repetitions")
### Combining descriptives
Using some data manipulation in R, we can plot both descriptives together, to easily see whether repetitions and selfloops occur often for the same activities.
data <- bind_rows(mutate(activity_selfloops, type = "selfloops"),
mutate(select(activity_repetitions, event_concept.name, absolute), type = "repetitions"))
ggplot(data) +
geom_bar(aes(reorder(event_concept.name, -absolute), absolute), stat = "identity", fill = "#0072B2") +
facet_grid(type ~ .) +
theme(axis.text.x = element_text(angle = 90)) +
xlab("Activity") +
ylab("Number of selfloops and repetitions")
## Other descriptives
Other available descriptives and the supported analysis levels are listed below: