Event data selection

Gert Janssenswillen

2/12/2015

The goal of this vignette is to illustrated the different methods for performing data selection provided by the package edeaR. The data from municipality 1 of the BPI challenge 2015 will be used as running example. A preprocessed event log is available in the package under te name BPIC15_1.

data(BPIC15_1)

Select on activity frequency

In the vignette discussing the descriptives, it was already observed than only a limited fraction of the activities occurs very often. Below, we repeat the cumulative distribution of the activity frequences.

A horizontal line has been added at 75%. Recall that 75% of the activities only occures less than 100 times, approximately. To select only these activities, we use the filter activity_frequency, as follows

filtered_log <- filter_activity_frequency(BPIC15_1, percentile_cut_off = 0.25, reverse = T)
## Warning in eventlog(output, activity_id = activity_id(eventlog), case_id
## = case_id(eventlog), : No resource identifier provided nor found. Set to
## default: NA
activities(filtered_log) %>% select(absolute_frequency) %>% summary
##  absolute_frequency
##  Min.   :  1       
##  1st Qu.:  3       
##  Median : 18       
##  Mean   :103       
##  3rd Qu.: 87       
##  Max.   :898

Note that the combination of a percentile cut off of 25% and reverse equal to TRUE, will select all but the 25% most frequenct activities, i.e. the 75% least frequent activities. It can be seen that the remaining activities have a absolute frequency of 105 or less.

Select on throughput time

The throughput time of the original eventlog is visualized in the graph below. It can be observed that most cases have a throughput time lower that circa 100 days, while there are some outliers.

case_throughput <- throughput_time(BPIC15_1, "case")
ggplot(case_throughput) +
    geom_histogram(aes(throughput_time), fill = "#0072B2", binwidth = 10) +
    xlab("Duration (in days)") +
    ylab("Number of cases")

To discard the outliers with a throughput time greater than 500 days, we can use the filter throughput_time as follows.

filtered_log <- filter_throughput_time(BPIC15_1, lower_threshold = 0, upper_threshold = 500)
## Warning in eventlog(f_eventlog, activity_id = activity_id(eventlog),
## case_id = case_id(eventlog), : No resource identifier provided nor found.
## Set to default: NA
case_throughput <- throughput_time(filtered_log, "case")
ggplot(case_throughput) +
    geom_histogram(aes(throughput_time), fill = "#0072B2", binwidth = 10) +
    xlab("Duration (in days)") +
    ylab("Number of cases")

Alternatively, we could only look at the outliers and select for instance the 1% longest cases.

filtered_log <- filter_throughput_time(BPIC15_1, percentile_cut_off = 0.99, reverse = T)
## Warning in eventlog(f_eventlog, activity_id = activity_id(eventlog),
## case_id = case_id(eventlog), : No resource identifier provided nor found.
## Set to default: NA
throughput_time(filtered_log, "case")
## # A tibble: 12 × 2
##    case_concept.name throughput_time
##                <chr>           <dbl>
## 1            2929114       1486.0000
## 2            3564895       1096.0000
## 3            3690553        998.0000
## 4            5121833        930.4084
## 5            4565008        918.9583
## 6            3388781        882.0000
## 7            3327777        849.4875
## 8            5173437        844.0000
## 9            3026543        676.5204
## 10           4602366        637.0000
## 11           3741124        629.6059
## 12           7079364        615.0000

Select on time period

Finally, let us select cases on a specific time period. To get an idea of the distribution of cases over time, the graph below shows the number of cases according to the starting timestamp and complete timestamp

start <- BPIC15_1 %>% group_by(case_concept.name) %>% summarize(timestamp = min(event_time.timestamp)) %>% mutate(type = "start")
complete <- BPIC15_1 %>% group_by(case_concept.name) %>% summarize(timestamp = max(event_time.timestamp)) %>% mutate(type = "end")
bind_rows(start, complete) %>% 
    ggplot() +
    geom_histogram(aes(timestamp, fill = type), binwidth = 60*60*24*30) +
    facet_grid(type ~ .) +
    scale_fill_brewer(palette = "Dark2") +
    theme(legend.position = "none")

Contained

Suppose we are only interested in the first quarter of the year 2012. We can filter the data in different ways. Firstly, we can filter on cases which started and completed in this period.

library(lubridate)
a <- ymd_hms("20120101 00:00:00")
b <- ymd_hms("20120331 00:00:00")
filtered_log <- filter_time_period(BPIC15_1, a, b, "contained")
## Warning in eventlog(eventlog = f_eventlog, activity_id =
## activity_id(eventlog), : No resource identifier provided nor found. Set to
## default: NA
start <- filtered_log %>% group_by(case_concept.name) %>% summarize(timestamp = min(event_time.timestamp)) %>% mutate(type = "start")
complete <- filtered_log %>% group_by(case_concept.name) %>% summarize(timestamp = max(event_time.timestamp)) %>% mutate(type = "end")
bind_rows(start, complete) %>%  
    ggplot() +
    geom_histogram(aes(timestamp, fill = type), binwidth = 60*60*24*7) +
    facet_grid(type ~ .) +
    scale_fill_brewer(palette = "Dark2") +
    theme(legend.position = "none")

Started

Alternatively, we could select cases who started or completed in this period, respectively.

filtered_log <- filter_time_period(BPIC15_1, a, b, "start")
## Warning in eventlog(eventlog = f_eventlog, activity_id =
## activity_id(eventlog), : No resource identifier provided nor found. Set to
## default: NA
start <- filtered_log %>% group_by(case_concept.name) %>% summarize(timestamp = min(event_time.timestamp)) %>% mutate(type = "start")
complete <- filtered_log %>% group_by(case_concept.name) %>% summarize(timestamp = max(event_time.timestamp)) %>% mutate(type = "end")
bind_rows(start, complete) %>%      ggplot() +
    geom_histogram(aes(timestamp, fill = type), binwidth = 60*60*24*7) +
    facet_grid(type ~ .) +
    scale_fill_brewer(palette = "Dark2") +
    theme(legend.position = "none")

Completed

filtered_log <- filter_time_period(BPIC15_1, a, b, "complete")
## Warning in eventlog(eventlog = f_eventlog, activity_id =
## activity_id(eventlog), : No resource identifier provided nor found. Set to
## default: NA
start <- filtered_log %>% group_by(case_concept.name) %>% summarize(timestamp = min(event_time.timestamp)) %>% mutate(type = "start")
complete <- filtered_log %>% group_by(case_concept.name) %>% summarize(timestamp = max(event_time.timestamp)) %>% mutate(type = "end")
bind_rows(start, complete) %>%  
    ggplot() +
    geom_histogram(aes(timestamp, fill = type), binwidth = 60*60*24*7) +
    facet_grid(type ~ .) +
    scale_fill_brewer(palette = "Dark2") +
    theme(legend.position = "none")

Intersected

Still another option is to select cases who intersected the time period, i.e. at least part of the case happened in the time period.

filtered_log <- filter_time_period(BPIC15_1, a, b, "intersecting")
## Warning in eventlog(eventlog = f_eventlog, activity_id =
## activity_id(eventlog), : No resource identifier provided nor found. Set to
## default: NA
start <- filtered_log %>% group_by(case_concept.name) %>% summarize(timestamp = min(event_time.timestamp)) %>% mutate(type = "start")
complete <- filtered_log %>% group_by(case_concept.name) %>% summarize(timestamp = max(event_time.timestamp)) %>% mutate(type = "end")
bind_rows(start, complete) %>%
    ggplot() +
    geom_histogram(aes(timestamp, fill = type), binwidth = 60*60*24*7) +
    facet_grid(type ~ .) +
    scale_fill_brewer(palette = "Dark2") +
    theme(legend.position = "none")

Trim

Finally, we can trim the cases to the time period.

filtered_log <- filter_time_period(BPIC15_1, a, b, "trim")
start <- filtered_log %>% group_by(case_concept.name) %>% summarize(timestamp = min(event_time.timestamp)) %>% mutate(type = "start")
complete <- filtered_log %>% group_by(case_concept.name) %>% summarize(timestamp = max(event_time.timestamp)) %>% mutate(type = "end")
bind_rows(start, complete) %>%
    ggplot() +
    geom_histogram(aes(timestamp, fill = type), binwidth = 60*60*24*7) +
    facet_grid(type ~ .) +
    scale_fill_brewer(palette = "Dark2") +
    theme(legend.position = "none")

Other filters

Other filters provided are listed below. Look at the help file for their workings.