The goal of this vignette is to illustrated the different methods for performing data selection provided by the package edeaR
. The data from municipality 1 of the BPI challenge 2015 will be used as running example. A preprocessed event log is available in the package under te name BPIC15_1
.
data(BPIC15_1)
In the vignette discussing the descriptives, it was already observed than only a limited fraction of the activities occurs very often. Below, we repeat the cumulative distribution of the activity frequences.
A horizontal line has been added at 75%. Recall that 75% of the activities only occures less than 100 times, approximately. To select only these activities, we use the filter activity_frequency
, as follows
filtered_log <- filter_activity_frequency(BPIC15_1, percentile_cut_off = 0.25, reverse = T)
activities(filtered_log) %>% select(absolute_frequency) %>% summary
## absolute_frequency
## Min. : 1.00
## 1st Qu.: 2.00
## Median : 9.00
## Mean : 19.58
## 3rd Qu.: 26.75
## Max. :105.00
Note that the combination of a percentile cut off of 25% and reverse equal to TRUE, will select all but the 25% most frequenct activities, i.e. the 75% least frequent activities. It can be seen that the remaining activities have a absolute frequency of 105 or less.
The throughput time of the original eventlog is visualized in the graph below. It can be observed that most cases have a throughput time lower that circa 100 days, while there are some outliers.
case_throughput <- throughput_time(BPIC15_1, "case")
ggplot(case_throughput) +
geom_histogram(aes(throughput_time), fill = "#0072B2", binwidth = 10) +
xlab("Duration (in days)") +
ylab("Number of cases")
To discard the outliers with a throughput time greater than 500 days, we can use the filter throughput_time
as follows.
filtered_log <- filter_throughput_time(BPIC15_1, lower_threshold = 0, upper_threshold = 500)
case_throughput <- throughput_time(filtered_log, "case")
ggplot(case_throughput) +
geom_histogram(aes(throughput_time), fill = "#0072B2", binwidth = 10) +
xlab("Duration (in days)") +
ylab("Number of cases")
Alternatively, we could only look at the outliers and select for instance the 1% longest cases.
filtered_log <- filter_throughput_time(BPIC15_1, percentile_cut_off = 0.99, reverse = T)
throughput_time(filtered_log, "case")
## Source: local data frame [12 x 2]
##
## case_concept.name throughput_time
## (chr) (dbl)
## 1 7079364 615.0000
## 2 3741124 629.6059
## 3 4602366 637.0000
## 4 3026543 676.5204
## 5 5173437 844.0000
## 6 3327777 849.4875
## 7 3388781 882.0000
## 8 4565008 918.9583
## 9 5121833 930.4084
## 10 3690553 998.0000
## 11 3564895 1096.0000
## 12 2929114 1486.0000
Finally, let us select cases on a specific time period. To get an idea of the distribution of cases over time, the graph below shows the number of cases according to the starting timestamp and complete timestamp
start <- BPIC15_1 %>% group_by(case_concept.name) %>% summarize(timestamp = min(event_time.timestamp)) %>% mutate(type = "start")
complete <- BPIC15_1 %>% group_by(case_concept.name) %>% summarize(timestamp = max(event_time.timestamp)) %>% mutate(type = "end")
bind_rows(start, complete) %>%
ggplot() +
geom_histogram(aes(timestamp, fill = type), binwidth = 60*60*24*30) +
facet_grid(type ~ .) +
scale_fill_brewer(palette = "Dark2") +
theme(legend.position = "none")
### Contained
Suppose we are only interested in the first quarter of the year 2012. We can filter the data in different ways. Firstly, we can filter on cases which started and completed in this period.
library(lubridate)
a <- ymd("20120101")
b <- ymd("20120331")
filtered_log <- filter_time_period(BPIC15_1, a, b, "contained")
start <- filtered_log %>% group_by(case_concept.name) %>% summarize(timestamp = min(event_time.timestamp)) %>% mutate(type = "start")
complete <- filtered_log %>% group_by(case_concept.name) %>% summarize(timestamp = max(event_time.timestamp)) %>% mutate(type = "end")
bind_rows(start, complete) %>%
ggplot() +
geom_histogram(aes(timestamp, fill = type), binwidth = 60*60*24*7) +
facet_grid(type ~ .) +
scale_fill_brewer(palette = "Dark2") +
theme(legend.position = "none")
#### Started
Alternatively, we could select cases who started or completed in this period, respectively.
filtered_log <- filter_time_period(BPIC15_1, a, b, "start")
start <- filtered_log %>% group_by(case_concept.name) %>% summarize(timestamp = min(event_time.timestamp)) %>% mutate(type = "start")
complete <- filtered_log %>% group_by(case_concept.name) %>% summarize(timestamp = max(event_time.timestamp)) %>% mutate(type = "end")
bind_rows(start, complete) %>% ggplot() +
geom_histogram(aes(timestamp, fill = type), binwidth = 60*60*24*7) +
facet_grid(type ~ .) +
scale_fill_brewer(palette = "Dark2") +
theme(legend.position = "none")
#### Completed
filtered_log <- filter_time_period(BPIC15_1, a, b, "complete")
start <- filtered_log %>% group_by(case_concept.name) %>% summarize(timestamp = min(event_time.timestamp)) %>% mutate(type = "start")
complete <- filtered_log %>% group_by(case_concept.name) %>% summarize(timestamp = max(event_time.timestamp)) %>% mutate(type = "end")
bind_rows(start, complete) %>%
ggplot() +
geom_histogram(aes(timestamp, fill = type), binwidth = 60*60*24*7) +
facet_grid(type ~ .) +
scale_fill_brewer(palette = "Dark2") +
theme(legend.position = "none")
Still another option is to select cases who intersected the time period, i.e. at least part of the case happened in the time period.
filtered_log <- filter_time_period(BPIC15_1, a, b, "intersecting")
start <- filtered_log %>% group_by(case_concept.name) %>% summarize(timestamp = min(event_time.timestamp)) %>% mutate(type = "start")
complete <- filtered_log %>% group_by(case_concept.name) %>% summarize(timestamp = max(event_time.timestamp)) %>% mutate(type = "end")
bind_rows(start, complete) %>%
ggplot() +
geom_histogram(aes(timestamp, fill = type), binwidth = 60*60*24*7) +
facet_grid(type ~ .) +
scale_fill_brewer(palette = "Dark2") +
theme(legend.position = "none")
#### Trim
Finally, we can trim the cases to the time period.
filtered_log <- filter_time_period(BPIC15_1, a, b, "trim")
start <- filtered_log %>% group_by(case_concept.name) %>% summarize(timestamp = min(event_time.timestamp)) %>% mutate(type = "start")
complete <- filtered_log %>% group_by(case_concept.name) %>% summarize(timestamp = max(event_time.timestamp)) %>% mutate(type = "end")
bind_rows(start, complete) %>%
ggplot() +
geom_histogram(aes(timestamp, fill = type), binwidth = 60*60*24*7) +
facet_grid(type ~ .) +
scale_fill_brewer(palette = "Dark2") +
theme(legend.position = "none")
Other filters provided are listed below. Look at the help file for their workings.