This vignette aims to give a short intoduction on how to use this package to run tree-based scan statistics in R. Tree-based scan statistic is a data driven method for identifying clusters of events across a hierarchical tree. This could, e.g., be disease clusters across the ICD-10 tree. This method is commonly used in pharmacovigilance to identify potential side effects of drugs, using an agnostic approach. Before we get started, let’s just define some terminology:
Please check Kulldorff et
al. (2013) for a more detailed description of the method. Let’s load
the TreeMineR
package to get started.
The TreeMineR
package comes with the
diagnosis
data set which includes simulated diagnosis data
on some exposed and unexposed individuals. In order to use the
TreeMineR
function, we need to prepare the data in a
special format. Each row in our data set needs to correspond to one
diagnosis with information on which individual received this diagnosis
and whether the individual was exposed or unexposed. The diagnosis
dataset, is fortunately, already in the right format. Let’s have a
look:
diagnoses
#> id leaf exposed
#> <int> <char> <num>
#> 1: 1 K251 0
#> 2: 2 Q702 0
#> 3: 3 G96 0
#> 4: 3 S949 0
#> 5: 4 S951 0
#> ---
#> 22985: 999 V539 1
#> 22986: 999 V625 1
#> 22987: 999 G823 1
#> 22988: 1000 L42 1
#> 22989: 1000 T524 1
Besides our diagnosis data, we also need an object that defines our
hierarchical tree. This data frame includes a variable called
pathString
, which defines the full path for each leaf.
Let’s look at an example. The pathString
for the ICD-10
code B369 looks as follows: the code is part of the group B36, which in
turn is part of the block B35-B49, which in turn is part of the ICD-10
chapter 1, which in turn is part of the ICD-10-SE coding system. The
full path, hence, is ICD-10-SE/01/B35-B49/B36/B369
.
Important is, that each of the leafs included in the data also has one
row in the tree. E.g., if we have an observation with a diagnosis code
B36, we also need to add a row to the tree file which finishes with B36,
i.e., ICD-10-SE/01/B35-B49/B36
.
The ICD-10-SE, is already included in the TreeMineR package and can
be used out of the box. If you want to use another tree structure, take
a look at the create_tree
function, which makes it easy to
define new tree structures. Also the drop_cuts
function can
be useful if you want to remove some leafs from your tree. This is,
e.g., helpful in situations, where you a priori want to remove
some ICD-10 chapters from your analysis.
Let’s have a look at the first rows of the ICD-10-SE tree file:
head(icd_10_se)
#> pathString
#> 1 ICD-10-SE/01/A00-A09
#> 2 ICD-10-SE/01/A15-A19
#> 3 ICD-10-SE/01/A20-A28
#> 4 ICD-10-SE/01/A30-A49
#> 5 ICD-10-SE/01/A50-A64
#> 6 ICD-10-SE/01/A65-A69
Once we have the tree file defined and the data in the right format,
we can use the TreeMineR()
function to identify potential
event clusters. The TreeMineR()
function has an inbuild
parallelisation via the future package, which can be set up using the
future_control
argument.
Let’s do a test run:
TreeMineR(
data = diagnoses,
tree = icd_10_se,
p = 1/11,
n_exposed = 1000,
n_unexposed = 10000,
n_monte_carlo_sim = 20,
random_seed = 124,
future_control = list("sequential")
) |> head()
#> cut n1 n0 risk1 risk0 RR llr p
#> 1 12 122 669 0.122 0.0669 1.823617 16.18714 0.04761905
#> 2 11 132 782 0.132 0.0782 1.687980 13.65786 0.04761905
#> 3 V01-X59 241 1687 0.241 0.1687 1.428571 12.26816 0.04761905
#> 4 V01-V99 210 1438 0.210 0.1438 1.460362 11.95732 0.09523810
#> 5 15 133 822 0.133 0.0822 1.618005 11.79775 0.09523810
#> 6 19 306 2281 0.306 0.2281 1.341517 10.80614 0.09523810
In our data, we could identify three event clusters (Ch. 12, Ch. 11, V01-X59) which passed the p < 0.05 threshold, suggesting that exposed individuals have a higher risk of being diagnosed with these disease groups than unexposed individuals. Usually, you would like to increase the number of Monte Carlo simulation runs to increase the stability of the results.