| Type: | Package | 
| Title: | See a Forest for the Trees | 
| Version: | 3.4.0 | 
| Imports: | MASS, partykit, rpart, RColorBrewer, grDevices, gridExtra, ggplot2, cluster, parallel, foreach, igraph, stats, graphics, plyr, ranger, randomForest, methods, doParallel | 
| LazyData: | true | 
| Encoding: | UTF-8 | 
| Date: | 2023-06-21 | 
| Description: | Get insight into a forest of classification trees, by calculating similarities between the trees, and subsequently clustering them. Each cluster is represented by it's most central cluster member. The package implements the methodology described in Sies & Van Mechelen (2020) <doi:10.1007/s00357-019-09350-4>. | 
| URL: | https://github.com/KULeuven-PPW-OKPIV/C443 | 
| BugReports: | https://github.com/KULeuven-PPW-OKPIV/C443/issues | 
| License: | GPL-2 | GPL-3 [expanded from: GPL (≥ 2)] | 
| RoxygenNote: | 7.2.3 | 
| NeedsCompilation: | no | 
| Packaged: | 2023-06-21 09:11:52 UTC; aniek | 
| Author: | Aniek Sies [aut, cre], Kristof Meers [ctb], Iven Van Mechelen [ths] | 
| Maintainer: | Aniek Sies <aniek.sies@kuleuven.be> | 
| Repository: | CRAN | 
| Date/Publication: | 2023-06-21 09:30:02 UTC | 
Clustering the classification trees in a forest based on similarities
Description
A function to get insight into a forest of classification trees by clustering the trees in a forest using Partitioning Around Medoids (PAM, Kaufman & Rousseeuw, 2009), based on user provided similarities, or based on similarities calculated by the package using a similarity measure chosen by the user (see Sies & Van Mechelen, 2020).
Usage
clusterforest(
  observeddata,
  treedata = NULL,
  trees,
  simmatrix = NULL,
  m = NULL,
  tol = NULL,
  weight = NULL,
  fromclus = 1,
  toclus = 1,
  treecov = NULL,
  sameobs = FALSE,
  seed = NULL,
  no_cores = detectCores(logical = FALSE)
)
Arguments
| observeddata | The entire observed dataset | 
| treedata | A list of dataframes on which the trees are based. Not necessary if the data set is included in the tree object already. | 
| trees | A list of trees of class party, classes inheriting from party (e.g., glmtree), classes that can be coerced to party (i.e., rpart, Weka_tree, XMLnode), or a randomForest or ranger object. | 
| simmatrix | A similaritymatrix with the similarities between all trees. Should be square, symmetric and have ones on the diagonal. Default=NULL | 
| m | Similarity measure that should be used to calculate similarities, in the case that no similarity matrix was provided by the user. Default=NULL. m=1 is based on counting common predictors; m=2 is based on counting common predictor-split point combinations; m=3 is based on common ordered sets of predictor-range part combinations (see Shannon & Banks (1999)); m=4 is based on the agreement of partitions implied by leaf membership (Chipman, 1998); m=5 is based on the agreement of partitions implied by class labels (Chipman, 1998); m=6 is based on the number of predictor occurrences in definitions of leaves with same class label; m=7 is based on the number of predictor-split point combinations in definitions of leaves with same class label m=8 measures closeness to logical equivalence (applicable in case of binary predictors only) | 
| tol | A vector with for each predictor a number that defines the tolerance zone within which two split points of the predictor in question are assumed equal. For example, if the tolerance for predictor X is 1, then a split on that predictor in tree A will be assumed equal to a split in tree B as long as the splitpoint in tree B is within the splitpoint in tree A + or - 1. Only applicable for m=1 and m=6. Default=NULL | 
| weight | If 1, the number of dissimilar paths in the Shannon and Banks measure (m=2), should be weighted by 1/their length (Otherwise they are weighted equally). Only applicable for m=2. Default=NULL | 
| fromclus | The lowest number of clusters for which the PAM algorithm should be run. Default=1. | 
| toclus | The highest number of clusters for which the PAM algorithm should be run. Default=1. | 
| treecov | A vector/dataframe with the covariate value(s) for each tree in the forest (1 column per covariate) in the case of known sources of variation underlying the forest, that should be linked to the clustering solution. | 
| sameobs | Are the same observations included in every tree data set? For example, in the case of subsamples or bootstrap samples, the answer is no. Default=FALSE | 
| seed | A seed number that should be used for the multi start procedure (based on which initial medoids are assigned). Default=NULL. | 
| no_cores | Number of CPU cores used for computations. Default=detectCores(logical=FALSE) | 
Details
The user should provide the number of clusters that the solution should contain, or a range of numbers that should be explored. In the latter case, the resulting clusterforest object will contain clustering results for each solution. On this clusterforest object, several methods, such as plot, print and summary, can be used.
Value
The function returns an object of class clusterforest, with attributes:
| medoids | the position of the medoid trees in the forest (i.e., which element of the list of partytrees) | 
| medoidtrees | the medoid trees | 
| clusters | The cluster to which each tree in the forest is assigned | 
| avgsilwidth | The average silhouette width for each solution (see Kaufman and Rousseeuw, 2009) | 
| accuracy | For each solution, the accuracy of the predicted class labels based on the medoids. | 
| agreement | For each solution, the agreement between the predicted class label for each observation based on the forest as a whole, and those based on the medoids only (see Sies & Van Mechelen,2020) | 
| withinsim | Within cluster similarity for each solution (see Sies & Van Mechelen, 2020) | 
| treesimilarities | Similarity matrix on which clustering was based | 
| treecov | covariate value(s) for each tree in the forest | 
| seed | seed number that was used for the multi start procedure (based on which initial medoids were assigned) | 
References
Kaufman, L., & Rousseeuw, P. J. (2009). Finding groups in data: an introduction to cluster analysis (Vol. 344). John Wiley & Sons.
Sies, A. & Van Mechelen I. (2020). C443: An R-package to see a forest for the trees. Journal of Classification.
Shannon, W. D., & Banks, D. (1999). Combining classification trees using MLE. Statistics in medicine, 18(6), 727-740.
Chipman, H. A., George, E. I., & McCulloh, R. E. (1998). Making sense of a forest of trees. Computing Science and Statistics, 84-92.
Examples
require(MASS)
require(ranger)
require(rpart)
#Function to draw a bootstrap sample from a dataset
DrawBoots <- function(dataset, i){
set.seed(2394 + i)
Boot <- dataset[sample(1:nrow(dataset), size = nrow(dataset), replace = TRUE),]
return(Boot)
}
#Function to grow a tree using rpart on a dataset
GrowTree <- function(x,y,BootsSample, minsplit = 40, minbucket = 20, maxdepth =3){
 controlrpart <- rpart.control(minsplit = minsplit, minbucket = minbucket, maxdepth = maxdepth,
  maxsurrogate = 0, maxcompete = 0)
 tree <- rpart(as.formula(paste(noquote(paste(y, "~")), noquote(paste(x, collapse="+")))),
 data = BootsSample, control = controlrpart)
 return(tree)
}
#Use functions to draw 10 boostrapsamples and grow a tree on each sample
Boots<- lapply(1:10, function(k) DrawBoots(Pima.tr ,k))
Trees <- lapply(1:10, function (i) GrowTree(x=c("npreg", "glu",  "bp",  "skin",
"bmi", "ped", "age"), y="type", Boots[[i]] ))
#Clustering the trees in this forest
ClusterForest<- clusterforest(observeddata=Pima.tr,treedata=Boots,trees=Trees,m=1,
fromclus=1, toclus=2, sameobs=FALSE, no_cores=2)
#Example RandomForest
Pima.tr.ranger <- ranger(type ~ ., data = Pima.tr, keep.inbag = TRUE, num.trees=20,
max.depth=3)
ClusterForest<- clusterforest(observeddata=Pima.tr,trees=Pima.tr.ranger,m=5,
                           fromclus=1, toclus=2, sameobs=FALSE, no_cores=2)
Get the cluster assignments for a solution of a clusterforest object
Description
A function to get the cluster assignments for a given solution of a clusterforest object.
Usage
clusters(clusterforest, solution)
Arguments
| clusterforest | A clusterforest object | 
| solution | The solution for which cluster assignments should be returned. Default = 1 | 
Get the cluster assignments for a solution of a clusterforest object
Description
A function to get the cluster assignments for a given solution of a clusterforest object.
Usage
## S3 method for class 'clusterforest'
clusters(clusterforest, solution = 1)
Arguments
| clusterforest | The clusterforest object | 
| solution | The solution | 
Get the cluster assignments for a solution of a clusterforest object
Description
A function to get the cluster assignments for a given solution of a clusterforest object.
Usage
## Default S3 method:
clusters(clusterforest, solution)
Arguments
| clusterforest | The clusterforest object | 
| solution | The solution | 
Drug consumption data set
Description
A dataset collected by Fehrman et al. (2017), freely available on the UCI Machine Learning Repository (Lichman, 2013) containing records of 1885 respondents regarding their use of 18 types of drugs, and their measurements on 12 predictors. #' All predictors were originally categorical and were quantified by Fehrman et al. (2017). The meaning of the values can be found on https://archive.ics.uci.edu/dataset/373/drug+consumption+quantified. The original response categories for each drug were: never used the drug, used it over a decade ago, or in the last decade, year, month, week, or day. We transformed these into binary response categories, where 0 (non-user) consists of the categories never used the drug and used it over a decade ago and 1 (user) consists of all other categories.
Usage
drugs
Format
A data frame with 1185 rows and 32 variables:
- ID
- Respondent ID 
- Age
- Age of respondent 
- Gender
- Gender of respondent, where 0.48 denotes female and -0.48 denotes male 
- Edu
- Level of education of participant 
- Country
- Country of current residence of participant 
- Ethn
- Ethnicity of participant 
- Neuro
- NEO-FFI-R Neuroticism score 
- Extr
- NEO-FFI-R Extraversion score 
- Open
- NEO-FFI-R Openness to experience score 
- Agree
- NEO-FFI-R Agreeableness score 
- Consc
- NEO-FFI-R Conscientiousness score 
- Impul
- Impulsiveness score measured by BIS-11 
- Sensat
- Sensation seeking score measured by ImpSS 
- Alc
- Alcohol user (1) or non-user (0) 
- Amphet
- Amphetamine user (1) or non-user (0) 
- Amyl
- Amyl nitrite user (1) or non-user (0) 
- Benzos
- Benzodiazepine user (1) or non-user (0) 
- Caff
- Caffeine user (1) or non-user (0) 
- Can
- Cannabis user (1) or non-user (0) 
- Choco
- Chocolate user (1) or non-user (0) 
- Coke
- Coke user (1) or non-user (0) 
- Crack
- Crack user (1) or non-user (0) 
- Ecst
- Ecstacy user (1) or non-user (0) 
- Her
- Heroin user (1) or non-user (0) 
- Ket
- Ketamine user (1) or non-user (0) 
- Leghighs
- Legal Highs user (1) or non-user (0) 
- LSD
- LSD user (1) or non-user (0) 
- Meth
- Methadone user (1) or non-user (0) 
- Mush
- Magical Mushroom user (1) or non-user (0) 
- Nico
- Nicotine user (1) or non-user (0) 
- Semeron
- Semeron user (1) or non-user (0), fictitious drug to identify over-claimers 
- VSA
- volatile substance abuse user(1) or non-user (0) 
Source
https://archive.ics.uci.edu/dataset/373/drug+consumption+quantified
References
Fehrman, E., Muhammad, A. K., Mirkes, E. M., Egan, V., & Gorban, A. N. (2017). The Five Factor Model of personality and evaluation of drug consumption risk. In Data Science (pp. 231-242). Springer, Cham. Lichman, M. (2013). UCI machine learning repository.
Get the medoid trees for a solution of a clusterforest object
Description
A function to get the medoid trees for a given solution of a clusterforest object.
Usage
medoidtrees(clusterforest, solution)
Arguments
| clusterforest | A clusterforest object | 
| solution | The solution for which medoid trees should be returned. Default = 1 | 
Get the medoid trees for a solution of a clusterforest object
Description
A function to get the medoid trees for a given solution of a clusterforest object.
Usage
## S3 method for class 'clusterforest'
medoidtrees(clusterforest, solution = 1)
Arguments
| clusterforest | A clusterforest object | 
| solution | The solution for which medoid trees should be returned. Default = 1 | 
Get the medoid trees for a solution of a clusterforest object
Description
A function to get the medoid trees for a given solution of a clusterforest object.
Usage
## Default S3 method:
medoidtrees(clusterforest, solution)
Arguments
| clusterforest | A clusterforest object | 
| solution | The solution for which medoid trees should be returned. Default = 1 | 
Plot a clusterforest object
Description
A function that can be used to plot a clusterforest object, either by returning plots with information such as average silhouette width and within cluster siiliarity on the cluster solutions, or plots of the medoid trees of each solution.
Usage
## S3 method for class 'clusterforest'
plot(x, solution = NULL, predictive_plots = FALSE, ...)
Arguments
| x | A clusterforest object | 
| solution | The solution to plot the medoid trees from. If NULL, plots with the average silhouette width, within cluster similiarty (and predictive accuracy) per solution are returned. Default = NULL | 
| predictive_plots | Indicating whether predictive plots should be returned: A plot showing the predictive accuracy when making predictions based on the medoid trees, and a plot of the agreement between the class label for each object predicted on the basis of the random forest as a whole versus based on the medoid trees. Default = FALSE. | 
| ... | Additional arguments that can be used in generic plot function, or in plot.party. | 
Details
This function can be used to plot a clusterforest object in two ways. If it's used without specifying a solution, then the average silhouette width, and within cluster similarity measures are plotted for each solution. If additionally, predictive_plots=TRUE, two more plots are returned, namely a plot showing for each solution the predictive accuracy when making predictions based on the medoid trees, and a plot showing for each solution the agreement between the class label for each object predicted on the basis of the random forest as a whole versus based on the medoid trees. These plots may be helpful in deciding how many clusters are needed to summarize the forest (see Sies & Van Mechelen, 2020).
If the function is used with the clusterforest object and the number of the solution, then the medoid tree(s) of that solution are plotted.
References
Sies, A. & Van Mechelen I. (2020). C443: An R-package to see a forest for the trees. Journal of Classification.
Examples
require(MASS)
require(rpart)
#Function to draw a bootstrap sample from a dataset
DrawBoots <- function(dataset, i){
set.seed(2394 + i)
Boot <- dataset[sample(1:nrow(dataset), size = nrow(dataset), replace = TRUE),]
return(Boot)
}
#Function to grow a tree using rpart on a dataset
GrowTree <- function(x,y,BootsSample, minsplit = 40, minbucket = 20, maxdepth =3){
 controlrpart <- rpart.control(minsplit = minsplit, minbucket = minbucket,
 maxdepth = maxdepth, maxsurrogate = 0, maxcompete = 0)
 tree <- rpart(as.formula(paste(noquote(paste(y, "~")),
 noquote(paste(x, collapse="+")))), data = BootsSample,
 control = controlrpart)
 return(tree)
}
#Use functions to draw 20 boostrapsamples and grow a tree on each sample
Boots<- lapply(1:10, function(k) DrawBoots(Pima.tr ,k))
Trees <- lapply(1:10, function (i) GrowTree(x=c("npreg", "glu",  "bp",
 "skin",  "bmi", "ped", "age"), y="type",
Boots[[i]] ))
ClusterForest<- clusterforest(observeddata=Pima.tr,treedata=Boots,trees=Trees,m=1,
fromclus=1, toclus=5, sameobs=FALSE, no_cores=2)
plot(ClusterForest)
plot(ClusterForest,2)
Print a clusterforest object
Description
A function that can be used to print a clusterforest object.
Usage
## S3 method for class 'clusterforest'
print(x, solution = 1, ...)
Arguments
| x | A clusterforest object | 
| solution | The solution to print the medoid trees from. Default = NULL | 
| ... | Additional arguments that can be used in the generic print function. | 
Summarize a clusterforest object
Description
A function to summarize a clusterforest object.
Usage
## S3 method for class 'clusterforest'
summary(object, ...)
Arguments
| object | A clusterforest object | 
| ... | Additional arguments that can be used in the generic summary function. | 
Get the similarity matrix that wast used to create a clusterforest object
Description
A function to get the similarity matrix used to obtain a clusterforest object.
Usage
treesimilarities(clusterforest)
Arguments
| clusterforest | A clusterforest object | 
Get the similarity matrix that wast used to create a clusterforest object
Description
A function to get the similarity matrix used to obtain a clusterforest object.
Usage
## S3 method for class 'clusterforest'
treesimilarities(clusterforest)
Arguments
| clusterforest | A clusterforest object | 
Get the similarity matrix that wast used to create a clusterforest object
Description
A function to get the similarity matrix used to obtain a clusterforest object.
Usage
## Default S3 method:
treesimilarities(clusterforest)
Arguments
| clusterforest | A clusterforest object | 
Mapping the tree clustering solution to a known source of variation underlying the forest
Description
A function that can be used to get insight into a clusterforest solution, in the case that there are known sources of variation underlying the forest. These known sources of variation must be included in the clusterforest object (and thus must be defined when running the clusterforest function) In case of a categorical covariate, it visualizes the number of trees from each value of the covariate that belong to each cluster. In case of a continuous covariate, it returns the mean and standard deviation of the covariate in each cluster.
Usage
treesource(clusterforest, solution)
Arguments
| clusterforest | The clusterforest object, indluding the treecov attribute. | 
| solution | The solution | 
Value
| multiplot | In case of categorical covariate, for each value of the covariate, a bar plot with the number of trees that belong to each cluster | 
| heatmap | In case of a categorical covariate, a heatmap with for each value of the covariate, the number of trees that belong to each cluster | 
| clustermeans | In case of a continuous covariate, the mean of the covariate in each cluster | 
| clusterstds | In case of a continuous covariate, the standard deviation of the covariate in each cluster | 
Examples
require(rpart)
data_Amphet <-drugs[,c ("Amphet","Age", "Gender", "Edu", "Neuro", "Extr", "Open", "Agree",
"Consc", "Impul","Sensat")]
data_cocaine <-drugs[,c ("Coke","Age", "Gender", "Edu", "Neuro", "Extr", "Open", "Agree",
                         "Consc", "Impul","Sensat")]
#Function to draw a bootstrap sample from a dataset
DrawBoots <- function(dataset, i){
set.seed(2394 + i)
Boot <- dataset[sample(1:nrow(dataset), size = nrow(dataset), replace = TRUE),]
return(Boot)
}
#Function to grow a tree using rpart on a dataset
GrowTree <- function(x,y,BootsSample, minsplit = 40, minbucket = 20, maxdepth =3){
 controlrpart <- rpart.control(minsplit = minsplit, minbucket = minbucket, maxdepth = maxdepth,
 maxsurrogate = 0, maxcompete = 0)
 tree <- rpart(as.formula(paste(noquote(paste(y, "~")), noquote(paste(x, collapse="+")))),
  data = BootsSample, control = controlrpart)
 return(tree)
}
#Draw bootstrap samples and grow trees
BootsA<- lapply(1:5, function(k) DrawBoots(data_Amphet,k))
BootsC<- lapply(1:5, function(k) DrawBoots(data_cocaine,k))
Boots = c(BootsA,BootsC)
TreesA <- lapply(1:5, function (i) GrowTree(x=c ("Age", "Gender", "Edu", "Neuro",
"Extr", "Open", "Agree","Consc", "Impul","Sensat"), y="Amphet", BootsA[[i]] ))
TreesC <- lapply(1:5, function (i) GrowTree(x=c ( "Age", "Gender", "Edu", "Neuro",
"Extr", "Open", "Agree", "Consc", "Impul","Sensat"), y="Coke", BootsC[[i]] ))
Trees=c(TreesA,TreesC)
#Cluster the trees
ClusterForest<- clusterforest(observeddata=drugs,treedata=Boots,trees=Trees,m=1,
fromclus=2, toclus=2, treecov=rep(c("Amphet","Coke"),each=5), sameobs=FALSE, no_cores=2)
#Link cluster result to known source of variation
treesource(ClusterForest, 2)
Mapping the tree clustering solution to a known source of variation underlying the forest
Description
A function that can be used to get insight into a clusterforest solution, in the case that there is a known source of variation underlying the forest. It visualizes the number of trees from each source that belong to each cluster.
Usage
## S3 method for class 'clusterforest'
treesource(clusterforest, solution)
Arguments
| clusterforest | The clusterforest object | 
| solution | The solution | 
Mapping the tree clustering solution to a known source of variation underlying the forest
Description
A function that can be used to get insight into a clusterforest solution, in the case that there is a known source of variation underlying the forest. It visualizes the number of trees from each source that belong to each cluster.
Usage
## Default S3 method:
treesource(clusterforest, solution)
Arguments
| clusterforest | The clusterforest object | 
| solution | The solution |