This function is at the heart of the FFTrees
package. The function takes a training dataset as an argument, and generates several FFT (more details about the algorithms coming soon…)
Let’s start with an example, we’ll create FFTs fitted to the heartdisease
dataset. Here’s how the dataset looks:
head(heartdisease)
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1 63 1 1 145 233 1 2 150 0 2.3 3 0 6
## 2 67 1 4 160 286 0 2 108 1 1.5 2 3 3
## 3 67 1 4 120 229 0 2 129 1 2.6 2 2 7
## 4 37 1 3 130 250 0 0 187 0 3.5 3 0 3
## 5 41 0 2 130 204 0 2 172 0 1.4 1 0 3
## 6 56 1 2 120 236 0 0 178 0 0.8 1 0 3
## diagnosis
## 1 0
## 2 1
## 3 1
## 4 0
## 5 0
## 6 0
We’ll create a new fft object called heart.fft
using the fft()
function. We’ll set the criterion to heartdisease$diagnosis
and use all other columns (heartdisease[,names(heartdisease) != "diagnosis"]
as potential predictors. Additionally, we’ll define two parameters:
train.p = .5
: Train the trees on a random sample of 50% of the original training dataset, and test the trees on the remaining 50%max.levels = 4
: The maximum number of levels (e.g.; cues) the trees will consider is 4. Because each of the max.levels - 1 levels can have two exit structures, this will lead to \(2^{3}\) possible trees.set.seed(100) # For reproducability
heart.fft <- fft(
train.cue.df = heartdisease[,names(heartdisease) != "diagnosis"],
train.criterion.v = heartdisease$diagnosis,
train.p = .5,
max.levels = 4
)
As you can see, fft()
returns an object with the fft class
class(heart.fft)
## [1] "fft"
There are many elements in an fft object:
names(heart.fft)
## [1] "trees" "cue.accuracies" "cart"
## [4] "lr" "train.cue" "train.crit"
## [7] "test.cue" "test.crit" "train.decision.df"
## [10] "test.decision.df" "train.levelout.df" "test.levelout.df"
## [13] "best.train.tree" "best.test.tree"
The cue.accuracies
dataframe contains the original, marginal cue accuracies. That is, for each cue, the threshold that maximizes v (hr - far) is chosen (this is done using the cuerank()
function):
heart.fft$cue.accuracies
## cue.name cue.class level.threshold level.sigdirection hi mi fa cr
## 12 age numeric 57 >= 44 22 26 59
## 2 sex numeric 1 >= 53 13 48 37
## 4 cp numeric 4 >= 48 18 18 67
## 9 trestbps numeric 139 >= 26 40 21 64
## 7 chol numeric 218 >= 51 15 56 29
## 21 fbs numeric 1 >= 10 56 9 76
## 1 restecg numeric 0 > 40 26 35 50
## 121 thalach numeric 154 < 43 23 27 58
## 11 exang numeric 0 > 31 35 14 71
## 22 oldpeak numeric 1 >= 41 25 21 64
## 13 slope numeric 1 > 45 21 27 58
## 14 ca numeric 0 > 47 19 19 66
## 15 thal numeric 3 > 47 19 16 69
## hr far v dprime correction hr.weight
## 12 0.6666667 0.3058824 0.3607843 0.4691417 0.25 0.5
## 2 0.8030303 0.5647059 0.2383244 0.3447918 0.25 0.5
## 4 0.7272727 0.2117647 0.5155080 0.7024492 0.25 0.5
## 9 0.3939394 0.2470588 0.1468806 0.2073541 0.25 0.5
## 7 0.7727273 0.6588235 0.1139037 0.1693021 0.25 0.5
## 21 0.1515152 0.1058824 0.0456328 0.1093854 0.25 0.5
## 1 0.6060606 0.4117647 0.1942959 0.2460370 0.25 0.5
## 121 0.6515152 0.3176471 0.3338681 0.4318515 0.25 0.5
## 11 0.4696970 0.1647059 0.3049911 0.4496339 0.25 0.5
## 22 0.6212121 0.2470588 0.3741533 0.4962201 0.25 0.5
## 13 0.6818182 0.3176471 0.3641711 0.4735389 0.25 0.5
## 14 0.7121212 0.2235294 0.4885918 0.6599599 0.25 0.5
## 15 0.7121212 0.1882353 0.5238859 0.7220052 0.25 0.5
Here, we can see that the thal
cue had the highest v value of 0.5239 while cp
had the second highest v value of 0.5155.
The trees
dataframe contains all tree definitions and training (and possibly test) statistics for all (\(2^{max.levels - 1}\)) trees. For our heart.fft
example, there are \(2^{4 - 1} = 8\) trees.
Tree definitions (exit directions, cue order, and cue thresholds) are contained in columns 1 through 6:
heart.fft$trees[,1:6] # Tree info are in columns 1:6
## tree.num level.name level.class level.exit
## 1 1 thal;cp;ca;oldpeak numeric;numeric;numeric;numeric 0;0;0;0.5
## 2 2 thal;cp;ca numeric;numeric;numeric 1;0;0.5
## 3 3 thal;cp;ca numeric;numeric;numeric 0;1;0.5
## 4 4 thal;cp;ca;oldpeak numeric;numeric;numeric;numeric 1;1;0;0.5
## 5 5 thal;cp;ca numeric;numeric;numeric 0;0;0.5
## 6 6 thal;cp;ca numeric;numeric;numeric 1;0;0.5
## 7 7 thal;cp;ca numeric;numeric;numeric 0;1;0.5
## 8 8 thal;cp;ca;oldpeak numeric;numeric;numeric;numeric 1;1;1;0.5
## level.threshold level.sigdirection
## 1 3;4;0;1 >;>=;>;>=
## 2 3;4;0 >;>=;>
## 3 3;4;0 >;>=;>
## 4 3;4;0;1 >;>=;>;>=
## 5 3;4;0 >;>=;>
## 6 3;4;0 >;>=;>
## 7 3;4;0 >;>=;>
## 8 3;4;0;1 >;>=;>;>=
Training statistics are contained in columns 7:15 and have the .train
suffix.
heart.fft$trees[,7:15] # Training stats are in columns 7:15
## n.train hi.train mi.train fa.train cr.train hr.train far.train
## 1 151 21 45 0 85 0.3181818 0.00000000
## 2 151 54 12 18 67 0.8181818 0.21176471
## 3 151 44 22 7 78 0.6666667 0.08235294
## 4 151 59 7 32 53 0.8939394 0.37647059
## 5 151 28 38 2 83 0.4242424 0.02352941
## 6 151 54 12 18 67 0.8181818 0.21176471
## 7 151 44 22 7 78 0.6666667 0.08235294
## 8 151 64 2 52 33 0.9696970 0.61176471
## v.train dprime.train
## 1 0.3166249 1.1436132
## 2 0.6064171 0.8543855
## 3 0.5843137 0.9100723
## 4 0.5174688 0.7812587
## 5 0.4007130 0.8973592
## 6 0.6064171 0.8543855
## 7 0.5843137 0.9100723
## 8 0.3579323 0.7962186
For our heart disease dataset, it looks like trees 2 and 6 had the highest training v (HR - FAR) values.
Test statistics are contained in columns 16:24 and have the .test
suffix.
heart.fft$trees[,16:24] # Test stats are in columns 16:24
## n.test hi.test mi.test fa.test cr.test hr.test far.test v.test
## 1 152 23 50 0 79 0.3150685 0.0000000 0.3131819
## 2 152 64 9 19 60 0.8767123 0.2405063 0.6362060
## 3 152 49 24 8 71 0.6712329 0.1012658 0.5699671
## 4 152 69 4 35 44 0.9452055 0.4430380 0.5021675
## 5 152 28 45 0 79 0.3835616 0.0000000 0.3812091
## 6 152 64 9 19 60 0.8767123 0.2405063 0.6362060
## 7 152 49 24 8 71 0.6712329 0.1012658 0.5699671
## 8 152 72 1 56 23 0.9863014 0.7088608 0.2774406
## dprime.test
## 1 1.1271540
## 2 0.9316912
## 3 0.8588460
## 4 0.8716571
## 5 1.2191190
## 6 0.9316912
## 7 0.8588460
## 8 0.8278755
It looks like trees 2 and 6 also had the highest test v (HR - FAR) values.
The best trees for training and testing are in best.train.tree
and best.test.tree
. That is, which of the trees had the best performance (in terms of v (HR - FAR)) in the training dataset and which had the best performance in the test dataset? We want these two values to be the same. If they are different, then the tree algorithm might be over-fitting to the training dataset.
# which tree had the best training statistics?
heart.fft$best.train.tree
## [1] 2
# Which tree had the best testing statistics?
heart.fft$best.test.tree
## [1] 2
This is a good sign for our heartdisease
dataset. It means that tree 2 did the best for both training and test.
The train.decision.df
and test.decision.df
contain the raw classification decisions for each tree for each training (and test) case.
Here are each of the 8 tree decisions for the first 5 training cases.
heart.fft$train.decision.df[1:5,]
## tree.1 tree.2 tree.3 tree.4 tree.5 tree.6 tree.7 tree.8
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 1 0 0 0 1
## 3 0 0 0 0 0 0 0 1
## 4 0 1 0 1 0 1 0 1
## 5 0 0 0 0 0 0 0 0
The train.levelout.df
and test.levelout.df
contain the levels at which each case was classified for each tree.
Here are the levels at which the first 5 training cases were classified:
heart.fft$train.levelout.df[1:5,]
## tree.1 tree.2 tree.3 tree.4 tree.5 tree.6 tree.7 tree.8
## 1 1 2 1 3 1 2 1 4
## 2 1 2 1 4 1 2 1 3
## 3 1 2 1 4 1 2 1 3
## 4 2 1 3 1 2 1 3 1
## 5 1 2 1 3 1 2 1 4
The cart
and lr
dataframes contain information about how CART (using the rpart
package) and Logistic Regression performed on the same data.
The cart
dataframe shows training and test statistics using different miss and false alarm costs (the standard tree is in the first row where the miss and false alarm costs are both set to 1).
heart.fft$cart
## miss.cost fa.cost hr.train far.train v.train dprime.train hr.test
## 1 1 1 0.8333333 0.14117647 0.6921569 1.0212351 0.6849315
## 2 2 1 0.8030303 0.11764706 0.6853832 1.0196632 0.6438356
## 3 3 1 0.5303030 0.04705882 0.4832442 0.8750488 0.4657534
## 4 4 1 0.3181818 0.00000000 0.3166249 1.1436132 0.3561644
## 5 5 1 0.4090909 0.01176471 0.3973262 1.0174217 0.3972603
## 6 1 2 0.9242424 0.24705882 0.6771836 1.0589873 0.7945205
## 7 3 2 0.8333333 0.14117647 0.6921569 1.0212351 0.6849315
## 8 4 2 0.8030303 0.11764706 0.6853832 1.0196632 0.6438356
## 9 5 2 0.8030303 0.11764706 0.6853832 1.0196632 0.7123288
## 10 1 3 0.9545455 0.28235294 0.6721925 1.1332437 0.8493151
## 11 2 3 0.8333333 0.14117647 0.6921569 1.0212351 0.6849315
## 14 1 4 0.9696970 0.35294118 0.6167558 1.1268753 0.8630137
## 15 2 4 0.9242424 0.24705882 0.6771836 1.0589873 0.7945205
## 16 3 4 0.8333333 0.14117647 0.6921569 1.0212351 0.6849315
## 18 1 5 1.0000000 0.44705882 0.5488722 1.4026303 0.9315068
## 19 2 5 0.9545455 0.28235294 0.6721925 1.1332437 0.8493151
## 20 3 5 0.9545455 0.27058824 0.6839572 1.1508283 0.7671233
## 21 4 5 0.8333333 0.14117647 0.6921569 1.0212351 0.6849315
## far.test v.test dprime.test cart.cues.vec
## 1 0.24050633 0.4444252 0.5931044 thal;ca;ca;chol
## 2 0.13924051 0.5045951 0.7262341 thal;ca;ca;age
## 3 0.03797468 0.4277787 0.8443696 thal;ca
## 4 0.01265823 0.3435062 0.9339046 thal;ca;oldpeak
## 5 0.05063291 0.3466274 0.6891513 thal;oldpeak;ca;oldpeak
## 6 0.39240506 0.4021155 0.5476317 thal;ca;thalach;age;cp;ca
## 7 0.24050633 0.4444252 0.5931044 thal;ca;ca;chol
## 8 0.13924051 0.5045951 0.7262341 thal;ca;ca;age
## 9 0.16455696 0.5477718 0.7680505 thal;ca;ca;cp
## 10 0.50632911 0.3429860 0.5088174 thal;ca;thalach;age;cp;chol
## 11 0.24050633 0.4444252 0.5931044 thal;ca;ca;chol
## 14 0.51898734 0.3440264 0.5231738 thal;ca;thalach;age
## 15 0.39240506 0.4021155 0.5476317 thal;ca;thalach;age;cp;ca
## 16 0.24050633 0.4444252 0.5931044 thal;ca;ca;chol
## 18 0.62025316 0.3112537 0.5904811 thal;ca;thalach;age;age
## 19 0.50632911 0.3429860 0.5088174 thal;ca;thalach;age;cp;chol
## 20 0.46835443 0.2987689 0.4044065 thal;ca;thalach;age;ca;chol
## 21 0.24050633 0.4444252 0.5931044 thal;ca;ca;chol
The lr
data frame shows training and test statistics using different probabilistic thresholds for decisions. A threshold value of 0.5 is the standard logistic regression model.
heart.fft$lr
## threshold hr.train far.train hr.test far.test
## 1 0.9 0.4545455 0.01176471 0.4520548 0.01265823
## 2 0.8 0.6212121 0.03529412 0.6164384 0.01265823
## 3 0.7 0.6969697 0.05882353 0.6575342 0.03797468
## 4 0.6 0.7727273 0.08235294 0.7534247 0.06329114
## 5 0.5 0.8181818 0.09411765 0.7808219 0.16455696
## 6 0.4 0.8636364 0.14117647 0.8082192 0.20253165
## 7 0.3 0.8787879 0.21176471 0.8356164 0.26582278
## 8 0.2 0.9090909 0.25882353 0.9041096 0.40506329
## 9 0.1 0.9696970 0.51764706 0.9726027 0.58227848
Once you’ve created an fft object using fft()
you can visualize the tree (and ROC curves) using plot()
. The following code will visualize the best training tree (tree 2) applied to the test data:
plot(heart.fft,
which.tree = "best.train",
which.data = "test",
description = "Heart Disease",
decision.names = c("Healthy", "Disease")
)
See the vignette on plot.fft
vignette("fft_plot", package = "fft")
for more details.
The fft()
function has several additional arguments than change how trees are built. Note: Not all of these arguments have fully tested yet!
train.p
: What percent of the data should be used for training? train.p = .1
will randomly select 10% of the data for training and leave the remaining 90% for testing. Settting train.p = 1
will fit the trees to the entire dataset (with no testing).
test.cue.df
, test.criterion.v
: If you have a specific set of training data that you want to test, you can specify them here. If you do, the function will use the entire training data (train.cue.df
, train.criterion.v
) for training and then will apply the training trees to the testing data you specify. Thus, this will bypass the train.p
argument.
rank.method
: As trees are being built, should cues be selected based on their marginal accuracy (rank.method = "m"
) applied to the entire dataset, or on their conditional accuracy (rank.method = "c"
) applied to all cases that have not yet been classified? Each method has potential pros and cons. The marginal method is much faster to implement and may be prone to less over-fitting. However, the conditional method could capture important conditional dependencies between cues that the marginal method misses.
stopping.rule
, stopping.par
: When should trees stop growing? While all trees will (currently) stop if the number of levels hits max.levels
, you can also stop trees using additional criteria.
stopping.rule = "levels"
will always the tree at the level indicated by stopping.par
(in this case, it makes more sense to just set max.levels
to the number of levels you want to stop at.).stopping.rule = "exemplars"
will stop the tree if only a small percentage of cases remain unclassified. This percentage is indicated by stopping.par
. For example, stopping.par = .05
will stop the tree if less than 5% of cases remain.