fft() function

Nathaniel Phillips

2016-07-12

This function is at the heart of the FFTrees package. The function takes a training dataset as an argument, and generates several FFT (more details about the algorithms coming soon…)

Example with heartdisease

Let’s start with an example, we’ll create FFTs fitted to the heartdisease dataset. Here’s how the dataset looks:

head(heartdisease)
##   age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1  63   1  1      145  233   1       2     150     0     2.3     3  0    6
## 2  67   1  4      160  286   0       2     108     1     1.5     2  3    3
## 3  67   1  4      120  229   0       2     129     1     2.6     2  2    7
## 4  37   1  3      130  250   0       0     187     0     3.5     3  0    3
## 5  41   0  2      130  204   0       2     172     0     1.4     1  0    3
## 6  56   1  2      120  236   0       0     178     0     0.8     1  0    3
##   diagnosis
## 1         0
## 2         1
## 3         1
## 4         0
## 5         0
## 6         0

We’ll create a new fft object called heart.fft using the fft() function. We’ll set the criterion to heartdisease$diagnosis and use all other columns (heartdisease[,names(heartdisease) != "diagnosis"] as potential predictors. Additionally, we’ll define two parameters:

set.seed(100) # For reproducability

heart.fft <- fft(
  train.cue.df = heartdisease[,names(heartdisease) != "diagnosis"],
  train.criterion.v = heartdisease$diagnosis,
  train.p = .5,
  max.levels = 4
  )

Elements of an fft object

As you can see, fft() returns an object with the fft class

class(heart.fft)
## [1] "fft"

There are many elements in an fft object:

names(heart.fft)
##  [1] "trees"             "cue.accuracies"    "cart"             
##  [4] "lr"                "train.cue"         "train.crit"       
##  [7] "test.cue"          "test.crit"         "train.decision.df"
## [10] "test.decision.df"  "train.levelout.df" "test.levelout.df" 
## [13] "best.train.tree"   "best.test.tree"

cue.accuracies

The cue.accuracies dataframe contains the original, marginal cue accuracies. That is, for each cue, the threshold that maximizes v (hr - far) is chosen (this is done using the cuerank() function):

heart.fft$cue.accuracies
##     cue.name cue.class level.threshold level.sigdirection hi mi fa cr
## 12       age   numeric              57                 >= 44 22 26 59
## 2        sex   numeric               1                 >= 53 13 48 37
## 4         cp   numeric               4                 >= 48 18 18 67
## 9   trestbps   numeric             139                 >= 26 40 21 64
## 7       chol   numeric             218                 >= 51 15 56 29
## 21       fbs   numeric               1                 >= 10 56  9 76
## 1    restecg   numeric               0                  > 40 26 35 50
## 121  thalach   numeric             154                  < 43 23 27 58
## 11     exang   numeric               0                  > 31 35 14 71
## 22   oldpeak   numeric               1                 >= 41 25 21 64
## 13     slope   numeric               1                  > 45 21 27 58
## 14        ca   numeric               0                  > 47 19 19 66
## 15      thal   numeric               3                  > 47 19 16 69
##            hr       far         v    dprime correction hr.weight
## 12  0.6666667 0.3058824 0.3607843 0.4691417       0.25       0.5
## 2   0.8030303 0.5647059 0.2383244 0.3447918       0.25       0.5
## 4   0.7272727 0.2117647 0.5155080 0.7024492       0.25       0.5
## 9   0.3939394 0.2470588 0.1468806 0.2073541       0.25       0.5
## 7   0.7727273 0.6588235 0.1139037 0.1693021       0.25       0.5
## 21  0.1515152 0.1058824 0.0456328 0.1093854       0.25       0.5
## 1   0.6060606 0.4117647 0.1942959 0.2460370       0.25       0.5
## 121 0.6515152 0.3176471 0.3338681 0.4318515       0.25       0.5
## 11  0.4696970 0.1647059 0.3049911 0.4496339       0.25       0.5
## 22  0.6212121 0.2470588 0.3741533 0.4962201       0.25       0.5
## 13  0.6818182 0.3176471 0.3641711 0.4735389       0.25       0.5
## 14  0.7121212 0.2235294 0.4885918 0.6599599       0.25       0.5
## 15  0.7121212 0.1882353 0.5238859 0.7220052       0.25       0.5

Here, we can see that the thal cue had the highest v value of 0.5239 while cp had the second highest v value of 0.5155.

trees

The trees dataframe contains all tree definitions and training (and possibly test) statistics for all (\(2^{max.levels - 1}\)) trees. For our heart.fft example, there are \(2^{4 - 1} = 8\) trees.

Tree definitions (exit directions, cue order, and cue thresholds) are contained in columns 1 through 6:

heart.fft$trees[,1:6]   # Tree info are in columns 1:6
##   tree.num         level.name                     level.class level.exit
## 1        1 thal;cp;ca;oldpeak numeric;numeric;numeric;numeric  0;0;0;0.5
## 2        2         thal;cp;ca         numeric;numeric;numeric    1;0;0.5
## 3        3         thal;cp;ca         numeric;numeric;numeric    0;1;0.5
## 4        4 thal;cp;ca;oldpeak numeric;numeric;numeric;numeric  1;1;0;0.5
## 5        5         thal;cp;ca         numeric;numeric;numeric    0;0;0.5
## 6        6         thal;cp;ca         numeric;numeric;numeric    1;0;0.5
## 7        7         thal;cp;ca         numeric;numeric;numeric    0;1;0.5
## 8        8 thal;cp;ca;oldpeak numeric;numeric;numeric;numeric  1;1;1;0.5
##   level.threshold level.sigdirection
## 1         3;4;0;1          >;>=;>;>=
## 2           3;4;0             >;>=;>
## 3           3;4;0             >;>=;>
## 4         3;4;0;1          >;>=;>;>=
## 5           3;4;0             >;>=;>
## 6           3;4;0             >;>=;>
## 7           3;4;0             >;>=;>
## 8         3;4;0;1          >;>=;>;>=

Training statistics are contained in columns 7:15 and have the .train suffix.

heart.fft$trees[,7:15]   # Training stats are in columns 7:15
##   n.train hi.train mi.train fa.train cr.train  hr.train  far.train
## 1     151       21       45        0       85 0.3181818 0.00000000
## 2     151       54       12       18       67 0.8181818 0.21176471
## 3     151       44       22        7       78 0.6666667 0.08235294
## 4     151       59        7       32       53 0.8939394 0.37647059
## 5     151       28       38        2       83 0.4242424 0.02352941
## 6     151       54       12       18       67 0.8181818 0.21176471
## 7     151       44       22        7       78 0.6666667 0.08235294
## 8     151       64        2       52       33 0.9696970 0.61176471
##     v.train dprime.train
## 1 0.3166249    1.1436132
## 2 0.6064171    0.8543855
## 3 0.5843137    0.9100723
## 4 0.5174688    0.7812587
## 5 0.4007130    0.8973592
## 6 0.6064171    0.8543855
## 7 0.5843137    0.9100723
## 8 0.3579323    0.7962186

For our heart disease dataset, it looks like trees 2 and 6 had the highest training v (HR - FAR) values.

Test statistics are contained in columns 16:24 and have the .test suffix.

heart.fft$trees[,16:24]   # Test stats are in columns 16:24
##   n.test hi.test mi.test fa.test cr.test   hr.test  far.test    v.test
## 1    152      23      50       0      79 0.3150685 0.0000000 0.3131819
## 2    152      64       9      19      60 0.8767123 0.2405063 0.6362060
## 3    152      49      24       8      71 0.6712329 0.1012658 0.5699671
## 4    152      69       4      35      44 0.9452055 0.4430380 0.5021675
## 5    152      28      45       0      79 0.3835616 0.0000000 0.3812091
## 6    152      64       9      19      60 0.8767123 0.2405063 0.6362060
## 7    152      49      24       8      71 0.6712329 0.1012658 0.5699671
## 8    152      72       1      56      23 0.9863014 0.7088608 0.2774406
##   dprime.test
## 1   1.1271540
## 2   0.9316912
## 3   0.8588460
## 4   0.8716571
## 5   1.2191190
## 6   0.9316912
## 7   0.8588460
## 8   0.8278755

It looks like trees 2 and 6 also had the highest test v (HR - FAR) values.

best.train.tree, best.test.tree

The best trees for training and testing are in best.train.tree and best.test.tree. That is, which of the trees had the best performance (in terms of v (HR - FAR)) in the training dataset and which had the best performance in the test dataset? We want these two values to be the same. If they are different, then the tree algorithm might be over-fitting to the training dataset.

# which tree had the best training statistics?
heart.fft$best.train.tree
## [1] 2
# Which tree had the best testing statistics?
heart.fft$best.test.tree
## [1] 2

This is a good sign for our heartdisease dataset. It means that tree 2 did the best for both training and test.

Other information

train.decision.df, test.decision.df

The train.decision.df and test.decision.df contain the raw classification decisions for each tree for each training (and test) case.

Here are each of the 8 tree decisions for the first 5 training cases.

heart.fft$train.decision.df[1:5,]
##   tree.1 tree.2 tree.3 tree.4 tree.5 tree.6 tree.7 tree.8
## 1      0      0      0      0      0      0      0      0
## 2      0      0      0      1      0      0      0      1
## 3      0      0      0      0      0      0      0      1
## 4      0      1      0      1      0      1      0      1
## 5      0      0      0      0      0      0      0      0

train.levelout.df, test.levelout.df

The train.levelout.df and test.levelout.df contain the levels at which each case was classified for each tree.

Here are the levels at which the first 5 training cases were classified:

heart.fft$train.levelout.df[1:5,]
##   tree.1 tree.2 tree.3 tree.4 tree.5 tree.6 tree.7 tree.8
## 1      1      2      1      3      1      2      1      4
## 2      1      2      1      4      1      2      1      3
## 3      1      2      1      4      1      2      1      3
## 4      2      1      3      1      2      1      3      1
## 5      1      2      1      3      1      2      1      4

cart, lr

The cart and lr dataframes contain information about how CART (using the rpart package) and Logistic Regression performed on the same data.

The cart dataframe shows training and test statistics using different miss and false alarm costs (the standard tree is in the first row where the miss and false alarm costs are both set to 1).

heart.fft$cart
##    miss.cost fa.cost  hr.train  far.train   v.train dprime.train   hr.test
## 1          1       1 0.8333333 0.14117647 0.6921569    1.0212351 0.6849315
## 2          2       1 0.8030303 0.11764706 0.6853832    1.0196632 0.6438356
## 3          3       1 0.5303030 0.04705882 0.4832442    0.8750488 0.4657534
## 4          4       1 0.3181818 0.00000000 0.3166249    1.1436132 0.3561644
## 5          5       1 0.4090909 0.01176471 0.3973262    1.0174217 0.3972603
## 6          1       2 0.9242424 0.24705882 0.6771836    1.0589873 0.7945205
## 7          3       2 0.8333333 0.14117647 0.6921569    1.0212351 0.6849315
## 8          4       2 0.8030303 0.11764706 0.6853832    1.0196632 0.6438356
## 9          5       2 0.8030303 0.11764706 0.6853832    1.0196632 0.7123288
## 10         1       3 0.9545455 0.28235294 0.6721925    1.1332437 0.8493151
## 11         2       3 0.8333333 0.14117647 0.6921569    1.0212351 0.6849315
## 14         1       4 0.9696970 0.35294118 0.6167558    1.1268753 0.8630137
## 15         2       4 0.9242424 0.24705882 0.6771836    1.0589873 0.7945205
## 16         3       4 0.8333333 0.14117647 0.6921569    1.0212351 0.6849315
## 18         1       5 1.0000000 0.44705882 0.5488722    1.4026303 0.9315068
## 19         2       5 0.9545455 0.28235294 0.6721925    1.1332437 0.8493151
## 20         3       5 0.9545455 0.27058824 0.6839572    1.1508283 0.7671233
## 21         4       5 0.8333333 0.14117647 0.6921569    1.0212351 0.6849315
##      far.test    v.test dprime.test               cart.cues.vec
## 1  0.24050633 0.4444252   0.5931044             thal;ca;ca;chol
## 2  0.13924051 0.5045951   0.7262341              thal;ca;ca;age
## 3  0.03797468 0.4277787   0.8443696                     thal;ca
## 4  0.01265823 0.3435062   0.9339046             thal;ca;oldpeak
## 5  0.05063291 0.3466274   0.6891513     thal;oldpeak;ca;oldpeak
## 6  0.39240506 0.4021155   0.5476317   thal;ca;thalach;age;cp;ca
## 7  0.24050633 0.4444252   0.5931044             thal;ca;ca;chol
## 8  0.13924051 0.5045951   0.7262341              thal;ca;ca;age
## 9  0.16455696 0.5477718   0.7680505               thal;ca;ca;cp
## 10 0.50632911 0.3429860   0.5088174 thal;ca;thalach;age;cp;chol
## 11 0.24050633 0.4444252   0.5931044             thal;ca;ca;chol
## 14 0.51898734 0.3440264   0.5231738         thal;ca;thalach;age
## 15 0.39240506 0.4021155   0.5476317   thal;ca;thalach;age;cp;ca
## 16 0.24050633 0.4444252   0.5931044             thal;ca;ca;chol
## 18 0.62025316 0.3112537   0.5904811     thal;ca;thalach;age;age
## 19 0.50632911 0.3429860   0.5088174 thal;ca;thalach;age;cp;chol
## 20 0.46835443 0.2987689   0.4044065 thal;ca;thalach;age;ca;chol
## 21 0.24050633 0.4444252   0.5931044             thal;ca;ca;chol

The lr data frame shows training and test statistics using different probabilistic thresholds for decisions. A threshold value of 0.5 is the standard logistic regression model.

heart.fft$lr
##   threshold  hr.train  far.train   hr.test   far.test
## 1       0.9 0.4545455 0.01176471 0.4520548 0.01265823
## 2       0.8 0.6212121 0.03529412 0.6164384 0.01265823
## 3       0.7 0.6969697 0.05882353 0.6575342 0.03797468
## 4       0.6 0.7727273 0.08235294 0.7534247 0.06329114
## 5       0.5 0.8181818 0.09411765 0.7808219 0.16455696
## 6       0.4 0.8636364 0.14117647 0.8082192 0.20253165
## 7       0.3 0.8787879 0.21176471 0.8356164 0.26582278
## 8       0.2 0.9090909 0.25882353 0.9041096 0.40506329
## 9       0.1 0.9696970 0.51764706 0.9726027 0.58227848

Plotting trees

Once you’ve created an fft object using fft() you can visualize the tree (and ROC curves) using plot(). The following code will visualize the best training tree (tree 2) applied to the test data:

plot(heart.fft,
     which.tree = "best.train",
     which.data = "test",
     description = "Heart Disease",
     decision.names = c("Healthy", "Disease")
     )

See the vignette on plot.fft vignette("fft_plot", package = "fft") for more details.

Additional arguments

The fft() function has several additional arguments than change how trees are built. Note: Not all of these arguments have fully tested yet!