simfft()

Nathaniel Phillips

2016-07-12

This function simulates multiple cross-validations of data with the fft() function.

Example with breastcancer

Let’s start with an example, we’ll create FFTs fitted to the breastcancer dataset. Here’s how the dataset looks:

head(breastcancer)
##   thickness cellsize.unif cellshape.unif adhesion epithelial nuclei.bare
## 1         5             1              1        1          2           1
## 2         5             4              4        5          7          10
## 3         3             1              1        1          2           2
## 4         6             8              8        1          3           4
## 5         4             1              1        3          2           1
## 6         8            10             10        8          7          10
##   chromatin nucleoli mitoses diagnosis
## 1         3        1       1         B
## 2         3        2       1         B
## 3         3        1       1         B
## 4         3        7       1         B
## 5         3        1       1         B
## 6         9        7       1         M

We’ll create a new fft object called heart.fft using the fft() function. We’ll set the criterion to breastcancer$diagnosis == "M" and use all other columns (breastcancer[,names(breastcancer) != "diagnosis"] as potential predictors. Additionally, we’ll define two parameters:

set.seed(100) # For reproducability

bcancer.fft.sim <- simfft(
  train.cue.df = breastcancer[,names(breastcancer) != "diagnosis"],
  train.criterion.v =  breastcancer$diagnosis == "M",
  train.p = .1,
  sim.n = 10
)

Results

The function will return a dataframe with fitting and test results for each simulation:

bcancer.fft.sim
##    train.p sim fft.hr.train fft.far.train fft.hr.test fft.far.test
## 1      0.1   1    0.9523810    0.04255319   0.9220183   0.04534005
## 2      0.1   2    0.9545455    0.04347826   0.7926267   0.02261307
## 3      0.1   3    1.0000000    0.05000000   0.9573460   0.10396040
## 4      0.1   4    0.9523810    0.04255319   0.9587156   0.06045340
## 5      0.1   5    1.0000000    0.02040816   0.8636364   0.05063291
## 6      0.1   6    1.0000000    0.04444444   0.9722222   0.08020050
## 7      0.1   7    1.0000000    0.00000000   0.6711712   0.01526718
## 8      0.1   8    1.0000000    0.06976744   0.9953271   0.18703242
## 9      0.1   9    1.0000000    0.04255319   0.9954128   0.16372796
## 10     0.1  10    0.9583333    0.06818182   0.9395349   0.06250000
##    fft.level.class               fft.level.name fft.level.exit
## 1  integer;integer cellsize.unif;cellshape.unif          1;0.5
## 2  integer;integer    cellshape.unif;epithelial          0;0.5
## 3  integer;integer cellsize.unif;cellshape.unif          1;0.5
## 4  integer;integer cellsize.unif;cellshape.unif          0;0.5
## 5  integer;integer     epithelial;cellsize.unif          0;0.5
## 6  numeric;integer   nuclei.bare;cellshape.unif          1;0.5
## 7  integer;integer cellshape.unif;cellsize.unif          0;0.5
## 8  integer;numeric    cellsize.unif;nuclei.bare          1;0.5
## 9  integer;numeric       epithelial;nuclei.bare          1;0.5
## 10 numeric;integer        nuclei.bare;chromatin          1;0.5
##    fft.level.threshold fft.level.sigdirection lr.hr.train lr.far.train
## 1                  4;3                   >=;>           1            0
## 2                  3;2                    >;>           1            0
## 3                  3;4                  >=;>=           1            0
## 4                  1;2                    >;>           1            0
## 5                  3;3                  >=;>=           1            0
## 6                  4;4                  >=;>=           1            0
## 7                  4;3                    >;>           1            0
## 8                  2;2                   >;>=           1            0
## 9                  2;2                    >;>           1            0
## 10                 5;3                   >=;>           1            0
##    lr.hr.test lr.far.test cart.hr.train cart.far.train cart.hr.test
## 1   0.7844037  0.05289673     0.9523810     0.04255319    0.8348624
## 2   0.8341014  0.02010050     1.0000000     0.08695652    0.8525346
## 3   0.9241706  0.04950495     1.0000000     0.05000000    0.9431280
## 4   0.8761468  0.04534005     1.0000000     0.08510638    0.9816514
## 5   0.9181818  0.03291139     1.0000000     0.06122449    0.9454545
## 6   0.7500000  0.03759398     0.9565217     0.04444444    0.8333333
## 7   0.8603604  0.02290076     1.0000000     0.00000000    0.7162162
## 8   0.9345794  0.05985037     0.9200000     0.02325581    0.9532710
## 9   0.7706422  0.03778338     0.9523810     0.02127660    0.9036697
## 10  0.8232558  0.03250000     0.8750000     0.04545455    0.7767442
##    cart.far.test
## 1     0.02267003
## 2     0.03768844
## 3     0.08910891
## 4     0.17884131
## 5     0.08860759
## 6     0.02255639
## 7     0.01781170
## 8     0.09226933
## 9     0.11335013
## 10    0.03500000

Plotting the results

You can plot the results using the simfftplot() function.

Which cues were selected the most often?

If you set roc = F you’ll see a bar chart showing how often each of the possible cues was used in trees. This gives you an indication of how important each cue is. For example, if a cue is used in >95% of simulations, this suggests that the cue is a consistently good predictor of the criterion across a wide range of training samples.

simfftplot(bcancer.fft.sim,
           roc = F
           )

How well (and consistently) did the trees perform?

If you set roc = T, you’ll see a distribution of hit-rates and false-alarm rates for trees across all simulations. You can also specify which data (training or test) to display with which.data.

Here, we can see the distribution of HR and FAR for the training data:

simfftplot(bcancer.fft.sim,
           roc = T,
           which.data = "train"
           )

Now let’s do the testing data. We should expect the trees to do a bit worse here:

simfftplot(bcancer.fft.sim,
           roc = T,
           which.data = "test"
           )

To add curves for CART and Logistic Regression by including the arguments lr = T and cart = T. Let’s look at the performance of CART and LR compared to the trees for the training data:

simfftplot(bcancer.fft.sim,
           roc = T,
           lr = T,
           cart = T,
           which.data = "train"
           )

It looks like LR dominated both CART and FFTs for the training data (in fact, for this simulation, LR always gave a perfect fit). Now let’s look at the test data:

simfftplot(bcancer.fft.sim,
           roc = T,
           lr = T,
           cart = T,
           which.data = "test"
           )

Here, we can see that for the testing data, all three algorithms performed similarly well.