This function simulates multiple cross-validations of data with the fft()
function.
Let’s start with an example, we’ll create FFTs fitted to the breastcancer
dataset. Here’s how the dataset looks:
head(breastcancer)
## thickness cellsize.unif cellshape.unif adhesion epithelial nuclei.bare
## 1 5 1 1 1 2 1
## 2 5 4 4 5 7 10
## 3 3 1 1 1 2 2
## 4 6 8 8 1 3 4
## 5 4 1 1 3 2 1
## 6 8 10 10 8 7 10
## chromatin nucleoli mitoses diagnosis
## 1 3 1 1 B
## 2 3 2 1 B
## 3 3 1 1 B
## 4 3 7 1 B
## 5 3 1 1 B
## 6 9 7 1 M
We’ll create a new fft object called heart.fft
using the fft()
function. We’ll set the criterion to breastcancer$diagnosis == "M"
and use all other columns (breastcancer[,names(breastcancer) != "diagnosis"]
as potential predictors. Additionally, we’ll define two parameters:
train.p = .1
: Train the trees on a random sample of 10% of the original training dataset, and test the trees on the remaining 50%sim.n = 10
: Do 10 simulationsset.seed(100) # For reproducability
bcancer.fft.sim <- simfft(
train.cue.df = breastcancer[,names(breastcancer) != "diagnosis"],
train.criterion.v = breastcancer$diagnosis == "M",
train.p = .1,
sim.n = 10
)
The function will return a dataframe with fitting and test results for each simulation:
bcancer.fft.sim
## train.p sim fft.hr.train fft.far.train fft.hr.test fft.far.test
## 1 0.1 1 0.9523810 0.04255319 0.9220183 0.04534005
## 2 0.1 2 0.9545455 0.04347826 0.7926267 0.02261307
## 3 0.1 3 1.0000000 0.05000000 0.9573460 0.10396040
## 4 0.1 4 0.9523810 0.04255319 0.9587156 0.06045340
## 5 0.1 5 1.0000000 0.02040816 0.8636364 0.05063291
## 6 0.1 6 1.0000000 0.04444444 0.9722222 0.08020050
## 7 0.1 7 1.0000000 0.00000000 0.6711712 0.01526718
## 8 0.1 8 1.0000000 0.06976744 0.9953271 0.18703242
## 9 0.1 9 1.0000000 0.04255319 0.9954128 0.16372796
## 10 0.1 10 0.9583333 0.06818182 0.9395349 0.06250000
## fft.level.class fft.level.name fft.level.exit
## 1 integer;integer cellsize.unif;cellshape.unif 1;0.5
## 2 integer;integer cellshape.unif;epithelial 0;0.5
## 3 integer;integer cellsize.unif;cellshape.unif 1;0.5
## 4 integer;integer cellsize.unif;cellshape.unif 0;0.5
## 5 integer;integer epithelial;cellsize.unif 0;0.5
## 6 numeric;integer nuclei.bare;cellshape.unif 1;0.5
## 7 integer;integer cellshape.unif;cellsize.unif 0;0.5
## 8 integer;numeric cellsize.unif;nuclei.bare 1;0.5
## 9 integer;numeric epithelial;nuclei.bare 1;0.5
## 10 numeric;integer nuclei.bare;chromatin 1;0.5
## fft.level.threshold fft.level.sigdirection lr.hr.train lr.far.train
## 1 4;3 >=;> 1 0
## 2 3;2 >;> 1 0
## 3 3;4 >=;>= 1 0
## 4 1;2 >;> 1 0
## 5 3;3 >=;>= 1 0
## 6 4;4 >=;>= 1 0
## 7 4;3 >;> 1 0
## 8 2;2 >;>= 1 0
## 9 2;2 >;> 1 0
## 10 5;3 >=;> 1 0
## lr.hr.test lr.far.test cart.hr.train cart.far.train cart.hr.test
## 1 0.7844037 0.05289673 0.9523810 0.04255319 0.8348624
## 2 0.8341014 0.02010050 1.0000000 0.08695652 0.8525346
## 3 0.9241706 0.04950495 1.0000000 0.05000000 0.9431280
## 4 0.8761468 0.04534005 1.0000000 0.08510638 0.9816514
## 5 0.9181818 0.03291139 1.0000000 0.06122449 0.9454545
## 6 0.7500000 0.03759398 0.9565217 0.04444444 0.8333333
## 7 0.8603604 0.02290076 1.0000000 0.00000000 0.7162162
## 8 0.9345794 0.05985037 0.9200000 0.02325581 0.9532710
## 9 0.7706422 0.03778338 0.9523810 0.02127660 0.9036697
## 10 0.8232558 0.03250000 0.8750000 0.04545455 0.7767442
## cart.far.test
## 1 0.02267003
## 2 0.03768844
## 3 0.08910891
## 4 0.17884131
## 5 0.08860759
## 6 0.02255639
## 7 0.01781170
## 8 0.09226933
## 9 0.11335013
## 10 0.03500000
You can plot the results using the simfftplot()
function.
If you set roc = F
you’ll see a bar chart showing how often each of the possible cues was used in trees. This gives you an indication of how important each cue is. For example, if a cue is used in >95% of simulations, this suggests that the cue is a consistently good predictor of the criterion across a wide range of training samples.
simfftplot(bcancer.fft.sim,
roc = F
)
If you set roc = T
, you’ll see a distribution of hit-rates and false-alarm rates for trees across all simulations. You can also specify which data (training or test) to display with which.data
.
Here, we can see the distribution of HR and FAR for the training data:
simfftplot(bcancer.fft.sim,
roc = T,
which.data = "train"
)
Now let’s do the testing data. We should expect the trees to do a bit worse here:
simfftplot(bcancer.fft.sim,
roc = T,
which.data = "test"
)
To add curves for CART and Logistic Regression by including the arguments lr = T
and cart = T
. Let’s look at the performance of CART and LR compared to the trees for the training data:
simfftplot(bcancer.fft.sim,
roc = T,
lr = T,
cart = T,
which.data = "train"
)
It looks like LR dominated both CART and FFTs for the training data (in fact, for this simulation, LR always gave a perfect fit). Now let’s look at the test data:
simfftplot(bcancer.fft.sim,
roc = T,
lr = T,
cart = T,
which.data = "test"
)
Here, we can see that for the testing data, all three algorithms performed similarly well.