The R package fuseMLR offers a framework for late integrative predictive modeling using multi-omics data. This vignette serves as a user guide to use the package effectively. Three key functionalities are provided: variable selection for modality-specific data, late integrative multi-omics training, and predictions for testing multi-omics datasets.
The following example is based on multi-omics simulated data
available in fuseMLR. Data have been simulated using the R
package InterSIM, version 2.2.0. Two types of data were
simulated: training and testing datasets. Each consists of four
data.frames—gene expression, protein expression,
methylation data modalities, and targeted observations. Individuals are
organized in rows, variables in columns, with an additional column for
individual IDs. In total, \(70\)
individuals with \(50\) (not
necessarily overlapping) individuals pro layer have been simulated for
training, and \(23\) (\(20\) per layer) for testing. Effects have
been introduced for across data modalities by shifting the means by
\(0.5\) to create case-control study
with \(50\)% prevalence. For
illustration, the number of variables was kept smaller than what is
typically expected in reality. The data simulation code is available here.
data("multi_omics")
# This list contains two lists of data: training and test.
# Each sublist contains three omics data as described above.
str(object = multi_omics, max.level = 2L)
#> List of 2
#>  $ training:List of 4
#>   ..$ geneexpr   :'data.frame':  50 obs. of  132 variables:
#>   ..$ proteinexpr:'data.frame':  50 obs. of  161 variables:
#>   ..$ methylation:'data.frame':  50 obs. of  368 variables:
#>   ..$ target     :'data.frame':  70 obs. of  2 variables:
#>  $ testing :List of 4
#>   ..$ geneexpr   :'data.frame':  20 obs. of  132 variables:
#>   ..$ proteinexpr:'data.frame':  20 obs. of  161 variables:
#>   ..$ methylation:'data.frame':  20 obs. of  368 variables:
#>   ..$ target     :'data.frame':  30 obs. of  2 variables:The training process is handled in fuseMLR by a
class called Training. We will use
training to refer to an instance from this class. Function
createTraining is available to create an empty (without
layers) training.
training <- createTraining(id = "training",
                           ind_col = "IDS",
                           target = "disease",
                           target_df = multi_omics$training$target,
                           verbose = TRUE)
print(training)
#> Training        : training
#> Problem type    : classification
#> Status          : Not trained
#> Number of layers: 0
#> Layers trained  : 0A training contains modality-specific training layers
(instances of the TrainLayer class) and a meta-layer
(instance of the TrainMetaLayer class). The
modality-specific training layers encapsulate training data modalities,
variable selection functions and learners (i.e. R functions
implementing statistical predicting methods). The meta-layer
encapsulates the meta-learner. The terms modality-specific and
layer-specific are used interchangeably as synonyms in the
following.
Three main functions are necessary to create the
training: createTraining(),
createTrainLayer() and createTrainMetaLayer().
createTraining() creates an empty training, on
which modalitity-specific training layers are added using the function
createTrainLayer(). Function
createTrainMetaLayer() is used to add meta-layer to
training.
The following code adds a gene expression, a protein abundance, and a
methylation layer to training. For illustration purpose we
use the same variable selection method Boruta and the same
learner (ranger) for all layers. These functions fulfill
fuseMLR requirements in terms of arguments and outputs (see
createTrainLayer documentation for details). We expect the
variable selection function to accept two arguments: x (the
predictor design matrix) and y (the response vector). The
function must return a vector of selected variables. Methods that do not
follow this format – such as Vita (via ranger) or Lasso
(via glmnet) – require additional wrapping or a custom
function to extract the selected variables, for example, based on a
significance threshold. An exception, however, is made for the Boruta
function, for which an internal adjustment is implemented; its use
requires no further modifications. For learners, the arguments
x and y are mandatory as well, and the
resulting model must support the generic predict function.
Predictions should be returned either as a vector or a list with a
predictions field containing the predicted values (also a
vector). If the function does not meet the required input and output
criteria, users should follow the steps in E-Interface and wrapping to
create an interface or wrap their function for use with
fuseMLR.
# Create gene expression layer.
createTrainLayer(training = training,
                 train_layer_id = "geneexpr",
                 train_data = multi_omics$training$geneexpr,
                 varsel_package = "Boruta",
                 varsel_fct = "Boruta",
                 varsel_param = list(num.trees = 1000L,
                                     mtry = 3L,
                                     probability = TRUE),
                 lrner_package = "ranger",
                 lrn_fct = "ranger",
                 param_train_list = list(probability = TRUE,
                                         mtry = 1L),
                 param_pred_list = list(),
                 na_action = "na.keep")
#> Training        : training
#> Problem type    : classification
#> Status          : Not trained
#> Number of layers: 1
#> Layers trained  : 0
#> p               : 131
#> n               :  50
#> na.action       : na.keep# Create gene protein abundance layer
createTrainLayer(training = training,
                 train_layer_id = "proteinexpr",
                 train_data = multi_omics$training$proteinexpr,
                 varsel_package = "Boruta",
                 varsel_fct = "Boruta",
                 varsel_param = list(num.trees = 1000L,
                                     mtry = 3L,
                                     probability = TRUE),
                 lrner_package = "ranger",
                 lrn_fct = "ranger",
                 param_train_list = list(probability = TRUE,
                                         mtry = 1L),
                 param_pred_list = list(type = "response"),
                 na_action = "na.keep")
#> Training        : training
#> Problem type    : classification
#> Status          : Not trained
#> Number of layers: 2
#> Layers trained  : 0
#> p               : 131 | 160
#> n               :  50 |  50
#> na.action       : na.keep | na.keep# Create methylation layer
createTrainLayer(training = training,
                 train_layer_id = "methylation",
                 train_data = multi_omics$training$methylation,
                 varsel_package = "Boruta",
                 varsel_fct = "Boruta",
                 varsel_param = list(num.trees = 1000L,
                                     mtry = 3L,
                                     probability = TRUE),
                 lrner_package = "ranger",
                 lrn_fct = "ranger",
                 param_train_list = list(probability = TRUE,
                                         mtry = 1L),
                 param_pred_list = list(),
                 na_action = "na.keep")
#> Training        : training
#> Problem type    : classification
#> Status          : Not trained
#> Number of layers: 3
#> Layers trained  : 0
#> p               : 131 | 160 | 367
#> n               :  50 |  50 |  50
#> na.action       : na.keep | na.keep | na.keepAlso add a meta-layer. We use the weighted mean (internal function to
fuseMLR) as meta-learner. Similarly to learners, a
meta-learner should allow at least the arguments x and
y to pass the design matrix of predictors and should return
a model that allow to call the generic function predict to
make predictions for a new dataset (see documentation of
createTrainMetaLayer for details). If these criteria are
not fulfilled, the explanation provided in E-Interface and wrapping details
how to map the x and y to the original
argument names or to wrap the original function. In Appendix we provide information about available
meta-learners.
We use the weighted mean as the meta-learner. Weighted learners allow the meta-model to be more robust against outliers or noisy data. By adjusting the weights, it can downplay the influence of less reliable models or predictions, thus making the system more stable in the face of unpredictable data. Theoretically, we do not expect this meta-learner to outperform all the modality-specific learners but rather to achieve a performance level between the worst and the best of the modality-specific learners.
# Create meta layer with imputation of missing values.
createTrainMetaLayer(training = training,
                     meta_layer_id = "meta_layer",
                     lrner_package = NULL,
                     lrn_fct = "weightedMeanLearner",
                     param_train_list = list(),
                     param_pred_list = list(na_rm = TRUE),
                     na_action = "na.rm")
#> Training        : training
#> Problem type    : classification
#> Status          : Not trained
#> Number of layers: 4
#> Layers trained  : 0
#> p               : 131 | 160 | 367
#> n               :  50 |  50 |  50
#> na.action       : na.keep | na.keep | na.keep
print(training)
#> Training        : training
#> Problem type    : classification
#> Status          : Not trained
#> Number of layers: 4
#> Layers trained  : 0
#> p               : 131 | 160 | 367
#> n               :  50 |  50 |  50
#> na.action       : na.keep | na.keep | na.keepFunction upsetplot() is available to generate an upset
of the training data, i.e. an overview how patients overlap across
layers.
upsetplot(object = training, order.by = "freq")
#> Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
#> ℹ Please use tidy evaluation idioms with `aes()`.
#> ℹ See also `vignette("ggplot2-in-packages")` for more information.
#> ℹ The deprecated feature was likely used in the UpSetR package.
#>   Please report the issue to the authors.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
#> ℹ Please use `linewidth` instead.
#> ℹ The deprecated feature was likely used in the UpSetR package.
#>   Please report the issue to the authors.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
#> ℹ Please use the `linewidth` argument instead.
#> ℹ The deprecated feature was likely used in the UpSetR package.
#>   Please report the issue to the authors.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.Function varSelection() performs modality-specific
variable selection. So, a user can opt to conduct variable selection
separately without training.
# Variable selection
set.seed(5467)
var_sel_res <- varSelection(training = training)
#> Variable selection on layer geneexpr started.
#> Variable selection on layer geneexpr done.
#> Variable selection on layer proteinexpr started.
#> Variable selection on layer proteinexpr done.
#> Variable selection on layer methylation started.
#> Variable selection on layer methylation done.
print(var_sel_res)
#>          Layer   variable
#> 1     geneexpr       ASNS
#> 2     geneexpr        ATM
#> 3     geneexpr       BRAF
#> 4     geneexpr       DVL3
#> 5     geneexpr      FOXM1
#> 6     geneexpr     MAP2K1
#> 7     geneexpr     MAPK14
#> 8     geneexpr        NF2
#> 9     geneexpr     NOTCH1
#> 10    geneexpr       PCNA
#> 11    geneexpr      PEA15
#> 12    geneexpr      PRDX1
#> 13    geneexpr      PRKCD
#> 14    geneexpr      RBM15
#> 15    geneexpr       SHC1
#> 16    geneexpr        SRC
#> 17    geneexpr    TP53BP1
#> 18    geneexpr      XRCC5
#> 19    geneexpr       YBX1
#> 20 proteinexpr      GAPDH
#> 21 methylation cg01894895
#> 22 methylation cg14785449
#> 23 methylation cg24747396
#> 24 methylation cg03813215
#> 25 methylation cg25059899
#> 26 methylation cg16225441
#> 27 methylation cg23935746
#> 28 methylation cg17655614
#> 29 methylation cg23989635
#> 30 methylation cg22679003
#> 31 methylation cg10936763
#> 32 methylation cg01663570
#> 33 methylation cg25500285
#> 34 methylation cg02720618
#> 35 methylation cg23244421
#> 36 methylation cg09976774
#> 37 methylation cg18470891
#> 38 methylation cg20484306
#> 39 methylation cg02412050
#> 40 methylation cg16293088
#> 41 methylation cg20042228
#> 42 methylation cg07566050
#> 43 methylation cg23641145
#> 44 methylation cg10786880
#> 45 methylation cg13908523
#> 46 methylation cg20849549
#> 47 methylation cg18331396
#> 48 methylation cg08383063
#> 49 methylation cg16153267
#> 50 methylation cg19254235
#> 51 methylation cg20406374
#> 52 methylation cg22175811
#> 53 methylation cg19393006
#> 54 methylation cg12507125
#> 55 methylation cg01442799Let us display the training object again to see the update on variable level.
print(training)
#> Training        : training
#> Problem type    : classification
#> Status          : Not trained
#> Number of layers: 4
#> Layers trained  : 0
#> p               : 19 |  1 | 35
#> n               : 50 | 50 | 50
#> na.action       : na.keep | na.keep | na.keepFor each layer, the variable selection results show the chosen variables.
We use the function fusemlr() to train our models using
the subset of selected variables. Here we set
use_var_sel = TRUE to use variables obtained from the
variable selection step.
set.seed(5462)
fusemlr(training = training,
        use_var_sel = TRUE)
#> Creating fold predictions.
#>   |                                                                              |                                                                      |   0%  |                                                                              |=======                                                               |  10%  |                                                                              |==============                                                        |  20%  |                                                                              |=====================                                                 |  30%  |                                                                              |============================                                          |  40%  |                                                                              |===================================                                   |  50%  |                                                                              |==========================================                            |  60%  |                                                                              |=================================================                     |  70%  |                                                                              |========================================================              |  80%  |                                                                              |===============================================================       |  90%  |                                                                              |======================================================================| 100%
#> Training of base model on layer geneexpr started.
#> Training of base model on layer geneexpr done.
#> Training of base model on layer proteinexpr started.
#> Training of base model on layer proteinexpr done.
#> Training of base model on layer methylation started.
#> Training of base model on layer methylation done.The display of the training object now updates information about the trained layers.
print(training)
#> Training        : training
#> Problem type    : classification
#> Status          : Trained
#> Number of layers: 4
#> Layers trained  : 4
#> Var. sel. used  : Yes
#> p               : 19 |  1 | 35 |  3
#> n               : 50 | 50 | 50 | 26
#> na.action       : na.keep | na.keep | na.keep | na.rmWe can also display a summary of training to see more
details on layer levels. Information about the training data modality,
the variable selection method and the learner stored at each layer are
displayed.
summary(training)
#> Training training
#> ----------------
#> Training        : training
#> Problem type    : classification
#> Status          : Trained
#> Number of layers: 4
#> Layers trained  : 4
#> Var. sel. used  : Yes
#> p               : 19 |  1 | 35 |  3
#> n               : 50 | 50 | 50 | 26
#> na.action       : na.keep | na.keep | na.keep | na.rm
#> ----------------
#> 
#>    Layer geneexpr
#>    ----------------
#>    TrainLayer            : geneexpr
#>    Status                : Trained
#>    Nb. of objects stored : 4
#>    ----------------
#>    Object(s) on layer geneexpr
#> 
#>       ----------------
#>       TrainData : geneexpr_data
#>       Layer      : geneexpr
#>       Ind. id.   : IDS
#>       Target     : disease
#>       n          : 50
#>       Missing    : 0
#>       p          : 19
#>       ----------------
#> 
#>       ----------------
#>       VarSel           : geneexpr_varsel
#>       TrainLayer       : geneexpr
#>       Package          : Boruta
#>       Function         : Boruta
#>       ----------------
#> 
#>       ----------------
#>       Learner          : geneexpr_lrner
#>       TrainLayer       : geneexpr
#>       Package          : ranger
#>       Learn function   : ranger
#>       ----------------
#> 
#> 
#>    Layer proteinexpr
#>    ----------------
#>    TrainLayer            : proteinexpr
#>    Status                : Trained
#>    Nb. of objects stored : 4
#>    ----------------
#>    Object(s) on layer proteinexpr
#> 
#>       ----------------
#>       TrainData : proteinexpr_data
#>       Layer      : proteinexpr
#>       Ind. id.   : IDS
#>       Target     : disease
#>       n          : 50
#>       Missing    : 0
#>       p          : 1
#>       ----------------
#> 
#>       ----------------
#>       VarSel           : proteinexpr_varsel
#>       TrainLayer       : proteinexpr
#>       Package          : Boruta
#>       Function         : Boruta
#>       ----------------
#> 
#>       ----------------
#>       Learner          : proteinexpr_lrner
#>       TrainLayer       : proteinexpr
#>       Package          : ranger
#>       Learn function   : ranger
#>       ----------------
#> 
#> 
#>    Layer methylation
#>    ----------------
#>    TrainLayer            : methylation
#>    Status                : Trained
#>    Nb. of objects stored : 4
#>    ----------------
#>    Object(s) on layer methylation
#> 
#>       ----------------
#>       TrainData : methylation_data
#>       Layer      : methylation
#>       Ind. id.   : IDS
#>       Target     : disease
#>       n          : 50
#>       Missing    : 0
#>       p          : 35
#>       ----------------
#> 
#>       ----------------
#>       VarSel           : methylation_varsel
#>       TrainLayer       : methylation
#>       Package          : Boruta
#>       Function         : Boruta
#>       ----------------
#> 
#>       ----------------
#>       Learner          : methylation_lrner
#>       TrainLayer       : methylation
#>       Package          : ranger
#>       Learn function   : ranger
#>       ----------------
#> 
#> 
#>    MetaLayer
#>    ----------------
#>    TrainMetaLayer    : meta_layer
#>    Status            : Trained
#>    Nb. of objects stored : 3
#> 
#>    ----------------
#>    Object(s) on MetaLayer
#> 
#>       ----------------
#>       Learner          : meta_layer_lrner
#>       TrainLayer       : meta_layer
#>       Learn function   : weightedMeanLearner
#>       ----------------
#> 
#>       ----------------
#>       TrainData : modality-specific predictions
#>       Layer      : meta_layer
#>       Ind. id.   : IDS
#>       Target     : disease
#>       n          : 26
#>       Missing    : 0
#>       p          : 3
#>       ----------------We use extractModel() to retrieve the list of stored
models and extractData() to retrieve training data.
models_list <- extractModel(training = training)
str(object = models_list, max.level = 1L)
#> List of 4
#>  $ geneexpr   :List of 14
#>  $ proteinexpr:List of 14
#>  $ methylation:List of 14
#>  $ meta_layer : 'weightedMeanLearner' Named num [1:3] 0.599 0.157 0.244
#>   ..- attr(*, "names")= chr [1:3] "geneexpr" "proteinexpr" "methylation"Three random forests and one weighted meta-model trained on each layer are returned. The smallest weight is assigned to protein abundance, while the highest is given to gene expression.
data_list <- extractData(object = training)
str(object = data_list, max.level = 1)
#> List of 4
#>  $ geneexpr   :'data.frame': 50 obs. of  21 variables:
#>  $ proteinexpr:'data.frame': 50 obs. of  3 variables:
#>  $ methylation:'data.frame': 50 obs. of  37 variables:
#>  $ meta_layer :'data.frame': 26 obs. of  5 variables:The three simulated training modalities and the meta-data are returned.
In this section, we create a testing instance (from the
Testing class) and make predictions for new data. This is done
analogously to training. Only the testing data modalities
are required. Relevant functions are createTesting() and
createTestLayer().
# Create gene expression layer
createTestLayer(testing = testing,
                test_layer_id = "geneexpr",
                test_data = multi_omics$testing$geneexpr)
#> Testing         : testing
#> Number of layers: 1
#> p               : 131
#> n               :  20# Create gene protein abundance layer
createTestLayer(testing = testing,
                test_layer_id = "proteinexpr",
                test_data = multi_omics$testing$proteinexpr)
#> Testing         : testing
#> Number of layers: 2
#> p               : 131 | 160
#> n               :  20 |  20# Create methylation layer
createTestLayer(testing = testing,
                test_layer_id = "methylation",
                test_data = multi_omics$testing$methylation)
#> Testing         : testing
#> Number of layers: 3
#> p               : 131 | 160 | 367
#> n               :  20 |  20 |  20A summary of testing.
summary(testing)
#> Testing testing
#> ----------------
#> Testing         : testing
#> Number of layers: 3
#> p               : 131 | 160 | 367
#> n               :  20 |  20 |  20
#> ----------------
#> 
#> Class     : TestData
#> name      : geneexpr_data
#> ind. id.  : IDS
#> n         : 20
#> p         : 132
#> 
#> 
#> Class     : TestData
#> name      : proteinexpr_data
#> ind. id.  : IDS
#> n         : 20
#> p         : 161
#> 
#> 
#> Class     : TestData
#> name      : methylation_data
#> ind. id.  : IDS
#> n         : 20
#> p         : 368A look on testing data.
data_list <- extractData(object = testing)
str(object = data_list, max.level = 1)
#> List of 3
#>  $ geneexpr   :'data.frame': 20 obs. of  132 variables:
#>  $ proteinexpr:'data.frame': 20 obs. of  161 variables:
#>  $ methylation:'data.frame': 20 obs. of  368 variables:An upset plot to visualize patient overlap across testing layers.
Function predict() is available for predicting.
predictions <- predict(object = training, testing = testing)
print(predictions)
#> $predicting
#> Predicting   : testing
#> Nb. layers   : 4
#> 
#> $predicted_values
#>               IDS  geneexpr proteinexpr methylation meta_layer
#> 1  participant100        NA   0.6326063   0.1251825  0.3242845
#> 2   participant20        NA          NA   0.1571294  0.1571294
#> 3   participant24 0.7400333   0.6601421   0.6843302  0.7138681
#> 4   participant25 0.3868008          NA          NA  0.3868008
#> 5   participant27 0.3497389          NA   0.3822556  0.3591507
#> 6   participant28 0.6292540   0.1459032          NA  0.5285877
#> 7    participant3 0.6919667   0.8694921          NA  0.7289395
#> 8   participant32 0.1178167   0.4782968   0.2063500  0.1961747
#> 9   participant34 0.5838603   0.8426365   0.7550254  0.6663532
#> 10  participant39 0.5797437          NA   0.7959706  0.6423299
#> 11  participant42 0.7620048   0.5215286   0.4115976  0.6386829
#> 12  participant51        NA   0.2175794          NA  0.2175794
#> 13  participant53        NA   0.5215286          NA  0.5215286
#> 14  participant54 0.3963286          NA          NA  0.3963286
#> 15  participant55 0.4559016          NA          NA  0.4559016
#> 16   participant6        NA   0.2678444   0.7796278  0.5788153
#> 17  participant63        NA   0.5215286   0.2305881  0.3447467
#> 18  participant64 0.2425183   0.8734579          NA  0.3739225
#> 19  participant68 0.6432230          NA   0.8299405  0.6972678
#> 20  participant71 0.5282675   0.1441889   0.5876659  0.4822686
#> 21  participant75        NA   0.6960524   0.2517754  0.4260999
#> 22  participant77 0.2102119   0.9070841   0.1535889  0.3061458
#> 23  participant79 0.3001992   0.2175794          NA  0.2829922
#> 24  participant81 0.2856817   0.3816246   0.2459389  0.2910988
#> 25  participant84 0.7755270   0.4325476   0.6476595  0.6903327
#> 26  participant86 0.5468381          NA   0.6990262  0.5908885
#> 27  participant94 0.4678897   0.4060643   0.7376230  0.5239321
#> 28  participant97        NA          NA   0.2271817  0.2271817
#> 29  participant98        NA   0.1441889   0.1903873  0.1722601Prediction performances for layer-specific levels and the meta-layer are estimated. Brier Score (BS) is utilized to assess calibration performance and the Area Under the Curve (AUC) to evaluate classification accuracy.
pred_values <- predictions$predicted_values
actual_pred <- merge(x = pred_values,
                     y = multi_omics$testing$target,
                     by = "IDS",
                     all.y = TRUE)
y <- as.numeric(actual_pred$disease == "1")
# On all patients
perf_bs <- sapply(X = actual_pred[ , 2L:5L], FUN = function (my_pred) {
  bs <- mean((y[complete.cases(my_pred)] - my_pred[complete.cases(my_pred)])^2)
  roc_obj <- pROC::roc(y[complete.cases(my_pred)], my_pred[complete.cases(my_pred)])
  auc <- pROC::auc(roc_obj)
  performances = rbind(bs, auc)
  return(performances)
})
rownames(perf_bs) <- c("BS", "AUC")
print(perf_bs)
#>      geneexpr proteinexpr methylation meta_layer
#> BS  0.1304363   0.3260078  0.07980678  0.1286471
#> AUC 1.0000000   0.5350000  1.00000000  1.0000000As expected, the performance of the meta-learner in terms of Brier Score is not the best; it falls between the worst and best modality-specific performance measures. For AUC, the meta-learner performs as well as the best modality-specific learner. We observe that performance on protein abundance is the worst, consistent with its lowest assigned weight. There is a reversal in the weight order between the methylation layer and gene expression. However, the performance difference between these two modalities is relatively small, approximately \(0.05\).
This section explains how to resolve argument discrepancies when the
original function does not conform to the fuseMLR format
introduced in C.1 - Creating a
training. These discrepancies can occur in input (argument names) or
output formats of the user-provided functions.
At the input level, we distinguish common supervised learning
arguments from method specific arguments. The common arguments are a
matrix x of independent variables and y
representing a response variable. If the provided original function
(variable selection or learning function) does not have these two
arguments, then discrepancies must be resolved. Moreover, the
predict function extending the generic R
predict function must allow arguments object and
data. If this is not the case, discrepancies must be
resolved.
The provided learner must return a model compatible with an extension
of the generic predict function (e.g.,
predict.glmnet for glmnet models). The
predict function should return either a vector of
predictions or a list with a predictions field
(a vector of predicted values). For binary classification (classes \(0\) and \(1\)), learners can return a two-column
matrix or data.frame of class probabilities,
where the second column represents the probability of class \(1\) (used by fuseMLR) and the
first column that of \(0\) (ignored by
fuseMLR). For variable selection, the output should be a
vector of selected variables. If these criteria are not met,
discrepancies must be resolved.
We offer two main ways to resolve discrepancies: either using an interface or a wrapping of the original function.
The interface approach maps the argument names of the original
learning function using arguments of createTrainLayer(). In
the example below, the gene expression layer is recreated using the
svm function from the e1071 package as the
learner. A discrepancy arises because predict.svm uses
object and newdata as argument names. We also
provide a function using the argument extract_pred_fct of
createTrainLayer to extract the predicted values. Similar
arguments are also available for the createTrainMetaLayer
function to generate meta-layer.
# Re-create the gene expression layer with support vector machine as learner.
createTrainLayer(training = training,
                 train_layer_id = "geneexpr",
                 train_data = multi_omics$training$geneexpr,
                 varsel_package = "Boruta",
                 varsel_fct = "Boruta",
                 varsel_param = list(num.trees = 1000L,
                                     mtry = 3L,
                                     probability = TRUE),
                 lrner_package = "e1071",
                 lrn_fct = "svm",
                 param_train_list = list(type = 'C-classification',
                                         kernel = 'radial',
                                         probability = TRUE),
                 param_pred_list = list(probability = TRUE),
                 na_action = "na.rm",
                 x_lrn = "x",
                 y_lrn = "y",
                 object = "object",
                 data = "newdata", # Name discrepancy resolved.
                 extract_pred_fct = function (pred) { 
                   pred <- attr(pred, "probabilities")
                   return(pred[ , 1L])
                 }
)
# Variable selection
set.seed(5467)
var_sel_res <- varSelection(training = training)
set.seed(5462)
training <- fusemlr(training = training,
                    use_var_sel = TRUE)
print(training)In the wrapping approach we define a new function
mylasso to run a Lasso regression from the
glmnet package as the meta-leaner.
# We wrap the original functions
mylasso <- function (x, y,
                     nlambda = 25,
                     nfolds = 5) {
  # Perform cross-validation to find the optimal lambda
  cv_lasso <- glmnet::cv.glmnet(x = as.matrix(x), y = y,
                        family = "binomial",
                        type.measure = "deviance",
                        nfolds = nfolds)
  best_lambda <- cv_lasso$lambda.min
  lasso_best <- glmnet::glmnet(x = as.matrix(x), y = y,
                       family = "binomial",
                       alpha = 1,
                       lambda = best_lambda
  )
  lasso_model <- list(model = lasso_best)
  class(lasso_model) <- "mylasso"
  return(lasso_model)
}predict.# We extend the generic predict function mylasso. 
predict.mylasso <- function (object, data) {
  glmnet_pred <- predict(object = object$model,
                         newx = as.matrix(data),
                         type = "response",
                         s = object$model$lambda)
  return(as.vector(glmnet_pred))
}
# Re-create the gene expression layer with support vector machine as learner.
createTrainMetaLayer(training = training,
                     meta_layer_id = "meta_layer",
                     lrner_package = NULL,
                     lrn_fct = "mylasso",
                     param_train_list = list(nlambda = 100L),
                     na_action = "na.impute")
set.seed(5462)
training <- fusemlr(training = training,
                    use_var_sel = TRUE)
print(training)# Re-create the gene expression layer with support vector machine as learner.
createTrainMetaLayer(training = training,
                     meta_layer_id = "meta_layer",
                     lrner_package = NULL,
                     lrn_fct = "mylasso",
                     param_train_list = list(nlambda = 100L),
                     na_action = "na.impute")
set.seed(5462)
training <- fusemlr(training = training,
                    use_var_sel = TRUE)
print(training)In addition to any pre-existing learner in R as a meta-learner, we have implemented the following ones.
| Leaner | Description | 
|---|---|
| weightedMeanLearner | The weighted mean meta learner. It uses modality-specific predictions to estimate the weights of the modality-specific models | 
| bestLayerLearner | The best layer-specific model is used as meta model. | 
| cobra | cobra implements the COBRA (COmBined Regression Alternative), an aggregation method for combining predictions from multiple individual learners | 
Fouodo, C. J. K., Bleskina, M. & Szymczak, S. fuseMLR: an R package for integrative prediction modeling of multi-omics data. BMC Bioinformatics 26, 221 (2025). https://doi.org/10.1186/s12859-025-06248-4
Wright MN, Ziegler A.. Ranger: a fast implementation of random forests for high dimensional data in C ++ and R. J Stat Softw 2017;77:1–17, https://doi.org/10.18637/jss.v077.i01.
Tay JK, Narasimhan B, Hastie T.. (2023). Elastic Net Regularization Paths for All Generalized Linear Models. Journal of Statistical Software, 106(1), 1-31, https://doi.org/10.18637/jss.v106.i01.
Kursa M, Rudnicki W.. Feature selection with the Boruta package. J Stat Softw 2010;36:1–13, https://doi.org/10.18637/jss.v036.i11.
Janitza S, Celik E, Boulesteix A-L.. A computationally fast variable importance test for random forests for high-dimensional data. Adv Data Anal Classif 2016, in press, https://doi.org/10.1007/s11634-016-0276-4
Biau, G., Fischer, A., Guedj, B., & Malley, J. D. (2014). COBRA: A combined regression strategy. The Journal of Multivariate Analysis 46:18-28, https://doi.org/10.1016/j.jmva.2015.04.007.
© 2024 Institute of Medical Biometry and Statistics (IMBS). All rights reserved.