| Title: | End-to-End Automated Machine Learning and Model Evaluation | 
| Version: | 1.5.0 | 
| Description: | Single unified interface for end-to-end modelling of regression, 
    categorical and time-to-event (survival) outcomes. Models created using
    familiar are self-containing, and their use does not require additional
    information such as baseline survival, feature clustering, or feature
    transformation and normalisation parameters. Model performance,
    calibration, risk group stratification, (permutation) variable importance,
    individual conditional expectation, partial dependence, and more, are
    assessed automatically as part of the evaluation process and exported in
    tabular format and plotted, and may also be computed manually using export
    and plot functions. Where possible, metrics and values obtained during the
    evaluation process come with confidence intervals. | 
| URL: | https://github.com/alexzwanenburg/familiar | 
| BugReports: | https://github.com/alexzwanenburg/familiar/issues | 
| Depends: | R (≥ 4.0.0) | 
| License: | EUPL version 1.1 | EUPL version 1.2 [expanded from: EUPL] | 
| Encoding: | UTF-8 | 
| RoxygenNote: | 7.3.2 | 
| VignetteBuilder: | knitr | 
| Imports: | data.table, methods, rlang (≥ 0.3.4), rstream, survival | 
| Suggests: | BART, callr (≥ 3.4.3), cluster, CORElearn, coro,
dynamicTreeCut, e1071 (≥ 1.7.5), Ecdat, fastcluster, fastglm,
ggplot2 (≥ 3.0.0), glmnet, gtable, harmonicmeanp, isotree (≥
0.3.0), knitr, labeling, laGP, MASS, maxstat, mboost (≥
2.9.0), microbenchmark, nnet, partykit, power.transform,
praznik, proxy, qvalue, randomForestSRC, ranger, rmarkdown,
scales, testthat (≥ 3.0.0), xml2, VGAM, xgboost | 
| Collate: | 'FamiliarS4Classes.R' 'FamiliarS4Generics.R'
'BatchNormalisation.R' 'BootstrapConfidenceInterval.R'
'CheckArguments.R' 'CheckHyperparameters.R' 'CheckPackages.R'
'ClassBalance.R' 'ClusteringMethod.R' 'Clustering.R'
'ClusterRepresentation.R' 'Normalisation.R'
'CombatNormalisation.R' 'LearnerS4Naive.R' 'DataObject.R'
'DataParameterChecks.R' 'DataPreProcessing.R'
'DataProcessing.R' 'DataServerBackend.R' 'ErrorMessages.R'
'Evaluation.R' 'ExperimentData.R' 'ExperimentSetup.R'
'Familiar.R' 'FamiliarCollection.R'
'FamiliarCollectionExport.R' 'FamiliarData.R'
'FamiliarDataComputation.R'
'FamiliarDataComputationAUCCurves.R'
'FamiliarDataComputationCalibrationData.R'
'FamiliarDataComputationCalibrationInfo.R'
'FamiliarDataComputationConfusionMatrix.R'
'FamiliarDataComputationDecisionCurveAnalysis.R'
'FamiliarDataComputationFeatureExpression.R'
'FamiliarDataComputationFeatureSimilarity.R'
'FamiliarDataComputationHyperparameters.R'
'FamiliarDataComputationICE.R'
'FamiliarDataComputationModelPerformance.R'
'FamiliarDataComputationPermutationVimp.R'
'FamiliarDataComputationPredictionData.R'
'FamiliarDataComputationRiskStratificationData.R'
'FamiliarDataComputationRiskStratificationInfo.R'
'FamiliarDataComputationSampleSimilarity.R'
'FamiliarDataComputationUnivariateAnalysis.R'
'FamiliarDataComputationVimp.R' 'FamiliarDataElement.R'
'FamiliarEnsemble.R' 'FamiliarHyperparameterLearner.R'
'FamiliarModel.R' 'FamiliarNoveltyDetector.R'
'FamiliarObjectConversion.R' 'Transformation.R'
'FamiliarObjectUpdate.R' 'FamiliarSharedS4Methods.R'
'FamiliarVimpMethod.R' 'FeatureInfo.R'
'FeatureInfoParameters.R' 'FeatureSelection.R'
'FunctionWrapperUtilities.R' 'HyperparameterOptimisation.R'
'HyperparameterOptimisationMetaLearners.R'
'HyperparameterOptimisationUtilities.R'
'HyperparameterS4BayesianAdditiveRegressionTrees.R'
'HyperparameterS4GaussianProcess.R'
'HyperparameterS4RandomSearch.R' 'HyperparameterS4Ranger.R'
'Imputation.R' 'Iterations.R' 'LearnerMain.R'
'LearnerRecalibration.R' 'LearnerS4Cox.R' 'LearnerS4GLM.R'
'LearnerS4GLMnet.R' 'LearnerS4KNN.R' 'LearnerS4MBoost.R'
'LearnerS4NaiveBayes.R' 'LearnerS4RFSRC.R' 'LearnerS4Ranger.R'
'LearnerS4SVM.R' 'LearnerS4SurvivalRegression.R'
'LearnerS4XGBoost.R' 'LearnerSurvivalGrouping.R'
'LearnerSurvivalProbability.R' 'Logger.R' 'MetricS4.R'
'MetricS4AUC.R' 'MetricS4Brier.R' 'MetricS4ConcordanceIndex.R'
'MetricS4ConfusionMatrixMetrics.R' 'MetricS4Regression.R'
'ModelBuilding.R' 'NoveltyDetectorS4IsolationTree.R'
'NoveltyDetectorMain.R'
'NoveltyDetectorS4NoneNoveltyDetector.R' 'OutcomeInfo.R'
'PairwiseSimilarity.R' 'ParallelFunctions.R' 'ParseData.R'
'ParseSettings.R' 'PlotAUCcurves.R' 'PlotAll.R'
'PlotCalibration.R' 'PlotColours.R' 'PlotConfusionMatrix.R'
'PlotDecisionCurves.R' 'PlotFeatureRanking.R'
'PlotFeatureSimilarity.R' 'PlotGTable.R' 'PlotICE.R'
'PlotInputArguments.R' 'PlotKaplanMeier.R'
'PlotModelPerformance.R' 'PlotPermutationVariableImportance.R'
'PlotSampleClustering.R' 'PlotUnivariateImportance.R'
'PlotUtilities.R' 'PredictS4Methods.R' 'ProcessTimeUtilities.R'
'Random.R' 'RandomGrouping.R' 'RankBordaAggregation.R'
'RankMain.R' 'RankSimpleAggregation.R'
'RankStabilityAggregation.R' 'SocketServer.R'
'StringUtilities.R' 'TestDataCreators.R' 'TestFunctions.R'
'TrainS4Methods.R' 'TrimUtilities.R' 'Utilities.R'
'UtilitiesS4.R' 'VimpMain.R' 'VimpS4Concordance.R'
'VimpS4CoreLearn.R' 'VimpS4Correlation.R'
'VimpS4MutualInformation.R' 'VimpS4OtherMethods.R'
'VimpS4Regression.R' 'VimpTable.R' 'aaa.R' | 
| Config/testthat/parallel: | true | 
| Config/testthat/edition: | 3 | 
| NeedsCompilation: | no | 
| Packaged: | 2024-09-23 15:26:57 UTC; alexz | 
| Author: | Alex Zwanenburg  [aut, cre],
  Steffen Löck [aut],
  Stefan Leger [ctb],
  Iram Shahzadi [ctb],
  Asier Rabasco Meneghetti [ctb],
  Sebastian Starke [ctb],
  Technische Universität Dresden [cph],
  German Cancer Research Center (DKFZ) [cph] | 
| Maintainer: | Alex Zwanenburg <alexander.zwanenburg@nct-dresden.de> | 
| Repository: | CRAN | 
| Date/Publication: | 2024-09-23 15:50:02 UTC | 
familiar: Fully Automated Machine Learning with Interpretable Analysis of Results
Description
End-to-end, automated machine learning package for creating
trustworthy and interpretable machine learning models. Familiar supports
modelling of regression, categorical and time-to-event (survival) outcomes.
Models created using familiar are self-containing, and their use does not
require additional information such as baseline survival, feature
clustering, or feature transformation and normalisation parameters. In
addition, an novelty or out-of-distribution detector is trained
simultaneously and contained with every model. Model performance,
calibration, risk group stratification, (permutation) variable importance,
individual conditional expectation, partial dependence, and more, are
assessed automatically as part of the evaluation process and exported in
tabular format and plotted, and may also be computed manually using export
and plot functions. Where possible, metrics and values obtained during the
evaluation process come with confidence intervals.
Author(s)
Maintainer: Alex Zwanenburg alexander.zwanenburg@nct-dresden.de (ORCID)
Authors:
Other contributors:
-  Stefan Leger [contributor]
 
-  Iram Shahzadi [contributor]
 
-  Asier Rabasco Meneghetti [contributor]
 
-  Sebastian Starke [contributor]
 
-  Technische Universität Dresden [copyright holder]
 
-  German Cancer Research Center (DKFZ) [copyright holder]
 
See Also
Useful links:
Internal function to test plausibility of provided class levels
Description
This function checks whether categorical levels are present in the data that
are not found in the user-provided class levels.
Usage
.check_class_level_plausibility(
  data,
  outcome_type,
  outcome_column,
  class_levels,
  check_stringency = "strict"
)
Arguments
| data | Data set as loaded using the .load_datafunction. | 
| outcome_type | (recommended) Type of outcome found in the outcome
column. The outcome type determines many aspects of the overall process,
e.g. the available feature selection methods and learners, but also the
type of assessments that can be conducted to evaluate the resulting models.
Implemented outcome types are:
 
 binomial: categorical outcome with 2 levels.
 multinomial: categorical outcome with 2 or more levels.
 count: Poisson-distributed numeric outcomes.
 continuous: general continuous numeric outcomes.
 survival: survival outcome for time-to-event data.
 If not provided, the algorithm will attempt to obtain outcome_type from
contents of the outcome column. This may lead to unexpected results, and we
therefore advise to provide this information manually.
 Note that competing_risksurvival analysis are not fully supported, and
is currently not a valid choice foroutcome_type. | 
| outcome_column | (recommended) Name of the column containing the
outcome of interest. May be identified from a formula, if a formula is
provided as an argument. Otherwise an error is raised. Note that survivalandcompeting_riskoutcome type outcomes require two columns that
indicate the time-to-event or the time of last follow-up and the event
status. | 
| class_levels | (optional) Class levels for binomialormultinomialoutcomes. This argument can be used to specify the ordering of levels for
categorical outcomes. These class levels must exactly match the levels
present in the outcome column. | 
| check_stringency | Specifies stringency of various checks. This is mostly:
 
 strict: default value used forsummon_familiar. Thoroughly checks
input data. Used internally for checking development data.
 external_warn: value used forextract_dataand related methods. Less
stringent checks, but will warn for possible issues. Used internally for
checking data for evaluation and explanation.
 external: value used for external methods such aspredict. Less
stringent checks, particularly for identifier and outcome columns, which may
be completely absent. Used internally forpredict.
 | 
Internal function to check whether feature columns are found in the data
Description
This function checks whether feature columns can be found in the data set.
It will raise an error if any feature columns are missing from the data set.
Usage
.check_feature_availability(data, feature)
Arguments
| data | Data set as loaded using the .load_datafunction. | 
| feature | Character string(s) indicating one or more features. | 
Description
This function checks whether an identifier column is consistent, i.e. appears
it exists, there is only one, and there is no overlap with any user-provided
feature columns, identifiers, or
Usage
.check_input_identifier_column(
  id_column,
  data,
  signature = NULL,
  exclude_features = NULL,
  include_features = NULL,
  other_id_column = NULL,
  outcome_column = NULL,
  col_type,
  check_stringency = "strict"
)
Arguments
| id_column | Character string indicating the currently inspected
identifier column. | 
| data | Data set as loaded using the .load_datafunction. | 
| signature | (optional) One or more names of feature columns that are
considered part of a specific signature. Features specified here will
always be used for modelling. Ranking from feature selection has no effect
for these features. | 
| exclude_features | (optional) Feature columns that will be removed
from the data set. Cannot overlap with features in signature,novelty_featuresorinclude_features. | 
| include_features | (optional) Feature columns that are specifically
included in the data set. By default all features are included. Cannot
overlap with exclude_features, but may overlapsignature. Features insignatureandnovelty_featuresare always included. If bothexclude_featuresandinclude_featuresare provided,include_featurestakes precedence, provided that there is no overlap between the two. | 
| other_id_column | Character string indicating another identifier column. | 
| outcome_column | Character string indicating the outcome column(s). | 
| col_type | Character string indicating the type of column, i.e. sampleorbatch. | 
| check_stringency | Specifies stringency of various checks. This is mostly:
 
 strict: default value used forsummon_familiar. Thoroughly checks
input data. Used internally for checking development data.
 external_warn: value used forextract_dataand related methods. Less
stringent checks, but will warn for possible issues. Used internally for
checking data for evaluation and explanation.
 external: value used for external methods such aspredict. Less
stringent checks, particularly for identifier and outcome columns, which may
be completely absent. Used internally forpredict.
 | 
Description
Internal checks on common plot input arguments
Usage
.check_input_plot_args(
  x_range = waiver(),
  y_range = waiver(),
  x_n_breaks = waiver(),
  y_n_breaks = waiver(),
  x_breaks = waiver(),
  y_breaks = waiver(),
  conf_int = waiver(),
  conf_int_alpha = waiver(),
  conf_int_style = waiver(),
  conf_int_default = c("step", "ribbon", "none"),
  facet_wrap_cols = waiver(),
  x_label = waiver(),
  y_label = waiver(),
  x_label_shared = waiver(),
  y_label_shared = waiver(),
  rotate_x_tick_labels = waiver(),
  rotate_y_tick_labels = waiver(),
  legend_label = waiver(),
  combine_legend = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = waiver()
)
Arguments
| x_range | (optional) Value range for the x-axis. | 
| y_range | (optional) Value range for the y-axis. | 
| x_n_breaks | (optional) Number of breaks to show on the x-axis of the
plot. x_n_breaksis used to determine thex_breaksargument in case it
is unset. | 
| y_n_breaks | (optional) Number of breaks to show on the y-axis of the
plot. y_n_breaksis used to determine they_breaksargument in case it
is unset. | 
| x_breaks | (optional) Break points on the x-axis of the plot. | 
| y_breaks | (optional) Break points on the y-axis of the plot. | 
| conf_int | (optional) | 
| conf_int_alpha | (optional) Alpha value to determine transparency of
confidence intervals or, alternatively, other plot elements with which the
confidence interval overlaps. Only values between 0.0 (fully transparent)
and 1.0 (fully opaque) are allowed. | 
| conf_int_style | (optional) Confidence interval style. See details for
allowed styles. | 
| conf_int_default | Sets the default options for the confidence interval. | 
| facet_wrap_cols | (optional) Number of columns to generate when facet
wrapping. If NULL, a facet grid is produced instead. | 
| x_label | (optional) Label to provide to the x-axis. If NULL, no label
is shown. | 
| y_label | (optional) Label to provide to the y-axis. If NULL, no label
is shown. | 
| x_label_shared | (optional) Sharing of x-axis labels between facets.
One of three values:
 
 overall: A single label is placed at the bottom of the figure. Tick
text (but not the ticks themselves) is removed for all but the bottom facet
plot(s).
 column: A label is placed at the bottom of each column. Tick text (but
not the ticks themselves) is removed for all but the bottom facet plot(s).
 individual: A label is placed below each facet plot. Tick text is kept.
 | 
| y_label_shared | (optional) Sharing of y-axis labels between facets.
One of three values:
 
 overall: A single label is placed to the left of the figure. Tick text
(but not the ticks themselves) is removed for all but the left-most facet
plot(s).
 row: A label is placed to the left of each row. Tick text (but not the
ticks themselves) is removed for all but the left-most facet plot(s).
 individual: A label is placed below each facet plot. Tick text is kept.
 | 
| rotate_x_tick_labels | (optional) Rotate tick labels on the x-axis by
90 degrees. Defaults to TRUE. Rotation of x-axis tick labels may also be
controlled through theggtheme. In this case,FALSEshould be provided
explicitly. | 
| rotate_y_tick_labels | (optional) Rotate tick labels on the y-axis by
45 degrees. | 
| legend_label | (optional) Label to provide to the legend. If NULL, the
legend will not have a name. | 
| combine_legend | (optional) Flag to indicate whether the same legend
is to be shared by multiple aesthetics, such as those specified by
color_byandlinetype_byarguments. | 
| plot_title | (optional) Label to provide as figure title. If NULL, no
title is shown. | 
| plot_sub_title | (optional) Label to provide as figure subtitle. If
NULL, no subtitle is shown. | 
| caption | (optional) Label to provide as figure caption. If NULL, no
caption is shown. | 
Internal function for checking if the outcome type fits well to the data
Description
This function may help identify if the outcome type is plausible
given the outcome data. In practice it also tests whether the outcome column
is actually correct given the outcome type.
Usage
.check_outcome_type_plausibility(
  data,
  outcome_type,
  outcome_column,
  censoring_indicator,
  event_indicator,
  competing_risk_indicator,
  check_stringency = "strict"
)
Arguments
| data | Data set as loaded using the .load_datafunction. | 
| outcome_type | Character string indicating the type of outcome being
assessed. | 
| outcome_column | Name of the outcome column in the data set. | 
| censoring_indicator | Name of censoring indicator. | 
| event_indicator | Name of event indicator. | 
| competing_risk_indicator | Name of competing risk indicator. | 
| check_stringency | Specifies stringency of various checks. This is mostly:
 
 strict: default value used forsummon_familiar. Thoroughly checks
input data. Used internally for checking development data.
 external_warn: value used forextract_dataand related methods. Less
stringent checks, but will warn for possible issues. Used internally for
checking data for evaluation and explanation.
 external: value used for external methods such aspredict. Less
stringent checks, particularly for identifier and outcome columns, which may
be completely absent. Used internally forpredict.
 | 
Checks and sanitizes splitting variables for plotting.
Description
Checks and sanitizes splitting variables for plotting.
Usage
.check_plot_splitting_variables(
  x,
  split_by = NULL,
  color_by = NULL,
  linetype_by = NULL,
  facet_by = NULL,
  x_axis_by = NULL,
  y_axis_by = NULL,
  available = NULL
)
Arguments
| x | data.table or data.frame containing the data used for splitting. | 
| split_by | (optional) Splitting variables. This refers to column names
on which datasets are split. A separate figure is created for each split.
See details for available variables. | 
| color_by | (optional) Variables used to determine fill colour of plot
objects. The variables cannot overlap with those provided to the split_byargument, but may overlap with other arguments. See details for available
variables. | 
| linetype_by | (optional) Variables that are used to determine the
linetype of lines in a plot. The variables cannot overlap with those
provided to the split_byargument, but may overlap with other arguments.
Sett details for available variables. | 
| facet_by | (optional) Variables used to determine how and if facets of
each figure appear. In case the facet_wrap_colsargument isNULL, the
first variable is used to define columns, and the remaing variables are
used to define rows of facets. The variables cannot overlap with those
provided to thesplit_byargument, but may overlap with other arguments.
See details for available variables. | 
| x_axis_by | (optional) Variable plotted along the x-axis of a plot.
The variable cannot overlap with variables provided to the split_byandy_axis_byarguments (if used), but may overlap with other arguments. Only
one variable is allowed for this argument. See details for available
variables. | 
| y_axis_by | (optional) Variable plotted along the y-axis of a plot.
The variable cannot overlap with variables provided to the split_byandx_axis_byarguments (if used), but may overlap with other arguments. Only
one variable is allowed for this argument. See details for available
variables. | 
| available | Names of columns available for splitting. | 
Details
This internal function allows some flexibility regarding the exact
input. Allowed splitting variables should be defined by the available
argument.
Value
A sanitized list of splitting variables.
Internal function to test plausibility of provided survival times.
Description
This function checks whether non-positive outcome time is present in the
data. This may produce unexpected results for some packages. For example,
glmnet will not train if an instance has a survival time of 0 or lower.
Usage
.check_survival_time_plausibility(
  data,
  outcome_type,
  outcome_column,
  check_stringency = "strict"
)
Arguments
| data | Data set as loaded using the .load_datafunction. | 
| outcome_type | (recommended) Type of outcome found in the outcome
column. The outcome type determines many aspects of the overall process,
e.g. the available feature selection methods and learners, but also the
type of assessments that can be conducted to evaluate the resulting models.
Implemented outcome types are:
 
 binomial: categorical outcome with 2 levels.
 multinomial: categorical outcome with 2 or more levels.
 count: Poisson-distributed numeric outcomes.
 continuous: general continuous numeric outcomes.
 survival: survival outcome for time-to-event data.
 If not provided, the algorithm will attempt to obtain outcome_type from
contents of the outcome column. This may lead to unexpected results, and we
therefore advise to provide this information manually.
 Note that competing_risksurvival analysis are not fully supported, and
is currently not a valid choice foroutcome_type. | 
| outcome_column | (recommended) Name of the column containing the
outcome of interest. May be identified from a formula, if a formula is
provided as an argument. Otherwise an error is raised. Note that survivalandcompeting_riskoutcome type outcomes require two columns that
indicate the time-to-event or the time of last follow-up and the event
status. | 
Internal function for finalising generic data processing
Description
Internal function for finalising generic data processing
Usage
.finish_data_preparation(
  data,
  sample_id_column,
  batch_id_column,
  series_id_column,
  outcome_column,
  outcome_type,
  include_features,
  class_levels,
  censoring_indicator,
  event_indicator,
  competing_risk_indicator,
  check_stringency = "strict",
  reference_method = "auto"
)
Arguments
| data | data.table with feature data | 
| sample_id_column | (recommended) Name of the column containing
sample or subject identifiers. See batch_id_columnabove for more
details. If unset, every row will be identified as a single sample. | 
| batch_id_column | (recommended) Name of the column containing batch
or cohort identifiers. This parameter is required if more than one dataset
is provided, or if external validation is performed.
 In familiar any row of data is organised by four identifiers:
 
 The batch identifier batch_id_column: This denotes the group to which a
set of samples belongs, e.g. patients from a single study, samples measured
in a batch, etc. The batch identifier is used for batch normalisation, as
well as selection of development and validation datasets. The sample identifier sample_id_column: This denotes the sample level,
e.g. data from a single individual. Subsets of data, e.g. bootstraps or
cross-validation folds, are created at this level. The series identifier series_id_column: Indicates measurements on a
single sample that may not share the same outcome value, e.g. a time
series, or the number of cells in a view. The repetition identifier: Indicates repeated measurements in a single
series where any feature values may differ, but the outcome does not.
Repetition identifiers are always implicitly set when multiple entries for
the same series of the same sample in the same batch that share the same
outcome are encountered.
 | 
| series_id_column | (optional) Name of the column containing series
identifiers, which distinguish between measurements that are part of a
series for a single sample. See batch_id_columnabove for more details. If unset, rows which share the same batch and sample identifiers but have a
different outcome are assigned unique series identifiers. | 
| outcome_column | (recommended) Name of the column containing the
outcome of interest. May be identified from a formula, if a formula is
provided as an argument. Otherwise an error is raised. Note that survivalandcompeting_riskoutcome type outcomes require two columns that
indicate the time-to-event or the time of last follow-up and the event
status. | 
| outcome_type | (recommended) Type of outcome found in the outcome
column. The outcome type determines many aspects of the overall process,
e.g. the available feature selection methods and learners, but also the
type of assessments that can be conducted to evaluate the resulting models.
Implemented outcome types are:
 
 binomial: categorical outcome with 2 levels.
 multinomial: categorical outcome with 2 or more levels.
 count: Poisson-distributed numeric outcomes.
 continuous: general continuous numeric outcomes.
 survival: survival outcome for time-to-event data.
 If not provided, the algorithm will attempt to obtain outcome_type from
contents of the outcome column. This may lead to unexpected results, and we
therefore advise to provide this information manually.
 Note that competing_risksurvival analysis are not fully supported, and
is currently not a valid choice foroutcome_type. | 
| include_features | (optional) Feature columns that are specifically
included in the data set. By default all features are included. Cannot
overlap with exclude_features, but may overlapsignature. Features insignatureandnovelty_featuresare always included. If bothexclude_featuresandinclude_featuresare provided,include_featurestakes precedence, provided that there is no overlap between the two. | 
| class_levels | (optional) Class levels for binomialormultinomialoutcomes. This argument can be used to specify the ordering of levels for
categorical outcomes. These class levels must exactly match the levels
present in the outcome column. | 
| censoring_indicator | (recommended) Indicator for right-censoring in
survivalandcompeting_riskanalyses.familiarwill automatically
recognise0,false,f,n,noas censoring indicators, including
different capitalisations. If this parameter is set, it replaces the
default values. | 
| event_indicator | (recommended) Indicator for events in survivalandcompeting_riskanalyses.familiarwill automatically recognise1,true,t,yandyesas event indicators, including different
capitalisations. If this parameter is set, it replaces the default values. | 
| competing_risk_indicator | (recommended) Indicator for competing
risks in competing_riskanalyses. There are no default values, and if
unset, all values other than those specified by theevent_indicatorandcensoring_indicatorparameters are considered to indicate competing
risks. | 
| check_stringency | Specifies stringency of various checks. This is mostly:
 
 strict: default value used forsummon_familiar. Thoroughly checks
input data. Used internally for checking development data.
 external_warn: value used forextract_dataand related methods. Less
stringent checks, but will warn for possible issues. Used internally for
checking data for evaluation and explanation.
 external: value used for external methods such aspredict. Less
stringent checks, particularly for identifier and outcome columns, which may
be completely absent. Used internally forpredict.
 | 
| reference_method | (optional) Method used to set reference levels for
categorical features. There are several options:
 
 auto(default): Categorical features that are not explicitly set by the
user, i.e. columns containing boolean values or characters, use the most
frequent level as reference. Categorical features that are explicitly set,
i.e. as factors, are used as is.
 always: Both automatically detected and user-specified categorical
features have the reference level set to the most frequent level. Ordinal
features are not altered, but are used as is.
 never: User-specified categorical features are used as is.
Automatically detected categorical features are simply sorted, and the
first level is then used as the reference level. This was the behaviour
prior to familiar version 1.3.0.
 | 
Details
This function is used to update data.table provided by loading the
data. When part of the main familiar workflow, this function is used after
.parse_initial_settings –> .load_data –> .update_initial_settings.
When used to parse external data (e.g. in conjunction with familiarModel)
it follows after .load_data. Hence the function contains several checks
which are otherwise part of .update_initial_settings.
Value
data.table with expected column names.
Internal function for obtaining a default signature size parameter
Description
Internal function for obtaining a default signature size parameter
Usage
.get_default_sign_size(data, restrict_samples = FALSE)
Arguments
| data | dataObject class object which contains the data on which the
preset parameters are determined. | 
| restrict_samples | Logical indicating whether the signature size should
be limited by the number of samples in addition to the number of available
features. This may help convergence of OLS-based methods. | 
Value
List containing the preset values for the signature size parameter.
Internal function for creating or retrieving iteration data
Description
Internal function for creating or retrieving iteration data
Usage
.get_iteration_data(
  file_paths,
  data,
  experiment_setup,
  settings,
  message_indent = 0L,
  verbose = TRUE
)
Arguments
| file_paths | Set of paths to relevant files and directories. | 
| data | Data set as loaded using the .load_datafunction. | 
| experiment_setup | data.table with subsampler information at different
levels of the experimental design. | 
| settings | List of parameter settings. Some of these parameters are
relevant to creating iterations. | 
| message_indent | Indenting of messages. | 
| verbose | Sets verbosity. | 
Value
A list with the following elements:
-  iter_list: A list containing iteration data at the different levels of
the experiment.
 
-  project_id: The unique project identifier.
 
-  experiment_setup: data.table with subsampler information at different
levels of the experimental design.
 
Internal imputation function for the outcome type.
Description
This function allows for imputation of the most plausible outcome type.
This imputation is only done for trivial cases, where there is little doubt.
As a consequence count and continuous outcome types are never imputed.
Usage
.impute_outcome_type(
  data,
  outcome_column,
  class_levels,
  censoring_indicator,
  event_indicator,
  competing_risk_indicator
)
Arguments
| data | Data set as loaded using the .load_datafunction. | 
| outcome_column | Name of the outcome column in the data set. | 
| class_levels | User-provided class levels for the outcome. | 
| censoring_indicator | Name of censoring indicator. | 
| event_indicator | Name of event indicator. | 
| competing_risk_indicator | Name of competing risk indicator. | 
Value
The imputed outcome type.
Note
It is highly recommended that the user provides the outcome type.
Internal function for loading iteration data from the file system
Description
Loads iterations generated by .create_iterations that were created in a
previous session. If these are not available, this is indicated by setting a
return flag.
Usage
.load_iterations(file_dir, iteration_file = NULL)
Arguments
| file_dir | Path to directory where iteration files are stored. | 
Value
List containing:
-  iteration_file_exists: An indicator whether an iteration file was found.
 
-  iteration_list: The list of iterations (if available).
 
-  project_id: The unique project identifier (if available).
 
Internal function for setting categorical features
Description
Internal function for setting categorical features
Usage
.parse_categorical_features(data, outcome_type, reference_method = "auto")
Arguments
| data | data.table with feature data | 
| outcome_type | character, indicating the type of outcome | 
| reference_method | character, indicating the type of method used to set
the reference level. | 
Details
This function parses columns containing feature data to factors if
the data contained therein have logical (TRUE, FALSE), character, or factor
classes.  Unless passed as feature names with reference, numerical data,
including integers, are not converted to factors.
Value
data.table with several features converted to factor.
Internal function for parsing settings related to model evaluation
Description
Internal function for parsing settings related to model evaluation
Usage
.parse_evaluation_settings(
  config = NULL,
  data,
  parallel,
  outcome_type,
  hpo_metric,
  development_batch_id,
  vimp_aggregation_method,
  vimp_aggregation_rank_threshold,
  prep_cluster_method,
  prep_cluster_linkage_method,
  prep_cluster_cut_method,
  prep_cluster_similarity_threshold,
  prep_cluster_similarity_metric,
  evaluate_top_level_only = waiver(),
  skip_evaluation_elements = waiver(),
  ensemble_method = waiver(),
  evaluation_metric = waiver(),
  sample_limit = waiver(),
  detail_level = waiver(),
  estimation_type = waiver(),
  aggregate_results = waiver(),
  confidence_level = waiver(),
  bootstrap_ci_method = waiver(),
  feature_cluster_method = waiver(),
  feature_cluster_cut_method = waiver(),
  feature_linkage_method = waiver(),
  feature_similarity_metric = waiver(),
  feature_similarity_threshold = waiver(),
  sample_cluster_method = waiver(),
  sample_linkage_method = waiver(),
  sample_similarity_metric = waiver(),
  eval_aggregation_method = waiver(),
  eval_aggregation_rank_threshold = waiver(),
  eval_icc_type = waiver(),
  stratification_method = waiver(),
  stratification_threshold = waiver(),
  time_max = waiver(),
  evaluation_times = waiver(),
  dynamic_model_loading = waiver(),
  parallel_evaluation = waiver(),
  ...
)
Arguments
| config | A list of settings, e.g. from an xml file. | 
| data | Data set as loaded using the .load_datafunction. | 
| parallel | Logical value that whether familiar uses parallelisation. If
FALSEit will overrideparallel_evaluation. | 
| outcome_type | Type of outcome found in the data set. | 
| hpo_metric | Metric defined for hyperparameter optimisation. | 
| development_batch_id | Identifiers of batches used for model development.
These identifiers are used to determine the cohorts used to determine a
setting for time_max, if theoutcome_typeissurvival, and bothtime_maxandevaluation_timesare not provided. | 
| vimp_aggregation_method | Method for variable importance aggregation that
was used for feature selection. | 
| vimp_aggregation_rank_threshold | Rank threshold for variable importance
aggregation used during feature selection. | 
| prep_cluster_method | Cluster method used during pre-processing. | 
| prep_cluster_linkage_method | Cluster linkage method used during
pre-processing. | 
| prep_cluster_cut_method | Cluster cut method used during pre-processing. | 
| prep_cluster_similarity_threshold | Cluster similarity threshold used
during pre-processing. | 
| prep_cluster_similarity_metric | Cluster similarity metric used during
pre-processing. | 
| evaluate_top_level_only | (optional) Flag that signals that only
evaluation at the most global experiment level is required. Consider a
cross-validation experiment with additional external validation. The global
experiment level consists of data that are used for development, internal
validation and external validation. The next lower experiment level are the
individual cross-validation iterations.
 When the flag is true, evaluations take place on the global level only,
and no results are generated for the next lower experiment levels. In our
example, this means that results from individual cross-validation iterations
are not computed and shown. When the flag isfalse, results are computed
from both the global layer and the next lower level. Setting the flag to truesaves computation time. | 
| skip_evaluation_elements | (optional) Specifies which evaluation steps,
if any, should be skipped as part of the evaluation process. Defaults to
none, which means that all relevant evaluation steps are performed. It can
have one or more of the following values: 
 none,false: no steps are skipped.
 all,true: all steps are skipped.
 auc_data: data for assessing and plotting the area under the receiver
operating characteristic curve are not computed.
 calibration_data: data for assessing and plotting model calibration are
not computed.
 calibration_info: data required to assess calibration, such as baseline
survival curves, are not collected. These data will still be present in the
models.
 confusion_matrix: data for assessing and plotting a confusion matrix are
not collected.
 decision_curve_analyis: data for performing a decision curve analysis
are not computed.
 feature_expressions: data for assessing and plotting sample clustering
are not computed.
 feature_similarity: data for assessing and plotting feature clusters are
not computed.
 fs_vimp: data for assessing and plotting feature selection-based
variable importance are not collected.
 hyperparameters: data for assessing model hyperparameters are not
collected. These data will still be present in the models.
 ice_data: data for individual conditional expectation and partial
dependence plots are not created.
 model_performance: data for assessing and visualising model performance
are not created.
 model_vimp: data for assessing and plotting model-based variable
importance are not collected.
 permutation_vimp: data for assessing and plotting model-agnostic
permutation variable importance are not computed.
 prediction_data: predictions for each sample are not made and exported.
 risk_stratification_data: data for assessing and plotting Kaplan-Meier
survival curves are not collected.
 risk_stratification_info: data for assessing stratification into risk
groups are not computed.
 univariate_analysis: data for assessing and plotting univariate feature
importance are not computed.
 | 
| ensemble_method | (optional) Method for ensembling predictions from
models for the same sample. Available methods are:
 This parameter is only used if detail_levelisensemble. | 
| evaluation_metric | (optional) One or more metrics for assessing model
performance. See the vignette on performance metrics for the available
metrics.
 Confidence intervals (or rather credibility intervals) are computed for each
metric during evaluation. This is done using bootstraps, the number of which
depends on the value of confidence_level(Davison and Hinkley, 1997). If unset, the metric in the optimisation_metricvariable is used. | 
| sample_limit | (optional) Set the upper limit of the number of samples
that are used during evaluation steps. Cannot be less than 20.
 This setting can be specified per data element by providing a parameter
value in a named list with data elements, e.g.
list("sample_similarity"=100, "permutation_vimp"=1000). This parameter can be set for the following data elements:
sample_similarityandice_data. | 
| detail_level | (optional) Sets the level at which results are computed
and aggregated.
 
 ensemble: Results are computed at the ensemble level, i.e. over all
models in the ensemble. This means that, for example, bias-corrected
estimates of model performance are assessed by creating (at least) 20
bootstraps and computing the model performance of the ensemble model for
each bootstrap.
 hybrid(default): Results are computed at the level of models in an
ensemble. This means that, for example, bias-corrected estimates of model
performance are directly computed using the models in the ensemble. If there
are at least 20 trained models in the ensemble, performance is computed for
each model, in contrast toensemblewhere performance is computed for the
ensemble of models. If there are less than 20 trained models in the
ensemble, bootstraps are created so that at least 20 point estimates can be
made.
 model: Results are computed at the model level. This means that, for
example, bias-corrected estimates of model performance are assessed by
creating (at least) 20 bootstraps and computing the performance of the model
for each bootstrap.
 Note that each level of detail has a different interpretation for bootstrap
confidence intervals. For ensembleandmodelthese are the confidence
intervals for the ensemble and an individual model, respectively. That is,
the confidence interval describes the range where an estimate produced by a
respective ensemble or model trained on a repeat of the experiment may be
found with the probability of the confidence level. Forhybrid, it
represents the range where any single model trained on a repeat of the
experiment may be found with the probability of the confidence level. By
definition, confidence intervals obtained usinghybridare at least as
wide as those forensemble.hybridoffers the correct interpretation if
the goal of the analysis is to assess the result of a single, unspecified,
model. hybridis generally computationally less expensive thenensemble, which
in turn is somewhat less expensive thanmodel.
 A non-default detail_levelparameter can be specified for separate
evaluation steps by providing a parameter value in a named list with data
elements, e.g.list("auc_data"="ensemble", "model_performance"="hybrid").
This parameter can be set for the following data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data,prediction_dataandconfusion_matrix. | 
| estimation_type | (optional) Sets the type of estimation that should be
possible. This has the following options:
 
 point: Point estimates.
 bias_correctionorbc: Bias-corrected estimates. A bias-corrected
estimate is computed from (at least) 20 point estimates, andfamiliarmay
bootstrap the data to create them.
 bootstrap_confidence_intervalorbci(default): Bias-corrected
estimates with bootstrap confidence intervals (Efron and Hastie, 2016). The
number of point estimates required depends on theconfidence_levelparameter, andfamiliarmay bootstrap the data to create them.
 As with detail_level, a non-defaultestimation_typeparameter can be
specified for separate evaluation steps by providing a parameter value in a
named list with data elements, e.g.list("auc_data"="bci", "model_performance"="point"). This parameter can be set for the following
data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data, andprediction_data. | 
| aggregate_results | (optional) Flag that signifies whether results
should be aggregated during evaluation. If estimation_typeisbias_correctionorbc, aggregation leads to a single bias-corrected
estimate. Ifestimation_typeisbootstrap_confidence_intervalorbci,
aggregation leads to a single bias-corrected estimate with lower and upper
boundaries of the confidence interval. This has no effect ifestimation_typeispoint. The default value is equal to TRUEexcept when assessing metrics to assess
model performance, as the default violin plot requires underlying data. As with detail_levelandestimation_type, a non-defaultaggregate_resultsparameter can be specified for separate evaluation steps
by providing a parameter value in a named list with data elements, e.g.list("auc_data"=TRUE, , "model_performance"=FALSE). This parameter exists
for the same elements asestimation_type. | 
| confidence_level | (optional) Numeric value for the level at which
confidence intervals are determined. In the case bootstraps are used to
determine the confidence intervals bootstrap estimation, familiaruses the
rule of thumbn = 20 / ci.levelto determine the number of required
bootstraps. The default value is 0.95. | 
| bootstrap_ci_method | (optional) Method used to determine bootstrap
confidence intervals (Efron and Hastie, 2016). The following methods are
implemented:
 Note that the standard method is not implemented because this method is
often not suitable due to non-normal distributions. The bias-corrected and
accelerated (BCa) method is not implemented yet. | 
| feature_cluster_method | (optional) Method used to perform clustering
of features. The same methods as for the cluster_methodconfiguration
parameter are available:none,hclust,agnes,dianaandpam. The value for the cluster_methodconfiguration parameter is used by
default. When generating clusters for the purpose of determining mutual
correlation and ordering feature expressions,noneis ignored andhclustis used instead. | 
| feature_cluster_cut_method | (optional) Method used to divide features
into separate clusters. The available methods are the same as for the
cluster_cut_methodconfiguration parameter:silhouette,fixed_cutanddynamic_cut. silhouetteis available for all cluster methods, butfixed_cutonly
applies to methods that create hierarchical trees (hclust,agnesanddiana).dynamic_cutrequires thedynamicTreeCutpackage and can only
be used withagnesandhclust.
 The value for the cluster_cut_methodconfiguration parameter is used by
default. | 
| feature_linkage_method | (optional) Method used for agglomerative
clustering with hclustandagnes. Linkage determines how features are
sequentially combined into clusters based on distance. The methods are
shared with thecluster_linkage_methodconfiguration parameter:average,single,complete,weighted, andward. The value for the cluster_linkage_methodconfiguration parameters is used
by default. | 
| feature_similarity_metric | (optional) Metric to determine pairwise
similarity between features. Similarity is computed in the same manner as
for clustering, and feature_similarity_metrictherefore has the same
options ascluster_similarity_metric:mcfadden_r2,cox_snell_r2,nagelkerke_r2,mutual_information,spearman,kendallandpearson. The value used for the cluster_similarity_metricconfiguration parameter
is used by default. | 
| feature_similarity_threshold | (optional) The threshold level for
pair-wise similarity that is required to form feature clusters with the
fixed_cutmethod. This threshold functions in the same manner as the one
defined using thecluster_similarity_thresholdparameter. By default, the value for the cluster_similarity_thresholdconfiguration
parameter is used. Unlike for cluster_similarity_threshold, more than one value can be
supplied here. | 
| sample_cluster_method | (optional) The method used to perform
clustering based on distance between samples. These are the same methods as
for the cluster_methodconfiguration parameter:hclust,agnes,dianaandpam. The value for the cluster_methodconfiguration parameter is used by
default. When generating clusters for the purpose of ordering samples in
feature expressions,noneis ignored andhclustis used instead. | 
| sample_linkage_method | (optional) The method used for agglomerative
clustering in hclustandagnes. These are the same methods as for thecluster_linkage_methodconfiguration parameter:average,single,complete,weighted, andward. The value for the cluster_linkage_methodconfiguration parameters is used
by default. | 
| sample_similarity_metric | (optional) Metric to determine pairwise
similarity between samples. Similarity is computed in the same manner as for
clustering, but sample_similarity_metrichas different options that are
better suited to computing distance between samples instead of between
features. The following metrics are available. 
 gower(default): compute Gower's distance between samples. By default,
Gower's distance is computed based on winsorised data to reduce the effect
of outliers (see below).
 euclidean: compute the Euclidean distance between samples.
 The underlying feature data for numerical features is scaled to the
[0,1]range using the feature values across the samples. The
normalisation parameters required can optionally be computed from feature
data with the outer 5% (on both sides) of feature values trimmed or
winsorised. To do so append_trim(trimming) or_winsor(winsorising) to
the metric name. This reduces the effect of outliers somewhat. Regardless of metric, all categorical features are handled as for the
Gower's distance: distance is 0 if the values in a pair of samples match,
and 1 if they do not. | 
| eval_aggregation_method | (optional) Method for aggregating variable
importances for the purpose of evaluation. Variable importances are
determined during feature selection steps and after training the model. Both
types are evaluated, but feature selection variable importance is only
evaluated at run-time.
 See the documentation for the vimp_aggregation_methodargument for
information concerning the different methods available. | 
| eval_aggregation_rank_threshold | (optional) The threshold used to
define the subset of highly important features during evaluation.
 See the documentation for the vimp_aggregation_rank_thresholdargument for
more information. | 
| eval_icc_type | (optional) String indicating the type of intraclass
correlation coefficient (1,2or3) that should be used to compute
robustness for features in repeated measurements during the evaluation of
univariate importance. These types correspond to the types in Shrout and
Fleiss (1979). The default value is1. | 
| stratification_method | (optional) Method for determining the
stratification threshold for creating survival groups. The actual,
model-dependent, threshold value is obtained from the development data, and
can afterwards be used to perform stratification on validation data.
 The following stratification methods are available:
 
 median(default): The median predicted value in the development cohort
is used to stratify the samples into two risk groups. For predicted outcome
values that build a continuous spectrum, the two risk groups in the
development cohort will be roughly equal in size.
 mean: The mean predicted value in the development cohort is used to
stratify the samples into two risk groups.
 mean_trim: Asmean, but based on the set of predicted values
where the 5% lowest and 5% highest values are discarded. This reduces the
effect of outliers.
 mean_winsor: Asmean, but based on the set of predicted values where
the 5% lowest and 5% highest values are winsorised. This reduces the effect
of outliers.
 fixed: Samples are stratified based on the sample quantiles of the
predicted values. These quantiles are defined using thestratification_thresholdparameter.
 optimised: Use maximally selected rank statistics to determine the
optimal threshold (Lausen and Schumacher, 1992; Hothorn et al., 2003) to
stratify samples into two optimally separated risk groups.
 One or more stratification methods can be selected simultaneously.
 This parameter is only relevant for survivaloutcomes. | 
| stratification_threshold | (optional) Numeric value(s) signifying the
sample quantiles for stratification using the fixedmethod. The number of
risk groups will be the number of values +1. The default value is c(1/3, 2/3), which will yield two thresholds that
divide samples into three equally sized groups. Iffixedis not among the
selected stratification methods, this parameter is ignored. This parameter is only relevant for survivaloutcomes. | 
| time_max | (optional) Time point which is used as the benchmark for
e.g. cumulative risks generated by random forest, or the cutoff for Uno's
concordance index.
 If time_maxis not provided, butevaluation_timesis, the largest value
ofevaluation_timesis used. If both are not provided,time_maxis set
to the 98th percentile of the distribution of survival times for samples
with an event in the development data set. This parameter is only relevant for survivaloutcomes. | 
| evaluation_times | (optional) One or more time points that are used for
assessing calibration in survival problems. This is done as expected and
observed survival probabilities depend on time.
 If unset, evaluation_timeswill be equal totime_max. This parameter is only relevant for survivaloutcomes. | 
| dynamic_model_loading | (optional) Enables dynamic loading of models
during the evaluation process, if TRUE. Defaults toFALSE. Dynamic
loading of models may reduce the overall memory footprint, at the cost of
increased disk or network IO. Models can only be dynamically loaded if they
are found at an accessible disk or network location. Setting this parameter
toTRUEmay help if parallel processing causes out-of-memory issues during
evaluation. | 
| parallel_evaluation | (optional) Enable parallel processing for
hyperparameter optimisation. Defaults to TRUE. When set toFALSE, this
will disable the use of parallel processing while performing optimisation,
regardless of the settings of theparallelparameter. The parameter
moreover specifies whether parallelisation takes place within the evaluation
process steps (inner, default), or in an outer loop (outer) over
learners, data subsamples, etc. parallel_evaluationis ignored ifparallel=FALSE.
 | 
| ... | Unused arguments. | 
Value
List of parameters related to model evaluation.
References
-  Davison, A. C. & Hinkley, D. V. Bootstrap methods and their
application. (Cambridge University Press, 1997).
 
-  Efron, B. & Hastie, T. Computer Age Statistical Inference. (Cambridge
University Press, 2016).
 
-  Lausen, B. & Schumacher, M. Maximally Selected Rank Statistics.
Biometrics 48, 73 (1992).
 
-  Hothorn, T. & Lausen, B. On the exact distribution of maximally selected
rank statistics. Comput. Stat. Data Anal. 43, 121–137 (2003).
 
Internal function for parsing settings related to the experimental setup
Description
Internal function for parsing settings related to the experimental setup
Usage
.parse_experiment_settings(
  config = NULL,
  batch_id_column = waiver(),
  sample_id_column = waiver(),
  series_id_column = waiver(),
  development_batch_id = waiver(),
  validation_batch_id = waiver(),
  outcome_name = waiver(),
  outcome_column = waiver(),
  outcome_type = waiver(),
  event_indicator = waiver(),
  censoring_indicator = waiver(),
  competing_risk_indicator = waiver(),
  class_levels = waiver(),
  signature = waiver(),
  novelty_features = waiver(),
  exclude_features = waiver(),
  include_features = waiver(),
  reference_method = waiver(),
  experimental_design = waiver(),
  imbalance_correction_method = waiver(),
  imbalance_n_partitions = waiver(),
  ...
)
Arguments
| config | A list of settings, e.g. from an xml file. | 
| batch_id_column | (recommended) Name of the column containing batch
or cohort identifiers. This parameter is required if more than one dataset
is provided, or if external validation is performed.
 In familiar any row of data is organised by four identifiers:
 
 The batch identifier batch_id_column: This denotes the group to which a
set of samples belongs, e.g. patients from a single study, samples measured
in a batch, etc. The batch identifier is used for batch normalisation, as
well as selection of development and validation datasets. The sample identifier sample_id_column: This denotes the sample level,
e.g. data from a single individual. Subsets of data, e.g. bootstraps or
cross-validation folds, are created at this level. The series identifier series_id_column: Indicates measurements on a
single sample that may not share the same outcome value, e.g. a time
series, or the number of cells in a view. The repetition identifier: Indicates repeated measurements in a single
series where any feature values may differ, but the outcome does not.
Repetition identifiers are always implicitly set when multiple entries for
the same series of the same sample in the same batch that share the same
outcome are encountered.
 | 
| sample_id_column | (recommended) Name of the column containing
sample or subject identifiers. See batch_id_columnabove for more
details. If unset, every row will be identified as a single sample. | 
| series_id_column | (optional) Name of the column containing series
identifiers, which distinguish between measurements that are part of a
series for a single sample. See batch_id_columnabove for more details. If unset, rows which share the same batch and sample identifiers but have a
different outcome are assigned unique series identifiers. | 
| development_batch_id | (optional) One or more batch or cohort
identifiers to constitute data sets for development. Defaults to all, or
all minus the identifiers in validation_batch_idfor external validation.
Required if external validation is performed andvalidation_batch_idis
not provided. | 
| validation_batch_id | (optional) One or more batch or cohort
identifiers to constitute data sets for external validation. Defaults to
all data sets except those in development_batch_idfor external
validation, or none if not. Required ifdevelopment_batch_idis not
provided. | 
| outcome_name | (optional) Name of the modelled outcome. This name will
be used in figures created by familiar. If not set, the column name in outcome_columnwill be used forbinomial,multinomial,countandcontinuousoutcomes. For other
outcomes (survivalandcompeting_risk) no default is used. | 
| outcome_column | (recommended) Name of the column containing the
outcome of interest. May be identified from a formula, if a formula is
provided as an argument. Otherwise an error is raised. Note that survivalandcompeting_riskoutcome type outcomes require two columns that
indicate the time-to-event or the time of last follow-up and the event
status. | 
| outcome_type | (recommended) Type of outcome found in the outcome
column. The outcome type determines many aspects of the overall process,
e.g. the available feature selection methods and learners, but also the
type of assessments that can be conducted to evaluate the resulting models.
Implemented outcome types are:
 
 binomial: categorical outcome with 2 levels.
 multinomial: categorical outcome with 2 or more levels.
 count: Poisson-distributed numeric outcomes.
 continuous: general continuous numeric outcomes.
 survival: survival outcome for time-to-event data.
 If not provided, the algorithm will attempt to obtain outcome_type from
contents of the outcome column. This may lead to unexpected results, and we
therefore advise to provide this information manually.
 Note that competing_risksurvival analysis are not fully supported, and
is currently not a valid choice foroutcome_type. | 
| event_indicator | (recommended) Indicator for events in survivalandcompeting_riskanalyses.familiarwill automatically recognise1,true,t,yandyesas event indicators, including different
capitalisations. If this parameter is set, it replaces the default values. | 
| censoring_indicator | (recommended) Indicator for right-censoring in
survivalandcompeting_riskanalyses.familiarwill automatically
recognise0,false,f,n,noas censoring indicators, including
different capitalisations. If this parameter is set, it replaces the
default values. | 
| competing_risk_indicator | (recommended) Indicator for competing
risks in competing_riskanalyses. There are no default values, and if
unset, all values other than those specified by theevent_indicatorandcensoring_indicatorparameters are considered to indicate competing
risks. | 
| class_levels | (optional) Class levels for binomialormultinomialoutcomes. This argument can be used to specify the ordering of levels for
categorical outcomes. These class levels must exactly match the levels
present in the outcome column. | 
| signature | (optional) One or more names of feature columns that are
considered part of a specific signature. Features specified here will
always be used for modelling. Ranking from feature selection has no effect
for these features. | 
| novelty_features | (optional) One or more names of feature columns
that should be included for the purpose of novelty detection. | 
| exclude_features | (optional) Feature columns that will be removed
from the data set. Cannot overlap with features in signature,novelty_featuresorinclude_features. | 
| include_features | (optional) Feature columns that are specifically
included in the data set. By default all features are included. Cannot
overlap with exclude_features, but may overlapsignature. Features insignatureandnovelty_featuresare always included. If bothexclude_featuresandinclude_featuresare provided,include_featurestakes precedence, provided that there is no overlap between the two. | 
| reference_method | (optional) Method used to set reference levels for
categorical features. There are several options:
 
 auto(default): Categorical features that are not explicitly set by the
user, i.e. columns containing boolean values or characters, use the most
frequent level as reference. Categorical features that are explicitly set,
i.e. as factors, are used as is.
 always: Both automatically detected and user-specified categorical
features have the reference level set to the most frequent level. Ordinal
features are not altered, but are used as is.
 never: User-specified categorical features are used as is.
Automatically detected categorical features are simply sorted, and the
first level is then used as the reference level. This was the behaviour
prior to familiar version 1.3.0.
 | 
| experimental_design | (required) Defines what the experiment looks
like, e.g. cv(bt(fs,20)+mb,3,2)+evfor 2 times repeated 3-fold
cross-validation with nested feature selection on 20 bootstraps and
model-building, and external validation. The basic workflow components are: 
 fs: (required) feature selection step.
 mb: (required) model building step.
 ev: (optional) external validation. Note that internal validation due
to subsampling will always be conducted if the subsampling methods create
any validation data sets.
 The different components are linked using +. Different subsampling methods can be used in conjunction with the basic
workflow components:
 
 bs(x,n): (stratified) .632 bootstrap, withnthe number of
bootstraps. In contrast tobt, feature pre-processing parameters and
hyperparameter optimisation are conducted on individual bootstraps.
 bt(x,n): (stratified) .632 bootstrap, withnthe number of
bootstraps. Unlikebsand other subsampling methods, no separate
pre-processing parameters or optimised hyperparameters will be determined
for each bootstrap.
 cv(x,n,p): (stratified)n-fold cross-validation, repeatedptimes.
Pre-processing parameters are determined for each iteration.
 lv(x): leave-one-out-cross-validation. Pre-processing parameters are
determined for each iteration.
 ip(x): imbalance partitioning for addressing class imbalances on the
data set. Pre-processing parameters are determined for each partition. The
number of partitions generated depends on the imbalance correction method
(see theimbalance_correction_methodparameter). Imbalance partitioning
does not generate validation sets.
 As shown in the example above, sampling algorithms can be nested.
 The simplest valid experimental design is fs+mb, which corresponds to a
TRIPOD type 1a analysis. Type 1b analyses are only possible using
bootstraps, e.g.bt(fs+mb,100). Type 2a analyses can be conducted using
cross-validation, e.g.cv(bt(fs,100)+mb,10,1). Depending on the origin of
the external validation data, designs such asfs+mb+evorcv(bt(fs,100)+mb,10,1)+evconstitute type 2b or type 3 analyses. Type 4
analyses can be done by obtaining one or morefamiliarModelobjects from
others and applying them to your own data set. Alternatively, the experimental_designparameter may be used to provide a
path to a file containing iterations, which is named####_iterations.RDSby convention. This path can be relative to the directory of the current
experiment (experiment_dir), or an absolute path. The absolute path may
thus also point to a file from a different experiment. | 
| imbalance_correction_method | (optional) Type of method used to
address class imbalances. Available options are:
 
 full_undersampling(default): All data will be used in an ensemble
fashion. The full minority class will appear in each partition, but
majority classes are undersampled until all data have been used.
 random_undersampling: Randomly undersamples majority classes. This is
useful in cases where full undersampling would lead to the formation of
many models due major overrepresentation of the largest class.
 This parameter is only used in combination with imbalance partitioning in
the experimental design, and ipshould therefore appear in the string
that defines the design. | 
| imbalance_n_partitions | (optional) Number of times random
undersampling should be repeated. 10 undersampled subsets with balanced
classes are formed by default. | 
| ... | Unused arguments. | 
Value
List of parameters related to data parsing and the experiment.
Internal function for parsing settings related to feature selection
Description
Internal function for parsing settings related to feature selection
Usage
.parse_feature_selection_settings(
  config = NULL,
  data,
  parallel,
  outcome_type,
  fs_method = waiver(),
  fs_method_parameter = waiver(),
  vimp_aggregation_method = waiver(),
  vimp_aggregation_rank_threshold = waiver(),
  parallel_feature_selection = waiver(),
  ...
)
Arguments
| config | A list of settings, e.g. from an xml file. | 
| data | Data set as loaded using the .load_datafunction. | 
| parallel | Logical value that whether familiar uses parallelisation. If
FALSEit will overrideparallel_feature_selection. | 
| outcome_type | Type of outcome found in the data set. | 
| fs_method | (required) Feature selection method to be used for
determining variable importance. familiarimplements various feature
selection methods. Please refer to the vignette on feature selection
methods for more details. More than one feature selection method can be chosen. The experiment will
then repeated for each feature selection method.
 Feature selection methods determines the ranking of features. Actual
selection of features is done by optimising the signature size model
hyperparameter during the hyperparameter optimisation step. | 
| fs_method_parameter | (optional) List of lists containing parameters
for feature selection methods. Each sublist should have the name of the
feature selection method it corresponds to.
 Most feature selection methods do not have parameters that can be set.
Please refer to the vignette on feature selection methods for more details.
Note that if the feature selection method is based on a learner (e.g. lasso
regression), hyperparameter optimisation may be performed prior to
assessing variable importance. | 
| vimp_aggregation_method | (optional) The method used to aggregate
variable importances over different data subsets, e.g. bootstraps. The
following methods can be selected:
 
 none: Don't aggregate ranks, but rather aggregate the variable
importance scores themselves.
 mean: Use the mean rank of a feature over the subsets to
determine the aggregated feature rank.
 median: Use the median rank of a feature over the subsets to determine
the aggregated feature rank.
 best: Use the best rank the feature obtained in any subset to determine
the aggregated feature rank.
 worst: Use the worst rank the feature obtained in any subset to
determine the aggregated feature rank.
 stability: Use the frequency of the feature being in the subset of
highly ranked features as measure for the aggregated feature rank
(Meinshausen and Buehlmann, 2010).
 exponential: Use a rank-weighted frequence of occurrence in the subset
of highly ranked features as measure for the aggregated feature rank (Haury
et al., 2011).
 borda(default): Use the borda count as measure for the aggregated
feature rank (Wald et al., 2012).
 enhanced_borda: Use an occurrence frequency-weighted borda count as
measure for the aggregated feature rank (Wald et al., 2012).
 truncated_borda: Use borda count computed only on features within the
subset of highly ranked features.
 enhanced_truncated_borda: Apply both the enhanced borda method and the
truncated borda method and use the resulting borda count as the aggregated
feature rank.
 The feature selection methods vignette provides additional information. | 
| vimp_aggregation_rank_threshold | (optional) The threshold used to
define the subset of highly important features. If not set, this threshold
is determined by maximising the variance in the occurrence value over all
features over the subset size.
 This parameter is only relevant for stability,exponential,enhanced_borda,truncated_bordaandenhanced_truncated_bordamethods. | 
| parallel_feature_selection | (optional) Enable parallel processing for
the feature selection workflow. Defaults to TRUE. When set toFALSE,
this will disable the use of parallel processing while performing feature
selection, regardless of the settings of theparallelparameter.parallel_feature_selectionis ignored ifparallel=FALSE. | 
| ... | Unused arguments. | 
Value
List of parameters related to feature selection.
References
-  Wald, R., Khoshgoftaar, T. M., Dittman, D., Awada, W. &
Napolitano, A. An extensive comparison of feature ranking aggregation
techniques in bioinformatics. in 2012 IEEE 13th International Conference on
Information Reuse Integration (IRI) 377–384 (2012).
 
-  Meinshausen, N. & Buehlmann, P. Stability selection. J. R. Stat. Soc.
Series B Stat. Methodol. 72, 417–473 (2010).
 
-  Haury, A.-C., Gestraud, P. & Vert, J.-P. The influence of feature
selection methods on accuracy, stability and interpretability of molecular
signatures. PLoS One 6, e28210 (2011).
 
Internal function for parsing file paths
Description
Internal function for parsing file paths
Usage
.parse_file_paths(
  config = NULL,
  project_dir = waiver(),
  experiment_dir = waiver(),
  data_file = waiver(),
  verbose = TRUE,
  ...
)
Arguments
| config | A list of settings, e.g. from an xml file. | 
| project_dir | (optional) Path to the project directory. familiarchecks if the directory indicated byexperiment_dirand data files indata_fileare relative to theproject_dir. | 
| experiment_dir | (recommended) Path to the directory where all
intermediate and final results produced by familiarare written to. The experiment_dircan be a path relative toproject_diror an absolute
path. In case no project directory is provided and the experiment directory is
not on an absolute path, a directory will be created in the temporary R
directory indicated by tempdir(). This directory is deleted after closing
the R session or once data analysis has finished. All information will be
lost afterwards. Hence, it is recommended to provide eitherexperiment_diras an absolute path, or provide bothproject_dirandexperiment_dir. | 
| data_file | (optional) Path to files containing data that should be
analysed. The paths can be relative to project_diror absolute paths. An
error will be raised if the file cannot be found. The following types of data are supported.
 
 csvfiles containing column headers on the first row, and samples per
row.csvfiles are read usingdata.table::fread.
 rdsfiles that contain adata.tableordata.frameobject.rdsfiles are imported usingbase::readRDS.
 RDatafiles that contain a singledata.tableordata.frameobject.RDatafiles are imported usingbase::load.
 All data are expected in wide format, with sample information organised
row-wise.
 More than one data file can be provided. familiarwill try to combine
data files based on column names and identifier columns. Alternatively, data can be provided using the dataargument. These data
are expected to bedata.frameordata.tableobjects or paths to data
files. The latter are handled in the same way as file paths provided todata_file. | 
| verbose | Sets verbosity. | 
| ... | Unused arguments | 
Value
List of paths to important directories and files.
Internal function for parsing settings that configure various aspects of the
worklow
Description
Internal function for parsing settings that configure various aspects of the
worklow
Usage
.parse_general_settings(settings, config = NULL, data, ...)
Arguments
| settings | List of settings that was previously generated by
.parse_initial_settings. | 
| config | A list of settings, e.g. from an xml file. | 
| data | Data set as loaded using the .load_datafunction. | 
| ... | Arguments passed on to .parse_setup_settings,.parse_preprocessing_settings,.parse_feature_selection_settings,.parse_model_development_settings,.parse_hyperparameter_optimisation_settings,.parse_evaluation_settings 
parallel(optional) Enable parallel processing. Defaults to TRUE.
When set toFALSE, this disables all parallel processing, regardless of
specific parameters such asparallel_preprocessing. However, whenparallelisTRUE, parallel processing of different parts of the
workflow can be disabled by setting respective flags toFALSE.parallel_nr_cores(optional) Number of cores available for
parallelisation. Defaults to 2. This setting does nothing if
parallelisation is disabled.restart_cluster(optional) Restart nodes used for parallel computing
to free up memory prior to starting a parallel process. Note that it does
take time to set up the clusters. Therefore setting this argument to TRUEmay impact processing speed. This argument is ignored ifparallelisFALSEor the cluster was initialised outside of familiar. Default isFALSE, which causes the clusters to be initialised only once.cluster_type(optional) Selection of the cluster type for parallel
processing. Available types are the ones supported by the parallel package
that is part of the base R distribution: psock(default),fork,mpi,nws,sock. In addition,noneis available, which also disables
parallel processing.backend_type(optional) Selection of the backend for distributing
copies of the data. This backend ensures that only a single master copy is
kept in memory. This limits memory usage during parallel processing.
 Several backend options are available, notably socket_server, andnone(default).socket_serveris based on the callr package and R sockets,
comes withfamiliarand is available for any OS.noneuses the package
environment of familiar to store data, and is available for any OS.
However,nonerequires copying of data to any parallel process, and has a
larger memory footprint.server_port(optional) Integer indicating the port on which the
socket server or RServe process should communicate. Defaults to port 6311.
Note that ports 0 to 1024 and 49152 to 65535 cannot be used.feature_max_fraction_missing(optional) Numeric value between 0.0and0.95that determines the meximum fraction of missing values that
still allows a feature to be included in the data set. All features with a
missing value fraction over this threshold are not processed further. The
default value is0.30.sample_max_fraction_missing(optional) Numeric value between 0.0and0.95that determines the maximum fraction of missing values that
still allows a sample to be included in the data set. All samples with a
missing value fraction over this threshold are excluded and not processed
further. The default value is0.30.filter_method(optional) One or methods used to reduce
dimensionality of the data set by removing irrelevant or poorly
reproducible features.
 Several method are available:
 
 none(default): None of the features will be filtered.
 low_variance: Features with a variance below thelow_var_minimum_variance_thresholdare filtered. This can be useful to
filter, for example, genes that are not differentially expressed.
 univariate_test: Features undergo a univariate regression using an
outcome-appropriate regression model. The p-value of the model coefficient
is collected. Features with coefficient p or q-value above theunivariate_test_thresholdare subsequently filtered.
 robustness: Features that are not sufficiently robust according to the
intraclass correlation coefficient are filtered. Use of this method
requires that repeated measurements are present in the data set, i.e. there
should be entries for which the sample and cohort identifiers are the same.
 More than one method can be used simultaneously. Features with singular
values are always filtered, as these do not contain information.univariate_test_threshold(optional) Numeric value between 1.0and0.0that determines which features are irrelevant and will be filtered by
theunivariate_test. The p or q-values are compared to this threshold.
All features with values above the threshold are filtered. The default
value is0.20.univariate_test_threshold_metric(optional) Metric used with the to
compare the univariate_test_thresholdagainst. The following metrics can
be chosen: 
 p_value(default): The unadjusted p-value of each feature is used for
to filter features.
 q_value: The q-value (Story, 2002), is used to filter features. Some
data sets may have insufficient samples to compute the q-value. Theqvaluepackage must be installed from Bioconductor to use this method.
univariate_test_max_feature_set_size(optional) Maximum size of the
feature set after the univariate test. P or q values of features are
compared against the threshold, but if the resulting data set would be
larger than this setting, only the most relevant features up to the desired
feature set size are selected.
 The default value is NULL, which causes features to be filtered based on
their relevance only.low_var_minimum_variance_threshold(required, if used) Numeric value
that determines which features will be filtered by the low_variancemethod. The variance of each feature is computed and compared to the
threshold. If it is below the threshold, the feature is removed. This parameter has no default value and should be set if low_varianceis
used.low_var_max_feature_set_size(optional) Maximum size of the feature
set after filtering features with a low variance. All features are first
compared against low_var_minimum_variance_threshold. If the resulting
feature set would be larger than specified, only the most strongly varying
features will be selected, up to the desired size of the feature set. The default value is NULL, which causes features to be filtered based on
their variance only.robustness_icc_type(optional) String indicating the type of
intraclass correlation coefficient (1,2or3) that should be used to
compute robustness for features in repeated measurements. These types
correspond to the types in Shrout and Fleiss (1979). The default value is1.robustness_threshold_metric(optional) String indicating which
specific intraclass correlation coefficient (ICC) metric should be used to
filter features. This should be one of:
 
 icc: The estimated ICC value itself.
 icc_low(default): The estimated lower limit of the 95% confidence
interval of the ICC, as suggested by Koo and Li (2016).
 icc_panel: The estimated ICC value over the panel average, i.e. the ICC
that would be obtained if all repeated measurements were averaged.
 icc_panel_low: The estimated lower limit of the 95% confidence interval
of the panel ICC.
robustness_threshold_value(optional) The intraclass correlation
coefficient value that is as threshold. The default value is 0.70.transformation_method(optional) The transformation method used to
change the distribution of the data to be more normal-like. The following
methods are available:
 
 none: This disables transformation of features.
 yeo_johnson: Transformation using the location and scale invariant
version of the Yeo-Johnson transformation (Yeo and Johnson, 2000;
Zwanenburg and Löck, 2023).
 yeo_johnson_robust(default): A robust version ofyeo_johnson.
This method is less sensitive to outliers.
 yeo_johnson_conventional: Asyeo_johnson, but without optimisation of
location and scale parameters. This method is equivalent to the original
transformation proposed by Yeo and Johnson (2001).
 box_cox: Transformation using the location and scale invariant version
of the Box-Cox transformation (Box and Cox, 1964; Zwanenburg and Löck,
2023).
 box_cox_robust: A robust version ofyeo_johnson. This method is less
sensitive to outliers.
 box_cox_conventional: Asbox_cox, but without optimisation of
location and scale parameters. This method is equivalent to the original
transformation proposed by Box and Cox (1964). This method requires
strictly positive feature values.
 Transformation requires the power.transformpackage. Only features that
contain numerical data are transformed. Transformation parameters obtained
in development data are stored withinfeatureInfoobjects for later use
with validation data sets.transformation_optimisation_criterion(optional) Transformation
parameters are optimised using a criterion, conventionally
maximum-likelihood-estimation. power.transformimplements multiple
optimisation criteria, of which the following are available: 
 mle(default): Optimisation using maximum likelihood estimation.
 cramer_von_mises: Optimisation using the Cramér-von Mises
criterion. Zwanenburg and Löck (2023) found that this criterion was
relatively robust against outliers.
transformation_gof_test_p_value(optional) Not all transformations
will lead to features that are roughly normally distributed. Zwanenburg and
Löck (2023) established a empirical goodness-of-fit test for central
normality. This parameter sets the significance for rejecting the
null-hypothesis that a feature distribution is centrally normal. When the
null-hypothesis is rejected, no transformation is performed. The default
value is NULL, which disables the test.normalisation_method(optional) The normalisation method used to
improve the comparability between numerical features that may have very
different scales. The following normalisation methods can be chosen:
 
 none: This disables feature normalisation.
 standardisation: Features are normalised by subtraction of their mean
values and division by their standard deviations. This causes every feature
to be have a center value of 0.0 and standard deviation of 1.0.
 standardisation_trim: Asstandardisation, but based on the set of
feature values where the 5% lowest and 5% highest values are discarded.
This reduces the effect of outliers.
 standardisation_winsor: Asstandardisation, but based on the set of
feature values where the 5% lowest and 5% highest values are winsorised.
This reduces the effect of outliers.
 standardisation_robust(default): A robust version ofstandardisationthat relies on computing Huber's M-estimators for location and scale.
 normalisation: Features are normalised by subtraction of their minimum
values and division by their ranges. This maps all feature values to a[0, 1]interval.
 normalisation_trim: Asnormalisation, but based on the set of feature
values where the 5% lowest and 5% highest values are discarded. This
reduces the effect of outliers.
 normalisation_winsor: Asnormalisation, but based on the set of
feature values where the 5% lowest and 5% highest values are winsorised.
This reduces the effect of outliers.
 quantile: Features are normalised by subtraction of their median values
and division by their interquartile range.
 mean_centering: Features are centered by substracting the mean, but do
not undergo rescaling.
 Only features that contain numerical data are normalised. Normalisation
parameters obtained in development data are stored within featureInfoobjects for later use with validation data sets.batch_normalisation_method(optional) The method used for batch
normalisation. Available methods are:
 
 none(default): This disables batch normalisation of features.
 standardisation: Features within each batch are normalised by
subtraction of the mean value and division by the standard deviation in
each batch.
 standardisation_trim: Asstandardisation, but based on the set of
feature values where the 5% lowest and 5% highest values are discarded.
This reduces the effect of outliers.
 standardisation_winsor: Asstandardisation, but based on the set of
feature values where the 5% lowest and 5% highest values are winsorised.
This reduces the effect of outliers.
 standardisation_robust: A robust version ofstandardisationthat
relies on computing Huber's M-estimators for location and scale within each
batch.
 normalisation: Features within each batch are normalised by subtraction
of their minimum values and division by their range in each batch. This
maps all feature values in each batch to a[0, 1]interval.
 normalisation_trim: Asnormalisation, but based on the set of feature
values where the 5% lowest and 5% highest values are discarded. This
reduces the effect of outliers.
 normalisation_winsor: Asnormalisation, but based on the set of
feature values where the 5% lowest and 5% highest values are winsorised.
This reduces the effect of outliers.
 quantile: Features in each batch are normalised by subtraction of the
median value and division by the interquartile range of each batch.
 mean_centering: Features in each batch are centered on 0.0 by
substracting the mean value in each batch, but are not rescaled.
 combat_parametric: Batch adjustments using parametric empirical Bayes
(Johnson et al, 2007).combat_pleads to the same method.
 combat_non_parametric: Batch adjustments using non-parametric empirical
Bayes (Johnson et al, 2007).combat_npandcombatlead to the same
method. Note that we reduced complexity from O(n^2) to O(n) by
only computing batch adjustment parameters for each feature on a subset of
50 randomly selected features, instead of all features.
 Only features that contain numerical data are normalised using batch
normalisation. Batch normalisation parameters obtained in development data
are stored within featureInfoobjects for later use with validation data
sets, in case the validation data is from the same batch. If validation data contains data from unknown batches, normalisation
parameters are separately determined for these batches.
 Note that for both empirical Bayes methods, the batch effect is assumed to
produce results across the features. This is often true for things such as
gene expressions, but the assumption may not hold generally.
 When performing batch normalisation, it is moreover important to check that
differences between batches or cohorts are not related to the studied
endpoint.imputation_method(optional) Method used for imputing missing
feature values. Two methods are implemented:
 
 simple: Simple replacement of a missing value by the median value (for
numeric features) or the modal value (for categorical features).
 lasso: Imputation of missing value by lasso regression (usingglmnet)
based on information contained in other features.
 simpleimputation precedeslassoimputation to ensure that any missing
values in predictors required forlassoregression are resolved. Thelassoestimate is then used to replace the missing value.
 The default value depends on the number of features in the dataset. If the
number is lower than 100, lassois used by default, andsimpleotherwise. Only single imputation is performed. Imputation models and parameters are
stored within featureInfoobjects for later use with validation data
sets.cluster_method(optional) Clustering is performed to identify and
replace redundant features, for example those that are highly correlated.
Such features do not carry much additional information and may be removed
or replaced instead (Park et al., 2007; Tolosi and Lengauer, 2011).
 The cluster method determines the algorithm used to form the clusters. The
following cluster methods are implemented:
 
 none: No clustering is performed.
 hclust(default): Hierarchical agglomerative clustering. If thefastclusterpackage is installed,fastcluster::hclustis used (Muellner
2013), otherwisestats::hclustis used.
 agnes: Hierarchical clustering using agglomerative nesting (Kaufman and
Rousseeuw, 1990). This algorithm is similar tohclust, but uses thecluster::agnesimplementation.
 diana: Divisive analysis hierarchical clustering. This method uses
divisive instead of agglomerative clustering (Kaufman and Rousseeuw, 1990).cluster::dianais used.
 pam: Partioning around medioids. This partitions the data into $k$
clusters around medioids (Kaufman and Rousseeuw, 1990). $k$ is selected
using thesilhouettemetric.pamis implemented using thecluster::pamfunction.
 Clusters and cluster information is stored within featureInfoobjects for
later use with validation data sets. This enables reproduction of the same
clusters as formed in the development data set.cluster_linkage_method(optional) Linkage method used for
agglomerative clustering in hclustandagnes. The following linkage
methods can be used: 
 average(default): Average linkage.
 single: Single linkage.
 complete: Complete linkage.
 weighted: Weighted linkage, also known as McQuitty linkage.
 ward: Linkage using Ward's minimum variance method.
 dianaandpamdo not require a linkage method.
cluster_cut_method(optional) The method used to define the actual
clusters. The following methods can be used:
 
 silhouette: Clusters are formed based on the silhouette score
(Rousseeuw, 1987). The average silhouette score is computed from 2 tonclusters, withnthe number of features. Clusters are only
formed if the average silhouette exceeds 0.50, which indicates reasonable
evidence for structure. This procedure may be slow if the number of
features is large (>100s).
 fixed_cut: Clusters are formed by cutting the hierarchical tree at the
point indicated by thecluster_similarity_threshold, e.g. where features
in a cluster have an average Spearman correlation of 0.90.fixed_cutis
only available foragnes,dianaandhclust.
 dynamic_cut: Dynamic cluster formation using the cutting algorithm in
thedynamicTreeCutpackage. This package should be installed to select
this option.dynamic_cutcan only be used withagnesandhclust.
 The default options are silhouettefor partioning around medioids (pam)
andfixed_cutotherwise.cluster_similarity_metric(optional) Clusters are formed based on
feature similarity. All features are compared in a pair-wise fashion to
compute similarity, for example correlation. The resulting similarity grid
is converted into a distance matrix that is subsequently used for
clustering. The following metrics are supported to compute pairwise
similarities:
 
 mutual_information(default): normalised mutual information.
 mcfadden_r2: McFadden's pseudo R-squared (McFadden, 1974).
 cox_snell_r2: Cox and Snell's pseudo R-squared (Cox and Snell, 1989).
 nagelkerke_r2: Nagelkerke's pseudo R-squared (Nagelkerke, 1991).
 spearman: Spearman's rank order correlation.
 kendall: Kendall rank correlation.
 pearson: Pearson product-moment correlation.
 The pseudo R-squared metrics can be used to assess similarity between mixed
pairs of numeric and categorical features, as these are based on the
log-likelihood of regression models. In familiar, the more informative
feature is used as the predictor and the other feature as the reponse
variable. In numeric-categorical pairs, the numeric feature is considered
to be more informative and is thus used as the predictor. In
categorical-categorical pairs, the feature with most levels is used as the
predictor. In case any of the classical correlation coefficients (pearson,spearmanandkendall) are used with (mixed) categorical features, the
categorical features are one-hot encoded and the mean correlation over all
resulting pairs is used as similarity.cluster_similarity_threshold(optional) The threshold level for
pair-wise similarity that is required to form clusters using fixed_cut.
This should be a numerical value between 0.0 and 1.0. Note however, that a
reasonable threshold value depends strongly on the similarity metric. The
following are the default values used: 
 mcfadden_r2andmutual_information:0.30
 cox_snell_r2andnagelkerke_r2:0.75
 spearman,kendallandpearson:0.90
 Alternatively, if the fixed cutmethod is not used, this value determines
whether any clustering should be performed, because the data may not
contain highly similar features. The default values in this situation are: 
 mcfadden_r2andmutual_information:0.25
 cox_snell_r2andnagelkerke_r2:0.40
 spearman,kendallandpearson:0.70
 The threshold value is converted to a distance (1-similarity) prior to
cutting hierarchical trees.cluster_representation_method(optional) Method used to determine
how the information of co-clustered features is summarised and used to
represent the cluster. The following methods can be selected:
 
 best_predictor(default): The feature with the highest importance
according to univariate regression with the outcome is used to represent
the cluster.
 medioid: The feature closest to the cluster center, i.e. the feature
that is most similar to the remaining features in the cluster, is used to
represent the feature.
 mean: A meta-feature is generated by averaging the feature values for
all features in a cluster. This method aligns all features so that all
features will be positively correlated prior to averaging. Should a cluster
contain one or more categorical features, themedioidmethod will be used
instead, as averaging is not possible. Note that if this method is chosen,
thenormalisation_methodparameter should be one ofstandardisation,standardisation_trim,standardisation_winsororquantile.'
 If the pamcluster method is selected, only themedioidmethod can be
used. In that case 1 medioid is used by default.parallel_preprocessing(optional) Enable parallel processing for the
preprocessing workflow. Defaults to TRUE. When set toFALSE, this will
disable the use of parallel processing while preprocessing, regardless of
the settings of theparallelparameter.parallel_preprocessingis
ignored ifparallel=FALSE.fs_method(required) Feature selection method to be used for
determining variable importance. familiarimplements various feature
selection methods. Please refer to the vignette on feature selection
methods for more details. More than one feature selection method can be chosen. The experiment will
then repeated for each feature selection method.
 Feature selection methods determines the ranking of features. Actual
selection of features is done by optimising the signature size model
hyperparameter during the hyperparameter optimisation step.fs_method_parameter(optional) List of lists containing parameters
for feature selection methods. Each sublist should have the name of the
feature selection method it corresponds to.
 Most feature selection methods do not have parameters that can be set.
Please refer to the vignette on feature selection methods for more details.
Note that if the feature selection method is based on a learner (e.g. lasso
regression), hyperparameter optimisation may be performed prior to
assessing variable importance.vimp_aggregation_method(optional) The method used to aggregate
variable importances over different data subsets, e.g. bootstraps. The
following methods can be selected:
 
 none: Don't aggregate ranks, but rather aggregate the variable
importance scores themselves.
 mean: Use the mean rank of a feature over the subsets to
determine the aggregated feature rank.
 median: Use the median rank of a feature over the subsets to determine
the aggregated feature rank.
 best: Use the best rank the feature obtained in any subset to determine
the aggregated feature rank.
 worst: Use the worst rank the feature obtained in any subset to
determine the aggregated feature rank.
 stability: Use the frequency of the feature being in the subset of
highly ranked features as measure for the aggregated feature rank
(Meinshausen and Buehlmann, 2010).
 exponential: Use a rank-weighted frequence of occurrence in the subset
of highly ranked features as measure for the aggregated feature rank (Haury
et al., 2011).
 borda(default): Use the borda count as measure for the aggregated
feature rank (Wald et al., 2012).
 enhanced_borda: Use an occurrence frequency-weighted borda count as
measure for the aggregated feature rank (Wald et al., 2012).
 truncated_borda: Use borda count computed only on features within the
subset of highly ranked features.
 enhanced_truncated_borda: Apply both the enhanced borda method and the
truncated borda method and use the resulting borda count as the aggregated
feature rank.
 The feature selection methods vignette provides additional information.vimp_aggregation_rank_threshold(optional) The threshold used to
define the subset of highly important features. If not set, this threshold
is determined by maximising the variance in the occurrence value over all
features over the subset size.
 This parameter is only relevant for stability,exponential,enhanced_borda,truncated_bordaandenhanced_truncated_bordamethods.parallel_feature_selection(optional) Enable parallel processing for
the feature selection workflow. Defaults to TRUE. When set toFALSE,
this will disable the use of parallel processing while performing feature
selection, regardless of the settings of theparallelparameter.parallel_feature_selectionis ignored ifparallel=FALSE.learner(required) One or more algorithms used for model
development. A sizeable number learners is supported in familiar. Please
see the vignette on learners for more information concerning the available
learners.hyperparameter(optional) List of lists containing hyperparameters
for learners. Each sublist should have the name of the learner method it
corresponds to, with list elements being named after the intended
hyperparameter, e.g. "glm_logistic"=list("sign_size"=3) All learners have hyperparameters. Please refer to the vignette on learners
for more details. If no parameters are provided, sequential model-based
optimisation is used to determine optimal hyperparameters.
 Hyperparameters provided by the user are never optimised. However, if more
than one value is provided for a single hyperparameter, optimisation will
be conducted using these values.novelty_detector(optional) Specify the algorithm used for training
a novelty detector. This detector can be used to identify
out-of-distribution data prospectively.detector_parameters(optional) List lists containing hyperparameters
for novelty detectors. Currently not used.parallel_model_development(optional) Enable parallel processing for
the model development workflow. Defaults to TRUE. When set toFALSE,
this will disable the use of parallel processing while developing models,
regardless of the settings of theparallelparameter.parallel_model_developmentis ignored ifparallel=FALSE.optimisation_bootstraps(optional) Number of bootstraps that should
be generated from the development data set. During the optimisation
procedure one or more of these bootstraps (indicated by
smbo_step_bootstraps) are used for model development using different
combinations of hyperparameters. The effect of the hyperparameters is then
assessed by comparing in-bag and out-of-bag model performance. The default number of bootstraps is 50. Hyperparameter optimisation may
finish before exhausting the set of bootstraps.optimisation_determine_vimp(optional) Logical value that indicates
whether variable importance is determined separately for each of the
bootstraps created during the optimisation process (TRUE) or the
applicable results from the feature selection step are used (FALSE). Determining variable importance increases the initial computational
overhead. However, it prevents positive biases for the out-of-bag data due
to overlap of these data with the development data set used for the feature
selection step. In this case, any hyperparameters of the variable
importance method are not determined separately for each bootstrap, but
those obtained during the feature selection step are used instead. In case
multiple of such hyperparameter sets could be applicable, the set that will
be used is randomly selected for each bootstrap.
 This parameter only affects hyperparameter optimisation of learners. The
default is TRUE.smbo_random_initialisation(optional) String indicating the
initialisation method for the hyperparameter space. Can be one of
fixed_subsample(default),fixed, orrandom.fixedandfixed_subsamplefirst create hyperparameter sets from a range of default
values set by familiar.fixed_subsamplethen randomly draws up tosmbo_n_random_setsfrom the grid.randomdoes not rely upon a fixed
grid, and randomly draws up tosmbo_n_random_setshyperparameter sets
from the hyperparameter space.smbo_n_random_sets(optional) Number of random or subsampled
hyperparameters drawn during the initialisation process. Default: 100.
Cannot be smaller than10. The parameter is not used whensmbo_random_initialisationisfixed, as the entire pre-defined grid
will be explored.max_smbo_iterations(optional) Maximum number of intensify
iterations of the SMBO algorithm. During an intensify iteration a run-off
occurs between the current best hyperparameter combination and either 10
challenger combination with the highest expected improvement or a set of 20
random combinations.
 Run-off with random combinations is used to force exploration of the
hyperparameter space, and is performed every second intensify iteration, or
if there is no expected improvement for any challenger combination.
 If a combination of hyperparameters leads to better performance on the same
data than the incumbent best set of hyperparameters, it replaces the
incumbent set at the end of the intensify iteration.
 The default number of intensify iteration is 20. Iterations may be
stopped early if the incumbent set of hyperparameters remains the same forsmbo_stop_convergent_iterationsiterations, or performance improvement is
minimal. This behaviour is suppressed during the first 4 iterations to
enable the algorithm to explore the hyperparameter space.smbo_stop_convergent_iterations(optional) The number of subsequent
convergent SMBO iterations required to stop hyperparameter optimisation
early. An iteration is convergent if the best parameter set has not
changed or the optimisation score over the 4 most recent iterations has not
changed beyond the tolerance level in smbo_stop_tolerance. The default value is 3.smbo_stop_tolerance(optional) Tolerance for early stopping due to
convergent optimisation score.
 The default value depends on the square root of the number of samples (at
the series level), and is 0.01for 100 samples. This value is computed as0.1 * 1 / sqrt(n_samples). The upper limit is0.0001for 1M or more
samples.smbo_time_limit(optional) Time limit (in minutes) for the
optimisation process. Optimisation is stopped after this limit is exceeded.
Time taken to determine variable importance for the optimisation process
(see the optimisation_determine_vimpparameter) does not count. The default is NULL, indicating that there is no time limit for the
optimisation process. The time limit cannot be less than 1 minute.smbo_initial_bootstraps(optional) The number of bootstraps taken
from the set of optimisation_bootstrapsas the bootstraps assessed
initially. The default value is 1. The value cannot be larger thanoptimisation_bootstraps.smbo_step_bootstraps(optional) The number of bootstraps taken from
the set of optimisation_bootstrapsbootstraps as the bootstraps assessed
during the steps of each intensify iteration. The default value is 3. The value cannot be larger thanoptimisation_bootstraps.smbo_intensify_steps(optional) The number of steps in each SMBO
intensify iteration. Each step a new set of smbo_step_bootstrapsbootstraps is drawn and used in the run-off between the incumbent best
hyperparameter combination and its challengers. The default value is 5. Higher numbers allow for a more detailed
comparison, but this comes with added computational cost.optimisation_metric(optional) One or more metrics used to compute
performance scores. See the vignette on performance metrics for the
available metrics.
 If unset, the following metrics are used by default:
 
 auc_roc: Forbinomialandmultinomialmodels.
 mse: Mean squared error forcontinuousmodels.
 msle: Mean squared logarithmic error forcountmodels.
 concordance_index: Forsurvivalmodels.
 Multiple optimisation metrics can be specified. Actual metric values are
converted to an objective value by comparison with a baseline metric value
that derives from a trivial model, i.e. majority class for binomial and
multinomial outcomes, the median outcome for count and continuous outcomes
and a fixed risk or time for survival outcomes.optimisation_function(optional) Type of optimisation function used
to quantify the performance of a hyperparameter set. Model performance is
assessed using the metric(s) specified by optimisation_metricon the
in-bag (IB) and out-of-bag (OOB) samples of a bootstrap. These values are
converted to objective scores with a standardised interval of[-1.0, 1.0]. Each pair of objective is subsequently used to compute an
optimisation score. The optimisation score across different bootstraps is
than aggregated to a summary score. This summary score is used to rank
hyperparameter sets, and select the optimal set. The combination of optimisation score and summary score is determined by
the optimisation function indicated by this parameter:
 
 validationormax_validation(default): seeks to maximise OOB score.
 balanced: seeks to balance IB and OOB score.
 stronger_balance: similar tobalanced, but with stronger penalty for
differences between IB and OOB scores.
 validation_minus_sd: seeks to optimise the average OOB score minus its
standard deviation.
 validation_25th_percentile: seeks to optimise the 25th percentile of
OOB scores, and is conceptually similar tovalidation_minus_sd.
 model_estimate: seeks to maximise the OOB score estimate predicted by
the hyperparameter learner (not available for random search).
 model_estimate_minus_sd: seeks to maximise the OOB score estimate minus
its estimated standard deviation, as predicted by the hyperparameter
learner (not available for random search).
 model_balanced_estimate: seeks to maximise the estimate of the balanced
IB and OOB score. This is similar to thebalancedscore, and in fact uses
a hyperparameter learner to predict said score (not available for random
search).
 model_balanced_estimate_minus_sd: seeks to maximise the estimate of the
balanced IB and OOB score, minus its estimated standard deviation. This is
similar to thebalancedscore, but takes into account its estimated
spread.
 Additional detail are provided in the Learning algorithms and
hyperparameter optimisation vignette.hyperparameter_learner(optional) Any point in the hyperparameter
space has a single, scalar, optimisation score value that is a priori
unknown. During the optimisation process, the algorithm samples from the
hyperparameter space by selecting hyperparameter sets and computing the
optimisation score value for one or more bootstraps. For each
hyperparameter set the resulting values are distributed around the actual
value. The learner indicated by hyperparameter_learneris then used to
infer optimisation score estimates for unsampled parts of the
hyperparameter space. The following models are available:
 
 bayesian_additive_regression_treesorbart: Uses Bayesian Additive
Regression Trees (Sparapani et al., 2021) for inference. Unlike standard
random forests, BART allows for estimating posterior distributions directly
and can extrapolate.
 gaussian_process(default): Creates a localised approximate Gaussian
process for inference (Gramacy, 2016). This allows for better scaling than
deterministic Gaussian Processes.
 random_forest: Creates a random forest for inference. Originally
suggested by Hutter et al. (2011). A weakness of random forests is their
lack of extrapolation beyond observed values, which limits their usefulness
in exploiting promising areas of hyperparameter space.
 randomorrandom_search: Forgoes the use of models to steer
optimisation. Instead, a random search is performed.
acquisition_function(optional) The acquisition function influences
how new hyperparameter sets are selected. The algorithm uses the model
learned by the learner indicated by hyperparameter_learnerto search the
hyperparameter space for hyperparameter sets that are either likely better
than the best known set (exploitation) or where there is considerable
uncertainty (exploration). The acquisition function quantifies this
(Shahriari et al., 2016). The following acquisition functions are available, and are described in
more detail in the learner algorithms vignette:
 
 improvement_probability: The probability of improvement quantifies the
probability that the expected optimisation score for a set is better than
the best observed optimisation score
 improvement_empirical_probability: Similar toimprovement_probability, but based directly on optimisation scores
predicted by the individual decision trees.
 expected_improvement(default): Computes expected improvement.
 upper_confidence_bound: This acquisition function is based on the upper
confidence bound of the distribution (Srinivas et al., 2012).
 bayes_upper_confidence_bound: This acquisition function is based on the
upper confidence bound of the distribution (Kaufmann et al., 2012).
exploration_method(optional) Method used to steer exploration in
post-initialisation intensive searching steps. As stated earlier, each SMBO
iteration step compares suggested alternative parameter sets with an
incumbent best set in a series of steps. The exploration method
controls how the set of alternative parameter sets is pruned after each
step in an iteration. Can be one of the following:
 
 single_shot(default): The set of alternative parameter sets is not
pruned, and each intensification iteration contains only a single
intensification step that only uses a single bootstrap. This is the fastest
exploration method, but only superficially tests each parameter set.
 successive_halving: The set of alternative parameter sets is
pruned by removing the worst performing half of the sets after each step
(Jamieson and Talwalkar, 2016).
 stochastic_reject: The set of alternative parameter sets is pruned by
comparing the performance of each parameter set with that of the incumbent
best parameter set using a paired Wilcoxon test based on shared
bootstraps. Parameter sets that perform significantly worse, at an alpha
level indicated bysmbo_stochastic_reject_p_value, are pruned.
 none: The set of alternative parameter sets is not pruned.
smbo_stochastic_reject_p_value(optional) The p-value threshold used
for the stochastic_rejectexploration method. The default value is 0.05.parallel_hyperparameter_optimisation(optional) Enable parallel
processing for hyperparameter optimisation. Defaults to TRUE. When set toFALSE, this will disable the use of parallel processing while performing
optimisation, regardless of the settings of theparallelparameter. The
parameter moreover specifies whether parallelisation takes place within the
optimisation algorithm (inner, default), or in an outer loop (outer)
over learners, data subsamples, etc. parallel_hyperparameter_optimisationis ignored ifparallel=FALSE.
evaluate_top_level_only(optional) Flag that signals that only
evaluation at the most global experiment level is required. Consider a
cross-validation experiment with additional external validation. The global
experiment level consists of data that are used for development, internal
validation and external validation. The next lower experiment level are the
individual cross-validation iterations.
 When the flag is true, evaluations take place on the global level only,
and no results are generated for the next lower experiment levels. In our
example, this means that results from individual cross-validation iterations
are not computed and shown. When the flag isfalse, results are computed
from both the global layer and the next lower level. Setting the flag to truesaves computation time.skip_evaluation_elements(optional) Specifies which evaluation steps,
if any, should be skipped as part of the evaluation process. Defaults to
none, which means that all relevant evaluation steps are performed. It can
have one or more of the following values: 
 none,false: no steps are skipped.
 all,true: all steps are skipped.
 auc_data: data for assessing and plotting the area under the receiver
operating characteristic curve are not computed.
 calibration_data: data for assessing and plotting model calibration are
not computed.
 calibration_info: data required to assess calibration, such as baseline
survival curves, are not collected. These data will still be present in the
models.
 confusion_matrix: data for assessing and plotting a confusion matrix are
not collected.
 decision_curve_analyis: data for performing a decision curve analysis
are not computed.
 feature_expressions: data for assessing and plotting sample clustering
are not computed.
 feature_similarity: data for assessing and plotting feature clusters are
not computed.
 fs_vimp: data for assessing and plotting feature selection-based
variable importance are not collected.
 hyperparameters: data for assessing model hyperparameters are not
collected. These data will still be present in the models.
 ice_data: data for individual conditional expectation and partial
dependence plots are not created.
 model_performance: data for assessing and visualising model performance
are not created.
 model_vimp: data for assessing and plotting model-based variable
importance are not collected.
 permutation_vimp: data for assessing and plotting model-agnostic
permutation variable importance are not computed.
 prediction_data: predictions for each sample are not made and exported.
 risk_stratification_data: data for assessing and plotting Kaplan-Meier
survival curves are not collected.
 risk_stratification_info: data for assessing stratification into risk
groups are not computed.
 univariate_analysis: data for assessing and plotting univariate feature
importance are not computed.
ensemble_method(optional) Method for ensembling predictions from
models for the same sample. Available methods are:
 This parameter is only used if detail_levelisensemble.evaluation_metric(optional) One or more metrics for assessing model
performance. See the vignette on performance metrics for the available
metrics.
 Confidence intervals (or rather credibility intervals) are computed for each
metric during evaluation. This is done using bootstraps, the number of which
depends on the value of confidence_level(Davison and Hinkley, 1997). If unset, the metric in the optimisation_metricvariable is used.sample_limit(optional) Set the upper limit of the number of samples
that are used during evaluation steps. Cannot be less than 20.
 This setting can be specified per data element by providing a parameter
value in a named list with data elements, e.g.
list("sample_similarity"=100, "permutation_vimp"=1000). This parameter can be set for the following data elements:
sample_similarityandice_data.detail_level(optional) Sets the level at which results are computed
and aggregated.
 
 ensemble: Results are computed at the ensemble level, i.e. over all
models in the ensemble. This means that, for example, bias-corrected
estimates of model performance are assessed by creating (at least) 20
bootstraps and computing the model performance of the ensemble model for
each bootstrap.
 hybrid(default): Results are computed at the level of models in an
ensemble. This means that, for example, bias-corrected estimates of model
performance are directly computed using the models in the ensemble. If there
are at least 20 trained models in the ensemble, performance is computed for
each model, in contrast toensemblewhere performance is computed for the
ensemble of models. If there are less than 20 trained models in the
ensemble, bootstraps are created so that at least 20 point estimates can be
made.
 model: Results are computed at the model level. This means that, for
example, bias-corrected estimates of model performance are assessed by
creating (at least) 20 bootstraps and computing the performance of the model
for each bootstrap.
 Note that each level of detail has a different interpretation for bootstrap
confidence intervals. For ensembleandmodelthese are the confidence
intervals for the ensemble and an individual model, respectively. That is,
the confidence interval describes the range where an estimate produced by a
respective ensemble or model trained on a repeat of the experiment may be
found with the probability of the confidence level. Forhybrid, it
represents the range where any single model trained on a repeat of the
experiment may be found with the probability of the confidence level. By
definition, confidence intervals obtained usinghybridare at least as
wide as those forensemble.hybridoffers the correct interpretation if
the goal of the analysis is to assess the result of a single, unspecified,
model. hybridis generally computationally less expensive thenensemble, which
in turn is somewhat less expensive thanmodel.
 A non-default detail_levelparameter can be specified for separate
evaluation steps by providing a parameter value in a named list with data
elements, e.g.list("auc_data"="ensemble", "model_performance"="hybrid").
This parameter can be set for the following data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data,prediction_dataandconfusion_matrix.estimation_type(optional) Sets the type of estimation that should be
possible. This has the following options:
 
 point: Point estimates.
 bias_correctionorbc: Bias-corrected estimates. A bias-corrected
estimate is computed from (at least) 20 point estimates, andfamiliarmay
bootstrap the data to create them.
 bootstrap_confidence_intervalorbci(default): Bias-corrected
estimates with bootstrap confidence intervals (Efron and Hastie, 2016). The
number of point estimates required depends on theconfidence_levelparameter, andfamiliarmay bootstrap the data to create them.
 As with detail_level, a non-defaultestimation_typeparameter can be
specified for separate evaluation steps by providing a parameter value in a
named list with data elements, e.g.list("auc_data"="bci", "model_performance"="point"). This parameter can be set for the following
data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data, andprediction_data.aggregate_results(optional) Flag that signifies whether results
should be aggregated during evaluation. If estimation_typeisbias_correctionorbc, aggregation leads to a single bias-corrected
estimate. Ifestimation_typeisbootstrap_confidence_intervalorbci,
aggregation leads to a single bias-corrected estimate with lower and upper
boundaries of the confidence interval. This has no effect ifestimation_typeispoint. The default value is equal to TRUEexcept when assessing metrics to assess
model performance, as the default violin plot requires underlying data. As with detail_levelandestimation_type, a non-defaultaggregate_resultsparameter can be specified for separate evaluation steps
by providing a parameter value in a named list with data elements, e.g.list("auc_data"=TRUE, , "model_performance"=FALSE). This parameter exists
for the same elements asestimation_type.confidence_level(optional) Numeric value for the level at which
confidence intervals are determined. In the case bootstraps are used to
determine the confidence intervals bootstrap estimation, familiaruses the
rule of thumbn = 20 / ci.levelto determine the number of required
bootstraps. The default value is 0.95.bootstrap_ci_method(optional) Method used to determine bootstrap
confidence intervals (Efron and Hastie, 2016). The following methods are
implemented:
 Note that the standard method is not implemented because this method is
often not suitable due to non-normal distributions. The bias-corrected and
accelerated (BCa) method is not implemented yet.feature_cluster_method(optional) Method used to perform clustering
of features. The same methods as for the cluster_methodconfiguration
parameter are available:none,hclust,agnes,dianaandpam. The value for the cluster_methodconfiguration parameter is used by
default. When generating clusters for the purpose of determining mutual
correlation and ordering feature expressions,noneis ignored andhclustis used instead.feature_linkage_method(optional) Method used for agglomerative
clustering with hclustandagnes. Linkage determines how features are
sequentially combined into clusters based on distance. The methods are
shared with thecluster_linkage_methodconfiguration parameter:average,single,complete,weighted, andward. The value for the cluster_linkage_methodconfiguration parameters is used
by default.feature_cluster_cut_method(optional) Method used to divide features
into separate clusters. The available methods are the same as for the
cluster_cut_methodconfiguration parameter:silhouette,fixed_cutanddynamic_cut. silhouetteis available for all cluster methods, butfixed_cutonly
applies to methods that create hierarchical trees (hclust,agnesanddiana).dynamic_cutrequires thedynamicTreeCutpackage and can only
be used withagnesandhclust.
 The value for the cluster_cut_methodconfiguration parameter is used by
default.feature_similarity_metric(optional) Metric to determine pairwise
similarity between features. Similarity is computed in the same manner as
for clustering, and feature_similarity_metrictherefore has the same
options ascluster_similarity_metric:mcfadden_r2,cox_snell_r2,nagelkerke_r2,mutual_information,spearman,kendallandpearson. The value used for the cluster_similarity_metricconfiguration parameter
is used by default.feature_similarity_threshold(optional) The threshold level for
pair-wise similarity that is required to form feature clusters with the
fixed_cutmethod. This threshold functions in the same manner as the one
defined using thecluster_similarity_thresholdparameter. By default, the value for the cluster_similarity_thresholdconfiguration
parameter is used. Unlike for cluster_similarity_threshold, more than one value can be
supplied here.sample_cluster_method(optional) The method used to perform
clustering based on distance between samples. These are the same methods as
for the cluster_methodconfiguration parameter:hclust,agnes,dianaandpam. The value for the cluster_methodconfiguration parameter is used by
default. When generating clusters for the purpose of ordering samples in
feature expressions,noneis ignored andhclustis used instead.sample_linkage_method(optional) The method used for agglomerative
clustering in hclustandagnes. These are the same methods as for thecluster_linkage_methodconfiguration parameter:average,single,complete,weighted, andward. The value for the cluster_linkage_methodconfiguration parameters is used
by default.sample_similarity_metric(optional) Metric to determine pairwise
similarity between samples. Similarity is computed in the same manner as for
clustering, but sample_similarity_metrichas different options that are
better suited to computing distance between samples instead of between
features. The following metrics are available. 
 gower(default): compute Gower's distance between samples. By default,
Gower's distance is computed based on winsorised data to reduce the effect
of outliers (see below).
 euclidean: compute the Euclidean distance between samples.
 The underlying feature data for numerical features is scaled to the
[0,1]range using the feature values across the samples. The
normalisation parameters required can optionally be computed from feature
data with the outer 5% (on both sides) of feature values trimmed or
winsorised. To do so append_trim(trimming) or_winsor(winsorising) to
the metric name. This reduces the effect of outliers somewhat. Regardless of metric, all categorical features are handled as for the
Gower's distance: distance is 0 if the values in a pair of samples match,
and 1 if they do not.eval_aggregation_method(optional) Method for aggregating variable
importances for the purpose of evaluation. Variable importances are
determined during feature selection steps and after training the model. Both
types are evaluated, but feature selection variable importance is only
evaluated at run-time.
 See the documentation for the vimp_aggregation_methodargument for
information concerning the different methods available.eval_aggregation_rank_threshold(optional) The threshold used to
define the subset of highly important features during evaluation.
 See the documentation for the vimp_aggregation_rank_thresholdargument for
more information.eval_icc_type(optional) String indicating the type of intraclass
correlation coefficient (1,2or3) that should be used to compute
robustness for features in repeated measurements during the evaluation of
univariate importance. These types correspond to the types in Shrout and
Fleiss (1979). The default value is1.stratification_method(optional) Method for determining the
stratification threshold for creating survival groups. The actual,
model-dependent, threshold value is obtained from the development data, and
can afterwards be used to perform stratification on validation data.
 The following stratification methods are available:
 
 median(default): The median predicted value in the development cohort
is used to stratify the samples into two risk groups. For predicted outcome
values that build a continuous spectrum, the two risk groups in the
development cohort will be roughly equal in size.
 mean: The mean predicted value in the development cohort is used to
stratify the samples into two risk groups.
 mean_trim: Asmean, but based on the set of predicted values
where the 5% lowest and 5% highest values are discarded. This reduces the
effect of outliers.
 mean_winsor: Asmean, but based on the set of predicted values where
the 5% lowest and 5% highest values are winsorised. This reduces the effect
of outliers.
 fixed: Samples are stratified based on the sample quantiles of the
predicted values. These quantiles are defined using thestratification_thresholdparameter.
 optimised: Use maximally selected rank statistics to determine the
optimal threshold (Lausen and Schumacher, 1992; Hothorn et al., 2003) to
stratify samples into two optimally separated risk groups.
 One or more stratification methods can be selected simultaneously.
 This parameter is only relevant for survivaloutcomes.stratification_threshold(optional) Numeric value(s) signifying the
sample quantiles for stratification using the fixedmethod. The number of
risk groups will be the number of values +1. The default value is c(1/3, 2/3), which will yield two thresholds that
divide samples into three equally sized groups. Iffixedis not among the
selected stratification methods, this parameter is ignored. This parameter is only relevant for survivaloutcomes.time_max(optional) Time point which is used as the benchmark for
e.g. cumulative risks generated by random forest, or the cutoff for Uno's
concordance index.
 If time_maxis not provided, butevaluation_timesis, the largest value
ofevaluation_timesis used. If both are not provided,time_maxis set
to the 98th percentile of the distribution of survival times for samples
with an event in the development data set. This parameter is only relevant for survivaloutcomes.evaluation_times(optional) One or more time points that are used for
assessing calibration in survival problems. This is done as expected and
observed survival probabilities depend on time.
 If unset, evaluation_timeswill be equal totime_max. This parameter is only relevant for survivaloutcomes.dynamic_model_loading(optional) Enables dynamic loading of models
during the evaluation process, if TRUE. Defaults toFALSE. Dynamic
loading of models may reduce the overall memory footprint, at the cost of
increased disk or network IO. Models can only be dynamically loaded if they
are found at an accessible disk or network location. Setting this parameter
toTRUEmay help if parallel processing causes out-of-memory issues during
evaluation.parallel_evaluation(optional) Enable parallel processing for
hyperparameter optimisation. Defaults to TRUE. When set toFALSE, this
will disable the use of parallel processing while performing optimisation,
regardless of the settings of theparallelparameter. The parameter
moreover specifies whether parallelisation takes place within the evaluation
process steps (inner, default), or in an outer loop (outer) over
learners, data subsamples, etc. parallel_evaluationis ignored ifparallel=FALSE.
 | 
Value
A list of settings to be used within the workflow
References
-  Storey, J. D. A direct approach to false discovery rates. J.
R. Stat. Soc. Series B Stat. Methodol. 64, 479–498 (2002).
 
-  Shrout, P. E. & Fleiss, J. L. Intraclass correlations: uses in assessing
rater reliability. Psychol. Bull. 86, 420–428 (1979).
 
-  Koo, T. K. & Li, M. Y. A guideline of selecting and reporting intraclass
correlation coefficients for reliability research. J. Chiropr. Med. 15,
155–163 (2016).
 
-  Yeo, I. & Johnson, R. A. A new family of power transformations to
improve normality or symmetry. Biometrika 87, 954–959 (2000).
 
-  Box, G. E. P. & Cox, D. R. An analysis of transformations. J. R. Stat.
Soc. Series B Stat. Methodol. 26, 211–252 (1964).
 
-  Raymaekers, J., Rousseeuw,  P. J. Transforming variables to central
normality. Mach Learn. (2021).
 
-  Park, M. Y., Hastie, T. & Tibshirani, R. Averaged gene expressions for
regression. Biostatistics 8, 212–227 (2007).
 
-  Tolosi, L. & Lengauer, T. Classification with correlated features:
unreliability of feature ranking and solutions. Bioinformatics 27,
1986–1994 (2011).
 
-  Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in
microarray expression data using empirical Bayes methods. Biostatistics 8,
118–127 (2007)
 
-  Kaufman, L. & Rousseeuw, P. J. Finding groups in data: an introduction
to cluster analysis. (John Wiley & Sons, 2009).
 
-  Muellner, D. fastcluster: fast hierarchical, agglomerative clustering
routines for R and Python. J. Stat. Softw. 53, 1–18 (2013).
 
-  Rousseeuw, P. J. Silhouettes: A graphical aid to the interpretation and
validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987).
 
-  Langfelder, P., Zhang, B. & Horvath, S. Defining clusters from a
hierarchical cluster tree: the Dynamic Tree Cut package for R.
Bioinformatics 24, 719–720 (2008).
 
-  McFadden, D. Conditional logit analysis of qualitative choice behavior.
in Frontiers in Econometrics (ed. Zarembka, P.) 105–142 (Academic Press,
1974).
 
-  Cox, D. R. & Snell, E. J. Analysis of binary data. (Chapman and Hall,
1989).
 
-  Nagelkerke, N. J. D. A note on a general definition of the coefficient
of determination. Biometrika 78, 691–692 (1991).
 
-  Meinshausen, N. & Buehlmann, P. Stability selection. J. R. Stat. Soc.
Series B Stat. Methodol. 72, 417–473 (2010).
 
-  Haury, A.-C., Gestraud, P. & Vert, J.-P. The influence of feature
selection methods on accuracy, stability and interpretability of molecular
signatures. PLoS One 6, e28210 (2011).
 
-  Wald, R., Khoshgoftaar, T. M., Dittman, D., Awada, W. & Napolitano,A. An
extensive comparison of feature ranking aggregation techniques in
bioinformatics. in 2012 IEEE 13th International Conference on Information
Reuse Integration (IRI) 377–384 (2012).
 
-  Hutter, F., Hoos, H. H. & Leyton-Brown, K. Sequential model-based
optimization for general algorithm configuration. in Learning and
Intelligent Optimization (ed. Coello, C. A. C.) 6683, 507–523 (Springer
Berlin Heidelberg, 2011).
 
-  Shahriari, B., Swersky, K., Wang, Z., Adams, R. P. & de Freitas, N.
Taking the Human Out of the Loop: A Review of Bayesian Optimization. Proc.
IEEE 104, 148–175 (2016)
 
-  Srinivas, N., Krause, A., Kakade, S. M. & Seeger, M. W.
Information-Theoretic Regret Bounds for Gaussian Process Optimization in
the Bandit Setting. IEEE Trans. Inf. Theory 58, 3250–3265 (2012)
 
-  Kaufmann, E., Cappé, O. & Garivier, A. On Bayesian upper confidence
bounds for bandit problems. in Artificial intelligence and statistics
592–600 (2012).
 
-  Jamieson, K. & Talwalkar, A. Non-stochastic Best Arm Identification and
Hyperparameter Optimization. in Proceedings of the 19th International
Conference on Artificial Intelligence and Statistics (eds. Gretton, A. &
Robert, C. C.) vol. 51 240–248 (PMLR, 2016).
 
-  Gramacy, R. B. laGP: Large-Scale Spatial Modeling via Local Approximate
Gaussian Processes in R. Journal of Statistical Software 72, 1–46 (2016)
 
-  Sparapani, R., Spanbauer, C. & McCulloch, R. Nonparametric Machine
Learning and Efficient Computation with Bayesian Additive Regression Trees:
The BART R Package. Journal of Statistical Software 97, 1–66 (2021)
 
-  Davison, A. C. & Hinkley, D. V. Bootstrap methods and their application.
(Cambridge University Press, 1997).
 
-  Efron, B. & Hastie, T. Computer Age Statistical Inference. (Cambridge
University Press, 2016).
 
-  Lausen, B. & Schumacher, M. Maximally Selected Rank Statistics.
Biometrics 48, 73 (1992).
 
-  Hothorn, T. & Lausen, B. On the exact distribution of maximally selected
rank statistics. Comput. Stat. Data Anal. 43, 121–137 (2003).
 
Internal function for parsing settings related to hyperparameter optimisation
Description
Internal function for parsing settings related to hyperparameter optimisation
Usage
.parse_hyperparameter_optimisation_settings(
  config = NULL,
  parallel,
  outcome_type,
  optimisation_bootstraps = waiver(),
  optimisation_determine_vimp = waiver(),
  smbo_random_initialisation = waiver(),
  smbo_n_random_sets = waiver(),
  max_smbo_iterations = waiver(),
  smbo_stop_convergent_iterations = waiver(),
  smbo_stop_tolerance = waiver(),
  smbo_time_limit = waiver(),
  smbo_initial_bootstraps = waiver(),
  smbo_step_bootstraps = waiver(),
  smbo_intensify_steps = waiver(),
  smbo_stochastic_reject_p_value = waiver(),
  optimisation_function = waiver(),
  optimisation_metric = waiver(),
  acquisition_function = waiver(),
  exploration_method = waiver(),
  hyperparameter_learner = waiver(),
  parallel_hyperparameter_optimisation = waiver(),
  ...
)
Arguments
| config | A list of settings, e.g. from an xml file. | 
| parallel | Logical value that whether familiar uses parallelisation. If
FALSEit will overrideparallel_hyperparameter_optimisation. | 
| outcome_type | Type of outcome found in the data set. | 
| optimisation_bootstraps | (optional) Number of bootstraps that should
be generated from the development data set. During the optimisation
procedure one or more of these bootstraps (indicated by
smbo_step_bootstraps) are used for model development using different
combinations of hyperparameters. The effect of the hyperparameters is then
assessed by comparing in-bag and out-of-bag model performance. The default number of bootstraps is 50. Hyperparameter optimisation may
finish before exhausting the set of bootstraps. | 
| optimisation_determine_vimp | (optional) Logical value that indicates
whether variable importance is determined separately for each of the
bootstraps created during the optimisation process (TRUE) or the
applicable results from the feature selection step are used (FALSE). Determining variable importance increases the initial computational
overhead. However, it prevents positive biases for the out-of-bag data due
to overlap of these data with the development data set used for the feature
selection step. In this case, any hyperparameters of the variable
importance method are not determined separately for each bootstrap, but
those obtained during the feature selection step are used instead. In case
multiple of such hyperparameter sets could be applicable, the set that will
be used is randomly selected for each bootstrap.
 This parameter only affects hyperparameter optimisation of learners. The
default is TRUE. | 
| smbo_random_initialisation | (optional) String indicating the
initialisation method for the hyperparameter space. Can be one of
fixed_subsample(default),fixed, orrandom.fixedandfixed_subsamplefirst create hyperparameter sets from a range of default
values set by familiar.fixed_subsamplethen randomly draws up tosmbo_n_random_setsfrom the grid.randomdoes not rely upon a fixed
grid, and randomly draws up tosmbo_n_random_setshyperparameter sets
from the hyperparameter space. | 
| smbo_n_random_sets | (optional) Number of random or subsampled
hyperparameters drawn during the initialisation process. Default: 100.
Cannot be smaller than10. The parameter is not used whensmbo_random_initialisationisfixed, as the entire pre-defined grid
will be explored. | 
| max_smbo_iterations | (optional) Maximum number of intensify
iterations of the SMBO algorithm. During an intensify iteration a run-off
occurs between the current best hyperparameter combination and either 10
challenger combination with the highest expected improvement or a set of 20
random combinations.
 Run-off with random combinations is used to force exploration of the
hyperparameter space, and is performed every second intensify iteration, or
if there is no expected improvement for any challenger combination.
 If a combination of hyperparameters leads to better performance on the same
data than the incumbent best set of hyperparameters, it replaces the
incumbent set at the end of the intensify iteration.
 The default number of intensify iteration is 20. Iterations may be
stopped early if the incumbent set of hyperparameters remains the same forsmbo_stop_convergent_iterationsiterations, or performance improvement is
minimal. This behaviour is suppressed during the first 4 iterations to
enable the algorithm to explore the hyperparameter space. | 
| smbo_stop_convergent_iterations | (optional) The number of subsequent
convergent SMBO iterations required to stop hyperparameter optimisation
early. An iteration is convergent if the best parameter set has not
changed or the optimisation score over the 4 most recent iterations has not
changed beyond the tolerance level in smbo_stop_tolerance. The default value is 3. | 
| smbo_stop_tolerance | (optional) Tolerance for early stopping due to
convergent optimisation score.
 The default value depends on the square root of the number of samples (at
the series level), and is 0.01for 100 samples. This value is computed as0.1 * 1 / sqrt(n_samples). The upper limit is0.0001for 1M or more
samples. | 
| smbo_time_limit | (optional) Time limit (in minutes) for the
optimisation process. Optimisation is stopped after this limit is exceeded.
Time taken to determine variable importance for the optimisation process
(see the optimisation_determine_vimpparameter) does not count. The default is NULL, indicating that there is no time limit for the
optimisation process. The time limit cannot be less than 1 minute. | 
| smbo_initial_bootstraps | (optional) The number of bootstraps taken
from the set of optimisation_bootstrapsas the bootstraps assessed
initially. The default value is 1. The value cannot be larger thanoptimisation_bootstraps. | 
| smbo_step_bootstraps | (optional) The number of bootstraps taken from
the set of optimisation_bootstrapsbootstraps as the bootstraps assessed
during the steps of each intensify iteration. The default value is 3. The value cannot be larger thanoptimisation_bootstraps. | 
| smbo_intensify_steps | (optional) The number of steps in each SMBO
intensify iteration. Each step a new set of smbo_step_bootstrapsbootstraps is drawn and used in the run-off between the incumbent best
hyperparameter combination and its challengers. The default value is 5. Higher numbers allow for a more detailed
comparison, but this comes with added computational cost. | 
| smbo_stochastic_reject_p_value | (optional) The p-value threshold used
for the stochastic_rejectexploration method. The default value is 0.05. | 
| optimisation_function | (optional) Type of optimisation function used
to quantify the performance of a hyperparameter set. Model performance is
assessed using the metric(s) specified by optimisation_metricon the
in-bag (IB) and out-of-bag (OOB) samples of a bootstrap. These values are
converted to objective scores with a standardised interval of[-1.0, 1.0]. Each pair of objective is subsequently used to compute an
optimisation score. The optimisation score across different bootstraps is
than aggregated to a summary score. This summary score is used to rank
hyperparameter sets, and select the optimal set. The combination of optimisation score and summary score is determined by
the optimisation function indicated by this parameter:
 
 validationormax_validation(default): seeks to maximise OOB score.
 balanced: seeks to balance IB and OOB score.
 stronger_balance: similar tobalanced, but with stronger penalty for
differences between IB and OOB scores.
 validation_minus_sd: seeks to optimise the average OOB score minus its
standard deviation.
 validation_25th_percentile: seeks to optimise the 25th percentile of
OOB scores, and is conceptually similar tovalidation_minus_sd.
 model_estimate: seeks to maximise the OOB score estimate predicted by
the hyperparameter learner (not available for random search).
 model_estimate_minus_sd: seeks to maximise the OOB score estimate minus
its estimated standard deviation, as predicted by the hyperparameter
learner (not available for random search).
 model_balanced_estimate: seeks to maximise the estimate of the balanced
IB and OOB score. This is similar to thebalancedscore, and in fact uses
a hyperparameter learner to predict said score (not available for random
search).
 model_balanced_estimate_minus_sd: seeks to maximise the estimate of the
balanced IB and OOB score, minus its estimated standard deviation. This is
similar to thebalancedscore, but takes into account its estimated
spread.
 Additional detail are provided in the Learning algorithms and
hyperparameter optimisation vignette. | 
| optimisation_metric | (optional) One or more metrics used to compute
performance scores. See the vignette on performance metrics for the
available metrics.
 If unset, the following metrics are used by default:
 
 auc_roc: Forbinomialandmultinomialmodels.
 mse: Mean squared error forcontinuousmodels.
 msle: Mean squared logarithmic error forcountmodels.
 concordance_index: Forsurvivalmodels.
 Multiple optimisation metrics can be specified. Actual metric values are
converted to an objective value by comparison with a baseline metric value
that derives from a trivial model, i.e. majority class for binomial and
multinomial outcomes, the median outcome for count and continuous outcomes
and a fixed risk or time for survival outcomes. | 
| acquisition_function | (optional) The acquisition function influences
how new hyperparameter sets are selected. The algorithm uses the model
learned by the learner indicated by hyperparameter_learnerto search the
hyperparameter space for hyperparameter sets that are either likely better
than the best known set (exploitation) or where there is considerable
uncertainty (exploration). The acquisition function quantifies this
(Shahriari et al., 2016). The following acquisition functions are available, and are described in
more detail in the learner algorithms vignette:
 
 improvement_probability: The probability of improvement quantifies the
probability that the expected optimisation score for a set is better than
the best observed optimisation score
 improvement_empirical_probability: Similar toimprovement_probability, but based directly on optimisation scores
predicted by the individual decision trees.
 expected_improvement(default): Computes expected improvement.
 upper_confidence_bound: This acquisition function is based on the upper
confidence bound of the distribution (Srinivas et al., 2012).
 bayes_upper_confidence_bound: This acquisition function is based on the
upper confidence bound of the distribution (Kaufmann et al., 2012).
 | 
| exploration_method | (optional) Method used to steer exploration in
post-initialisation intensive searching steps. As stated earlier, each SMBO
iteration step compares suggested alternative parameter sets with an
incumbent best set in a series of steps. The exploration method
controls how the set of alternative parameter sets is pruned after each
step in an iteration. Can be one of the following:
 
 single_shot(default): The set of alternative parameter sets is not
pruned, and each intensification iteration contains only a single
intensification step that only uses a single bootstrap. This is the fastest
exploration method, but only superficially tests each parameter set.
 successive_halving: The set of alternative parameter sets is
pruned by removing the worst performing half of the sets after each step
(Jamieson and Talwalkar, 2016).
 stochastic_reject: The set of alternative parameter sets is pruned by
comparing the performance of each parameter set with that of the incumbent
best parameter set using a paired Wilcoxon test based on shared
bootstraps. Parameter sets that perform significantly worse, at an alpha
level indicated bysmbo_stochastic_reject_p_value, are pruned.
 none: The set of alternative parameter sets is not pruned.
 | 
| hyperparameter_learner | (optional) Any point in the hyperparameter
space has a single, scalar, optimisation score value that is a priori
unknown. During the optimisation process, the algorithm samples from the
hyperparameter space by selecting hyperparameter sets and computing the
optimisation score value for one or more bootstraps. For each
hyperparameter set the resulting values are distributed around the actual
value. The learner indicated by hyperparameter_learneris then used to
infer optimisation score estimates for unsampled parts of the
hyperparameter space. The following models are available:
 
 bayesian_additive_regression_treesorbart: Uses Bayesian Additive
Regression Trees (Sparapani et al., 2021) for inference. Unlike standard
random forests, BART allows for estimating posterior distributions directly
and can extrapolate.
 gaussian_process(default): Creates a localised approximate Gaussian
process for inference (Gramacy, 2016). This allows for better scaling than
deterministic Gaussian Processes.
 random_forest: Creates a random forest for inference. Originally
suggested by Hutter et al. (2011). A weakness of random forests is their
lack of extrapolation beyond observed values, which limits their usefulness
in exploiting promising areas of hyperparameter space.
 randomorrandom_search: Forgoes the use of models to steer
optimisation. Instead, a random search is performed.
 | 
| parallel_hyperparameter_optimisation | (optional) Enable parallel
processing for hyperparameter optimisation. Defaults to TRUE. When set toFALSE, this will disable the use of parallel processing while performing
optimisation, regardless of the settings of theparallelparameter. The
parameter moreover specifies whether parallelisation takes place within the
optimisation algorithm (inner, default), or in an outer loop (outer)
over learners, data subsamples, etc. parallel_hyperparameter_optimisationis ignored ifparallel=FALSE.
 | 
| ... | Unused arguments. | 
Value
List of parameters related to model hyperparameter optimisation.
References
-  Hutter, F., Hoos, H. H. & Leyton-Brown, K. Sequential
model-based optimization for general algorithm configuration. in Learning
and Intelligent Optimization (ed. Coello, C. A. C.) 6683, 507–523 (Springer
Berlin Heidelberg, 2011).
 
-  Shahriari, B., Swersky, K., Wang, Z., Adams, R. P. & de Freitas, N.
Taking the Human Out of the Loop: A Review of Bayesian Optimization. Proc.
IEEE 104, 148–175 (2016)
 
-  Srinivas, N., Krause, A., Kakade, S. M. & Seeger, M. W.
Information-Theoretic Regret Bounds for Gaussian Process Optimization in
the Bandit Setting. IEEE Trans. Inf. Theory 58, 3250–3265 (2012)
 
-  Kaufmann, E., Cappé, O. & Garivier, A. On Bayesian upper confidence
bounds for bandit problems. in Artificial intelligence and statistics
592–600 (2012).
 
-  Jamieson, K. & Talwalkar, A. Non-stochastic Best Arm Identification and
Hyperparameter Optimization. in Proceedings of the 19th International
Conference on Artificial Intelligence and Statistics (eds. Gretton, A. &
Robert, C. C.) vol. 51 240–248 (PMLR, 2016).
 
-  Gramacy, R. B. laGP: Large-Scale Spatial Modeling via Local Approximate
Gaussian Processes in R. Journal of Statistical Software 72, 1–46 (2016)
 
-  Sparapani, R., Spanbauer, C. & McCulloch, R. Nonparametric Machine
Learning and Efficient Computation with Bayesian Additive Regression Trees:
The BART R Package. Journal of Statistical Software 97, 1–66 (2021)
 
Internal function for parsing settings required to parse the input data and
define the experiment
Description
This function parses settings required to parse the data set, e.g. determine
which columns are identfier columns, what column contains outcome data, which
type of outcome is it?
Usage
.parse_initial_settings(config = NULL, ...)
Arguments
| config | A list of settings, e.g. from an xml file. | 
| ... | Arguments passed on to .parse_experiment_settings 
batch_id_column(recommended) Name of the column containing batch
or cohort identifiers. This parameter is required if more than one dataset
is provided, or if external validation is performed.
 In familiar any row of data is organised by four identifiers:
 
 The batch identifier batch_id_column: This denotes the group to which a
set of samples belongs, e.g. patients from a single study, samples measured
in a batch, etc. The batch identifier is used for batch normalisation, as
well as selection of development and validation datasets. The sample identifier sample_id_column: This denotes the sample level,
e.g. data from a single individual. Subsets of data, e.g. bootstraps or
cross-validation folds, are created at this level. The series identifier series_id_column: Indicates measurements on a
single sample that may not share the same outcome value, e.g. a time
series, or the number of cells in a view. The repetition identifier: Indicates repeated measurements in a single
series where any feature values may differ, but the outcome does not.
Repetition identifiers are always implicitly set when multiple entries for
the same series of the same sample in the same batch that share the same
outcome are encountered.
sample_id_column(recommended) Name of the column containing
sample or subject identifiers. See batch_id_columnabove for more
details. If unset, every row will be identified as a single sample.series_id_column(optional) Name of the column containing series
identifiers, which distinguish between measurements that are part of a
series for a single sample. See batch_id_columnabove for more details. If unset, rows which share the same batch and sample identifiers but have a
different outcome are assigned unique series identifiers.development_batch_id(optional) One or more batch or cohort
identifiers to constitute data sets for development. Defaults to all, or
all minus the identifiers in validation_batch_idfor external validation.
Required if external validation is performed andvalidation_batch_idis
not provided.validation_batch_id(optional) One or more batch or cohort
identifiers to constitute data sets for external validation. Defaults to
all data sets except those in development_batch_idfor external
validation, or none if not. Required ifdevelopment_batch_idis not
provided.outcome_name(optional) Name of the modelled outcome. This name will
be used in figures created by familiar. If not set, the column name in outcome_columnwill be used forbinomial,multinomial,countandcontinuousoutcomes. For other
outcomes (survivalandcompeting_risk) no default is used.outcome_column(recommended) Name of the column containing the
outcome of interest. May be identified from a formula, if a formula is
provided as an argument. Otherwise an error is raised. Note that survivalandcompeting_riskoutcome type outcomes require two columns that
indicate the time-to-event or the time of last follow-up and the event
status.outcome_type(recommended) Type of outcome found in the outcome
column. The outcome type determines many aspects of the overall process,
e.g. the available feature selection methods and learners, but also the
type of assessments that can be conducted to evaluate the resulting models.
Implemented outcome types are:
 
 binomial: categorical outcome with 2 levels.
 multinomial: categorical outcome with 2 or more levels.
 count: Poisson-distributed numeric outcomes.
 continuous: general continuous numeric outcomes.
 survival: survival outcome for time-to-event data.
 If not provided, the algorithm will attempt to obtain outcome_type from
contents of the outcome column. This may lead to unexpected results, and we
therefore advise to provide this information manually.
 Note that competing_risksurvival analysis are not fully supported, and
is currently not a valid choice foroutcome_type.class_levels(optional) Class levels for binomialormultinomialoutcomes. This argument can be used to specify the ordering of levels for
categorical outcomes. These class levels must exactly match the levels
present in the outcome column.event_indicator(recommended) Indicator for events in survivalandcompeting_riskanalyses.familiarwill automatically recognise1,true,t,yandyesas event indicators, including different
capitalisations. If this parameter is set, it replaces the default values.censoring_indicator(recommended) Indicator for right-censoring in
survivalandcompeting_riskanalyses.familiarwill automatically
recognise0,false,f,n,noas censoring indicators, including
different capitalisations. If this parameter is set, it replaces the
default values.competing_risk_indicator(recommended) Indicator for competing
risks in competing_riskanalyses. There are no default values, and if
unset, all values other than those specified by theevent_indicatorandcensoring_indicatorparameters are considered to indicate competing
risks.signature(optional) One or more names of feature columns that are
considered part of a specific signature. Features specified here will
always be used for modelling. Ranking from feature selection has no effect
for these features.novelty_features(optional) One or more names of feature columns
that should be included for the purpose of novelty detection.exclude_features(optional) Feature columns that will be removed
from the data set. Cannot overlap with features in signature,novelty_featuresorinclude_features.include_features(optional) Feature columns that are specifically
included in the data set. By default all features are included. Cannot
overlap with exclude_features, but may overlapsignature. Features insignatureandnovelty_featuresare always included. If bothexclude_featuresandinclude_featuresare provided,include_featurestakes precedence, provided that there is no overlap between the two.reference_method(optional) Method used to set reference levels for
categorical features. There are several options:
 
 auto(default): Categorical features that are not explicitly set by the
user, i.e. columns containing boolean values or characters, use the most
frequent level as reference. Categorical features that are explicitly set,
i.e. as factors, are used as is.
 always: Both automatically detected and user-specified categorical
features have the reference level set to the most frequent level. Ordinal
features are not altered, but are used as is.
 never: User-specified categorical features are used as is.
Automatically detected categorical features are simply sorted, and the
first level is then used as the reference level. This was the behaviour
prior to familiar version 1.3.0.
experimental_design(required) Defines what the experiment looks
like, e.g. cv(bt(fs,20)+mb,3,2)+evfor 2 times repeated 3-fold
cross-validation with nested feature selection on 20 bootstraps and
model-building, and external validation. The basic workflow components are: 
 fs: (required) feature selection step.
 mb: (required) model building step.
 ev: (optional) external validation. Note that internal validation due
to subsampling will always be conducted if the subsampling methods create
any validation data sets.
 The different components are linked using +. Different subsampling methods can be used in conjunction with the basic
workflow components:
 
 bs(x,n): (stratified) .632 bootstrap, withnthe number of
bootstraps. In contrast tobt, feature pre-processing parameters and
hyperparameter optimisation are conducted on individual bootstraps.
 bt(x,n): (stratified) .632 bootstrap, withnthe number of
bootstraps. Unlikebsand other subsampling methods, no separate
pre-processing parameters or optimised hyperparameters will be determined
for each bootstrap.
 cv(x,n,p): (stratified)n-fold cross-validation, repeatedptimes.
Pre-processing parameters are determined for each iteration.
 lv(x): leave-one-out-cross-validation. Pre-processing parameters are
determined for each iteration.
 ip(x): imbalance partitioning for addressing class imbalances on the
data set. Pre-processing parameters are determined for each partition. The
number of partitions generated depends on the imbalance correction method
(see theimbalance_correction_methodparameter). Imbalance partitioning
does not generate validation sets.
 As shown in the example above, sampling algorithms can be nested.
 The simplest valid experimental design is fs+mb, which corresponds to a
TRIPOD type 1a analysis. Type 1b analyses are only possible using
bootstraps, e.g.bt(fs+mb,100). Type 2a analyses can be conducted using
cross-validation, e.g.cv(bt(fs,100)+mb,10,1). Depending on the origin of
the external validation data, designs such asfs+mb+evorcv(bt(fs,100)+mb,10,1)+evconstitute type 2b or type 3 analyses. Type 4
analyses can be done by obtaining one or morefamiliarModelobjects from
others and applying them to your own data set. Alternatively, the experimental_designparameter may be used to provide a
path to a file containing iterations, which is named####_iterations.RDSby convention. This path can be relative to the directory of the current
experiment (experiment_dir), or an absolute path. The absolute path may
thus also point to a file from a different experiment.imbalance_correction_method(optional) Type of method used to
address class imbalances. Available options are:
 
 full_undersampling(default): All data will be used in an ensemble
fashion. The full minority class will appear in each partition, but
majority classes are undersampled until all data have been used.
 random_undersampling: Randomly undersamples majority classes. This is
useful in cases where full undersampling would lead to the formation of
many models due major overrepresentation of the largest class.
 This parameter is only used in combination with imbalance partitioning in
the experimental design, and ipshould therefore appear in the string
that defines the design.imbalance_n_partitions(optional) Number of times random
undersampling should be repeated. 10 undersampled subsets with balanced
classes are formed by default. | 
Details
Three variants of parameters exist:
-  required: this parameter is required and must be set by the user.
 
-  recommended: not setting this parameter might cause an error to be thrown,
dependent on other input.
 
-  optional: these parameters have default values that may be altered if
required.
 
Value
A list of settings to be used for configuring the experiments.
Internal function for converting integer features
Description
Internal function for converting integer features
Usage
.parse_integer_features(data, outcome_type)
Arguments
| data | data.table with feature data | 
| outcome_type | character, indicating the type of outcome | 
Details
This function parses columns containing integer feature data to
features to double. This prevents, e.g., errors when the result of an
operation on the feature data yields a non-integer (i.e. floating point)
result.
Value
data.table with integer features converted to double.
Internal function for parsing settings related to model development
Description
Internal function for parsing settings related to model development
Usage
.parse_model_development_settings(
  config = NULL,
  data,
  parallel,
  outcome_type,
  learner = waiver(),
  hyperparameter = waiver(),
  novelty_detector = waiver(),
  detector_parameters = waiver(),
  parallel_model_development = waiver(),
  ...
)
Arguments
| config | A list of settings, e.g. from an xml file. | 
| data | Data set as loaded using the .load_datafunction. | 
| parallel | Logical value that whether familiar uses parallelisation. If
FALSEit will overrideparallel_model_development. | 
| outcome_type | Type of outcome found in the data set. | 
| learner | (required) One or more algorithms used for model
development. A sizeable number learners is supported in familiar. Please
see the vignette on learners for more information concerning the available
learners. | 
| hyperparameter | (optional) List of lists containing hyperparameters
for learners. Each sublist should have the name of the learner method it
corresponds to, with list elements being named after the intended
hyperparameter, e.g. "glm_logistic"=list("sign_size"=3) All learners have hyperparameters. Please refer to the vignette on learners
for more details. If no parameters are provided, sequential model-based
optimisation is used to determine optimal hyperparameters.
 Hyperparameters provided by the user are never optimised. However, if more
than one value is provided for a single hyperparameter, optimisation will
be conducted using these values. | 
| novelty_detector | (optional) Specify the algorithm used for training
a novelty detector. This detector can be used to identify
out-of-distribution data prospectively. | 
| detector_parameters | (optional) List lists containing hyperparameters
for novelty detectors. Currently not used. | 
| parallel_model_development | (optional) Enable parallel processing for
the model development workflow. Defaults to TRUE. When set toFALSE,
this will disable the use of parallel processing while developing models,
regardless of the settings of theparallelparameter.parallel_model_developmentis ignored ifparallel=FALSE. | 
| ... | Unused arguments. | 
Value
List of parameters related to model development.
Internal function for parsing settings related to preprocessing
Description
Internal function for parsing settings related to preprocessing
Usage
.parse_preprocessing_settings(
  config = NULL,
  data,
  parallel,
  outcome_type,
  feature_max_fraction_missing = waiver(),
  sample_max_fraction_missing = waiver(),
  filter_method = waiver(),
  univariate_test_threshold = waiver(),
  univariate_test_threshold_metric = waiver(),
  univariate_test_max_feature_set_size = waiver(),
  low_var_minimum_variance_threshold = waiver(),
  low_var_max_feature_set_size = waiver(),
  robustness_icc_type = waiver(),
  robustness_threshold_metric = waiver(),
  robustness_threshold_value = waiver(),
  transformation_method = waiver(),
  transformation_optimisation_criterion = waiver(),
  transformation_gof_test_p_value = waiver(),
  normalisation_method = waiver(),
  batch_normalisation_method = waiver(),
  imputation_method = waiver(),
  cluster_method = waiver(),
  cluster_linkage_method = waiver(),
  cluster_cut_method = waiver(),
  cluster_similarity_metric = waiver(),
  cluster_similarity_threshold = waiver(),
  cluster_representation_method = waiver(),
  parallel_preprocessing = waiver(),
  ...
)
Arguments
| config | A list of settings, e.g. from an xml file. | 
| data | Data set as loaded using the .load_datafunction. | 
| parallel | Logical value that whether familiar uses parallelisation. If
FALSEit will overrideparallel_preprocessing. | 
| outcome_type | Type of outcome found in the data set. | 
| feature_max_fraction_missing | (optional) Numeric value between 0.0and0.95that determines the meximum fraction of missing values that
still allows a feature to be included in the data set. All features with a
missing value fraction over this threshold are not processed further. The
default value is0.30. | 
| sample_max_fraction_missing | (optional) Numeric value between 0.0and0.95that determines the maximum fraction of missing values that
still allows a sample to be included in the data set. All samples with a
missing value fraction over this threshold are excluded and not processed
further. The default value is0.30. | 
| filter_method | (optional) One or methods used to reduce
dimensionality of the data set by removing irrelevant or poorly
reproducible features.
 Several method are available:
 
 none(default): None of the features will be filtered.
 low_variance: Features with a variance below thelow_var_minimum_variance_thresholdare filtered. This can be useful to
filter, for example, genes that are not differentially expressed.
 univariate_test: Features undergo a univariate regression using an
outcome-appropriate regression model. The p-value of the model coefficient
is collected. Features with coefficient p or q-value above theunivariate_test_thresholdare subsequently filtered.
 robustness: Features that are not sufficiently robust according to the
intraclass correlation coefficient are filtered. Use of this method
requires that repeated measurements are present in the data set, i.e. there
should be entries for which the sample and cohort identifiers are the same.
 More than one method can be used simultaneously. Features with singular
values are always filtered, as these do not contain information. | 
| univariate_test_threshold | (optional) Numeric value between 1.0and0.0that determines which features are irrelevant and will be filtered by
theunivariate_test. The p or q-values are compared to this threshold.
All features with values above the threshold are filtered. The default
value is0.20. | 
| univariate_test_threshold_metric | (optional) Metric used with the to
compare the univariate_test_thresholdagainst. The following metrics can
be chosen: 
 p_value(default): The unadjusted p-value of each feature is used for
to filter features.
 q_value: The q-value (Story, 2002), is used to filter features. Some
data sets may have insufficient samples to compute the q-value. Theqvaluepackage must be installed from Bioconductor to use this method.
 | 
| univariate_test_max_feature_set_size | (optional) Maximum size of the
feature set after the univariate test. P or q values of features are
compared against the threshold, but if the resulting data set would be
larger than this setting, only the most relevant features up to the desired
feature set size are selected.
 The default value is NULL, which causes features to be filtered based on
their relevance only. | 
| low_var_minimum_variance_threshold | (required, if used) Numeric value
that determines which features will be filtered by the low_variancemethod. The variance of each feature is computed and compared to the
threshold. If it is below the threshold, the feature is removed. This parameter has no default value and should be set if low_varianceis
used. | 
| low_var_max_feature_set_size | (optional) Maximum size of the feature
set after filtering features with a low variance. All features are first
compared against low_var_minimum_variance_threshold. If the resulting
feature set would be larger than specified, only the most strongly varying
features will be selected, up to the desired size of the feature set. The default value is NULL, which causes features to be filtered based on
their variance only. | 
| robustness_icc_type | (optional) String indicating the type of
intraclass correlation coefficient (1,2or3) that should be used to
compute robustness for features in repeated measurements. These types
correspond to the types in Shrout and Fleiss (1979). The default value is1. | 
| robustness_threshold_metric | (optional) String indicating which
specific intraclass correlation coefficient (ICC) metric should be used to
filter features. This should be one of:
 
 icc: The estimated ICC value itself.
 icc_low(default): The estimated lower limit of the 95% confidence
interval of the ICC, as suggested by Koo and Li (2016).
 icc_panel: The estimated ICC value over the panel average, i.e. the ICC
that would be obtained if all repeated measurements were averaged.
 icc_panel_low: The estimated lower limit of the 95% confidence interval
of the panel ICC.
 | 
| robustness_threshold_value | (optional) The intraclass correlation
coefficient value that is as threshold. The default value is 0.70. | 
| transformation_method | (optional) The transformation method used to
change the distribution of the data to be more normal-like. The following
methods are available:
 
 none: This disables transformation of features.
 yeo_johnson: Transformation using the location and scale invariant
version of the Yeo-Johnson transformation (Yeo and Johnson, 2000;
Zwanenburg and Löck, 2023).
 yeo_johnson_robust(default): A robust version ofyeo_johnson.
This method is less sensitive to outliers.
 yeo_johnson_conventional: Asyeo_johnson, but without optimisation of
location and scale parameters. This method is equivalent to the original
transformation proposed by Yeo and Johnson (2001).
 box_cox: Transformation using the location and scale invariant version
of the Box-Cox transformation (Box and Cox, 1964; Zwanenburg and Löck,
2023).
 box_cox_robust: A robust version ofyeo_johnson. This method is less
sensitive to outliers.
 box_cox_conventional: Asbox_cox, but without optimisation of
location and scale parameters. This method is equivalent to the original
transformation proposed by Box and Cox (1964). This method requires
strictly positive feature values.
 Transformation requires the power.transformpackage. Only features that
contain numerical data are transformed. Transformation parameters obtained
in development data are stored withinfeatureInfoobjects for later use
with validation data sets. | 
| transformation_optimisation_criterion | (optional) Transformation
parameters are optimised using a criterion, conventionally
maximum-likelihood-estimation. power.transformimplements multiple
optimisation criteria, of which the following are available: 
 mle(default): Optimisation using maximum likelihood estimation.
 cramer_von_mises: Optimisation using the Cramér-von Mises
criterion. Zwanenburg and Löck (2023) found that this criterion was
relatively robust against outliers.
 | 
| transformation_gof_test_p_value | (optional) Not all transformations
will lead to features that are roughly normally distributed. Zwanenburg and
Löck (2023) established a empirical goodness-of-fit test for central
normality. This parameter sets the significance for rejecting the
null-hypothesis that a feature distribution is centrally normal. When the
null-hypothesis is rejected, no transformation is performed. The default
value is NULL, which disables the test. | 
| normalisation_method | (optional) The normalisation method used to
improve the comparability between numerical features that may have very
different scales. The following normalisation methods can be chosen:
 
 none: This disables feature normalisation.
 standardisation: Features are normalised by subtraction of their mean
values and division by their standard deviations. This causes every feature
to be have a center value of 0.0 and standard deviation of 1.0.
 standardisation_trim: Asstandardisation, but based on the set of
feature values where the 5% lowest and 5% highest values are discarded.
This reduces the effect of outliers.
 standardisation_winsor: Asstandardisation, but based on the set of
feature values where the 5% lowest and 5% highest values are winsorised.
This reduces the effect of outliers.
 standardisation_robust(default): A robust version ofstandardisationthat relies on computing Huber's M-estimators for location and scale.
 normalisation: Features are normalised by subtraction of their minimum
values and division by their ranges. This maps all feature values to a[0, 1]interval.
 normalisation_trim: Asnormalisation, but based on the set of feature
values where the 5% lowest and 5% highest values are discarded. This
reduces the effect of outliers.
 normalisation_winsor: Asnormalisation, but based on the set of
feature values where the 5% lowest and 5% highest values are winsorised.
This reduces the effect of outliers.
 quantile: Features are normalised by subtraction of their median values
and division by their interquartile range.
 mean_centering: Features are centered by substracting the mean, but do
not undergo rescaling.
 Only features that contain numerical data are normalised. Normalisation
parameters obtained in development data are stored within featureInfoobjects for later use with validation data sets. | 
| batch_normalisation_method | (optional) The method used for batch
normalisation. Available methods are:
 
 none(default): This disables batch normalisation of features.
 standardisation: Features within each batch are normalised by
subtraction of the mean value and division by the standard deviation in
each batch.
 standardisation_trim: Asstandardisation, but based on the set of
feature values where the 5% lowest and 5% highest values are discarded.
This reduces the effect of outliers.
 standardisation_winsor: Asstandardisation, but based on the set of
feature values where the 5% lowest and 5% highest values are winsorised.
This reduces the effect of outliers.
 standardisation_robust: A robust version ofstandardisationthat
relies on computing Huber's M-estimators for location and scale within each
batch.
 normalisation: Features within each batch are normalised by subtraction
of their minimum values and division by their range in each batch. This
maps all feature values in each batch to a[0, 1]interval.
 normalisation_trim: Asnormalisation, but based on the set of feature
values where the 5% lowest and 5% highest values are discarded. This
reduces the effect of outliers.
 normalisation_winsor: Asnormalisation, but based on the set of
feature values where the 5% lowest and 5% highest values are winsorised.
This reduces the effect of outliers.
 quantile: Features in each batch are normalised by subtraction of the
median value and division by the interquartile range of each batch.
 mean_centering: Features in each batch are centered on 0.0 by
substracting the mean value in each batch, but are not rescaled.
 combat_parametric: Batch adjustments using parametric empirical Bayes
(Johnson et al, 2007).combat_pleads to the same method.
 combat_non_parametric: Batch adjustments using non-parametric empirical
Bayes (Johnson et al, 2007).combat_npandcombatlead to the same
method. Note that we reduced complexity from O(n^2) to O(n) by
only computing batch adjustment parameters for each feature on a subset of
50 randomly selected features, instead of all features.
 Only features that contain numerical data are normalised using batch
normalisation. Batch normalisation parameters obtained in development data
are stored within featureInfoobjects for later use with validation data
sets, in case the validation data is from the same batch. If validation data contains data from unknown batches, normalisation
parameters are separately determined for these batches.
 Note that for both empirical Bayes methods, the batch effect is assumed to
produce results across the features. This is often true for things such as
gene expressions, but the assumption may not hold generally.
 When performing batch normalisation, it is moreover important to check that
differences between batches or cohorts are not related to the studied
endpoint. | 
| imputation_method | (optional) Method used for imputing missing
feature values. Two methods are implemented:
 
 simple: Simple replacement of a missing value by the median value (for
numeric features) or the modal value (for categorical features).
 lasso: Imputation of missing value by lasso regression (usingglmnet)
based on information contained in other features.
 simpleimputation precedeslassoimputation to ensure that any missing
values in predictors required forlassoregression are resolved. Thelassoestimate is then used to replace the missing value.
 The default value depends on the number of features in the dataset. If the
number is lower than 100, lassois used by default, andsimpleotherwise. Only single imputation is performed. Imputation models and parameters are
stored within featureInfoobjects for later use with validation data
sets. | 
| cluster_method | (optional) Clustering is performed to identify and
replace redundant features, for example those that are highly correlated.
Such features do not carry much additional information and may be removed
or replaced instead (Park et al., 2007; Tolosi and Lengauer, 2011).
 The cluster method determines the algorithm used to form the clusters. The
following cluster methods are implemented:
 
 none: No clustering is performed.
 hclust(default): Hierarchical agglomerative clustering. If thefastclusterpackage is installed,fastcluster::hclustis used (Muellner
2013), otherwisestats::hclustis used.
 agnes: Hierarchical clustering using agglomerative nesting (Kaufman and
Rousseeuw, 1990). This algorithm is similar tohclust, but uses thecluster::agnesimplementation.
 diana: Divisive analysis hierarchical clustering. This method uses
divisive instead of agglomerative clustering (Kaufman and Rousseeuw, 1990).cluster::dianais used.
 pam: Partioning around medioids. This partitions the data into $k$
clusters around medioids (Kaufman and Rousseeuw, 1990). $k$ is selected
using thesilhouettemetric.pamis implemented using thecluster::pamfunction.
 Clusters and cluster information is stored within featureInfoobjects for
later use with validation data sets. This enables reproduction of the same
clusters as formed in the development data set. | 
| cluster_linkage_method | (optional) Linkage method used for
agglomerative clustering in hclustandagnes. The following linkage
methods can be used: 
 average(default): Average linkage.
 single: Single linkage.
 complete: Complete linkage.
 weighted: Weighted linkage, also known as McQuitty linkage.
 ward: Linkage using Ward's minimum variance method.
 dianaandpamdo not require a linkage method.
 | 
| cluster_cut_method | (optional) The method used to define the actual
clusters. The following methods can be used:
 
 silhouette: Clusters are formed based on the silhouette score
(Rousseeuw, 1987). The average silhouette score is computed from 2 tonclusters, withnthe number of features. Clusters are only
formed if the average silhouette exceeds 0.50, which indicates reasonable
evidence for structure. This procedure may be slow if the number of
features is large (>100s).
 fixed_cut: Clusters are formed by cutting the hierarchical tree at the
point indicated by thecluster_similarity_threshold, e.g. where features
in a cluster have an average Spearman correlation of 0.90.fixed_cutis
only available foragnes,dianaandhclust.
 dynamic_cut: Dynamic cluster formation using the cutting algorithm in
thedynamicTreeCutpackage. This package should be installed to select
this option.dynamic_cutcan only be used withagnesandhclust.
 The default options are silhouettefor partioning around medioids (pam)
andfixed_cutotherwise. | 
| cluster_similarity_metric | (optional) Clusters are formed based on
feature similarity. All features are compared in a pair-wise fashion to
compute similarity, for example correlation. The resulting similarity grid
is converted into a distance matrix that is subsequently used for
clustering. The following metrics are supported to compute pairwise
similarities:
 
 mutual_information(default): normalised mutual information.
 mcfadden_r2: McFadden's pseudo R-squared (McFadden, 1974).
 cox_snell_r2: Cox and Snell's pseudo R-squared (Cox and Snell, 1989).
 nagelkerke_r2: Nagelkerke's pseudo R-squared (Nagelkerke, 1991).
 spearman: Spearman's rank order correlation.
 kendall: Kendall rank correlation.
 pearson: Pearson product-moment correlation.
 The pseudo R-squared metrics can be used to assess similarity between mixed
pairs of numeric and categorical features, as these are based on the
log-likelihood of regression models. In familiar, the more informative
feature is used as the predictor and the other feature as the reponse
variable. In numeric-categorical pairs, the numeric feature is considered
to be more informative and is thus used as the predictor. In
categorical-categorical pairs, the feature with most levels is used as the
predictor. In case any of the classical correlation coefficients (pearson,spearmanandkendall) are used with (mixed) categorical features, the
categorical features are one-hot encoded and the mean correlation over all
resulting pairs is used as similarity. | 
| cluster_similarity_threshold | (optional) The threshold level for
pair-wise similarity that is required to form clusters using fixed_cut.
This should be a numerical value between 0.0 and 1.0. Note however, that a
reasonable threshold value depends strongly on the similarity metric. The
following are the default values used: 
 mcfadden_r2andmutual_information:0.30
 cox_snell_r2andnagelkerke_r2:0.75
 spearman,kendallandpearson:0.90
 Alternatively, if the fixed cutmethod is not used, this value determines
whether any clustering should be performed, because the data may not
contain highly similar features. The default values in this situation are: 
 mcfadden_r2andmutual_information:0.25
 cox_snell_r2andnagelkerke_r2:0.40
 spearman,kendallandpearson:0.70
 The threshold value is converted to a distance (1-similarity) prior to
cutting hierarchical trees. | 
| cluster_representation_method | (optional) Method used to determine
how the information of co-clustered features is summarised and used to
represent the cluster. The following methods can be selected:
 
 best_predictor(default): The feature with the highest importance
according to univariate regression with the outcome is used to represent
the cluster.
 medioid: The feature closest to the cluster center, i.e. the feature
that is most similar to the remaining features in the cluster, is used to
represent the feature.
 mean: A meta-feature is generated by averaging the feature values for
all features in a cluster. This method aligns all features so that all
features will be positively correlated prior to averaging. Should a cluster
contain one or more categorical features, themedioidmethod will be used
instead, as averaging is not possible. Note that if this method is chosen,
thenormalisation_methodparameter should be one ofstandardisation,standardisation_trim,standardisation_winsororquantile.'
 If the pamcluster method is selected, only themedioidmethod can be
used. In that case 1 medioid is used by default. | 
| parallel_preprocessing | (optional) Enable parallel processing for the
preprocessing workflow. Defaults to TRUE. When set toFALSE, this will
disable the use of parallel processing while preprocessing, regardless of
the settings of theparallelparameter.parallel_preprocessingis
ignored ifparallel=FALSE. | 
| ... | Unused arguments. | 
Value
List of parameters related to preprocessing.
References
-  Storey, J. D. A direct approach to false discovery rates. J.
R. Stat. Soc. Series B Stat. Methodol. 64, 479–498 (2002).
 
-  Shrout, P. E. & Fleiss, J. L. Intraclass correlations: uses in assessing
rater reliability. Psychol. Bull. 86, 420–428 (1979).
 
-  Koo, T. K. & Li, M. Y. A guideline of selecting and reporting intraclass
correlation coefficients for reliability research. J. Chiropr. Med. 15,
155–163 (2016).
 
-  Yeo, I. & Johnson, R. A. A new family of power transformations to
improve normality or symmetry. Biometrika 87, 954–959 (2000).
 
-  Box, G. E. P. & Cox, D. R. An analysis of transformations. J. R. Stat.
Soc. Series B Stat. Methodol. 26, 211–252 (1964).
 
-  Raymaekers, J., Rousseeuw,  P. J. Transforming variables to central
normality. Mach Learn. (2021).
 
-  Park, M. Y., Hastie, T. & Tibshirani, R. Averaged gene expressions for
regression. Biostatistics 8, 212–227 (2007).
 
-  Tolosi, L. & Lengauer, T. Classification with correlated features:
unreliability of feature ranking and solutions. Bioinformatics 27,
1986–1994 (2011).
 
-  Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in
microarray expression data using empirical Bayes methods. Biostatistics 8,
118–127 (2007)
 
-  Kaufman, L. & Rousseeuw, P. J. Finding groups in data: an introduction
to cluster analysis. (John Wiley & Sons, 2009).
 
-  Muellner, D. fastcluster: fast hierarchical, agglomerative clustering
routines for R and Python. J. Stat. Softw. 53, 1–18 (2013).
 
-  Rousseeuw, P. J. Silhouettes: A graphical aid to the interpretation and
validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987).
 
-  Langfelder, P., Zhang, B. & Horvath, S. Defining clusters from a
hierarchical cluster tree: the Dynamic Tree Cut package for R.
Bioinformatics 24, 719–720 (2008).
 
-  McFadden, D. Conditional logit analysis of qualitative choice behavior.
in Frontiers in Econometrics (ed. Zarembka, P.) 105–142 (Academic Press,
1974).
 
-  Cox, D. R. & Snell, E. J. Analysis of binary data. (Chapman and Hall,
1989).
 
-  Nagelkerke, N. J. D. A note on a general definition of the coefficient
of determination. Biometrika 78, 691–692 (1991).
 
Internal function for parsing settings related to the computational setup
Description
Internal function for parsing settings related to the computational setup
Usage
.parse_setup_settings(
  config = NULL,
  parallel = waiver(),
  parallel_nr_cores = waiver(),
  restart_cluster = waiver(),
  cluster_type = waiver(),
  backend_type = waiver(),
  server_port = waiver(),
  ...
)
Arguments
| config | A list of settings, e.g. from an xml file. | 
| parallel | (optional) Enable parallel processing. Defaults to TRUE.
When set toFALSE, this disables all parallel processing, regardless of
specific parameters such asparallel_preprocessing. However, whenparallelisTRUE, parallel processing of different parts of the
workflow can be disabled by setting respective flags toFALSE. | 
| parallel_nr_cores | (optional) Number of cores available for
parallelisation. Defaults to 2. This setting does nothing if
parallelisation is disabled. | 
| restart_cluster | (optional) Restart nodes used for parallel computing
to free up memory prior to starting a parallel process. Note that it does
take time to set up the clusters. Therefore setting this argument to TRUEmay impact processing speed. This argument is ignored ifparallelisFALSEor the cluster was initialised outside of familiar. Default isFALSE, which causes the clusters to be initialised only once. | 
| cluster_type | (optional) Selection of the cluster type for parallel
processing. Available types are the ones supported by the parallel package
that is part of the base R distribution: psock(default),fork,mpi,nws,sock. In addition,noneis available, which also disables
parallel processing. | 
| backend_type | (optional) Selection of the backend for distributing
copies of the data. This backend ensures that only a single master copy is
kept in memory. This limits memory usage during parallel processing.
 Several backend options are available, notably socket_server, andnone(default).socket_serveris based on the callr package and R sockets,
comes withfamiliarand is available for any OS.noneuses the package
environment of familiar to store data, and is available for any OS.
However,nonerequires copying of data to any parallel process, and has a
larger memory footprint. | 
| server_port | (optional) Integer indicating the port on which the
socket server or RServe process should communicate. Defaults to port 6311.
Note that ports 0 to 1024 and 49152 to 65535 cannot be used. | 
| ... | Unused arguments. | 
Value
List of parameters related to the computational setup.
Internal plotting function for permutation variable importance plots
Description
Internal plotting function for permutation variable importance plots
Usage
.plot_permutation_variable_importance(
  x,
  color_by,
  facet_by,
  facet_wrap_cols,
  ggtheme,
  discrete_palette,
  x_label,
  y_label,
  legend_label,
  plot_title,
  plot_sub_title,
  caption,
  conf_int_style,
  conf_int_alpha,
  x_range,
  x_breaks
)
Arguments
| color_by | (optional) Variables used to determine fill colour of plot
objects. The variables cannot overlap with those provided to the split_byargument, but may overlap with other arguments. See details for available
variables. | 
| facet_by | (optional) Variables used to determine how and if facets of
each figure appear. In case the facet_wrap_colsargument isNULL, the
first variable is used to define columns, and the remaing variables are
used to define rows of facets. The variables cannot overlap with those
provided to thesplit_byargument, but may overlap with other arguments.
See details for available variables. | 
| facet_wrap_cols | (optional) Number of columns to generate when facet
wrapping. If NULL, a facet grid is produced instead. | 
| ggtheme | (optional) ggplottheme to use for plotting. | 
| discrete_palette | (optional) Palette used to fill the bars in case a
non-singular variable was provided to the color_byargument. | 
| x_label | (optional) Label to provide to the x-axis. If NULL, no label
is shown. | 
| y_label | (optional) Label to provide to the y-axis. If NULL, no label
is shown. | 
| legend_label | (optional) Label to provide to the legend. If NULL, the
legend will not have a name. | 
| plot_title | (optional) Label to provide as figure title. If NULL, no
title is shown. | 
| plot_sub_title | (optional) Label to provide as figure subtitle. If
NULL, no subtitle is shown. | 
| caption | (optional) Label to provide as figure caption. If NULL, no
caption is shown. | 
| conf_int_style | (optional) Confidence interval style. See details for
allowed styles. | 
| conf_int_alpha | (optional) Alpha value to determine transparency of
confidence intervals or, alternatively, other plot elements with which the
confidence interval overlaps. Only values between 0.0 (fully transparent)
and 1.0 (fully opaque) are allowed. | 
| x_range | (optional) Value range for the x-axis. | 
| x_breaks | (optional) Break points on the x-axis of the plot. | 
Value
ggplot plot object.
Internal plotting function for univariate plots
Description
Internal plotting function for univariate plots
Usage
.plot_univariate_importance(
  x,
  color_by,
  facet_by,
  facet_wrap_cols,
  ggtheme,
  show_cluster,
  discrete_palette,
  gradient_palette,
  x_label,
  y_label,
  legend_label,
  plot_title,
  plot_sub_title,
  caption,
  x_range,
  x_breaks,
  significance_level_shown
)
Arguments
| color_by | (optional) Variables used to determine fill colour of plot
objects. The variables cannot overlap with those provided to the split_byargument, but may overlap with other arguments. See details for available
variables. | 
| facet_by | (optional) Variables used to determine how and if facets of
each figure appear. In case the facet_wrap_colsargument isNULL, the
first variable is used to define columns, and the remaing variables are
used to define rows of facets. The variables cannot overlap with those
provided to thesplit_byargument, but may overlap with other arguments.
See details for available variables. | 
| facet_wrap_cols | (optional) Number of columns to generate when facet
wrapping. If NULL, a facet grid is produced instead. | 
| ggtheme | (optional) ggplottheme to use for plotting. | 
| show_cluster | (optional) Show which features were clustered together. | 
| discrete_palette | (optional) Palette used to fill the bars in case a
non-singular variable was provided to the color_byargument. | 
| gradient_palette | (optional) Palette to use for filling the bars in
case the color_byargument is not set. The bars are then coloured
according to their importance. By default, no gradient is used, and the
bars are not filled according to importance. UseNULLto fill the bars
using the default palette infamiliar. | 
| x_label | (optional) Label to provide to the x-axis. If NULL, no label
is shown. | 
| y_label | (optional) Label to provide to the y-axis. If NULL, no label
is shown. | 
| legend_label | (optional) Label to provide to the legend. If NULL, the
legend will not have a name. | 
| plot_title | (optional) Label to provide as figure title. If NULL, no
title is shown. | 
| plot_sub_title | (optional) Label to provide as figure subtitle. If
NULL, no subtitle is shown. | 
| caption | (optional) Label to provide as figure caption. If NULL, no
caption is shown. | 
| x_range | (optional) Value range for the x-axis. | 
| x_breaks | (optional) Break points on the x-axis of the plot. | 
| significance_level_shown | Position(s) to draw vertical lines indicating
a significance level, e.g. 0.05. Can be NULL to not draw anything. | 
Value
ggplot plot object.
See Also
Prepare familiarData objects for evaluation at runtime.
Description
Information concerning models, features and the experiment is
processed and stored in familiarData objects. Information can be extracted
from these objects as csv files, or by plotting, or multiple objects can be
combined into familiarCollection objects, which allows aggregated exports.
Usage
.prepare_familiar_data_sets(
  cl = NULL,
  only_pooling = FALSE,
  message_indent = 0L,
  verbose = FALSE
)
Arguments
| cl | Cluster for parallel processing. | 
| only_pooling | Flag that, if set, forces evaluation of only the
top-level data, and not e.g. ensembles. | 
| message_indent | indent that messages should have. | 
| verbose | Sets verbosity | 
Details
This function generates the names of familiarData object files, and
their corresponding generating ensemble, which allows the familiarData
objects to be created.
Value
A data.table with created links to created data objects.
Internal function to check batch assignment to development and validation
Description
This function checks which batches in the data set are assigned to model
development and external validation. Several errors may be raised if there
are inconsistencies such as an overlapping assignment, name mismatches etc.
Usage
.update_experimental_design_settings(section_table, data, settings)
Arguments
| section_table | data.table generated by the extract_experimental_setupfunction. Contains information regarding the experiment. | 
| data | Data set as loaded using the .load_datafunction. | 
| settings | List of parameter settings for data set parsing and setting
up the experiment. | 
Value
A verified and updated list of parameter settings.
Internal check and update of settings related to data set parsing
Description
This function updates and checks parameters related to data set parsing based
on the available data set.
Usage
.update_initial_settings(
  formula = NULL,
  data,
  settings,
  check_stringency = "strict"
)
Arguments
| formula | User-provided formula, may be absent (NULL). | 
| data | Data set as loaded using the .load_datafunction. | 
| settings | List of parameter settings for data set parsing. | 
| check_stringency | Specifies stringency of various checks. This is mostly:
 
 strict: default value used forsummon_familiar. Thoroughly checks
input data. Used internally for checking development data.
 external_warn: value used forextract_dataand related methods. Less
stringent checks, but will warn for possible issues. Used internally for
checking data for evaluation and explanation.
 external: value used for external methods such aspredict. Less
stringent checks, particularly for identifier and outcome columns, which may
be completely absent. Used internally forpredict.
 | 
Value
A verified and updated list of parameter settings.
Aggregate variable importance from multiple variable importance
objects.
Description
This methods aggregates variable importance from one or more
vimpTable objects.
Usage
aggregate_vimp_table(x, aggregation_method, rank_threshold = NULL, ...)
## S4 method for signature 'list'
aggregate_vimp_table(x, aggregation_method, rank_threshold = NULL, ...)
## S4 method for signature 'character'
aggregate_vimp_table(x, aggregation_method, rank_threshold = NULL, ...)
## S4 method for signature 'vimpTable'
aggregate_vimp_table(x, aggregation_method, rank_threshold = NULL, ...)
## S4 method for signature 'NULL'
aggregate_vimp_table(x, aggregation_method, rank_threshold = NULL, ...)
## S4 method for signature 'experimentData'
aggregate_vimp_table(x, aggregation_method, rank_threshold = NULL, ...)
Arguments
| x | Variable importance (vimpTable) object, a list thereof, or one or
more paths to these objects. | 
| aggregation_method | Method used to aggregate variable importance. The
available methods are described in the feature selection methods vignette. | 
| rank_threshold | Rank threshold used within several aggregation methods.
See the feature selection methods vignette for more details. | 
| ... | unused parameters. | 
Value
A vimpTable object with aggregated variable importance data.
Creates a valid data object from input data.
Description
Creates dataObject a object from input data. Input data can be
a data.frame or data.table, a path to such tables on a local or network
drive, or a path to tabular data that may be converted to these formats.
In addition, a familiarEnsemble or familiarModel object can be passed
along to check whether the data are formatted correctly, e.g. by checking
the levels of categorical features, whether all expected columns are
present, etc.
Usage
as_data_object(data, ...)
## S4 method for signature 'dataObject'
as_data_object(data, object = NULL, ...)
## S4 method for signature 'data.table'
as_data_object(
  data,
  object = NULL,
  sample_id_column = waiver(),
  batch_id_column = waiver(),
  series_id_column = waiver(),
  development_batch_id = waiver(),
  validation_batch_id = waiver(),
  outcome_name = waiver(),
  outcome_column = waiver(),
  outcome_type = waiver(),
  event_indicator = waiver(),
  censoring_indicator = waiver(),
  competing_risk_indicator = waiver(),
  class_levels = waiver(),
  exclude_features = waiver(),
  include_features = waiver(),
  reference_method = waiver(),
  check_stringency = "strict",
  ...
)
## S4 method for signature 'ANY'
as_data_object(
  data,
  object = NULL,
  sample_id_column = waiver(),
  batch_id_column = waiver(),
  series_id_column = waiver(),
  ...
)
Arguments
| data | A data.frameordata.table, a path to such tables on a local
or network drive, or a path to tabular data that may be converted to these
formats. | 
| ... | Unused arguments. | 
| object | A familiarEnsembleorfamiliarModelobject that is used to
check consistency of these objects. | 
| sample_id_column | (recommended) Name of the column containing
sample or subject identifiers. See batch_id_columnabove for more
details. If unset, every row will be identified as a single sample. | 
| batch_id_column | (recommended) Name of the column containing batch
or cohort identifiers. This parameter is required if more than one dataset
is provided, or if external validation is performed.
 In familiar any row of data is organised by four identifiers:
 
 The batch identifier batch_id_column: This denotes the group to which a
set of samples belongs, e.g. patients from a single study, samples measured
in a batch, etc. The batch identifier is used for batch normalisation, as
well as selection of development and validation datasets. The sample identifier sample_id_column: This denotes the sample level,
e.g. data from a single individual. Subsets of data, e.g. bootstraps or
cross-validation folds, are created at this level. The series identifier series_id_column: Indicates measurements on a
single sample that may not share the same outcome value, e.g. a time
series, or the number of cells in a view. The repetition identifier: Indicates repeated measurements in a single
series where any feature values may differ, but the outcome does not.
Repetition identifiers are always implicitly set when multiple entries for
the same series of the same sample in the same batch that share the same
outcome are encountered.
 | 
| series_id_column | (optional) Name of the column containing series
identifiers, which distinguish between measurements that are part of a
series for a single sample. See batch_id_columnabove for more details. If unset, rows which share the same batch and sample identifiers but have a
different outcome are assigned unique series identifiers. | 
| development_batch_id | (optional) One or more batch or cohort
identifiers to constitute data sets for development. Defaults to all, or
all minus the identifiers in validation_batch_idfor external validation.
Required if external validation is performed andvalidation_batch_idis
not provided. | 
| validation_batch_id | (optional) One or more batch or cohort
identifiers to constitute data sets for external validation. Defaults to
all data sets except those in development_batch_idfor external
validation, or none if not. Required ifdevelopment_batch_idis not
provided. | 
| outcome_name | (optional) Name of the modelled outcome. This name will
be used in figures created by familiar. If not set, the column name in outcome_columnwill be used forbinomial,multinomial,countandcontinuousoutcomes. For other
outcomes (survivalandcompeting_risk) no default is used. | 
| outcome_column | (recommended) Name of the column containing the
outcome of interest. May be identified from a formula, if a formula is
provided as an argument. Otherwise an error is raised. Note that survivalandcompeting_riskoutcome type outcomes require two columns that
indicate the time-to-event or the time of last follow-up and the event
status. | 
| outcome_type | (recommended) Type of outcome found in the outcome
column. The outcome type determines many aspects of the overall process,
e.g. the available feature selection methods and learners, but also the
type of assessments that can be conducted to evaluate the resulting models.
Implemented outcome types are:
 
 binomial: categorical outcome with 2 levels.
 multinomial: categorical outcome with 2 or more levels.
 count: Poisson-distributed numeric outcomes.
 continuous: general continuous numeric outcomes.
 survival: survival outcome for time-to-event data.
 If not provided, the algorithm will attempt to obtain outcome_type from
contents of the outcome column. This may lead to unexpected results, and we
therefore advise to provide this information manually.
 Note that competing_risksurvival analysis are not fully supported, and
is currently not a valid choice foroutcome_type. | 
| event_indicator | (recommended) Indicator for events in survivalandcompeting_riskanalyses.familiarwill automatically recognise1,true,t,yandyesas event indicators, including different
capitalisations. If this parameter is set, it replaces the default values. | 
| censoring_indicator | (recommended) Indicator for right-censoring in
survivalandcompeting_riskanalyses.familiarwill automatically
recognise0,false,f,n,noas censoring indicators, including
different capitalisations. If this parameter is set, it replaces the
default values. | 
| competing_risk_indicator | (recommended) Indicator for competing
risks in competing_riskanalyses. There are no default values, and if
unset, all values other than those specified by theevent_indicatorandcensoring_indicatorparameters are considered to indicate competing
risks. | 
| class_levels | (optional) Class levels for binomialormultinomialoutcomes. This argument can be used to specify the ordering of levels for
categorical outcomes. These class levels must exactly match the levels
present in the outcome column. | 
| exclude_features | (optional) Feature columns that will be removed
from the data set. Cannot overlap with features in signature,novelty_featuresorinclude_features. | 
| include_features | (optional) Feature columns that are specifically
included in the data set. By default all features are included. Cannot
overlap with exclude_features, but may overlapsignature. Features insignatureandnovelty_featuresare always included. If bothexclude_featuresandinclude_featuresare provided,include_featurestakes precedence, provided that there is no overlap between the two. | 
| reference_method | (optional) Method used to set reference levels for
categorical features. There are several options:
 
 auto(default): Categorical features that are not explicitly set by the
user, i.e. columns containing boolean values or characters, use the most
frequent level as reference. Categorical features that are explicitly set,
i.e. as factors, are used as is.
 always: Both automatically detected and user-specified categorical
features have the reference level set to the most frequent level. Ordinal
features are not altered, but are used as is.
 never: User-specified categorical features are used as is.
Automatically detected categorical features are simply sorted, and the
first level is then used as the reference level. This was the behaviour
prior to familiar version 1.3.0.
 | 
| check_stringency | Specifies stringency of various checks. This is mostly:
 
 strict: default value used forsummon_familiar. Thoroughly checks
input data. Used internally for checking development data.
 external_warn: value used forextract_dataand related methods. Less
stringent checks, but will warn for possible issues. Used internally for
checking data for evaluation and explanation.
 external: value used for external methods such aspredict. Less
stringent checks, particularly for identifier and outcome columns, which may
be completely absent. Used internally forpredict.
 | 
Details
You can specify settings for your data manually, e.g. the column for
sample identifiers (sample_id_column). This prevents you from having to
change the column name externally. In the case you provide a familiarModel
or familiarEnsemble for the object argument, any parameters you provide
take precedence over parameters specified by the object.
Value
A dataObject object.
Conversion to familiarCollection object.
Description
Creates a familiarCollection objects from familiarData,
familiarEnsemble or familiarModel objects.
Usage
as_familiar_collection(
  object,
  familiar_data_names = NULL,
  collection_name = NULL,
  ...
)
## S4 method for signature 'familiarCollection'
as_familiar_collection(
  object,
  familiar_data_names = NULL,
  collection_name = NULL,
  ...
)
## S4 method for signature 'familiarData'
as_familiar_collection(
  object,
  familiar_data_names = NULL,
  collection_name = NULL,
  ...
)
## S4 method for signature 'familiarEnsemble'
as_familiar_collection(
  object,
  familiar_data_names = NULL,
  collection_name = NULL,
  ...
)
## S4 method for signature 'familiarModel'
as_familiar_collection(
  object,
  familiar_data_names = NULL,
  collection_name = NULL,
  ...
)
## S4 method for signature 'list'
as_familiar_collection(
  object,
  familiar_data_names = NULL,
  collection_name = NULL,
  ...
)
## S4 method for signature 'character'
as_familiar_collection(
  object,
  familiar_data_names = NULL,
  collection_name = NULL,
  ...
)
## S4 method for signature 'ANY'
as_familiar_collection(
  object,
  familiar_data_names = NULL,
  collection_name = NULL,
  ...
)
Arguments
| object | familiarCollectionobject, or one or morefamiliarDataobjects, that will be internally converted to afamiliarCollectionobject. It is also possible to provide afamiliarEnsembleor one or morefamiliarModelobjects together with the data from which data is computed
prior to export. Paths to such files can also be provided.
 | 
| familiar_data_names | Names of the dataset(s). Only used if the objectparameter is one or morefamiliarDataobjects. | 
| collection_name | Name of the collection. | 
| ... | Arguments passed on to extract_data 
dataA dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed.is_pre_processedFlag that indicates whether the data was already
pre-processed externally, e.g. normalised and clustered. Only used if the
dataargument is adata.tableordata.frame.clCluster created using the parallelpackage. This cluster is then
used to speed up computation through parallellisation.time_maxTime point which is used as the benchmark for e.g. cumulative
risks generated by random forest, or the cut-off value for Uno's concordance
index. If not provided explicitly, this parameter is read from settings used
at creation of the underlying familiarModelobjects. Only used forsurvivaloutcomes.evaluation_timesOne or more time points that are used for in analysis of
survival problems when data has to be assessed at a set time, e.g.
calibration. If not provided explicitly, this parameter is read from
settings used at creation of the underlying familiarModelobjects. Only
used forsurvivaloutcomes.aggregation_methodMethod for aggregating variable importances for the
purpose of evaluation. Variable importances are determined during feature
selection steps and after training the model. Both types are evaluated, but
feature selection variable importance is only evaluated at run-time.
 See the documentation for the vimp_aggregation_methodargument insummon_familiarfor information concerning the different available
methods. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.rank_thresholdThe threshold used to  define the subset of highly
important features during evaluation.
 See the documentation for the vimp_aggregation_rank_thresholdargument insummon_familiarfor more information. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.ensemble_methodMethod for ensembling predictions from models for the
same sample. Available methods are:
metricOne or more metrics for assessing model performance. See the
vignette on performance metrics for the available metrics. If not provided
explicitly, this parameter is read from settings used at creation of the
underlying familiarModelobjects.feature_cluster_methodThe method used to perform clustering. These are
the same methods as for the cluster_methodconfiguration parameter:none,hclust,agnes,dianaandpam. nonecannot be used when extracting data regarding mutual correlation or
feature expressions.
 If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.feature_linkage_methodThe method used for agglomerative clustering in
hclustandagnes. These are the same methods as for thecluster_linkage_methodconfiguration parameter:average,single,complete,weighted, andward. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.feature_cluster_cut_methodThe method used to divide features into
separate clusters. The available methods are the same as for the
cluster_cut_methodconfiguration parameter:silhouette,fixed_cutanddynamic_cut. silhouetteis available for all cluster methods, butfixed_cutonly
applies to methods that create hierarchical trees (hclust,agnesanddiana).dynamic_cutrequires thedynamicTreeCutpackage and can only
be used withagnesandhclust.
 If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.feature_similarity_thresholdThe threshold level for pair-wise
similarity that is required to form feature clusters with the fixed_cutmethod. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.feature_similarity_metricMetric to determine pairwise similarity
between features. Similarity is computed in the same manner as for
clustering, and feature_similarity_metrictherefore has the same options
ascluster_similarity_metric:mcfadden_r2,cox_snell_r2,nagelkerke_r2,spearman,kendallandpearson. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.sample_cluster_methodThe method used to perform clustering based on
distance between samples. These are the same methods as for the
cluster_methodconfiguration parameter:hclust,agnes,dianaandpam. nonecannot be used when extracting data for feature expressions.
 If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.sample_linkage_methodThe method used for agglomerative clustering in
hclustandagnes. These are the same methods as for thecluster_linkage_methodconfiguration parameter:average,single,complete,weighted, andward. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.sample_similarity_metricMetric to determine pairwise similarity
between samples. Similarity is computed in the same manner as for
clustering, but sample_similarity_metrichas different options that are
better suited to computing distance between samples instead of between
features:gower,euclidean. The underlying feature data is scaled to the [0, 1]range (for
numerical features) using the feature values across the samples. The
normalisation parameters required can optionally be computed from feature
data with the outer 5% (on both sides) of feature values trimmed or
winsorised. To do so append_trim(trimming) or_winsor(winsorising) to
the metric name. This reduces the effect of outliers somewhat. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.icc_typeString indicating the type of intraclass correlation
coefficient (1,2or3) that should be used to compute robustness for
features in repeated measurements during the evaluation of univariate
importance. These types correspond to the types in Shrout and Fleiss (1979).
If not provided explicitly, this parameter is read from settings used at
creation of the underlyingfamiliarModelobjects.verboseFlag to indicate whether feedback should be provided on the
computation and extraction of various data elements.message_indentNumber of indentation steps for messages shown during
computation and extraction of various data elements.data_elementString indicating which data elements are to be extracted.
Default is all, but specific elements can be specified to speed up
computations if not all elements are to be computed. This is an internal
parameter that is set by, e.g. theexport_model_vimpmethod.sample_limit(optional) Set the upper limit of the number of samples
that are used during evaluation steps. Cannot be less than 20.
 This setting can be specified per data element by providing a parameter
value in a named list with data elements, e.g.
list("sample_similarity"=100, "permutation_vimp"=1000). This parameter can be set for the following data elements:
sample_similarityandice_data.detail_level(optional) Sets the level at which results are computed
and aggregated.
 
 ensemble: Results are computed at the ensemble level, i.e. over all
models in the ensemble. This means that, for example, bias-corrected
estimates of model performance are assessed by creating (at least) 20
bootstraps and computing the model performance of the ensemble model for
each bootstrap.
 hybrid(default): Results are computed at the level of models in an
ensemble. This means that, for example, bias-corrected estimates of model
performance are directly computed using the models in the ensemble. If there
are at least 20 trained models in the ensemble, performance is computed for
each model, in contrast toensemblewhere performance is computed for the
ensemble of models. If there are less than 20 trained models in the
ensemble, bootstraps are created so that at least 20 point estimates can be
made.
 model: Results are computed at the model level. This means that, for
example, bias-corrected estimates of model performance are assessed by
creating (at least) 20 bootstraps and computing the performance of the model
for each bootstrap.
 Note that each level of detail has a different interpretation for bootstrap
confidence intervals. For ensembleandmodelthese are the confidence
intervals for the ensemble and an individual model, respectively. That is,
the confidence interval describes the range where an estimate produced by a
respective ensemble or model trained on a repeat of the experiment may be
found with the probability of the confidence level. Forhybrid, it
represents the range where any single model trained on a repeat of the
experiment may be found with the probability of the confidence level. By
definition, confidence intervals obtained usinghybridare at least as
wide as those forensemble.hybridoffers the correct interpretation if
the goal of the analysis is to assess the result of a single, unspecified,
model. hybridis generally computationally less expensive thenensemble, which
in turn is somewhat less expensive thanmodel.
 A non-default detail_levelparameter can be specified for separate
evaluation steps by providing a parameter value in a named list with data
elements, e.g.list("auc_data"="ensemble", "model_performance"="hybrid").
This parameter can be set for the following data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data,prediction_dataandconfusion_matrix.estimation_type(optional) Sets the type of estimation that should be
possible. This has the following options:
 
 point: Point estimates.
 bias_correctionorbc: Bias-corrected estimates. A bias-corrected
estimate is computed from (at least) 20 point estimates, andfamiliarmay
bootstrap the data to create them.
 bootstrap_confidence_intervalorbci(default): Bias-corrected
estimates with bootstrap confidence intervals (Efron and Hastie, 2016). The
number of point estimates required depends on theconfidence_levelparameter, andfamiliarmay bootstrap the data to create them.
 As with detail_level, a non-defaultestimation_typeparameter can be
specified for separate evaluation steps by providing a parameter value in a
named list with data elements, e.g.list("auc_data"="bci", "model_performance"="point"). This parameter can be set for the following
data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data, andprediction_data.aggregate_results(optional) Flag that signifies whether results
should be aggregated during evaluation. If estimation_typeisbias_correctionorbc, aggregation leads to a single bias-corrected
estimate. Ifestimation_typeisbootstrap_confidence_intervalorbci,
aggregation leads to a single bias-corrected estimate with lower and upper
boundaries of the confidence interval. This has no effect ifestimation_typeispoint. The default value is equal to TRUEexcept when assessing metrics to assess
model performance, as the default violin plot requires underlying data. As with detail_levelandestimation_type, a non-defaultaggregate_resultsparameter can be specified for separate evaluation steps
by providing a parameter value in a named list with data elements, e.g.list("auc_data"=TRUE, , "model_performance"=FALSE). This parameter exists
for the same elements asestimation_type.confidence_level(optional) Numeric value for the level at which
confidence intervals are determined. In the case bootstraps are used to
determine the confidence intervals bootstrap estimation, familiaruses the
rule of thumbn = 20 / ci.levelto determine the number of required
bootstraps. The default value is 0.95.bootstrap_ci_method(optional) Method used to determine bootstrap
confidence intervals (Efron and Hastie, 2016). The following methods are
implemented:
 Note that the standard method is not implemented because this method is
often not suitable due to non-normal distributions. The bias-corrected and
accelerated (BCa) method is not implemented yet.stratification_method(optional) Method for determining the
stratification threshold for creating survival groups. The actual,
model-dependent, threshold value is obtained from the development data, and
can afterwards be used to perform stratification on validation data.
 The following stratification methods are available:
 
 median(default): The median predicted value in the development cohort
is used to stratify the samples into two risk groups. For predicted outcome
values that build a continuous spectrum, the two risk groups in the
development cohort will be roughly equal in size.
 mean: The mean predicted value in the development cohort is used to
stratify the samples into two risk groups.
 mean_trim: Asmean, but based on the set of predicted values
where the 5% lowest and 5% highest values are discarded. This reduces the
effect of outliers.
 mean_winsor: Asmean, but based on the set of predicted values where
the 5% lowest and 5% highest values are winsorised. This reduces the effect
of outliers.
 fixed: Samples are stratified based on the sample quantiles of the
predicted values. These quantiles are defined using thestratification_thresholdparameter.
 optimised: Use maximally selected rank statistics to determine the
optimal threshold (Lausen and Schumacher, 1992; Hothorn et al., 2003) to
stratify samples into two optimally separated risk groups.
 One or more stratification methods can be selected simultaneously.
 This parameter is only relevant for survivaloutcomes.dynamic_model_loading(optional) Enables dynamic loading of models
during the evaluation process, if TRUE. Defaults toFALSE. Dynamic
loading of models may reduce the overall memory footprint, at the cost of
increased disk or network IO. Models can only be dynamically loaded if they
are found at an accessible disk or network location. Setting this parameter
toTRUEmay help if parallel processing causes out-of-memory issues during
evaluation. | 
Details
A data argument is expected if the object argument is a
familiarEnsemble object or one or more familiarModel objects.
Value
A familiarCollection object.
Conversion to familiarData object.
Description
Creates familiarData a object from familiarEnsemble or
familiarModel objects.
Usage
as_familiar_data(object, ...)
## S4 method for signature 'familiarData'
as_familiar_data(object, ...)
## S4 method for signature 'familiarEnsemble'
as_familiar_data(object, name = NULL, ...)
## S4 method for signature 'familiarModel'
as_familiar_data(object, ...)
## S4 method for signature 'list'
as_familiar_data(object, ...)
## S4 method for signature 'character'
as_familiar_data(object, ...)
## S4 method for signature 'ANY'
as_familiar_data(object, ...)
Arguments
| object | A familiarDataobject, or afamiliarEnsembleorfamiliarModelobjects that will be internally converted to afamiliarDataobject. Paths to such objects can also be provided. | 
| ... | Arguments passed on to extract_data 
dataA dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed.is_pre_processedFlag that indicates whether the data was already
pre-processed externally, e.g. normalised and clustered. Only used if the
dataargument is adata.tableordata.frame.clCluster created using the parallelpackage. This cluster is then
used to speed up computation through parallellisation.time_maxTime point which is used as the benchmark for e.g. cumulative
risks generated by random forest, or the cut-off value for Uno's concordance
index. If not provided explicitly, this parameter is read from settings used
at creation of the underlying familiarModelobjects. Only used forsurvivaloutcomes.evaluation_timesOne or more time points that are used for in analysis of
survival problems when data has to be assessed at a set time, e.g.
calibration. If not provided explicitly, this parameter is read from
settings used at creation of the underlying familiarModelobjects. Only
used forsurvivaloutcomes.aggregation_methodMethod for aggregating variable importances for the
purpose of evaluation. Variable importances are determined during feature
selection steps and after training the model. Both types are evaluated, but
feature selection variable importance is only evaluated at run-time.
 See the documentation for the vimp_aggregation_methodargument insummon_familiarfor information concerning the different available
methods. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.rank_thresholdThe threshold used to  define the subset of highly
important features during evaluation.
 See the documentation for the vimp_aggregation_rank_thresholdargument insummon_familiarfor more information. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.ensemble_methodMethod for ensembling predictions from models for the
same sample. Available methods are:
metricOne or more metrics for assessing model performance. See the
vignette on performance metrics for the available metrics. If not provided
explicitly, this parameter is read from settings used at creation of the
underlying familiarModelobjects.feature_cluster_methodThe method used to perform clustering. These are
the same methods as for the cluster_methodconfiguration parameter:none,hclust,agnes,dianaandpam. nonecannot be used when extracting data regarding mutual correlation or
feature expressions.
 If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.feature_linkage_methodThe method used for agglomerative clustering in
hclustandagnes. These are the same methods as for thecluster_linkage_methodconfiguration parameter:average,single,complete,weighted, andward. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.feature_cluster_cut_methodThe method used to divide features into
separate clusters. The available methods are the same as for the
cluster_cut_methodconfiguration parameter:silhouette,fixed_cutanddynamic_cut. silhouetteis available for all cluster methods, butfixed_cutonly
applies to methods that create hierarchical trees (hclust,agnesanddiana).dynamic_cutrequires thedynamicTreeCutpackage and can only
be used withagnesandhclust.
 If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.feature_similarity_thresholdThe threshold level for pair-wise
similarity that is required to form feature clusters with the fixed_cutmethod. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.feature_similarity_metricMetric to determine pairwise similarity
between features. Similarity is computed in the same manner as for
clustering, and feature_similarity_metrictherefore has the same options
ascluster_similarity_metric:mcfadden_r2,cox_snell_r2,nagelkerke_r2,spearman,kendallandpearson. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.sample_cluster_methodThe method used to perform clustering based on
distance between samples. These are the same methods as for the
cluster_methodconfiguration parameter:hclust,agnes,dianaandpam. nonecannot be used when extracting data for feature expressions.
 If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.sample_linkage_methodThe method used for agglomerative clustering in
hclustandagnes. These are the same methods as for thecluster_linkage_methodconfiguration parameter:average,single,complete,weighted, andward. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.sample_similarity_metricMetric to determine pairwise similarity
between samples. Similarity is computed in the same manner as for
clustering, but sample_similarity_metrichas different options that are
better suited to computing distance between samples instead of between
features:gower,euclidean. The underlying feature data is scaled to the [0, 1]range (for
numerical features) using the feature values across the samples. The
normalisation parameters required can optionally be computed from feature
data with the outer 5% (on both sides) of feature values trimmed or
winsorised. To do so append_trim(trimming) or_winsor(winsorising) to
the metric name. This reduces the effect of outliers somewhat. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.icc_typeString indicating the type of intraclass correlation
coefficient (1,2or3) that should be used to compute robustness for
features in repeated measurements during the evaluation of univariate
importance. These types correspond to the types in Shrout and Fleiss (1979).
If not provided explicitly, this parameter is read from settings used at
creation of the underlyingfamiliarModelobjects.verboseFlag to indicate whether feedback should be provided on the
computation and extraction of various data elements.message_indentNumber of indentation steps for messages shown during
computation and extraction of various data elements.data_elementString indicating which data elements are to be extracted.
Default is all, but specific elements can be specified to speed up
computations if not all elements are to be computed. This is an internal
parameter that is set by, e.g. theexport_model_vimpmethod.sample_limit(optional) Set the upper limit of the number of samples
that are used during evaluation steps. Cannot be less than 20.
 This setting can be specified per data element by providing a parameter
value in a named list with data elements, e.g.
list("sample_similarity"=100, "permutation_vimp"=1000). This parameter can be set for the following data elements:
sample_similarityandice_data.detail_level(optional) Sets the level at which results are computed
and aggregated.
 
 ensemble: Results are computed at the ensemble level, i.e. over all
models in the ensemble. This means that, for example, bias-corrected
estimates of model performance are assessed by creating (at least) 20
bootstraps and computing the model performance of the ensemble model for
each bootstrap.
 hybrid(default): Results are computed at the level of models in an
ensemble. This means that, for example, bias-corrected estimates of model
performance are directly computed using the models in the ensemble. If there
are at least 20 trained models in the ensemble, performance is computed for
each model, in contrast toensemblewhere performance is computed for the
ensemble of models. If there are less than 20 trained models in the
ensemble, bootstraps are created so that at least 20 point estimates can be
made.
 model: Results are computed at the model level. This means that, for
example, bias-corrected estimates of model performance are assessed by
creating (at least) 20 bootstraps and computing the performance of the model
for each bootstrap.
 Note that each level of detail has a different interpretation for bootstrap
confidence intervals. For ensembleandmodelthese are the confidence
intervals for the ensemble and an individual model, respectively. That is,
the confidence interval describes the range where an estimate produced by a
respective ensemble or model trained on a repeat of the experiment may be
found with the probability of the confidence level. Forhybrid, it
represents the range where any single model trained on a repeat of the
experiment may be found with the probability of the confidence level. By
definition, confidence intervals obtained usinghybridare at least as
wide as those forensemble.hybridoffers the correct interpretation if
the goal of the analysis is to assess the result of a single, unspecified,
model. hybridis generally computationally less expensive thenensemble, which
in turn is somewhat less expensive thanmodel.
 A non-default detail_levelparameter can be specified for separate
evaluation steps by providing a parameter value in a named list with data
elements, e.g.list("auc_data"="ensemble", "model_performance"="hybrid").
This parameter can be set for the following data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data,prediction_dataandconfusion_matrix.estimation_type(optional) Sets the type of estimation that should be
possible. This has the following options:
 
 point: Point estimates.
 bias_correctionorbc: Bias-corrected estimates. A bias-corrected
estimate is computed from (at least) 20 point estimates, andfamiliarmay
bootstrap the data to create them.
 bootstrap_confidence_intervalorbci(default): Bias-corrected
estimates with bootstrap confidence intervals (Efron and Hastie, 2016). The
number of point estimates required depends on theconfidence_levelparameter, andfamiliarmay bootstrap the data to create them.
 As with detail_level, a non-defaultestimation_typeparameter can be
specified for separate evaluation steps by providing a parameter value in a
named list with data elements, e.g.list("auc_data"="bci", "model_performance"="point"). This parameter can be set for the following
data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data, andprediction_data.aggregate_results(optional) Flag that signifies whether results
should be aggregated during evaluation. If estimation_typeisbias_correctionorbc, aggregation leads to a single bias-corrected
estimate. Ifestimation_typeisbootstrap_confidence_intervalorbci,
aggregation leads to a single bias-corrected estimate with lower and upper
boundaries of the confidence interval. This has no effect ifestimation_typeispoint. The default value is equal to TRUEexcept when assessing metrics to assess
model performance, as the default violin plot requires underlying data. As with detail_levelandestimation_type, a non-defaultaggregate_resultsparameter can be specified for separate evaluation steps
by providing a parameter value in a named list with data elements, e.g.list("auc_data"=TRUE, , "model_performance"=FALSE). This parameter exists
for the same elements asestimation_type.confidence_level(optional) Numeric value for the level at which
confidence intervals are determined. In the case bootstraps are used to
determine the confidence intervals bootstrap estimation, familiaruses the
rule of thumbn = 20 / ci.levelto determine the number of required
bootstraps. The default value is 0.95.bootstrap_ci_method(optional) Method used to determine bootstrap
confidence intervals (Efron and Hastie, 2016). The following methods are
implemented:
 Note that the standard method is not implemented because this method is
often not suitable due to non-normal distributions. The bias-corrected and
accelerated (BCa) method is not implemented yet.stratification_method(optional) Method for determining the
stratification threshold for creating survival groups. The actual,
model-dependent, threshold value is obtained from the development data, and
can afterwards be used to perform stratification on validation data.
 The following stratification methods are available:
 
 median(default): The median predicted value in the development cohort
is used to stratify the samples into two risk groups. For predicted outcome
values that build a continuous spectrum, the two risk groups in the
development cohort will be roughly equal in size.
 mean: The mean predicted value in the development cohort is used to
stratify the samples into two risk groups.
 mean_trim: Asmean, but based on the set of predicted values
where the 5% lowest and 5% highest values are discarded. This reduces the
effect of outliers.
 mean_winsor: Asmean, but based on the set of predicted values where
the 5% lowest and 5% highest values are winsorised. This reduces the effect
of outliers.
 fixed: Samples are stratified based on the sample quantiles of the
predicted values. These quantiles are defined using thestratification_thresholdparameter.
 optimised: Use maximally selected rank statistics to determine the
optimal threshold (Lausen and Schumacher, 1992; Hothorn et al., 2003) to
stratify samples into two optimally separated risk groups.
 One or more stratification methods can be selected simultaneously.
 This parameter is only relevant for survivaloutcomes.dynamic_model_loading(optional) Enables dynamic loading of models
during the evaluation process, if TRUE. Defaults toFALSE. Dynamic
loading of models may reduce the overall memory footprint, at the cost of
increased disk or network IO. Models can only be dynamically loaded if they
are found at an accessible disk or network location. Setting this parameter
toTRUEmay help if parallel processing causes out-of-memory issues during
evaluation. | 
| name | Name of the familiarDataobject. If not set, a name is
automatically generated. | 
Details
The data argument is required if familiarEnsemble or
familiarModel objects are provided.
Value
A familiarData object.
Conversion to familiarEnsemble object.
Description
Creates familiarEnsemble a object from familiarModel
objects.
Usage
as_familiar_ensemble(object, ...)
## S4 method for signature 'familiarEnsemble'
as_familiar_ensemble(object, ...)
## S4 method for signature 'familiarModel'
as_familiar_ensemble(object, ...)
## S4 method for signature 'list'
as_familiar_ensemble(object, ...)
## S4 method for signature 'character'
as_familiar_ensemble(object, ...)
## S4 method for signature 'ANY'
as_familiar_ensemble(object, ...)
Arguments
| object | A familiarEnsembleobject, or one or morefamiliarModelobjects that will be internally converted to afamiliarEnsembleobject.
Paths to such objects can also be provided. | 
| ... | Unused arguments. | 
Value
A familiarEnsemble object.
Extract model coefficients
Description
Extract model coefficients
Usage
coef(object, ...)
## S4 method for signature 'familiarModel'
coef(object, ...)
Arguments
| object | a familiarModel object | 
| ... | additional arguments passed to coefmethods for the underlying
model, when available. | 
Details
This method extends the coef S3 method. For some models coef
requires information that is trimmed from the model. In this case a copy of
the model coefficient is stored with the model, and returned.
Value
Coefficients extracted from the model in the familiarModel object, if
any.
Create randomised groups Creates randomised groups, e.g. for tests that
depend on splitting (continuous) data into groups, such as the
Hosmer-Lemeshow test
Description
The default fast mode is based on random sampling, whereas the slow mode is
based on probabilistic joining of adjacent groups. As the name suggests, fast
mode operates considerably more efficient.
Usage
create_randomised_groups(
  x,
  y = NULL,
  sample_identifiers,
  n_max_groups = NULL,
  n_min_groups = NULL,
  n_min_y_in_group = NULL,
  n_groups_init = 30,
  fast_mode = TRUE
)
Arguments
| x | Vector with data used for sorting. Groups are formed based on
adjacent values. | 
| y | Vector with markers, e.g. the events. Should be 0 or 1 (for an
event). | 
| sample_identifiers | data.table with sample_identifiers. If provide, a
list of grouped sample_identifiers will be returned, and integers
otherwise. | 
| n_max_groups | Maximum number of groups that need to be formed. | 
| n_min_groups | Minimum number of groups that need to be formed. | 
| n_min_y_in_group | Minimum number of y=1 in each group for a valid
group. | 
| n_groups_init | Number of initial groups (default: 30) | 
| fast_mode | Enables fast randomised grouping mode (default: TRUE) | 
Details
Creates randomised groups, e.g. for tests that depend on splitting
(continuous) data into groups, such as the Hosmer-Lemeshow test
-  Determine maximum number of groups: either 10 or number so that each group
has 5 events (if smaller).
 
-  Determine minimum number of groups (half the maximum, or 2). Groups cannot
the exceed corresponding group size.
 
-  Start with 50 very small groups.
 
-  Iterate while the maximum number of groups has not been reached.
 - 
-  Selection probability is 1/n_j
 
-  If a group exceeds the maximum group size, selection probability is 0.
 
-  Get cumulative probability and normalise by total.
 
-  Draw random number between 0 and 1.
 
-  Select the group which has a cumulative probability range that contains
the random number.
 
-  Draw a random number to decide whether to join the group with right or
left adjacent group, and assign the group number to the adjacent group.
Probability depends on the size of adjacent groups. Smaller sizes have
greater probability of being joined. No joining with groups already
exceeding the maximum group size. If surrounded on both sides, force
selection probability for current group to 0. If joining is possible,
update group size, and selection probability for the new group.
 
 
-  Check that 5 events are present in each group. For each group with < 5
events, try to join with neighbours.
 
-  Start over if the number of groups is smaller than the minimum number.
 
Value
List of group sample ids or indices.
Data object
Description
The dataObject class is used to resolve the issue of keeping track of
pre-processing status and data loading inside complex workflows, e.g. nested
predict functions inside a calibration function.
Slots
- data
- NULL or data table containing the data. This is the data which
will be read and used. 
- preprocessing_level
- character indicating the level of pre-processing
already conducted. 
- outcome_type
- character, determines the outcome type. 
- data_column_info
- Object containing column information. 
- delay_loading
- logical. Allows delayed loading data, which enables data
parsing downstream without additional workflow complexity or memory
utilisation. 
- perturb_level
- numeric. This is the perturbation level for data which
has not been loaded. Used for data retrieval by interacting with the run
table of the accompanying model. 
- load_validation
- logical. This determines which internal data set will
be loaded. If TRUE, the validation data will be loaded, whereas FALSE loads
the development data. 
- aggregate_on_load
- logical. Determines whether data is aggregated after
loading. 
- sample_set_on_load
- NULL or vector of sample identifiers to be loaded. 
Encapsulate path
Description
This function is used to encapsulate paths to allow for behaviour switches.
One use is for example when plotting. The plot_all method will encapsulate a
path so that plots may be saved to a directory structure. Other plot methods,
e.g. plot_model_performance do not encapsulate a path, and if the user calls
these functions directly, the plot may be written to the provided path
instead of a directory structure.
Usage
encapsulate_path(path)
Value
encapsulated_path object
Experiment data
Description
An experimentData object contains information concerning the experiment.
These objects can be used to instantiate multiple experiments using the same
iterations, feature information and variable importance.
Details
experimentData objects are primarily used to improve
reproducibility, since these allow for training models on a shared
foundation.
Slots
- experiment_setup
- Contains regarding the experimental setup that is used
to generate the iteration list. 
- iteration_list
- List of iteration data that determines which instances
are assigned to training, validation and test sets. 
- feature_info
- Feature information objects. Only available if the
experimentData object was generated using the - precompute_feature_infoor- precompute_vimpfunctions.
 
- vimp_table_list
- List of variable importance table objects. Only
available if the experimentData object was created using the
- precompute_vimpfunction.
 
- project_id
- Identifier of the project that generated the experimentData
object. 
- familiar_version
- Version of the familiar package used to create this
experimentData. 
See Also
precompute_data_assignment
precompute_feature_info, precompute_vimp
Extract and export all data.
Description
Extract and export all data from a familiarCollection.
Usage
export_all(object, dir_path = NULL, aggregate_results = waiver(), ...)
## S4 method for signature 'familiarCollection'
export_all(object, dir_path = NULL, aggregate_results = waiver(), ...)
## S4 method for signature 'ANY'
export_all(object, dir_path = NULL, aggregate_results = waiver(), ...)
Arguments
| object | A familiarCollectionobject, or other other objects from which
afamiliarCollectioncan be extracted. See details for more information. | 
| dir_path | Path to folder where extracted data should be saved. NULLwill allow export as a structured list of data.tables. | 
| aggregate_results | Flag that signifies whether results should be
aggregated for export. | 
| ... | Arguments passed on to extract_data,as_familiar_collection 
dataA dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed.is_pre_processedFlag that indicates whether the data was already
pre-processed externally, e.g. normalised and clustered. Only used if the
dataargument is adata.tableordata.frame.clCluster created using the parallelpackage. This cluster is then
used to speed up computation through parallellisation.time_maxTime point which is used as the benchmark for e.g. cumulative
risks generated by random forest, or the cut-off value for Uno's concordance
index. If not provided explicitly, this parameter is read from settings used
at creation of the underlying familiarModelobjects. Only used forsurvivaloutcomes.evaluation_timesOne or more time points that are used for in analysis of
survival problems when data has to be assessed at a set time, e.g.
calibration. If not provided explicitly, this parameter is read from
settings used at creation of the underlying familiarModelobjects. Only
used forsurvivaloutcomes.aggregation_methodMethod for aggregating variable importances for the
purpose of evaluation. Variable importances are determined during feature
selection steps and after training the model. Both types are evaluated, but
feature selection variable importance is only evaluated at run-time.
 See the documentation for the vimp_aggregation_methodargument insummon_familiarfor information concerning the different available
methods. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.rank_thresholdThe threshold used to  define the subset of highly
important features during evaluation.
 See the documentation for the vimp_aggregation_rank_thresholdargument insummon_familiarfor more information. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.ensemble_methodMethod for ensembling predictions from models for the
same sample. Available methods are:
metricOne or more metrics for assessing model performance. See the
vignette on performance metrics for the available metrics. If not provided
explicitly, this parameter is read from settings used at creation of the
underlying familiarModelobjects.feature_cluster_methodThe method used to perform clustering. These are
the same methods as for the cluster_methodconfiguration parameter:none,hclust,agnes,dianaandpam. nonecannot be used when extracting data regarding mutual correlation or
feature expressions.
 If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.feature_linkage_methodThe method used for agglomerative clustering in
hclustandagnes. These are the same methods as for thecluster_linkage_methodconfiguration parameter:average,single,complete,weighted, andward. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.feature_cluster_cut_methodThe method used to divide features into
separate clusters. The available methods are the same as for the
cluster_cut_methodconfiguration parameter:silhouette,fixed_cutanddynamic_cut. silhouetteis available for all cluster methods, butfixed_cutonly
applies to methods that create hierarchical trees (hclust,agnesanddiana).dynamic_cutrequires thedynamicTreeCutpackage and can only
be used withagnesandhclust.
 If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.feature_similarity_thresholdThe threshold level for pair-wise
similarity that is required to form feature clusters with the fixed_cutmethod. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.feature_similarity_metricMetric to determine pairwise similarity
between features. Similarity is computed in the same manner as for
clustering, and feature_similarity_metrictherefore has the same options
ascluster_similarity_metric:mcfadden_r2,cox_snell_r2,nagelkerke_r2,spearman,kendallandpearson. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.sample_cluster_methodThe method used to perform clustering based on
distance between samples. These are the same methods as for the
cluster_methodconfiguration parameter:hclust,agnes,dianaandpam. nonecannot be used when extracting data for feature expressions.
 If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.sample_linkage_methodThe method used for agglomerative clustering in
hclustandagnes. These are the same methods as for thecluster_linkage_methodconfiguration parameter:average,single,complete,weighted, andward. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.sample_similarity_metricMetric to determine pairwise similarity
between samples. Similarity is computed in the same manner as for
clustering, but sample_similarity_metrichas different options that are
better suited to computing distance between samples instead of between
features:gower,euclidean. The underlying feature data is scaled to the [0, 1]range (for
numerical features) using the feature values across the samples. The
normalisation parameters required can optionally be computed from feature
data with the outer 5% (on both sides) of feature values trimmed or
winsorised. To do so append_trim(trimming) or_winsor(winsorising) to
the metric name. This reduces the effect of outliers somewhat. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.icc_typeString indicating the type of intraclass correlation
coefficient (1,2or3) that should be used to compute robustness for
features in repeated measurements during the evaluation of univariate
importance. These types correspond to the types in Shrout and Fleiss (1979).
If not provided explicitly, this parameter is read from settings used at
creation of the underlyingfamiliarModelobjects.verboseFlag to indicate whether feedback should be provided on the
computation and extraction of various data elements.message_indentNumber of indentation steps for messages shown during
computation and extraction of various data elements.data_elementString indicating which data elements are to be extracted.
Default is all, but specific elements can be specified to speed up
computations if not all elements are to be computed. This is an internal
parameter that is set by, e.g. theexport_model_vimpmethod.sample_limit(optional) Set the upper limit of the number of samples
that are used during evaluation steps. Cannot be less than 20.
 This setting can be specified per data element by providing a parameter
value in a named list with data elements, e.g.
list("sample_similarity"=100, "permutation_vimp"=1000). This parameter can be set for the following data elements:
sample_similarityandice_data.detail_level(optional) Sets the level at which results are computed
and aggregated.
 
 ensemble: Results are computed at the ensemble level, i.e. over all
models in the ensemble. This means that, for example, bias-corrected
estimates of model performance are assessed by creating (at least) 20
bootstraps and computing the model performance of the ensemble model for
each bootstrap.
 hybrid(default): Results are computed at the level of models in an
ensemble. This means that, for example, bias-corrected estimates of model
performance are directly computed using the models in the ensemble. If there
are at least 20 trained models in the ensemble, performance is computed for
each model, in contrast toensemblewhere performance is computed for the
ensemble of models. If there are less than 20 trained models in the
ensemble, bootstraps are created so that at least 20 point estimates can be
made.
 model: Results are computed at the model level. This means that, for
example, bias-corrected estimates of model performance are assessed by
creating (at least) 20 bootstraps and computing the performance of the model
for each bootstrap.
 Note that each level of detail has a different interpretation for bootstrap
confidence intervals. For ensembleandmodelthese are the confidence
intervals for the ensemble and an individual model, respectively. That is,
the confidence interval describes the range where an estimate produced by a
respective ensemble or model trained on a repeat of the experiment may be
found with the probability of the confidence level. Forhybrid, it
represents the range where any single model trained on a repeat of the
experiment may be found with the probability of the confidence level. By
definition, confidence intervals obtained usinghybridare at least as
wide as those forensemble.hybridoffers the correct interpretation if
the goal of the analysis is to assess the result of a single, unspecified,
model. hybridis generally computationally less expensive thenensemble, which
in turn is somewhat less expensive thanmodel.
 A non-default detail_levelparameter can be specified for separate
evaluation steps by providing a parameter value in a named list with data
elements, e.g.list("auc_data"="ensemble", "model_performance"="hybrid").
This parameter can be set for the following data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data,prediction_dataandconfusion_matrix.estimation_type(optional) Sets the type of estimation that should be
possible. This has the following options:
 
 point: Point estimates.
 bias_correctionorbc: Bias-corrected estimates. A bias-corrected
estimate is computed from (at least) 20 point estimates, andfamiliarmay
bootstrap the data to create them.
 bootstrap_confidence_intervalorbci(default): Bias-corrected
estimates with bootstrap confidence intervals (Efron and Hastie, 2016). The
number of point estimates required depends on theconfidence_levelparameter, andfamiliarmay bootstrap the data to create them.
 As with detail_level, a non-defaultestimation_typeparameter can be
specified for separate evaluation steps by providing a parameter value in a
named list with data elements, e.g.list("auc_data"="bci", "model_performance"="point"). This parameter can be set for the following
data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data, andprediction_data.confidence_level(optional) Numeric value for the level at which
confidence intervals are determined. In the case bootstraps are used to
determine the confidence intervals bootstrap estimation, familiaruses the
rule of thumbn = 20 / ci.levelto determine the number of required
bootstraps. The default value is 0.95.bootstrap_ci_method(optional) Method used to determine bootstrap
confidence intervals (Efron and Hastie, 2016). The following methods are
implemented:
 Note that the standard method is not implemented because this method is
often not suitable due to non-normal distributions. The bias-corrected and
accelerated (BCa) method is not implemented yet.stratification_method(optional) Method for determining the
stratification threshold for creating survival groups. The actual,
model-dependent, threshold value is obtained from the development data, and
can afterwards be used to perform stratification on validation data.
 The following stratification methods are available:
 
 median(default): The median predicted value in the development cohort
is used to stratify the samples into two risk groups. For predicted outcome
values that build a continuous spectrum, the two risk groups in the
development cohort will be roughly equal in size.
 mean: The mean predicted value in the development cohort is used to
stratify the samples into two risk groups.
 mean_trim: Asmean, but based on the set of predicted values
where the 5% lowest and 5% highest values are discarded. This reduces the
effect of outliers.
 mean_winsor: Asmean, but based on the set of predicted values where
the 5% lowest and 5% highest values are winsorised. This reduces the effect
of outliers.
 fixed: Samples are stratified based on the sample quantiles of the
predicted values. These quantiles are defined using thestratification_thresholdparameter.
 optimised: Use maximally selected rank statistics to determine the
optimal threshold (Lausen and Schumacher, 1992; Hothorn et al., 2003) to
stratify samples into two optimally separated risk groups.
 One or more stratification methods can be selected simultaneously.
 This parameter is only relevant for survivaloutcomes.dynamic_model_loading(optional) Enables dynamic loading of models
during the evaluation process, if TRUE. Defaults toFALSE. Dynamic
loading of models may reduce the overall memory footprint, at the cost of
increased disk or network IO. Models can only be dynamically loaded if they
are found at an accessible disk or network location. Setting this parameter
toTRUEmay help if parallel processing causes out-of-memory issues during
evaluation.familiar_data_namesNames of the dataset(s). Only used if the objectparameter is one or morefamiliarDataobjects.collection_nameName of the collection. | 
Details
Data, such as model performance and calibration information, is
usually collected from a familiarCollection object. However, you can also
provide one or more familiarData objects, that will be internally
converted to a familiarCollection object. It is also possible to provide a
familiarEnsemble or one or more familiarModel objects together with the
data from which data is computed prior to export. Paths to the previous
files can also be provided.
All parameters aside from object and dir_path are only used if object
is not a familiarCollection object, or a path to one.
Value
A list of data.tables (if dir_path is not provided), or nothing, as
all data is exported to csv files.
Extract and export ROC and Precision-Recall curves.
Description
Extract and export ROC and Precision-Recall curves for models in
a familiarCollection.
Usage
export_auc_data(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  export_collection = FALSE,
  ...
)
## S4 method for signature 'familiarCollection'
export_auc_data(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  export_collection = FALSE,
  ...
)
## S4 method for signature 'ANY'
export_auc_data(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  export_collection = FALSE,
  ...
)
Arguments
| object | A familiarCollectionobject, or other other objects from which
afamiliarCollectioncan be extracted. See details for more information. | 
| dir_path | Path to folder where extracted data should be saved. NULLwill allow export as a structured list of data.tables. | 
| aggregate_results | Flag that signifies whether results should be
aggregated for export. | 
| export_collection | (optional) Exports the collection if TRUE. | 
| ... | Arguments passed on to extract_auc_data,as_familiar_collection 
dataA dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed.is_pre_processedFlag that indicates whether the data was already
pre-processed externally, e.g. normalised and clustered. Only used if the
dataargument is adata.tableordata.frame.clCluster created using the parallelpackage. This cluster is then
used to speed up computation through parallellisation.ensemble_methodMethod for ensembling predictions from models for the
same sample. Available methods are:
verboseFlag to indicate whether feedback should be provided on the
computation and extraction of various data elements.message_indentNumber of indentation steps for messages shown during
computation and extraction of various data elements.detail_level(optional) Sets the level at which results are computed
and aggregated.
 
 ensemble: Results are computed at the ensemble level, i.e. over all
models in the ensemble. This means that, for example, bias-corrected
estimates of model performance are assessed by creating (at least) 20
bootstraps and computing the model performance of the ensemble model for
each bootstrap.
 hybrid(default): Results are computed at the level of models in an
ensemble. This means that, for example, bias-corrected estimates of model
performance are directly computed using the models in the ensemble. If there
are at least 20 trained models in the ensemble, performance is computed for
each model, in contrast toensemblewhere performance is computed for the
ensemble of models. If there are less than 20 trained models in the
ensemble, bootstraps are created so that at least 20 point estimates can be
made.
 model: Results are computed at the model level. This means that, for
example, bias-corrected estimates of model performance are assessed by
creating (at least) 20 bootstraps and computing the performance of the model
for each bootstrap.
 Note that each level of detail has a different interpretation for bootstrap
confidence intervals. For ensembleandmodelthese are the confidence
intervals for the ensemble and an individual model, respectively. That is,
the confidence interval describes the range where an estimate produced by a
respective ensemble or model trained on a repeat of the experiment may be
found with the probability of the confidence level. Forhybrid, it
represents the range where any single model trained on a repeat of the
experiment may be found with the probability of the confidence level. By
definition, confidence intervals obtained usinghybridare at least as
wide as those forensemble.hybridoffers the correct interpretation if
the goal of the analysis is to assess the result of a single, unspecified,
model. hybridis generally computationally less expensive thenensemble, which
in turn is somewhat less expensive thanmodel.
 A non-default detail_levelparameter can be specified for separate
evaluation steps by providing a parameter value in a named list with data
elements, e.g.list("auc_data"="ensemble", "model_performance"="hybrid").
This parameter can be set for the following data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data,prediction_dataandconfusion_matrix.estimation_type(optional) Sets the type of estimation that should be
possible. This has the following options:
 
 point: Point estimates.
 bias_correctionorbc: Bias-corrected estimates. A bias-corrected
estimate is computed from (at least) 20 point estimates, andfamiliarmay
bootstrap the data to create them.
 bootstrap_confidence_intervalorbci(default): Bias-corrected
estimates with bootstrap confidence intervals (Efron and Hastie, 2016). The
number of point estimates required depends on theconfidence_levelparameter, andfamiliarmay bootstrap the data to create them.
 As with detail_level, a non-defaultestimation_typeparameter can be
specified for separate evaluation steps by providing a parameter value in a
named list with data elements, e.g.list("auc_data"="bci", "model_performance"="point"). This parameter can be set for the following
data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data, andprediction_data.confidence_level(optional) Numeric value for the level at which
confidence intervals are determined. In the case bootstraps are used to
determine the confidence intervals bootstrap estimation, familiaruses the
rule of thumbn = 20 / ci.levelto determine the number of required
bootstraps. The default value is 0.95.bootstrap_ci_method(optional) Method used to determine bootstrap
confidence intervals (Efron and Hastie, 2016). The following methods are
implemented:
 Note that the standard method is not implemented because this method is
often not suitable due to non-normal distributions. The bias-corrected and
accelerated (BCa) method is not implemented yet.familiar_data_namesNames of the dataset(s). Only used if the objectparameter is one or morefamiliarDataobjects.collection_nameName of the collection. | 
Details
Data is usually collected from a familiarCollection object.
However, you can also provide one or more familiarData objects, that will
be internally converted to a familiarCollection object. It is also
possible to provide a familiarEnsemble or one or more familiarModel
objects together with the data from which data is computed prior to export.
Paths to the previous files can also be provided.
All parameters aside from object and dir_path are only used if object
is not a familiarCollection object, or a path to one.
ROC curve data are exported for individual and ensemble models. For ensemble
models, a credibility interval for the ROC curve is determined using
bootstrapping for each metric. In case of multinomial outcomes, ROC-curves
are computed for each class, using a one-against-all approach.
Value
A list of data.tables (if dir_path is not provided), or nothing, as
all data is exported to csv files.
Extract and export calibration and goodness-of-fit tests.
Description
Extract and export calibration and goodness-of-fit tests for data
in a familiarCollection.
Usage
export_calibration_data(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  export_collection = FALSE,
  ...
)
## S4 method for signature 'familiarCollection'
export_calibration_data(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  export_collection = FALSE,
  ...
)
## S4 method for signature 'ANY'
export_calibration_data(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  export_collection = FALSE,
  ...
)
Arguments
| object | A familiarCollectionobject, or other other objects from which
afamiliarCollectioncan be extracted. See details for more information. | 
| dir_path | Path to folder where extracted data should be saved. NULLwill allow export as a structured list of data.tables. | 
| aggregate_results | Flag that signifies whether results should be
aggregated for export. | 
| export_collection | (optional) Exports the collection if TRUE. | 
| ... | Arguments passed on to extract_calibration_data,as_familiar_collection 
dataA dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed.is_pre_processedFlag that indicates whether the data was already
pre-processed externally, e.g. normalised and clustered. Only used if the
dataargument is adata.tableordata.frame.clCluster created using the parallelpackage. This cluster is then
used to speed up computation through parallellisation.evaluation_timesOne or more time points that are used for in analysis of
survival problems when data has to be assessed at a set time, e.g.
calibration. If not provided explicitly, this parameter is read from
settings used at creation of the underlying familiarModelobjects. Only
used forsurvivaloutcomes.ensemble_methodMethod for ensembling predictions from models for the
same sample. Available methods are:
verboseFlag to indicate whether feedback should be provided on the
computation and extraction of various data elements.message_indentNumber of indentation steps for messages shown during
computation and extraction of various data elements.detail_level(optional) Sets the level at which results are computed
and aggregated.
 
 ensemble: Results are computed at the ensemble level, i.e. over all
models in the ensemble. This means that, for example, bias-corrected
estimates of model performance are assessed by creating (at least) 20
bootstraps and computing the model performance of the ensemble model for
each bootstrap.
 hybrid(default): Results are computed at the level of models in an
ensemble. This means that, for example, bias-corrected estimates of model
performance are directly computed using the models in the ensemble. If there
are at least 20 trained models in the ensemble, performance is computed for
each model, in contrast toensemblewhere performance is computed for the
ensemble of models. If there are less than 20 trained models in the
ensemble, bootstraps are created so that at least 20 point estimates can be
made.
 model: Results are computed at the model level. This means that, for
example, bias-corrected estimates of model performance are assessed by
creating (at least) 20 bootstraps and computing the performance of the model
for each bootstrap.
 Note that each level of detail has a different interpretation for bootstrap
confidence intervals. For ensembleandmodelthese are the confidence
intervals for the ensemble and an individual model, respectively. That is,
the confidence interval describes the range where an estimate produced by a
respective ensemble or model trained on a repeat of the experiment may be
found with the probability of the confidence level. Forhybrid, it
represents the range where any single model trained on a repeat of the
experiment may be found with the probability of the confidence level. By
definition, confidence intervals obtained usinghybridare at least as
wide as those forensemble.hybridoffers the correct interpretation if
the goal of the analysis is to assess the result of a single, unspecified,
model. hybridis generally computationally less expensive thenensemble, which
in turn is somewhat less expensive thanmodel.
 A non-default detail_levelparameter can be specified for separate
evaluation steps by providing a parameter value in a named list with data
elements, e.g.list("auc_data"="ensemble", "model_performance"="hybrid").
This parameter can be set for the following data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data,prediction_dataandconfusion_matrix.estimation_type(optional) Sets the type of estimation that should be
possible. This has the following options:
 
 point: Point estimates.
 bias_correctionorbc: Bias-corrected estimates. A bias-corrected
estimate is computed from (at least) 20 point estimates, andfamiliarmay
bootstrap the data to create them.
 bootstrap_confidence_intervalorbci(default): Bias-corrected
estimates with bootstrap confidence intervals (Efron and Hastie, 2016). The
number of point estimates required depends on theconfidence_levelparameter, andfamiliarmay bootstrap the data to create them.
 As with detail_level, a non-defaultestimation_typeparameter can be
specified for separate evaluation steps by providing a parameter value in a
named list with data elements, e.g.list("auc_data"="bci", "model_performance"="point"). This parameter can be set for the following
data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data, andprediction_data.confidence_level(optional) Numeric value for the level at which
confidence intervals are determined. In the case bootstraps are used to
determine the confidence intervals bootstrap estimation, familiaruses the
rule of thumbn = 20 / ci.levelto determine the number of required
bootstraps. The default value is 0.95.bootstrap_ci_method(optional) Method used to determine bootstrap
confidence intervals (Efron and Hastie, 2016). The following methods are
implemented:
 Note that the standard method is not implemented because this method is
often not suitable due to non-normal distributions. The bias-corrected and
accelerated (BCa) method is not implemented yet.familiar_data_namesNames of the dataset(s). Only used if the objectparameter is one or morefamiliarDataobjects.collection_nameName of the collection. | 
Details
Data is usually collected from a familiarCollection object.
However, you can also provide one or more familiarData objects, that will
be internally converted to a familiarCollection object. It is also
possible to provide a familiarEnsemble or one or more familiarModel
objects together with the data from which data is computed prior to export.
Paths to the previous files can also be provided.
All parameters aside from object and dir_path are only used if object
is not a familiarCollection object, or a path to one.
Calibration tests are performed based on expected (predicted) and observed
outcomes. For all outcomes, calibration-at-the-large and calibration slopes
are determined. Furthermore, for all but survival outcomes, a repeated,
randomised grouping Hosmer-Lemeshow test is performed. For survival
outcomes, the Nam-D'Agostino and Greenwood-Nam-D'Agostino tests are
performed.
Value
A list of data.tables (if dir_path is not provided), or nothing, as
all data is exported to csv files.
Extract and export calibration information.
Description
Extract and export calibration information (e.g. baseline
survival) for data in a familiarCollection.
Usage
export_calibration_info(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  export_collection = FALSE,
  ...
)
## S4 method for signature 'familiarCollection'
export_calibration_info(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  export_collection = FALSE,
  ...
)
## S4 method for signature 'ANY'
export_calibration_info(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  export_collection = FALSE,
  ...
)
Arguments
| object | A familiarCollectionobject, or other other objects from which
afamiliarCollectioncan be extracted. See details for more information. | 
| dir_path | Path to folder where extracted data should be saved. NULLwill allow export as a structured list of data.tables. | 
| aggregate_results | Flag that signifies whether results should be
aggregated for export. | 
| export_collection | (optional) Exports the collection if TRUE. | 
| ... | Arguments passed on to as_familiar_collection 
familiar_data_namesNames of the dataset(s). Only used if the objectparameter is one or morefamiliarDataobjects.collection_nameName of the collection. | 
Details
Data is usually collected from a familiarCollection object.
However, you can also provide one or more familiarData objects, that will
be internally converted to a familiarCollection object. It is also
possible to provide a familiarEnsemble or one or more familiarModel
objects together with the data from which data is computed prior to export.
Paths to the previous files can also be provided.
All parameters aside from object and dir_path are only used if object
is not a familiarCollection object, or a path to one.
Currently only baseline survival is exported as supporting calibration
information. See export_calibration_data for export of direct assessment
of calibration, including calibration and goodness-of-fit tests.
Value
A data.table (if dir_path is not provided), or nothing, as all data
is exported to csv files.
Extract and export confusion matrices.
Description
Extract and export confusion matrics for models in a
familiarCollection.
Usage
export_confusion_matrix_data(
  object,
  dir_path = NULL,
  export_collection = FALSE,
  ...
)
## S4 method for signature 'familiarCollection'
export_confusion_matrix_data(
  object,
  dir_path = NULL,
  export_collection = FALSE,
  ...
)
## S4 method for signature 'ANY'
export_confusion_matrix_data(
  object,
  dir_path = NULL,
  export_collection = FALSE,
  ...
)
Arguments
| object | A familiarCollectionobject, or other other objects from which
afamiliarCollectioncan be extracted. See details for more information. | 
| dir_path | Path to folder where extracted data should be saved. NULLwill allow export as a structured list of data.tables. | 
| export_collection | (optional) Exports the collection if TRUE. | 
| ... | Arguments passed on to extract_confusion_matrix,as_familiar_collection 
dataA dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed.is_pre_processedFlag that indicates whether the data was already
pre-processed externally, e.g. normalised and clustered. Only used if the
dataargument is adata.tableordata.frame.clCluster created using the parallelpackage. This cluster is then
used to speed up computation through parallellisation.ensemble_methodMethod for ensembling predictions from models for the
same sample. Available methods are:
verboseFlag to indicate whether feedback should be provided on the
computation and extraction of various data elements.message_indentNumber of indentation steps for messages shown during
computation and extraction of various data elements.detail_level(optional) Sets the level at which results are computed
and aggregated.
 
 ensemble: Results are computed at the ensemble level, i.e. over all
models in the ensemble. This means that, for example, bias-corrected
estimates of model performance are assessed by creating (at least) 20
bootstraps and computing the model performance of the ensemble model for
each bootstrap.
 hybrid(default): Results are computed at the level of models in an
ensemble. This means that, for example, bias-corrected estimates of model
performance are directly computed using the models in the ensemble. If there
are at least 20 trained models in the ensemble, performance is computed for
each model, in contrast toensemblewhere performance is computed for the
ensemble of models. If there are less than 20 trained models in the
ensemble, bootstraps are created so that at least 20 point estimates can be
made.
 model: Results are computed at the model level. This means that, for
example, bias-corrected estimates of model performance are assessed by
creating (at least) 20 bootstraps and computing the performance of the model
for each bootstrap.
 Note that each level of detail has a different interpretation for bootstrap
confidence intervals. For ensembleandmodelthese are the confidence
intervals for the ensemble and an individual model, respectively. That is,
the confidence interval describes the range where an estimate produced by a
respective ensemble or model trained on a repeat of the experiment may be
found with the probability of the confidence level. Forhybrid, it
represents the range where any single model trained on a repeat of the
experiment may be found with the probability of the confidence level. By
definition, confidence intervals obtained usinghybridare at least as
wide as those forensemble.hybridoffers the correct interpretation if
the goal of the analysis is to assess the result of a single, unspecified,
model. hybridis generally computationally less expensive thenensemble, which
in turn is somewhat less expensive thanmodel.
 A non-default detail_levelparameter can be specified for separate
evaluation steps by providing a parameter value in a named list with data
elements, e.g.list("auc_data"="ensemble", "model_performance"="hybrid").
This parameter can be set for the following data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data,prediction_dataandconfusion_matrix.familiar_data_namesNames of the dataset(s). Only used if the objectparameter is one or morefamiliarDataobjects.collection_nameName of the collection. | 
Details
Data is usually collected from a familiarCollection object.
However, you can also provide one or more familiarData objects, that will
be internally converted to a familiarCollection object. It is also
possible to provide a familiarEnsemble or one or more familiarModel
objects together with the data from which data is computed prior to export.
Paths to the previous files can also be provided.
All parameters aside from object and dir_path are only used if object
is not a familiarCollection object, or a path to one.
Confusion matrices are exported for individual and ensemble models.
Value
A list of data.tables (if dir_path is not provided), or nothing, as
all data is exported to csv files.
Extract and export decision curve analysis data.
Description
Extract and export decision curve analysis data in a
familiarCollection.
Usage
export_decision_curve_analysis_data(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  ...
)
## S4 method for signature 'familiarCollection'
export_decision_curve_analysis_data(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  ...
)
## S4 method for signature 'ANY'
export_decision_curve_analysis_data(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  ...
)
Arguments
| object | A familiarCollectionobject, or other other objects from which
afamiliarCollectioncan be extracted. See details for more information. | 
| dir_path | Path to folder where extracted data should be saved. NULLwill allow export as a structured list of data.tables. | 
| aggregate_results | Flag that signifies whether results should be
aggregated for export. | 
| ... | Arguments passed on to as_familiar_collection 
familiar_data_namesNames of the dataset(s). Only used if the objectparameter is one or morefamiliarDataobjects.collection_nameName of the collection. | 
Details
Data is usually collected from a familiarCollection object.
However, you can also provide one or more familiarData objects, that will
be internally converted to a familiarCollection object. It is also
possible to provide a familiarEnsemble or one or more familiarModel
objects together with the data from which data is computed prior to export.
Paths to the previous files can also be provided.
All parameters aside from object and dir_path are only used if object
is not a familiarCollection object, or a path to one.
Decision curve analysis data is computed for categorical outcomes, i.e.
binomial and multinomial, as well as survival outcomes.
Value
A list of data.table (if dir_path is not provided), or nothing, as
all data is exported to csv files.
Extract and export feature expressions.
Description
Extract and export feature expressions for the features in a
familiarCollection.
Usage
export_feature_expressions(
  object,
  dir_path = NULL,
  evaluation_time = waiver(),
  export_collection = FALSE,
  ...
)
## S4 method for signature 'familiarCollection'
export_feature_expressions(
  object,
  dir_path = NULL,
  evaluation_time = waiver(),
  export_collection = FALSE,
  ...
)
## S4 method for signature 'ANY'
export_feature_expressions(
  object,
  dir_path = NULL,
  evaluation_time = waiver(),
  export_collection = FALSE,
  ...
)
Arguments
| object | A familiarCollectionobject, or other other objects from which
afamiliarCollectioncan be extracted. See details for more information. | 
| dir_path | Path to folder where extracted data should be saved. NULLwill allow export as a structured list of data.tables. | 
| evaluation_time | One or more time points that are used to create the
outcome columns in expression plots. If not provided explicitly, this
parameter is read from settings used at creation of the underlying
familiarDataobjects. Only used forsurvivaloutcomes. | 
| export_collection | (optional) Exports the collection if TRUE. | 
| ... | Arguments passed on to extract_feature_expression,as_familiar_collection 
feature_similarityTable containing pairwise distance between
sample. This is used to determine cluster information, and indicate which
samples are similar. The table is created by the
extract_sample_similaritymethod.dataA dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed.evaluation_timesOne or more time points that are used for in analysis of
survival problems when data has to be assessed at a set time, e.g.
calibration. If not provided explicitly, this parameter is read from
settings used at creation of the underlying familiarModelobjects. Only
used forsurvivaloutcomes.feature_cluster_methodThe method used to perform clustering. These are
the same methods as for the cluster_methodconfiguration parameter:none,hclust,agnes,dianaandpam. nonecannot be used when extracting data regarding mutual correlation or
feature expressions.
 If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.feature_linkage_methodThe method used for agglomerative clustering in
hclustandagnes. These are the same methods as for thecluster_linkage_methodconfiguration parameter:average,single,complete,weighted, andward. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.feature_similarity_metricMetric to determine pairwise similarity
between features. Similarity is computed in the same manner as for
clustering, and feature_similarity_metrictherefore has the same options
ascluster_similarity_metric:mcfadden_r2,cox_snell_r2,nagelkerke_r2,spearman,kendallandpearson. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.sample_cluster_methodThe method used to perform clustering based on
distance between samples. These are the same methods as for the
cluster_methodconfiguration parameter:hclust,agnes,dianaandpam. nonecannot be used when extracting data for feature expressions.
 If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.sample_linkage_methodThe method used for agglomerative clustering in
hclustandagnes. These are the same methods as for thecluster_linkage_methodconfiguration parameter:average,single,complete,weighted, andward. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.sample_similarity_metricMetric to determine pairwise similarity
between samples. Similarity is computed in the same manner as for
clustering, but sample_similarity_metrichas different options that are
better suited to computing distance between samples instead of between
features:gower,euclidean. The underlying feature data is scaled to the [0, 1]range (for
numerical features) using the feature values across the samples. The
normalisation parameters required can optionally be computed from feature
data with the outer 5% (on both sides) of feature values trimmed or
winsorised. To do so append_trim(trimming) or_winsor(winsorising) to
the metric name. This reduces the effect of outliers somewhat. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.verboseFlag to indicate whether feedback should be provided on the
computation and extraction of various data elements.message_indentNumber of indentation steps for messages shown during
computation and extraction of various data elements.familiar_data_namesNames of the dataset(s). Only used if the objectparameter is one or morefamiliarDataobjects.collection_nameName of the collection. | 
Details
Data is usually collected from a familiarCollection object.
However, you can also provide one or more familiarData objects, that will
be internally converted to a familiarCollection object. It is also
possible to provide a familiarEnsemble or one or more familiarModel
objects together with the data from which data is computed prior to export.
Paths to the previous files can also be provided.
All parameters aside from object and dir_path are only used if object
is not a familiarCollection object, or a path to one.
Feature expressions are computed by standardising each feature, i.e. sample
mean is 0 and standard deviation is 1.
Value
A data.table (if dir_path is not provided), or nothing, as all data
is exported to csv files.
Extract and export mutual correlation between features.
Description
Extract and export mutual correlation between features in a
familiarCollection.
Usage
export_feature_similarity(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  feature_cluster_method = waiver(),
  feature_linkage_method = waiver(),
  feature_cluster_cut_method = waiver(),
  feature_similarity_threshold = waiver(),
  export_dendrogram = FALSE,
  export_ordered_data = FALSE,
  export_clustering = FALSE,
  export_collection = FALSE,
  ...
)
## S4 method for signature 'familiarCollection'
export_feature_similarity(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  feature_cluster_method = waiver(),
  feature_linkage_method = waiver(),
  feature_cluster_cut_method = waiver(),
  feature_similarity_threshold = waiver(),
  export_dendrogram = FALSE,
  export_ordered_data = FALSE,
  export_clustering = FALSE,
  export_collection = FALSE,
  ...
)
## S4 method for signature 'ANY'
export_feature_similarity(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  feature_cluster_method = waiver(),
  feature_linkage_method = waiver(),
  feature_cluster_cut_method = waiver(),
  feature_similarity_threshold = waiver(),
  export_dendrogram = FALSE,
  export_ordered_data = FALSE,
  export_clustering = FALSE,
  export_collection = FALSE,
  ...
)
Arguments
| object | A familiarCollectionobject, or other other objects from which
afamiliarCollectioncan be extracted. See details for more information. | 
| dir_path | Path to folder where extracted data should be saved. NULLwill allow export as a structured list of data.tables. | 
| aggregate_results | Flag that signifies whether results should be
aggregated for export. | 
| feature_cluster_method | The method used to perform clustering. These are
the same methods as for the cluster_methodconfiguration parameter:none,hclust,agnes,dianaandpam. nonecannot be used when extracting data regarding mutual correlation or
feature expressions.
 If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
| feature_linkage_method | The method used for agglomerative clustering in
hclustandagnes. These are the same methods as for thecluster_linkage_methodconfiguration parameter:average,single,complete,weighted, andward. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
| feature_cluster_cut_method | The method used to divide features into
separate clusters. The available methods are the same as for the
cluster_cut_methodconfiguration parameter:silhouette,fixed_cutanddynamic_cut. silhouetteis available for all cluster methods, butfixed_cutonly
applies to methods that create hierarchical trees (hclust,agnesanddiana).dynamic_cutrequires thedynamicTreeCutpackage and can only
be used withagnesandhclust.
 If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
| feature_similarity_threshold | The threshold level for pair-wise
similarity that is required to form feature clusters with the fixed_cutmethod. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
| export_dendrogram | Add dendrogram in the data element objects. | 
| export_ordered_data | Add feature label ordering to data in the data
element objects. | 
| export_clustering | Add clustering information to data. | 
| export_collection | (optional) Exports the collection if TRUE. | 
| ... | Arguments passed on to as_familiar_collection 
familiar_data_namesNames of the dataset(s). Only used if the objectparameter is one or morefamiliarDataobjects.collection_nameName of the collection. | 
Details
Data is usually collected from a familiarCollection object.
However, you can also provide one or more familiarData objects, that will
be internally converted to a familiarCollection object. It is also
possible to provide a familiarEnsemble or one or more familiarModel
objects together with the data from which data is computed prior to export.
Paths to the previous files can also be provided.
All parameters aside from object and dir_path are only used if object
is not a familiarCollection object, or a path to one.
Value
A list containing a data.table (if dir_path is not provided), or
nothing, as all data is exported to csv files.
Extract and export feature selection variable importance.
Description
Extract and export feature selection variable importance from a
familiarCollection.
Usage
export_fs_vimp(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  aggregation_method = waiver(),
  rank_threshold = waiver(),
  export_collection = FALSE,
  ...
)
## S4 method for signature 'familiarCollection'
export_fs_vimp(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  aggregation_method = waiver(),
  rank_threshold = waiver(),
  export_collection = FALSE,
  ...
)
## S4 method for signature 'ANY'
export_fs_vimp(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  aggregation_method = waiver(),
  rank_threshold = waiver(),
  export_collection = FALSE,
  ...
)
Arguments
| object | A familiarCollectionobject, or other other objects from which
afamiliarCollectioncan be extracted. See details for more information. | 
| dir_path | Path to folder where extracted data should be saved. NULLwill allow export as a structured list of data.tables. | 
| aggregate_results | Flag that signifies whether results should be
aggregated for export. | 
| aggregation_method | (optional) The method used to aggregate variable
importances over different data subsets, e.g. bootstraps. The following
methods can be selected:
 
 mean(default): Use the mean rank of a feature over the subsets to
determine the aggregated feature rank.
 median: Use the median rank of a feature over the subsets to determine
the aggregated feature rank.
 best: Use the best rank the feature obtained in any subset to determine
the aggregated feature rank.
 worst: Use the worst rank the feature obtained in any subset to
determine the aggregated feature rank.
 stability: Use the frequency of the feature being in the subset of
highly ranked features as measure for the aggregated feature rank
(Meinshausen and Buehlmann, 2010).
 exponential: Use a rank-weighted frequence of occurrence in the subset
of highly ranked features as measure for the aggregated feature rank (Haury
et al., 2011).
 borda: Use the borda count as measure for the aggregated feature rank
(Wald et al., 2012).
 enhanced_borda: Use an occurrence frequency-weighted borda count as
measure for the aggregated feature rank (Wald et al., 2012).
 truncated_borda: Use borda count computed only on features within the
subset of highly ranked features.
 enhanced_truncated_borda: Apply both the enhanced borda method and the
truncated borda method and use the resulting borda count as the aggregated
feature rank.
 | 
| rank_threshold | (optional) The threshold used to define the subset of
highly important features. If not set, this threshold is determined by
maximising the variance in the occurrence value over all features over the
subset size.
 This parameter is only relevant for stability,exponential,enhanced_borda,truncated_bordaandenhanced_truncated_bordamethods. | 
| export_collection | (optional) Exports the collection if TRUE. | 
| ... | Arguments passed on to as_familiar_collection 
familiar_data_namesNames of the dataset(s). Only used if the objectparameter is one or morefamiliarDataobjects.collection_nameName of the collection. | 
Details
Data, such as model performance and calibration information, is
usually collected from a familiarCollection object. However, you can also
provide one or more familiarData objects, that will be internally
converted to a familiarCollection object. Paths to the previous files can
also be provided.
Unlike other export function, export using familiarEnsemble or
familiarModel objects is not possible. This is because feature selection
variable importance is not stored within familiarModel objects.
All parameters aside from object and dir_path are only used if object
is not a familiarCollection object, or a path to one.
Variable importance is based on the ranking produced by feature selection
routines. In case feature selection was performed repeatedly, e.g. using
bootstraps, feature ranks are first aggregated using the method defined by
the aggregation_method, some of which require a rank_threshold to
indicate a subset of most important features.
Information concerning highly similar features that form clusters is
provided as well. This information is based on consensus clustering of the
features. This clustering information is also used during aggregation to
ensure that co-clustered features are only taken into account once.
Value
A data.table (if dir_path is not provided), or nothing, as all data
is exported to csv files.
Extract and export model hyperparameters.
Description
Extract and export model hyperparameters from models in a
familiarCollection.
Usage
export_hyperparameters(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  export_collection = FALSE,
  ...
)
## S4 method for signature 'familiarCollection'
export_hyperparameters(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  export_collection = FALSE,
  ...
)
## S4 method for signature 'ANY'
export_hyperparameters(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  export_collection = FALSE,
  ...
)
Arguments
| object | A familiarCollectionobject, or other other objects from which
afamiliarCollectioncan be extracted. See details for more information. | 
| dir_path | Path to folder where extracted data should be saved. NULLwill allow export as a structured list of data.tables. | 
| aggregate_results | Flag that signifies whether results should be
aggregated for export. | 
| export_collection | (optional) Exports the collection if TRUE. | 
| ... | Arguments passed on to as_familiar_collection 
familiar_data_namesNames of the dataset(s). Only used if the objectparameter is one or morefamiliarDataobjects.collection_nameName of the collection. | 
Details
Data, such as model performance and calibration information, is
usually collected from a familiarCollection object. However, you can also
provide one or more familiarData objects, that will be internally
converted to a familiarCollection object. It is also possible to provide a
familiarEnsemble or one or more familiarModel objects together with the
data from which data is computed prior to export. Paths to the previous
files can also be provided.
All parameters aside from object and dir_path are only used if object
is not a familiarCollection object, or a path to one.
Many model hyperparameters are optimised using sequential model-based
optimisation. The extracted hyperparameters are those that were selected to
construct the underlying models (familiarModel objects).
Value
A data.table (if dir_path is not provided), or nothing, as all data
is exported to csv files. In case of the latter, hyperparameters are
summarised.
Extract and export individual conditional expectation data.
Description
Extract and export individual conditional expectation data.
Usage
export_ice_data(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  export_collection = FALSE,
  ...
)
## S4 method for signature 'familiarCollection'
export_ice_data(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  export_collection = FALSE,
  ...
)
## S4 method for signature 'ANY'
export_ice_data(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  export_collection = FALSE,
  ...
)
Arguments
| object | A familiarCollectionobject, or other other objects from which
afamiliarCollectioncan be extracted. See details for more information. | 
| dir_path | Path to folder where extracted data should be saved. NULLwill allow export as a structured list of data.tables. | 
| aggregate_results | Flag that signifies whether results should be
aggregated for export. | 
| export_collection | (optional) Exports the collection if TRUE. | 
| ... | Arguments passed on to extract_ice,as_familiar_collection 
featuresNames of the feature or features (2) assessed simultaneously.
By default NULL, which means that all features are assessed one-by-one.feature_x_rangeWhen one or two features are defined using features,feature_x_rangecan be used to set the range of values for the first
feature. For numeric features, a vector of two values is assumed to indicate
a range from whichn_sample_pointsare uniformly sampled. A vector of more
than two values is interpreted as is, i.e. these represent the values to be
sampled. For categorical features, values should represent a (sub)set of
available levels.feature_y_rangeAs feature_x_range, but for the second feature in
case two features are defined.n_sample_pointsNumber of points used to sample continuous features.dataA dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed.is_pre_processedFlag that indicates whether the data was already
pre-processed externally, e.g. normalised and clustered. Only used if the
dataargument is adata.tableordata.frame.clCluster created using the parallelpackage. This cluster is then
used to speed up computation through parallellisation.evaluation_timesOne or more time points that are used for in analysis of
survival problems when data has to be assessed at a set time, e.g.
calibration. If not provided explicitly, this parameter is read from
settings used at creation of the underlying familiarModelobjects. Only
used forsurvivaloutcomes.ensemble_methodMethod for ensembling predictions from models for the
same sample. Available methods are:
verboseFlag to indicate whether feedback should be provided on the
computation and extraction of various data elements.message_indentNumber of indentation steps for messages shown during
computation and extraction of various data elements.sample_limit(optional) Set the upper limit of the number of samples
that are used during evaluation steps. Cannot be less than 20.
 This setting can be specified per data element by providing a parameter
value in a named list with data elements, e.g.
list("sample_similarity"=100, "permutation_vimp"=1000). This parameter can be set for the following data elements:
sample_similarityandice_data.detail_level(optional) Sets the level at which results are computed
and aggregated.
 
 ensemble: Results are computed at the ensemble level, i.e. over all
models in the ensemble. This means that, for example, bias-corrected
estimates of model performance are assessed by creating (at least) 20
bootstraps and computing the model performance of the ensemble model for
each bootstrap.
 hybrid(default): Results are computed at the level of models in an
ensemble. This means that, for example, bias-corrected estimates of model
performance are directly computed using the models in the ensemble. If there
are at least 20 trained models in the ensemble, performance is computed for
each model, in contrast toensemblewhere performance is computed for the
ensemble of models. If there are less than 20 trained models in the
ensemble, bootstraps are created so that at least 20 point estimates can be
made.
 model: Results are computed at the model level. This means that, for
example, bias-corrected estimates of model performance are assessed by
creating (at least) 20 bootstraps and computing the performance of the model
for each bootstrap.
 Note that each level of detail has a different interpretation for bootstrap
confidence intervals. For ensembleandmodelthese are the confidence
intervals for the ensemble and an individual model, respectively. That is,
the confidence interval describes the range where an estimate produced by a
respective ensemble or model trained on a repeat of the experiment may be
found with the probability of the confidence level. Forhybrid, it
represents the range where any single model trained on a repeat of the
experiment may be found with the probability of the confidence level. By
definition, confidence intervals obtained usinghybridare at least as
wide as those forensemble.hybridoffers the correct interpretation if
the goal of the analysis is to assess the result of a single, unspecified,
model. hybridis generally computationally less expensive thenensemble, which
in turn is somewhat less expensive thanmodel.
 A non-default detail_levelparameter can be specified for separate
evaluation steps by providing a parameter value in a named list with data
elements, e.g.list("auc_data"="ensemble", "model_performance"="hybrid").
This parameter can be set for the following data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data,prediction_dataandconfusion_matrix.estimation_type(optional) Sets the type of estimation that should be
possible. This has the following options:
 
 point: Point estimates.
 bias_correctionorbc: Bias-corrected estimates. A bias-corrected
estimate is computed from (at least) 20 point estimates, andfamiliarmay
bootstrap the data to create them.
 bootstrap_confidence_intervalorbci(default): Bias-corrected
estimates with bootstrap confidence intervals (Efron and Hastie, 2016). The
number of point estimates required depends on theconfidence_levelparameter, andfamiliarmay bootstrap the data to create them.
 As with detail_level, a non-defaultestimation_typeparameter can be
specified for separate evaluation steps by providing a parameter value in a
named list with data elements, e.g.list("auc_data"="bci", "model_performance"="point"). This parameter can be set for the following
data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data, andprediction_data.confidence_level(optional) Numeric value for the level at which
confidence intervals are determined. In the case bootstraps are used to
determine the confidence intervals bootstrap estimation, familiaruses the
rule of thumbn = 20 / ci.levelto determine the number of required
bootstraps. The default value is 0.95.bootstrap_ci_method(optional) Method used to determine bootstrap
confidence intervals (Efron and Hastie, 2016). The following methods are
implemented:
 Note that the standard method is not implemented because this method is
often not suitable due to non-normal distributions. The bias-corrected and
accelerated (BCa) method is not implemented yet.familiar_data_namesNames of the dataset(s). Only used if the objectparameter is one or morefamiliarDataobjects.collection_nameName of the collection. | 
Details
Data is usually collected from a familiarCollection object.
However, you can also provide one or more familiarData objects, that will
be internally converted to a familiarCollection object. It is also
possible to provide a familiarEnsemble or one or more familiarModel
objects together with the data from which data is computed prior to export.
Paths to the previous files can also be provided.
All parameters aside from object and dir_path are only used if object
is not a familiarCollection object, or a path to one.
Value
A list of data.tables (if dir_path is not provided), or nothing, as
all data is exported to csv files.
Description
Extract and export metrics for model performance of models in a
familiarCollection.
Usage
export_model_performance(
  object,
  dir_path = NULL,
  aggregate_results = FALSE,
  export_collection = FALSE,
  ...
)
## S4 method for signature 'familiarCollection'
export_model_performance(
  object,
  dir_path = NULL,
  aggregate_results = FALSE,
  export_collection = FALSE,
  ...
)
## S4 method for signature 'ANY'
export_model_performance(
  object,
  dir_path = NULL,
  aggregate_results = FALSE,
  export_collection = FALSE,
  ...
)
Arguments
| object | A familiarCollectionobject, or other other objects from which
afamiliarCollectioncan be extracted. See details for more information. | 
| dir_path | Path to folder where extracted data should be saved. NULLwill allow export as a structured list of data.tables. | 
| aggregate_results | Flag that signifies whether results should be
aggregated for export. | 
| export_collection | (optional) Exports the collection if TRUE. | 
| ... | Arguments passed on to extract_performance,as_familiar_collection 
dataA dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed.is_pre_processedFlag that indicates whether the data was already
pre-processed externally, e.g. normalised and clustered. Only used if the
dataargument is adata.tableordata.frame.clCluster created using the parallelpackage. This cluster is then
used to speed up computation through parallellisation.evaluation_timesOne or more time points that are used for in analysis of
survival problems when data has to be assessed at a set time, e.g.
calibration. If not provided explicitly, this parameter is read from
settings used at creation of the underlying familiarModelobjects. Only
used forsurvivaloutcomes.ensemble_methodMethod for ensembling predictions from models for the
same sample. Available methods are:
metricOne or more metrics for assessing model performance. See the
vignette on performance metrics for the available metrics. If not provided
explicitly, this parameter is read from settings used at creation of the
underlying familiarModelobjects.verboseFlag to indicate whether feedback should be provided on the
computation and extraction of various data elements.message_indentNumber of indentation steps for messages shown during
computation and extraction of various data elements.detail_level(optional) Sets the level at which results are computed
and aggregated.
 
 ensemble: Results are computed at the ensemble level, i.e. over all
models in the ensemble. This means that, for example, bias-corrected
estimates of model performance are assessed by creating (at least) 20
bootstraps and computing the model performance of the ensemble model for
each bootstrap.
 hybrid(default): Results are computed at the level of models in an
ensemble. This means that, for example, bias-corrected estimates of model
performance are directly computed using the models in the ensemble. If there
are at least 20 trained models in the ensemble, performance is computed for
each model, in contrast toensemblewhere performance is computed for the
ensemble of models. If there are less than 20 trained models in the
ensemble, bootstraps are created so that at least 20 point estimates can be
made.
 model: Results are computed at the model level. This means that, for
example, bias-corrected estimates of model performance are assessed by
creating (at least) 20 bootstraps and computing the performance of the model
for each bootstrap.
 Note that each level of detail has a different interpretation for bootstrap
confidence intervals. For ensembleandmodelthese are the confidence
intervals for the ensemble and an individual model, respectively. That is,
the confidence interval describes the range where an estimate produced by a
respective ensemble or model trained on a repeat of the experiment may be
found with the probability of the confidence level. Forhybrid, it
represents the range where any single model trained on a repeat of the
experiment may be found with the probability of the confidence level. By
definition, confidence intervals obtained usinghybridare at least as
wide as those forensemble.hybridoffers the correct interpretation if
the goal of the analysis is to assess the result of a single, unspecified,
model. hybridis generally computationally less expensive thenensemble, which
in turn is somewhat less expensive thanmodel.
 A non-default detail_levelparameter can be specified for separate
evaluation steps by providing a parameter value in a named list with data
elements, e.g.list("auc_data"="ensemble", "model_performance"="hybrid").
This parameter can be set for the following data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data,prediction_dataandconfusion_matrix.estimation_type(optional) Sets the type of estimation that should be
possible. This has the following options:
 
 point: Point estimates.
 bias_correctionorbc: Bias-corrected estimates. A bias-corrected
estimate is computed from (at least) 20 point estimates, andfamiliarmay
bootstrap the data to create them.
 bootstrap_confidence_intervalorbci(default): Bias-corrected
estimates with bootstrap confidence intervals (Efron and Hastie, 2016). The
number of point estimates required depends on theconfidence_levelparameter, andfamiliarmay bootstrap the data to create them.
 As with detail_level, a non-defaultestimation_typeparameter can be
specified for separate evaluation steps by providing a parameter value in a
named list with data elements, e.g.list("auc_data"="bci", "model_performance"="point"). This parameter can be set for the following
data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data, andprediction_data.confidence_level(optional) Numeric value for the level at which
confidence intervals are determined. In the case bootstraps are used to
determine the confidence intervals bootstrap estimation, familiaruses the
rule of thumbn = 20 / ci.levelto determine the number of required
bootstraps. The default value is 0.95.bootstrap_ci_method(optional) Method used to determine bootstrap
confidence intervals (Efron and Hastie, 2016). The following methods are
implemented:
 Note that the standard method is not implemented because this method is
often not suitable due to non-normal distributions. The bias-corrected and
accelerated (BCa) method is not implemented yet.familiar_data_namesNames of the dataset(s). Only used if the objectparameter is one or morefamiliarDataobjects.collection_nameName of the collection. | 
Details
Data is usually collected from a familiarCollection object.
However, you can also provide one or more familiarData objects, that will
be internally converted to a familiarCollection object. It is also
possible to provide a familiarEnsemble or one or more familiarModel
objects together with the data from which data is computed prior to export.
Paths to the previous files can also be provided.
All parameters aside from object and dir_path are only used if object
is not a familiarCollection object, or a path to one.
Performance of individual and ensemble models is exported. For ensemble
models, a credibility interval is determined using bootstrapping for each
metric.
Value
A list of data.tables (if dir_path is not provided), or nothing, as
all data is exported to csv files.
Extract and export model-based variable importance.
Description
Extract and export model-based variable importance from a
familiarCollection.
Usage
export_model_vimp(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  aggregation_method = waiver(),
  rank_threshold = waiver(),
  export_collection = FALSE,
  ...
)
## S4 method for signature 'familiarCollection'
export_model_vimp(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  aggregation_method = waiver(),
  rank_threshold = waiver(),
  export_collection = FALSE,
  ...
)
## S4 method for signature 'ANY'
export_model_vimp(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  aggregation_method = waiver(),
  rank_threshold = waiver(),
  export_collection = FALSE,
  ...
)
Arguments
| object | A familiarCollectionobject, or other other objects from which
afamiliarCollectioncan be extracted. See details for more information. | 
| dir_path | Path to folder where extracted data should be saved. NULLwill allow export as a structured list of data.tables. | 
| aggregate_results | Flag that signifies whether results should be
aggregated for export. | 
| aggregation_method | (optional) The method used to aggregate variable
importances over different data subsets, e.g. bootstraps. The following
methods can be selected:
 
 mean(default): Use the mean rank of a feature over the subsets to
determine the aggregated feature rank.
 median: Use the median rank of a feature over the subsets to determine
the aggregated feature rank.
 best: Use the best rank the feature obtained in any subset to determine
the aggregated feature rank.
 worst: Use the worst rank the feature obtained in any subset to
determine the aggregated feature rank.
 stability: Use the frequency of the feature being in the subset of
highly ranked features as measure for the aggregated feature rank
(Meinshausen and Buehlmann, 2010).
 exponential: Use a rank-weighted frequence of occurrence in the subset
of highly ranked features as measure for the aggregated feature rank (Haury
et al., 2011).
 borda: Use the borda count as measure for the aggregated feature rank
(Wald et al., 2012).
 enhanced_borda: Use an occurrence frequency-weighted borda count as
measure for the aggregated feature rank (Wald et al., 2012).
 truncated_borda: Use borda count computed only on features within the
subset of highly ranked features.
 enhanced_truncated_borda: Apply both the enhanced borda method and the
truncated borda method and use the resulting borda count as the aggregated
feature rank.
 | 
| rank_threshold | (optional) The threshold used to define the subset of
highly important features. If not set, this threshold is determined by
maximising the variance in the occurrence value over all features over the
subset size.
 This parameter is only relevant for stability,exponential,enhanced_borda,truncated_bordaandenhanced_truncated_bordamethods. | 
| export_collection | (optional) Exports the collection if TRUE. | 
| ... | Arguments passed on to as_familiar_collection 
familiar_data_namesNames of the dataset(s). Only used if the objectparameter is one or morefamiliarDataobjects.collection_nameName of the collection. | 
Details
Data, such as model performance and calibration information, is
usually collected from a familiarCollection object. However, you can also
provide one or more familiarData objects, that will be internally
converted to a familiarCollection object. It is also possible to provide a
familiarEnsemble or one or more familiarModel objects together with the
data from which data is computed prior to export. Paths to the previous
files can also be provided.
All parameters aside from object and dir_path are only used if object
is not a familiarCollection object, or a path to one.
Variable importance is based on the ranking produced by model-specific
variable importance routines, e.g. permutation for random forests. If such a
routine is absent, variable importance is based on the feature selection
method that led to the features included in the model. In case multiple
models (familiarModel objects) are combined, feature ranks are first
aggregated using the method defined by the aggregation_method, some of
which require a rank_threshold to indicate a subset of most important
features.
Information concerning highly similar features that form clusters is
provided as well. This information is based on consensus clustering of the
features that were used in the signatures of the underlying models. This
clustering information is also used during aggregation to ensure that
co-clustered features are only taken into account once.
Value
A data.table (if dir_path is not provided), or nothing, as all data
is exported to csv files.
Extract and export partial dependence data.
Description
Extract and export partial dependence data.
Usage
export_partial_dependence_data(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  export_collection = FALSE,
  ...
)
## S4 method for signature 'familiarCollection'
export_partial_dependence_data(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  export_collection = FALSE,
  ...
)
## S4 method for signature 'ANY'
export_partial_dependence_data(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  export_collection = FALSE,
  ...
)
Arguments
| object | A familiarCollectionobject, or other other objects from which
afamiliarCollectioncan be extracted. See details for more information. | 
| dir_path | Path to folder where extracted data should be saved. NULLwill allow export as a structured list of data.tables. | 
| aggregate_results | Flag that signifies whether results should be
aggregated for export. | 
| export_collection | (optional) Exports the collection if TRUE. | 
| ... | Arguments passed on to extract_ice,as_familiar_collection 
featuresNames of the feature or features (2) assessed simultaneously.
By default NULL, which means that all features are assessed one-by-one.feature_x_rangeWhen one or two features are defined using features,feature_x_rangecan be used to set the range of values for the first
feature. For numeric features, a vector of two values is assumed to indicate
a range from whichn_sample_pointsare uniformly sampled. A vector of more
than two values is interpreted as is, i.e. these represent the values to be
sampled. For categorical features, values should represent a (sub)set of
available levels.feature_y_rangeAs feature_x_range, but for the second feature in
case two features are defined.n_sample_pointsNumber of points used to sample continuous features.dataA dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed.is_pre_processedFlag that indicates whether the data was already
pre-processed externally, e.g. normalised and clustered. Only used if the
dataargument is adata.tableordata.frame.clCluster created using the parallelpackage. This cluster is then
used to speed up computation through parallellisation.evaluation_timesOne or more time points that are used for in analysis of
survival problems when data has to be assessed at a set time, e.g.
calibration. If not provided explicitly, this parameter is read from
settings used at creation of the underlying familiarModelobjects. Only
used forsurvivaloutcomes.ensemble_methodMethod for ensembling predictions from models for the
same sample. Available methods are:
verboseFlag to indicate whether feedback should be provided on the
computation and extraction of various data elements.message_indentNumber of indentation steps for messages shown during
computation and extraction of various data elements.sample_limit(optional) Set the upper limit of the number of samples
that are used during evaluation steps. Cannot be less than 20.
 This setting can be specified per data element by providing a parameter
value in a named list with data elements, e.g.
list("sample_similarity"=100, "permutation_vimp"=1000). This parameter can be set for the following data elements:
sample_similarityandice_data.detail_level(optional) Sets the level at which results are computed
and aggregated.
 
 ensemble: Results are computed at the ensemble level, i.e. over all
models in the ensemble. This means that, for example, bias-corrected
estimates of model performance are assessed by creating (at least) 20
bootstraps and computing the model performance of the ensemble model for
each bootstrap.
 hybrid(default): Results are computed at the level of models in an
ensemble. This means that, for example, bias-corrected estimates of model
performance are directly computed using the models in the ensemble. If there
are at least 20 trained models in the ensemble, performance is computed for
each model, in contrast toensemblewhere performance is computed for the
ensemble of models. If there are less than 20 trained models in the
ensemble, bootstraps are created so that at least 20 point estimates can be
made.
 model: Results are computed at the model level. This means that, for
example, bias-corrected estimates of model performance are assessed by
creating (at least) 20 bootstraps and computing the performance of the model
for each bootstrap.
 Note that each level of detail has a different interpretation for bootstrap
confidence intervals. For ensembleandmodelthese are the confidence
intervals for the ensemble and an individual model, respectively. That is,
the confidence interval describes the range where an estimate produced by a
respective ensemble or model trained on a repeat of the experiment may be
found with the probability of the confidence level. Forhybrid, it
represents the range where any single model trained on a repeat of the
experiment may be found with the probability of the confidence level. By
definition, confidence intervals obtained usinghybridare at least as
wide as those forensemble.hybridoffers the correct interpretation if
the goal of the analysis is to assess the result of a single, unspecified,
model. hybridis generally computationally less expensive thenensemble, which
in turn is somewhat less expensive thanmodel.
 A non-default detail_levelparameter can be specified for separate
evaluation steps by providing a parameter value in a named list with data
elements, e.g.list("auc_data"="ensemble", "model_performance"="hybrid").
This parameter can be set for the following data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data,prediction_dataandconfusion_matrix.estimation_type(optional) Sets the type of estimation that should be
possible. This has the following options:
 
 point: Point estimates.
 bias_correctionorbc: Bias-corrected estimates. A bias-corrected
estimate is computed from (at least) 20 point estimates, andfamiliarmay
bootstrap the data to create them.
 bootstrap_confidence_intervalorbci(default): Bias-corrected
estimates with bootstrap confidence intervals (Efron and Hastie, 2016). The
number of point estimates required depends on theconfidence_levelparameter, andfamiliarmay bootstrap the data to create them.
 As with detail_level, a non-defaultestimation_typeparameter can be
specified for separate evaluation steps by providing a parameter value in a
named list with data elements, e.g.list("auc_data"="bci", "model_performance"="point"). This parameter can be set for the following
data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data, andprediction_data.confidence_level(optional) Numeric value for the level at which
confidence intervals are determined. In the case bootstraps are used to
determine the confidence intervals bootstrap estimation, familiaruses the
rule of thumbn = 20 / ci.levelto determine the number of required
bootstraps. The default value is 0.95.bootstrap_ci_method(optional) Method used to determine bootstrap
confidence intervals (Efron and Hastie, 2016). The following methods are
implemented:
 Note that the standard method is not implemented because this method is
often not suitable due to non-normal distributions. The bias-corrected and
accelerated (BCa) method is not implemented yet.familiar_data_namesNames of the dataset(s). Only used if the objectparameter is one or morefamiliarDataobjects.collection_nameName of the collection. | 
Details
Data is usually collected from a familiarCollection object.
However, you can also provide one or more familiarData objects, that will
be internally converted to a familiarCollection object. It is also
possible to provide a familiarEnsemble or one or more familiarModel
objects together with the data from which data is computed prior to export.
Paths to the previous files can also be provided.
All parameters aside from object and dir_path are only used if object
is not a familiarCollection object, or a path to one.
Value
A list of data.tables (if dir_path is not provided), or nothing, as
all data is exported to csv files.
Extract and export permutation variable importance.
Description
Extract and export model-based variable importance from a
familiarCollection.
Usage
export_permutation_vimp(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  export_collection = FALSE,
  ...
)
## S4 method for signature 'familiarCollection'
export_permutation_vimp(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  export_collection = FALSE,
  ...
)
## S4 method for signature 'ANY'
export_permutation_vimp(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  export_collection = FALSE,
  ...
)
Arguments
| object | A familiarCollectionobject, or other other objects from which
afamiliarCollectioncan be extracted. See details for more information. | 
| dir_path | Path to folder where extracted data should be saved. NULLwill allow export as a structured list of data.tables. | 
| aggregate_results | Flag that signifies whether results should be
aggregated for export. | 
| export_collection | (optional) Exports the collection if TRUE. | 
| ... | Arguments passed on to extract_permutation_vimp,as_familiar_collection 
dataA dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed.is_pre_processedFlag that indicates whether the data was already
pre-processed externally, e.g. normalised and clustered. Only used if the
dataargument is adata.tableordata.frame.clCluster created using the parallelpackage. This cluster is then
used to speed up computation through parallellisation.evaluation_timesOne or more time points that are used for in analysis of
survival problems when data has to be assessed at a set time, e.g.
calibration. If not provided explicitly, this parameter is read from
settings used at creation of the underlying familiarModelobjects. Only
used forsurvivaloutcomes.ensemble_methodMethod for ensembling predictions from models for the
same sample. Available methods are:
metricOne or more metrics for assessing model performance. See the
vignette on performance metrics for the available metrics. If not provided
explicitly, this parameter is read from settings used at creation of the
underlying familiarModelobjects.feature_cluster_methodThe method used to perform clustering. These are
the same methods as for the cluster_methodconfiguration parameter:none,hclust,agnes,dianaandpam. nonecannot be used when extracting data regarding mutual correlation or
feature expressions.
 If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.feature_linkage_methodThe method used for agglomerative clustering in
hclustandagnes. These are the same methods as for thecluster_linkage_methodconfiguration parameter:average,single,complete,weighted, andward. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.feature_cluster_cut_methodThe method used to divide features into
separate clusters. The available methods are the same as for the
cluster_cut_methodconfiguration parameter:silhouette,fixed_cutanddynamic_cut. silhouetteis available for all cluster methods, butfixed_cutonly
applies to methods that create hierarchical trees (hclust,agnesanddiana).dynamic_cutrequires thedynamicTreeCutpackage and can only
be used withagnesandhclust.
 If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.feature_similarity_thresholdThe threshold level for pair-wise
similarity that is required to form feature clusters with the fixed_cutmethod. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.feature_similarity_metricMetric to determine pairwise similarity
between features. Similarity is computed in the same manner as for
clustering, and feature_similarity_metrictherefore has the same options
ascluster_similarity_metric:mcfadden_r2,cox_snell_r2,nagelkerke_r2,spearman,kendallandpearson. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.verboseFlag to indicate whether feedback should be provided on the
computation and extraction of various data elements.message_indentNumber of indentation steps for messages shown during
computation and extraction of various data elements.detail_level(optional) Sets the level at which results are computed
and aggregated.
 
 ensemble: Results are computed at the ensemble level, i.e. over all
models in the ensemble. This means that, for example, bias-corrected
estimates of model performance are assessed by creating (at least) 20
bootstraps and computing the model performance of the ensemble model for
each bootstrap.
 hybrid(default): Results are computed at the level of models in an
ensemble. This means that, for example, bias-corrected estimates of model
performance are directly computed using the models in the ensemble. If there
are at least 20 trained models in the ensemble, performance is computed for
each model, in contrast toensemblewhere performance is computed for the
ensemble of models. If there are less than 20 trained models in the
ensemble, bootstraps are created so that at least 20 point estimates can be
made.
 model: Results are computed at the model level. This means that, for
example, bias-corrected estimates of model performance are assessed by
creating (at least) 20 bootstraps and computing the performance of the model
for each bootstrap.
 Note that each level of detail has a different interpretation for bootstrap
confidence intervals. For ensembleandmodelthese are the confidence
intervals for the ensemble and an individual model, respectively. That is,
the confidence interval describes the range where an estimate produced by a
respective ensemble or model trained on a repeat of the experiment may be
found with the probability of the confidence level. Forhybrid, it
represents the range where any single model trained on a repeat of the
experiment may be found with the probability of the confidence level. By
definition, confidence intervals obtained usinghybridare at least as
wide as those forensemble.hybridoffers the correct interpretation if
the goal of the analysis is to assess the result of a single, unspecified,
model. hybridis generally computationally less expensive thenensemble, which
in turn is somewhat less expensive thanmodel.
 A non-default detail_levelparameter can be specified for separate
evaluation steps by providing a parameter value in a named list with data
elements, e.g.list("auc_data"="ensemble", "model_performance"="hybrid").
This parameter can be set for the following data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data,prediction_dataandconfusion_matrix.estimation_type(optional) Sets the type of estimation that should be
possible. This has the following options:
 
 point: Point estimates.
 bias_correctionorbc: Bias-corrected estimates. A bias-corrected
estimate is computed from (at least) 20 point estimates, andfamiliarmay
bootstrap the data to create them.
 bootstrap_confidence_intervalorbci(default): Bias-corrected
estimates with bootstrap confidence intervals (Efron and Hastie, 2016). The
number of point estimates required depends on theconfidence_levelparameter, andfamiliarmay bootstrap the data to create them.
 As with detail_level, a non-defaultestimation_typeparameter can be
specified for separate evaluation steps by providing a parameter value in a
named list with data elements, e.g.list("auc_data"="bci", "model_performance"="point"). This parameter can be set for the following
data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data, andprediction_data.confidence_level(optional) Numeric value for the level at which
confidence intervals are determined. In the case bootstraps are used to
determine the confidence intervals bootstrap estimation, familiaruses the
rule of thumbn = 20 / ci.levelto determine the number of required
bootstraps. The default value is 0.95.bootstrap_ci_method(optional) Method used to determine bootstrap
confidence intervals (Efron and Hastie, 2016). The following methods are
implemented:
 Note that the standard method is not implemented because this method is
often not suitable due to non-normal distributions. The bias-corrected and
accelerated (BCa) method is not implemented yet.familiar_data_namesNames of the dataset(s). Only used if the objectparameter is one or morefamiliarDataobjects.collection_nameName of the collection. | 
Details
Data, such as permutation variable importance and calibration
information, is usually collected from a familiarCollection object.
However, you can also provide one or more familiarData objects, that will
be internally converted to a familiarCollection object. It is also
possible to provide a familiarEnsemble or one or more familiarModel
objects together with the data from which data is computed prior to export.
Paths to the previously mentioned files can also be provided.
All parameters aside from object and dir_path are only used if object
is not a familiarCollection object, or a path to one.
Permutation Variable importance assesses the improvement in model
performance due to a feature. For this purpose, the performance of the model
is measured as normal, and is measured again with a dataset where the values
of the feature in question have been randomly permuted. The difference
between both performance measurements is the permutation variable
importance.
In familiar, this basic concept is extended in several ways:
-  Point estimates of variable importance are based on multiple (21) random
permutations. The difference between model performance on the normal dataset
and the median performance measurement of the randomly permuted datasets is
used as permutation variable importance.
 
-  Confidence intervals for the ensemble model are determined using bootstrap
methods.
 
-  Permutation variable importance is assessed for any metric specified using
the - metricargument.
 
-  Permutation variable importance can take into account similarity between
features and permute similar features simultaneously.
 
Value
A data.table (if dir_path is not provided), or nothing, as all data
is exported to csv files.
Extract and export predicted values.
Description
Extract and export the values predicted by single and ensemble
models in a familiarCollection.
Usage
export_prediction_data(object, dir_path = NULL, export_collection = FALSE, ...)
## S4 method for signature 'familiarCollection'
export_prediction_data(object, dir_path = NULL, export_collection = FALSE, ...)
## S4 method for signature 'ANY'
export_prediction_data(object, dir_path = NULL, export_collection = FALSE, ...)
Arguments
| object | A familiarCollectionobject, or other other objects from which
afamiliarCollectioncan be extracted. See details for more information. | 
| dir_path | Path to folder where extracted data should be saved. NULLwill allow export as a structured list of data.tables. | 
| export_collection | (optional) Exports the collection if TRUE. | 
| ... | Arguments passed on to extract_predictions,as_familiar_collection 
dataA dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed.is_pre_processedFlag that indicates whether the data was already
pre-processed externally, e.g. normalised and clustered. Only used if the
dataargument is adata.tableordata.frame.clCluster created using the parallelpackage. This cluster is then
used to speed up computation through parallellisation.evaluation_timesOne or more time points that are used for in analysis of
survival problems when data has to be assessed at a set time, e.g.
calibration. If not provided explicitly, this parameter is read from
settings used at creation of the underlying familiarModelobjects. Only
used forsurvivaloutcomes.ensemble_methodMethod for ensembling predictions from models for the
same sample. Available methods are:
verboseFlag to indicate whether feedback should be provided on the
computation and extraction of various data elements.message_indentNumber of indentation steps for messages shown during
computation and extraction of various data elements.detail_level(optional) Sets the level at which results are computed
and aggregated.
 
 ensemble: Results are computed at the ensemble level, i.e. over all
models in the ensemble. This means that, for example, bias-corrected
estimates of model performance are assessed by creating (at least) 20
bootstraps and computing the model performance of the ensemble model for
each bootstrap.
 hybrid(default): Results are computed at the level of models in an
ensemble. This means that, for example, bias-corrected estimates of model
performance are directly computed using the models in the ensemble. If there
are at least 20 trained models in the ensemble, performance is computed for
each model, in contrast toensemblewhere performance is computed for the
ensemble of models. If there are less than 20 trained models in the
ensemble, bootstraps are created so that at least 20 point estimates can be
made.
 model: Results are computed at the model level. This means that, for
example, bias-corrected estimates of model performance are assessed by
creating (at least) 20 bootstraps and computing the performance of the model
for each bootstrap.
 Note that each level of detail has a different interpretation for bootstrap
confidence intervals. For ensembleandmodelthese are the confidence
intervals for the ensemble and an individual model, respectively. That is,
the confidence interval describes the range where an estimate produced by a
respective ensemble or model trained on a repeat of the experiment may be
found with the probability of the confidence level. Forhybrid, it
represents the range where any single model trained on a repeat of the
experiment may be found with the probability of the confidence level. By
definition, confidence intervals obtained usinghybridare at least as
wide as those forensemble.hybridoffers the correct interpretation if
the goal of the analysis is to assess the result of a single, unspecified,
model. hybridis generally computationally less expensive thenensemble, which
in turn is somewhat less expensive thanmodel.
 A non-default detail_levelparameter can be specified for separate
evaluation steps by providing a parameter value in a named list with data
elements, e.g.list("auc_data"="ensemble", "model_performance"="hybrid").
This parameter can be set for the following data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data,prediction_dataandconfusion_matrix.estimation_type(optional) Sets the type of estimation that should be
possible. This has the following options:
 
 point: Point estimates.
 bias_correctionorbc: Bias-corrected estimates. A bias-corrected
estimate is computed from (at least) 20 point estimates, andfamiliarmay
bootstrap the data to create them.
 bootstrap_confidence_intervalorbci(default): Bias-corrected
estimates with bootstrap confidence intervals (Efron and Hastie, 2016). The
number of point estimates required depends on theconfidence_levelparameter, andfamiliarmay bootstrap the data to create them.
 As with detail_level, a non-defaultestimation_typeparameter can be
specified for separate evaluation steps by providing a parameter value in a
named list with data elements, e.g.list("auc_data"="bci", "model_performance"="point"). This parameter can be set for the following
data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data, andprediction_data.aggregate_results(optional) Flag that signifies whether results
should be aggregated during evaluation. If estimation_typeisbias_correctionorbc, aggregation leads to a single bias-corrected
estimate. Ifestimation_typeisbootstrap_confidence_intervalorbci,
aggregation leads to a single bias-corrected estimate with lower and upper
boundaries of the confidence interval. This has no effect ifestimation_typeispoint. The default value is equal to TRUEexcept when assessing metrics to assess
model performance, as the default violin plot requires underlying data. As with detail_levelandestimation_type, a non-defaultaggregate_resultsparameter can be specified for separate evaluation steps
by providing a parameter value in a named list with data elements, e.g.list("auc_data"=TRUE, , "model_performance"=FALSE). This parameter exists
for the same elements asestimation_type.confidence_level(optional) Numeric value for the level at which
confidence intervals are determined. In the case bootstraps are used to
determine the confidence intervals bootstrap estimation, familiaruses the
rule of thumbn = 20 / ci.levelto determine the number of required
bootstraps. The default value is 0.95.familiar_data_namesNames of the dataset(s). Only used if the objectparameter is one or morefamiliarDataobjects.collection_nameName of the collection. | 
Details
Data, such as model performance and calibration information, is
usually collected from a familiarCollection object. However, you can also
provide one or more familiarData objects, that will be internally
converted to a familiarCollection object. It is also possible to provide a
familiarEnsemble or one or more familiarModel objects together with the
data from which data is computed prior to export. Paths to the previous
files can also be provided.
All parameters aside from object and dir_path are only used if object
is not a familiarCollection object, or a path to one.
Both single and ensemble predictions are exported.
Value
A list of data.tables (if dir_path is not provided), or nothing, as
all data is exported to csv files.
Extract and export sample risk group stratification and associated
tests.
Description
Extract and export sample risk group stratification and
associated tests for data in a familiarCollection.
Usage
export_risk_stratification_data(
  object,
  dir_path = NULL,
  export_strata = TRUE,
  time_range = NULL,
  export_collection = FALSE,
  ...
)
## S4 method for signature 'familiarCollection'
export_risk_stratification_data(
  object,
  dir_path = NULL,
  export_strata = TRUE,
  time_range = NULL,
  export_collection = FALSE,
  ...
)
## S4 method for signature 'ANY'
export_risk_stratification_data(
  object,
  dir_path = NULL,
  export_strata = TRUE,
  time_range = NULL,
  export_collection = FALSE,
  ...
)
Arguments
| object | A familiarCollectionobject, or other other objects from which
afamiliarCollectioncan be extracted. See details for more information. | 
| dir_path | Path to folder where extracted data should be saved. NULLwill allow export as a structured list of data.tables. | 
| export_strata | Flag that determines whether the raw data or strata are
exported. | 
| time_range | Time range for which strata should be created. If NULL,
the full time range is used. | 
| export_collection | (optional) Exports the collection if TRUE. | 
| ... | Arguments passed on to extract_risk_stratification_data,as_familiar_collection 
dataA dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed.is_pre_processedFlag that indicates whether the data was already
pre-processed externally, e.g. normalised and clustered. Only used if the
dataargument is adata.tableordata.frame.clCluster created using the parallelpackage. This cluster is then
used to speed up computation through parallellisation.ensemble_methodMethod for ensembling predictions from models for the
same sample. Available methods are:
verboseFlag to indicate whether feedback should be provided on the
computation and extraction of various data elements.message_indentNumber of indentation steps for messages shown during
computation and extraction of various data elements.detail_level(optional) Sets the level at which results are computed
and aggregated.
 
 ensemble: Results are computed at the ensemble level, i.e. over all
models in the ensemble. This means that, for example, bias-corrected
estimates of model performance are assessed by creating (at least) 20
bootstraps and computing the model performance of the ensemble model for
each bootstrap.
 hybrid(default): Results are computed at the level of models in an
ensemble. This means that, for example, bias-corrected estimates of model
performance are directly computed using the models in the ensemble. If there
are at least 20 trained models in the ensemble, performance is computed for
each model, in contrast toensemblewhere performance is computed for the
ensemble of models. If there are less than 20 trained models in the
ensemble, bootstraps are created so that at least 20 point estimates can be
made.
 model: Results are computed at the model level. This means that, for
example, bias-corrected estimates of model performance are assessed by
creating (at least) 20 bootstraps and computing the performance of the model
for each bootstrap.
 Note that each level of detail has a different interpretation for bootstrap
confidence intervals. For ensembleandmodelthese are the confidence
intervals for the ensemble and an individual model, respectively. That is,
the confidence interval describes the range where an estimate produced by a
respective ensemble or model trained on a repeat of the experiment may be
found with the probability of the confidence level. Forhybrid, it
represents the range where any single model trained on a repeat of the
experiment may be found with the probability of the confidence level. By
definition, confidence intervals obtained usinghybridare at least as
wide as those forensemble.hybridoffers the correct interpretation if
the goal of the analysis is to assess the result of a single, unspecified,
model. hybridis generally computationally less expensive thenensemble, which
in turn is somewhat less expensive thanmodel.
 A non-default detail_levelparameter can be specified for separate
evaluation steps by providing a parameter value in a named list with data
elements, e.g.list("auc_data"="ensemble", "model_performance"="hybrid").
This parameter can be set for the following data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data,prediction_dataandconfusion_matrix.confidence_level(optional) Numeric value for the level at which
confidence intervals are determined. In the case bootstraps are used to
determine the confidence intervals bootstrap estimation, familiaruses the
rule of thumbn = 20 / ci.levelto determine the number of required
bootstraps. The default value is 0.95.familiar_data_namesNames of the dataset(s). Only used if the objectparameter is one or morefamiliarDataobjects.collection_nameName of the collection. | 
Details
Data is usually collected from a familiarCollection object.
However, you can also provide one or more familiarData objects, that will
be internally converted to a familiarCollection object. It is also
possible to provide a familiarEnsemble or one or more familiarModel
objects together with the data from which data is computed prior to export.
Paths to the previous files can also be provided.
All parameters aside from object and dir_path are only used if object
is not a familiarCollection object, or a path to one.
Three tables are exported in a list:
-  data: Contains the assigned risk group for a given sample, along with
its reported survival time and censoring status.
 
-  hr_ratio: Contains the hazard ratio between different risk groups.
 
-  logrank: Contains the results from the logrank test between different
risk groups.
 
Value
A list of data.tables (if dir_path is not provided), or nothing, as
all data is exported to csv files.
Extract and export cut-off values for risk group stratification.
Description
Extract and export cut-off values for risk group stratification
by models in a familiarCollection.
Usage
export_risk_stratification_info(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  export_collection = FALSE,
  ...
)
## S4 method for signature 'familiarCollection'
export_risk_stratification_info(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  export_collection = FALSE,
  ...
)
## S4 method for signature 'ANY'
export_risk_stratification_info(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  export_collection = FALSE,
  ...
)
Arguments
| object | A familiarCollectionobject, or other other objects from which
afamiliarCollectioncan be extracted. See details for more information. | 
| dir_path | Path to folder where extracted data should be saved. NULLwill allow export as a structured list of data.tables. | 
| aggregate_results | Flag that signifies whether results should be
aggregated for export. | 
| export_collection | (optional) Exports the collection if TRUE. | 
| ... | Arguments passed on to as_familiar_collection 
familiar_data_namesNames of the dataset(s). Only used if the objectparameter is one or morefamiliarDataobjects.collection_nameName of the collection. | 
Details
Data is usually collected from a familiarCollection object.
However, you can also provide one or more familiarData objects, that will
be internally converted to a familiarCollection object. It is also
possible to provide a familiarEnsemble or one or more familiarModel
objects together with the data from which data is computed prior to export.
Paths to the previous files can also be provided.
All parameters aside from object and dir_path are only used if object
is not a familiarCollection object, or a path to one.
Stratification cut-off values are determined when creating a model, using
one of several methods set by the stratification_method parameter. These
values are then used to stratify samples in any new dataset. The available
methods are:
-  median(default): The median predicted value in the development cohort
is used to stratify the samples into two risk groups.
 
-  fixed: Samples are stratified based on the sample quantiles of the
predicted values. These quantiles are defined using thestratification_thresholdparameter.
 
-  optimised: Use maximally selected rank statistics to determine the
optimal threshold (Lausen and Schumacher, 1992; Hothorn et al., 2003) to
stratify samples into two optimally separated risk groups.
 
Value
A data.table (if dir_path is not provided), or nothing, as all data
is exported to csv files.
References
-  Lausen, B. & Schumacher, M. Maximally Selected Rank Statistics.
Biometrics 48, 73 (1992).
 
-  Hothorn, T. & Lausen, B. On the exact distribution of maximally selected
rank statistics. Comput. Stat. Data Anal. 43, 121–137 (2003).
 
Extract and export mutual correlation between features.
Description
Extract and export mutual correlation between features in a
familiarCollection.
Usage
export_sample_similarity(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  sample_limit = waiver(),
  sample_cluster_method = waiver(),
  sample_linkage_method = waiver(),
  export_dendrogram = FALSE,
  export_collection = FALSE,
  ...
)
## S4 method for signature 'familiarCollection'
export_sample_similarity(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  sample_limit = waiver(),
  sample_cluster_method = waiver(),
  sample_linkage_method = waiver(),
  export_dendrogram = FALSE,
  export_collection = FALSE,
  ...
)
## S4 method for signature 'ANY'
export_sample_similarity(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  sample_limit = waiver(),
  sample_cluster_method = waiver(),
  sample_linkage_method = waiver(),
  export_dendrogram = FALSE,
  export_collection = FALSE,
  ...
)
Arguments
| object | A familiarCollectionobject, or other other objects from which
afamiliarCollectioncan be extracted. See details for more information. | 
| dir_path | Path to folder where extracted data should be saved. NULLwill allow export as a structured list of data.tables. | 
| aggregate_results | Flag that signifies whether results should be
aggregated for export. | 
| sample_limit | (optional) Set the upper limit of the number of samples
that are used during evaluation steps. Cannot be less than 20.
 This setting can be specified per data element by providing a parameter
value in a named list with data elements, e.g.
list("sample_similarity"=100, "permutation_vimp"=1000). This parameter can be set for the following data elements:
sample_similarityandice_data. | 
| sample_cluster_method | The method used to perform clustering based on
distance between samples. These are the same methods as for the
cluster_methodconfiguration parameter:hclust,agnes,dianaandpam. nonecannot be used when extracting data for feature expressions.
 If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
| sample_linkage_method | The method used for agglomerative clustering in
hclustandagnes. These are the same methods as for thecluster_linkage_methodconfiguration parameter:average,single,complete,weighted, andward. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
| export_dendrogram | Add dendrogram in the data element objects. | 
| export_collection | (optional) Exports the collection if TRUE. | 
| ... | Arguments passed on to as_familiar_collection 
familiar_data_namesNames of the dataset(s). Only used if the objectparameter is one or morefamiliarDataobjects.collection_nameName of the collection. | 
Details
Data is usually collected from a familiarCollection object.
However, you can also provide one or more familiarData objects, that will
be internally converted to a familiarCollection object. It is also
possible to provide a familiarEnsemble or one or more familiarModel
objects together with the data from which data is computed prior to export.
Paths to the previous files can also be provided.
All parameters aside from object and dir_path are only used if object
is not a familiarCollection object, or a path to one.
Value
A list containing a data.table (if dir_path is not provided), or
nothing, as all data is exported to csv files.
Extract and export univariate analysis data of features.
Description
Extract and export univariate analysis data of features for data
in a familiarCollection.
Usage
export_univariate_analysis_data(
  object,
  dir_path = NULL,
  p_adjustment_method = waiver(),
  export_collection = FALSE,
  ...
)
## S4 method for signature 'familiarCollection'
export_univariate_analysis_data(
  object,
  dir_path = NULL,
  p_adjustment_method = waiver(),
  export_collection = FALSE,
  ...
)
## S4 method for signature 'ANY'
export_univariate_analysis_data(
  object,
  dir_path = NULL,
  p_adjustment_method = waiver(),
  export_collection = FALSE,
  ...
)
Arguments
| object | A familiarCollectionobject, or other other objects from which
afamiliarCollectioncan be extracted. See details for more information. | 
| dir_path | Path to folder where extracted data should be saved. NULLwill allow export as a structured list of data.tables. | 
| p_adjustment_method | (optional) Indicates type of p-value that is
shown. One of holm,hochberg,hommel,bonferroni,BH,BY,fdr,none,p_valueorq_valuefor adjusted p-values, uncorrected
p-values and q-values. q-values may not be available. | 
| export_collection | (optional) Exports the collection if TRUE. | 
| ... | Arguments passed on to extract_univariate_analysis,as_familiar_collection 
dataA dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed.clCluster created using the parallelpackage. This cluster is then
used to speed up computation through parallellisation.feature_cluster_methodThe method used to perform clustering. These are
the same methods as for the cluster_methodconfiguration parameter:none,hclust,agnes,dianaandpam. nonecannot be used when extracting data regarding mutual correlation or
feature expressions.
 If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.feature_linkage_methodThe method used for agglomerative clustering in
hclustandagnes. These are the same methods as for thecluster_linkage_methodconfiguration parameter:average,single,complete,weighted, andward. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.feature_cluster_cut_methodThe method used to divide features into
separate clusters. The available methods are the same as for the
cluster_cut_methodconfiguration parameter:silhouette,fixed_cutanddynamic_cut. silhouetteis available for all cluster methods, butfixed_cutonly
applies to methods that create hierarchical trees (hclust,agnesanddiana).dynamic_cutrequires thedynamicTreeCutpackage and can only
be used withagnesandhclust.
 If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.feature_similarity_thresholdThe threshold level for pair-wise
similarity that is required to form feature clusters with the fixed_cutmethod. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.feature_similarity_metricMetric to determine pairwise similarity
between features. Similarity is computed in the same manner as for
clustering, and feature_similarity_metrictherefore has the same options
ascluster_similarity_metric:mcfadden_r2,cox_snell_r2,nagelkerke_r2,spearman,kendallandpearson. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.icc_typeString indicating the type of intraclass correlation
coefficient (1,2or3) that should be used to compute robustness for
features in repeated measurements during the evaluation of univariate
importance. These types correspond to the types in Shrout and Fleiss (1979).
If not provided explicitly, this parameter is read from settings used at
creation of the underlyingfamiliarModelobjects.verboseFlag to indicate whether feedback should be provided on the
computation and extraction of various data elements.message_indentNumber of indentation steps for messages shown during
computation and extraction of various data elements.familiar_data_namesNames of the dataset(s). Only used if the objectparameter is one or morefamiliarDataobjects.collection_nameName of the collection. | 
Details
Data is usually collected from a familiarCollection object.
However, you can also provide one or more familiarData objects, that will
be internally converted to a familiarCollection object. It is also
possible to provide a familiarEnsemble or one or more familiarModel
objects together with the data from which data is computed prior to export.
Paths to the previous files can also be provided.
All parameters aside from object and dir_path are only used if object
is not a familiarCollection object, or a path to one.
Univariate analysis includes the computation of p and q-values, as well as
robustness (in case of repeated measurements). p-values are derived from
Wald's test.
Value
A data.table (if dir_path is not provided), or nothing, as
all data is exported to csv files.
Description
Computes the ROC curve from a familiarEnsemble.
'
Usage
extract_auc_data(
  object,
  data,
  cl = NULL,
  ensemble_method = waiver(),
  detail_level = waiver(),
  estimation_type = waiver(),
  aggregate_results = waiver(),
  confidence_level = waiver(),
  bootstrap_ci_method = waiver(),
  is_pre_processed = FALSE,
  message_indent = 0L,
  verbose = FALSE,
  ...
)
Arguments
|  | A familiarEnsembleobject, which is an ensemble of one or morefamiliarModelobjects. | 
|  | A dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed. | 
|  | Cluster created using the parallelpackage. This cluster is then
used to speed up computation through parallellisation. | 
|  | Method for ensembling predictions from models for the
same sample. Available methods are:
 | 
|  | (optional) Sets the level at which results are computed
and aggregated.
 
 ensemble: Results are computed at the ensemble level, i.e. over all
models in the ensemble. This means that, for example, bias-corrected
estimates of model performance are assessed by creating (at least) 20
bootstraps and computing the model performance of the ensemble model for
each bootstrap.
 hybrid(default): Results are computed at the level of models in an
ensemble. This means that, for example, bias-corrected estimates of model
performance are directly computed using the models in the ensemble. If there
are at least 20 trained models in the ensemble, performance is computed for
each model, in contrast toensemblewhere performance is computed for the
ensemble of models. If there are less than 20 trained models in the
ensemble, bootstraps are created so that at least 20 point estimates can be
made.
 model: Results are computed at the model level. This means that, for
example, bias-corrected estimates of model performance are assessed by
creating (at least) 20 bootstraps and computing the performance of the model
for each bootstrap.
 Note that each level of detail has a different interpretation for bootstrap
confidence intervals. For ensembleandmodelthese are the confidence
intervals for the ensemble and an individual model, respectively. That is,
the confidence interval describes the range where an estimate produced by a
respective ensemble or model trained on a repeat of the experiment may be
found with the probability of the confidence level. Forhybrid, it
represents the range where any single model trained on a repeat of the
experiment may be found with the probability of the confidence level. By
definition, confidence intervals obtained usinghybridare at least as
wide as those forensemble.hybridoffers the correct interpretation if
the goal of the analysis is to assess the result of a single, unspecified,
model. hybridis generally computationally less expensive thenensemble, which
in turn is somewhat less expensive thanmodel.
 A non-default detail_levelparameter can be specified for separate
evaluation steps by providing a parameter value in a named list with data
elements, e.g.list("auc_data"="ensemble", "model_performance"="hybrid").
This parameter can be set for the following data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data,prediction_dataandconfusion_matrix. | 
|  | (optional) Sets the type of estimation that should be
possible. This has the following options:
 
 point: Point estimates.
 bias_correctionorbc: Bias-corrected estimates. A bias-corrected
estimate is computed from (at least) 20 point estimates, andfamiliarmay
bootstrap the data to create them.
 bootstrap_confidence_intervalorbci(default): Bias-corrected
estimates with bootstrap confidence intervals (Efron and Hastie, 2016). The
number of point estimates required depends on theconfidence_levelparameter, andfamiliarmay bootstrap the data to create them.
 As with detail_level, a non-defaultestimation_typeparameter can be
specified for separate evaluation steps by providing a parameter value in a
named list with data elements, e.g.list("auc_data"="bci", "model_performance"="point"). This parameter can be set for the following
data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data, andprediction_data. | 
|  | (optional) Flag that signifies whether results
should be aggregated during evaluation. If estimation_typeisbias_correctionorbc, aggregation leads to a single bias-corrected
estimate. Ifestimation_typeisbootstrap_confidence_intervalorbci,
aggregation leads to a single bias-corrected estimate with lower and upper
boundaries of the confidence interval. This has no effect ifestimation_typeispoint. The default value is equal to TRUEexcept when assessing metrics to assess
model performance, as the default violin plot requires underlying data. As with detail_levelandestimation_type, a non-defaultaggregate_resultsparameter can be specified for separate evaluation steps
by providing a parameter value in a named list with data elements, e.g.list("auc_data"=TRUE, , "model_performance"=FALSE). This parameter exists
for the same elements asestimation_type. | 
|  | (optional) Numeric value for the level at which
confidence intervals are determined. In the case bootstraps are used to
determine the confidence intervals bootstrap estimation, familiaruses the
rule of thumbn = 20 / ci.levelto determine the number of required
bootstraps. The default value is 0.95. | 
|  | (optional) Method used to determine bootstrap
confidence intervals (Efron and Hastie, 2016). The following methods are
implemented:
 Note that the standard method is not implemented because this method is
often not suitable due to non-normal distributions. The bias-corrected and
accelerated (BCa) method is not implemented yet. | 
|  | Flag that indicates whether the data was already
pre-processed externally, e.g. normalised and clustered. Only used if the
dataargument is adata.tableordata.frame. | 
|  | Number of indentation steps for messages shown during
computation and extraction of various data elements. | 
|  | Flag to indicate whether feedback should be provided on the
computation and extraction of various data elements. | 
|  | Unused arguments. | 
Details
This function also computes credibility intervals for the ROC curve
for the ensemble model, at the level of confidence_level. In the case of
multinomial outcomes, an AUC curve is computed per class in a
one-against-all fashion.
To allow plotting of multiple AUC curves in the same plot and the use of
ensemble models, the AUC curve is evaluated at 0.01 (1-specificity) intervals.
Value
A list with data.tables for single and ensemble model ROC curve data.
Description
Computes calibration data from a familiarEnsemble object.
Calibration tests are performed based on expected (predicted) and observed
outcomes. For all outcomes, calibration-at-the-large and calibration slopes
are determined. Furthermore, for all but survival outcomes, a repeated,
randomised grouping Hosmer-Lemeshow test is performed. For survival
outcomes, the Nam-D'Agostino and Greenwood-Nam-D'Agostino tests are
performed.
Usage
extract_calibration_data(
  object,
  data,
  cl = NULL,
  ensemble_method = waiver(),
  evaluation_times = waiver(),
  detail_level = waiver(),
  estimation_type = waiver(),
  aggregate_results = waiver(),
  confidence_level = waiver(),
  bootstrap_ci_method = waiver(),
  is_pre_processed = FALSE,
  message_indent = 0L,
  verbose = FALSE,
  ...
)
Arguments
|  | A familiarEnsembleobject, which is an ensemble of one or morefamiliarModelobjects. | 
|  | A dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed. | 
|  | Cluster created using the parallelpackage. This cluster is then
used to speed up computation through parallellisation. | 
|  | Method for ensembling predictions from models for the
same sample. Available methods are:
 | 
|  | One or more time points that are used for in analysis of
survival problems when data has to be assessed at a set time, e.g.
calibration. If not provided explicitly, this parameter is read from
settings used at creation of the underlying familiarModelobjects. Only
used forsurvivaloutcomes. | 
|  | (optional) Sets the level at which results are computed
and aggregated.
 
 ensemble: Results are computed at the ensemble level, i.e. over all
models in the ensemble. This means that, for example, bias-corrected
estimates of model performance are assessed by creating (at least) 20
bootstraps and computing the model performance of the ensemble model for
each bootstrap.
 hybrid(default): Results are computed at the level of models in an
ensemble. This means that, for example, bias-corrected estimates of model
performance are directly computed using the models in the ensemble. If there
are at least 20 trained models in the ensemble, performance is computed for
each model, in contrast toensemblewhere performance is computed for the
ensemble of models. If there are less than 20 trained models in the
ensemble, bootstraps are created so that at least 20 point estimates can be
made.
 model: Results are computed at the model level. This means that, for
example, bias-corrected estimates of model performance are assessed by
creating (at least) 20 bootstraps and computing the performance of the model
for each bootstrap.
 Note that each level of detail has a different interpretation for bootstrap
confidence intervals. For ensembleandmodelthese are the confidence
intervals for the ensemble and an individual model, respectively. That is,
the confidence interval describes the range where an estimate produced by a
respective ensemble or model trained on a repeat of the experiment may be
found with the probability of the confidence level. Forhybrid, it
represents the range where any single model trained on a repeat of the
experiment may be found with the probability of the confidence level. By
definition, confidence intervals obtained usinghybridare at least as
wide as those forensemble.hybridoffers the correct interpretation if
the goal of the analysis is to assess the result of a single, unspecified,
model. hybridis generally computationally less expensive thenensemble, which
in turn is somewhat less expensive thanmodel.
 A non-default detail_levelparameter can be specified for separate
evaluation steps by providing a parameter value in a named list with data
elements, e.g.list("auc_data"="ensemble", "model_performance"="hybrid").
This parameter can be set for the following data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data,prediction_dataandconfusion_matrix. | 
|  | (optional) Sets the type of estimation that should be
possible. This has the following options:
 
 point: Point estimates.
 bias_correctionorbc: Bias-corrected estimates. A bias-corrected
estimate is computed from (at least) 20 point estimates, andfamiliarmay
bootstrap the data to create them.
 bootstrap_confidence_intervalorbci(default): Bias-corrected
estimates with bootstrap confidence intervals (Efron and Hastie, 2016). The
number of point estimates required depends on theconfidence_levelparameter, andfamiliarmay bootstrap the data to create them.
 As with detail_level, a non-defaultestimation_typeparameter can be
specified for separate evaluation steps by providing a parameter value in a
named list with data elements, e.g.list("auc_data"="bci", "model_performance"="point"). This parameter can be set for the following
data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data, andprediction_data. | 
|  | (optional) Flag that signifies whether results
should be aggregated during evaluation. If estimation_typeisbias_correctionorbc, aggregation leads to a single bias-corrected
estimate. Ifestimation_typeisbootstrap_confidence_intervalorbci,
aggregation leads to a single bias-corrected estimate with lower and upper
boundaries of the confidence interval. This has no effect ifestimation_typeispoint. The default value is equal to TRUEexcept when assessing metrics to assess
model performance, as the default violin plot requires underlying data. As with detail_levelandestimation_type, a non-defaultaggregate_resultsparameter can be specified for separate evaluation steps
by providing a parameter value in a named list with data elements, e.g.list("auc_data"=TRUE, , "model_performance"=FALSE). This parameter exists
for the same elements asestimation_type. | 
|  | (optional) Numeric value for the level at which
confidence intervals are determined. In the case bootstraps are used to
determine the confidence intervals bootstrap estimation, familiaruses the
rule of thumbn = 20 / ci.levelto determine the number of required
bootstraps. The default value is 0.95. | 
|  | (optional) Method used to determine bootstrap
confidence intervals (Efron and Hastie, 2016). The following methods are
implemented:
 Note that the standard method is not implemented because this method is
often not suitable due to non-normal distributions. The bias-corrected and
accelerated (BCa) method is not implemented yet. | 
|  | Flag that indicates whether the data was already
pre-processed externally, e.g. normalised and clustered. Only used if the
dataargument is adata.tableordata.frame. | 
|  | Number of indentation steps for messages shown during
computation and extraction of various data elements. | 
|  | Flag to indicate whether feedback should be provided on the
computation and extraction of various data elements. | 
|  | Unused arguments. | 
Value
A list with data.tables containing calibration test information for
the ensemble model.
Description
Collects .
Usage
extract_calibration_info(
  object,
  detail_level = waiver(),
  message_indent = 0L,
  verbose = FALSE,
  ...
)
Arguments
|  | A familiarEnsembleobject, which is an ensemble of one or morefamiliarModelobjects. | 
|  | (optional) Sets the level at which results are computed
and aggregated.
 
 ensemble: Results are computed at the ensemble level, i.e. over all
models in the ensemble. This means that, for example, bias-corrected
estimates of model performance are assessed by creating (at least) 20
bootstraps and computing the model performance of the ensemble model for
each bootstrap.
 hybrid(default): Results are computed at the level of models in an
ensemble. This means that, for example, bias-corrected estimates of model
performance are directly computed using the models in the ensemble. If there
are at least 20 trained models in the ensemble, performance is computed for
each model, in contrast toensemblewhere performance is computed for the
ensemble of models. If there are less than 20 trained models in the
ensemble, bootstraps are created so that at least 20 point estimates can be
made.
 model: Results are computed at the model level. This means that, for
example, bias-corrected estimates of model performance are assessed by
creating (at least) 20 bootstraps and computing the performance of the model
for each bootstrap.
 Note that each level of detail has a different interpretation for bootstrap
confidence intervals. For ensembleandmodelthese are the confidence
intervals for the ensemble and an individual model, respectively. That is,
the confidence interval describes the range where an estimate produced by a
respective ensemble or model trained on a repeat of the experiment may be
found with the probability of the confidence level. Forhybrid, it
represents the range where any single model trained on a repeat of the
experiment may be found with the probability of the confidence level. By
definition, confidence intervals obtained usinghybridare at least as
wide as those forensemble.hybridoffers the correct interpretation if
the goal of the analysis is to assess the result of a single, unspecified,
model. hybridis generally computationally less expensive thenensemble, which
in turn is somewhat less expensive thanmodel.
 A non-default detail_levelparameter can be specified for separate
evaluation steps by providing a parameter value in a named list with data
elements, e.g.list("auc_data"="ensemble", "model_performance"="hybrid").
This parameter can be set for the following data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data,prediction_dataandconfusion_matrix. | 
|  | Number of indentation steps for messages shown during
computation and extraction of various data elements. | 
|  | Flag to indicate whether feedback should be provided on the
computation and extraction of various data elements. | 
|  | Unused arguments. | 
Value
A list of familiarDataElements with hyperparameters.
Description
Computes and extracts the confusion matrix for predicted and
observed categorical outcomes used in a familiarEnsemble object.
Usage
extract_confusion_matrix(
  object,
  data,
  cl = NULL,
  ensemble_method = waiver(),
  detail_level = waiver(),
  is_pre_processed = FALSE,
  message_indent = 0L,
  verbose = FALSE,
  ...
)
Arguments
|  | A familiarEnsembleobject, which is an ensemble of one or morefamiliarModelobjects. | 
|  | A dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed. | 
|  | Cluster created using the parallelpackage. This cluster is then
used to speed up computation through parallellisation. | 
|  | Method for ensembling predictions from models for the
same sample. Available methods are:
 | 
|  | (optional) Sets the level at which results are computed
and aggregated.
 
 ensemble: Results are computed at the ensemble level, i.e. over all
models in the ensemble. This means that, for example, bias-corrected
estimates of model performance are assessed by creating (at least) 20
bootstraps and computing the model performance of the ensemble model for
each bootstrap.
 hybrid(default): Results are computed at the level of models in an
ensemble. This means that, for example, bias-corrected estimates of model
performance are directly computed using the models in the ensemble. If there
are at least 20 trained models in the ensemble, performance is computed for
each model, in contrast toensemblewhere performance is computed for the
ensemble of models. If there are less than 20 trained models in the
ensemble, bootstraps are created so that at least 20 point estimates can be
made.
 model: Results are computed at the model level. This means that, for
example, bias-corrected estimates of model performance are assessed by
creating (at least) 20 bootstraps and computing the performance of the model
for each bootstrap.
 Note that each level of detail has a different interpretation for bootstrap
confidence intervals. For ensembleandmodelthese are the confidence
intervals for the ensemble and an individual model, respectively. That is,
the confidence interval describes the range where an estimate produced by a
respective ensemble or model trained on a repeat of the experiment may be
found with the probability of the confidence level. Forhybrid, it
represents the range where any single model trained on a repeat of the
experiment may be found with the probability of the confidence level. By
definition, confidence intervals obtained usinghybridare at least as
wide as those forensemble.hybridoffers the correct interpretation if
the goal of the analysis is to assess the result of a single, unspecified,
model. hybridis generally computationally less expensive thenensemble, which
in turn is somewhat less expensive thanmodel.
 A non-default detail_levelparameter can be specified for separate
evaluation steps by providing a parameter value in a named list with data
elements, e.g.list("auc_data"="ensemble", "model_performance"="hybrid").
This parameter can be set for the following data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data,prediction_dataandconfusion_matrix. | 
|  | Flag that indicates whether the data was already
pre-processed externally, e.g. normalised and clustered. Only used if the
dataargument is adata.tableordata.frame. | 
|  | Number of indentation steps for messages shown during
computation and extraction of various data elements. | 
|  | Flag to indicate whether feedback should be provided on the
computation and extraction of various data elements. | 
|  | Unused arguments. | 
Value
A data.table containing predicted and observed outcome data together
with a co-occurence count.
Description
Compute various data related to model performance and calibration
from the provided dataset and familiarEnsemble object and store it as a
familiarData object.
Usage
extract_data(
  object,
  data,
  data_element = waiver(),
  is_pre_processed = FALSE,
  cl = NULL,
  time_max = waiver(),
  aggregation_method = waiver(),
  rank_threshold = waiver(),
  ensemble_method = waiver(),
  stratification_method = waiver(),
  evaluation_times = waiver(),
  metric = waiver(),
  feature_cluster_method = waiver(),
  feature_cluster_cut_method = waiver(),
  feature_linkage_method = waiver(),
  feature_similarity_metric = waiver(),
  feature_similarity_threshold = waiver(),
  sample_cluster_method = waiver(),
  sample_linkage_method = waiver(),
  sample_similarity_metric = waiver(),
  sample_limit = waiver(),
  detail_level = waiver(),
  estimation_type = waiver(),
  aggregate_results = waiver(),
  confidence_level = waiver(),
  bootstrap_ci_method = waiver(),
  icc_type = waiver(),
  dynamic_model_loading = FALSE,
  message_indent = 0L,
  verbose = FALSE,
  ...
)
Arguments
|  | A familiarEnsembleobject, which is an ensemble of one or morefamiliarModelobjects. | 
|  | A dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed. | 
|  | String indicating which data elements are to be extracted.
Default is all, but specific elements can be specified to speed up
computations if not all elements are to be computed. This is an internal
parameter that is set by, e.g. theexport_model_vimpmethod. | 
|  | Flag that indicates whether the data was already
pre-processed externally, e.g. normalised and clustered. Only used if the
dataargument is adata.tableordata.frame. | 
|  | Cluster created using the parallelpackage. This cluster is then
used to speed up computation through parallellisation. | 
|  | Time point which is used as the benchmark for e.g. cumulative
risks generated by random forest, or the cut-off value for Uno's concordance
index. If not provided explicitly, this parameter is read from settings used
at creation of the underlying familiarModelobjects. Only used forsurvivaloutcomes. | 
|  | Method for aggregating variable importances for the
purpose of evaluation. Variable importances are determined during feature
selection steps and after training the model. Both types are evaluated, but
feature selection variable importance is only evaluated at run-time.
 See the documentation for the vimp_aggregation_methodargument insummon_familiarfor information concerning the different available
methods. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
|  | The threshold used to  define the subset of highly
important features during evaluation.
 See the documentation for the vimp_aggregation_rank_thresholdargument insummon_familiarfor more information. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
|  | Method for ensembling predictions from models for the
same sample. Available methods are:
 | 
|  | (optional) Method for determining the
stratification threshold for creating survival groups. The actual,
model-dependent, threshold value is obtained from the development data, and
can afterwards be used to perform stratification on validation data.
 The following stratification methods are available:
 
 median(default): The median predicted value in the development cohort
is used to stratify the samples into two risk groups. For predicted outcome
values that build a continuous spectrum, the two risk groups in the
development cohort will be roughly equal in size.
 mean: The mean predicted value in the development cohort is used to
stratify the samples into two risk groups.
 mean_trim: Asmean, but based on the set of predicted values
where the 5% lowest and 5% highest values are discarded. This reduces the
effect of outliers.
 mean_winsor: Asmean, but based on the set of predicted values where
the 5% lowest and 5% highest values are winsorised. This reduces the effect
of outliers.
 fixed: Samples are stratified based on the sample quantiles of the
predicted values. These quantiles are defined using thestratification_thresholdparameter.
 optimised: Use maximally selected rank statistics to determine the
optimal threshold (Lausen and Schumacher, 1992; Hothorn et al., 2003) to
stratify samples into two optimally separated risk groups.
 One or more stratification methods can be selected simultaneously.
 This parameter is only relevant for survivaloutcomes. | 
|  | One or more time points that are used for in analysis of
survival problems when data has to be assessed at a set time, e.g.
calibration. If not provided explicitly, this parameter is read from
settings used at creation of the underlying familiarModelobjects. Only
used forsurvivaloutcomes. | 
|  | One or more metrics for assessing model performance. See the
vignette on performance metrics for the available metrics. If not provided
explicitly, this parameter is read from settings used at creation of the
underlying familiarModelobjects. | 
|  | The method used to perform clustering. These are
the same methods as for the cluster_methodconfiguration parameter:none,hclust,agnes,dianaandpam. nonecannot be used when extracting data regarding mutual correlation or
feature expressions.
 If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
|  | The method used to divide features into
separate clusters. The available methods are the same as for the
cluster_cut_methodconfiguration parameter:silhouette,fixed_cutanddynamic_cut. silhouetteis available for all cluster methods, butfixed_cutonly
applies to methods that create hierarchical trees (hclust,agnesanddiana).dynamic_cutrequires thedynamicTreeCutpackage and can only
be used withagnesandhclust.
 If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
|  | The method used for agglomerative clustering in
hclustandagnes. These are the same methods as for thecluster_linkage_methodconfiguration parameter:average,single,complete,weighted, andward. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
|  | Metric to determine pairwise similarity
between features. Similarity is computed in the same manner as for
clustering, and feature_similarity_metrictherefore has the same options
ascluster_similarity_metric:mcfadden_r2,cox_snell_r2,nagelkerke_r2,spearman,kendallandpearson. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
|  | The threshold level for pair-wise
similarity that is required to form feature clusters with the fixed_cutmethod. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
|  | The method used to perform clustering based on
distance between samples. These are the same methods as for the
cluster_methodconfiguration parameter:hclust,agnes,dianaandpam. nonecannot be used when extracting data for feature expressions.
 If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
|  | The method used for agglomerative clustering in
hclustandagnes. These are the same methods as for thecluster_linkage_methodconfiguration parameter:average,single,complete,weighted, andward. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
|  | Metric to determine pairwise similarity
between samples. Similarity is computed in the same manner as for
clustering, but sample_similarity_metrichas different options that are
better suited to computing distance between samples instead of between
features:gower,euclidean. The underlying feature data is scaled to the [0, 1]range (for
numerical features) using the feature values across the samples. The
normalisation parameters required can optionally be computed from feature
data with the outer 5% (on both sides) of feature values trimmed or
winsorised. To do so append_trim(trimming) or_winsor(winsorising) to
the metric name. This reduces the effect of outliers somewhat. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
|  | (optional) Set the upper limit of the number of samples
that are used during evaluation steps. Cannot be less than 20.
 This setting can be specified per data element by providing a parameter
value in a named list with data elements, e.g.
list("sample_similarity"=100, "permutation_vimp"=1000). This parameter can be set for the following data elements:
sample_similarityandice_data. | 
|  | (optional) Sets the level at which results are computed
and aggregated.
 
 ensemble: Results are computed at the ensemble level, i.e. over all
models in the ensemble. This means that, for example, bias-corrected
estimates of model performance are assessed by creating (at least) 20
bootstraps and computing the model performance of the ensemble model for
each bootstrap.
 hybrid(default): Results are computed at the level of models in an
ensemble. This means that, for example, bias-corrected estimates of model
performance are directly computed using the models in the ensemble. If there
are at least 20 trained models in the ensemble, performance is computed for
each model, in contrast toensemblewhere performance is computed for the
ensemble of models. If there are less than 20 trained models in the
ensemble, bootstraps are created so that at least 20 point estimates can be
made.
 model: Results are computed at the model level. This means that, for
example, bias-corrected estimates of model performance are assessed by
creating (at least) 20 bootstraps and computing the performance of the model
for each bootstrap.
 Note that each level of detail has a different interpretation for bootstrap
confidence intervals. For ensembleandmodelthese are the confidence
intervals for the ensemble and an individual model, respectively. That is,
the confidence interval describes the range where an estimate produced by a
respective ensemble or model trained on a repeat of the experiment may be
found with the probability of the confidence level. Forhybrid, it
represents the range where any single model trained on a repeat of the
experiment may be found with the probability of the confidence level. By
definition, confidence intervals obtained usinghybridare at least as
wide as those forensemble.hybridoffers the correct interpretation if
the goal of the analysis is to assess the result of a single, unspecified,
model. hybridis generally computationally less expensive thenensemble, which
in turn is somewhat less expensive thanmodel.
 A non-default detail_levelparameter can be specified for separate
evaluation steps by providing a parameter value in a named list with data
elements, e.g.list("auc_data"="ensemble", "model_performance"="hybrid").
This parameter can be set for the following data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data,prediction_dataandconfusion_matrix. | 
|  | (optional) Sets the type of estimation that should be
possible. This has the following options:
 
 point: Point estimates.
 bias_correctionorbc: Bias-corrected estimates. A bias-corrected
estimate is computed from (at least) 20 point estimates, andfamiliarmay
bootstrap the data to create them.
 bootstrap_confidence_intervalorbci(default): Bias-corrected
estimates with bootstrap confidence intervals (Efron and Hastie, 2016). The
number of point estimates required depends on theconfidence_levelparameter, andfamiliarmay bootstrap the data to create them.
 As with detail_level, a non-defaultestimation_typeparameter can be
specified for separate evaluation steps by providing a parameter value in a
named list with data elements, e.g.list("auc_data"="bci", "model_performance"="point"). This parameter can be set for the following
data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data, andprediction_data. | 
|  | (optional) Flag that signifies whether results
should be aggregated during evaluation. If estimation_typeisbias_correctionorbc, aggregation leads to a single bias-corrected
estimate. Ifestimation_typeisbootstrap_confidence_intervalorbci,
aggregation leads to a single bias-corrected estimate with lower and upper
boundaries of the confidence interval. This has no effect ifestimation_typeispoint. The default value is equal to TRUEexcept when assessing metrics to assess
model performance, as the default violin plot requires underlying data. As with detail_levelandestimation_type, a non-defaultaggregate_resultsparameter can be specified for separate evaluation steps
by providing a parameter value in a named list with data elements, e.g.list("auc_data"=TRUE, , "model_performance"=FALSE). This parameter exists
for the same elements asestimation_type. | 
|  | (optional) Numeric value for the level at which
confidence intervals are determined. In the case bootstraps are used to
determine the confidence intervals bootstrap estimation, familiaruses the
rule of thumbn = 20 / ci.levelto determine the number of required
bootstraps. The default value is 0.95. | 
|  | (optional) Method used to determine bootstrap
confidence intervals (Efron and Hastie, 2016). The following methods are
implemented:
 Note that the standard method is not implemented because this method is
often not suitable due to non-normal distributions. The bias-corrected and
accelerated (BCa) method is not implemented yet. | 
|  | String indicating the type of intraclass correlation
coefficient (1,2or3) that should be used to compute robustness for
features in repeated measurements during the evaluation of univariate
importance. These types correspond to the types in Shrout and Fleiss (1979).
If not provided explicitly, this parameter is read from settings used at
creation of the underlyingfamiliarModelobjects. | 
|  | (optional) Enables dynamic loading of models
during the evaluation process, if TRUE. Defaults toFALSE. Dynamic
loading of models may reduce the overall memory footprint, at the cost of
increased disk or network IO. Models can only be dynamically loaded if they
are found at an accessible disk or network location. Setting this parameter
toTRUEmay help if parallel processing causes out-of-memory issues during
evaluation. | 
|  | Number of indentation steps for messages shown during
computation and extraction of various data elements. | 
|  | Flag to indicate whether feedback should be provided on the
computation and extraction of various data elements. | 
|  | Unused arguments. | 
Value
A familiarData object.
References
-  Shrout, P. E. & Fleiss, J. L. Intraclass correlations: uses in
assessing rater reliability. Psychol. Bull. 86, 420–428 (1979).
 
Description
Computes decision curve analysis data from a familiarEnsemble
object. Calibration tests are performed based on expected (predicted) and
observed outcomes. For all outcomes, calibration-at-the-large and
calibration slopes are determined. Furthermore, for all but survival
outcomes, a repeated, randomised grouping Hosmer-Lemeshow test is performed.
For survival outcomes, the Nam-D'Agostino and Greenwood-Nam-D'Agostino tests
are performed.
Usage
extract_decision_curve_data(
  object,
  data,
  cl = NULL,
  ensemble_method = waiver(),
  evaluation_times = waiver(),
  detail_level = waiver(),
  estimation_type = waiver(),
  aggregate_results = waiver(),
  confidence_level = waiver(),
  bootstrap_ci_method = waiver(),
  is_pre_processed = FALSE,
  message_indent = 0L,
  verbose = FALSE,
  ...
)
Arguments
|  | A familiarEnsembleobject, which is an ensemble of one or morefamiliarModelobjects. | 
|  | A dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed. | 
|  | Cluster created using the parallelpackage. This cluster is then
used to speed up computation through parallellisation. | 
|  | Method for ensembling predictions from models for the
same sample. Available methods are:
 | 
|  | One or more time points that are used for in analysis of
survival problems when data has to be assessed at a set time, e.g.
calibration. If not provided explicitly, this parameter is read from
settings used at creation of the underlying familiarModelobjects. Only
used forsurvivaloutcomes. | 
|  | (optional) Sets the level at which results are computed
and aggregated.
 
 ensemble: Results are computed at the ensemble level, i.e. over all
models in the ensemble. This means that, for example, bias-corrected
estimates of model performance are assessed by creating (at least) 20
bootstraps and computing the model performance of the ensemble model for
each bootstrap.
 hybrid(default): Results are computed at the level of models in an
ensemble. This means that, for example, bias-corrected estimates of model
performance are directly computed using the models in the ensemble. If there
are at least 20 trained models in the ensemble, performance is computed for
each model, in contrast toensemblewhere performance is computed for the
ensemble of models. If there are less than 20 trained models in the
ensemble, bootstraps are created so that at least 20 point estimates can be
made.
 model: Results are computed at the model level. This means that, for
example, bias-corrected estimates of model performance are assessed by
creating (at least) 20 bootstraps and computing the performance of the model
for each bootstrap.
 Note that each level of detail has a different interpretation for bootstrap
confidence intervals. For ensembleandmodelthese are the confidence
intervals for the ensemble and an individual model, respectively. That is,
the confidence interval describes the range where an estimate produced by a
respective ensemble or model trained on a repeat of the experiment may be
found with the probability of the confidence level. Forhybrid, it
represents the range where any single model trained on a repeat of the
experiment may be found with the probability of the confidence level. By
definition, confidence intervals obtained usinghybridare at least as
wide as those forensemble.hybridoffers the correct interpretation if
the goal of the analysis is to assess the result of a single, unspecified,
model. hybridis generally computationally less expensive thenensemble, which
in turn is somewhat less expensive thanmodel.
 A non-default detail_levelparameter can be specified for separate
evaluation steps by providing a parameter value in a named list with data
elements, e.g.list("auc_data"="ensemble", "model_performance"="hybrid").
This parameter can be set for the following data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data,prediction_dataandconfusion_matrix. | 
|  | (optional) Sets the type of estimation that should be
possible. This has the following options:
 
 point: Point estimates.
 bias_correctionorbc: Bias-corrected estimates. A bias-corrected
estimate is computed from (at least) 20 point estimates, andfamiliarmay
bootstrap the data to create them.
 bootstrap_confidence_intervalorbci(default): Bias-corrected
estimates with bootstrap confidence intervals (Efron and Hastie, 2016). The
number of point estimates required depends on theconfidence_levelparameter, andfamiliarmay bootstrap the data to create them.
 As with detail_level, a non-defaultestimation_typeparameter can be
specified for separate evaluation steps by providing a parameter value in a
named list with data elements, e.g.list("auc_data"="bci", "model_performance"="point"). This parameter can be set for the following
data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data, andprediction_data. | 
|  | (optional) Flag that signifies whether results
should be aggregated during evaluation. If estimation_typeisbias_correctionorbc, aggregation leads to a single bias-corrected
estimate. Ifestimation_typeisbootstrap_confidence_intervalorbci,
aggregation leads to a single bias-corrected estimate with lower and upper
boundaries of the confidence interval. This has no effect ifestimation_typeispoint. The default value is equal to TRUEexcept when assessing metrics to assess
model performance, as the default violin plot requires underlying data. As with detail_levelandestimation_type, a non-defaultaggregate_resultsparameter can be specified for separate evaluation steps
by providing a parameter value in a named list with data elements, e.g.list("auc_data"=TRUE, , "model_performance"=FALSE). This parameter exists
for the same elements asestimation_type. | 
|  | (optional) Numeric value for the level at which
confidence intervals are determined. In the case bootstraps are used to
determine the confidence intervals bootstrap estimation, familiaruses the
rule of thumbn = 20 / ci.levelto determine the number of required
bootstraps. The default value is 0.95. | 
|  | (optional) Method used to determine bootstrap
confidence intervals (Efron and Hastie, 2016). The following methods are
implemented:
 Note that the standard method is not implemented because this method is
often not suitable due to non-normal distributions. The bias-corrected and
accelerated (BCa) method is not implemented yet. | 
|  | Flag that indicates whether the data was already
pre-processed externally, e.g. normalised and clustered. Only used if the
dataargument is adata.tableordata.frame. | 
|  | Number of indentation steps for messages shown during
computation and extraction of various data elements. | 
|  | Flag to indicate whether feedback should be provided on the
computation and extraction of various data elements. | 
|  | Unused arguments. | 
Value
A list with data.tables containing calibration test information for
the ensemble model.
Description
This function provides a unified access point to extraction
functions. Some of these functions require bootstrapping and result
aggregation, which are handled here.
Usage
## S4 method for signature 'familiarEnsemble,familiarDataElement'
extract_dispatcher(
  cl = NULL,
  FUN,
  object,
  proto_data_element,
  aggregate_results,
  has_internal_bootstrap,
  ...,
  message_indent = 0L,
  verbose = TRUE
)
Arguments
|  | Cluster created using the parallelpackage. This cluster is then
used to speed up computation through parallellisation. | 
|  | Extraction function or method to which data and parameters are
dispatched. | 
|  | A familiarEnsembleobject. | 
|  | A familiarDataElementobject, or an object that
inherits from it. | 
|  | A logical flag indicating whether results should be
aggregated. | 
|  | A logical flag that indicates whether FUNhas
internal bootstrapping capabilities. | 
|  | Unused arguments. | 
|  | Number of indentation steps for messages shown during
computation and extraction of various data elements. | 
|  | Flag to indicate whether feedback should be provided on the
computation and extraction of various data elements. | 
Details
This function first determines how many data points need to be
evaluated to complete the desired estimation, i.e. 1 for point estimates, 20
for bias-corrected estimates, and 20 / (1 - confidence level) for bootstrap
confidence intervals.
Subsequently, we determine the number of models. This number is used to set
external or internal clusters, the number of bootstraps, and to evaluate
whether the estimation can be done in case FUN does not support
bootstrapping.
Value
A list of familiarDataElement objects.
Description
Parse experimental design
Usage
extract_experimental_setup(
  experimental_design,
  file_dir,
  message_indent = 0L,
  verbose = TRUE
)
Arguments
|  | (required) Defines what the experiment looks
like, e.g. cv(bt(fs,20)+mb,3,2)+evfor 2 times repeated 3-fold
cross-validation with nested feature selection on 20 bootstraps and
model-building, and external validation. The basic workflow components are: 
 fs: (required) feature selection step.
 mb: (required) model building step.
 ev: (optional) external validation. Note that internal validation due
to subsampling will always be conducted if the subsampling methods create
any validation data sets.
 The different components are linked using +. Different subsampling methods can be used in conjunction with the basic
workflow components:
 
 bs(x,n): (stratified) .632 bootstrap, withnthe number of
bootstraps. In contrast tobt, feature pre-processing parameters and
hyperparameter optimisation are conducted on individual bootstraps.
 bt(x,n): (stratified) .632 bootstrap, withnthe number of
bootstraps. Unlikebsand other subsampling methods, no separate
pre-processing parameters or optimised hyperparameters will be determined
for each bootstrap.
 cv(x,n,p): (stratified)n-fold cross-validation, repeatedptimes.
Pre-processing parameters are determined for each iteration.
 lv(x): leave-one-out-cross-validation. Pre-processing parameters are
determined for each iteration.
 ip(x): imbalance partitioning for addressing class imbalances on the
data set. Pre-processing parameters are determined for each partition. The
number of partitions generated depends on the imbalance correction method
(see theimbalance_correction_methodparameter). Imbalance partitioning
does not generate validation sets.
 As shown in the example above, sampling algorithms can be nested.
 The simplest valid experimental design is fs+mb, which corresponds to a
TRIPOD type 1a analysis. Type 1b analyses are only possible using
bootstraps, e.g.bt(fs+mb,100). Type 2a analyses can be conducted using
cross-validation, e.g.cv(bt(fs,100)+mb,10,1). Depending on the origin of
the external validation data, designs such asfs+mb+evorcv(bt(fs,100)+mb,10,1)+evconstitute type 2b or type 3 analyses. Type 4
analyses can be done by obtaining one or morefamiliarModelobjects from
others and applying them to your own data set. Alternatively, the experimental_designparameter may be used to provide a
path to a file containing iterations, which is named####_iterations.RDSby convention. This path can be relative to the directory of the current
experiment (experiment_dir), or an absolute path. The absolute path may
thus also point to a file from a different experiment. | 
|  | Spacing inserted before messages. | 
|  | Sets verbosity. | 
Details
This function converts the experimental_design string
Value
data.table with subsampler information at different levels of the
experimental design.
Description
Computes and extracts feature expressions for features
used in a familiarEnsemble object.
Usage
extract_feature_expression(
  object,
  data,
  feature_similarity,
  sample_similarity,
  feature_cluster_method = waiver(),
  feature_linkage_method = waiver(),
  feature_similarity_metric = waiver(),
  sample_cluster_method = waiver(),
  sample_linkage_method = waiver(),
  sample_similarity_metric = waiver(),
  evaluation_times = waiver(),
  message_indent = 0L,
  verbose = FALSE,
  ...
)
Arguments
|  | A familiarEnsembleobject, which is an ensemble of one or morefamiliarModelobjects. | 
|  | A dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed. | 
|  | Table containing pairwise distance between
sample. This is used to determine cluster information, and indicate which
samples are similar. The table is created by the
extract_sample_similaritymethod. | 
|  | The method used to perform clustering. These are
the same methods as for the cluster_methodconfiguration parameter:none,hclust,agnes,dianaandpam. nonecannot be used when extracting data regarding mutual correlation or
feature expressions.
 If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
|  | The method used for agglomerative clustering in
hclustandagnes. These are the same methods as for thecluster_linkage_methodconfiguration parameter:average,single,complete,weighted, andward. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
|  | Metric to determine pairwise similarity
between features. Similarity is computed in the same manner as for
clustering, and feature_similarity_metrictherefore has the same options
ascluster_similarity_metric:mcfadden_r2,cox_snell_r2,nagelkerke_r2,spearman,kendallandpearson. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
|  | The method used to perform clustering based on
distance between samples. These are the same methods as for the
cluster_methodconfiguration parameter:hclust,agnes,dianaandpam. nonecannot be used when extracting data for feature expressions.
 If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
|  | The method used for agglomerative clustering in
hclustandagnes. These are the same methods as for thecluster_linkage_methodconfiguration parameter:average,single,complete,weighted, andward. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
|  | Metric to determine pairwise similarity
between samples. Similarity is computed in the same manner as for
clustering, but sample_similarity_metrichas different options that are
better suited to computing distance between samples instead of between
features:gower,euclidean. The underlying feature data is scaled to the [0, 1]range (for
numerical features) using the feature values across the samples. The
normalisation parameters required can optionally be computed from feature
data with the outer 5% (on both sides) of feature values trimmed or
winsorised. To do so append_trim(trimming) or_winsor(winsorising) to
the metric name. This reduces the effect of outliers somewhat. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
|  | One or more time points that are used for in analysis of
survival problems when data has to be assessed at a set time, e.g.
calibration. If not provided explicitly, this parameter is read from
settings used at creation of the underlying familiarModelobjects. Only
used forsurvivaloutcomes. | 
|  | Number of indentation steps for messages shown during
computation and extraction of various data elements. | 
|  | Flag to indicate whether feedback should be provided on the
computation and extraction of various data elements. | 
|  | Unused arguments. | 
Value
A list with a data.table containing feature expressions.
Description
Computes and extracts the feature distance table for features
used in a familiarEnsemble object. This table can be used to cluster
features, and is exported directly by export_feature_similarity.
Usage
extract_feature_similarity(
  object,
  data,
  cl = NULL,
  estimation_type = waiver(),
  aggregate_results = waiver(),
  confidence_level = waiver(),
  bootstrap_ci_method = waiver(),
  is_pre_processed = FALSE,
  feature_cluster_method = waiver(),
  feature_linkage_method = waiver(),
  feature_cluster_cut_method = waiver(),
  feature_similarity_threshold = waiver(),
  feature_similarity_metric = waiver(),
  verbose = FALSE,
  message_indent = 0L,
  ...
)
Arguments
|  | A familiarEnsembleobject, which is an ensemble of one or morefamiliarModelobjects. | 
|  | A dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed. | 
|  | Cluster created using the parallelpackage. This cluster is then
used to speed up computation through parallellisation. | 
|  | (optional) Sets the type of estimation that should be
possible. This has the following options:
 
 point: Point estimates.
 bias_correctionorbc: Bias-corrected estimates. A bias-corrected
estimate is computed from (at least) 20 point estimates, andfamiliarmay
bootstrap the data to create them.
 bootstrap_confidence_intervalorbci(default): Bias-corrected
estimates with bootstrap confidence intervals (Efron and Hastie, 2016). The
number of point estimates required depends on theconfidence_levelparameter, andfamiliarmay bootstrap the data to create them.
 As with detail_level, a non-defaultestimation_typeparameter can be
specified for separate evaluation steps by providing a parameter value in a
named list with data elements, e.g.list("auc_data"="bci", "model_performance"="point"). This parameter can be set for the following
data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data, andprediction_data. | 
|  | (optional) Flag that signifies whether results
should be aggregated during evaluation. If estimation_typeisbias_correctionorbc, aggregation leads to a single bias-corrected
estimate. Ifestimation_typeisbootstrap_confidence_intervalorbci,
aggregation leads to a single bias-corrected estimate with lower and upper
boundaries of the confidence interval. This has no effect ifestimation_typeispoint. The default value is equal to TRUEexcept when assessing metrics to assess
model performance, as the default violin plot requires underlying data. As with detail_levelandestimation_type, a non-defaultaggregate_resultsparameter can be specified for separate evaluation steps
by providing a parameter value in a named list with data elements, e.g.list("auc_data"=TRUE, , "model_performance"=FALSE). This parameter exists
for the same elements asestimation_type. | 
|  | (optional) Numeric value for the level at which
confidence intervals are determined. In the case bootstraps are used to
determine the confidence intervals bootstrap estimation, familiaruses the
rule of thumbn = 20 / ci.levelto determine the number of required
bootstraps. The default value is 0.95. | 
|  | (optional) Method used to determine bootstrap
confidence intervals (Efron and Hastie, 2016). The following methods are
implemented:
 Note that the standard method is not implemented because this method is
often not suitable due to non-normal distributions. The bias-corrected and
accelerated (BCa) method is not implemented yet. | 
|  | Flag that indicates whether the data was already
pre-processed externally, e.g. normalised and clustered. Only used if the
dataargument is adata.tableordata.frame. | 
|  | The method used to perform clustering. These are
the same methods as for the cluster_methodconfiguration parameter:none,hclust,agnes,dianaandpam. nonecannot be used when extracting data regarding mutual correlation or
feature expressions.
 If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
|  | The method used for agglomerative clustering in
hclustandagnes. These are the same methods as for thecluster_linkage_methodconfiguration parameter:average,single,complete,weighted, andward. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
|  | The method used to divide features into
separate clusters. The available methods are the same as for the
cluster_cut_methodconfiguration parameter:silhouette,fixed_cutanddynamic_cut. silhouetteis available for all cluster methods, butfixed_cutonly
applies to methods that create hierarchical trees (hclust,agnesanddiana).dynamic_cutrequires thedynamicTreeCutpackage and can only
be used withagnesandhclust.
 If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
|  | The threshold level for pair-wise
similarity that is required to form feature clusters with the fixed_cutmethod. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
|  | Metric to determine pairwise similarity
between features. Similarity is computed in the same manner as for
clustering, and feature_similarity_metrictherefore has the same options
ascluster_similarity_metric:mcfadden_r2,cox_snell_r2,nagelkerke_r2,spearman,kendallandpearson. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
|  | Flag to indicate whether feedback should be provided on the
computation and extraction of various data elements. | 
|  | Number of indentation steps for messages shown during
computation and extraction of various data elements. | 
|  | Unused arguments. | 
Value
A data.table containing pairwise distance between features. This data
is only the upper triangular of the complete matrix (i.e. the sparse
unitriangular representation). Diagonals will always be 0.0 and the lower
triangular is mirrored.
Description
Aggregate variable importance obtained during feature selection.
This information can only be obtained as part of the main summon_familiar
process.
Usage
extract_fs_vimp(
  object,
  aggregation_method = waiver(),
  rank_threshold = waiver(),
  message_indent = 0L,
  verbose = FALSE,
  ...
)
Arguments
|  | A familiarEnsembleobject, which is an ensemble of one or morefamiliarModelobjects. | 
|  | Method for aggregating variable importances for the
purpose of evaluation. Variable importances are determined during feature
selection steps and after training the model. Both types are evaluated, but
feature selection variable importance is only evaluated at run-time.
 See the documentation for the vimp_aggregation_methodargument insummon_familiarfor information concerning the different available
methods. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
|  | The threshold used to  define the subset of highly
important features during evaluation.
 See the documentation for the vimp_aggregation_rank_thresholdargument insummon_familiarfor more information. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
|  | Number of indentation steps for messages shown during
computation and extraction of various data elements. | 
|  | Flag to indicate whether feedback should be provided on the
computation and extraction of various data elements. | 
|  | Unused arguments. | 
Value
A list containing feature selection variable importance information.
Description
Collects hyperparameters from models in a familiarEnsemble.
Usage
extract_hyperparameters(object, message_indent = 0L, verbose = FALSE, ...)
Arguments
|  | A familiarEnsembleobject, which is an ensemble of one or morefamiliarModelobjects. | 
|  | Number of indentation steps for messages shown during
computation and extraction of various data elements. | 
|  | Flag to indicate whether feedback should be provided on the
computation and extraction of various data elements. | 
|  | Unused arguments. | 
Value
A list of familiarDataElements with hyperparameters.
Description
Computes data for individual conditional expectation plots and
partial dependence plots for the model(s) in a familiarEnsemble object.
Usage
extract_ice(
  object,
  data,
  cl = NULL,
  features = NULL,
  feature_x_range = NULL,
  feature_y_range = NULL,
  n_sample_points = 50L,
  ensemble_method = waiver(),
  evaluation_times = waiver(),
  sample_limit = waiver(),
  detail_level = waiver(),
  estimation_type = waiver(),
  aggregate_results = waiver(),
  confidence_level = waiver(),
  bootstrap_ci_method = waiver(),
  is_pre_processed = FALSE,
  message_indent = 0L,
  verbose = FALSE,
  ...
)
Arguments
|  | A familiarEnsembleobject, which is an ensemble of one or morefamiliarModelobjects. | 
|  | A dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed. | 
|  | Cluster created using the parallelpackage. This cluster is then
used to speed up computation through parallellisation. | 
|  | Names of the feature or features (2) assessed simultaneously.
By default NULL, which means that all features are assessed one-by-one. | 
|  | When one or two features are defined using features,feature_x_rangecan be used to set the range of values for the first
feature. For numeric features, a vector of two values is assumed to indicate
a range from whichn_sample_pointsare uniformly sampled. A vector of more
than two values is interpreted as is, i.e. these represent the values to be
sampled. For categorical features, values should represent a (sub)set of
available levels. | 
|  | As feature_x_range, but for the second feature in
case two features are defined. | 
|  | Number of points used to sample continuous features. | 
|  | Method for ensembling predictions from models for the
same sample. Available methods are:
 | 
|  | One or more time points that are used for in analysis of
survival problems when data has to be assessed at a set time, e.g.
calibration. If not provided explicitly, this parameter is read from
settings used at creation of the underlying familiarModelobjects. Only
used forsurvivaloutcomes. | 
|  | (optional) Set the upper limit of the number of samples
that are used during evaluation steps. Cannot be less than 20.
 This setting can be specified per data element by providing a parameter
value in a named list with data elements, e.g.
list("sample_similarity"=100, "permutation_vimp"=1000). This parameter can be set for the following data elements:
sample_similarityandice_data. | 
|  | (optional) Sets the level at which results are computed
and aggregated.
 
 ensemble: Results are computed at the ensemble level, i.e. over all
models in the ensemble. This means that, for example, bias-corrected
estimates of model performance are assessed by creating (at least) 20
bootstraps and computing the model performance of the ensemble model for
each bootstrap.
 hybrid(default): Results are computed at the level of models in an
ensemble. This means that, for example, bias-corrected estimates of model
performance are directly computed using the models in the ensemble. If there
are at least 20 trained models in the ensemble, performance is computed for
each model, in contrast toensemblewhere performance is computed for the
ensemble of models. If there are less than 20 trained models in the
ensemble, bootstraps are created so that at least 20 point estimates can be
made.
 model: Results are computed at the model level. This means that, for
example, bias-corrected estimates of model performance are assessed by
creating (at least) 20 bootstraps and computing the performance of the model
for each bootstrap.
 Note that each level of detail has a different interpretation for bootstrap
confidence intervals. For ensembleandmodelthese are the confidence
intervals for the ensemble and an individual model, respectively. That is,
the confidence interval describes the range where an estimate produced by a
respective ensemble or model trained on a repeat of the experiment may be
found with the probability of the confidence level. Forhybrid, it
represents the range where any single model trained on a repeat of the
experiment may be found with the probability of the confidence level. By
definition, confidence intervals obtained usinghybridare at least as
wide as those forensemble.hybridoffers the correct interpretation if
the goal of the analysis is to assess the result of a single, unspecified,
model. hybridis generally computationally less expensive thenensemble, which
in turn is somewhat less expensive thanmodel.
 A non-default detail_levelparameter can be specified for separate
evaluation steps by providing a parameter value in a named list with data
elements, e.g.list("auc_data"="ensemble", "model_performance"="hybrid").
This parameter can be set for the following data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data,prediction_dataandconfusion_matrix. | 
|  | (optional) Sets the type of estimation that should be
possible. This has the following options:
 
 point: Point estimates.
 bias_correctionorbc: Bias-corrected estimates. A bias-corrected
estimate is computed from (at least) 20 point estimates, andfamiliarmay
bootstrap the data to create them.
 bootstrap_confidence_intervalorbci(default): Bias-corrected
estimates with bootstrap confidence intervals (Efron and Hastie, 2016). The
number of point estimates required depends on theconfidence_levelparameter, andfamiliarmay bootstrap the data to create them.
 As with detail_level, a non-defaultestimation_typeparameter can be
specified for separate evaluation steps by providing a parameter value in a
named list with data elements, e.g.list("auc_data"="bci", "model_performance"="point"). This parameter can be set for the following
data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data, andprediction_data. | 
|  | (optional) Flag that signifies whether results
should be aggregated during evaluation. If estimation_typeisbias_correctionorbc, aggregation leads to a single bias-corrected
estimate. Ifestimation_typeisbootstrap_confidence_intervalorbci,
aggregation leads to a single bias-corrected estimate with lower and upper
boundaries of the confidence interval. This has no effect ifestimation_typeispoint. The default value is equal to TRUEexcept when assessing metrics to assess
model performance, as the default violin plot requires underlying data. As with detail_levelandestimation_type, a non-defaultaggregate_resultsparameter can be specified for separate evaluation steps
by providing a parameter value in a named list with data elements, e.g.list("auc_data"=TRUE, , "model_performance"=FALSE). This parameter exists
for the same elements asestimation_type. | 
|  | (optional) Numeric value for the level at which
confidence intervals are determined. In the case bootstraps are used to
determine the confidence intervals bootstrap estimation, familiaruses the
rule of thumbn = 20 / ci.levelto determine the number of required
bootstraps. The default value is 0.95. | 
|  | (optional) Method used to determine bootstrap
confidence intervals (Efron and Hastie, 2016). The following methods are
implemented:
 Note that the standard method is not implemented because this method is
often not suitable due to non-normal distributions. The bias-corrected and
accelerated (BCa) method is not implemented yet. | 
|  | Flag that indicates whether the data was already
pre-processed externally, e.g. normalised and clustered. Only used if the
dataargument is adata.tableordata.frame. | 
|  | Number of indentation steps for messages shown during
computation and extraction of various data elements. | 
|  | Flag to indicate whether feedback should be provided on the
computation and extraction of various data elements. | 
|  | Unused arguments. | 
Value
A data.table containing individual conditional expectation plot data.
Description
Aggregate variable importance from models in a
familiarEnsemble.
Usage
extract_model_vimp(
  object,
  data,
  aggregation_method = waiver(),
  rank_threshold = waiver(),
  message_indent = 0L,
  verbose = FALSE,
  ...
)
Arguments
|  | A familiarEnsembleobject, which is an ensemble of one or morefamiliarModelobjects. | 
|  | A dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed. | 
|  | Method for aggregating variable importances for the
purpose of evaluation. Variable importances are determined during feature
selection steps and after training the model. Both types are evaluated, but
feature selection variable importance is only evaluated at run-time.
 See the documentation for the vimp_aggregation_methodargument insummon_familiarfor information concerning the different available
methods. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
|  | The threshold used to  define the subset of highly
important features during evaluation.
 See the documentation for the vimp_aggregation_rank_thresholdargument insummon_familiarfor more information. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
|  | Number of indentation steps for messages shown during
computation and extraction of various data elements. | 
|  | Flag to indicate whether feedback should be provided on the
computation and extraction of various data elements. | 
|  | Unused arguments. | 
Value
A list containing variable importance information.
Description
Computes and collects discriminative performance metrics from a
familiarEnsemble.
Usage
extract_performance(
  object,
  data,
  cl = NULL,
  metric = waiver(),
  ensemble_method = waiver(),
  evaluation_times = waiver(),
  detail_level = waiver(),
  estimation_type = waiver(),
  aggregate_results = waiver(),
  confidence_level = waiver(),
  bootstrap_ci_method = waiver(),
  is_pre_processed = FALSE,
  message_indent = 0L,
  verbose = FALSE,
  ...
)
Arguments
|  | A familiarEnsembleobject, which is an ensemble of one or morefamiliarModelobjects. | 
|  | A dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed. | 
|  | Cluster created using the parallelpackage. This cluster is then
used to speed up computation through parallellisation. | 
|  | One or more metrics for assessing model performance. See the
vignette on performance metrics for the available metrics. If not provided
explicitly, this parameter is read from settings used at creation of the
underlying familiarModelobjects. | 
|  | Method for ensembling predictions from models for the
same sample. Available methods are:
 | 
|  | One or more time points that are used for in analysis of
survival problems when data has to be assessed at a set time, e.g.
calibration. If not provided explicitly, this parameter is read from
settings used at creation of the underlying familiarModelobjects. Only
used forsurvivaloutcomes. | 
|  | (optional) Sets the level at which results are computed
and aggregated.
 
 ensemble: Results are computed at the ensemble level, i.e. over all
models in the ensemble. This means that, for example, bias-corrected
estimates of model performance are assessed by creating (at least) 20
bootstraps and computing the model performance of the ensemble model for
each bootstrap.
 hybrid(default): Results are computed at the level of models in an
ensemble. This means that, for example, bias-corrected estimates of model
performance are directly computed using the models in the ensemble. If there
are at least 20 trained models in the ensemble, performance is computed for
each model, in contrast toensemblewhere performance is computed for the
ensemble of models. If there are less than 20 trained models in the
ensemble, bootstraps are created so that at least 20 point estimates can be
made.
 model: Results are computed at the model level. This means that, for
example, bias-corrected estimates of model performance are assessed by
creating (at least) 20 bootstraps and computing the performance of the model
for each bootstrap.
 Note that each level of detail has a different interpretation for bootstrap
confidence intervals. For ensembleandmodelthese are the confidence
intervals for the ensemble and an individual model, respectively. That is,
the confidence interval describes the range where an estimate produced by a
respective ensemble or model trained on a repeat of the experiment may be
found with the probability of the confidence level. Forhybrid, it
represents the range where any single model trained on a repeat of the
experiment may be found with the probability of the confidence level. By
definition, confidence intervals obtained usinghybridare at least as
wide as those forensemble.hybridoffers the correct interpretation if
the goal of the analysis is to assess the result of a single, unspecified,
model. hybridis generally computationally less expensive thenensemble, which
in turn is somewhat less expensive thanmodel.
 A non-default detail_levelparameter can be specified for separate
evaluation steps by providing a parameter value in a named list with data
elements, e.g.list("auc_data"="ensemble", "model_performance"="hybrid").
This parameter can be set for the following data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data,prediction_dataandconfusion_matrix. | 
|  | (optional) Sets the type of estimation that should be
possible. This has the following options:
 
 point: Point estimates.
 bias_correctionorbc: Bias-corrected estimates. A bias-corrected
estimate is computed from (at least) 20 point estimates, andfamiliarmay
bootstrap the data to create them.
 bootstrap_confidence_intervalorbci(default): Bias-corrected
estimates with bootstrap confidence intervals (Efron and Hastie, 2016). The
number of point estimates required depends on theconfidence_levelparameter, andfamiliarmay bootstrap the data to create them.
 As with detail_level, a non-defaultestimation_typeparameter can be
specified for separate evaluation steps by providing a parameter value in a
named list with data elements, e.g.list("auc_data"="bci", "model_performance"="point"). This parameter can be set for the following
data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data, andprediction_data. | 
|  | (optional) Flag that signifies whether results
should be aggregated during evaluation. If estimation_typeisbias_correctionorbc, aggregation leads to a single bias-corrected
estimate. Ifestimation_typeisbootstrap_confidence_intervalorbci,
aggregation leads to a single bias-corrected estimate with lower and upper
boundaries of the confidence interval. This has no effect ifestimation_typeispoint. The default value is equal to TRUEexcept when assessing metrics to assess
model performance, as the default violin plot requires underlying data. As with detail_levelandestimation_type, a non-defaultaggregate_resultsparameter can be specified for separate evaluation steps
by providing a parameter value in a named list with data elements, e.g.list("auc_data"=TRUE, , "model_performance"=FALSE). This parameter exists
for the same elements asestimation_type. | 
|  | (optional) Numeric value for the level at which
confidence intervals are determined. In the case bootstraps are used to
determine the confidence intervals bootstrap estimation, familiaruses the
rule of thumbn = 20 / ci.levelto determine the number of required
bootstraps. The default value is 0.95. | 
|  | (optional) Method used to determine bootstrap
confidence intervals (Efron and Hastie, 2016). The following methods are
implemented:
 Note that the standard method is not implemented because this method is
often not suitable due to non-normal distributions. The bias-corrected and
accelerated (BCa) method is not implemented yet. | 
|  | Flag that indicates whether the data was already
pre-processed externally, e.g. normalised and clustered. Only used if the
dataargument is adata.tableordata.frame. | 
|  | Number of indentation steps for messages shown during
computation and extraction of various data elements. | 
|  | Flag to indicate whether feedback should be provided on the
computation and extraction of various data elements. | 
|  | Unused arguments. | 
Details
This method computes credibility intervals for the ensemble model, at
the level of confidence_level. This is a general method. Metrics with
known, theoretically derived confidence intervals, nevertheless have a
credibility interval computed.
Value
A list with data.tables for single and ensemble model assessments.
Description
Computes and collects permutation variable importance from a
familiarEnsemble.
Usage
extract_permutation_vimp(
  object,
  data,
  cl = NULL,
  ensemble_method = waiver(),
  feature_similarity,
  feature_cluster_method = waiver(),
  feature_linkage_method = waiver(),
  feature_cluster_cut_method = waiver(),
  feature_similarity_metric = waiver(),
  feature_similarity_threshold = waiver(),
  metric = waiver(),
  evaluation_times = waiver(),
  detail_level = waiver(),
  estimation_type = waiver(),
  aggregate_results = waiver(),
  confidence_level = waiver(),
  bootstrap_ci_method = waiver(),
  is_pre_processed = FALSE,
  message_indent = 0L,
  verbose = FALSE,
  ...
)
Arguments
|  | A familiarEnsembleobject, which is an ensemble of one or morefamiliarModelobjects. | 
|  | A dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed. | 
|  | Cluster created using the parallelpackage. This cluster is then
used to speed up computation through parallellisation. | 
|  | Method for ensembling predictions from models for the
same sample. Available methods are:
 | 
|  | The method used to perform clustering. These are
the same methods as for the cluster_methodconfiguration parameter:none,hclust,agnes,dianaandpam. nonecannot be used when extracting data regarding mutual correlation or
feature expressions.
 If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
|  | The method used for agglomerative clustering in
hclustandagnes. These are the same methods as for thecluster_linkage_methodconfiguration parameter:average,single,complete,weighted, andward. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
|  | The method used to divide features into
separate clusters. The available methods are the same as for the
cluster_cut_methodconfiguration parameter:silhouette,fixed_cutanddynamic_cut. silhouetteis available for all cluster methods, butfixed_cutonly
applies to methods that create hierarchical trees (hclust,agnesanddiana).dynamic_cutrequires thedynamicTreeCutpackage and can only
be used withagnesandhclust.
 If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
|  | Metric to determine pairwise similarity
between features. Similarity is computed in the same manner as for
clustering, and feature_similarity_metrictherefore has the same options
ascluster_similarity_metric:mcfadden_r2,cox_snell_r2,nagelkerke_r2,spearman,kendallandpearson. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
|  | The threshold level for pair-wise
similarity that is required to form feature clusters with the fixed_cutmethod. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
|  | One or more metrics for assessing model performance. See the
vignette on performance metrics for the available metrics. If not provided
explicitly, this parameter is read from settings used at creation of the
underlying familiarModelobjects. | 
|  | One or more time points that are used for in analysis of
survival problems when data has to be assessed at a set time, e.g.
calibration. If not provided explicitly, this parameter is read from
settings used at creation of the underlying familiarModelobjects. Only
used forsurvivaloutcomes. | 
|  | (optional) Sets the level at which results are computed
and aggregated.
 
 ensemble: Results are computed at the ensemble level, i.e. over all
models in the ensemble. This means that, for example, bias-corrected
estimates of model performance are assessed by creating (at least) 20
bootstraps and computing the model performance of the ensemble model for
each bootstrap.
 hybrid(default): Results are computed at the level of models in an
ensemble. This means that, for example, bias-corrected estimates of model
performance are directly computed using the models in the ensemble. If there
are at least 20 trained models in the ensemble, performance is computed for
each model, in contrast toensemblewhere performance is computed for the
ensemble of models. If there are less than 20 trained models in the
ensemble, bootstraps are created so that at least 20 point estimates can be
made.
 model: Results are computed at the model level. This means that, for
example, bias-corrected estimates of model performance are assessed by
creating (at least) 20 bootstraps and computing the performance of the model
for each bootstrap.
 Note that each level of detail has a different interpretation for bootstrap
confidence intervals. For ensembleandmodelthese are the confidence
intervals for the ensemble and an individual model, respectively. That is,
the confidence interval describes the range where an estimate produced by a
respective ensemble or model trained on a repeat of the experiment may be
found with the probability of the confidence level. Forhybrid, it
represents the range where any single model trained on a repeat of the
experiment may be found with the probability of the confidence level. By
definition, confidence intervals obtained usinghybridare at least as
wide as those forensemble.hybridoffers the correct interpretation if
the goal of the analysis is to assess the result of a single, unspecified,
model. hybridis generally computationally less expensive thenensemble, which
in turn is somewhat less expensive thanmodel.
 A non-default detail_levelparameter can be specified for separate
evaluation steps by providing a parameter value in a named list with data
elements, e.g.list("auc_data"="ensemble", "model_performance"="hybrid").
This parameter can be set for the following data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data,prediction_dataandconfusion_matrix. | 
|  | (optional) Sets the type of estimation that should be
possible. This has the following options:
 
 point: Point estimates.
 bias_correctionorbc: Bias-corrected estimates. A bias-corrected
estimate is computed from (at least) 20 point estimates, andfamiliarmay
bootstrap the data to create them.
 bootstrap_confidence_intervalorbci(default): Bias-corrected
estimates with bootstrap confidence intervals (Efron and Hastie, 2016). The
number of point estimates required depends on theconfidence_levelparameter, andfamiliarmay bootstrap the data to create them.
 As with detail_level, a non-defaultestimation_typeparameter can be
specified for separate evaluation steps by providing a parameter value in a
named list with data elements, e.g.list("auc_data"="bci", "model_performance"="point"). This parameter can be set for the following
data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data, andprediction_data. | 
|  | (optional) Flag that signifies whether results
should be aggregated during evaluation. If estimation_typeisbias_correctionorbc, aggregation leads to a single bias-corrected
estimate. Ifestimation_typeisbootstrap_confidence_intervalorbci,
aggregation leads to a single bias-corrected estimate with lower and upper
boundaries of the confidence interval. This has no effect ifestimation_typeispoint. The default value is equal to TRUEexcept when assessing metrics to assess
model performance, as the default violin plot requires underlying data. As with detail_levelandestimation_type, a non-defaultaggregate_resultsparameter can be specified for separate evaluation steps
by providing a parameter value in a named list with data elements, e.g.list("auc_data"=TRUE, , "model_performance"=FALSE). This parameter exists
for the same elements asestimation_type. | 
|  | (optional) Numeric value for the level at which
confidence intervals are determined. In the case bootstraps are used to
determine the confidence intervals bootstrap estimation, familiaruses the
rule of thumbn = 20 / ci.levelto determine the number of required
bootstraps. The default value is 0.95. | 
|  | (optional) Method used to determine bootstrap
confidence intervals (Efron and Hastie, 2016). The following methods are
implemented:
 Note that the standard method is not implemented because this method is
often not suitable due to non-normal distributions. The bias-corrected and
accelerated (BCa) method is not implemented yet. | 
|  | Flag that indicates whether the data was already
pre-processed externally, e.g. normalised and clustered. Only used if the
dataargument is adata.tableordata.frame. | 
|  | Number of indentation steps for messages shown during
computation and extraction of various data elements. | 
|  | Flag to indicate whether feedback should be provided on the
computation and extraction of various data elements. | 
|  | Unused arguments. | 
Details
This function also computes credibility intervals for the ensemble
model, at the level of confidence_level.
Value
A list with data.tables for single and ensemble model assessments.
Description
Collects predicted values from models in a familiarEnsemble.
Usage
extract_predictions(
  object,
  data,
  cl = NULL,
  is_pre_processed = FALSE,
  ensemble_method = waiver(),
  evaluation_times = waiver(),
  detail_level = waiver(),
  estimation_type = waiver(),
  aggregate_results = waiver(),
  confidence_level = waiver(),
  message_indent = 0L,
  verbose = FALSE,
  ...
)
Arguments
|  | A familiarEnsembleobject, which is an ensemble of one or morefamiliarModelobjects. | 
|  | A dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed. | 
|  | Cluster created using the parallelpackage. This cluster is then
used to speed up computation through parallellisation. | 
|  | Flag that indicates whether the data was already
pre-processed externally, e.g. normalised and clustered. Only used if the
dataargument is adata.tableordata.frame. | 
|  | Method for ensembling predictions from models for the
same sample. Available methods are:
 | 
|  | One or more time points that are used for in analysis of
survival problems when data has to be assessed at a set time, e.g.
calibration. If not provided explicitly, this parameter is read from
settings used at creation of the underlying familiarModelobjects. Only
used forsurvivaloutcomes. | 
|  | (optional) Sets the level at which results are computed
and aggregated.
 
 ensemble: Results are computed at the ensemble level, i.e. over all
models in the ensemble. This means that, for example, bias-corrected
estimates of model performance are assessed by creating (at least) 20
bootstraps and computing the model performance of the ensemble model for
each bootstrap.
 hybrid(default): Results are computed at the level of models in an
ensemble. This means that, for example, bias-corrected estimates of model
performance are directly computed using the models in the ensemble. If there
are at least 20 trained models in the ensemble, performance is computed for
each model, in contrast toensemblewhere performance is computed for the
ensemble of models. If there are less than 20 trained models in the
ensemble, bootstraps are created so that at least 20 point estimates can be
made.
 model: Results are computed at the model level. This means that, for
example, bias-corrected estimates of model performance are assessed by
creating (at least) 20 bootstraps and computing the performance of the model
for each bootstrap.
 Note that each level of detail has a different interpretation for bootstrap
confidence intervals. For ensembleandmodelthese are the confidence
intervals for the ensemble and an individual model, respectively. That is,
the confidence interval describes the range where an estimate produced by a
respective ensemble or model trained on a repeat of the experiment may be
found with the probability of the confidence level. Forhybrid, it
represents the range where any single model trained on a repeat of the
experiment may be found with the probability of the confidence level. By
definition, confidence intervals obtained usinghybridare at least as
wide as those forensemble.hybridoffers the correct interpretation if
the goal of the analysis is to assess the result of a single, unspecified,
model. hybridis generally computationally less expensive thenensemble, which
in turn is somewhat less expensive thanmodel.
 A non-default detail_levelparameter can be specified for separate
evaluation steps by providing a parameter value in a named list with data
elements, e.g.list("auc_data"="ensemble", "model_performance"="hybrid").
This parameter can be set for the following data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data,prediction_dataandconfusion_matrix. | 
|  | (optional) Sets the type of estimation that should be
possible. This has the following options:
 
 point: Point estimates.
 bias_correctionorbc: Bias-corrected estimates. A bias-corrected
estimate is computed from (at least) 20 point estimates, andfamiliarmay
bootstrap the data to create them.
 bootstrap_confidence_intervalorbci(default): Bias-corrected
estimates with bootstrap confidence intervals (Efron and Hastie, 2016). The
number of point estimates required depends on theconfidence_levelparameter, andfamiliarmay bootstrap the data to create them.
 As with detail_level, a non-defaultestimation_typeparameter can be
specified for separate evaluation steps by providing a parameter value in a
named list with data elements, e.g.list("auc_data"="bci", "model_performance"="point"). This parameter can be set for the following
data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data, andprediction_data. | 
|  | (optional) Flag that signifies whether results
should be aggregated during evaluation. If estimation_typeisbias_correctionorbc, aggregation leads to a single bias-corrected
estimate. Ifestimation_typeisbootstrap_confidence_intervalorbci,
aggregation leads to a single bias-corrected estimate with lower and upper
boundaries of the confidence interval. This has no effect ifestimation_typeispoint. The default value is equal to TRUEexcept when assessing metrics to assess
model performance, as the default violin plot requires underlying data. As with detail_levelandestimation_type, a non-defaultaggregate_resultsparameter can be specified for separate evaluation steps
by providing a parameter value in a named list with data elements, e.g.list("auc_data"=TRUE, , "model_performance"=FALSE). This parameter exists
for the same elements asestimation_type. | 
|  | (optional) Numeric value for the level at which
confidence intervals are determined. In the case bootstraps are used to
determine the confidence intervals bootstrap estimation, familiaruses the
rule of thumbn = 20 / ci.levelto determine the number of required
bootstraps. The default value is 0.95. | 
|  | Number of indentation steps for messages shown during
computation and extraction of various data elements. | 
|  | Flag to indicate whether feedback should be provided on the
computation and extraction of various data elements. | 
|  | Unused arguments. | 
Value
A list with single-model and ensemble predictions.
Description
Computes and extracts stratification data from a
familiarEnsemble object. This includes the data required to draw
Kaplan-Meier plots, as well as logrank and hazard-ratio tests between the
respective risk groups.
Usage
extract_risk_stratification_data(
  object,
  data,
  cl = NULL,
  is_pre_processed = FALSE,
  ensemble_method = waiver(),
  detail_level = waiver(),
  confidence_level = waiver(),
  message_indent = 0L,
  verbose = FALSE,
  ...
)
Arguments
|  | A familiarEnsembleobject, which is an ensemble of one or morefamiliarModelobjects. | 
|  | A dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed. | 
|  | Cluster created using the parallelpackage. This cluster is then
used to speed up computation through parallellisation. | 
|  | Flag that indicates whether the data was already
pre-processed externally, e.g. normalised and clustered. Only used if the
dataargument is adata.tableordata.frame. | 
|  | Method for ensembling predictions from models for the
same sample. Available methods are:
 | 
|  | (optional) Sets the level at which results are computed
and aggregated.
 
 ensemble: Results are computed at the ensemble level, i.e. over all
models in the ensemble. This means that, for example, bias-corrected
estimates of model performance are assessed by creating (at least) 20
bootstraps and computing the model performance of the ensemble model for
each bootstrap.
 hybrid(default): Results are computed at the level of models in an
ensemble. This means that, for example, bias-corrected estimates of model
performance are directly computed using the models in the ensemble. If there
are at least 20 trained models in the ensemble, performance is computed for
each model, in contrast toensemblewhere performance is computed for the
ensemble of models. If there are less than 20 trained models in the
ensemble, bootstraps are created so that at least 20 point estimates can be
made.
 model: Results are computed at the model level. This means that, for
example, bias-corrected estimates of model performance are assessed by
creating (at least) 20 bootstraps and computing the performance of the model
for each bootstrap.
 Note that each level of detail has a different interpretation for bootstrap
confidence intervals. For ensembleandmodelthese are the confidence
intervals for the ensemble and an individual model, respectively. That is,
the confidence interval describes the range where an estimate produced by a
respective ensemble or model trained on a repeat of the experiment may be
found with the probability of the confidence level. Forhybrid, it
represents the range where any single model trained on a repeat of the
experiment may be found with the probability of the confidence level. By
definition, confidence intervals obtained usinghybridare at least as
wide as those forensemble.hybridoffers the correct interpretation if
the goal of the analysis is to assess the result of a single, unspecified,
model. hybridis generally computationally less expensive thenensemble, which
in turn is somewhat less expensive thanmodel.
 A non-default detail_levelparameter can be specified for separate
evaluation steps by providing a parameter value in a named list with data
elements, e.g.list("auc_data"="ensemble", "model_performance"="hybrid").
This parameter can be set for the following data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data,prediction_dataandconfusion_matrix. | 
|  | (optional) Numeric value for the level at which
confidence intervals are determined. In the case bootstraps are used to
determine the confidence intervals bootstrap estimation, familiaruses the
rule of thumbn = 20 / ci.levelto determine the number of required
bootstraps. The default value is 0.95. | 
|  | Number of indentation steps for messages shown during
computation and extraction of various data elements. | 
|  | Flag to indicate whether feedback should be provided on the
computation and extraction of various data elements. | 
|  | Unused arguments. | 
Value
A list with data.tables containing information concerning risk group
stratification.
Description
Collects risk stratification information.
Usage
extract_risk_stratification_info(
  object,
  detail_level = waiver(),
  message_indent = 0L,
  verbose = FALSE,
  ...
)
Arguments
|  | A familiarEnsembleobject, which is an ensemble of one or morefamiliarModelobjects. | 
|  | (optional) Sets the level at which results are computed
and aggregated.
 
 ensemble: Results are computed at the ensemble level, i.e. over all
models in the ensemble. This means that, for example, bias-corrected
estimates of model performance are assessed by creating (at least) 20
bootstraps and computing the model performance of the ensemble model for
each bootstrap.
 hybrid(default): Results are computed at the level of models in an
ensemble. This means that, for example, bias-corrected estimates of model
performance are directly computed using the models in the ensemble. If there
are at least 20 trained models in the ensemble, performance is computed for
each model, in contrast toensemblewhere performance is computed for the
ensemble of models. If there are less than 20 trained models in the
ensemble, bootstraps are created so that at least 20 point estimates can be
made.
 model: Results are computed at the model level. This means that, for
example, bias-corrected estimates of model performance are assessed by
creating (at least) 20 bootstraps and computing the performance of the model
for each bootstrap.
 Note that each level of detail has a different interpretation for bootstrap
confidence intervals. For ensembleandmodelthese are the confidence
intervals for the ensemble and an individual model, respectively. That is,
the confidence interval describes the range where an estimate produced by a
respective ensemble or model trained on a repeat of the experiment may be
found with the probability of the confidence level. Forhybrid, it
represents the range where any single model trained on a repeat of the
experiment may be found with the probability of the confidence level. By
definition, confidence intervals obtained usinghybridare at least as
wide as those forensemble.hybridoffers the correct interpretation if
the goal of the analysis is to assess the result of a single, unspecified,
model. hybridis generally computationally less expensive thenensemble, which
in turn is somewhat less expensive thanmodel.
 A non-default detail_levelparameter can be specified for separate
evaluation steps by providing a parameter value in a named list with data
elements, e.g.list("auc_data"="ensemble", "model_performance"="hybrid").
This parameter can be set for the following data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data,prediction_dataandconfusion_matrix. | 
|  | Number of indentation steps for messages shown during
computation and extraction of various data elements. | 
|  | Flag to indicate whether feedback should be provided on the
computation and extraction of various data elements. | 
|  | Unused arguments. | 
Value
A list of familiarDataElements with risk stratification information.
Description
Computes and extracts the sample distance table for samples
analysed using a familiarEnsemble object to form a familiarData object.
This table can be used to cluster samples, and is exported directly by
extract_feature_expression.
Usage
extract_sample_similarity(
  object,
  data,
  cl = NULL,
  is_pre_processed = FALSE,
  sample_limit = waiver(),
  sample_cluster_method = waiver(),
  sample_linkage_method = waiver(),
  sample_similarity_metric = waiver(),
  verbose = FALSE,
  message_indent = 0L,
  ...
)
Arguments
|  | A familiarEnsembleobject, which is an ensemble of one or morefamiliarModelobjects. | 
|  | A dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed. | 
|  | Cluster created using the parallelpackage. This cluster is then
used to speed up computation through parallellisation. | 
|  | Flag that indicates whether the data was already
pre-processed externally, e.g. normalised and clustered. Only used if the
dataargument is adata.tableordata.frame. | 
|  | (optional) Set the upper limit of the number of samples
that are used during evaluation steps. Cannot be less than 20.
 This setting can be specified per data element by providing a parameter
value in a named list with data elements, e.g.
list("sample_similarity"=100, "permutation_vimp"=1000). This parameter can be set for the following data elements:
sample_similarityandice_data. | 
|  | The method used to perform clustering based on
distance between samples. These are the same methods as for the
cluster_methodconfiguration parameter:hclust,agnes,dianaandpam. nonecannot be used when extracting data for feature expressions.
 If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
|  | The method used for agglomerative clustering in
hclustandagnes. These are the same methods as for thecluster_linkage_methodconfiguration parameter:average,single,complete,weighted, andward. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
|  | Metric to determine pairwise similarity
between samples. Similarity is computed in the same manner as for
clustering, but sample_similarity_metrichas different options that are
better suited to computing distance between samples instead of between
features:gower,euclidean. The underlying feature data is scaled to the [0, 1]range (for
numerical features) using the feature values across the samples. The
normalisation parameters required can optionally be computed from feature
data with the outer 5% (on both sides) of feature values trimmed or
winsorised. To do so append_trim(trimming) or_winsor(winsorising) to
the metric name. This reduces the effect of outliers somewhat. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
|  | Flag to indicate whether feedback should be provided on the
computation and extraction of various data elements. | 
|  | Number of indentation steps for messages shown during
computation and extraction of various data elements. | 
|  | Unused arguments. | 
Value
A data.table containing pairwise distance between samples. This data
is only the upper triangular of the complete matrix (i.e. the sparse
unitriangular representation). Diagonals will always be 0.0 and the lower
triangular is mirrored.
Description
Computes and extracts univariate analysis for the features used
in a familiarEnsemble object. This assessment includes the computation of
p and q-values, as well as robustness (in case of repeated measurements).
Usage
extract_univariate_analysis(
  object,
  data,
  cl = NULL,
  icc_type = waiver(),
  feature_similarity = NULL,
  feature_cluster_method = waiver(),
  feature_cluster_cut_method = waiver(),
  feature_linkage_method = waiver(),
  feature_similarity_threshold = waiver(),
  feature_similarity_metric = waiver(),
  message_indent = 0L,
  verbose = FALSE,
  ...
)
Arguments
|  | A familiarEnsembleobject, which is an ensemble of one or morefamiliarModelobjects. | 
|  | A dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed. | 
|  | Cluster created using the parallelpackage. This cluster is then
used to speed up computation through parallellisation. | 
|  | String indicating the type of intraclass correlation
coefficient (1,2or3) that should be used to compute robustness for
features in repeated measurements during the evaluation of univariate
importance. These types correspond to the types in Shrout and Fleiss (1979).
If not provided explicitly, this parameter is read from settings used at
creation of the underlyingfamiliarModelobjects. | 
|  | The method used to perform clustering. These are
the same methods as for the cluster_methodconfiguration parameter:none,hclust,agnes,dianaandpam. nonecannot be used when extracting data regarding mutual correlation or
feature expressions.
 If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
|  | The method used to divide features into
separate clusters. The available methods are the same as for the
cluster_cut_methodconfiguration parameter:silhouette,fixed_cutanddynamic_cut. silhouetteis available for all cluster methods, butfixed_cutonly
applies to methods that create hierarchical trees (hclust,agnesanddiana).dynamic_cutrequires thedynamicTreeCutpackage and can only
be used withagnesandhclust.
 If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
|  | The method used for agglomerative clustering in
hclustandagnes. These are the same methods as for thecluster_linkage_methodconfiguration parameter:average,single,complete,weighted, andward. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
|  | The threshold level for pair-wise
similarity that is required to form feature clusters with the fixed_cutmethod. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
|  | Metric to determine pairwise similarity
between features. Similarity is computed in the same manner as for
clustering, and feature_similarity_metrictherefore has the same options
ascluster_similarity_metric:mcfadden_r2,cox_snell_r2,nagelkerke_r2,spearman,kendallandpearson. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
|  | Number of indentation steps for messages shown during
computation and extraction of various data elements. | 
|  | Flag to indicate whether feedback should be provided on the
computation and extraction of various data elements. | 
|  | Unused arguments. | 
Value
A list with a data.table containing information concerning the
univariate analysis of important features.
Collection of familiar data.
Description
A familiarCollection object aggregates data from one or more familiarData
objects.
Slots
- name
- Name of the collection. 
- data_sets
- Name of the individual underlying datasets. 
- outcome_type
- Outcome type for which the collection was created. 
- outcome_info
- Outcome information object, which contains information
concerning the outcome, such as class levels. 
- fs_vimp
- Variable importance data collected by feature selection
methods. 
- model_vimp
- Variable importance data collected from model-specific
algorithms implemented by models created by familiar. 
- permutation_vimp
- Data collected for permutation variable importance. 
- hyperparameters
- Hyperparameters collected from created models. 
- hyperparameter_data
- Additional data concerning hyperparameters. This is
currently not used yet. 
- required_features
- The set of features required for complete
reproduction, i.e. with imputation. 
- model_features
- The set of features that are required for using the
model, but without imputation. 
- learner
- Learning algorithm(s) used for data in the collection. 
- fs_method
- Feature selection method(s) used for data in the collection. 
- prediction_data
- Model predictions for the data in the collection. 
- confusion_matrix
- Confusion matrix information for the data in the
collection. 
- decision_curve_data
- Decision curve analysis data for the data in the
collection. 
- calibration_info
- Calibration information, e.g. baseline survival in the
development cohort. 
- calibration_data
- Model calibration data collected from data in the
collection. 
- model_performance
- Collection of model performance data for data in the
collection. 
- km_info
- Information concerning risk-stratification cut-off values for
data in the collection. 
- km_data
- Kaplan-Meier survival data for data in the collection. 
- auc_data
- AUC-ROC and AUC-PR data for data in the collection. 
- ice_data
- Individual conditional expectation data for data in the
collection. Partial dependence data are computed on the fly from these
data. 
- univariate_analysis
- Univariate analysis results of data in the
collection. 
- feature_expressions
- Feature expression values for data in the
collection. 
- feature_similarity
- Feature similarity information for data in the
collection. 
- sample_similarity
- Sample similarity information for data in the
collection. 
- data_set_labels
- Labels for the different datasets in the collection.
See - get_data_set_namesand- set_data_set_names.
 
- learner_labels
- Labels for the different learning algorithms used to
create the collection. See - get_learner_namesand- set_learner_names.
 
- fs_method_labels
- Labels for the different feature selection methods
used to create the collection. See - get_fs_method_namesand- set_fs_method_names.
 
- feature_labels
- Labels for the features in this collection. See
- get_feature_namesand- set_feature_names.
 
- km_group_labels
- Labels for the risk strata in this collection. See
- get_risk_group_namesand- set_risk_group_names.
 
- class_labels
- Labels of the response variable. See - get_class_namesand- set_class_names.
 
- project_id
- Identifier of the project that generated this collection. 
- familiar_version
- Version of the familiar package.
 - familiarCollection objects collect data from one or more familiarData
objects. This objects are important, as all plotting and export functions
use it. The fact that one can supply familiarModel, familiarEnsemble and
familiarData objects as arguments for these methods, is because familiar
internally converts these into familiarCollection objects prior to
executing the method. 
Dataset obtained after evaluating models on a dataset.
Description
A familiarData object is created by evaluating familiarEnsemble or
familiarModel objects on a dataset. Multiple familiarData objects are
aggregated in a familiarCollection object.
Slots
- name
- Name of the dataset, e.g. training or internal validation. 
- outcome_type
- Outcome type of the data used to create the object. 
- outcome_info
- Outcome information object, which contains additional
information concerning the outcome, such as class levels. 
- fs_vimp
- Variable importance data collected from feature selection
methods. 
- model_vimp
- Variable importance data collected from model-specific
algorithms implemented by models created by familiar. 
- permutation_vimp
- Data collected for permutation variable importance. 
- hyperparameters
- Hyperparameters collected from created models. 
- hyperparameter_data
- Additional data concerning hyperparameters. This is
currently not used yet. 
- required_features
- The set of features required for complete
reproduction, i.e. with imputation. 
- model_features
- The set of features that are required for using the
model or ensemble of models, but without imputation. 
- learner
- Learning algorithm used to create the model or ensemble of
models. 
- fs_method
- Feature selection method used to determine variable
importance for the model or ensemble of models. 
- pooling_table
- Run table for the data underlying the familiarData
object. Used internally. 
- prediction_data
- Model predictions for a model or ensemble of models for
the underlying dataset. 
- confusion_matrix
- Confusion matrix for a model or ensemble of models,
based on the underlying dataset. 
- decision_curve_data
- Decision curve analysis data for a model or
ensemble of models, based on the underlying dataset. 
- calibration_info
- Calibration information, e.g. baseline survival in the
development cohort. 
- calibration_data
- Calibration data for a model or ensemble of models,
based on the underlying dataset. 
- model_performance
- Model performance data for a model or ensemble of
models, based on the underlying dataset. 
- km_info
- Information concerning risk-stratification cut-off values.. 
- km_data
- Kaplan-Meier survival data for a model or ensemble of models,
based on the underlying dataset. 
- auc_data
- AUC-ROC and AUC-PR data for a model or ensemble of models,
based on the underlying dataset. 
- ice_data
- Individual conditional expectation data for features included
in a model or ensemble of models, based on the underlying dataset. Partial
dependence data are computed on the fly from these data. 
- univariate_analysis
- Univariate analysis of the underlying dataset. 
- feature_expressions
- Feature expression values of the underlying
dataset. 
- feature_similarity
- Feature similarity information of the underlying
dataset. 
- sample_similarity
- Sample similarity information of the underlying
dataset. 
- is_validation
- Signifies whether the underlying data forms a validation
dataset. Used internally. 
- generating_ensemble
- Name of the ensemble that was used to generate the
familiarData object. 
- project_id
- Identifier of the project that generated the familiarData
object. 
- familiar_version
- Version of the familiar package.
 - familiarData objects contain information obtained by evaluating a single
model or single ensemble of models on a dataset. 
Data container for evaluation data.
Description
Most attributes of the familiarData object are objects of the
familiarDataElement class. This (super-)class is used to allow for
standardised aggregation and processing of evaluation data.
Slots
- data
- Evaluation data, typically a data.table or list. 
- identifiers
- Identifiers of the data, e.g. the generating model name,
learner, etc. 
- detail_level
- Sets the level at which results are computed and
aggregated.
 - 
-  ensemble: Results are computed at the ensemble level, i.e. over all
models in the ensemble. This means that, for example, bias-corrected
estimates of model performance are assessed by creating (at least) 20
bootstraps and computing the model performance of the ensemble model for
each bootstrap.
 
-  hybrid(default): Results are computed at the level of models in an
ensemble. This means that, for example, bias-corrected estimates of model
performance are directly computed using the models in the ensemble. If
there are at least 20 trained models in the ensemble, performance is
computed for each model, in contrast toensemblewhere performance is
computed for the ensemble of models. If there are less than 20 trained
models in the ensemble, bootstraps are created so that at least 20 point
estimates can be made.
 
-  model: Results are computed at the model level. This means that, for
example, bias-corrected estimates of model performance are assessed by
creating (at least) 20 bootstraps and computing the performance of the
model for each bootstrap.
 
 - Note that each level of detail has a different interpretation for bootstrap
confidence intervals. For - ensembleand- modelthese are the confidence
intervals for the ensemble and an individual model, respectively. That is,
the confidence interval describes the range where an estimate produced by a
respective ensemble or model trained on a repeat of the experiment may be
found with the probability of the confidence level. For- hybrid, it
represents the range where any single model trained on a repeat of the
experiment may be found with the probability of the confidence level. By
definition, confidence intervals obtained using- hybridare at least as
wide as those for- ensemble.- hybridoffers the correct interpretation if
the goal of the analysis is to assess the result of a single, unspecified,
model.
 - Some child classes do not use this parameter. 
- estimation_type
- Sets the type of estimation that should be possible.
This has the following options:
 - 
-  point: Point estimates.
 
-  bias_correctionorbc: Bias-corrected estimates. A bias-corrected
estimate is computed from (at least) 20 point estimates, andfamiliarmay
bootstrap the data to create them.
 
-  bootstrap_confidence_intervalorbci(default): Bias-corrected
estimates with bootstrap confidence intervals (Efron and Hastie, 2016). The
number of point estimates required depends on theconfidence_levelparameter, andfamiliarmay bootstrap the data to create them.
 
 - Some child classes do not use this parameter. 
- confidence_level
- (optional) Numeric value for the level at which
confidence intervals are determined. In the case bootstraps are used to
determine the confidence intervals bootstrap estimation, - familiaruses
the rule of thumb- n = 20 / ci.levelto determine the number of
required bootstraps.
 
- bootstrap_ci_method
- Method used to determine bootstrap confidence
intervals (Efron and Hastie, 2016). The following methods are implemented:
 - Note that the standard method is not implemented because this method is
often not suitable due to non-normal distributions. The bias-corrected and
accelerated (BCa) method is not implemented yet. 
- value_column
- Identifies column(s) in the - dataattribute presenting
values.
 
- grouping_column
- Identifies column(s) in the - dataattribute presenting
identifier columns for grouping during aggregation. Familiar will
automatically assign items from the- identifiersattribute to the data and
this attribute when combining multiple familiarDataElements of the same
(child) class.
 
- is_aggregated
- Defines whether the object was aggregated. 
References
-  Efron, B. & Hastie, T. Computer Age Statistical Inference.
(Cambridge University Press, 2016).
 
Ensemble of familiar models.
Description
A familiarEnsemble object contains one or more familiarModel objects.
Slots
- name
- Name of the familiarEnsemble object. 
- model_list
- List of attached familiarModel objects, or paths to these
objects. Familiar attaches familiarModel objects when required. 
- outcome_type
- Outcome type of the data used to create the object. 
- outcome_info
- Outcome information object, which contains additional
information concerning the outcome, such as class levels. 
- data_column_info
- Data information object containing information
regarding identifier column names and outcome column names. 
- learner
- Learning algorithm used to create the models in the ensemble. 
- fs_method
- Feature selection method used to determine variable
importance for the models in the ensemble. 
- feature_info
- List of objects containing feature information, e.g.,
name, class levels, transformation, normalisation and clustering
parameters. 
- required_features
- The set of features required for complete
reproduction, i.e. with imputation. 
- model_features
- The combined set of features that is used to train the
models in the ensemble, 
- novelty_features
- The combined set of features that is used to train all
novelty detectors in the ensemble. 
- run_table
- Run table for the data used to train the ensemble. Used
internally. 
- calibration_info
- Calibration information, e.g. baseline survival in the
development cohort. 
- model_dir_path
- Path to folder containing the familiarModel objects. Can
be updated using the - update_model_dir_pathmethod.
 
- auto_detach
- Flag used to determine whether models should be detached
from the model after use, or not. Used internally. 
- settings
- A copy of the evaluation configuration parameters used at
model creation. These are used as default parameters when evaluating the
ensemble to create a familiarData object. 
- project_id
- Identifier of the project that generated the underlying
familiarModel object(s). 
- familiar_version
- Version of the familiar package. 
Hyperparameter learner.
Description
A familiarHyperparameterLearner object is a self-contained model that can be
applied to predict optimisation scores for a set of hyperparameters.
Details
Hyperparameter learners are used to infer the optimisation score for
sets of hyperparameters. These are then used to either infer utility using
acquisition functions or to generate summary scores to identify the optimal
model.
Slots
- name
- Name of the familiarHyperparameterLearner object. 
- learner
- Algorithm used to create the hyperparameter learner. 
- target_learner
- Algorithm for which the hyperparameters are being
learned. 
- target_outcome_type
- Outcome type of the learner for which
hyperparameters are being modeled. Used to determine the target
hyperparameters. 
- optimisation_metric
- One or metrics used to generate the optimisation
score. 
- optimisation_function
- Function used to generate the optimisation score. 
- model
- The actual model trained using the specific algorithm, e.g. a
isolation forest from the - isotreepackage.
 
- target_hyperparameters
- The names of the hyperparameters that are used
to train the hyperparameter learner. 
- project_id
- Identifier of the project that generated the
familiarHyperparameterLearner object. 
- familiar_version
- Version of the familiar package. 
- package
- Name of package(s) required to executed the hyperparameter
learner itself, e.g. - laGP.
 
- package_version
- Version of the packages mentioned in the - packageattribute.
 
Model performance metric.
Description
Superclass for model performance objects.
Slots
- metric
- Performance metric. 
- outcome_type
- Type of outcome being predicted. 
- name
- Name of the performance metric. 
- value_range
- Range of the performance metric. Can be half-open. 
- baseline_value
- Value of the metric for trivial models, e.g. models that
always predict the median value, the majority class, or the mean hazard,
etc. 
- higher_better
- States whether higher metric values correspond to better
predictive model performance (e.g. accuracy) or not (e.g. root mean squared
error). 
Familiar model.
Description
A familiarModel object is a self-contained model that can be applied to
generate predictions for a dataset. familiarModel objects form the parent
class of learner-specific child classes.
Slots
- name
- Name of the familiarModel object. 
- model
- The actual model trained using a specific algorithm, e.g. a
random forest from the - rangerpackage, or a LASSO model from- glmnet.
 
- outcome_type
- Outcome type of the data used to create the object. 
- outcome_info
- Outcome information object, which contains additional
information concerning the outcome, such as class levels. 
- feature_info
- List of objects containing feature information, e.g.,
name, class levels, transformation, normalisation and clustering
parameters. 
- data_column_info
- Data information object containing information
regarding identifier column names and outcome column names. 
- hyperparameters
- Set of hyperparameters used to train the model. 
- hyperparameter_data
- Information generated during hyperparameter
optimisation. 
- calibration_model
- One or more models used to recalibrate the model
output. Currently only used by some models. 
- novelty_detector
- A familiarNoveltyDetector object that can be used to
detect out-of-distribution samples. 
- learner
- Learning algorithm used to create the model. 
- fs_method
- Feature selection method used to determine variable
importance for the model. 
- required_features
- The set of features required for complete
reproduction, i.e. with imputation. 
- model_features
- The set of features that is used to train the model, 
- novelty_features
- The set of features that is used to train all novelty
detectors in the ensemble. 
- calibration_info
- Calibration information, e.g. baseline survival in the
development cohort. 
- km_info
- Data concerning stratification into risk groups. 
- run_table
- Run table for the data used to train the model. Used
internally. 
- settings
- A copy of the evaluation configuration parameters used at
model creation. These are used as default parameters when evaluating the
model (technically, familiarEnsemble) to create a familiarData object. 
- is_trimmed
- Flag that indicates whether the model, stored in the - modelslot, has been trimmed.
 
- trimmed_function
- List of functions whose output has been captured prior
to trimming the model. 
- messages
- List of warning and error messages generated during training. 
- project_id
- Identifier of the project that generated the familiarModel
object. 
- familiar_version
- Version of the familiar package. 
- package
- Name of package(s) required to executed the model itself, e.g.
- rangeror- glmnet.
 
- package_version
- Version of the packages mentioned in the - packageattribute.
 
Novelty detector.
Description
A familiarNoveltyDetector object is a self-contained model that can be
applied to generate out-of-distribution predictions for instances in a
dataset.
Details
Note that these objects do not contain any data concerning outcome,
as this not relevant for (prospective) out-of-distribution detection.
Slots
- name
- Name of the familiarNoveltyDetector object. 
- learner
- Learning algorithm used to create the novelty detector. 
- model
- The actual novelty detector trained using a specific algorithm,
e.g. a isolation forest from the - isotreepackage.
 
- feature_info
- List of objects containing feature information, e.g.,
name, class levels, transformation, normalisation and clustering
parameters. 
- data_column_info
- Data information object containing information
regarding identifier column names. 
- conversion_parameters
- Parameters used to convert raw output to
statistical probability of being out-of-distribution. Currently unused. 
- hyperparameters
- Set of hyperparameters used to train the detector. 
- required_features
- The set of features required for complete
reproduction, i.e. with imputation. 
- model_features
- The set of features that is used to train the detector. 
- run_table
- Run table for the data used to train the detector. Used
internally. 
- is_trimmed
- Flag that indicates whether the detector, stored in the
- modelslot, has been trimmed.
 
- trimmed_function
- List of functions whose output has been captured prior
to trimming the model. 
- project_id
- Identifier of the project that generated the
familiarNoveltyDetector object. 
- familiar_version
- Version of the familiar package. 
- package
- Name of package(s) required to executed the detector itself,
e.g. - isotree.
 
- package_version
- Version of the packages mentioned in the - packageattribute.
 
Variable importance method object.
Description
The familiarVimpMethod class is the parent class for all variable importance
methods in familiar.
Slots
- outcome_type
- Outcome type of the data to be evaluated using the object. 
- hyperparameters
- Set of hyperparameters for the variable importance
method. 
- vimp_method
- The character string indicating the variable importance
method. 
- multivariate
- Flags whether the variable importance method is
multivariate vs. univariate. 
- outcome_info
- Outcome information object, which contains additional
information concerning the outcome, such as class levels. 
- feature_info
- List of objects containing feature information, e.g.,
name, class levels, transformation, normalisation and clustering
parameters. 
- required_features
- The set of features to be assessed by the variable
importance method. 
- package
- Name of the package(s) required to execute the variable
importance method. 
- run_table
- Run table for the data to be assessed by the variable
importance method. Used internally. 
- project_id
- Identifier of the project that generated the
familiarVimpMethod object. 
Feature information object.
Description
A featureInfo object contains information for a single feature. This
information is used to check data prospectively for consistency and for data
preparation. These objects are, for instance, attached to a familiarModel
object so that data can be pre-processed in the same way as the development
data.
Slots
- name
- Name of the feature, which by default is the column name of the
feature. 
- set_descriptor
- Character string describing the set to which the feature
belongs. Currently not used. 
- feature_type
- Describes the feature type, i.e. - factoror- numeric.
 
- levels
- The class levels of categorical features. This is used to check
prospective datasets. 
- ordered
- Specifies whether the 
- distribution
- Five-number summary (numeric) or class frequency
(categorical). 
- data_id
- Internal identifier for the dataset used to derive the feature
information. 
- run_id
- Internal identifier for the specific subset of the dataset used
to derive the feature information. 
- in_signature
- Specifies whether the feature is included in the model
signature. 
- in_novelty
- Specifies whether the feature is included in the novelty
detector. 
- removed
- Specifies whether the feature was removed during
pre-processing. 
- removed_unknown_type
- Specifies whether the feature was removed during
pre-processing because the type was neither factor nor numeric.. 
- removed_missing_values
- Specifies whether the feature was removed during
pre-processing because it contained too many missing values. 
- removed_no_variance
- Specifies whether the feature was removed during
pre-processing because it did not contain more than 1 unique value. 
- removed_low_variance
- Specifies whether the feature was removed during
pre-processing because the variance was too low. Requires applying
- low_varianceas a- filter_method.
 
- removed_low_robustness
- Specifies whether the feature was removed during
pre-processing because it lacks robustness. Requires applying - robustnessas a- filter_method, as well as repeated measurement.
 
- removed_low_importance
- Specifies whether the feature was removed during
pre-processing because it lacks relevance. Requires applying
- univariate_testas a- filter_method.
 
- fraction_missing
- Specifies the fraction of missing values. 
- robustness
- Specifies robustness of the feature, if measured. 
- univariate_importance
- Specifies the univariate p-value of the feature,
if measured. 
- transformation_parameters
- Details parameters for power transformation
of numeric features. 
- normalisation_parameters
- Details parameters for (global) normalisation
of numeric features. 
- batch_normalisation_parameters
- Details parameters for batch
normalisation of numeric features. 
- imputation_parameters
- Details parameters or models for imputation of
missing values. 
- cluster_parameters
- Details parameters for forming clusters with other
features. 
- required_features
- Details features required for clustering or
imputation. 
- familiar_version
- Version of the familiar package. 
Feature information parameters object.
Description
A featureInfo object contains information for a single feature. Some
information, for example concerning clustering and transformation contains
various parameters that allow for applying the data transformation correctly.
These are stored in featureInfoParameters objects.
Details
featureInfoParameters is normally a parent class for specific
classes, such as featureInfoParametersTransformation.
Slots
- name
- Name of the feature, which by default is the column name of the
feature. Typically used to correctly assign the data. 
- complete
- Flags whether the parameters have been completely set. 
- familiar_version
- Version of the familiar package. 
Get outcome class labels
Description
Outcome classes in familiarCollection objects can have custom
names for export and plotting. This function retrieves the currently
assigned names.
Usage
## S4 method for signature 'familiarCollection'
get_class_names(x)
Arguments
| x | A familiarCollection object. | 
Details
Labels convert internal class names to the requested label at export
or when plotting. Labels can be changed using the set_class_names
method.
Value
An ordered array of class labels.
See Also
Get current name of datasets
Description
Datasets in familiarCollection objects can have custom names for
export and plotting. This function retrieves the currently assigned names.
Usage
## S4 method for signature 'familiarCollection'
get_data_set_names(x)
Arguments
| x | A familiarCollection object. | 
Details
Labels convert internal naming of data sets to the requested label
at export or when plotting. Labels can be changed using the
set_data_set_names method.
Value
An ordered array of dataset name labels.
See Also
Get current feature labels
Description
Features in familiarCollection objects can have custom names for
export and plotting. This function retrieves the currently assigned names.
Usage
## S4 method for signature 'familiarCollection'
get_feature_names(x)
Arguments
| x | A familiarCollection object. | 
Details
Labels convert internal naming of features to the requested label at
export or when plotting. Labels can be changed using the
set_feature_names method.
Value
An ordered array of feature labels.
See Also
Get current feature selection method name labels
Description
Feature selection methods in familiarCollection objects can have
custom names for export and plotting. This function retrieves the currently
assigned names.
Usage
## S4 method for signature 'familiarCollection'
get_fs_method_names(x)
Arguments
| x | A familiarCollection object. | 
Details
Labels convert internal naming of feature selection methods to the
requested label at export or when plotting. Labels can be changed using the
set_fs_method_names method.
Value
An ordered array of feature selection method name labels.
See Also
Get current learner name labels
Description
Learners in familiarCollection objects can have custom names for
export and plotting. This function retrieves the currently assigned names.
Usage
## S4 method for signature 'familiarCollection'
get_learner_names(x)
Arguments
| x | A familiarCollection object. | 
Details
Labels convert internal naming of learners to the requested label at
export or when plotting. Labels can be changed using the
set_learner_names method.
Value
An ordered array of learner name labels.
See Also
Get current risk group labels
Description
Risk groups in familiarCollection objects can have custom names
for export and plotting. This function retrieves the currently assigned
names.
Usage
## S4 method for signature 'familiarCollection'
get_risk_group_names(x)
Arguments
| x | A familiarCollection object. | 
Details
Labels convert internal naming of risk groups to the requested label
at export or when plotting. Labels can be changed using the
set_risk_group_names method.
Value
An ordered array of risk group labels.
See Also
Extract variable importance table.
Description
This method retrieves and parses variable importance tables from
their respective vimpTable objects.
Usage
get_vimp_table(x, state = "ranked", ...)
## S4 method for signature 'list'
get_vimp_table(x, state = "ranked", ...)
## S4 method for signature 'character'
get_vimp_table(x, state = "ranked", ...)
## S4 method for signature 'vimpTable'
get_vimp_table(x, state = "ranked", ...)
## S4 method for signature 'NULL'
get_vimp_table(x, state = "ranked", ...)
## S4 method for signature 'experimentData'
get_vimp_table(x, state = "ranked", ...)
## S4 method for signature 'familiarModel'
get_vimp_table(x, state = "ranked", data = NULL, as_object = FALSE, ...)
Arguments
| x | Variable importance (vimpTable) object, a list thereof, or one or
more paths to these objects. This method extracts the variable importance
table from such objects. | 
| state | State of the returned variable importance table. This affects
what contents are shown, and in which format. The variable importance table
can be returned with the following states:
 
 initial: initial state, directly after the variable importance table is
filled. The returned variable importance table shows the raw, un-processed
data.
 decoded: depending on the variable importance method, the initial
variable importance table may contain the scores of individual contrasts
for categorical variables. When decoded, scores from all contrasts are
aggregated to a single score for each feature.
 declustered: variable importance is determined from fully processed
features, which includes clustering. This means that a single feature in
the variable importance table may represent multiple original features.
When a variable importance table has been declustered, all clusters have
been turned into their constituent features.
 ranked(default): The scores have been used to create ranks, with lower
ranks indicating better features.
 Internally, the variable importance table will go through each state, i.e.
an variable importance table in the initial state will be decoded,
declustered and then ranked prior to returning the variable importance
table. | 
| ... | Unused arguments. | 
| data | Internally used argument for use with familiarModelobjects. | 
| as_object | Internally used argument for use with familiarModelobjects. | 
Value
A data.table with variable importance scores and, with
state="ranked", the respective ranks.
Create an empty xml configuration file
Description
This function creates an empty configuration xml file in the directory
specified by dir_path. This provides an alternative to the use of input
arguments for familiar.
Usage
get_xml_config(dir_path)
Arguments
| dir_path | Path to the directory where the configuration file should be
created. The directory should exist, and no file named config.xmlshould
be present. | 
Value
Nothing. A file named config.xml is created in the directory
indicated by dir_path.
Examples
## Not run: 
# Creates a config.xml file in the working directory
get_xml_config(dir_path=getwd())
## End(Not run)
Internal test for encapsulated_path
Description
This function tests if the object is an encapsulated_path object.
Usage
is.encapsulated_path(x)
Arguments
Value
TRUE for objects that are encapsulated_path, FALSE otherwise.
Internal test to see if an object is a waiver
Description
This function tests if the object was created by the waiver function. This
function is functionally identical to ggplot2:::is.waive().
Usage
is.waive(x)
Arguments
Value
TRUE for objects that are waivers, FALSE otherwise.
Outcome information object.
Description
An outcome information object stores data concerning an outcome. This is used
to prospectively check data.
Slots
- name
- Name of the outcome, inherited from the original column name by
default. 
- outcome_type
- Type of outcome. 
- outcome_column
- Name of the outcome column in data. 
- levels
- Specifies class levels of categorical outcomes. 
- ordered
- Specifies whether categorical outcomes are ordered. 
- reference
- Class level used as reference. 
- time
- Maximum time, as set by the - time_maxconfiguration parameter.
 
- censored
- Censoring indicators for survival outcomes. 
- event
- Event indicators for survival outcomes. 
- competing_risk
- Indicators for competing risks in survival outcomes. 
- distribution
- Five-number summary (numeric outcomes), class frequency
(categorical outcomes), or survival distributions. 
- data_id
- Internal identifier for the dataset used to derive the outcome
information. 
- run_id
- Internal identifier for the specific subset of the dataset used
to derive the outcome information. 
- transformation_parameters
- Parameters used for transforming a numeric
outcomes. Currently unused. 
- normalisation_parameters
- Parameters used for normalising numeric
outcomes. Currently unused. 
Plot the precision-recall curve.
Description
This method creates precision-recall curves based on data in a
familiarCollection object.
Usage
plot_auc_precision_recall_curve(
  object,
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  color_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  ggtheme = NULL,
  discrete_palette = NULL,
  x_label = waiver(),
  y_label = waiver(),
  legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  x_n_breaks = 5,
  x_breaks = NULL,
  y_n_breaks = 5,
  y_breaks = NULL,
  conf_int_style = c("ribbon", "step", "none"),
  conf_int_alpha = 0.4,
  width = waiver(),
  height = waiver(),
  units = waiver(),
  export_collection = FALSE,
  ...
)
## S4 method for signature 'ANY'
plot_auc_precision_recall_curve(
  object,
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  color_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  ggtheme = NULL,
  discrete_palette = NULL,
  x_label = waiver(),
  y_label = waiver(),
  legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  x_n_breaks = 5,
  x_breaks = NULL,
  y_n_breaks = 5,
  y_breaks = NULL,
  conf_int_style = c("ribbon", "step", "none"),
  conf_int_alpha = 0.4,
  width = waiver(),
  height = waiver(),
  units = waiver(),
  export_collection = FALSE,
  ...
)
## S4 method for signature 'familiarCollection'
plot_auc_precision_recall_curve(
  object,
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  color_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  ggtheme = NULL,
  discrete_palette = NULL,
  x_label = waiver(),
  y_label = waiver(),
  legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  x_n_breaks = 5,
  x_breaks = NULL,
  y_n_breaks = 5,
  y_breaks = NULL,
  conf_int_style = c("ribbon", "step", "none"),
  conf_int_alpha = 0.4,
  width = waiver(),
  height = waiver(),
  units = waiver(),
  export_collection = FALSE,
  ...
)
Arguments
| object | familiarCollectionobject, or one or morefamiliarDataobjects, that will be internally converted to afamiliarCollectionobject. It is also possible to provide afamiliarEnsembleor one or morefamiliarModelobjects together with the data from which data is computed
prior to export. Paths to such files can also be provided.
 | 
| draw | (optional) Draws the plot if TRUE. | 
| dir_path | (optional) Path to the directory where the plots of receiver
operating characteristic curves are saved to. Output is saved in the
performancesubdirectory. IfNULLno figures are saved, but are returned
instead. | 
| split_by | (optional) Splitting variables. This refers to column names
on which datasets are split. A separate figure is created for each split.
See details for available variables. | 
| color_by | (optional) Variables used to determine fill colour of plot
objects. The variables cannot overlap with those provided to the split_byargument, but may overlap with other arguments. See details for available
variables. | 
| facet_by | (optional) Variables used to determine how and if facets of
each figure appear. In case the facet_wrap_colsargument isNULL, the
first variable is used to define columns, and the remaing variables are
used to define rows of facets. The variables cannot overlap with those
provided to thesplit_byargument, but may overlap with other arguments.
See details for available variables. | 
| facet_wrap_cols | (optional) Number of columns to generate when facet
wrapping. If NULL, a facet grid is produced instead. | 
| ggtheme | (optional) ggplottheme to use for plotting. | 
| discrete_palette | (optional) Palette to use to color the different
plot elements in case a value was provided to the color_byargument. | 
| x_label | (optional) Label to provide to the x-axis. If NULL, no label
is shown. | 
| y_label | (optional) Label to provide to the y-axis. If NULL, no label
is shown. | 
| legend_label | (optional) Label to provide to the legend. If NULL, the
legend will not have a name. | 
| plot_title | (optional) Label to provide as figure title. If NULL, no
title is shown. | 
| plot_sub_title | (optional) Label to provide as figure subtitle. If
NULL, no subtitle is shown. | 
| caption | (optional) Label to provide as figure caption. If NULL, no
caption is shown. | 
| x_n_breaks | (optional) Number of breaks to show on the x-axis of the
plot. x_n_breaksis used to determine thex_breaksargument in case it
is unset. | 
| x_breaks | (optional) Break points on the x-axis of the plot. | 
| y_n_breaks | (optional) Number of breaks to show on the y-axis of the
plot. y_n_breaksis used to determine they_breaksargument in case it
is unset. | 
| y_breaks | (optional) Break points on the y-axis of the plot. | 
| conf_int_style | (optional) Confidence interval style. See details for
allowed styles. | 
| conf_int_alpha | (optional) Alpha value to determine transparency of
confidence intervals or, alternatively, other plot elements with which the
confidence interval overlaps. Only values between 0.0 (fully transparent)
and 1.0 (fully opaque) are allowed. | 
| width | (optional) Width of the plot. A default value is derived from
the number of facets. | 
| height | (optional) Height of the plot. A default value is derived
from the number of features and the number of facets. | 
| units | (optional) Plot size unit. Either cm(default),mmorin. | 
| export_collection | (optional) Exports the collection if TRUE. | 
| ... | Arguments passed on to as_familiar_collection,ggplot2::ggsave 
familiar_data_namesNames of the dataset(s). Only used if the objectparameter is one or morefamiliarDataobjects.collection_nameName of the collection.deviceDevice to use. Can either be a device function
(e.g. png), or one of "eps", "ps", "tex" (pictex),
"pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only). If
NULL(default), the device is guessed based on thefilenameextension.scaleMultiplicative scaling factor.dpiPlot resolution. Also accepts a string input: "retina" (320),
"print" (300), or "screen" (72). Applies only to raster output types.limitsizeWhen TRUE(the default),ggsave()will not
save images larger than 50x50 inches, to prevent the common error of
specifying dimensions in pixels.bgBackground colour. If NULL, uses theplot.backgroundfill value
from the plot theme.create.dirWhether to create new directories if a non-existing
directory is specified in the filenameorpath(TRUE) or return an
error (FALSE, default). IfFALSEand run in an interactive session,
a prompt will appear asking to create a new directory when necessary. | 
Details
This function generates area under the precision-recall curve plots.
Available splitting variables are: fs_method, learner, data_set and
positive_class. By default, the data is split by fs_method and learner,
with faceting by data_set and colouring by positive_class.
Available palettes for discrete_palette are those listed by
grDevices::palette.pals() (requires R >= 4.0.0), grDevices::hcl.pals()
(requires R >= 3.6.0) and rainbow, heat.colors, terrain.colors,
topo.colors and cm.colors, which correspond to the palettes of the same
name in grDevices. If not specified, a default palette based on palettes
in Tableau are used. You may also specify your own palette by using colour
names listed by grDevices::colors() or through hexadecimal RGB strings.
Bootstrap confidence intervals of the ROC curve (if present) can be shown
using various styles set by conf_int_style:
-  ribbon(default): confidence intervals are shown as a ribbon with an
opacity ofconf_int_alphaaround the point estimate of the ROC curve.
 
-  step(default): confidence intervals are shown as a step function around
the point estimate of the ROC curve.
 
-  none: confidence intervals are not shown. The point estimate of the ROC
curve is shown as usual.
 
Labelling methods such as set_fs_method_names or set_data_set_names can
be applied to the familiarCollection object to update labels, and order
the output in the figure.
Value
NULL or list of plot objects, if dir_path is NULL.
Plot the receiver operating characteristic curve.
Description
This method creates receiver operating characteristic curves
based on data in a familiarCollection object.
Usage
plot_auc_roc_curve(
  object,
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  color_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  ggtheme = NULL,
  discrete_palette = NULL,
  x_label = waiver(),
  y_label = waiver(),
  legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  x_n_breaks = 5,
  x_breaks = NULL,
  y_n_breaks = 5,
  y_breaks = NULL,
  conf_int_style = c("ribbon", "step", "none"),
  conf_int_alpha = 0.4,
  width = waiver(),
  height = waiver(),
  units = waiver(),
  export_collection = FALSE,
  ...
)
## S4 method for signature 'ANY'
plot_auc_roc_curve(
  object,
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  color_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  ggtheme = NULL,
  discrete_palette = NULL,
  x_label = waiver(),
  y_label = waiver(),
  legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  x_n_breaks = 5,
  x_breaks = NULL,
  y_n_breaks = 5,
  y_breaks = NULL,
  conf_int_style = c("ribbon", "step", "none"),
  conf_int_alpha = 0.4,
  width = waiver(),
  height = waiver(),
  units = waiver(),
  export_collection = FALSE,
  ...
)
## S4 method for signature 'familiarCollection'
plot_auc_roc_curve(
  object,
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  color_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  ggtheme = NULL,
  discrete_palette = NULL,
  x_label = waiver(),
  y_label = waiver(),
  legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  x_n_breaks = 5,
  x_breaks = NULL,
  y_n_breaks = 5,
  y_breaks = NULL,
  conf_int_style = c("ribbon", "step", "none"),
  conf_int_alpha = 0.4,
  width = waiver(),
  height = waiver(),
  units = waiver(),
  export_collection = FALSE,
  ...
)
Arguments
| object | familiarCollectionobject, or one or morefamiliarDataobjects, that will be internally converted to afamiliarCollectionobject. It is also possible to provide afamiliarEnsembleor one or morefamiliarModelobjects together with the data from which data is computed
prior to export. Paths to such files can also be provided.
 | 
| draw | (optional) Draws the plot if TRUE. | 
| dir_path | (optional) Path to the directory where the plots of receiver
operating characteristic curves are saved to. Output is saved in the
performancesubdirectory. IfNULLno figures are saved, but are returned
instead. | 
| split_by | (optional) Splitting variables. This refers to column names
on which datasets are split. A separate figure is created for each split.
See details for available variables. | 
| color_by | (optional) Variables used to determine fill colour of plot
objects. The variables cannot overlap with those provided to the split_byargument, but may overlap with other arguments. See details for available
variables. | 
| facet_by | (optional) Variables used to determine how and if facets of
each figure appear. In case the facet_wrap_colsargument isNULL, the
first variable is used to define columns, and the remaing variables are
used to define rows of facets. The variables cannot overlap with those
provided to thesplit_byargument, but may overlap with other arguments.
See details for available variables. | 
| facet_wrap_cols | (optional) Number of columns to generate when facet
wrapping. If NULL, a facet grid is produced instead. | 
| ggtheme | (optional) ggplottheme to use for plotting. | 
| discrete_palette | (optional) Palette to use to color the different
plot elements in case a value was provided to the color_byargument. | 
| x_label | (optional) Label to provide to the x-axis. If NULL, no label
is shown. | 
| y_label | (optional) Label to provide to the y-axis. If NULL, no label
is shown. | 
| legend_label | (optional) Label to provide to the legend. If NULL, the
legend will not have a name. | 
| plot_title | (optional) Label to provide as figure title. If NULL, no
title is shown. | 
| plot_sub_title | (optional) Label to provide as figure subtitle. If
NULL, no subtitle is shown. | 
| caption | (optional) Label to provide as figure caption. If NULL, no
caption is shown. | 
| x_n_breaks | (optional) Number of breaks to show on the x-axis of the
plot. x_n_breaksis used to determine thex_breaksargument in case it
is unset. | 
| x_breaks | (optional) Break points on the x-axis of the plot. | 
| y_n_breaks | (optional) Number of breaks to show on the y-axis of the
plot. y_n_breaksis used to determine they_breaksargument in case it
is unset. | 
| y_breaks | (optional) Break points on the y-axis of the plot. | 
| conf_int_style | (optional) Confidence interval style. See details for
allowed styles. | 
| conf_int_alpha | (optional) Alpha value to determine transparency of
confidence intervals or, alternatively, other plot elements with which the
confidence interval overlaps. Only values between 0.0 (fully transparent)
and 1.0 (fully opaque) are allowed. | 
| width | (optional) Width of the plot. A default value is derived from
the number of facets. | 
| height | (optional) Height of the plot. A default value is derived
from the number of features and the number of facets. | 
| units | (optional) Plot size unit. Either cm(default),mmorin. | 
| export_collection | (optional) Exports the collection if TRUE. | 
| ... | Arguments passed on to as_familiar_collection,ggplot2::ggsave,extract_auc_data 
familiar_data_namesNames of the dataset(s). Only used if the objectparameter is one or morefamiliarDataobjects.collection_nameName of the collection.deviceDevice to use. Can either be a device function
(e.g. png), or one of "eps", "ps", "tex" (pictex),
"pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only). If
NULL(default), the device is guessed based on thefilenameextension.scaleMultiplicative scaling factor.dpiPlot resolution. Also accepts a string input: "retina" (320),
"print" (300), or "screen" (72). Applies only to raster output types.limitsizeWhen TRUE(the default),ggsave()will not
save images larger than 50x50 inches, to prevent the common error of
specifying dimensions in pixels.bgBackground colour. If NULL, uses theplot.backgroundfill value
from the plot theme.create.dirWhether to create new directories if a non-existing
directory is specified in the filenameorpath(TRUE) or return an
error (FALSE, default). IfFALSEand run in an interactive session,
a prompt will appear asking to create a new directory when necessary.dataA dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed.is_pre_processedFlag that indicates whether the data was already
pre-processed externally, e.g. normalised and clustered. Only used if the
dataargument is adata.tableordata.frame.clCluster created using the parallelpackage. This cluster is then
used to speed up computation through parallellisation.ensemble_methodMethod for ensembling predictions from models for the
same sample. Available methods are:
verboseFlag to indicate whether feedback should be provided on the
computation and extraction of various data elements.message_indentNumber of indentation steps for messages shown during
computation and extraction of various data elements.detail_level(optional) Sets the level at which results are computed
and aggregated.
 
 ensemble: Results are computed at the ensemble level, i.e. over all
models in the ensemble. This means that, for example, bias-corrected
estimates of model performance are assessed by creating (at least) 20
bootstraps and computing the model performance of the ensemble model for
each bootstrap.
 hybrid(default): Results are computed at the level of models in an
ensemble. This means that, for example, bias-corrected estimates of model
performance are directly computed using the models in the ensemble. If there
are at least 20 trained models in the ensemble, performance is computed for
each model, in contrast toensemblewhere performance is computed for the
ensemble of models. If there are less than 20 trained models in the
ensemble, bootstraps are created so that at least 20 point estimates can be
made.
 model: Results are computed at the model level. This means that, for
example, bias-corrected estimates of model performance are assessed by
creating (at least) 20 bootstraps and computing the performance of the model
for each bootstrap.
 Note that each level of detail has a different interpretation for bootstrap
confidence intervals. For ensembleandmodelthese are the confidence
intervals for the ensemble and an individual model, respectively. That is,
the confidence interval describes the range where an estimate produced by a
respective ensemble or model trained on a repeat of the experiment may be
found with the probability of the confidence level. Forhybrid, it
represents the range where any single model trained on a repeat of the
experiment may be found with the probability of the confidence level. By
definition, confidence intervals obtained usinghybridare at least as
wide as those forensemble.hybridoffers the correct interpretation if
the goal of the analysis is to assess the result of a single, unspecified,
model. hybridis generally computationally less expensive thenensemble, which
in turn is somewhat less expensive thanmodel.
 A non-default detail_levelparameter can be specified for separate
evaluation steps by providing a parameter value in a named list with data
elements, e.g.list("auc_data"="ensemble", "model_performance"="hybrid").
This parameter can be set for the following data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data,prediction_dataandconfusion_matrix.estimation_type(optional) Sets the type of estimation that should be
possible. This has the following options:
 
 point: Point estimates.
 bias_correctionorbc: Bias-corrected estimates. A bias-corrected
estimate is computed from (at least) 20 point estimates, andfamiliarmay
bootstrap the data to create them.
 bootstrap_confidence_intervalorbci(default): Bias-corrected
estimates with bootstrap confidence intervals (Efron and Hastie, 2016). The
number of point estimates required depends on theconfidence_levelparameter, andfamiliarmay bootstrap the data to create them.
 As with detail_level, a non-defaultestimation_typeparameter can be
specified for separate evaluation steps by providing a parameter value in a
named list with data elements, e.g.list("auc_data"="bci", "model_performance"="point"). This parameter can be set for the following
data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data, andprediction_data.aggregate_results(optional) Flag that signifies whether results
should be aggregated during evaluation. If estimation_typeisbias_correctionorbc, aggregation leads to a single bias-corrected
estimate. Ifestimation_typeisbootstrap_confidence_intervalorbci,
aggregation leads to a single bias-corrected estimate with lower and upper
boundaries of the confidence interval. This has no effect ifestimation_typeispoint. The default value is equal to TRUEexcept when assessing metrics to assess
model performance, as the default violin plot requires underlying data. As with detail_levelandestimation_type, a non-defaultaggregate_resultsparameter can be specified for separate evaluation steps
by providing a parameter value in a named list with data elements, e.g.list("auc_data"=TRUE, , "model_performance"=FALSE). This parameter exists
for the same elements asestimation_type.confidence_level(optional) Numeric value for the level at which
confidence intervals are determined. In the case bootstraps are used to
determine the confidence intervals bootstrap estimation, familiaruses the
rule of thumbn = 20 / ci.levelto determine the number of required
bootstraps. The default value is 0.95.bootstrap_ci_method(optional) Method used to determine bootstrap
confidence intervals (Efron and Hastie, 2016). The following methods are
implemented:
 Note that the standard method is not implemented because this method is
often not suitable due to non-normal distributions. The bias-corrected and
accelerated (BCa) method is not implemented yet. | 
Details
This function generates area under the ROC curve plots.
Available splitting variables are: fs_method, learner, data_set and
positive_class. By default, the data is split by fs_method and learner,
with faceting by data_set and colouring by positive_class.
Available palettes for discrete_palette are those listed by
grDevices::palette.pals() (requires R >= 4.0.0), grDevices::hcl.pals()
(requires R >= 3.6.0) and rainbow, heat.colors, terrain.colors,
topo.colors and cm.colors, which correspond to the palettes of the same
name in grDevices. If not specified, a default palette based on palettes
in Tableau are used. You may also specify your own palette by using colour
names listed by grDevices::colors() or through hexadecimal RGB strings.
Bootstrap confidence intervals of the ROC curve (if present) can be shown
using various styles set by conf_int_style:
-  ribbon(default): confidence intervals are shown as a ribbon with an
opacity ofconf_int_alphaaround the point estimate of the ROC curve.
 
-  step(default): confidence intervals are shown as a step function around
the point estimate of the ROC curve.
 
-  none: confidence intervals are not shown. The point estimate of the ROC
curve is shown as usual.
 
Labelling methods such as set_fs_method_names or set_data_set_names can
be applied to the familiarCollection object to update labels, and order
the output in the figure.
Value
NULL or list of plot objects, if dir_path is NULL.
Plot calibration figures.
Description
This method creates calibration plots from calibration data
stored in a familiarCollection object. For this figures, the expected
(predicted) values are plotted against the observed values. A
well-calibrated model should be close to the identity line.
Usage
plot_calibration_data(
  object,
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  color_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  ggtheme = NULL,
  discrete_palette = NULL,
  x_label = waiver(),
  x_label_shared = "column",
  y_label = waiver(),
  y_label_shared = "row",
  legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  x_range = NULL,
  x_n_breaks = 5,
  x_breaks = NULL,
  y_range = NULL,
  y_n_breaks = 5,
  y_breaks = NULL,
  conf_int_style = c("ribbon", "step", "none"),
  conf_int_alpha = 0.4,
  show_density = TRUE,
  show_calibration_fit = TRUE,
  show_goodness_of_fit = TRUE,
  density_plot_height = grid::unit(1, "cm"),
  width = waiver(),
  height = waiver(),
  units = waiver(),
  export_collection = FALSE,
  ...
)
## S4 method for signature 'ANY'
plot_calibration_data(
  object,
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  color_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  ggtheme = NULL,
  discrete_palette = NULL,
  x_label = waiver(),
  x_label_shared = "column",
  y_label = waiver(),
  y_label_shared = "row",
  legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  x_range = NULL,
  x_n_breaks = 5,
  x_breaks = NULL,
  y_range = NULL,
  y_n_breaks = 5,
  y_breaks = NULL,
  conf_int_style = c("ribbon", "step", "none"),
  conf_int_alpha = 0.4,
  show_density = TRUE,
  show_calibration_fit = TRUE,
  show_goodness_of_fit = TRUE,
  density_plot_height = grid::unit(1, "cm"),
  width = waiver(),
  height = waiver(),
  units = waiver(),
  export_collection = FALSE,
  ...
)
## S4 method for signature 'familiarCollection'
plot_calibration_data(
  object,
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  color_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  ggtheme = NULL,
  discrete_palette = NULL,
  x_label = waiver(),
  x_label_shared = "column",
  y_label = waiver(),
  y_label_shared = "row",
  legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  x_range = NULL,
  x_n_breaks = 5,
  x_breaks = NULL,
  y_range = NULL,
  y_n_breaks = 5,
  y_breaks = NULL,
  conf_int_style = c("ribbon", "step", "none"),
  conf_int_alpha = 0.4,
  show_density = TRUE,
  show_calibration_fit = TRUE,
  show_goodness_of_fit = TRUE,
  density_plot_height = grid::unit(1, "cm"),
  width = waiver(),
  height = waiver(),
  units = waiver(),
  export_collection = FALSE,
  ...
)
Arguments
| object | familiarCollectionobject, or one or morefamiliarDataobjects, that will be internally converted to afamiliarCollectionobject. It is also possible to provide afamiliarEnsembleor one or morefamiliarModelobjects together with the data from which data is computed
prior to export. Paths to such files can also be provided.
 | 
| draw | (optional) Draws the plot if TRUE. | 
| dir_path | (optional) Path to the directory where created calibration
plots are saved to. Output is saved in the calibrationsubdirectory. IfNULLno figures are saved, but are returned instead. | 
| split_by | (optional) Splitting variables. This refers to column names
on which datasets are split. A separate figure is created for each split.
See details for available variables. | 
| color_by | (optional) Variables used to determine fill colour of plot
objects. The variables cannot overlap with those provided to the split_byargument, but may overlap with other arguments. See details for available
variables. | 
| facet_by | (optional) Variables used to determine how and if facets of
each figure appear. In case the facet_wrap_colsargument isNULL, the
first variable is used to define columns, and the remaing variables are
used to define rows of facets. The variables cannot overlap with those
provided to thesplit_byargument, but may overlap with other arguments.
See details for available variables. | 
| facet_wrap_cols | (optional) Number of columns to generate when facet
wrapping. If NULL, a facet grid is produced instead. | 
| ggtheme | (optional) ggplottheme to use for plotting. | 
| discrete_palette | (optional) Palette to use to color the different
data points and fit lines in case a non-singular variable was provided to
the color_byargument. | 
| x_label | (optional) Label to provide to the x-axis. If NULL, no label
is shown. | 
| x_label_shared | (optional) Sharing of x-axis labels between facets.
One of three values:
 
 overall: A single label is placed at the bottom of the figure. Tick
text (but not the ticks themselves) is removed for all but the bottom facet
plot(s).
 column: A label is placed at the bottom of each column. Tick text (but
not the ticks themselves) is removed for all but the bottom facet plot(s).
 individual: A label is placed below each facet plot. Tick text is kept.
 | 
| y_label | (optional) Label to provide to the y-axis. If NULL, no label
is shown. | 
| y_label_shared | (optional) Sharing of y-axis labels between facets.
One of three values:
 
 overall: A single label is placed to the left of the figure. Tick text
(but not the ticks themselves) is removed for all but the left-most facet
plot(s).
 row: A label is placed to the left of each row. Tick text (but not the
ticks themselves) is removed for all but the left-most facet plot(s).
 individual: A label is placed below each facet plot. Tick text is kept.
 | 
| legend_label | (optional) Label to provide to the legend. If NULL, the
legend will not have a name. | 
| plot_title | (optional) Label to provide as figure title. If NULL, no
title is shown. | 
| plot_sub_title | (optional) Label to provide as figure subtitle. If
NULL, no subtitle is shown. | 
| caption | (optional) Label to provide as figure caption. If NULL, no
caption is shown. | 
| x_range | (optional) Value range for the x-axis. | 
| x_n_breaks | (optional) Number of breaks to show on the x-axis of the
plot. x_n_breaksis used to determine thex_breaksargument in case it
is unset. | 
| x_breaks | (optional) Break points on the x-axis of the plot. | 
| y_range | (optional) Value range for the y-axis. | 
| y_n_breaks | (optional) Number of breaks to show on the y-axis of the
plot. y_n_breaksis used to determine they_breaksargument in case it
is unset. | 
| y_breaks | (optional) Break points on the y-axis of the plot. | 
| conf_int_style | (optional) Confidence interval style. See details for
allowed styles. | 
| conf_int_alpha | (optional) Alpha value to determine transparency of
confidence intervals or, alternatively, other plot elements with which the
confidence interval overlaps. Only values between 0.0 (fully transparent)
and 1.0 (fully opaque) are allowed. | 
| show_density | (optional) Show point density in top margin of the
figure. If color_byis set, this information will not be shown. | 
| show_calibration_fit | (optional) Specifies whether the calibration in
the large and calibration slope are annotated in the plot. If color_byis
set, this information will not be shown. | 
| show_goodness_of_fit | (optional) Specifies whether a the results of
goodness of fit tests are annotated in the plot. If color_byis set, this
information will not be shown. | 
| density_plot_height | (optional) Height of the density plot. The height
is 1.5 cm by default. Height is expected to be grid unit (see grid::unit),
which also allows for specifying relative heights. Will be ignored ifshow_densityisFALSE. | 
| width | (optional) Width of the plot. A default value is derived from
the number of facets. | 
| height | (optional) Height of the plot. A default value is derived
from the number of features and the number of facets. | 
| units | (optional) Plot size unit. Either cm(default),mmorin. | 
| export_collection | (optional) Exports the collection if TRUE. | 
| ... | Arguments passed on to as_familiar_collection,ggplot2::ggsave,extract_calibration_data 
familiar_data_namesNames of the dataset(s). Only used if the objectparameter is one or morefamiliarDataobjects.collection_nameName of the collection.deviceDevice to use. Can either be a device function
(e.g. png), or one of "eps", "ps", "tex" (pictex),
"pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only). If
NULL(default), the device is guessed based on thefilenameextension.scaleMultiplicative scaling factor.dpiPlot resolution. Also accepts a string input: "retina" (320),
"print" (300), or "screen" (72). Applies only to raster output types.limitsizeWhen TRUE(the default),ggsave()will not
save images larger than 50x50 inches, to prevent the common error of
specifying dimensions in pixels.bgBackground colour. If NULL, uses theplot.backgroundfill value
from the plot theme.create.dirWhether to create new directories if a non-existing
directory is specified in the filenameorpath(TRUE) or return an
error (FALSE, default). IfFALSEand run in an interactive session,
a prompt will appear asking to create a new directory when necessary.dataA dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed.is_pre_processedFlag that indicates whether the data was already
pre-processed externally, e.g. normalised and clustered. Only used if the
dataargument is adata.tableordata.frame.clCluster created using the parallelpackage. This cluster is then
used to speed up computation through parallellisation.evaluation_timesOne or more time points that are used for in analysis of
survival problems when data has to be assessed at a set time, e.g.
calibration. If not provided explicitly, this parameter is read from
settings used at creation of the underlying familiarModelobjects. Only
used forsurvivaloutcomes.ensemble_methodMethod for ensembling predictions from models for the
same sample. Available methods are:
verboseFlag to indicate whether feedback should be provided on the
computation and extraction of various data elements.message_indentNumber of indentation steps for messages shown during
computation and extraction of various data elements.detail_level(optional) Sets the level at which results are computed
and aggregated.
 
 ensemble: Results are computed at the ensemble level, i.e. over all
models in the ensemble. This means that, for example, bias-corrected
estimates of model performance are assessed by creating (at least) 20
bootstraps and computing the model performance of the ensemble model for
each bootstrap.
 hybrid(default): Results are computed at the level of models in an
ensemble. This means that, for example, bias-corrected estimates of model
performance are directly computed using the models in the ensemble. If there
are at least 20 trained models in the ensemble, performance is computed for
each model, in contrast toensemblewhere performance is computed for the
ensemble of models. If there are less than 20 trained models in the
ensemble, bootstraps are created so that at least 20 point estimates can be
made.
 model: Results are computed at the model level. This means that, for
example, bias-corrected estimates of model performance are assessed by
creating (at least) 20 bootstraps and computing the performance of the model
for each bootstrap.
 Note that each level of detail has a different interpretation for bootstrap
confidence intervals. For ensembleandmodelthese are the confidence
intervals for the ensemble and an individual model, respectively. That is,
the confidence interval describes the range where an estimate produced by a
respective ensemble or model trained on a repeat of the experiment may be
found with the probability of the confidence level. Forhybrid, it
represents the range where any single model trained on a repeat of the
experiment may be found with the probability of the confidence level. By
definition, confidence intervals obtained usinghybridare at least as
wide as those forensemble.hybridoffers the correct interpretation if
the goal of the analysis is to assess the result of a single, unspecified,
model. hybridis generally computationally less expensive thenensemble, which
in turn is somewhat less expensive thanmodel.
 A non-default detail_levelparameter can be specified for separate
evaluation steps by providing a parameter value in a named list with data
elements, e.g.list("auc_data"="ensemble", "model_performance"="hybrid").
This parameter can be set for the following data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data,prediction_dataandconfusion_matrix.estimation_type(optional) Sets the type of estimation that should be
possible. This has the following options:
 
 point: Point estimates.
 bias_correctionorbc: Bias-corrected estimates. A bias-corrected
estimate is computed from (at least) 20 point estimates, andfamiliarmay
bootstrap the data to create them.
 bootstrap_confidence_intervalorbci(default): Bias-corrected
estimates with bootstrap confidence intervals (Efron and Hastie, 2016). The
number of point estimates required depends on theconfidence_levelparameter, andfamiliarmay bootstrap the data to create them.
 As with detail_level, a non-defaultestimation_typeparameter can be
specified for separate evaluation steps by providing a parameter value in a
named list with data elements, e.g.list("auc_data"="bci", "model_performance"="point"). This parameter can be set for the following
data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data, andprediction_data.aggregate_results(optional) Flag that signifies whether results
should be aggregated during evaluation. If estimation_typeisbias_correctionorbc, aggregation leads to a single bias-corrected
estimate. Ifestimation_typeisbootstrap_confidence_intervalorbci,
aggregation leads to a single bias-corrected estimate with lower and upper
boundaries of the confidence interval. This has no effect ifestimation_typeispoint. The default value is equal to TRUEexcept when assessing metrics to assess
model performance, as the default violin plot requires underlying data. As with detail_levelandestimation_type, a non-defaultaggregate_resultsparameter can be specified for separate evaluation steps
by providing a parameter value in a named list with data elements, e.g.list("auc_data"=TRUE, , "model_performance"=FALSE). This parameter exists
for the same elements asestimation_type.confidence_level(optional) Numeric value for the level at which
confidence intervals are determined. In the case bootstraps are used to
determine the confidence intervals bootstrap estimation, familiaruses the
rule of thumbn = 20 / ci.levelto determine the number of required
bootstraps. The default value is 0.95.bootstrap_ci_method(optional) Method used to determine bootstrap
confidence intervals (Efron and Hastie, 2016). The following methods are
implemented:
 Note that the standard method is not implemented because this method is
often not suitable due to non-normal distributions. The bias-corrected and
accelerated (BCa) method is not implemented yet. | 
Details
This function generates a calibration plot for each model in each
dataset. Any data used for calibration (e.g. baseline survival) is obtained
during model creation.
Available splitting variables are: fs_method, learner, data_set and
evaluation_time (survival analysis only) and positive_class (multinomial
endpoints only). By default, separate figures are created for each
combination of fs_method and learner, with facetting by data_set.
Calibration in survival analysis is performed at set time points so that
survival probabilities can be computed from the model, and compared with
observed survival probabilities. This is done differently depending on the
underlying model. For Cox partial hazards regression models, the base
survival (of the development samples) are used, whereas accelerated failure
time models (e.g. Weibull) and survival random forests can be used to
directly predict survival probabilities at a given time point. For survival
analysis, evaluation_time is an additional facet variable (by default).
Calibration for multinomial endpoints is performed in a one-against-all
manner. This yields calibration information for each individual class of the
endpoint. For such endpoints, positive_class is an additional facet variable
(by default).
Calibration plots have a density plot in the margin, which shows the density
of the plotted points, ordered by the expected probability or value. For
binomial and multinomial outcomes, the density for positive and negative
classes are shown separately. Note that this information is only provided in
when color_by is not used as a splitting variable (i.e. one calibration
plot per facet).
Calibration plots are annotated with the intercept and the slope of a linear
model fitted to the sample points. A well-calibrated model has an intercept
close to 0.0 and a slope of 1.0. Intercept and slope are shown with their
respective 95% confidence intervals. In addition, goodness-of-fit tests may
be shown. For most endpoints these are based on the Hosmer-Lemeshow (HL)
test, but for survival endpoints both the Nam-D'Agostino (ND) and the
Greenwood-Nam-D'Agostino (GND) tests are shown. Note that this information
is only annotated when color_by is not used as a splitting variable (i.e.
one calibration plot per facet).
Available palettes for discrete_palette are those listed by
grDevices::palette.pals() (requires R >= 4.0.0), grDevices::hcl.pals()
(requires R >= 3.6.0) and rainbow, heat.colors, terrain.colors,
topo.colors and cm.colors, which correspond to the palettes of the same
name in grDevices. If not specified, a default palette based on palettes
in Tableau are used. You may also specify your own palette by using colour
names listed by grDevices::colors() or through hexadecimal RGB strings.
Labeling methods such as set_risk_group_names or set_data_set_names can
be applied to the familiarCollection object to update labels, and order
the output in the figure.
Value
NULL or list of plot objects, if dir_path is NULL.
References
-  Hosmer, D. W., Hosmer, T., Le Cessie, S. & Lemeshow, S. A
comparison of goodness-of-fit tests for the logistic regression model. Stat.
Med. 16, 965–980 (1997).
 
-  D’Agostino, R. B. & Nam, B.-H. Evaluation of the Performance of Survival
Analysis Models: Discrimination and Calibration Measures. in Handbook of
Statistics vol. 23 1–25 (Elsevier, 2003).
 
-  Demler, O. V., Paynter, N. P. & Cook, N. R. Tests of calibration and
goodness-of-fit in the survival setting. Stat. Med. 34, 1659–1680 (2015).
 
Plot confusion matrix.
Description
This method creates confusion matrices based on data in a
familiarCollection object.
Usage
plot_confusion_matrix(
  object,
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  ggtheme = NULL,
  discrete_palette = NULL,
  x_label = waiver(),
  y_label = waiver(),
  legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  rotate_x_tick_labels = waiver(),
  show_alpha = TRUE,
  width = waiver(),
  height = waiver(),
  units = waiver(),
  export_collection = FALSE,
  ...
)
## S4 method for signature 'ANY'
plot_confusion_matrix(
  object,
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  ggtheme = NULL,
  discrete_palette = NULL,
  x_label = waiver(),
  y_label = waiver(),
  legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  rotate_x_tick_labels = waiver(),
  show_alpha = TRUE,
  width = waiver(),
  height = waiver(),
  units = waiver(),
  export_collection = FALSE,
  ...
)
## S4 method for signature 'familiarCollection'
plot_confusion_matrix(
  object,
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  ggtheme = NULL,
  discrete_palette = NULL,
  x_label = waiver(),
  y_label = waiver(),
  legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  rotate_x_tick_labels = waiver(),
  show_alpha = TRUE,
  width = waiver(),
  height = waiver(),
  units = waiver(),
  export_collection = FALSE,
  ...
)
Arguments
| object | familiarCollectionobject, or one or morefamiliarDataobjects, that will be internally converted to afamiliarCollectionobject. It is also possible to provide afamiliarEnsembleor one or morefamiliarModelobjects together with the data from which data is computed
prior to export. Paths to such files can also be provided.
 | 
| draw | (optional) Draws the plot if TRUE. | 
| dir_path | (optional) Path to the directory where created confusion
matrixes are saved to. Output is saved in the performancesubdirectory.
IfNULLno figures are saved, but are returned instead. | 
| split_by | (optional) Splitting variables. This refers to column names
on which datasets are split. A separate figure is created for each split.
See details for available variables. | 
| facet_by | (optional) Variables used to determine how and if facets of
each figure appear. In case the facet_wrap_colsargument isNULL, the
first variable is used to define columns, and the remaing variables are
used to define rows of facets. The variables cannot overlap with those
provided to thesplit_byargument, but may overlap with other arguments.
See details for available variables. | 
| facet_wrap_cols | (optional) Number of columns to generate when facet
wrapping. If NULL, a facet grid is produced instead. | 
| ggtheme | (optional) ggplottheme to use for plotting. | 
| discrete_palette | (optional) Palette used to colour the confusion
matrix. The colour depends on whether each cell of the confusion matrix is
on the diagonal (observed outcome matched expected outcome) or not. | 
| x_label | (optional) Label to provide to the x-axis. If NULL, no label
is shown. | 
| y_label | (optional) Label to provide to the y-axis. If NULL, no label
is shown. | 
| legend_label | (optional) Label to provide to the legend. If NULL, the
legend will not have a name. | 
| plot_title | (optional) Label to provide as figure title. If NULL, no
title is shown. | 
| plot_sub_title | (optional) Label to provide as figure subtitle. If
NULL, no subtitle is shown. | 
| caption | (optional) Label to provide as figure caption. If NULL, no
caption is shown. | 
| rotate_x_tick_labels | (optional) Rotate tick labels on the x-axis by
90 degrees. Defaults to TRUE. Rotation of x-axis tick labels may also be
controlled through theggtheme. In this case,FALSEshould be provided
explicitly. | 
| show_alpha | (optional) Interpreting confusion matrices is made easier
by setting the opacity of the cells. show_alphatakes the following
values: 
 none: Cell opacity is not altered. Diagonal and off-diagonal cells are
completely opaque and transparent, respectively. Same asshow_alpha=FALSE.
 by_class: Cell opacity is normalised by the number of instances for each
observed outcome class in each confusion matrix.
 by_matrix(default): Cell opacity is normalised by the number of
instances in the largest observed outcome class in each confusion matrix.
Same asshow_alpha=TRUE
 by_figure: Cell opacity is normalised by the number of instances in the
largest observed outcome class across confusion matrices in different
facets.
 by_all: Cell opacity is normalised by the number of instances in the
largest observed outcome class across all confusion matrices.
 | 
| width | (optional) Width of the plot. A default value is derived from
the number of facets. | 
| height | (optional) Height of the plot. A default value is derived
from the number of features and the number of facets. | 
| units | (optional) Plot size unit. Either cm(default),mmorin. | 
| export_collection | (optional) Exports the collection if TRUE. | 
| ... | Arguments passed on to as_familiar_collection,ggplot2::ggsave,extract_confusion_matrix 
familiar_data_namesNames of the dataset(s). Only used if the objectparameter is one or morefamiliarDataobjects.collection_nameName of the collection.deviceDevice to use. Can either be a device function
(e.g. png), or one of "eps", "ps", "tex" (pictex),
"pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only). If
NULL(default), the device is guessed based on thefilenameextension.scaleMultiplicative scaling factor.dpiPlot resolution. Also accepts a string input: "retina" (320),
"print" (300), or "screen" (72). Applies only to raster output types.limitsizeWhen TRUE(the default),ggsave()will not
save images larger than 50x50 inches, to prevent the common error of
specifying dimensions in pixels.bgBackground colour. If NULL, uses theplot.backgroundfill value
from the plot theme.create.dirWhether to create new directories if a non-existing
directory is specified in the filenameorpath(TRUE) or return an
error (FALSE, default). IfFALSEand run in an interactive session,
a prompt will appear asking to create a new directory when necessary.dataA dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed.is_pre_processedFlag that indicates whether the data was already
pre-processed externally, e.g. normalised and clustered. Only used if the
dataargument is adata.tableordata.frame.clCluster created using the parallelpackage. This cluster is then
used to speed up computation through parallellisation.ensemble_methodMethod for ensembling predictions from models for the
same sample. Available methods are:
verboseFlag to indicate whether feedback should be provided on the
computation and extraction of various data elements.message_indentNumber of indentation steps for messages shown during
computation and extraction of various data elements.detail_level(optional) Sets the level at which results are computed
and aggregated.
 
 ensemble: Results are computed at the ensemble level, i.e. over all
models in the ensemble. This means that, for example, bias-corrected
estimates of model performance are assessed by creating (at least) 20
bootstraps and computing the model performance of the ensemble model for
each bootstrap.
 hybrid(default): Results are computed at the level of models in an
ensemble. This means that, for example, bias-corrected estimates of model
performance are directly computed using the models in the ensemble. If there
are at least 20 trained models in the ensemble, performance is computed for
each model, in contrast toensemblewhere performance is computed for the
ensemble of models. If there are less than 20 trained models in the
ensemble, bootstraps are created so that at least 20 point estimates can be
made.
 model: Results are computed at the model level. This means that, for
example, bias-corrected estimates of model performance are assessed by
creating (at least) 20 bootstraps and computing the performance of the model
for each bootstrap.
 Note that each level of detail has a different interpretation for bootstrap
confidence intervals. For ensembleandmodelthese are the confidence
intervals for the ensemble and an individual model, respectively. That is,
the confidence interval describes the range where an estimate produced by a
respective ensemble or model trained on a repeat of the experiment may be
found with the probability of the confidence level. Forhybrid, it
represents the range where any single model trained on a repeat of the
experiment may be found with the probability of the confidence level. By
definition, confidence intervals obtained usinghybridare at least as
wide as those forensemble.hybridoffers the correct interpretation if
the goal of the analysis is to assess the result of a single, unspecified,
model. hybridis generally computationally less expensive thenensemble, which
in turn is somewhat less expensive thanmodel.
 A non-default detail_levelparameter can be specified for separate
evaluation steps by providing a parameter value in a named list with data
elements, e.g.list("auc_data"="ensemble", "model_performance"="hybrid").
This parameter can be set for the following data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data,prediction_dataandconfusion_matrix. | 
Details
This function generates area under the ROC curve plots.
Available splitting variables are: fs_method, learner and data_set.
By default, the data is split by fs_method and learner, with facetting
by data_set.
Available palettes for discrete_palette are those listed by
grDevices::palette.pals() (requires R >= 4.0.0), grDevices::hcl.pals()
(requires R >= 3.6.0) and rainbow, heat.colors, terrain.colors,
topo.colors and cm.colors, which correspond to the palettes of the same
name in grDevices. If not specified, a default palette based on palettes
in Tableau are used. You may also specify your own palette by using colour
names listed by grDevices::colors() or through hexadecimal RGB strings.
Labeling methods such as set_fs_method_names or set_data_set_names can
be applied to the familiarCollection object to update labels, and order
the output in the figure.
Value
NULL or list of plot objects, if dir_path is NULL.
Plot decision curves.
Description
This method creates decision curves based on data in a
familiarCollection object.
Usage
plot_decision_curve(
  object,
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  color_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  ggtheme = NULL,
  discrete_palette = NULL,
  x_label = waiver(),
  y_label = waiver(),
  legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  x_range = NULL,
  x_n_breaks = 5,
  x_breaks = NULL,
  y_range = NULL,
  y_n_breaks = 5,
  y_breaks = NULL,
  conf_int_style = c("ribbon", "step", "none"),
  conf_int_alpha = 0.4,
  width = waiver(),
  height = waiver(),
  units = waiver(),
  export_collection = FALSE,
  ...
)
## S4 method for signature 'ANY'
plot_decision_curve(
  object,
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  color_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  ggtheme = NULL,
  discrete_palette = NULL,
  x_label = waiver(),
  y_label = waiver(),
  legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  x_range = NULL,
  x_n_breaks = 5,
  x_breaks = NULL,
  y_range = NULL,
  y_n_breaks = 5,
  y_breaks = NULL,
  conf_int_style = c("ribbon", "step", "none"),
  conf_int_alpha = 0.4,
  width = waiver(),
  height = waiver(),
  units = waiver(),
  export_collection = FALSE,
  ...
)
## S4 method for signature 'familiarCollection'
plot_decision_curve(
  object,
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  color_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  ggtheme = NULL,
  discrete_palette = NULL,
  x_label = waiver(),
  y_label = waiver(),
  legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  x_range = NULL,
  x_n_breaks = 5,
  x_breaks = NULL,
  y_range = NULL,
  y_n_breaks = 5,
  y_breaks = NULL,
  conf_int_style = c("ribbon", "step", "none"),
  conf_int_alpha = 0.4,
  width = waiver(),
  height = waiver(),
  units = waiver(),
  export_collection = FALSE,
  ...
)
Arguments
| object | familiarCollectionobject, or one or morefamiliarDataobjects, that will be internally converted to afamiliarCollectionobject. It is also possible to provide afamiliarEnsembleor one or morefamiliarModelobjects together with the data from which data is computed
prior to export. Paths to such files can also be provided.
 | 
| draw | (optional) Draws the plot if TRUE. | 
| dir_path | (optional) Path to the directory where created decision
curve plots are saved to. Output is saved in the decision_curve_analysissubdirectory. IfNULL, figures are written to the folder, but are
returned instead. | 
| split_by | (optional) Splitting variables. This refers to column names
on which datasets are split. A separate figure is created for each split.
See details for available variables. | 
| color_by | (optional) Variables used to determine fill colour of plot
objects. The variables cannot overlap with those provided to the split_byargument, but may overlap with other arguments. See details for available
variables. | 
| facet_by | (optional) Variables used to determine how and if facets of
each figure appear. In case the facet_wrap_colsargument isNULL, the
first variable is used to define columns, and the remaing variables are
used to define rows of facets. The variables cannot overlap with those
provided to thesplit_byargument, but may overlap with other arguments.
See details for available variables. | 
| facet_wrap_cols | (optional) Number of columns to generate when facet
wrapping. If NULL, a facet grid is produced instead. | 
| ggtheme | (optional) ggplottheme to use for plotting. | 
| discrete_palette | (optional) Palette to use to color the different
plot elements in case a value was provided to the color_byargument. | 
| x_label | (optional) Label to provide to the x-axis. If NULL, no label
is shown. | 
| y_label | (optional) Label to provide to the y-axis. If NULL, no label
is shown. | 
| legend_label | (optional) Label to provide to the legend. If NULL, the
legend will not have a name. | 
| plot_title | (optional) Label to provide as figure title. If NULL, no
title is shown. | 
| plot_sub_title | (optional) Label to provide as figure subtitle. If
NULL, no subtitle is shown. | 
| caption | (optional) Label to provide as figure caption. If NULL, no
caption is shown. | 
| x_range | (optional) Value range for the x-axis. | 
| x_n_breaks | (optional) Number of breaks to show on the x-axis of the
plot. x_n_breaksis used to determine thex_breaksargument in case it
is unset. | 
| x_breaks | (optional) Break points on the x-axis of the plot. | 
| y_range | (optional) Value range for the y-axis. | 
| y_n_breaks | (optional) Number of breaks to show on the y-axis of the
plot. y_n_breaksis used to determine they_breaksargument in case it
is unset. | 
| y_breaks | (optional) Break points on the y-axis of the plot. | 
| conf_int_style | (optional) Confidence interval style. See details for
allowed styles. | 
| conf_int_alpha | (optional) Alpha value to determine transparency of
confidence intervals or, alternatively, other plot elements with which the
confidence interval overlaps. Only values between 0.0 (fully transparent)
and 1.0 (fully opaque) are allowed. | 
| width | (optional) Width of the plot. A default value is derived from
the number of facets. | 
| height | (optional) Height of the plot. A default value is derived
from the number of features and the number of facets. | 
| units | (optional) Plot size unit. Either cm(default),mmorin. | 
| export_collection | (optional) Exports the collection if TRUE. | 
| ... | Arguments passed on to as_familiar_collection,ggplot2::ggsave,extract_decision_curve_data 
familiar_data_namesNames of the dataset(s). Only used if the objectparameter is one or morefamiliarDataobjects.collection_nameName of the collection.deviceDevice to use. Can either be a device function
(e.g. png), or one of "eps", "ps", "tex" (pictex),
"pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only). If
NULL(default), the device is guessed based on thefilenameextension.scaleMultiplicative scaling factor.dpiPlot resolution. Also accepts a string input: "retina" (320),
"print" (300), or "screen" (72). Applies only to raster output types.limitsizeWhen TRUE(the default),ggsave()will not
save images larger than 50x50 inches, to prevent the common error of
specifying dimensions in pixels.bgBackground colour. If NULL, uses theplot.backgroundfill value
from the plot theme.create.dirWhether to create new directories if a non-existing
directory is specified in the filenameorpath(TRUE) or return an
error (FALSE, default). IfFALSEand run in an interactive session,
a prompt will appear asking to create a new directory when necessary.dataA dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed.is_pre_processedFlag that indicates whether the data was already
pre-processed externally, e.g. normalised and clustered. Only used if the
dataargument is adata.tableordata.frame.clCluster created using the parallelpackage. This cluster is then
used to speed up computation through parallellisation.evaluation_timesOne or more time points that are used for in analysis of
survival problems when data has to be assessed at a set time, e.g.
calibration. If not provided explicitly, this parameter is read from
settings used at creation of the underlying familiarModelobjects. Only
used forsurvivaloutcomes.ensemble_methodMethod for ensembling predictions from models for the
same sample. Available methods are:
verboseFlag to indicate whether feedback should be provided on the
computation and extraction of various data elements.message_indentNumber of indentation steps for messages shown during
computation and extraction of various data elements.detail_level(optional) Sets the level at which results are computed
and aggregated.
 
 ensemble: Results are computed at the ensemble level, i.e. over all
models in the ensemble. This means that, for example, bias-corrected
estimates of model performance are assessed by creating (at least) 20
bootstraps and computing the model performance of the ensemble model for
each bootstrap.
 hybrid(default): Results are computed at the level of models in an
ensemble. This means that, for example, bias-corrected estimates of model
performance are directly computed using the models in the ensemble. If there
are at least 20 trained models in the ensemble, performance is computed for
each model, in contrast toensemblewhere performance is computed for the
ensemble of models. If there are less than 20 trained models in the
ensemble, bootstraps are created so that at least 20 point estimates can be
made.
 model: Results are computed at the model level. This means that, for
example, bias-corrected estimates of model performance are assessed by
creating (at least) 20 bootstraps and computing the performance of the model
for each bootstrap.
 Note that each level of detail has a different interpretation for bootstrap
confidence intervals. For ensembleandmodelthese are the confidence
intervals for the ensemble and an individual model, respectively. That is,
the confidence interval describes the range where an estimate produced by a
respective ensemble or model trained on a repeat of the experiment may be
found with the probability of the confidence level. Forhybrid, it
represents the range where any single model trained on a repeat of the
experiment may be found with the probability of the confidence level. By
definition, confidence intervals obtained usinghybridare at least as
wide as those forensemble.hybridoffers the correct interpretation if
the goal of the analysis is to assess the result of a single, unspecified,
model. hybridis generally computationally less expensive thenensemble, which
in turn is somewhat less expensive thanmodel.
 A non-default detail_levelparameter can be specified for separate
evaluation steps by providing a parameter value in a named list with data
elements, e.g.list("auc_data"="ensemble", "model_performance"="hybrid").
This parameter can be set for the following data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data,prediction_dataandconfusion_matrix.estimation_type(optional) Sets the type of estimation that should be
possible. This has the following options:
 
 point: Point estimates.
 bias_correctionorbc: Bias-corrected estimates. A bias-corrected
estimate is computed from (at least) 20 point estimates, andfamiliarmay
bootstrap the data to create them.
 bootstrap_confidence_intervalorbci(default): Bias-corrected
estimates with bootstrap confidence intervals (Efron and Hastie, 2016). The
number of point estimates required depends on theconfidence_levelparameter, andfamiliarmay bootstrap the data to create them.
 As with detail_level, a non-defaultestimation_typeparameter can be
specified for separate evaluation steps by providing a parameter value in a
named list with data elements, e.g.list("auc_data"="bci", "model_performance"="point"). This parameter can be set for the following
data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data, andprediction_data.aggregate_results(optional) Flag that signifies whether results
should be aggregated during evaluation. If estimation_typeisbias_correctionorbc, aggregation leads to a single bias-corrected
estimate. Ifestimation_typeisbootstrap_confidence_intervalorbci,
aggregation leads to a single bias-corrected estimate with lower and upper
boundaries of the confidence interval. This has no effect ifestimation_typeispoint. The default value is equal to TRUEexcept when assessing metrics to assess
model performance, as the default violin plot requires underlying data. As with detail_levelandestimation_type, a non-defaultaggregate_resultsparameter can be specified for separate evaluation steps
by providing a parameter value in a named list with data elements, e.g.list("auc_data"=TRUE, , "model_performance"=FALSE). This parameter exists
for the same elements asestimation_type.confidence_level(optional) Numeric value for the level at which
confidence intervals are determined. In the case bootstraps are used to
determine the confidence intervals bootstrap estimation, familiaruses the
rule of thumbn = 20 / ci.levelto determine the number of required
bootstraps. The default value is 0.95.bootstrap_ci_method(optional) Method used to determine bootstrap
confidence intervals (Efron and Hastie, 2016). The following methods are
implemented:
 Note that the standard method is not implemented because this method is
often not suitable due to non-normal distributions. The bias-corrected and
accelerated (BCa) method is not implemented yet. | 
Details
This function generates plots for decision curves.
Available splitting variables are: fs_method, learner, data_set and
positive_class (categorical outcomes) or evaluation_time (survival
outcomes). By default, the data is split by fs_method and learner, with
faceting by data_set and colouring by positive_class or
evaluation_time.
Available palettes for discrete_palette are those listed by
grDevices::palette.pals() (requires R >= 4.0.0), grDevices::hcl.pals()
(requires R >= 3.6.0) and rainbow, heat.colors, terrain.colors,
topo.colors and cm.colors, which correspond to the palettes of the same
name in grDevices. If not specified, a default palette based on palettes
in Tableau are used. You may also specify your own palette by using colour
names listed by grDevices::colors() or through hexadecimal RGB strings.
Bootstrap confidence intervals of the decision curve (if present) can be
shown using various styles set by conf_int_style:
-  ribbon(default): confidence intervals are shown as a ribbon with an
opacity ofconf_int_alphaaround the point estimate of the decision
curve.
 
-  step(default): confidence intervals are shown as a step function around
the point estimate of the decision curve.
 
-  none: confidence intervals are not shown. The point estimate of the
decision curve is shown as usual.
 
Labelling methods such as set_fs_method_names or set_data_set_names can
be applied to the familiarCollection object to update labels, and order
the output in the figure.
Value
NULL or list of plot objects, if dir_path is NULL.
References
-  Vickers, A. J. & Elkin, E. B. Decision curve analysis: a novel
method for evaluating prediction models. Med. Decis. Making 26, 565–574
(2006).
 
-  Vickers, A. J., Cronin, A. M., Elkin, E. B. & Gonen, M. Extensions to
decision curve analysis, a novel method for evaluating diagnostic tests,
prediction models and molecular markers. BMC Med. Inform. Decis. Mak. 8, 53
(2008).
 
-  Vickers, A. J., van Calster, B. & Steyerberg, E. W. A simple,
step-by-step guide to interpreting decision curve analysis. Diagn Progn Res
3, 18 (2019).
 
Plot heatmaps for pairwise similarity between features.
Description
This method creates a heatmap based on data stored in a
familiarCollection object. Features in the heatmap are ordered so that
more similar features appear together.
Usage
plot_feature_similarity(
  object,
  feature_cluster_method = waiver(),
  feature_linkage_method = waiver(),
  feature_cluster_cut_method = waiver(),
  feature_similarity_threshold = waiver(),
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  ggtheme = NULL,
  gradient_palette = NULL,
  gradient_palette_range = NULL,
  x_label = waiver(),
  x_label_shared = "column",
  y_label = waiver(),
  y_label_shared = "row",
  legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  y_range = NULL,
  y_n_breaks = 3,
  y_breaks = NULL,
  rotate_x_tick_labels = waiver(),
  show_dendrogram = c("top", "right"),
  dendrogram_height = grid::unit(1.5, "cm"),
  width = waiver(),
  height = waiver(),
  units = waiver(),
  export_collection = FALSE,
  ...
)
## S4 method for signature 'ANY'
plot_feature_similarity(
  object,
  feature_cluster_method = waiver(),
  feature_linkage_method = waiver(),
  feature_cluster_cut_method = waiver(),
  feature_similarity_threshold = waiver(),
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  ggtheme = NULL,
  gradient_palette = NULL,
  gradient_palette_range = NULL,
  x_label = waiver(),
  x_label_shared = "column",
  y_label = waiver(),
  y_label_shared = "row",
  legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  y_range = NULL,
  y_n_breaks = 3,
  y_breaks = NULL,
  rotate_x_tick_labels = waiver(),
  show_dendrogram = c("top", "right"),
  dendrogram_height = grid::unit(1.5, "cm"),
  width = waiver(),
  height = waiver(),
  units = waiver(),
  export_collection = FALSE,
  ...
)
## S4 method for signature 'familiarCollection'
plot_feature_similarity(
  object,
  feature_cluster_method = waiver(),
  feature_linkage_method = waiver(),
  feature_cluster_cut_method = waiver(),
  feature_similarity_threshold = waiver(),
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  ggtheme = NULL,
  gradient_palette = NULL,
  gradient_palette_range = NULL,
  x_label = waiver(),
  x_label_shared = "column",
  y_label = waiver(),
  y_label_shared = "row",
  legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  y_range = NULL,
  y_n_breaks = 3,
  y_breaks = NULL,
  rotate_x_tick_labels = waiver(),
  show_dendrogram = c("top", "right"),
  dendrogram_height = grid::unit(1.5, "cm"),
  width = waiver(),
  height = waiver(),
  units = waiver(),
  export_collection = FALSE,
  ...
)
Arguments
| object | A familiarCollectionobject, or other other objects from which
afamiliarCollectioncan be extracted. See details for more information. | 
| feature_cluster_method | The method used to perform clustering. These are
the same methods as for the cluster_methodconfiguration parameter:none,hclust,agnes,dianaandpam. nonecannot be used when extracting data regarding mutual correlation or
feature expressions.
 If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
| feature_linkage_method | The method used for agglomerative clustering in
hclustandagnes. These are the same methods as for thecluster_linkage_methodconfiguration parameter:average,single,complete,weighted, andward. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
| feature_cluster_cut_method | The method used to divide features into
separate clusters. The available methods are the same as for the
cluster_cut_methodconfiguration parameter:silhouette,fixed_cutanddynamic_cut. silhouetteis available for all cluster methods, butfixed_cutonly
applies to methods that create hierarchical trees (hclust,agnesanddiana).dynamic_cutrequires thedynamicTreeCutpackage and can only
be used withagnesandhclust.
 If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
| feature_similarity_threshold | The threshold level for pair-wise
similarity that is required to form feature clusters with the fixed_cutmethod. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
| draw | (optional) Draws the plot if TRUE. | 
| dir_path | (optional) Path to the directory where created performance
plots are saved to. Output is saved in the feature_similaritysubdirectory. IfNULLno figures are saved, but are returned instead. | 
| split_by | (optional) Splitting variables. This refers to column names
on which datasets are split. A separate figure is created for each split.
See details for available variables. | 
| facet_by | (optional) Variables used to determine how and if facets of
each figure appear. In case the facet_wrap_colsargument isNULL, the
first variable is used to define columns, and the remaing variables are
used to define rows of facets. The variables cannot overlap with those
provided to thesplit_byargument, but may overlap with other arguments.
See details for available variables. | 
| facet_wrap_cols | (optional) Number of columns to generate when facet
wrapping. If NULL, a facet grid is produced instead. | 
| ggtheme | (optional) ggplottheme to use for plotting. | 
| gradient_palette | (optional) Sequential or divergent palette used to
colour the similarity or distance between features in a heatmap. | 
| gradient_palette_range | (optional) Numerical range used to span the
gradient. This should be a range of two values, e.g. c(0, 1). Lower or
upper boundary can be unset by usingNA. If not set, the full
metric-specific range is used. | 
| x_label | (optional) Label to provide to the x-axis. If NULL, no label
is shown. | 
| x_label_shared | (optional) Sharing of x-axis labels between facets.
One of three values:
 
 overall: A single label is placed at the bottom of the figure. Tick
text (but not the ticks themselves) is removed for all but the bottom facet
plot(s).
 column: A label is placed at the bottom of each column. Tick text (but
not the ticks themselves) is removed for all but the bottom facet plot(s).
 individual: A label is placed below each facet plot. Tick text is kept.
 | 
| y_label | (optional) Label to provide to the y-axis. If NULL, no label
is shown. | 
| y_label_shared | (optional) Sharing of y-axis labels between facets.
One of three values:
 
 overall: A single label is placed to the left of the figure. Tick text
(but not the ticks themselves) is removed for all but the left-most facet
plot(s).
 row: A label is placed to the left of each row. Tick text (but not the
ticks themselves) is removed for all but the left-most facet plot(s).
 individual: A label is placed below each facet plot. Tick text is kept.
 | 
| legend_label | (optional) Label to provide to the legend. If NULL, the
legend will not have a name. | 
| plot_title | (optional) Label to provide as figure title. If NULL, no
title is shown. | 
| plot_sub_title | (optional) Label to provide as figure subtitle. If
NULL, no subtitle is shown. | 
| caption | (optional) Label to provide as figure caption. If NULL, no
caption is shown. | 
| y_range | (optional) Value range for the y-axis. | 
| y_n_breaks | (optional) Number of breaks to show on the y-axis of the
plot. y_n_breaksis used to determine they_breaksargument in case it
is unset. | 
| y_breaks | (optional) Break points on the y-axis of the plot. | 
| rotate_x_tick_labels | (optional) Rotate tick labels on the x-axis by
90 degrees. Defaults to TRUE. Rotation of x-axis tick labels may also be
controlled through theggtheme. In this case,FALSEshould be provided
explicitly. | 
| show_dendrogram | (optional) Show dendrogram around the main panel.
Can be TRUE,FALSE,NULL, or a position, i.e.top,bottom,leftandright. Up to two positions may be provided, but only as long as the
dendrograms are not on opposite sides of the heatmap:topandbottom,
andleftandrightcannot be used together. A dendrogram can only be drawn from cluster methods that produce
dendrograms, such as hclust. A dendrogram can for example not be
constructed using the partitioning around medioids method (pam). By default, a dendrogram is drawn to the top and right of the panel. | 
| dendrogram_height | (optional) Height of the dendrogram. The height is
1.5 cm by default. Height is expected to be grid unit (see grid::unit),
which also allows for specifying relative heights. | 
| width | (optional) Width of the plot. A default value is derived from
the number of facets. | 
| height | (optional) Height of the plot. A default value is derived
from the number of features and the number of facets. | 
| units | (optional) Plot size unit. Either cm(default),mmorin. | 
| export_collection | (optional) Exports the collection if TRUE. | 
| ... | Arguments passed on to as_familiar_collection,ggplot2::ggsave,extract_feature_similarity 
familiar_data_namesNames of the dataset(s). Only used if the objectparameter is one or morefamiliarDataobjects.collection_nameName of the collection.deviceDevice to use. Can either be a device function
(e.g. png), or one of "eps", "ps", "tex" (pictex),
"pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only). If
NULL(default), the device is guessed based on thefilenameextension.scaleMultiplicative scaling factor.dpiPlot resolution. Also accepts a string input: "retina" (320),
"print" (300), or "screen" (72). Applies only to raster output types.limitsizeWhen TRUE(the default),ggsave()will not
save images larger than 50x50 inches, to prevent the common error of
specifying dimensions in pixels.bgBackground colour. If NULL, uses theplot.backgroundfill value
from the plot theme.create.dirWhether to create new directories if a non-existing
directory is specified in the filenameorpath(TRUE) or return an
error (FALSE, default). IfFALSEand run in an interactive session,
a prompt will appear asking to create a new directory when necessary.dataA dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed.is_pre_processedFlag that indicates whether the data was already
pre-processed externally, e.g. normalised and clustered. Only used if the
dataargument is adata.tableordata.frame.clCluster created using the parallelpackage. This cluster is then
used to speed up computation through parallellisation.feature_similarity_metricMetric to determine pairwise similarity
between features. Similarity is computed in the same manner as for
clustering, and feature_similarity_metrictherefore has the same options
ascluster_similarity_metric:mcfadden_r2,cox_snell_r2,nagelkerke_r2,spearman,kendallandpearson. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.verboseFlag to indicate whether feedback should be provided on the
computation and extraction of various data elements.message_indentNumber of indentation steps for messages shown during
computation and extraction of various data elements.estimation_type(optional) Sets the type of estimation that should be
possible. This has the following options:
 
 point: Point estimates.
 bias_correctionorbc: Bias-corrected estimates. A bias-corrected
estimate is computed from (at least) 20 point estimates, andfamiliarmay
bootstrap the data to create them.
 bootstrap_confidence_intervalorbci(default): Bias-corrected
estimates with bootstrap confidence intervals (Efron and Hastie, 2016). The
number of point estimates required depends on theconfidence_levelparameter, andfamiliarmay bootstrap the data to create them.
 As with detail_level, a non-defaultestimation_typeparameter can be
specified for separate evaluation steps by providing a parameter value in a
named list with data elements, e.g.list("auc_data"="bci", "model_performance"="point"). This parameter can be set for the following
data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data, andprediction_data.aggregate_results(optional) Flag that signifies whether results
should be aggregated during evaluation. If estimation_typeisbias_correctionorbc, aggregation leads to a single bias-corrected
estimate. Ifestimation_typeisbootstrap_confidence_intervalorbci,
aggregation leads to a single bias-corrected estimate with lower and upper
boundaries of the confidence interval. This has no effect ifestimation_typeispoint. The default value is equal to TRUEexcept when assessing metrics to assess
model performance, as the default violin plot requires underlying data. As with detail_levelandestimation_type, a non-defaultaggregate_resultsparameter can be specified for separate evaluation steps
by providing a parameter value in a named list with data elements, e.g.list("auc_data"=TRUE, , "model_performance"=FALSE). This parameter exists
for the same elements asestimation_type.confidence_level(optional) Numeric value for the level at which
confidence intervals are determined. In the case bootstraps are used to
determine the confidence intervals bootstrap estimation, familiaruses the
rule of thumbn = 20 / ci.levelto determine the number of required
bootstraps. The default value is 0.95.bootstrap_ci_method(optional) Method used to determine bootstrap
confidence intervals (Efron and Hastie, 2016). The following methods are
implemented:
 Note that the standard method is not implemented because this method is
often not suitable due to non-normal distributions. The bias-corrected and
accelerated (BCa) method is not implemented yet. | 
Details
This function generates area under the ROC curve plots.
Available splitting variables are: fs_method, learner, and data_set.
By default, the data is split by fs_method and learner, with facetting
by data_set.
Note that similarity is determined based on the underlying data. Hence the
ordering of features may differ between facets, and tick labels are
maintained for each panel.
Available palettes for gradient_palette are those listed by
grDevices::palette.pals() (requires R >= 4.0.0), grDevices::hcl.pals()
(requires R >= 3.6.0) and rainbow, heat.colors, terrain.colors,
topo.colors and cm.colors, which correspond to the palettes of the same
name in grDevices. If not specified, a default palette based on palettes
in Tableau are used. You may also specify your own palette by using colour
names listed by grDevices::colors() or through hexadecimal RGB strings.
Labeling methods such as set_fs_method_names or set_data_set_names can
be applied to the familiarCollection object to update labels, and order
the output in the figure.
Value
NULL or list of plot objects, if dir_path is NULL.
Plot individual conditional expectation plots.
Description
This method creates individual conditional expectation plots
based on data in a familiarCollection object.
Usage
plot_ice(
  object,
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  color_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  ggtheme = NULL,
  discrete_palette = NULL,
  gradient_palette = NULL,
  gradient_palette_range = NULL,
  x_label = waiver(),
  y_label = waiver(),
  legend_label = waiver(),
  plot_title = NULL,
  plot_sub_title = NULL,
  caption = NULL,
  x_range = NULL,
  x_n_breaks = 5,
  x_breaks = NULL,
  y_range = NULL,
  y_n_breaks = 5,
  y_breaks = NULL,
  novelty_range = NULL,
  value_scales = waiver(),
  novelty_scales = waiver(),
  conf_int_style = c("ribbon", "step", "none"),
  conf_int_alpha = 0.4,
  ice_default_alpha = 0.6,
  n_max_samples_shown = 50L,
  show_ice = TRUE,
  show_pd = TRUE,
  show_novelty = TRUE,
  anchor_values = NULL,
  width = waiver(),
  height = waiver(),
  units = waiver(),
  export_collection = FALSE,
  ...
)
## S4 method for signature 'ANY'
plot_ice(
  object,
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  color_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  ggtheme = NULL,
  discrete_palette = NULL,
  gradient_palette = NULL,
  gradient_palette_range = NULL,
  x_label = waiver(),
  y_label = waiver(),
  legend_label = waiver(),
  plot_title = NULL,
  plot_sub_title = NULL,
  caption = NULL,
  x_range = NULL,
  x_n_breaks = 5,
  x_breaks = NULL,
  y_range = NULL,
  y_n_breaks = 5,
  y_breaks = NULL,
  novelty_range = NULL,
  value_scales = waiver(),
  novelty_scales = waiver(),
  conf_int_style = c("ribbon", "step", "none"),
  conf_int_alpha = 0.4,
  ice_default_alpha = 0.6,
  n_max_samples_shown = 50L,
  show_ice = TRUE,
  show_pd = TRUE,
  show_novelty = TRUE,
  anchor_values = NULL,
  width = waiver(),
  height = waiver(),
  units = waiver(),
  export_collection = FALSE,
  ...
)
## S4 method for signature 'familiarCollection'
plot_ice(
  object,
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  color_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  ggtheme = NULL,
  discrete_palette = NULL,
  gradient_palette = NULL,
  gradient_palette_range = NULL,
  x_label = waiver(),
  y_label = waiver(),
  legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  x_range = NULL,
  x_n_breaks = 5,
  x_breaks = NULL,
  y_range = NULL,
  y_n_breaks = 5,
  y_breaks = NULL,
  novelty_range = NULL,
  value_scales = waiver(),
  novelty_scales = waiver(),
  conf_int_style = c("ribbon", "step", "none"),
  conf_int_alpha = 0.4,
  ice_default_alpha = 0.6,
  n_max_samples_shown = 50L,
  show_ice = TRUE,
  show_pd = TRUE,
  show_novelty = TRUE,
  anchor_values = NULL,
  width = waiver(),
  height = waiver(),
  units = waiver(),
  export_collection = FALSE,
  ...
)
Arguments
| object | familiarCollectionobject, or one or morefamiliarDataobjects, that will be internally converted to afamiliarCollectionobject. It is also possible to provide afamiliarEnsembleor one or morefamiliarModelobjects together with the data from which data is computed
prior to export. Paths to such files can also be provided.
 | 
| draw | (optional) Draws the plot if TRUE. | 
| dir_path | (optional) Path to the directory where created individual
conditional expectation plots are saved to. Output is saved in the
explanationsubdirectory. IfNULL, figures are written to the folder,
but are returned instead. | 
| split_by | (optional) Splitting variables. This refers to column names
on which datasets are split. A separate figure is created for each split.
See details for available variables. | 
| color_by | (optional) Variables used to determine fill colour of plot
objects. The variables cannot overlap with those provided to the split_byargument, but may overlap with other arguments. See details for available
variables. | 
| facet_by | (optional) Variables used to determine how and if facets of
each figure appear. In case the facet_wrap_colsargument isNULL, the
first variable is used to define columns, and the remaing variables are
used to define rows of facets. The variables cannot overlap with those
provided to thesplit_byargument, but may overlap with other arguments.
See details for available variables. | 
| facet_wrap_cols | (optional) Number of columns to generate when facet
wrapping. If NULL, a facet grid is produced instead. | 
| ggtheme | (optional) ggplottheme to use for plotting. | 
| discrete_palette | (optional) Palette to use to colour the different
plot elements in case a value was provided to the color_byargument. For
2D individual conditional expectation plots without novelty, the initial
colour determines the colour of the points indicating sample values. | 
| gradient_palette | (optional) Sequential or divergent palette used to
colour the raster in 2D individual conditional expectation or partial
dependence plots. This argument is not used for 1D plots. | 
| gradient_palette_range | (optional) Numerical range used to span the
gradient for 2D plots. This should be a range of two values, e.g. c(0, 1). By default, values are determined from the data, dependent on thevalue_scalesparameter. This parameter is ignored for 1D plots. | 
| x_label | (optional) Label to provide to the x-axis. If NULL, no label
is shown. | 
| y_label | (optional) Label to provide to the y-axis. If NULL, no label
is shown. | 
| legend_label | (optional) Label to provide to the legend. If NULL, the
legend will not have a name. | 
| plot_title | (optional) Label to provide as figure title. If NULL, no
title is shown. | 
| plot_sub_title | (optional) Label to provide as figure subtitle. If
NULL, no subtitle is shown. | 
| caption | (optional) Label to provide as figure caption. If NULL, no
caption is shown. | 
| x_range | (optional) Value range for the x-axis. | 
| x_n_breaks | (optional) Number of breaks to show on the x-axis of the
plot. x_n_breaksis used to determine thex_breaksargument in case it
is unset. | 
| x_breaks | (optional) Break points on the x-axis of the plot. | 
| y_range | (optional) Value range for the y-axis. | 
| y_n_breaks | (optional) Number of breaks to show on the y-axis of the
plot. y_n_breaksis used to determine they_breaksargument in case it
is unset. | 
| y_breaks | (optional) Break points on the y-axis of the plot. | 
| novelty_range | (optional) Numerical range used to span the range of
novelty values. This determines the size of the bubbles in 2D, and
transparency of lines in 1D. This should be a range of two values, e.g.
c(0, 1). By default, values are determined from the data, dependent on
thevalue_scalesparameter. This parameter is ignored ifshow_novelty=FALSE. | 
| value_scales | (optional) Sets scaling of predicted values. This
parameter has several options:
 
 fixed(default): The value axis for all features will have the same
range.
 feature: The value axis for each feature will have the same range. This
option is unavailable for 2D plots.
 figure: The value axis for all facets in a figure will have the same
range.
 facet: Each facet has its own range. This option is unavailable for 2D
plots.
 For 1D plots, this option is ignored if the y_rangeis provided, whereas
for 2D it is ignored if thegradient_palette_rangeis provided. | 
| novelty_scales | (optional) Sets scaling of novelty values, similar to
the value_scalesparameter, but with more limited options: | 
| conf_int_style | (optional) Confidence interval style. See details for
allowed styles. | 
| conf_int_alpha | (optional) Alpha value to determine transparency of
confidence intervals or, alternatively, other plot elements with which the
confidence interval overlaps. Only values between 0.0 (fully transparent)
and 1.0 (fully opaque) are allowed. | 
| ice_default_alpha | (optional) Default transparency (value) of sample
lines in an 1D plot. When novelty is shown, this is the transparency
corresponding to the least novel points. The confidence interval alpha
values is scaled by this value. | 
| n_max_samples_shown | (optional) Maximum number of samples shown in an
individual conditional expectation plot. Defaults to 50. These samples are
randomly picked from the samples present in the ICE data, but the same
samples are consistently picked. Partial dependence is nonetheless computed
from all available samples. | 
| show_ice | (optional) Sets whether individual conditional expectation
plots should be created. | 
| show_pd | (optional) Sets whether partial dependence plots should be
created. Note that if an anchor is set for a particular feature, its
partial dependence cannot be shown. | 
| show_novelty | (optional) Sets whether novelty is shown in plots. | 
| anchor_values | (optional) A single value or a named list or array of
values that are used to centre the individual conditional expectation plot.
A single value is valid if and only if only a single feature is assessed.
Otherwise, values Has no effect if the plot is not shown, i.e.
show_ice=FALSE. A partial dependence plot cannot be shown for those
features. | 
| width | (optional) Width of the plot. A default value is derived from
the number of facets. | 
| height | (optional) Height of the plot. A default value is derived
from the number of features and the number of facets. | 
| units | (optional) Plot size unit. Either cm(default),mmorin. | 
| export_collection | (optional) Exports the collection if TRUE. | 
| ... | Arguments passed on to export_ice_data,ggplot2::ggsave,extract_ice 
aggregate_resultsFlag that signifies whether results should be
aggregated for export.deviceDevice to use. Can either be a device function
(e.g. png), or one of "eps", "ps", "tex" (pictex),
"pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only). If
NULL(default), the device is guessed based on thefilenameextension.scaleMultiplicative scaling factor.dpiPlot resolution. Also accepts a string input: "retina" (320),
"print" (300), or "screen" (72). Applies only to raster output types.limitsizeWhen TRUE(the default),ggsave()will not
save images larger than 50x50 inches, to prevent the common error of
specifying dimensions in pixels.bgBackground colour. If NULL, uses theplot.backgroundfill value
from the plot theme.create.dirWhether to create new directories if a non-existing
directory is specified in the filenameorpath(TRUE) or return an
error (FALSE, default). IfFALSEand run in an interactive session,
a prompt will appear asking to create a new directory when necessary.featuresNames of the feature or features (2) assessed simultaneously.
By default NULL, which means that all features are assessed one-by-one.feature_x_rangeWhen one or two features are defined using features,feature_x_rangecan be used to set the range of values for the first
feature. For numeric features, a vector of two values is assumed to indicate
a range from whichn_sample_pointsare uniformly sampled. A vector of more
than two values is interpreted as is, i.e. these represent the values to be
sampled. For categorical features, values should represent a (sub)set of
available levels.feature_y_rangeAs feature_x_range, but for the second feature in
case two features are defined.n_sample_pointsNumber of points used to sample continuous features.dataA dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed.is_pre_processedFlag that indicates whether the data was already
pre-processed externally, e.g. normalised and clustered. Only used if the
dataargument is adata.tableordata.frame.clCluster created using the parallelpackage. This cluster is then
used to speed up computation through parallellisation.evaluation_timesOne or more time points that are used for in analysis of
survival problems when data has to be assessed at a set time, e.g.
calibration. If not provided explicitly, this parameter is read from
settings used at creation of the underlying familiarModelobjects. Only
used forsurvivaloutcomes.ensemble_methodMethod for ensembling predictions from models for the
same sample. Available methods are:
verboseFlag to indicate whether feedback should be provided on the
computation and extraction of various data elements.message_indentNumber of indentation steps for messages shown during
computation and extraction of various data elements.sample_limit(optional) Set the upper limit of the number of samples
that are used during evaluation steps. Cannot be less than 20.
 This setting can be specified per data element by providing a parameter
value in a named list with data elements, e.g.
list("sample_similarity"=100, "permutation_vimp"=1000). This parameter can be set for the following data elements:
sample_similarityandice_data.detail_level(optional) Sets the level at which results are computed
and aggregated.
 
 ensemble: Results are computed at the ensemble level, i.e. over all
models in the ensemble. This means that, for example, bias-corrected
estimates of model performance are assessed by creating (at least) 20
bootstraps and computing the model performance of the ensemble model for
each bootstrap.
 hybrid(default): Results are computed at the level of models in an
ensemble. This means that, for example, bias-corrected estimates of model
performance are directly computed using the models in the ensemble. If there
are at least 20 trained models in the ensemble, performance is computed for
each model, in contrast toensemblewhere performance is computed for the
ensemble of models. If there are less than 20 trained models in the
ensemble, bootstraps are created so that at least 20 point estimates can be
made.
 model: Results are computed at the model level. This means that, for
example, bias-corrected estimates of model performance are assessed by
creating (at least) 20 bootstraps and computing the performance of the model
for each bootstrap.
 Note that each level of detail has a different interpretation for bootstrap
confidence intervals. For ensembleandmodelthese are the confidence
intervals for the ensemble and an individual model, respectively. That is,
the confidence interval describes the range where an estimate produced by a
respective ensemble or model trained on a repeat of the experiment may be
found with the probability of the confidence level. Forhybrid, it
represents the range where any single model trained on a repeat of the
experiment may be found with the probability of the confidence level. By
definition, confidence intervals obtained usinghybridare at least as
wide as those forensemble.hybridoffers the correct interpretation if
the goal of the analysis is to assess the result of a single, unspecified,
model. hybridis generally computationally less expensive thenensemble, which
in turn is somewhat less expensive thanmodel.
 A non-default detail_levelparameter can be specified for separate
evaluation steps by providing a parameter value in a named list with data
elements, e.g.list("auc_data"="ensemble", "model_performance"="hybrid").
This parameter can be set for the following data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data,prediction_dataandconfusion_matrix.estimation_type(optional) Sets the type of estimation that should be
possible. This has the following options:
 
 point: Point estimates.
 bias_correctionorbc: Bias-corrected estimates. A bias-corrected
estimate is computed from (at least) 20 point estimates, andfamiliarmay
bootstrap the data to create them.
 bootstrap_confidence_intervalorbci(default): Bias-corrected
estimates with bootstrap confidence intervals (Efron and Hastie, 2016). The
number of point estimates required depends on theconfidence_levelparameter, andfamiliarmay bootstrap the data to create them.
 As with detail_level, a non-defaultestimation_typeparameter can be
specified for separate evaluation steps by providing a parameter value in a
named list with data elements, e.g.list("auc_data"="bci", "model_performance"="point"). This parameter can be set for the following
data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data, andprediction_data.confidence_level(optional) Numeric value for the level at which
confidence intervals are determined. In the case bootstraps are used to
determine the confidence intervals bootstrap estimation, familiaruses the
rule of thumbn = 20 / ci.levelto determine the number of required
bootstraps. The default value is 0.95.bootstrap_ci_method(optional) Method used to determine bootstrap
confidence intervals (Efron and Hastie, 2016). The following methods are
implemented:
 Note that the standard method is not implemented because this method is
often not suitable due to non-normal distributions. The bias-corrected and
accelerated (BCa) method is not implemented yet. | 
Details
This function generates individual conditional expectation plots.
These plots come in two varieties, namely 1D and 2D. 1D plots show the
predicted value as function of a single feature, whereas 2D plots show the
predicted value as a function of two features.
Available splitting variables are: feature_x, feature_y (2D only),
fs_method, learner, data_set and positive_class (categorical
outcomes) or evaluation_time (survival outcomes). By default, for 1D ICE
plots the data are split by feature_x, fs_method and learner, with
faceting by data_set, positive_class or evaluation_time. If only
partial dependence is shown, positive_class and evaluation_time are
used to set colours instead. For 2D plots, by default the data are split by
feature_x, fs_method and learner, with faceting by data_set,
positive_class or evaluation_time. The color_by argument cannot be
used with 2D plots, and attempting to do so causes an error. Attempting to
specify feature_x or feature_y for color_by will likewise result in
an error, as multiple features cannot be shown in the same facet.
The splitting variables indicated by color_by are coloured according to
the discrete_palette parameter. This parameter is therefore only used for
1D plots. Available palettes for discrete_palette and gradient_palette
are those listed by grDevices::palette.pals() (requires R >= 4.0.0),
grDevices::hcl.pals() (requires R >= 3.6.0) and rainbow, heat.colors,
terrain.colors, topo.colors and cm.colors, which correspond to the
palettes of the same name in grDevices. If not specified, a default
palette based on palettes in Tableau are used. You may also specify your
own palette by using colour names listed by grDevices::colors() or
through hexadecimal RGB strings.
Bootstrap confidence intervals of the partial dependence plots can be shown
using various styles set by conf_int_style:
-  ribbon(default): confidence intervals are shown as a ribbon with an
opacity ofconf_int_alphaaround the point estimate of the partial
dependence.
 
-  step(default): confidence intervals are shown as a step function around
the point estimate of the partial dependence.
 
-  none: confidence intervals are not shown. The point estimate of the
partial dependence is shown as usual.
 
Note that when bootstrap confidence intervals were computed, they were also
computed for individual samples in individual conditional expectation
plots. To avoid clutter, only point estimates for individual samples are
shown.
Labelling methods such as set_fs_method_names or set_data_set_names can
be applied to the familiarCollection object to update labels, and order
the output in the figure.
Value
NULL or list of plot objects, if dir_path is NULL.
Plot Kaplan-Meier survival curves.
Description
This function creates Kaplan-Meier survival curves from
stratification data stored in a familiarCollection object.
Usage
plot_kaplan_meier(
  object,
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  color_by = NULL,
  linetype_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  combine_legend = TRUE,
  ggtheme = NULL,
  discrete_palette = NULL,
  x_label = "time",
  x_label_shared = "column",
  y_label = "survival probability",
  y_label_shared = "row",
  legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  x_range = NULL,
  x_n_breaks = 5,
  x_breaks = NULL,
  y_range = c(0, 1),
  y_n_breaks = 5,
  y_breaks = NULL,
  confidence_level = NULL,
  conf_int_style = c("ribbon", "step", "none"),
  conf_int_alpha = 0.4,
  censoring = TRUE,
  censor_shape = "plus",
  show_logrank = TRUE,
  show_survival_table = TRUE,
  width = waiver(),
  height = waiver(),
  units = waiver(),
  export_collection = FALSE,
  ...
)
## S4 method for signature 'ANY'
plot_kaplan_meier(
  object,
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  color_by = NULL,
  linetype_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  combine_legend = TRUE,
  ggtheme = NULL,
  discrete_palette = NULL,
  x_label = "time",
  x_label_shared = "column",
  y_label = "survival probability",
  y_label_shared = "row",
  legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  x_range = NULL,
  x_n_breaks = 5,
  x_breaks = NULL,
  y_range = c(0, 1),
  y_n_breaks = 5,
  y_breaks = NULL,
  confidence_level = NULL,
  conf_int_style = c("ribbon", "step", "none"),
  conf_int_alpha = 0.4,
  censoring = TRUE,
  censor_shape = "plus",
  show_logrank = TRUE,
  show_survival_table = TRUE,
  width = waiver(),
  height = waiver(),
  units = waiver(),
  export_collection = FALSE,
  ...
)
## S4 method for signature 'familiarCollection'
plot_kaplan_meier(
  object,
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  color_by = NULL,
  linetype_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  combine_legend = TRUE,
  ggtheme = NULL,
  discrete_palette = NULL,
  x_label = "time",
  x_label_shared = "column",
  y_label = "survival probability",
  y_label_shared = "row",
  legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  x_range = NULL,
  x_n_breaks = 5,
  x_breaks = NULL,
  y_range = c(0, 1),
  y_n_breaks = 5,
  y_breaks = NULL,
  confidence_level = NULL,
  conf_int_style = c("ribbon", "step", "none"),
  conf_int_alpha = 0.4,
  censoring = TRUE,
  censor_shape = "plus",
  show_logrank = TRUE,
  show_survival_table = TRUE,
  width = waiver(),
  height = waiver(),
  units = waiver(),
  export_collection = FALSE,
  ...
)
Arguments
| object | familiarCollectionobject, or one or morefamiliarDataobjects, that will be internally converted to afamiliarCollectionobject. It is also possible to provide afamiliarEnsembleor one or morefamiliarModelobjects together with the data from which data is computed
prior to export. Paths to such files can also be provided.
 | 
| draw | (optional) Draws the plot if TRUE. | 
| dir_path | (optional) Path to the directory where created figures are
saved to. Output is saved in the stratificationsubdirectory. IfNULLno figures are saved, but are returned instead. | 
| split_by | (optional) Splitting variables. This refers to column names
on which datasets are split. A separate figure is created for each split.
See details for available variables. | 
| color_by | (optional) Variables used to determine fill colour of plot
objects. The variables cannot overlap with those provided to the split_byargument, but may overlap with other arguments. See details for available
variables. | 
| linetype_by | (optional) Variables that are used to determine the
linetype of lines in a plot. The variables cannot overlap with those
provided to the split_byargument, but may overlap with other arguments.
Sett details for available variables. | 
| facet_by | (optional) Variables used to determine how and if facets of
each figure appear. In case the facet_wrap_colsargument isNULL, the
first variable is used to define columns, and the remaing variables are
used to define rows of facets. The variables cannot overlap with those
provided to thesplit_byargument, but may overlap with other arguments.
See details for available variables. | 
| facet_wrap_cols | (optional) Number of columns to generate when facet
wrapping. If NULL, a facet grid is produced instead. | 
| combine_legend | (optional) Flag to indicate whether the same legend
is to be shared by multiple aesthetics, such as those specified by
color_byandlinetype_byarguments. | 
| ggtheme | (optional) ggplottheme to use for plotting. | 
| discrete_palette | (optional) Palette to use to color the different
risk strata in case a non-singular variable was provided to the color_byargument. | 
| x_label | (optional) Label to provide to the x-axis. If NULL, no label
is shown. | 
| x_label_shared | (optional) Sharing of x-axis labels between facets.
One of three values:
 
 overall: A single label is placed at the bottom of the figure. Tick
text (but not the ticks themselves) is removed for all but the bottom facet
plot(s).
 column: A label is placed at the bottom of each column. Tick text (but
not the ticks themselves) is removed for all but the bottom facet plot(s).
 individual: A label is placed below each facet plot. Tick text is kept.
 | 
| y_label | (optional) Label to provide to the y-axis. If NULL, no label
is shown. | 
| y_label_shared | (optional) Sharing of y-axis labels between facets.
One of three values:
 
 overall: A single label is placed to the left of the figure. Tick text
(but not the ticks themselves) is removed for all but the left-most facet
plot(s).
 row: A label is placed to the left of each row. Tick text (but not the
ticks themselves) is removed for all but the left-most facet plot(s).
 individual: A label is placed below each facet plot. Tick text is kept.
 | 
| legend_label | (optional) Label to provide to the legend. If NULL, the
legend will not have a name. | 
| plot_title | (optional) Label to provide as figure title. If NULL, no
title is shown. | 
| plot_sub_title | (optional) Label to provide as figure subtitle. If
NULL, no subtitle is shown. | 
| caption | (optional) Label to provide as figure caption. If NULL, no
caption is shown. | 
| x_range | (optional) Value range for the x-axis. | 
| x_n_breaks | (optional) Number of breaks to show on the x-axis of the
plot. x_n_breaksis used to determine thex_breaksargument in case it
is unset. | 
| x_breaks | (optional) Break points on the x-axis of the plot. | 
| y_range | (optional) Value range for the y-axis. | 
| y_n_breaks | (optional) Number of breaks to show on the y-axis of the
plot. y_n_breaksis used to determine they_breaksargument in case it
is unset. | 
| y_breaks | (optional) Break points on the y-axis of the plot. | 
| confidence_level | (optional) Confidence level for the strata in the
plot. | 
| conf_int_style | (optional) Confidence interval style. See details for
allowed styles. | 
| conf_int_alpha | (optional) Alpha value to determine transparency of
confidence intervals or, alternatively, other plot elements with which the
confidence interval overlaps. Only values between 0.0 (fully transparent)
and 1.0 (fully opaque) are allowed. | 
| censoring | (optional) Flag to indicate whether censored samples
should be indicated on the survival curve. | 
| censor_shape | (optional) Shape used to indicate censored samples on
the survival curve. Available shapes are documented in the ggplot2vignette Aesthetic specifications. By default a plus shape is used. | 
| show_logrank | (optional) Specifies whether the results of a logrank
test to assess differences between the risk strata is annotated in the
plot. A log-rank test can only be shown when color_byandlinestyle_byare either unset, or only containrisk_group. | 
| show_survival_table | (optional) Specifies whether a survival table is
shown below the Kaplan-Meier survival curves. Survival in the risk strata
is assessed for each of the breaks in x_breaks. | 
| width | (optional) Width of the plot. A default value is derived from
the number of facets. | 
| height | (optional) Height of the plot. A default value is derived
from number of facets and the inclusion of survival tables. | 
| units | (optional) Plot size unit. Either cm(default),mmorin. | 
| export_collection | (optional) Exports the collection if TRUE. | 
| ... | Arguments passed on to as_familiar_collection,ggplot2::ggsave,extract_risk_stratification_data 
familiar_data_namesNames of the dataset(s). Only used if the objectparameter is one or morefamiliarDataobjects.collection_nameName of the collection.deviceDevice to use. Can either be a device function
(e.g. png), or one of "eps", "ps", "tex" (pictex),
"pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only). If
NULL(default), the device is guessed based on thefilenameextension.scaleMultiplicative scaling factor.dpiPlot resolution. Also accepts a string input: "retina" (320),
"print" (300), or "screen" (72). Applies only to raster output types.limitsizeWhen TRUE(the default),ggsave()will not
save images larger than 50x50 inches, to prevent the common error of
specifying dimensions in pixels.bgBackground colour. If NULL, uses theplot.backgroundfill value
from the plot theme.create.dirWhether to create new directories if a non-existing
directory is specified in the filenameorpath(TRUE) or return an
error (FALSE, default). IfFALSEand run in an interactive session,
a prompt will appear asking to create a new directory when necessary.dataA dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed.is_pre_processedFlag that indicates whether the data was already
pre-processed externally, e.g. normalised and clustered. Only used if the
dataargument is adata.tableordata.frame.clCluster created using the parallelpackage. This cluster is then
used to speed up computation through parallellisation.ensemble_methodMethod for ensembling predictions from models for the
same sample. Available methods are:
verboseFlag to indicate whether feedback should be provided on the
computation and extraction of various data elements.message_indentNumber of indentation steps for messages shown during
computation and extraction of various data elements.detail_level(optional) Sets the level at which results are computed
and aggregated.
 
 ensemble: Results are computed at the ensemble level, i.e. over all
models in the ensemble. This means that, for example, bias-corrected
estimates of model performance are assessed by creating (at least) 20
bootstraps and computing the model performance of the ensemble model for
each bootstrap.
 hybrid(default): Results are computed at the level of models in an
ensemble. This means that, for example, bias-corrected estimates of model
performance are directly computed using the models in the ensemble. If there
are at least 20 trained models in the ensemble, performance is computed for
each model, in contrast toensemblewhere performance is computed for the
ensemble of models. If there are less than 20 trained models in the
ensemble, bootstraps are created so that at least 20 point estimates can be
made.
 model: Results are computed at the model level. This means that, for
example, bias-corrected estimates of model performance are assessed by
creating (at least) 20 bootstraps and computing the performance of the model
for each bootstrap.
 Note that each level of detail has a different interpretation for bootstrap
confidence intervals. For ensembleandmodelthese are the confidence
intervals for the ensemble and an individual model, respectively. That is,
the confidence interval describes the range where an estimate produced by a
respective ensemble or model trained on a repeat of the experiment may be
found with the probability of the confidence level. Forhybrid, it
represents the range where any single model trained on a repeat of the
experiment may be found with the probability of the confidence level. By
definition, confidence intervals obtained usinghybridare at least as
wide as those forensemble.hybridoffers the correct interpretation if
the goal of the analysis is to assess the result of a single, unspecified,
model. hybridis generally computationally less expensive thenensemble, which
in turn is somewhat less expensive thanmodel.
 A non-default detail_levelparameter can be specified for separate
evaluation steps by providing a parameter value in a named list with data
elements, e.g.list("auc_data"="ensemble", "model_performance"="hybrid").
This parameter can be set for the following data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data,prediction_dataandconfusion_matrix. | 
Details
This function generates a Kaplan-Meier survival plot based on risk
group stratification by the learners.
familiar does not determine what units the x-axis has or what kind of
survival the y-axis represents. It is therefore recommended to provide
x_label and y_label arguments.
Available splitting variables are: fs_method, learner, data_set,
risk_group and stratification_method. By default, separate figures are
created for each combination of fs_method and learner, with faceting by
data_set, colouring of the strata in each individual plot by
risk_group.
Available palettes for discrete_palette are those listed by
grDevices::palette.pals() (requires R >= 4.0.0), grDevices::hcl.pals()
(requires R >= 3.6.0) and rainbow, heat.colors, terrain.colors,
topo.colors and cm.colors, which correspond to the palettes of the same
name in grDevices. If not specified, a default palette based on palettes
in Tableau are used. You may also specify your own palette by using colour
names listed by grDevices::colors() or through hexadecimal RGB strings.
Greenwood confidence intervals of the Kaplan-Meier curve can be shown using
various styles set by conf_int_style:
-  ribbon(default): confidence intervals are shown as a ribbon with an
opacity ofconf_int_alphaaround the point estimate of the Kaplan-Meier
curve.
 
-  step(default): confidence intervals are shown as a step function around
the point estimate of the Kaplan-Meier curve.
 
-  none: confidence intervals are not shown. The point estimate of the ROC
curve is shown as usual.
 
Labelling methods such as set_risk_group_names or set_data_set_names
can be applied to the familiarCollection object to update labels, and
order the output in the figure.
Value
NULL or list of plot objects, if dir_path is NULL.
Description
This method creates plots that show model performance from the
data stored in a familiarCollection object. This method may create several
types of plots, as determined by plot_type.
Usage
plot_model_performance(
  object,
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  x_axis_by = NULL,
  y_axis_by = NULL,
  color_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  plot_type = NULL,
  ggtheme = NULL,
  discrete_palette = NULL,
  gradient_palette = NULL,
  gradient_palette_range = waiver(),
  x_label = waiver(),
  y_label = waiver(),
  legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  rotate_x_tick_labels = waiver(),
  y_range = NULL,
  y_n_breaks = 5,
  y_breaks = NULL,
  width = waiver(),
  height = waiver(),
  units = waiver(),
  annotate_performance = NULL,
  export_collection = FALSE,
  ...
)
## S4 method for signature 'ANY'
plot_model_performance(
  object,
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  x_axis_by = NULL,
  y_axis_by = NULL,
  color_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  plot_type = NULL,
  ggtheme = NULL,
  discrete_palette = NULL,
  gradient_palette = NULL,
  gradient_palette_range = waiver(),
  x_label = waiver(),
  y_label = waiver(),
  legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  rotate_x_tick_labels = waiver(),
  y_range = NULL,
  y_n_breaks = 5,
  y_breaks = NULL,
  width = waiver(),
  height = waiver(),
  units = waiver(),
  annotate_performance = NULL,
  export_collection = FALSE,
  ...
)
## S4 method for signature 'familiarCollection'
plot_model_performance(
  object,
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  x_axis_by = NULL,
  y_axis_by = NULL,
  color_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  plot_type = NULL,
  ggtheme = NULL,
  discrete_palette = NULL,
  gradient_palette = NULL,
  gradient_palette_range = waiver(),
  x_label = waiver(),
  y_label = waiver(),
  legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  rotate_x_tick_labels = waiver(),
  y_range = NULL,
  y_n_breaks = 5,
  y_breaks = NULL,
  width = waiver(),
  height = waiver(),
  units = waiver(),
  annotate_performance = NULL,
  export_collection = FALSE,
  ...
)
Arguments
| object | familiarCollectionobject, or one or morefamiliarDataobjects, that will be internally converted to afamiliarCollectionobject. It is also possible to provide afamiliarEnsembleor one or morefamiliarModelobjects together with the data from which data is computed
prior to export. Paths to such files can also be provided.
 | 
| draw | (optional) Draws the plot if TRUE. | 
| dir_path | (optional) Path to the directory where created performance
plots are saved to. Output is saved in the performancesubdirectory. IfNULLno figures are saved, but are returned instead. | 
| split_by | (optional) Splitting variables. This refers to column names
on which datasets are split. A separate figure is created for each split.
See details for available variables. | 
| x_axis_by | (optional) Variable plotted along the x-axis of a plot.
The variable cannot overlap with variables provided to the split_byandy_axis_byarguments (if used), but may overlap with other arguments. Only
one variable is allowed for this argument. See details for available
variables. | 
| y_axis_by | (optional) Variable plotted along the y-axis of a plot.
The variable cannot overlap with variables provided to the split_byandx_axis_byarguments (if used), but may overlap with other arguments. Only
one variable is allowed for this argument. See details for available
variables. | 
| color_by | (optional) Variables used to determine fill colour of plot
objects. The variables cannot overlap with those provided to the split_byargument, but may overlap with other arguments. See details for available
variables. | 
| facet_by | (optional) Variables used to determine how and if facets of
each figure appear. In case the facet_wrap_colsargument isNULL, the
first variable is used to define columns, and the remaing variables are
used to define rows of facets. The variables cannot overlap with those
provided to thesplit_byargument, but may overlap with other arguments.
See details for available variables. | 
| facet_wrap_cols | (optional) Number of columns to generate when facet
wrapping. If NULL, a facet grid is produced instead. | 
| plot_type | (optional) Type of plot to draw. This is one of heatmap(draws a heatmap),barplot(draws a barplot with confidence intervals),boxplot(draws a boxplot) andviolinplot(draws a violin plot).
Defaults toviolinplot. The choice for plot_typeaffects several other arguments, e.g.color_byis not used forheatmapandy_axis_byis only used byheatmap. | 
| ggtheme | (optional) ggplottheme to use for plotting. | 
| discrete_palette | (optional) Palette to use to color the different
plot elements in case a value was provided to the color_byargument. Only
used whenplot_typeis notheatmap. | 
| gradient_palette | (optional) Sequential or divergent palette used to
color the raster in heatmapplots. This argument is not used for otherplot_typevalue. | 
| gradient_palette_range | (optional) Numerical range used to span the
gradient. This should be a range of two values, e.g. c(0, 1). Lower or
upper boundary can be unset by usingNA. If not set, the full
metric-specific range is used. | 
| x_label | (optional) Label to provide to the x-axis. If NULL, no label
is shown. | 
| y_label | (optional) Label to provide to the y-axis. If NULL, no label
is shown. | 
| legend_label | (optional) Label to provide to the legend. If NULL, the
legend will not have a name. | 
| plot_title | (optional) Label to provide as figure title. If NULL, no
title is shown. | 
| plot_sub_title | (optional) Label to provide as figure subtitle. If
NULL, no subtitle is shown. | 
| caption | (optional) Label to provide as figure caption. If NULL, no
caption is shown. | 
| rotate_x_tick_labels | (optional) Rotate tick labels on the x-axis by
90 degrees. Defaults to TRUE. Rotation of x-axis tick labels may also be
controlled through theggtheme. In this case,FALSEshould be provided
explicitly. | 
| y_range | (optional) Value range for the y-axis. | 
| y_n_breaks | (optional) Number of breaks to show on the y-axis of the
plot. y_n_breaksis used to determine they_breaksargument in case it
is unset. | 
| y_breaks | (optional) Break points on the y-axis of the plot. | 
| width | (optional) Width of the plot. A default value is derived from
the number of facets. | 
| height | (optional) Height of the plot. A default value is derived
from the number of features and the number of facets. | 
| units | (optional) Plot size unit. Either cm(default),mmorin. | 
| annotate_performance | (optional) Indicates whether performance in
heatmaps should be annotated with text. Can be none,value(default),
orvalue_ci(median value plus 95% credibility intervals). | 
| export_collection | (optional) Exports the collection if TRUE. | 
| ... | Arguments passed on to extract_performance,as_familiar_collection,ggplot2::ggsave 
dataA dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed.is_pre_processedFlag that indicates whether the data was already
pre-processed externally, e.g. normalised and clustered. Only used if the
dataargument is adata.tableordata.frame.clCluster created using the parallelpackage. This cluster is then
used to speed up computation through parallellisation.evaluation_timesOne or more time points that are used for in analysis of
survival problems when data has to be assessed at a set time, e.g.
calibration. If not provided explicitly, this parameter is read from
settings used at creation of the underlying familiarModelobjects. Only
used forsurvivaloutcomes.ensemble_methodMethod for ensembling predictions from models for the
same sample. Available methods are:
metricOne or more metrics for assessing model performance. See the
vignette on performance metrics for the available metrics. If not provided
explicitly, this parameter is read from settings used at creation of the
underlying familiarModelobjects.verboseFlag to indicate whether feedback should be provided on the
computation and extraction of various data elements.message_indentNumber of indentation steps for messages shown during
computation and extraction of various data elements.detail_level(optional) Sets the level at which results are computed
and aggregated.
 
 ensemble: Results are computed at the ensemble level, i.e. over all
models in the ensemble. This means that, for example, bias-corrected
estimates of model performance are assessed by creating (at least) 20
bootstraps and computing the model performance of the ensemble model for
each bootstrap.
 hybrid(default): Results are computed at the level of models in an
ensemble. This means that, for example, bias-corrected estimates of model
performance are directly computed using the models in the ensemble. If there
are at least 20 trained models in the ensemble, performance is computed for
each model, in contrast toensemblewhere performance is computed for the
ensemble of models. If there are less than 20 trained models in the
ensemble, bootstraps are created so that at least 20 point estimates can be
made.
 model: Results are computed at the model level. This means that, for
example, bias-corrected estimates of model performance are assessed by
creating (at least) 20 bootstraps and computing the performance of the model
for each bootstrap.
 Note that each level of detail has a different interpretation for bootstrap
confidence intervals. For ensembleandmodelthese are the confidence
intervals for the ensemble and an individual model, respectively. That is,
the confidence interval describes the range where an estimate produced by a
respective ensemble or model trained on a repeat of the experiment may be
found with the probability of the confidence level. Forhybrid, it
represents the range where any single model trained on a repeat of the
experiment may be found with the probability of the confidence level. By
definition, confidence intervals obtained usinghybridare at least as
wide as those forensemble.hybridoffers the correct interpretation if
the goal of the analysis is to assess the result of a single, unspecified,
model. hybridis generally computationally less expensive thenensemble, which
in turn is somewhat less expensive thanmodel.
 A non-default detail_levelparameter can be specified for separate
evaluation steps by providing a parameter value in a named list with data
elements, e.g.list("auc_data"="ensemble", "model_performance"="hybrid").
This parameter can be set for the following data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data,prediction_dataandconfusion_matrix.estimation_type(optional) Sets the type of estimation that should be
possible. This has the following options:
 
 point: Point estimates.
 bias_correctionorbc: Bias-corrected estimates. A bias-corrected
estimate is computed from (at least) 20 point estimates, andfamiliarmay
bootstrap the data to create them.
 bootstrap_confidence_intervalorbci(default): Bias-corrected
estimates with bootstrap confidence intervals (Efron and Hastie, 2016). The
number of point estimates required depends on theconfidence_levelparameter, andfamiliarmay bootstrap the data to create them.
 As with detail_level, a non-defaultestimation_typeparameter can be
specified for separate evaluation steps by providing a parameter value in a
named list with data elements, e.g.list("auc_data"="bci", "model_performance"="point"). This parameter can be set for the following
data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data, andprediction_data.aggregate_results(optional) Flag that signifies whether results
should be aggregated during evaluation. If estimation_typeisbias_correctionorbc, aggregation leads to a single bias-corrected
estimate. Ifestimation_typeisbootstrap_confidence_intervalorbci,
aggregation leads to a single bias-corrected estimate with lower and upper
boundaries of the confidence interval. This has no effect ifestimation_typeispoint. The default value is equal to TRUEexcept when assessing metrics to assess
model performance, as the default violin plot requires underlying data. As with detail_levelandestimation_type, a non-defaultaggregate_resultsparameter can be specified for separate evaluation steps
by providing a parameter value in a named list with data elements, e.g.list("auc_data"=TRUE, , "model_performance"=FALSE). This parameter exists
for the same elements asestimation_type.confidence_level(optional) Numeric value for the level at which
confidence intervals are determined. In the case bootstraps are used to
determine the confidence intervals bootstrap estimation, familiaruses the
rule of thumbn = 20 / ci.levelto determine the number of required
bootstraps. The default value is 0.95.bootstrap_ci_method(optional) Method used to determine bootstrap
confidence intervals (Efron and Hastie, 2016). The following methods are
implemented:
 Note that the standard method is not implemented because this method is
often not suitable due to non-normal distributions. The bias-corrected and
accelerated (BCa) method is not implemented yet.familiar_data_namesNames of the dataset(s). Only used if the objectparameter is one or morefamiliarDataobjects.collection_nameName of the collection.deviceDevice to use. Can either be a device function
(e.g. png), or one of "eps", "ps", "tex" (pictex),
"pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only). If
NULL(default), the device is guessed based on thefilenameextension.scaleMultiplicative scaling factor.dpiPlot resolution. Also accepts a string input: "retina" (320),
"print" (300), or "screen" (72). Applies only to raster output types.limitsizeWhen TRUE(the default),ggsave()will not
save images larger than 50x50 inches, to prevent the common error of
specifying dimensions in pixels.bgBackground colour. If NULL, uses theplot.backgroundfill value
from the plot theme.create.dirWhether to create new directories if a non-existing
directory is specified in the filenameorpath(TRUE) or return an
error (FALSE, default). IfFALSEand run in an interactive session,
a prompt will appear asking to create a new directory when necessary. | 
Details
This function plots model performance based on empirical bootstraps,
using various plot representations.
Available splitting variables are: fs_method, learner, data_set,
evaluation_time (survival outcome only) and metric. The default for
heatmap is to split by metric, facet by data_set and
evaluation_time, position learner along the x-axis and fs_method
along the y-axis. The color_by argument is not used. The only valid
options for x_axis_by and y_axis_by are learner and fs_method.
For other plot types (barplot, boxplot and violinplot), depends on
the number of learners and feature selection methods:
-  one feature selection method and one learner: the default is to split by
metric, and havedata_setalong the x-axis.
 
-  one feature selection and multiple learners: the default is to split by
metric, facet bydata_setand havelearneralong the x-axis.
 
-  multiple feature selection methods and one learner: the default is to
split by metric, facet bydata_setand havefs_methodalong the
x-axis.
 
-  multiple feature selection methods and learners: the default is to split
by metric, facet bydata_set, colour byfs_methodand havelearneralong the x-axis.
 
If applicable, additional faceting is performed for evaluation_time.
Available palettes for discrete_palette and gradient_palette are those
listed by grDevices::palette.pals() (requires R >= 4.0.0),
grDevices::hcl.pals() (requires R >= 3.6.0) and rainbow, heat.colors,
terrain.colors, topo.colors and cm.colors, which correspond to the
palettes of the same name in grDevices. If not specified, a default
palette based on palettes in Tableau are used. You may also specify your
own palette by using colour names listed by grDevices::colors() or
through hexadecimal RGB strings.
Labeling methods such as set_fs_method_names or set_data_set_names can
be applied to the familiarCollection object to update labels, and order
the output in the figure.
Value
NULL or list of plot objects, if dir_path is NULL.
Plot partial dependence.
Description
This method creates partial dependence plots based on data in a
familiarCollection object.
Usage
plot_pd(
  object,
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  color_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  ggtheme = NULL,
  discrete_palette = NULL,
  gradient_palette = NULL,
  gradient_palette_range = NULL,
  x_label = waiver(),
  y_label = waiver(),
  legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  x_range = NULL,
  x_n_breaks = 5,
  x_breaks = NULL,
  y_range = NULL,
  y_n_breaks = 5,
  y_breaks = NULL,
  novelty_range = NULL,
  value_scales = waiver(),
  novelty_scales = waiver(),
  conf_int_style = c("ribbon", "step", "none"),
  conf_int_alpha = 0.4,
  show_novelty = TRUE,
  anchor_values = NULL,
  width = waiver(),
  height = waiver(),
  units = waiver(),
  export_collection = FALSE,
  ...
)
## S4 method for signature 'ANY'
plot_pd(
  object,
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  color_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  ggtheme = NULL,
  discrete_palette = NULL,
  gradient_palette = NULL,
  gradient_palette_range = NULL,
  x_label = waiver(),
  y_label = waiver(),
  legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  x_range = NULL,
  x_n_breaks = 5,
  x_breaks = NULL,
  y_range = NULL,
  y_n_breaks = 5,
  y_breaks = NULL,
  novelty_range = NULL,
  value_scales = waiver(),
  novelty_scales = waiver(),
  conf_int_style = c("ribbon", "step", "none"),
  conf_int_alpha = 0.4,
  show_novelty = TRUE,
  anchor_values = NULL,
  width = waiver(),
  height = waiver(),
  units = waiver(),
  export_collection = FALSE,
  ...
)
Arguments
| object | familiarCollectionobject, or one or morefamiliarDataobjects, that will be internally converted to afamiliarCollectionobject. It is also possible to provide afamiliarEnsembleor one or morefamiliarModelobjects together with the data from which data is computed
prior to export. Paths to such files can also be provided.
 | 
| draw | (optional) Draws the plot if TRUE. | 
| dir_path | (optional) Path to the directory where created individual
conditional expectation plots are saved to. Output is saved in the
explanationsubdirectory. IfNULL, figures are written to the folder,
but are returned instead. | 
| split_by | (optional) Splitting variables. This refers to column names
on which datasets are split. A separate figure is created for each split.
See details for available variables. | 
| color_by | (optional) Variables used to determine fill colour of plot
objects. The variables cannot overlap with those provided to the split_byargument, but may overlap with other arguments. See details for available
variables. | 
| facet_by | (optional) Variables used to determine how and if facets of
each figure appear. In case the facet_wrap_colsargument isNULL, the
first variable is used to define columns, and the remaing variables are
used to define rows of facets. The variables cannot overlap with those
provided to thesplit_byargument, but may overlap with other arguments.
See details for available variables. | 
| facet_wrap_cols | (optional) Number of columns to generate when facet
wrapping. If NULL, a facet grid is produced instead. | 
| ggtheme | (optional) ggplottheme to use for plotting. | 
| discrete_palette | (optional) Palette to use to colour the different
plot elements in case a value was provided to the color_byargument. For
2D individual conditional expectation plots without novelty, the initial
colour determines the colour of the points indicating sample values. | 
| gradient_palette | (optional) Sequential or divergent palette used to
colour the raster in 2D individual conditional expectation or partial
dependence plots. This argument is not used for 1D plots. | 
| gradient_palette_range | (optional) Numerical range used to span the
gradient for 2D plots. This should be a range of two values, e.g. c(0, 1). By default, values are determined from the data, dependent on thevalue_scalesparameter. This parameter is ignored for 1D plots. | 
| x_label | (optional) Label to provide to the x-axis. If NULL, no label
is shown. | 
| y_label | (optional) Label to provide to the y-axis. If NULL, no label
is shown. | 
| legend_label | (optional) Label to provide to the legend. If NULL, the
legend will not have a name. | 
| plot_title | (optional) Label to provide as figure title. If NULL, no
title is shown. | 
| plot_sub_title | (optional) Label to provide as figure subtitle. If
NULL, no subtitle is shown. | 
| caption | (optional) Label to provide as figure caption. If NULL, no
caption is shown. | 
| x_range | (optional) Value range for the x-axis. | 
| x_n_breaks | (optional) Number of breaks to show on the x-axis of the
plot. x_n_breaksis used to determine thex_breaksargument in case it
is unset. | 
| x_breaks | (optional) Break points on the x-axis of the plot. | 
| y_range | (optional) Value range for the y-axis. | 
| y_n_breaks | (optional) Number of breaks to show on the y-axis of the
plot. y_n_breaksis used to determine they_breaksargument in case it
is unset. | 
| y_breaks | (optional) Break points on the y-axis of the plot. | 
| novelty_range | (optional) Numerical range used to span the range of
novelty values. This determines the size of the bubbles in 2D, and
transparency of lines in 1D. This should be a range of two values, e.g.
c(0, 1). By default, values are determined from the data, dependent on
thevalue_scalesparameter. This parameter is ignored ifshow_novelty=FALSE. | 
| value_scales | (optional) Sets scaling of predicted values. This
parameter has several options:
 
 fixed(default): The value axis for all features will have the same
range.
 feature: The value axis for each feature will have the same range. This
option is unavailable for 2D plots.
 figure: The value axis for all facets in a figure will have the same
range.
 facet: Each facet has its own range. This option is unavailable for 2D
plots.
 For 1D plots, this option is ignored if the y_rangeis provided, whereas
for 2D it is ignored if thegradient_palette_rangeis provided. | 
| novelty_scales | (optional) Sets scaling of novelty values, similar to
the value_scalesparameter, but with more limited options: | 
| conf_int_style | (optional) Confidence interval style. See details for
allowed styles. | 
| conf_int_alpha | (optional) Alpha value to determine transparency of
confidence intervals or, alternatively, other plot elements with which the
confidence interval overlaps. Only values between 0.0 (fully transparent)
and 1.0 (fully opaque) are allowed. | 
| show_novelty | (optional) Sets whether novelty is shown in plots. | 
| anchor_values | (optional) A single value or a named list or array of
values that are used to centre the individual conditional expectation plot.
A single value is valid if and only if only a single feature is assessed.
Otherwise, values Has no effect if the plot is not shown, i.e.
show_ice=FALSE. A partial dependence plot cannot be shown for those
features. | 
| width | (optional) Width of the plot. A default value is derived from
the number of facets. | 
| height | (optional) Height of the plot. A default value is derived
from the number of features and the number of facets. | 
| units | (optional) Plot size unit. Either cm(default),mmorin. | 
| export_collection | (optional) Exports the collection if TRUE. | 
| ... | Arguments passed on to export_ice_data,ggplot2::ggsave,extract_ice 
aggregate_resultsFlag that signifies whether results should be
aggregated for export.deviceDevice to use. Can either be a device function
(e.g. png), or one of "eps", "ps", "tex" (pictex),
"pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only). If
NULL(default), the device is guessed based on thefilenameextension.scaleMultiplicative scaling factor.dpiPlot resolution. Also accepts a string input: "retina" (320),
"print" (300), or "screen" (72). Applies only to raster output types.limitsizeWhen TRUE(the default),ggsave()will not
save images larger than 50x50 inches, to prevent the common error of
specifying dimensions in pixels.bgBackground colour. If NULL, uses theplot.backgroundfill value
from the plot theme.create.dirWhether to create new directories if a non-existing
directory is specified in the filenameorpath(TRUE) or return an
error (FALSE, default). IfFALSEand run in an interactive session,
a prompt will appear asking to create a new directory when necessary.featuresNames of the feature or features (2) assessed simultaneously.
By default NULL, which means that all features are assessed one-by-one.feature_x_rangeWhen one or two features are defined using features,feature_x_rangecan be used to set the range of values for the first
feature. For numeric features, a vector of two values is assumed to indicate
a range from whichn_sample_pointsare uniformly sampled. A vector of more
than two values is interpreted as is, i.e. these represent the values to be
sampled. For categorical features, values should represent a (sub)set of
available levels.feature_y_rangeAs feature_x_range, but for the second feature in
case two features are defined.n_sample_pointsNumber of points used to sample continuous features.dataA dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed.is_pre_processedFlag that indicates whether the data was already
pre-processed externally, e.g. normalised and clustered. Only used if the
dataargument is adata.tableordata.frame.clCluster created using the parallelpackage. This cluster is then
used to speed up computation through parallellisation.evaluation_timesOne or more time points that are used for in analysis of
survival problems when data has to be assessed at a set time, e.g.
calibration. If not provided explicitly, this parameter is read from
settings used at creation of the underlying familiarModelobjects. Only
used forsurvivaloutcomes.ensemble_methodMethod for ensembling predictions from models for the
same sample. Available methods are:
verboseFlag to indicate whether feedback should be provided on the
computation and extraction of various data elements.message_indentNumber of indentation steps for messages shown during
computation and extraction of various data elements.sample_limit(optional) Set the upper limit of the number of samples
that are used during evaluation steps. Cannot be less than 20.
 This setting can be specified per data element by providing a parameter
value in a named list with data elements, e.g.
list("sample_similarity"=100, "permutation_vimp"=1000). This parameter can be set for the following data elements:
sample_similarityandice_data.detail_level(optional) Sets the level at which results are computed
and aggregated.
 
 ensemble: Results are computed at the ensemble level, i.e. over all
models in the ensemble. This means that, for example, bias-corrected
estimates of model performance are assessed by creating (at least) 20
bootstraps and computing the model performance of the ensemble model for
each bootstrap.
 hybrid(default): Results are computed at the level of models in an
ensemble. This means that, for example, bias-corrected estimates of model
performance are directly computed using the models in the ensemble. If there
are at least 20 trained models in the ensemble, performance is computed for
each model, in contrast toensemblewhere performance is computed for the
ensemble of models. If there are less than 20 trained models in the
ensemble, bootstraps are created so that at least 20 point estimates can be
made.
 model: Results are computed at the model level. This means that, for
example, bias-corrected estimates of model performance are assessed by
creating (at least) 20 bootstraps and computing the performance of the model
for each bootstrap.
 Note that each level of detail has a different interpretation for bootstrap
confidence intervals. For ensembleandmodelthese are the confidence
intervals for the ensemble and an individual model, respectively. That is,
the confidence interval describes the range where an estimate produced by a
respective ensemble or model trained on a repeat of the experiment may be
found with the probability of the confidence level. Forhybrid, it
represents the range where any single model trained on a repeat of the
experiment may be found with the probability of the confidence level. By
definition, confidence intervals obtained usinghybridare at least as
wide as those forensemble.hybridoffers the correct interpretation if
the goal of the analysis is to assess the result of a single, unspecified,
model. hybridis generally computationally less expensive thenensemble, which
in turn is somewhat less expensive thanmodel.
 A non-default detail_levelparameter can be specified for separate
evaluation steps by providing a parameter value in a named list with data
elements, e.g.list("auc_data"="ensemble", "model_performance"="hybrid").
This parameter can be set for the following data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data,prediction_dataandconfusion_matrix.estimation_type(optional) Sets the type of estimation that should be
possible. This has the following options:
 
 point: Point estimates.
 bias_correctionorbc: Bias-corrected estimates. A bias-corrected
estimate is computed from (at least) 20 point estimates, andfamiliarmay
bootstrap the data to create them.
 bootstrap_confidence_intervalorbci(default): Bias-corrected
estimates with bootstrap confidence intervals (Efron and Hastie, 2016). The
number of point estimates required depends on theconfidence_levelparameter, andfamiliarmay bootstrap the data to create them.
 As with detail_level, a non-defaultestimation_typeparameter can be
specified for separate evaluation steps by providing a parameter value in a
named list with data elements, e.g.list("auc_data"="bci", "model_performance"="point"). This parameter can be set for the following
data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data, andprediction_data.confidence_level(optional) Numeric value for the level at which
confidence intervals are determined. In the case bootstraps are used to
determine the confidence intervals bootstrap estimation, familiaruses the
rule of thumbn = 20 / ci.levelto determine the number of required
bootstraps. The default value is 0.95.bootstrap_ci_method(optional) Method used to determine bootstrap
confidence intervals (Efron and Hastie, 2016). The following methods are
implemented:
 Note that the standard method is not implemented because this method is
often not suitable due to non-normal distributions. The bias-corrected and
accelerated (BCa) method is not implemented yet. | 
Details
This function generates partial dependence plots. These plots come
in two varieties, namely 1D and 2D. 1D plots show the predicted value as
function of a single feature, whereas 2D plots show the predicted value as
a function of two features.
Available splitting variables are: feature_x, feature_y (2D only),
fs_method, learner, data_set and positive_class (categorical
outcomes) or evaluation_time (survival outcomes). By default, for 1D ICE
plots the data are split by feature_x, fs_method and learner, with
faceting by data_set, positive_class or evaluation_time. If only
partial dependence is shown, positive_class and evaluation_time are
used to set colours instead. For 2D plots, by default the data are split by
feature_x, fs_method and learner, with faceting by data_set,
positive_class or evaluation_time. The color_by argument cannot be
used with 2D plots, and attempting to do so causes an error. Attempting to
specify feature_x or feature_y for color_by will likewise result in
an error, as multiple features cannot be shown in the same facet.
The splitting variables indicated by color_by are coloured according to
the discrete_palette parameter. This parameter is therefore only used for
1D plots. Available palettes for discrete_palette and gradient_palette
are those listed by grDevices::palette.pals() (requires R >= 4.0.0),
grDevices::hcl.pals() (requires R >= 3.6.0) and rainbow, heat.colors,
terrain.colors, topo.colors and cm.colors, which correspond to the
palettes of the same name in grDevices. If not specified, a default
palette based on palettes in Tableau are used. You may also specify your
own palette by using colour names listed by grDevices::colors() or
through hexadecimal RGB strings.
Bootstrap confidence intervals of the partial dependence plots can be shown
using various styles set by conf_int_style:
-  ribbon(default): confidence intervals are shown as a ribbon with an
opacity ofconf_int_alphaaround the point estimate of the partial
dependence.
 
-  step(default): confidence intervals are shown as a step function around
the point estimate of the partial dependence.
 
-  none: confidence intervals are not shown. The point estimate of the
partial dependence is shown as usual.
 
Labelling methods such as set_fs_method_names or set_data_set_names can
be applied to the familiarCollection object to update labels, and order
the output in the figure.
Value
NULL or list of plot objects, if dir_path is NULL.
Plot permutation variable importance.
Description
This function plots the data on permutation variable importance
stored in a familiarCollection object.
Usage
plot_permutation_variable_importance(
  object,
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  color_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  ggtheme = NULL,
  discrete_palette = NULL,
  x_label = waiver(),
  y_label = "feature",
  legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  x_range = NULL,
  x_n_breaks = 5,
  x_breaks = NULL,
  conf_int_style = c("point_line", "line", "bar_line", "none"),
  conf_int_alpha = 0.4,
  width = waiver(),
  height = waiver(),
  units = waiver(),
  export_collection = FALSE,
  ...
)
## S4 method for signature 'ANY'
plot_permutation_variable_importance(
  object,
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  color_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  ggtheme = NULL,
  discrete_palette = NULL,
  x_label = waiver(),
  y_label = "feature",
  legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  x_range = NULL,
  x_n_breaks = 5,
  x_breaks = NULL,
  conf_int_style = c("point_line", "line", "bar_line", "none"),
  conf_int_alpha = 0.4,
  width = waiver(),
  height = waiver(),
  units = waiver(),
  export_collection = FALSE,
  ...
)
## S4 method for signature 'familiarCollection'
plot_permutation_variable_importance(
  object,
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  color_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  ggtheme = NULL,
  discrete_palette = NULL,
  x_label = waiver(),
  y_label = "feature",
  legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  x_range = NULL,
  x_n_breaks = 5,
  x_breaks = NULL,
  conf_int_style = c("point_line", "line", "bar_line", "none"),
  conf_int_alpha = 0.4,
  width = waiver(),
  height = waiver(),
  units = waiver(),
  export_collection = FALSE,
  ...
)
Arguments
| object | familiarCollectionobject, or one or morefamiliarDataobjects, that will be internally converted to afamiliarCollectionobject. It is also possible to provide afamiliarEnsembleor one or morefamiliarModelobjects together with the data from which data is computed
prior to export. Paths to such files can also be provided.
 | 
| draw | (optional) Draws the plot if TRUE. | 
| dir_path | (optional) Path to the directory where created figures are
saved to. Output is saved in the variable_importancesubdirectory. If
NULL no figures are saved, but are returned instead. | 
| split_by | (optional) Splitting variables. This refers to column names
on which datasets are split. A separate figure is created for each split.
See details for available variables. | 
| color_by | (optional) Variables used to determine fill colour of plot
objects. The variables cannot overlap with those provided to the split_byargument, but may overlap with other arguments. See details for available
variables. | 
| facet_by | (optional) Variables used to determine how and if facets of
each figure appear. In case the facet_wrap_colsargument isNULL, the
first variable is used to define columns, and the remaing variables are
used to define rows of facets. The variables cannot overlap with those
provided to thesplit_byargument, but may overlap with other arguments.
See details for available variables. | 
| facet_wrap_cols | (optional) Number of columns to generate when facet
wrapping. If NULL, a facet grid is produced instead. | 
| ggtheme | (optional) ggplottheme to use for plotting. | 
| discrete_palette | (optional) Palette used to fill the bars in case a
non-singular variable was provided to the color_byargument. | 
| x_label | (optional) Label to provide to the x-axis. If NULL, no label
is shown. | 
| y_label | (optional) Label to provide to the y-axis. If NULL, no label
is shown. | 
| legend_label | (optional) Label to provide to the legend. If NULL, the
legend will not have a name. | 
| plot_title | (optional) Label to provide as figure title. If NULL, no
title is shown. | 
| plot_sub_title | (optional) Label to provide as figure subtitle. If
NULL, no subtitle is shown. | 
| caption | (optional) Label to provide as figure caption. If NULL, no
caption is shown. | 
| x_range | (optional) Value range for the x-axis. | 
| x_n_breaks | (optional) Number of breaks to show on the x-axis of the
plot. x_n_breaksis used to determine thex_breaksargument in case it
is unset. | 
| x_breaks | (optional) Break points on the x-axis of the plot. | 
| conf_int_style | (optional) Confidence interval style. See details for
allowed styles. | 
| conf_int_alpha | (optional) Alpha value to determine transparency of
confidence intervals or, alternatively, other plot elements with which the
confidence interval overlaps. Only values between 0.0 (fully transparent)
and 1.0 (fully opaque) are allowed. | 
| width | (optional) Width of the plot. A default value is derived from
the number of facets. | 
| height | (optional) Height of the plot. A default value is derived
from the number of features and the number of facets. | 
| units | (optional) Plot size unit. Either cm(default),mmorin. | 
| export_collection | (optional) Exports the collection if TRUE. | 
| ... | Arguments passed on to as_familiar_collection,ggplot2::ggsave,extract_permutation_vimp 
familiar_data_namesNames of the dataset(s). Only used if the objectparameter is one or morefamiliarDataobjects.collection_nameName of the collection.deviceDevice to use. Can either be a device function
(e.g. png), or one of "eps", "ps", "tex" (pictex),
"pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only). If
NULL(default), the device is guessed based on thefilenameextension.scaleMultiplicative scaling factor.dpiPlot resolution. Also accepts a string input: "retina" (320),
"print" (300), or "screen" (72). Applies only to raster output types.limitsizeWhen TRUE(the default),ggsave()will not
save images larger than 50x50 inches, to prevent the common error of
specifying dimensions in pixels.bgBackground colour. If NULL, uses theplot.backgroundfill value
from the plot theme.create.dirWhether to create new directories if a non-existing
directory is specified in the filenameorpath(TRUE) or return an
error (FALSE, default). IfFALSEand run in an interactive session,
a prompt will appear asking to create a new directory when necessary.dataA dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed.is_pre_processedFlag that indicates whether the data was already
pre-processed externally, e.g. normalised and clustered. Only used if the
dataargument is adata.tableordata.frame.clCluster created using the parallelpackage. This cluster is then
used to speed up computation through parallellisation.evaluation_timesOne or more time points that are used for in analysis of
survival problems when data has to be assessed at a set time, e.g.
calibration. If not provided explicitly, this parameter is read from
settings used at creation of the underlying familiarModelobjects. Only
used forsurvivaloutcomes.ensemble_methodMethod for ensembling predictions from models for the
same sample. Available methods are:
metricOne or more metrics for assessing model performance. See the
vignette on performance metrics for the available metrics. If not provided
explicitly, this parameter is read from settings used at creation of the
underlying familiarModelobjects.feature_cluster_methodThe method used to perform clustering. These are
the same methods as for the cluster_methodconfiguration parameter:none,hclust,agnes,dianaandpam. nonecannot be used when extracting data regarding mutual correlation or
feature expressions.
 If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.feature_linkage_methodThe method used for agglomerative clustering in
hclustandagnes. These are the same methods as for thecluster_linkage_methodconfiguration parameter:average,single,complete,weighted, andward. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.feature_cluster_cut_methodThe method used to divide features into
separate clusters. The available methods are the same as for the
cluster_cut_methodconfiguration parameter:silhouette,fixed_cutanddynamic_cut. silhouetteis available for all cluster methods, butfixed_cutonly
applies to methods that create hierarchical trees (hclust,agnesanddiana).dynamic_cutrequires thedynamicTreeCutpackage and can only
be used withagnesandhclust.
 If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.feature_similarity_thresholdThe threshold level for pair-wise
similarity that is required to form feature clusters with the fixed_cutmethod. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.feature_similarity_metricMetric to determine pairwise similarity
between features. Similarity is computed in the same manner as for
clustering, and feature_similarity_metrictherefore has the same options
ascluster_similarity_metric:mcfadden_r2,cox_snell_r2,nagelkerke_r2,spearman,kendallandpearson. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.verboseFlag to indicate whether feedback should be provided on the
computation and extraction of various data elements.message_indentNumber of indentation steps for messages shown during
computation and extraction of various data elements.detail_level(optional) Sets the level at which results are computed
and aggregated.
 
 ensemble: Results are computed at the ensemble level, i.e. over all
models in the ensemble. This means that, for example, bias-corrected
estimates of model performance are assessed by creating (at least) 20
bootstraps and computing the model performance of the ensemble model for
each bootstrap.
 hybrid(default): Results are computed at the level of models in an
ensemble. This means that, for example, bias-corrected estimates of model
performance are directly computed using the models in the ensemble. If there
are at least 20 trained models in the ensemble, performance is computed for
each model, in contrast toensemblewhere performance is computed for the
ensemble of models. If there are less than 20 trained models in the
ensemble, bootstraps are created so that at least 20 point estimates can be
made.
 model: Results are computed at the model level. This means that, for
example, bias-corrected estimates of model performance are assessed by
creating (at least) 20 bootstraps and computing the performance of the model
for each bootstrap.
 Note that each level of detail has a different interpretation for bootstrap
confidence intervals. For ensembleandmodelthese are the confidence
intervals for the ensemble and an individual model, respectively. That is,
the confidence interval describes the range where an estimate produced by a
respective ensemble or model trained on a repeat of the experiment may be
found with the probability of the confidence level. Forhybrid, it
represents the range where any single model trained on a repeat of the
experiment may be found with the probability of the confidence level. By
definition, confidence intervals obtained usinghybridare at least as
wide as those forensemble.hybridoffers the correct interpretation if
the goal of the analysis is to assess the result of a single, unspecified,
model. hybridis generally computationally less expensive thenensemble, which
in turn is somewhat less expensive thanmodel.
 A non-default detail_levelparameter can be specified for separate
evaluation steps by providing a parameter value in a named list with data
elements, e.g.list("auc_data"="ensemble", "model_performance"="hybrid").
This parameter can be set for the following data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data,prediction_dataandconfusion_matrix.estimation_type(optional) Sets the type of estimation that should be
possible. This has the following options:
 
 point: Point estimates.
 bias_correctionorbc: Bias-corrected estimates. A bias-corrected
estimate is computed from (at least) 20 point estimates, andfamiliarmay
bootstrap the data to create them.
 bootstrap_confidence_intervalorbci(default): Bias-corrected
estimates with bootstrap confidence intervals (Efron and Hastie, 2016). The
number of point estimates required depends on theconfidence_levelparameter, andfamiliarmay bootstrap the data to create them.
 As with detail_level, a non-defaultestimation_typeparameter can be
specified for separate evaluation steps by providing a parameter value in a
named list with data elements, e.g.list("auc_data"="bci", "model_performance"="point"). This parameter can be set for the following
data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data, andprediction_data.aggregate_results(optional) Flag that signifies whether results
should be aggregated during evaluation. If estimation_typeisbias_correctionorbc, aggregation leads to a single bias-corrected
estimate. Ifestimation_typeisbootstrap_confidence_intervalorbci,
aggregation leads to a single bias-corrected estimate with lower and upper
boundaries of the confidence interval. This has no effect ifestimation_typeispoint. The default value is equal to TRUEexcept when assessing metrics to assess
model performance, as the default violin plot requires underlying data. As with detail_levelandestimation_type, a non-defaultaggregate_resultsparameter can be specified for separate evaluation steps
by providing a parameter value in a named list with data elements, e.g.list("auc_data"=TRUE, , "model_performance"=FALSE). This parameter exists
for the same elements asestimation_type.confidence_level(optional) Numeric value for the level at which
confidence intervals are determined. In the case bootstraps are used to
determine the confidence intervals bootstrap estimation, familiaruses the
rule of thumbn = 20 / ci.levelto determine the number of required
bootstraps. The default value is 0.95.bootstrap_ci_method(optional) Method used to determine bootstrap
confidence intervals (Efron and Hastie, 2016). The following methods are
implemented:
 Note that the standard method is not implemented because this method is
often not suitable due to non-normal distributions. The bias-corrected and
accelerated (BCa) method is not implemented yet. | 
Details
This function generates a horizontal barplot that lists features by
the estimated model improvement over that of a dataset where the respective
feature is randomly permuted.
The following splitting variables are available for split_by, color_by
and facet_by:
-  fs_method: feature selection methods.
 
-  learner: learners.
 
-  data_set: data sets.
 
-  metric: the model performance metrics.
 
-  evaluation_time: the evaluation times (survival outcomes only).
 
-  similarity_threshold: the similarity threshold used to identify groups
of features to permute simultaneously.
 
By default, the data is split by fs_method, learner and metric,
faceted by data_set and evaluation_time, and coloured by
similarity_threshold.
Available palettes for discrete_palette are those listed by
grDevices::palette.pals() (requires R >= 4.0.0), grDevices::hcl.pals()
(requires R >= 3.6.0) and rainbow, heat.colors, terrain.colors,
topo.colors and cm.colors, which correspond to the palettes of the same
name in grDevices. If not specified, a default palette based on palettes
in Tableau are used. You may also specify your own palette by using colour
names listed by grDevices::colors() or through hexadecimal RGB strings.
Labelling methods such as set_fs_method_names or set_feature_names can
be applied to the familiarCollection object to update labels, and order
the output in the figure.
Bootstrap confidence intervals (if present) can be shown using various
styles set by conf_int_style:
-  point_line(default): confidence intervals are shown as lines, on which
the point estimate is likewise shown.
 
-  line(default): confidence intervals are shown as lines, but the point
estimate is not shown.
 
-  bar_line: confidence intervals are shown as lines, with the point
estimate shown as a bar plot with the opacity ofconf_int_alpha.
 
-  none: confidence intervals are not shown. The point estimate is shown as
a bar plot.
 
For metrics where lower values indicate better model performance, more
negative permutation variable importance values indicate features that are
more important. Because this may cause confusion, values obtained for these
metrics are mirrored around 0.0 for plotting (but not any tabular data
export).
Value
NULL or list of plot objects, if dir_path is NULL.
Plot heatmaps for pairwise similarity between features.
Description
This method creates a heatmap based on data stored in a
familiarCollection object. Features in the heatmap are ordered so that
more similar features appear together.
Usage
plot_sample_clustering(
  object,
  feature_cluster_method = waiver(),
  feature_linkage_method = waiver(),
  sample_cluster_method = waiver(),
  sample_linkage_method = waiver(),
  sample_limit = waiver(),
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  x_axis_by = NULL,
  y_axis_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  ggtheme = NULL,
  gradient_palette = NULL,
  gradient_palette_range = waiver(),
  outcome_palette = NULL,
  outcome_palette_range = waiver(),
  x_label = waiver(),
  x_label_shared = "column",
  y_label = waiver(),
  y_label_shared = "row",
  legend_label = waiver(),
  outcome_legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  x_range = NULL,
  x_n_breaks = 3,
  x_breaks = NULL,
  y_range = NULL,
  y_n_breaks = 3,
  y_breaks = NULL,
  rotate_x_tick_labels = waiver(),
  show_feature_dendrogram = TRUE,
  show_sample_dendrogram = TRUE,
  show_normalised_data = TRUE,
  show_outcome = TRUE,
  dendrogram_height = grid::unit(1.5, "cm"),
  outcome_height = grid::unit(0.3, "cm"),
  evaluation_times = waiver(),
  width = waiver(),
  height = waiver(),
  units = waiver(),
  export_collection = FALSE,
  verbose = TRUE,
  ...
)
## S4 method for signature 'ANY'
plot_sample_clustering(
  object,
  feature_cluster_method = waiver(),
  feature_linkage_method = waiver(),
  sample_cluster_method = waiver(),
  sample_linkage_method = waiver(),
  sample_limit = waiver(),
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  x_axis_by = NULL,
  y_axis_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  ggtheme = NULL,
  gradient_palette = NULL,
  gradient_palette_range = waiver(),
  outcome_palette = NULL,
  outcome_palette_range = waiver(),
  x_label = waiver(),
  x_label_shared = "column",
  y_label = waiver(),
  y_label_shared = "row",
  legend_label = waiver(),
  outcome_legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  x_range = NULL,
  x_n_breaks = 3,
  x_breaks = NULL,
  y_range = NULL,
  y_n_breaks = 3,
  y_breaks = NULL,
  rotate_x_tick_labels = waiver(),
  show_feature_dendrogram = TRUE,
  show_sample_dendrogram = TRUE,
  show_normalised_data = TRUE,
  show_outcome = TRUE,
  dendrogram_height = grid::unit(1.5, "cm"),
  outcome_height = grid::unit(0.3, "cm"),
  evaluation_times = waiver(),
  width = waiver(),
  height = waiver(),
  units = waiver(),
  export_collection = FALSE,
  verbose = TRUE,
  ...
)
## S4 method for signature 'familiarCollection'
plot_sample_clustering(
  object,
  feature_cluster_method = waiver(),
  feature_linkage_method = waiver(),
  sample_cluster_method = waiver(),
  sample_linkage_method = waiver(),
  sample_limit = waiver(),
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  x_axis_by = NULL,
  y_axis_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  ggtheme = NULL,
  gradient_palette = NULL,
  gradient_palette_range = waiver(),
  outcome_palette = NULL,
  outcome_palette_range = waiver(),
  x_label = waiver(),
  x_label_shared = "column",
  y_label = waiver(),
  y_label_shared = "row",
  legend_label = waiver(),
  outcome_legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  x_range = NULL,
  x_n_breaks = 3,
  x_breaks = NULL,
  y_range = NULL,
  y_n_breaks = 3,
  y_breaks = NULL,
  rotate_x_tick_labels = waiver(),
  show_feature_dendrogram = TRUE,
  show_sample_dendrogram = TRUE,
  show_normalised_data = TRUE,
  show_outcome = TRUE,
  dendrogram_height = grid::unit(1.5, "cm"),
  outcome_height = grid::unit(0.3, "cm"),
  evaluation_times = waiver(),
  width = waiver(),
  height = waiver(),
  units = waiver(),
  export_collection = FALSE,
  verbose = TRUE,
  ...
)
Arguments
| object | A familiarCollectionobject, or other other objects from which
afamiliarCollectioncan be extracted. See details for more information. | 
| feature_cluster_method | The method used to perform clustering. These are
the same methods as for the cluster_methodconfiguration parameter:none,hclust,agnes,dianaandpam. nonecannot be used when extracting data regarding mutual correlation or
feature expressions.
 If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
| feature_linkage_method | The method used for agglomerative clustering in
hclustandagnes. These are the same methods as for thecluster_linkage_methodconfiguration parameter:average,single,complete,weighted, andward. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
| sample_cluster_method | The method used to perform clustering based on
distance between samples. These are the same methods as for the
cluster_methodconfiguration parameter:hclust,agnes,dianaandpam. nonecannot be used when extracting data for feature expressions.
 If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
| sample_linkage_method | The method used for agglomerative clustering in
hclustandagnes. These are the same methods as for thecluster_linkage_methodconfiguration parameter:average,single,complete,weighted, andward. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
| sample_limit | (optional) Set the upper limit of the number of samples
that are used during evaluation steps. Cannot be less than 20.
 This setting can be specified per data element by providing a parameter
value in a named list with data elements, e.g.
list("sample_similarity"=100, "permutation_vimp"=1000). This parameter can be set for the following data elements:
sample_similarityandice_data. | 
| draw | (optional) Draws the plot if TRUE. | 
| dir_path | (optional) Path to the directory where created performance
plots are saved to. Output is saved in the feature_similaritysubdirectory. IfNULLno figures are saved, but are returned instead. | 
| split_by | (optional) Splitting variables. This refers to column names
on which datasets are split. A separate figure is created for each split.
See details for available variables. | 
| x_axis_by | (optional) Variable plotted along the x-axis of a plot.
The variable cannot overlap with variables provided to the split_byandy_axis_byarguments (if used), but may overlap with other arguments. Only
one variable is allowed for this argument. See details for available
variables. | 
| y_axis_by | (optional) Variable plotted along the y-axis of a plot.
The variable cannot overlap with variables provided to the split_byandx_axis_byarguments (if used), but may overlap with other arguments. Only
one variable is allowed for this argument. See details for available
variables. | 
| facet_by | (optional) Variables used to determine how and if facets of
each figure appear. In case the facet_wrap_colsargument isNULL, the
first variable is used to define columns, and the remaing variables are
used to define rows of facets. The variables cannot overlap with those
provided to thesplit_byargument, but may overlap with other arguments.
See details for available variables. | 
| facet_wrap_cols | (optional) Number of columns to generate when facet
wrapping. If NULL, a facet grid is produced instead. | 
| ggtheme | (optional) ggplottheme to use for plotting. | 
| gradient_palette | (optional) Sequential or divergent palette used to
colour the similarity or distance between features in a heatmap. | 
| gradient_palette_range | (optional) Numerical range used to span the
gradient. This should be a range of two values, e.g. c(0, 1). Lower or
upper boundary can be unset by usingNA. If not set, the full
metric-specific range is used. | 
| outcome_palette | (optional) Sequential (continuous,countoutcomes) or qualitative (other outcome types) palette used to show outcome
values. This argument is ignored if the outcome is not shown. | 
| outcome_palette_range | (optional) Numerical range used to span the
gradient of numeric (continuous,count) outcome values. This argument
is ignored for other outcome types or if the outcome is not shown. | 
| x_label | (optional) Label to provide to the x-axis. If NULL, no label
is shown. | 
| x_label_shared | (optional) Sharing of x-axis labels between facets.
One of three values:
 
 overall: A single label is placed at the bottom of the figure. Tick
text (but not the ticks themselves) is removed for all but the bottom facet
plot(s).
 column: A label is placed at the bottom of each column. Tick text (but
not the ticks themselves) is removed for all but the bottom facet plot(s).
 individual: A label is placed below each facet plot. Tick text is kept.
 | 
| y_label | (optional) Label to provide to the y-axis. If NULL, no label
is shown. | 
| y_label_shared | (optional) Sharing of y-axis labels between facets.
One of three values:
 
 overall: A single label is placed to the left of the figure. Tick text
(but not the ticks themselves) is removed for all but the left-most facet
plot(s).
 row: A label is placed to the left of each row. Tick text (but not the
ticks themselves) is removed for all but the left-most facet plot(s).
 individual: A label is placed below each facet plot. Tick text is kept.
 | 
| legend_label | (optional) Label to provide to the legend. If NULL, the
legend will not have a name. | 
| outcome_legend_label | (optional) Label to provide to the legend for
outcome data. If NULL, the legend will not have a name. By default,
class,valueandeventare used forbinomialandmultinomial,continuousandcount, andsurvivaloutcome types, respectively. | 
| plot_title | (optional) Label to provide as figure title. If NULL, no
title is shown. | 
| plot_sub_title | (optional) Label to provide as figure subtitle. If
NULL, no subtitle is shown. | 
| caption | (optional) Label to provide as figure caption. If NULL, no
caption is shown. | 
| x_range | (optional) Value range for the x-axis. | 
| x_n_breaks | (optional) Number of breaks to show on the x-axis of the
plot. x_n_breaksis used to determine thex_breaksargument in case it
is unset. | 
| x_breaks | (optional) Break points on the x-axis of the plot. | 
| y_range | (optional) Value range for the y-axis. | 
| y_n_breaks | (optional) Number of breaks to show on the y-axis of the
plot. y_n_breaksis used to determine they_breaksargument in case it
is unset. | 
| y_breaks | (optional) Break points on the y-axis of the plot. | 
| rotate_x_tick_labels | (optional) Rotate tick labels on the x-axis by
90 degrees. Defaults to TRUE. Rotation of x-axis tick labels may also be
controlled through theggtheme. In this case,FALSEshould be provided
explicitly. | 
| show_feature_dendrogram | (optional) Show feature dendrogram around
the main panel. Can be TRUE,FALSE,NULL, or a position, i.e.top,bottom,leftandright. If a position is specified, it should be appropriate with regard to the
x_axis_byory_axis_byargument. Ifx_axis_byissample(default),
the only valid positions aretop(default) andbottom. Alternatively,
ify_axis_byisfeature, the only valid positions areright(default)
andleft. A dendrogram can only be drawn from cluster methods that produce
dendograms, such as hclust. A dendogram can for example not be
constructed using the partioning around medioids method (pam). | 
| show_sample_dendrogram | (optional) Show sample dendrogram around the
main panel. Can be TRUE,FALSE,NULL, or a position, i.e.top,bottom,leftandright. If a position is specified, it should be appropriate with regard to the
x_axis_byory_axis_byargument. Ify_axis_byissample(default),
the only valid positions areright(default) andleft. Alternatively,
ifx_axis_byissample, the only valid positions aretop(default)
andbottom. A dendrogram can only be drawn from cluster methods that produce
dendograms, such as hclust. A dendogram can for example not be
constructed using the partioning around medioids method (pam). | 
| show_normalised_data | (optional) Flag that determines whether the
data shown in the main heatmap is normalised using the same settings as
within the analysis (fixed; default), using a standardisation method
(set_normalisation) that is applied separately to each dataset, or not at
all (none), which shows the data at the original scale, albeit with
batch-corrections. Categorial variables are plotted to span 90% of the entire numerical value
range, i.e. the levels of categorical variables with 2 levels are
represented at 5% and 95% of the range, with 3 levels at 5%, 50%, and 95%,
etc. | 
| show_outcome | (optional) Show outcome column(s) or row(s) in the
graph. Can be TRUE,FALSE,NULLor a poistion, i.e.top,bottom,leftandright. If a position is specified, it should be appropriate with regard to the
x_axis_byory_axis_byargument. Ify_axis_byissample(default),
the only valid positions areleft(default) andright. Alternatively,
ifx_axis_byissample, the only valid positions aretop(default)
andbottom. The outcome data will be drawn between the main panel and the sample
dendrogram (if any). | 
| dendrogram_height | (optional) Height of the dendrogram. The height is
1.5 cm by default. Height is expected to be grid unit (see grid::unit),
which also allows for specifying relative heights. | 
| outcome_height | (optional) Height of an outcome data column/row. The
height is 0.3 cm by default. Height is expected to be a grid unit (see
grid::unit), which also allows for specifying relative heights. In case
ofsurvivaloutcome data with multipeevaluation_times, this height is
multiplied by the number of time points. | 
| evaluation_times | (optional) Times at which the event status of
time-to-event survival outcomes are determined. Only used for survivaloutcome. If not specified, the values used when creating the underlyingfamiliarDataobjects are used. | 
| width | (optional) Width of the plot. A default value is derived from
the number of facets. | 
| height | (optional) Height of the plot. A default value is derived
from the number of features and the number of facets. | 
| units | (optional) Plot size unit. Either cm(default),mmorin. | 
| export_collection | (optional) Exports the collection if TRUE. | 
| verbose | Flag to indicate whether feedback should be provided for the
plotting. | 
| ... | Arguments passed on to as_familiar_collection,ggplot2::ggsave,extract_feature_expression 
familiar_data_namesNames of the dataset(s). Only used if the objectparameter is one or morefamiliarDataobjects.collection_nameName of the collection.deviceDevice to use. Can either be a device function
(e.g. png), or one of "eps", "ps", "tex" (pictex),
"pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only). If
NULL(default), the device is guessed based on thefilenameextension.scaleMultiplicative scaling factor.dpiPlot resolution. Also accepts a string input: "retina" (320),
"print" (300), or "screen" (72). Applies only to raster output types.limitsizeWhen TRUE(the default),ggsave()will not
save images larger than 50x50 inches, to prevent the common error of
specifying dimensions in pixels.bgBackground colour. If NULL, uses theplot.backgroundfill value
from the plot theme.create.dirWhether to create new directories if a non-existing
directory is specified in the filenameorpath(TRUE) or return an
error (FALSE, default). IfFALSEand run in an interactive session,
a prompt will appear asking to create a new directory when necessary.feature_similarityTable containing pairwise distance between
sample. This is used to determine cluster information, and indicate which
samples are similar. The table is created by the
extract_sample_similaritymethod.dataA dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed.feature_similarity_metricMetric to determine pairwise similarity
between features. Similarity is computed in the same manner as for
clustering, and feature_similarity_metrictherefore has the same options
ascluster_similarity_metric:mcfadden_r2,cox_snell_r2,nagelkerke_r2,spearman,kendallandpearson. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.sample_similarity_metricMetric to determine pairwise similarity
between samples. Similarity is computed in the same manner as for
clustering, but sample_similarity_metrichas different options that are
better suited to computing distance between samples instead of between
features:gower,euclidean. The underlying feature data is scaled to the [0, 1]range (for
numerical features) using the feature values across the samples. The
normalisation parameters required can optionally be computed from feature
data with the outer 5% (on both sides) of feature values trimmed or
winsorised. To do so append_trim(trimming) or_winsor(winsorising) to
the metric name. This reduces the effect of outliers somewhat. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.message_indentNumber of indentation steps for messages shown during
computation and extraction of various data elements. | 
Details
This function generates area under the ROC curve plots.
Available splitting variables are: fs_method, learner, and data_set.
By default, the data is split by fs_method and learner and data_set,
since the number of samples will typically differ between data sets, even
for the same feature selection method and learner.
The x_axis_by and y_axis_by arguments determine what data are shown
along which axis. Each argument takes one of feature and sample, and
both arguments should be unique. By default, features are shown along the
x-axis and samples along the y-axis.
Note that similarity is determined based on the underlying data. Hence the
ordering of features may differ between facets, and tick labels are
maintained for each panel.
Available palettes for gradient_palette are those listed by
grDevices::palette.pals() (requires R >= 4.0.0), grDevices::hcl.pals()
(requires R >= 3.6.0) and rainbow, heat.colors, terrain.colors,
topo.colors and cm.colors, which correspond to the palettes of the same
name in grDevices. If not specified, a default palette based on palettes
in Tableau are used. You may also specify your own palette by using colour
names listed by grDevices::colors() or through hexadecimal RGB strings.
Labeling methods such as set_fs_method_names or set_data_set_names can
be applied to the familiarCollection object to update labels, and order
the output in the figure.
Value
NULL or list of plot objects, if dir_path is NULL.
Plot univariate importance.
Description
This function plots the univariate analysis data stored in a
familiarCollection object.
Usage
plot_univariate_importance(
  object,
  feature_cluster_method = waiver(),
  feature_linkage_method = waiver(),
  feature_cluster_cut_method = waiver(),
  feature_similarity_threshold = waiver(),
  draw = FALSE,
  dir_path = NULL,
  p_adjustment_method = waiver(),
  split_by = NULL,
  color_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  show_cluster = TRUE,
  ggtheme = NULL,
  discrete_palette = NULL,
  gradient_palette = waiver(),
  x_label = waiver(),
  y_label = "feature",
  legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  x_range = NULL,
  x_n_breaks = 5,
  x_breaks = NULL,
  significance_level_shown = 0.05,
  width = waiver(),
  height = waiver(),
  units = waiver(),
  verbose = TRUE,
  export_collection = FALSE,
  ...
)
## S4 method for signature 'ANY'
plot_univariate_importance(
  object,
  feature_cluster_method = waiver(),
  feature_linkage_method = waiver(),
  feature_cluster_cut_method = waiver(),
  feature_similarity_threshold = waiver(),
  draw = FALSE,
  dir_path = NULL,
  p_adjustment_method = waiver(),
  split_by = NULL,
  color_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  show_cluster = TRUE,
  ggtheme = NULL,
  discrete_palette = NULL,
  gradient_palette = waiver(),
  x_label = waiver(),
  y_label = "feature",
  legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  x_range = NULL,
  x_n_breaks = 5,
  x_breaks = NULL,
  significance_level_shown = 0.05,
  width = waiver(),
  height = waiver(),
  units = waiver(),
  verbose = TRUE,
  export_collection = FALSE,
  ...
)
## S4 method for signature 'familiarCollection'
plot_univariate_importance(
  object,
  feature_cluster_method = waiver(),
  feature_linkage_method = waiver(),
  feature_cluster_cut_method = waiver(),
  feature_similarity_threshold = waiver(),
  draw = FALSE,
  dir_path = NULL,
  p_adjustment_method = waiver(),
  split_by = NULL,
  color_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  show_cluster = TRUE,
  ggtheme = NULL,
  discrete_palette = NULL,
  gradient_palette = waiver(),
  x_label = waiver(),
  y_label = "feature",
  legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  x_range = NULL,
  x_n_breaks = 5,
  x_breaks = NULL,
  significance_level_shown = 0.05,
  width = waiver(),
  height = waiver(),
  units = waiver(),
  verbose = TRUE,
  export_collection = FALSE,
  ...
)
Arguments
| object | A familiarCollectionobject, or other other objects from which
afamiliarCollectioncan be extracted. See details for more information. | 
| feature_cluster_method | The method used to perform clustering. These are
the same methods as for the cluster_methodconfiguration parameter:none,hclust,agnes,dianaandpam. nonecannot be used when extracting data regarding mutual correlation or
feature expressions.
 If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
| feature_linkage_method | The method used for agglomerative clustering in
hclustandagnes. These are the same methods as for thecluster_linkage_methodconfiguration parameter:average,single,complete,weighted, andward. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
| feature_cluster_cut_method | The method used to divide features into
separate clusters. The available methods are the same as for the
cluster_cut_methodconfiguration parameter:silhouette,fixed_cutanddynamic_cut. silhouetteis available for all cluster methods, butfixed_cutonly
applies to methods that create hierarchical trees (hclust,agnesanddiana).dynamic_cutrequires thedynamicTreeCutpackage and can only
be used withagnesandhclust.
 If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
| feature_similarity_threshold | The threshold level for pair-wise
similarity that is required to form feature clusters with the fixed_cutmethod. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
| draw | (optional) Draws the plot if TRUE. | 
| dir_path | (optional) Path to the directory where created figures are
saved to. Output is saved in the variable_importancesubdirectory. If
NULL no figures are saved, but are returned instead. | 
| p_adjustment_method | (optional) Indicates type of p-value that is
shown. One of holm,hochberg,hommel,bonferroni,BH,BY,fdr,none,p_valueorq_valuefor adjusted p-values, uncorrected
p-values and q-values. q-values may not be available. | 
| split_by | (optional) Splitting variables. This refers to column names
on which datasets are split. A separate figure is created for each split.
See details for available variables. | 
| color_by | (optional) Variables used to determine fill colour of plot
objects. The variables cannot overlap with those provided to the split_byargument, but may overlap with other arguments. See details for available
variables. | 
| facet_by | (optional) Variables used to determine how and if facets of
each figure appear. In case the facet_wrap_colsargument isNULL, the
first variable is used to define columns, and the remaing variables are
used to define rows of facets. The variables cannot overlap with those
provided to thesplit_byargument, but may overlap with other arguments.
See details for available variables. | 
| facet_wrap_cols | (optional) Number of columns to generate when facet
wrapping. If NULL, a facet grid is produced instead. | 
| show_cluster | (optional) Show which features were clustered together. | 
| ggtheme | (optional) ggplottheme to use for plotting. | 
| discrete_palette | (optional) Palette used to fill the bars in case a
non-singular variable was provided to the color_byargument. | 
| gradient_palette | (optional) Palette to use for filling the bars in
case the color_byargument is not set. The bars are then coloured
according to their importance. By default, no gradient is used, and the
bars are not filled according to importance. UseNULLto fill the bars
using the default palette infamiliar. | 
| x_label | (optional) Label to provide to the x-axis. If NULL, no label
is shown. | 
| y_label | (optional) Label to provide to the y-axis. If NULL, no label
is shown. | 
| legend_label | (optional) Label to provide to the legend. If NULL, the
legend will not have a name. | 
| plot_title | (optional) Label to provide as figure title. If NULL, no
title is shown. | 
| plot_sub_title | (optional) Label to provide as figure subtitle. If
NULL, no subtitle is shown. | 
| caption | (optional) Label to provide as figure caption. If NULL, no
caption is shown. | 
| x_range | (optional) Value range for the x-axis. | 
| x_n_breaks | (optional) Number of breaks to show on the x-axis of the
plot. x_n_breaksis used to determine thex_breaksargument in case it
is unset. | 
| x_breaks | (optional) Break points on the x-axis of the plot. | 
| significance_level_shown | Position(s) to draw vertical lines indicating
a significance level, e.g. 0.05. Can be NULL to not draw anything. | 
| width | (optional) Width of the plot. A default value is derived from
the number of facets. | 
| height | (optional) Height of the plot. A default value is derived
from the number of features and the number of facets. | 
| units | (optional) Plot size unit. Either cm(default),mmorin. | 
| verbose | Flag to indicate whether feedback should be provided for the
plotting. | 
| export_collection | (optional) Exports the collection if TRUE. | 
| ... | Arguments passed on to as_familiar_collection,ggplot2::ggsave,extract_univariate_analysis 
familiar_data_namesNames of the dataset(s). Only used if the objectparameter is one or morefamiliarDataobjects.collection_nameName of the collection.deviceDevice to use. Can either be a device function
(e.g. png), or one of "eps", "ps", "tex" (pictex),
"pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only). If
NULL(default), the device is guessed based on thefilenameextension.scaleMultiplicative scaling factor.dpiPlot resolution. Also accepts a string input: "retina" (320),
"print" (300), or "screen" (72). Applies only to raster output types.limitsizeWhen TRUE(the default),ggsave()will not
save images larger than 50x50 inches, to prevent the common error of
specifying dimensions in pixels.bgBackground colour. If NULL, uses theplot.backgroundfill value
from the plot theme.create.dirWhether to create new directories if a non-existing
directory is specified in the filenameorpath(TRUE) or return an
error (FALSE, default). IfFALSEand run in an interactive session,
a prompt will appear asking to create a new directory when necessary.dataA dataObjectobject,data.tableordata.framethat
constitutes the data that are assessed.clCluster created using the parallelpackage. This cluster is then
used to speed up computation through parallellisation.feature_similarity_metricMetric to determine pairwise similarity
between features. Similarity is computed in the same manner as for
clustering, and feature_similarity_metrictherefore has the same options
ascluster_similarity_metric:mcfadden_r2,cox_snell_r2,nagelkerke_r2,spearman,kendallandpearson. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects.icc_typeString indicating the type of intraclass correlation
coefficient (1,2or3) that should be used to compute robustness for
features in repeated measurements during the evaluation of univariate
importance. These types correspond to the types in Shrout and Fleiss (1979).
If not provided explicitly, this parameter is read from settings used at
creation of the underlyingfamiliarModelobjects.message_indentNumber of indentation steps for messages shown during
computation and extraction of various data elements. | 
Details
This function generates a horizontal barplot with the length of the
bars corresponding to the 10-logarithm of the (multiple-testing corrected)
p-value or q-value.
Features are assessed univariately using one-sample location t-tests after
fitting a suitable regression model. The fitted model coefficient and the
covariance matrix are then used to compute a p-value.
The following splitting variables are available for split_by, color_by
and facet_by:
Unlike for plots of feature ranking in feature selection and after
modelling (as assessed by model-specific routines), clusters of features
are now found during creation of underlying familiarData objects, instead
of through consensus clustering. Hence, clustering results may differ due
to differences in the underlying datasets.
Available palettes for discrete_palette and gradient_palette are those
listed by grDevices::palette.pals() (requires R >= 4.0.0),
grDevices::hcl.pals() (requires R >= 3.6.0) and rainbow, heat.colors,
terrain.colors, topo.colors and cm.colors, which correspond to the
palettes of the same name in grDevices. If not specified, a default
palette based on palettes in Tableau are used. You may also specify your
own palette by using colour names listed by grDevices::colors() or
through hexadecimal RGB strings.
Labelling methods such as set_fs_method_names or set_feature_names can
be applied to the familiarCollection object to update labels, and order
the output in the figure.
Value
NULL or list of plot objects, if dir_path is NULL.
Plot variable importance scores of features during feature selection
or after training a model.
Description
This function plots variable importance based data obtained
during feature selection or after training a model, which are stored in a
familiarCollection object.
Usage
plot_variable_importance(
  object,
  type,
  feature_cluster_method = waiver(),
  feature_linkage_method = waiver(),
  feature_cluster_cut_method = waiver(),
  feature_similarity_threshold = waiver(),
  aggregation_method = waiver(),
  rank_threshold = waiver(),
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  color_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  show_cluster = TRUE,
  ggtheme = NULL,
  discrete_palette = NULL,
  gradient_palette = waiver(),
  x_label = "feature",
  rotate_x_tick_labels = waiver(),
  y_label = waiver(),
  legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  y_range = NULL,
  y_n_breaks = 5,
  y_breaks = NULL,
  width = waiver(),
  height = waiver(),
  units = waiver(),
  export_collection = FALSE,
  ...
)
## S4 method for signature 'ANY'
plot_variable_importance(
  object,
  type,
  feature_cluster_method = waiver(),
  feature_linkage_method = waiver(),
  feature_cluster_cut_method = waiver(),
  feature_similarity_threshold = waiver(),
  aggregation_method = waiver(),
  rank_threshold = waiver(),
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  color_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  show_cluster = TRUE,
  ggtheme = NULL,
  discrete_palette = NULL,
  gradient_palette = waiver(),
  x_label = "feature",
  rotate_x_tick_labels = waiver(),
  y_label = waiver(),
  legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  y_range = NULL,
  y_n_breaks = 5,
  y_breaks = NULL,
  width = waiver(),
  height = waiver(),
  units = waiver(),
  export_collection = FALSE,
  ...
)
## S4 method for signature 'familiarCollection'
plot_variable_importance(
  object,
  type,
  feature_cluster_method = waiver(),
  feature_linkage_method = waiver(),
  feature_cluster_cut_method = waiver(),
  feature_similarity_threshold = waiver(),
  aggregation_method = waiver(),
  rank_threshold = waiver(),
  draw = FALSE,
  dir_path = NULL,
  split_by = NULL,
  color_by = NULL,
  facet_by = NULL,
  facet_wrap_cols = NULL,
  show_cluster = TRUE,
  ggtheme = NULL,
  discrete_palette = NULL,
  gradient_palette = waiver(),
  x_label = "feature",
  rotate_x_tick_labels = waiver(),
  y_label = waiver(),
  legend_label = waiver(),
  plot_title = waiver(),
  plot_sub_title = waiver(),
  caption = NULL,
  y_range = NULL,
  y_n_breaks = 5,
  y_breaks = NULL,
  width = waiver(),
  height = waiver(),
  units = waiver(),
  export_collection = FALSE,
  ...
)
plot_feature_selection_occurrence(...)
plot_feature_selection_variable_importance(...)
plot_model_signature_occurrence(...)
plot_model_signature_variable_importance(...)
Arguments
| object | A familiarCollectionobject, or other other objects from which
afamiliarCollectioncan be extracted. See details for more information. | 
| type | Determine what variable importance should be shown. Can be
feature_selectionormodelfor the variable importance after the
feature selection step and after the model training step, respectively. | 
| feature_cluster_method | The method used to perform clustering. These are
the same methods as for the cluster_methodconfiguration parameter:none,hclust,agnes,dianaandpam. nonecannot be used when extracting data regarding mutual correlation or
feature expressions.
 If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
| feature_linkage_method | The method used for agglomerative clustering in
hclustandagnes. These are the same methods as for thecluster_linkage_methodconfiguration parameter:average,single,complete,weighted, andward. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
| feature_cluster_cut_method | The method used to divide features into
separate clusters. The available methods are the same as for the
cluster_cut_methodconfiguration parameter:silhouette,fixed_cutanddynamic_cut. silhouetteis available for all cluster methods, butfixed_cutonly
applies to methods that create hierarchical trees (hclust,agnesanddiana).dynamic_cutrequires thedynamicTreeCutpackage and can only
be used withagnesandhclust.
 If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
| feature_similarity_threshold | The threshold level for pair-wise
similarity that is required to form feature clusters with the fixed_cutmethod. If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModelobjects. | 
| aggregation_method | (optional) The method used to aggregate variable
importances over different data subsets, e.g. bootstraps. The following
methods can be selected:
 
 mean(default): Use the mean rank of a feature over the subsets to
determine the aggregated feature rank.
 median: Use the median rank of a feature over the subsets to determine
the aggregated feature rank.
 best: Use the best rank the feature obtained in any subset to determine
the aggregated feature rank.
 worst: Use the worst rank the feature obtained in any subset to
determine the aggregated feature rank.
 stability: Use the frequency of the feature being in the subset of
highly ranked features as measure for the aggregated feature rank
(Meinshausen and Buehlmann, 2010).
 exponential: Use a rank-weighted frequence of occurrence in the subset
of highly ranked features as measure for the aggregated feature rank (Haury
et al., 2011).
 borda: Use the borda count as measure for the aggregated feature rank
(Wald et al., 2012).
 enhanced_borda: Use an occurrence frequency-weighted borda count as
measure for the aggregated feature rank (Wald et al., 2012).
 truncated_borda: Use borda count computed only on features within the
subset of highly ranked features.
 enhanced_truncated_borda: Apply both the enhanced borda method and the
truncated borda method and use the resulting borda count as the aggregated
feature rank.
 | 
| rank_threshold | (optional) The threshold used to define the subset of
highly important features. If not set, this threshold is determined by
maximising the variance in the occurrence value over all features over the
subset size.
 This parameter is only relevant for stability,exponential,enhanced_borda,truncated_bordaandenhanced_truncated_bordamethods. | 
| draw | (optional) Draws the plot if TRUE. | 
| dir_path | (optional) Path to the directory where created figures are
saved to. Output is saved in the variable_importancesubdirectory. IfNULLno figures are saved, but are returned instead. | 
| split_by | (optional) Splitting variables. This refers to column names
on which datasets are split. A separate figure is created for each split.
See details for available variables. | 
| color_by | (optional) Variables used to determine fill colour of plot
objects. The variables cannot overlap with those provided to the split_byargument, but may overlap with other arguments. See details for available
variables. | 
| facet_by | (optional) Variables used to determine how and if facets of
each figure appear. In case the facet_wrap_colsargument isNULL, the
first variable is used to define columns, and the remaing variables are
used to define rows of facets. The variables cannot overlap with those
provided to thesplit_byargument, but may overlap with other arguments.
See details for available variables. | 
| facet_wrap_cols | (optional) Number of columns to generate when facet
wrapping. If NULL, a facet grid is produced instead. | 
| show_cluster | (optional) Show which features were clustered together.
Currently not available in combination with variable importance obtained
during feature selection. | 
| ggtheme | (optional) ggplottheme to use for plotting. | 
| discrete_palette | (optional) Palette to use for coloring bar plots,
in case a non-singular variable was provided to the color_byargument. | 
| gradient_palette | (optional) Palette to use for filling the bars in
case the color_byargument is not set. The bars are then coloured
according to the occurrence of features. By default, no gradient is used,
and the bars are not filled according to occurrence. UseNULLto fill the
bars using the default palette infamiliar. | 
| x_label | (optional) Label to provide to the x-axis. If NULL, no label
is shown. | 
| rotate_x_tick_labels | (optional) Rotate tick labels on the x-axis by
90 degrees. Defaults to TRUE. Rotation of x-axis tick labels may also be
controlled through theggtheme. In this case,FALSEshould be provided
explicitly. | 
| y_label | (optional) Label to provide to the y-axis. If NULL, no label
is shown. | 
| legend_label | (optional) Label to provide to the legend. If NULL, the
legend will not have a name. | 
| plot_title | (optional) Label to provide as figure title. If NULL, no
title is shown. | 
| plot_sub_title | (optional) Label to provide as figure subtitle. If
NULL, no subtitle is shown. | 
| caption | (optional) Label to provide as figure caption. If NULL, no
caption is shown. | 
| y_range | (optional) Value range for the y-axis. | 
| y_n_breaks | (optional) Number of breaks to show on the y-axis of the
plot. y_n_breaksis used to determine they_breaksargument in case it
is unset. | 
| y_breaks | (optional) Break points on the y-axis of the plot. | 
| width | (optional) Width of the plot. A default value is derived from
the number of facets and the number of features. | 
| height | (optional) Height of the plot. A default value is derived
from number of facets, and the length of the longest feature name (if
rotate_x_tick_labelsisTRUE). | 
| units | (optional) Plot size unit. Either cm(default),mmorin. | 
| export_collection | (optional) Exports the collection if TRUE. | 
| ... | Arguments passed on to as_familiar_collection,ggplot2::ggsave,extract_fs_vimp 
familiar_data_namesNames of the dataset(s). Only used if the objectparameter is one or morefamiliarDataobjects.collection_nameName of the collection.deviceDevice to use. Can either be a device function
(e.g. png), or one of "eps", "ps", "tex" (pictex),
"pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only). If
NULL(default), the device is guessed based on thefilenameextension.scaleMultiplicative scaling factor.dpiPlot resolution. Also accepts a string input: "retina" (320),
"print" (300), or "screen" (72). Applies only to raster output types.limitsizeWhen TRUE(the default),ggsave()will not
save images larger than 50x50 inches, to prevent the common error of
specifying dimensions in pixels.bgBackground colour. If NULL, uses theplot.backgroundfill value
from the plot theme.create.dirWhether to create new directories if a non-existing
directory is specified in the filenameorpath(TRUE) or return an
error (FALSE, default). IfFALSEand run in an interactive session,
a prompt will appear asking to create a new directory when necessary.verboseFlag to indicate whether feedback should be provided on the
computation and extraction of various data elements.message_indentNumber of indentation steps for messages shown during
computation and extraction of various data elements. | 
Details
This function generates a barplot based on variable importance of
features.
The only allowed values for split_by, color_by or facet_by are
fs_method and learner, but note that learner has no effect when
plotting variable importance of features acquired during feature selection.
Available palettes for discrete_palette and gradient_palette are those
listed by grDevices::palette.pals() (requires R >= 4.0.0),
grDevices::hcl.pals() (requires R >= 3.6.0) and rainbow, heat.colors,
terrain.colors, topo.colors and cm.colors, which correspond to the
palettes of the same name in grDevices. If not specified, a default
palette based on palettes in Tableau are used. You may also specify your
own palette by using colour names listed by grDevices::colors() or
through hexadecimal RGB strings.
Labeling methods such as set_feature_names or set_fs_method_names can
be applied to the familiarCollection object to update labels, and order
the output in the figure.
Value
NULL or list of plot objects, if dir_path is NULL.
Pre-compute data assignment
Description
Creates data assignment.
Usage
precompute_data_assignment(
  formula = NULL,
  data = NULL,
  experiment_data = NULL,
  cl = NULL,
  experimental_design = "fs+mb",
  verbose = TRUE,
  ...
)
Arguments
| formula | An R formula. The formula can only contain feature names and
dot (.). The*and+1operators are not supported as these refer to
columns that are not present in the data set. Use of the formula interface is optional. | 
| data | A data.tableobject, adata.frameobject, list containing
multipledata.tableordata.frameobjects, or paths to data files. datashould be provided if no file paths are provided to thedata_filesargument. If both are provided, onlydatawill be used.
 All data is expected to be in wide format, and ideally has a sample
identifier (see sample_id_column), batch identifier (seecohort_column)
and outcome columns (seeoutcome_column). In case paths are provided, the data should be stored as csv,rdsorRDatafiles. See documentation for thedata_filesargument for more
information. | 
| experiment_data | Experimental data may provided in the form of | 
| cl | Cluster created using the parallelpackage. This cluster is then
used to speed up computation through parallelisation. When a cluster is not
provided, parallelisation is performed by setting up a cluster on the local
machine. This parameter has no effect if the parallelargument is set toFALSE. | 
| experimental_design | (required) Defines what the experiment looks
like, e.g. cv(bt(fs,20)+mb,3,2)for 2 times repeated 3-fold
cross-validation with nested feature selection on 20 bootstraps and
model-building. The basic workflow components are: 
 fs: (required) feature selection step.
 mb: (required) model building step.
 ev: (optional) external validation. If validation batches or cohorts
are present in the dataset (data), these should be indicated in thevalidation_batch_idargument.
 The different components are linked using +. Different subsampling methods can be used in conjunction with the basic
workflow components:
 
 bs(x,n): (stratified) .632 bootstrap, withnthe number of
bootstraps. In contrast tobt, feature pre-processing parameters and
hyperparameter optimisation are conducted on individual bootstraps.
 bt(x,n): (stratified) .632 bootstrap, withnthe number of
bootstraps. Unlikebsand other subsampling methods, no separate
pre-processing parameters or optimised hyperparameters will be determined
for each bootstrap.
 cv(x,n,p): (stratified)n-fold cross-validation, repeatedptimes.
Pre-processing parameters are determined for each iteration.
 lv(x): leave-one-out-cross-validation. Pre-processing parameters are
determined for each iteration.
 ip(x): imbalance partitioning for addressing class imbalances on the
data set. Pre-processing parameters are determined for each partition. The
number of partitions generated depends on the imbalance correction method
(see theimbalance_correction_methodparameter).
 As shown in the example above, sampling algorithms can be nested.
 Though neither variable importance is determined nor models are learned
within precompute_data_assignment, the corresponding elements are still
required to prevent issues when using the resultingexperimentDataobject
to warm-start the experiments. The simplest valid experimental design is fs+mb. This is the default inprecompute_data_assignment, and will simply assign all instances to the
training set. | 
| verbose | Indicates verbosity of the results. Default is TRUE, and all
messages and warnings are returned. | 
| ... | Arguments passed on to .parse_experiment_settings,.parse_setup_settings,.parse_preprocessing_settings 
batch_id_column(recommended) Name of the column containing batch
or cohort identifiers. This parameter is required if more than one dataset
is provided, or if external validation is performed.
 In familiar any row of data is organised by four identifiers:
 
 The batch identifier batch_id_column: This denotes the group to which a
set of samples belongs, e.g. patients from a single study, samples measured
in a batch, etc. The batch identifier is used for batch normalisation, as
well as selection of development and validation datasets. The sample identifier sample_id_column: This denotes the sample level,
e.g. data from a single individual. Subsets of data, e.g. bootstraps or
cross-validation folds, are created at this level. The series identifier series_id_column: Indicates measurements on a
single sample that may not share the same outcome value, e.g. a time
series, or the number of cells in a view. The repetition identifier: Indicates repeated measurements in a single
series where any feature values may differ, but the outcome does not.
Repetition identifiers are always implicitly set when multiple entries for
the same series of the same sample in the same batch that share the same
outcome are encountered.
sample_id_column(recommended) Name of the column containing
sample or subject identifiers. See batch_id_columnabove for more
details. If unset, every row will be identified as a single sample.series_id_column(optional) Name of the column containing series
identifiers, which distinguish between measurements that are part of a
series for a single sample. See batch_id_columnabove for more details. If unset, rows which share the same batch and sample identifiers but have a
different outcome are assigned unique series identifiers.development_batch_id(optional) One or more batch or cohort
identifiers to constitute data sets for development. Defaults to all, or
all minus the identifiers in validation_batch_idfor external validation.
Required if external validation is performed andvalidation_batch_idis
not provided.validation_batch_id(optional) One or more batch or cohort
identifiers to constitute data sets for external validation. Defaults to
all data sets except those in development_batch_idfor external
validation, or none if not. Required ifdevelopment_batch_idis not
provided.outcome_name(optional) Name of the modelled outcome. This name will
be used in figures created by familiar. If not set, the column name in outcome_columnwill be used forbinomial,multinomial,countandcontinuousoutcomes. For other
outcomes (survivalandcompeting_risk) no default is used.outcome_column(recommended) Name of the column containing the
outcome of interest. May be identified from a formula, if a formula is
provided as an argument. Otherwise an error is raised. Note that survivalandcompeting_riskoutcome type outcomes require two columns that
indicate the time-to-event or the time of last follow-up and the event
status.outcome_type(recommended) Type of outcome found in the outcome
column. The outcome type determines many aspects of the overall process,
e.g. the available feature selection methods and learners, but also the
type of assessments that can be conducted to evaluate the resulting models.
Implemented outcome types are:
 
 binomial: categorical outcome with 2 levels.
 multinomial: categorical outcome with 2 or more levels.
 count: Poisson-distributed numeric outcomes.
 continuous: general continuous numeric outcomes.
 survival: survival outcome for time-to-event data.
 If not provided, the algorithm will attempt to obtain outcome_type from
contents of the outcome column. This may lead to unexpected results, and we
therefore advise to provide this information manually.
 Note that competing_risksurvival analysis are not fully supported, and
is currently not a valid choice foroutcome_type.class_levels(optional) Class levels for binomialormultinomialoutcomes. This argument can be used to specify the ordering of levels for
categorical outcomes. These class levels must exactly match the levels
present in the outcome column.event_indicator(recommended) Indicator for events in survivalandcompeting_riskanalyses.familiarwill automatically recognise1,true,t,yandyesas event indicators, including different
capitalisations. If this parameter is set, it replaces the default values.censoring_indicator(recommended) Indicator for right-censoring in
survivalandcompeting_riskanalyses.familiarwill automatically
recognise0,false,f,n,noas censoring indicators, including
different capitalisations. If this parameter is set, it replaces the
default values.competing_risk_indicator(recommended) Indicator for competing
risks in competing_riskanalyses. There are no default values, and if
unset, all values other than those specified by theevent_indicatorandcensoring_indicatorparameters are considered to indicate competing
risks.signature(optional) One or more names of feature columns that are
considered part of a specific signature. Features specified here will
always be used for modelling. Ranking from feature selection has no effect
for these features.novelty_features(optional) One or more names of feature columns
that should be included for the purpose of novelty detection.exclude_features(optional) Feature columns that will be removed
from the data set. Cannot overlap with features in signature,novelty_featuresorinclude_features.include_features(optional) Feature columns that are specifically
included in the data set. By default all features are included. Cannot
overlap with exclude_features, but may overlapsignature. Features insignatureandnovelty_featuresare always included. If bothexclude_featuresandinclude_featuresare provided,include_featurestakes precedence, provided that there is no overlap between the two.reference_method(optional) Method used to set reference levels for
categorical features. There are several options:
 
 auto(default): Categorical features that are not explicitly set by the
user, i.e. columns containing boolean values or characters, use the most
frequent level as reference. Categorical features that are explicitly set,
i.e. as factors, are used as is.
 always: Both automatically detected and user-specified categorical
features have the reference level set to the most frequent level. Ordinal
features are not altered, but are used as is.
 never: User-specified categorical features are used as is.
Automatically detected categorical features are simply sorted, and the
first level is then used as the reference level. This was the behaviour
prior to familiar version 1.3.0.
imbalance_correction_method(optional) Type of method used to
address class imbalances. Available options are:
 
 full_undersampling(default): All data will be used in an ensemble
fashion. The full minority class will appear in each partition, but
majority classes are undersampled until all data have been used.
 random_undersampling: Randomly undersamples majority classes. This is
useful in cases where full undersampling would lead to the formation of
many models due major overrepresentation of the largest class.
 This parameter is only used in combination with imbalance partitioning in
the experimental design, and ipshould therefore appear in the string
that defines the design.imbalance_n_partitions(optional) Number of times random
undersampling should be repeated. 10 undersampled subsets with balanced
classes are formed by default.parallel(optional) Enable parallel processing. Defaults to TRUE.
When set toFALSE, this disables all parallel processing, regardless of
specific parameters such asparallel_preprocessing. However, whenparallelisTRUE, parallel processing of different parts of the
workflow can be disabled by setting respective flags toFALSE.parallel_nr_cores(optional) Number of cores available for
parallelisation. Defaults to 2. This setting does nothing if
parallelisation is disabled.restart_cluster(optional) Restart nodes used for parallel computing
to free up memory prior to starting a parallel process. Note that it does
take time to set up the clusters. Therefore setting this argument to TRUEmay impact processing speed. This argument is ignored ifparallelisFALSEor the cluster was initialised outside of familiar. Default isFALSE, which causes the clusters to be initialised only once.cluster_type(optional) Selection of the cluster type for parallel
processing. Available types are the ones supported by the parallel package
that is part of the base R distribution: psock(default),fork,mpi,nws,sock. In addition,noneis available, which also disables
parallel processing.backend_type(optional) Selection of the backend for distributing
copies of the data. This backend ensures that only a single master copy is
kept in memory. This limits memory usage during parallel processing.
 Several backend options are available, notably socket_server, andnone(default).socket_serveris based on the callr package and R sockets,
comes withfamiliarand is available for any OS.noneuses the package
environment of familiar to store data, and is available for any OS.
However,nonerequires copying of data to any parallel process, and has a
larger memory footprint.server_port(optional) Integer indicating the port on which the
socket server or RServe process should communicate. Defaults to port 6311.
Note that ports 0 to 1024 and 49152 to 65535 cannot be used.feature_max_fraction_missing(optional) Numeric value between 0.0and0.95that determines the meximum fraction of missing values that
still allows a feature to be included in the data set. All features with a
missing value fraction over this threshold are not processed further. The
default value is0.30.sample_max_fraction_missing(optional) Numeric value between 0.0and0.95that determines the maximum fraction of missing values that
still allows a sample to be included in the data set. All samples with a
missing value fraction over this threshold are excluded and not processed
further. The default value is0.30.filter_method(optional) One or methods used to reduce
dimensionality of the data set by removing irrelevant or poorly
reproducible features.
 Several method are available:
 
 none(default): None of the features will be filtered.
 low_variance: Features with a variance below thelow_var_minimum_variance_thresholdare filtered. This can be useful to
filter, for example, genes that are not differentially expressed.
 univariate_test: Features undergo a univariate regression using an
outcome-appropriate regression model. The p-value of the model coefficient
is collected. Features with coefficient p or q-value above theunivariate_test_thresholdare subsequently filtered.
 robustness: Features that are not sufficiently robust according to the
intraclass correlation coefficient are filtered. Use of this method
requires that repeated measurements are present in the data set, i.e. there
should be entries for which the sample and cohort identifiers are the same.
 More than one method can be used simultaneously. Features with singular
values are always filtered, as these do not contain information.univariate_test_threshold(optional) Numeric value between 1.0and0.0that determines which features are irrelevant and will be filtered by
theunivariate_test. The p or q-values are compared to this threshold.
All features with values above the threshold are filtered. The default
value is0.20.univariate_test_threshold_metric(optional) Metric used with the to
compare the univariate_test_thresholdagainst. The following metrics can
be chosen: 
 p_value(default): The unadjusted p-value of each feature is used for
to filter features.
 q_value: The q-value (Story, 2002), is used to filter features. Some
data sets may have insufficient samples to compute the q-value. Theqvaluepackage must be installed from Bioconductor to use this method.
univariate_test_max_feature_set_size(optional) Maximum size of the
feature set after the univariate test. P or q values of features are
compared against the threshold, but if the resulting data set would be
larger than this setting, only the most relevant features up to the desired
feature set size are selected.
 The default value is NULL, which causes features to be filtered based on
their relevance only.low_var_minimum_variance_threshold(required, if used) Numeric value
that determines which features will be filtered by the low_variancemethod. The variance of each feature is computed and compared to the
threshold. If it is below the threshold, the feature is removed. This parameter has no default value and should be set if low_varianceis
used.low_var_max_feature_set_size(optional) Maximum size of the feature
set after filtering features with a low variance. All features are first
compared against low_var_minimum_variance_threshold. If the resulting
feature set would be larger than specified, only the most strongly varying
features will be selected, up to the desired size of the feature set. The default value is NULL, which causes features to be filtered based on
their variance only.robustness_icc_type(optional) String indicating the type of
intraclass correlation coefficient (1,2or3) that should be used to
compute robustness for features in repeated measurements. These types
correspond to the types in Shrout and Fleiss (1979). The default value is1.robustness_threshold_metric(optional) String indicating which
specific intraclass correlation coefficient (ICC) metric should be used to
filter features. This should be one of:
 
 icc: The estimated ICC value itself.
 icc_low(default): The estimated lower limit of the 95% confidence
interval of the ICC, as suggested by Koo and Li (2016).
 icc_panel: The estimated ICC value over the panel average, i.e. the ICC
that would be obtained if all repeated measurements were averaged.
 icc_panel_low: The estimated lower limit of the 95% confidence interval
of the panel ICC.
robustness_threshold_value(optional) The intraclass correlation
coefficient value that is as threshold. The default value is 0.70.transformation_method(optional) The transformation method used to
change the distribution of the data to be more normal-like. The following
methods are available:
 
 none: This disables transformation of features.
 yeo_johnson: Transformation using the location and scale invariant
version of the Yeo-Johnson transformation (Yeo and Johnson, 2000;
Zwanenburg and Löck, 2023).
 yeo_johnson_robust(default): A robust version ofyeo_johnson.
This method is less sensitive to outliers.
 yeo_johnson_conventional: Asyeo_johnson, but without optimisation of
location and scale parameters. This method is equivalent to the original
transformation proposed by Yeo and Johnson (2001).
 box_cox: Transformation using the location and scale invariant version
of the Box-Cox transformation (Box and Cox, 1964; Zwanenburg and Löck,
2023).
 box_cox_robust: A robust version ofyeo_johnson. This method is less
sensitive to outliers.
 box_cox_conventional: Asbox_cox, but without optimisation of
location and scale parameters. This method is equivalent to the original
transformation proposed by Box and Cox (1964). This method requires
strictly positive feature values.
 Transformation requires the power.transformpackage. Only features that
contain numerical data are transformed. Transformation parameters obtained
in development data are stored withinfeatureInfoobjects for later use
with validation data sets.transformation_optimisation_criterion(optional) Transformation
parameters are optimised using a criterion, conventionally
maximum-likelihood-estimation. power.transformimplements multiple
optimisation criteria, of which the following are available: 
 mle(default): Optimisation using maximum likelihood estimation.
 cramer_von_mises: Optimisation using the Cramér-von Mises
criterion. Zwanenburg and Löck (2023) found that this criterion was
relatively robust against outliers.
transformation_gof_test_p_value(optional) Not all transformations
will lead to features that are roughly normally distributed. Zwanenburg and
Löck (2023) established a empirical goodness-of-fit test for central
normality. This parameter sets the significance for rejecting the
null-hypothesis that a feature distribution is centrally normal. When the
null-hypothesis is rejected, no transformation is performed. The default
value is NULL, which disables the test.normalisation_method(optional) The normalisation method used to
improve the comparability between numerical features that may have very
different scales. The following normalisation methods can be chosen:
 
 none: This disables feature normalisation.
 standardisation: Features are normalised by subtraction of their mean
values and division by their standard deviations. This causes every feature
to be have a center value of 0.0 and standard deviation of 1.0.
 standardisation_trim: Asstandardisation, but based on the set of
feature values where the 5% lowest and 5% highest values are discarded.
This reduces the effect of outliers.
 standardisation_winsor: Asstandardisation, but based on the set of
feature values where the 5% lowest and 5% highest values are winsorised.
This reduces the effect of outliers.
 standardisation_robust(default): A robust version ofstandardisationthat relies on computing Huber's M-estimators for location and scale.
 normalisation: Features are normalised by subtraction of their minimum
values and division by their ranges. This maps all feature values to a[0, 1]interval.
 normalisation_trim: Asnormalisation, but based on the set of feature
values where the 5% lowest and 5% highest values are discarded. This
reduces the effect of outliers.
 normalisation_winsor: Asnormalisation, but based on the set of
feature values where the 5% lowest and 5% highest values are winsorised.
This reduces the effect of outliers.
 quantile: Features are normalised by subtraction of their median values
and division by their interquartile range.
 mean_centering: Features are centered by substracting the mean, but do
not undergo rescaling.
 Only features that contain numerical data are normalised. Normalisation
parameters obtained in development data are stored within featureInfoobjects for later use with validation data sets.batch_normalisation_method(optional) The method used for batch
normalisation. Available methods are:
 
 none(default): This disables batch normalisation of features.
 standardisation: Features within each batch are normalised by
subtraction of the mean value and division by the standard deviation in
each batch.
 standardisation_trim: Asstandardisation, but based on the set of
feature values where the 5% lowest and 5% highest values are discarded.
This reduces the effect of outliers.
 standardisation_winsor: Asstandardisation, but based on the set of
feature values where the 5% lowest and 5% highest values are winsorised.
This reduces the effect of outliers.
 standardisation_robust: A robust version ofstandardisationthat
relies on computing Huber's M-estimators for location and scale within each
batch.
 normalisation: Features within each batch are normalised by subtraction
of their minimum values and division by their range in each batch. This
maps all feature values in each batch to a[0, 1]interval.
 normalisation_trim: Asnormalisation, but based on the set of feature
values where the 5% lowest and 5% highest values are discarded. This
reduces the effect of outliers.
 normalisation_winsor: Asnormalisation, but based on the set of
feature values where the 5% lowest and 5% highest values are winsorised.
This reduces the effect of outliers.
 quantile: Features in each batch are normalised by subtraction of the
median value and division by the interquartile range of each batch.
 mean_centering: Features in each batch are centered on 0.0 by
substracting the mean value in each batch, but are not rescaled.
 combat_parametric: Batch adjustments using parametric empirical Bayes
(Johnson et al, 2007).combat_pleads to the same method.
 combat_non_parametric: Batch adjustments using non-parametric empirical
Bayes (Johnson et al, 2007).combat_npandcombatlead to the same
method. Note that we reduced complexity from O(n^2) to O(n) by
only computing batch adjustment parameters for each feature on a subset of
50 randomly selected features, instead of all features.
 Only features that contain numerical data are normalised using batch
normalisation. Batch normalisation parameters obtained in development data
are stored within featureInfoobjects for later use with validation data
sets, in case the validation data is from the same batch. If validation data contains data from unknown batches, normalisation
parameters are separately determined for these batches.
 Note that for both empirical Bayes methods, the batch effect is assumed to
produce results across the features. This is often true for things such as
gene expressions, but the assumption may not hold generally.
 When performing batch normalisation, it is moreover important to check that
differences between batches or cohorts are not related to the studied
endpoint.imputation_method(optional) Method used for imputing missing
feature values. Two methods are implemented:
 
 simple: Simple replacement of a missing value by the median value (for
numeric features) or the modal value (for categorical features).
 lasso: Imputation of missing value by lasso regression (usingglmnet)
based on information contained in other features.
 simpleimputation precedeslassoimputation to ensure that any missing
values in predictors required forlassoregression are resolved. Thelassoestimate is then used to replace the missing value.
 The default value depends on the number of features in the dataset. If the
number is lower than 100, lassois used by default, andsimpleotherwise. Only single imputation is performed. Imputation models and parameters are
stored within featureInfoobjects for later use with validation data
sets.cluster_method(optional) Clustering is performed to identify and
replace redundant features, for example those that are highly correlated.
Such features do not carry much additional information and may be removed
or replaced instead (Park et al., 2007; Tolosi and Lengauer, 2011).
 The cluster method determines the algorithm used to form the clusters. The
following cluster methods are implemented:
 
 none: No clustering is performed.
 hclust(default): Hierarchical agglomerative clustering. If thefastclusterpackage is installed,fastcluster::hclustis used (Muellner
2013), otherwisestats::hclustis used.
 agnes: Hierarchical clustering using agglomerative nesting (Kaufman and
Rousseeuw, 1990). This algorithm is similar tohclust, but uses thecluster::agnesimplementation.
 diana: Divisive analysis hierarchical clustering. This method uses
divisive instead of agglomerative clustering (Kaufman and Rousseeuw, 1990).cluster::dianais used.
 pam: Partioning around medioids. This partitions the data into $k$
clusters around medioids (Kaufman and Rousseeuw, 1990). $k$ is selected
using thesilhouettemetric.pamis implemented using thecluster::pamfunction.
 Clusters and cluster information is stored within featureInfoobjects for
later use with validation data sets. This enables reproduction of the same
clusters as formed in the development data set.cluster_linkage_method(optional) Linkage method used for
agglomerative clustering in hclustandagnes. The following linkage
methods can be used: 
 average(default): Average linkage.
 single: Single linkage.
 complete: Complete linkage.
 weighted: Weighted linkage, also known as McQuitty linkage.
 ward: Linkage using Ward's minimum variance method.
 dianaandpamdo not require a linkage method.
cluster_cut_method(optional) The method used to define the actual
clusters. The following methods can be used:
 
 silhouette: Clusters are formed based on the silhouette score
(Rousseeuw, 1987). The average silhouette score is computed from 2 tonclusters, withnthe number of features. Clusters are only
formed if the average silhouette exceeds 0.50, which indicates reasonable
evidence for structure. This procedure may be slow if the number of
features is large (>100s).
 fixed_cut: Clusters are formed by cutting the hierarchical tree at the
point indicated by thecluster_similarity_threshold, e.g. where features
in a cluster have an average Spearman correlation of 0.90.fixed_cutis
only available foragnes,dianaandhclust.
 dynamic_cut: Dynamic cluster formation using the cutting algorithm in
thedynamicTreeCutpackage. This package should be installed to select
this option.dynamic_cutcan only be used withagnesandhclust.
 The default options are silhouettefor partioning around medioids (pam)
andfixed_cutotherwise.cluster_similarity_metric(optional) Clusters are formed based on
feature similarity. All features are compared in a pair-wise fashion to
compute similarity, for example correlation. The resulting similarity grid
is converted into a distance matrix that is subsequently used for
clustering. The following metrics are supported to compute pairwise
similarities:
 
 mutual_information(default): normalised mutual information.
 mcfadden_r2: McFadden's pseudo R-squared (McFadden, 1974).
 cox_snell_r2: Cox and Snell's pseudo R-squared (Cox and Snell, 1989).
 nagelkerke_r2: Nagelkerke's pseudo R-squared (Nagelkerke, 1991).
 spearman: Spearman's rank order correlation.
 kendall: Kendall rank correlation.
 pearson: Pearson product-moment correlation.
 The pseudo R-squared metrics can be used to assess similarity between mixed
pairs of numeric and categorical features, as these are based on the
log-likelihood of regression models. In familiar, the more informative
feature is used as the predictor and the other feature as the reponse
variable. In numeric-categorical pairs, the numeric feature is considered
to be more informative and is thus used as the predictor. In
categorical-categorical pairs, the feature with most levels is used as the
predictor. In case any of the classical correlation coefficients (pearson,spearmanandkendall) are used with (mixed) categorical features, the
categorical features are one-hot encoded and the mean correlation over all
resulting pairs is used as similarity.cluster_similarity_threshold(optional) The threshold level for
pair-wise similarity that is required to form clusters using fixed_cut.
This should be a numerical value between 0.0 and 1.0. Note however, that a
reasonable threshold value depends strongly on the similarity metric. The
following are the default values used: 
 mcfadden_r2andmutual_information:0.30
 cox_snell_r2andnagelkerke_r2:0.75
 spearman,kendallandpearson:0.90
 Alternatively, if the fixed cutmethod is not used, this value determines
whether any clustering should be performed, because the data may not
contain highly similar features. The default values in this situation are: 
 mcfadden_r2andmutual_information:0.25
 cox_snell_r2andnagelkerke_r2:0.40
 spearman,kendallandpearson:0.70
 The threshold value is converted to a distance (1-similarity) prior to
cutting hierarchical trees.cluster_representation_method(optional) Method used to determine
how the information of co-clustered features is summarised and used to
represent the cluster. The following methods can be selected:
 
 best_predictor(default): The feature with the highest importance
according to univariate regression with the outcome is used to represent
the cluster.
 medioid: The feature closest to the cluster center, i.e. the feature
that is most similar to the remaining features in the cluster, is used to
represent the feature.
 mean: A meta-feature is generated by averaging the feature values for
all features in a cluster. This method aligns all features so that all
features will be positively correlated prior to averaging. Should a cluster
contain one or more categorical features, themedioidmethod will be used
instead, as averaging is not possible. Note that if this method is chosen,
thenormalisation_methodparameter should be one ofstandardisation,standardisation_trim,standardisation_winsororquantile.'
 If the pamcluster method is selected, only themedioidmethod can be
used. In that case 1 medioid is used by default.parallel_preprocessing(optional) Enable parallel processing for the
preprocessing workflow. Defaults to TRUE. When set toFALSE, this will
disable the use of parallel processing while preprocessing, regardless of
the settings of theparallelparameter.parallel_preprocessingis
ignored ifparallel=FALSE. | 
Details
This is a thin wrapper around summon_familiar, and functions like
it, but automatically skips computation of variable importance, learning
and subsequent evaluation steps.
The function returns an experimentData object, which can be used to
warm-start other experiments by providing it to the experiment_data
argument.
Value
An experimentData object.
Pre-compute feature information
Description
Creates data assignment and subsequently extracts feature
information such as normalisation and clustering parameters.
Usage
precompute_feature_info(
  formula = NULL,
  data = NULL,
  experiment_data = NULL,
  cl = NULL,
  experimental_design = "fs+mb",
  verbose = TRUE,
  ...
)
Arguments
| formula | An R formula. The formula can only contain feature names and
dot (.). The*and+1operators are not supported as these refer to
columns that are not present in the data set. Use of the formula interface is optional. | 
| data | A data.tableobject, adata.frameobject, list containing
multipledata.tableordata.frameobjects, or paths to data files. datashould be provided if no file paths are provided to thedata_filesargument. If both are provided, onlydatawill be used.
 All data is expected to be in wide format, and ideally has a sample
identifier (see sample_id_column), batch identifier (seecohort_column)
and outcome columns (seeoutcome_column). In case paths are provided, the data should be stored as csv,rdsorRDatafiles. See documentation for thedata_filesargument for more
information. | 
| experiment_data | Experimental data may provided in the form of | 
| cl | Cluster created using the parallelpackage. This cluster is then
used to speed up computation through parallelisation. When a cluster is not
provided, parallelisation is performed by setting up a cluster on the local
machine. This parameter has no effect if the parallelargument is set toFALSE. | 
| experimental_design | (required) Defines what the experiment looks
like, e.g. cv(bt(fs,20)+mb,3,2)for 2 times repeated 3-fold
cross-validation with nested feature selection on 20 bootstraps and
model-building. The basic workflow components are: 
 fs: (required) feature selection step.
 mb: (required) model building step.
 ev: (optional) external validation. If validation batches or cohorts
are present in the dataset (data), these should be indicated in thevalidation_batch_idargument.
 The different components are linked using +. Different subsampling methods can be used in conjunction with the basic
workflow components:
 
 bs(x,n): (stratified) .632 bootstrap, withnthe number of
bootstraps. In contrast tobt, feature pre-processing parameters and
hyperparameter optimisation are conducted on individual bootstraps.
 bt(x,n): (stratified) .632 bootstrap, withnthe number of
bootstraps. Unlikebsand other subsampling methods, no separate
pre-processing parameters or optimised hyperparameters will be determined
for each bootstrap.
 cv(x,n,p): (stratified)n-fold cross-validation, repeatedptimes.
Pre-processing parameters are determined for each iteration.
 lv(x): leave-one-out-cross-validation. Pre-processing parameters are
determined for each iteration.
 ip(x): imbalance partitioning for addressing class imbalances on the
data set. Pre-processing parameters are determined for each partition. The
number of partitions generated depends on the imbalance correction method
(see theimbalance_correction_methodparameter).
 As shown in the example above, sampling algorithms can be nested.
 Though neither variable importance is determined nor models are learned
within precompute_feature_info, the corresponding elements are still
required to prevent issues when using the resultingexperimentDataobject
to warm-start the experiments. The simplest valid experimental design is fs+mb. This is the default inprecompute_feature_info, and will determine feature parameters over the
entire dataset. This argument is ignored if the experiment_dataargument is set. | 
| verbose | Indicates verbosity of the results. Default is TRUE, and all
messages and warnings are returned. | 
| ... | Arguments passed on to .parse_experiment_settings,.parse_setup_settings,.parse_preprocessing_settings 
batch_id_column(recommended) Name of the column containing batch
or cohort identifiers. This parameter is required if more than one dataset
is provided, or if external validation is performed.
 In familiar any row of data is organised by four identifiers:
 
 The batch identifier batch_id_column: This denotes the group to which a
set of samples belongs, e.g. patients from a single study, samples measured
in a batch, etc. The batch identifier is used for batch normalisation, as
well as selection of development and validation datasets. The sample identifier sample_id_column: This denotes the sample level,
e.g. data from a single individual. Subsets of data, e.g. bootstraps or
cross-validation folds, are created at this level. The series identifier series_id_column: Indicates measurements on a
single sample that may not share the same outcome value, e.g. a time
series, or the number of cells in a view. The repetition identifier: Indicates repeated measurements in a single
series where any feature values may differ, but the outcome does not.
Repetition identifiers are always implicitly set when multiple entries for
the same series of the same sample in the same batch that share the same
outcome are encountered.
sample_id_column(recommended) Name of the column containing
sample or subject identifiers. See batch_id_columnabove for more
details. If unset, every row will be identified as a single sample.series_id_column(optional) Name of the column containing series
identifiers, which distinguish between measurements that are part of a
series for a single sample. See batch_id_columnabove for more details. If unset, rows which share the same batch and sample identifiers but have a
different outcome are assigned unique series identifiers.development_batch_id(optional) One or more batch or cohort
identifiers to constitute data sets for development. Defaults to all, or
all minus the identifiers in validation_batch_idfor external validation.
Required if external validation is performed andvalidation_batch_idis
not provided.validation_batch_id(optional) One or more batch or cohort
identifiers to constitute data sets for external validation. Defaults to
all data sets except those in development_batch_idfor external
validation, or none if not. Required ifdevelopment_batch_idis not
provided.outcome_name(optional) Name of the modelled outcome. This name will
be used in figures created by familiar. If not set, the column name in outcome_columnwill be used forbinomial,multinomial,countandcontinuousoutcomes. For other
outcomes (survivalandcompeting_risk) no default is used.outcome_column(recommended) Name of the column containing the
outcome of interest. May be identified from a formula, if a formula is
provided as an argument. Otherwise an error is raised. Note that survivalandcompeting_riskoutcome type outcomes require two columns that
indicate the time-to-event or the time of last follow-up and the event
status.outcome_type(recommended) Type of outcome found in the outcome
column. The outcome type determines many aspects of the overall process,
e.g. the available feature selection methods and learners, but also the
type of assessments that can be conducted to evaluate the resulting models.
Implemented outcome types are:
 
 binomial: categorical outcome with 2 levels.
 multinomial: categorical outcome with 2 or more levels.
 count: Poisson-distributed numeric outcomes.
 continuous: general continuous numeric outcomes.
 survival: survival outcome for time-to-event data.
 If not provided, the algorithm will attempt to obtain outcome_type from
contents of the outcome column. This may lead to unexpected results, and we
therefore advise to provide this information manually.
 Note that competing_risksurvival analysis are not fully supported, and
is currently not a valid choice foroutcome_type.class_levels(optional) Class levels for binomialormultinomialoutcomes. This argument can be used to specify the ordering of levels for
categorical outcomes. These class levels must exactly match the levels
present in the outcome column.event_indicator(recommended) Indicator for events in survivalandcompeting_riskanalyses.familiarwill automatically recognise1,true,t,yandyesas event indicators, including different
capitalisations. If this parameter is set, it replaces the default values.censoring_indicator(recommended) Indicator for right-censoring in
survivalandcompeting_riskanalyses.familiarwill automatically
recognise0,false,f,n,noas censoring indicators, including
different capitalisations. If this parameter is set, it replaces the
default values.competing_risk_indicator(recommended) Indicator for competing
risks in competing_riskanalyses. There are no default values, and if
unset, all values other than those specified by theevent_indicatorandcensoring_indicatorparameters are considered to indicate competing
risks.signature(optional) One or more names of feature columns that are
considered part of a specific signature. Features specified here will
always be used for modelling. Ranking from feature selection has no effect
for these features.novelty_features(optional) One or more names of feature columns
that should be included for the purpose of novelty detection.exclude_features(optional) Feature columns that will be removed
from the data set. Cannot overlap with features in signature,novelty_featuresorinclude_features.include_features(optional) Feature columns that are specifically
included in the data set. By default all features are included. Cannot
overlap with exclude_features, but may overlapsignature. Features insignatureandnovelty_featuresare always included. If bothexclude_featuresandinclude_featuresare provided,include_featurestakes precedence, provided that there is no overlap between the two.reference_method(optional) Method used to set reference levels for
categorical features. There are several options:
 
 auto(default): Categorical features that are not explicitly set by the
user, i.e. columns containing boolean values or characters, use the most
frequent level as reference. Categorical features that are explicitly set,
i.e. as factors, are used as is.
 always: Both automatically detected and user-specified categorical
features have the reference level set to the most frequent level. Ordinal
features are not altered, but are used as is.
 never: User-specified categorical features are used as is.
Automatically detected categorical features are simply sorted, and the
first level is then used as the reference level. This was the behaviour
prior to familiar version 1.3.0.
imbalance_correction_method(optional) Type of method used to
address class imbalances. Available options are:
 
 full_undersampling(default): All data will be used in an ensemble
fashion. The full minority class will appear in each partition, but
majority classes are undersampled until all data have been used.
 random_undersampling: Randomly undersamples majority classes. This is
useful in cases where full undersampling would lead to the formation of
many models due major overrepresentation of the largest class.
 This parameter is only used in combination with imbalance partitioning in
the experimental design, and ipshould therefore appear in the string
that defines the design.imbalance_n_partitions(optional) Number of times random
undersampling should be repeated. 10 undersampled subsets with balanced
classes are formed by default.parallel(optional) Enable parallel processing. Defaults to TRUE.
When set toFALSE, this disables all parallel processing, regardless of
specific parameters such asparallel_preprocessing. However, whenparallelisTRUE, parallel processing of different parts of the
workflow can be disabled by setting respective flags toFALSE.parallel_nr_cores(optional) Number of cores available for
parallelisation. Defaults to 2. This setting does nothing if
parallelisation is disabled.restart_cluster(optional) Restart nodes used for parallel computing
to free up memory prior to starting a parallel process. Note that it does
take time to set up the clusters. Therefore setting this argument to TRUEmay impact processing speed. This argument is ignored ifparallelisFALSEor the cluster was initialised outside of familiar. Default isFALSE, which causes the clusters to be initialised only once.cluster_type(optional) Selection of the cluster type for parallel
processing. Available types are the ones supported by the parallel package
that is part of the base R distribution: psock(default),fork,mpi,nws,sock. In addition,noneis available, which also disables
parallel processing.backend_type(optional) Selection of the backend for distributing
copies of the data. This backend ensures that only a single master copy is
kept in memory. This limits memory usage during parallel processing.
 Several backend options are available, notably socket_server, andnone(default).socket_serveris based on the callr package and R sockets,
comes withfamiliarand is available for any OS.noneuses the package
environment of familiar to store data, and is available for any OS.
However,nonerequires copying of data to any parallel process, and has a
larger memory footprint.server_port(optional) Integer indicating the port on which the
socket server or RServe process should communicate. Defaults to port 6311.
Note that ports 0 to 1024 and 49152 to 65535 cannot be used.feature_max_fraction_missing(optional) Numeric value between 0.0and0.95that determines the meximum fraction of missing values that
still allows a feature to be included in the data set. All features with a
missing value fraction over this threshold are not processed further. The
default value is0.30.sample_max_fraction_missing(optional) Numeric value between 0.0and0.95that determines the maximum fraction of missing values that
still allows a sample to be included in the data set. All samples with a
missing value fraction over this threshold are excluded and not processed
further. The default value is0.30.filter_method(optional) One or methods used to reduce
dimensionality of the data set by removing irrelevant or poorly
reproducible features.
 Several method are available:
 
 none(default): None of the features will be filtered.
 low_variance: Features with a variance below thelow_var_minimum_variance_thresholdare filtered. This can be useful to
filter, for example, genes that are not differentially expressed.
 univariate_test: Features undergo a univariate regression using an
outcome-appropriate regression model. The p-value of the model coefficient
is collected. Features with coefficient p or q-value above theunivariate_test_thresholdare subsequently filtered.
 robustness: Features that are not sufficiently robust according to the
intraclass correlation coefficient are filtered. Use of this method
requires that repeated measurements are present in the data set, i.e. there
should be entries for which the sample and cohort identifiers are the same.
 More than one method can be used simultaneously. Features with singular
values are always filtered, as these do not contain information.univariate_test_threshold(optional) Numeric value between 1.0and0.0that determines which features are irrelevant and will be filtered by
theunivariate_test. The p or q-values are compared to this threshold.
All features with values above the threshold are filtered. The default
value is0.20.univariate_test_threshold_metric(optional) Metric used with the to
compare the univariate_test_thresholdagainst. The following metrics can
be chosen: 
 p_value(default): The unadjusted p-value of each feature is used for
to filter features.
 q_value: The q-value (Story, 2002), is used to filter features. Some
data sets may have insufficient samples to compute the q-value. Theqvaluepackage must be installed from Bioconductor to use this method.
univariate_test_max_feature_set_size(optional) Maximum size of the
feature set after the univariate test. P or q values of features are
compared against the threshold, but if the resulting data set would be
larger than this setting, only the most relevant features up to the desired
feature set size are selected.
 The default value is NULL, which causes features to be filtered based on
their relevance only.low_var_minimum_variance_threshold(required, if used) Numeric value
that determines which features will be filtered by the low_variancemethod. The variance of each feature is computed and compared to the
threshold. If it is below the threshold, the feature is removed. This parameter has no default value and should be set if low_varianceis
used.low_var_max_feature_set_size(optional) Maximum size of the feature
set after filtering features with a low variance. All features are first
compared against low_var_minimum_variance_threshold. If the resulting
feature set would be larger than specified, only the most strongly varying
features will be selected, up to the desired size of the feature set. The default value is NULL, which causes features to be filtered based on
their variance only.robustness_icc_type(optional) String indicating the type of
intraclass correlation coefficient (1,2or3) that should be used to
compute robustness for features in repeated measurements. These types
correspond to the types in Shrout and Fleiss (1979). The default value is1.robustness_threshold_metric(optional) String indicating which
specific intraclass correlation coefficient (ICC) metric should be used to
filter features. This should be one of:
 
 icc: The estimated ICC value itself.
 icc_low(default): The estimated lower limit of the 95% confidence
interval of the ICC, as suggested by Koo and Li (2016).
 icc_panel: The estimated ICC value over the panel average, i.e. the ICC
that would be obtained if all repeated measurements were averaged.
 icc_panel_low: The estimated lower limit of the 95% confidence interval
of the panel ICC.
robustness_threshold_value(optional) The intraclass correlation
coefficient value that is as threshold. The default value is 0.70.transformation_method(optional) The transformation method used to
change the distribution of the data to be more normal-like. The following
methods are available:
 
 none: This disables transformation of features.
 yeo_johnson: Transformation using the location and scale invariant
version of the Yeo-Johnson transformation (Yeo and Johnson, 2000;
Zwanenburg and Löck, 2023).
 yeo_johnson_robust(default): A robust version ofyeo_johnson.
This method is less sensitive to outliers.
 yeo_johnson_conventional: Asyeo_johnson, but without optimisation of
location and scale parameters. This method is equivalent to the original
transformation proposed by Yeo and Johnson (2001).
 box_cox: Transformation using the location and scale invariant version
of the Box-Cox transformation (Box and Cox, 1964; Zwanenburg and Löck,
2023).
 box_cox_robust: A robust version ofyeo_johnson. This method is less
sensitive to outliers.
 box_cox_conventional: Asbox_cox, but without optimisation of
location and scale parameters. This method is equivalent to the original
transformation proposed by Box and Cox (1964). This method requires
strictly positive feature values.
 Transformation requires the power.transformpackage. Only features that
contain numerical data are transformed. Transformation parameters obtained
in development data are stored withinfeatureInfoobjects for later use
with validation data sets.transformation_optimisation_criterion(optional) Transformation
parameters are optimised using a criterion, conventionally
maximum-likelihood-estimation. power.transformimplements multiple
optimisation criteria, of which the following are available: 
 mle(default): Optimisation using maximum likelihood estimation.
 cramer_von_mises: Optimisation using the Cramér-von Mises
criterion. Zwanenburg and Löck (2023) found that this criterion was
relatively robust against outliers.
transformation_gof_test_p_value(optional) Not all transformations
will lead to features that are roughly normally distributed. Zwanenburg and
Löck (2023) established a empirical goodness-of-fit test for central
normality. This parameter sets the significance for rejecting the
null-hypothesis that a feature distribution is centrally normal. When the
null-hypothesis is rejected, no transformation is performed. The default
value is NULL, which disables the test.normalisation_method(optional) The normalisation method used to
improve the comparability between numerical features that may have very
different scales. The following normalisation methods can be chosen:
 
 none: This disables feature normalisation.
 standardisation: Features are normalised by subtraction of their mean
values and division by their standard deviations. This causes every feature
to be have a center value of 0.0 and standard deviation of 1.0.
 standardisation_trim: Asstandardisation, but based on the set of
feature values where the 5% lowest and 5% highest values are discarded.
This reduces the effect of outliers.
 standardisation_winsor: Asstandardisation, but based on the set of
feature values where the 5% lowest and 5% highest values are winsorised.
This reduces the effect of outliers.
 standardisation_robust(default): A robust version ofstandardisationthat relies on computing Huber's M-estimators for location and scale.
 normalisation: Features are normalised by subtraction of their minimum
values and division by their ranges. This maps all feature values to a[0, 1]interval.
 normalisation_trim: Asnormalisation, but based on the set of feature
values where the 5% lowest and 5% highest values are discarded. This
reduces the effect of outliers.
 normalisation_winsor: Asnormalisation, but based on the set of
feature values where the 5% lowest and 5% highest values are winsorised.
This reduces the effect of outliers.
 quantile: Features are normalised by subtraction of their median values
and division by their interquartile range.
 mean_centering: Features are centered by substracting the mean, but do
not undergo rescaling.
 Only features that contain numerical data are normalised. Normalisation
parameters obtained in development data are stored within featureInfoobjects for later use with validation data sets.batch_normalisation_method(optional) The method used for batch
normalisation. Available methods are:
 
 none(default): This disables batch normalisation of features.
 standardisation: Features within each batch are normalised by
subtraction of the mean value and division by the standard deviation in
each batch.
 standardisation_trim: Asstandardisation, but based on the set of
feature values where the 5% lowest and 5% highest values are discarded.
This reduces the effect of outliers.
 standardisation_winsor: Asstandardisation, but based on the set of
feature values where the 5% lowest and 5% highest values are winsorised.
This reduces the effect of outliers.
 standardisation_robust: A robust version ofstandardisationthat
relies on computing Huber's M-estimators for location and scale within each
batch.
 normalisation: Features within each batch are normalised by subtraction
of their minimum values and division by their range in each batch. This
maps all feature values in each batch to a[0, 1]interval.
 normalisation_trim: Asnormalisation, but based on the set of feature
values where the 5% lowest and 5% highest values are discarded. This
reduces the effect of outliers.
 normalisation_winsor: Asnormalisation, but based on the set of
feature values where the 5% lowest and 5% highest values are winsorised.
This reduces the effect of outliers.
 quantile: Features in each batch are normalised by subtraction of the
median value and division by the interquartile range of each batch.
 mean_centering: Features in each batch are centered on 0.0 by
substracting the mean value in each batch, but are not rescaled.
 combat_parametric: Batch adjustments using parametric empirical Bayes
(Johnson et al, 2007).combat_pleads to the same method.
 combat_non_parametric: Batch adjustments using non-parametric empirical
Bayes (Johnson et al, 2007).combat_npandcombatlead to the same
method. Note that we reduced complexity from O(n^2) to O(n) by
only computing batch adjustment parameters for each feature on a subset of
50 randomly selected features, instead of all features.
 Only features that contain numerical data are normalised using batch
normalisation. Batch normalisation parameters obtained in development data
are stored within featureInfoobjects for later use with validation data
sets, in case the validation data is from the same batch. If validation data contains data from unknown batches, normalisation
parameters are separately determined for these batches.
 Note that for both empirical Bayes methods, the batch effect is assumed to
produce results across the features. This is often true for things such as
gene expressions, but the assumption may not hold generally.
 When performing batch normalisation, it is moreover important to check that
differences between batches or cohorts are not related to the studied
endpoint.imputation_method(optional) Method used for imputing missing
feature values. Two methods are implemented:
 
 simple: Simple replacement of a missing value by the median value (for
numeric features) or the modal value (for categorical features).
 lasso: Imputation of missing value by lasso regression (usingglmnet)
based on information contained in other features.
 simpleimputation precedeslassoimputation to ensure that any missing
values in predictors required forlassoregression are resolved. Thelassoestimate is then used to replace the missing value.
 The default value depends on the number of features in the dataset. If the
number is lower than 100, lassois used by default, andsimpleotherwise. Only single imputation is performed. Imputation models and parameters are
stored within featureInfoobjects for later use with validation data
sets.cluster_method(optional) Clustering is performed to identify and
replace redundant features, for example those that are highly correlated.
Such features do not carry much additional information and may be removed
or replaced instead (Park et al., 2007; Tolosi and Lengauer, 2011).
 The cluster method determines the algorithm used to form the clusters. The
following cluster methods are implemented:
 
 none: No clustering is performed.
 hclust(default): Hierarchical agglomerative clustering. If thefastclusterpackage is installed,fastcluster::hclustis used (Muellner
2013), otherwisestats::hclustis used.
 agnes: Hierarchical clustering using agglomerative nesting (Kaufman and
Rousseeuw, 1990). This algorithm is similar tohclust, but uses thecluster::agnesimplementation.
 diana: Divisive analysis hierarchical clustering. This method uses
divisive instead of agglomerative clustering (Kaufman and Rousseeuw, 1990).cluster::dianais used.
 pam: Partioning around medioids. This partitions the data into $k$
clusters around medioids (Kaufman and Rousseeuw, 1990). $k$ is selected
using thesilhouettemetric.pamis implemented using thecluster::pamfunction.
 Clusters and cluster information is stored within featureInfoobjects for
later use with validation data sets. This enables reproduction of the same
clusters as formed in the development data set.cluster_linkage_method(optional) Linkage method used for
agglomerative clustering in hclustandagnes. The following linkage
methods can be used: 
 average(default): Average linkage.
 single: Single linkage.
 complete: Complete linkage.
 weighted: Weighted linkage, also known as McQuitty linkage.
 ward: Linkage using Ward's minimum variance method.
 dianaandpamdo not require a linkage method.
cluster_cut_method(optional) The method used to define the actual
clusters. The following methods can be used:
 
 silhouette: Clusters are formed based on the silhouette score
(Rousseeuw, 1987). The average silhouette score is computed from 2 tonclusters, withnthe number of features. Clusters are only
formed if the average silhouette exceeds 0.50, which indicates reasonable
evidence for structure. This procedure may be slow if the number of
features is large (>100s).
 fixed_cut: Clusters are formed by cutting the hierarchical tree at the
point indicated by thecluster_similarity_threshold, e.g. where features
in a cluster have an average Spearman correlation of 0.90.fixed_cutis
only available foragnes,dianaandhclust.
 dynamic_cut: Dynamic cluster formation using the cutting algorithm in
thedynamicTreeCutpackage. This package should be installed to select
this option.dynamic_cutcan only be used withagnesandhclust.
 The default options are silhouettefor partioning around medioids (pam)
andfixed_cutotherwise.cluster_similarity_metric(optional) Clusters are formed based on
feature similarity. All features are compared in a pair-wise fashion to
compute similarity, for example correlation. The resulting similarity grid
is converted into a distance matrix that is subsequently used for
clustering. The following metrics are supported to compute pairwise
similarities:
 
 mutual_information(default): normalised mutual information.
 mcfadden_r2: McFadden's pseudo R-squared (McFadden, 1974).
 cox_snell_r2: Cox and Snell's pseudo R-squared (Cox and Snell, 1989).
 nagelkerke_r2: Nagelkerke's pseudo R-squared (Nagelkerke, 1991).
 spearman: Spearman's rank order correlation.
 kendall: Kendall rank correlation.
 pearson: Pearson product-moment correlation.
 The pseudo R-squared metrics can be used to assess similarity between mixed
pairs of numeric and categorical features, as these are based on the
log-likelihood of regression models. In familiar, the more informative
feature is used as the predictor and the other feature as the reponse
variable. In numeric-categorical pairs, the numeric feature is considered
to be more informative and is thus used as the predictor. In
categorical-categorical pairs, the feature with most levels is used as the
predictor. In case any of the classical correlation coefficients (pearson,spearmanandkendall) are used with (mixed) categorical features, the
categorical features are one-hot encoded and the mean correlation over all
resulting pairs is used as similarity.cluster_similarity_threshold(optional) The threshold level for
pair-wise similarity that is required to form clusters using fixed_cut.
This should be a numerical value between 0.0 and 1.0. Note however, that a
reasonable threshold value depends strongly on the similarity metric. The
following are the default values used: 
 mcfadden_r2andmutual_information:0.30
 cox_snell_r2andnagelkerke_r2:0.75
 spearman,kendallandpearson:0.90
 Alternatively, if the fixed cutmethod is not used, this value determines
whether any clustering should be performed, because the data may not
contain highly similar features. The default values in this situation are: 
 mcfadden_r2andmutual_information:0.25
 cox_snell_r2andnagelkerke_r2:0.40
 spearman,kendallandpearson:0.70
 The threshold value is converted to a distance (1-similarity) prior to
cutting hierarchical trees.cluster_representation_method(optional) Method used to determine
how the information of co-clustered features is summarised and used to
represent the cluster. The following methods can be selected:
 
 best_predictor(default): The feature with the highest importance
according to univariate regression with the outcome is used to represent
the cluster.
 medioid: The feature closest to the cluster center, i.e. the feature
that is most similar to the remaining features in the cluster, is used to
represent the feature.
 mean: A meta-feature is generated by averaging the feature values for
all features in a cluster. This method aligns all features so that all
features will be positively correlated prior to averaging. Should a cluster
contain one or more categorical features, themedioidmethod will be used
instead, as averaging is not possible. Note that if this method is chosen,
thenormalisation_methodparameter should be one ofstandardisation,standardisation_trim,standardisation_winsororquantile.'
 If the pamcluster method is selected, only themedioidmethod can be
used. In that case 1 medioid is used by default.parallel_preprocessing(optional) Enable parallel processing for the
preprocessing workflow. Defaults to TRUE. When set toFALSE, this will
disable the use of parallel processing while preprocessing, regardless of
the settings of theparallelparameter.parallel_preprocessingis
ignored ifparallel=FALSE. | 
Details
This is a thin wrapper around summon_familiar, and functions like
it, but automatically skips computation of variable importance, learning
and subsequent evaluation steps.
The function returns an experimentData object, which can be used to
warm-start other experiments by providing it to the experiment_data
argument.
Value
An experimentData object.
Pre-compute variable importance
Description
Creates data assignment, extracts feature information and
subsequently computes variable importance.
Usage
precompute_vimp(
  formula = NULL,
  data = NULL,
  experiment_data = NULL,
  cl = NULL,
  experimental_design = "fs+mb",
  fs_method = NULL,
  fs_method_parameter = NULL,
  verbose = TRUE,
  ...
)
Arguments
| formula | An R formula. The formula can only contain feature names and
dot (.). The*and+1operators are not supported as these refer to
columns that are not present in the data set. Use of the formula interface is optional. | 
| data | A data.tableobject, adata.frameobject, list containing
multipledata.tableordata.frameobjects, or paths to data files. datashould be provided if no file paths are provided to thedata_filesargument. If both are provided, onlydatawill be used.
 All data is expected to be in wide format, and ideally has a sample
identifier (see sample_id_column), batch identifier (seecohort_column)
and outcome columns (seeoutcome_column). In case paths are provided, the data should be stored as csv,rdsorRDatafiles. See documentation for thedata_filesargument for more
information. | 
| experiment_data | Experimental data may provided in the form of | 
| cl | Cluster created using the parallelpackage. This cluster is then
used to speed up computation through parallelisation. When a cluster is not
provided, parallelisation is performed by setting up a cluster on the local
machine. This parameter has no effect if the parallelargument is set toFALSE. | 
| experimental_design | (required) Defines what the experiment looks
like, e.g. cv(bt(fs,20)+mb,3,2)for 2 times repeated 3-fold
cross-validation with nested feature selection on 20 bootstraps and
model-building. The basic workflow components are: 
 fs: (required) feature selection step.
 mb: (required) model building step. Though models are not learned byprecompute_vimp, this element is still required to prevent issues when
using the resultingexperimentDataobject to warm-start the experiments.
 ev: (optional) external validation. If validation batches or cohorts
are present in the dataset (data), these should be indicated in thevalidation_batch_idargument.
 The different components are linked using +. Different subsampling methods can be used in conjunction with the basic
workflow components:
 
 bs(x,n): (stratified) .632 bootstrap, withnthe number of
bootstraps. In contrast tobt, feature pre-processing parameters and
hyperparameter optimisation are conducted on individual bootstraps.
 bt(x,n): (stratified) .632 bootstrap, withnthe number of
bootstraps. Unlikebsand other subsampling methods, no separate
pre-processing parameters or optimised hyperparameters will be determined
for each bootstrap.
 cv(x,n,p): (stratified)n-fold cross-validation, repeatedptimes.
Pre-processing parameters are determined for each iteration.
 lv(x): leave-one-out-cross-validation. Pre-processing parameters are
determined for each iteration.
 ip(x): imbalance partitioning for addressing class imbalances on the
data set. Pre-processing parameters are determined for each partition. The
number of partitions generated depends on the imbalance correction method
(see theimbalance_correction_methodparameter).
 As shown in the example above, sampling algorithms can be nested.
 The simplest valid experimental design is fs+mb. This is the default inprecompute_vimp, and will compute variable importance over the entire
dataset. This argument is ignored if the experiment_dataargument is set. | 
| fs_method | (required) Feature selection method to be used for
determining variable importance. familiarimplements various feature
selection methods. Please refer to the vignette on feature selection
methods for more details. More than one feature selection method can be chosen. The experiment will
then repeated for each feature selection method.
 Feature selection methods determines the ranking of features. Actual
selection of features is done by optimising the signature size model
hyperparameter during the hyperparameter optimisation step. | 
| fs_method_parameter | (optional) List of lists containing parameters
for feature selection methods. Each sublist should have the name of the
feature selection method it corresponds to.
 Most feature selection methods do not have parameters that can be set.
Please refer to the vignette on feature selection methods for more details.
Note that if the feature selection method is based on a learner (e.g. lasso
regression), hyperparameter optimisation may be performed prior to
assessing variable importance. | 
| verbose | Indicates verbosity of the results. Default is TRUE, and all
messages and warnings are returned. | 
| ... | Arguments passed on to .parse_experiment_settings,.parse_setup_settings,.parse_preprocessing_settings,.parse_feature_selection_settings 
batch_id_column(recommended) Name of the column containing batch
or cohort identifiers. This parameter is required if more than one dataset
is provided, or if external validation is performed.
 In familiar any row of data is organised by four identifiers:
 
 The batch identifier batch_id_column: This denotes the group to which a
set of samples belongs, e.g. patients from a single study, samples measured
in a batch, etc. The batch identifier is used for batch normalisation, as
well as selection of development and validation datasets. The sample identifier sample_id_column: This denotes the sample level,
e.g. data from a single individual. Subsets of data, e.g. bootstraps or
cross-validation folds, are created at this level. The series identifier series_id_column: Indicates measurements on a
single sample that may not share the same outcome value, e.g. a time
series, or the number of cells in a view. The repetition identifier: Indicates repeated measurements in a single
series where any feature values may differ, but the outcome does not.
Repetition identifiers are always implicitly set when multiple entries for
the same series of the same sample in the same batch that share the same
outcome are encountered.
sample_id_column(recommended) Name of the column containing
sample or subject identifiers. See batch_id_columnabove for more
details. If unset, every row will be identified as a single sample.series_id_column(optional) Name of the column containing series
identifiers, which distinguish between measurements that are part of a
series for a single sample. See batch_id_columnabove for more details. If unset, rows which share the same batch and sample identifiers but have a
different outcome are assigned unique series identifiers.development_batch_id(optional) One or more batch or cohort
identifiers to constitute data sets for development. Defaults to all, or
all minus the identifiers in validation_batch_idfor external validation.
Required if external validation is performed andvalidation_batch_idis
not provided.validation_batch_id(optional) One or more batch or cohort
identifiers to constitute data sets for external validation. Defaults to
all data sets except those in development_batch_idfor external
validation, or none if not. Required ifdevelopment_batch_idis not
provided.outcome_name(optional) Name of the modelled outcome. This name will
be used in figures created by familiar. If not set, the column name in outcome_columnwill be used forbinomial,multinomial,countandcontinuousoutcomes. For other
outcomes (survivalandcompeting_risk) no default is used.outcome_column(recommended) Name of the column containing the
outcome of interest. May be identified from a formula, if a formula is
provided as an argument. Otherwise an error is raised. Note that survivalandcompeting_riskoutcome type outcomes require two columns that
indicate the time-to-event or the time of last follow-up and the event
status.outcome_type(recommended) Type of outcome found in the outcome
column. The outcome type determines many aspects of the overall process,
e.g. the available feature selection methods and learners, but also the
type of assessments that can be conducted to evaluate the resulting models.
Implemented outcome types are:
 
 binomial: categorical outcome with 2 levels.
 multinomial: categorical outcome with 2 or more levels.
 count: Poisson-distributed numeric outcomes.
 continuous: general continuous numeric outcomes.
 survival: survival outcome for time-to-event data.
 If not provided, the algorithm will attempt to obtain outcome_type from
contents of the outcome column. This may lead to unexpected results, and we
therefore advise to provide this information manually.
 Note that competing_risksurvival analysis are not fully supported, and
is currently not a valid choice foroutcome_type.class_levels(optional) Class levels for binomialormultinomialoutcomes. This argument can be used to specify the ordering of levels for
categorical outcomes. These class levels must exactly match the levels
present in the outcome column.event_indicator(recommended) Indicator for events in survivalandcompeting_riskanalyses.familiarwill automatically recognise1,true,t,yandyesas event indicators, including different
capitalisations. If this parameter is set, it replaces the default values.censoring_indicator(recommended) Indicator for right-censoring in
survivalandcompeting_riskanalyses.familiarwill automatically
recognise0,false,f,n,noas censoring indicators, including
different capitalisations. If this parameter is set, it replaces the
default values.competing_risk_indicator(recommended) Indicator for competing
risks in competing_riskanalyses. There are no default values, and if
unset, all values other than those specified by theevent_indicatorandcensoring_indicatorparameters are considered to indicate competing
risks.signature(optional) One or more names of feature columns that are
considered part of a specific signature. Features specified here will
always be used for modelling. Ranking from feature selection has no effect
for these features.novelty_features(optional) One or more names of feature columns
that should be included for the purpose of novelty detection.exclude_features(optional) Feature columns that will be removed
from the data set. Cannot overlap with features in signature,novelty_featuresorinclude_features.include_features(optional) Feature columns that are specifically
included in the data set. By default all features are included. Cannot
overlap with exclude_features, but may overlapsignature. Features insignatureandnovelty_featuresare always included. If bothexclude_featuresandinclude_featuresare provided,include_featurestakes precedence, provided that there is no overlap between the two.reference_method(optional) Method used to set reference levels for
categorical features. There are several options:
 
 auto(default): Categorical features that are not explicitly set by the
user, i.e. columns containing boolean values or characters, use the most
frequent level as reference. Categorical features that are explicitly set,
i.e. as factors, are used as is.
 always: Both automatically detected and user-specified categorical
features have the reference level set to the most frequent level. Ordinal
features are not altered, but are used as is.
 never: User-specified categorical features are used as is.
Automatically detected categorical features are simply sorted, and the
first level is then used as the reference level. This was the behaviour
prior to familiar version 1.3.0.
imbalance_correction_method(optional) Type of method used to
address class imbalances. Available options are:
 
 full_undersampling(default): All data will be used in an ensemble
fashion. The full minority class will appear in each partition, but
majority classes are undersampled until all data have been used.
 random_undersampling: Randomly undersamples majority classes. This is
useful in cases where full undersampling would lead to the formation of
many models due major overrepresentation of the largest class.
 This parameter is only used in combination with imbalance partitioning in
the experimental design, and ipshould therefore appear in the string
that defines the design.imbalance_n_partitions(optional) Number of times random
undersampling should be repeated. 10 undersampled subsets with balanced
classes are formed by default.parallel(optional) Enable parallel processing. Defaults to TRUE.
When set toFALSE, this disables all parallel processing, regardless of
specific parameters such asparallel_preprocessing. However, whenparallelisTRUE, parallel processing of different parts of the
workflow can be disabled by setting respective flags toFALSE.parallel_nr_cores(optional) Number of cores available for
parallelisation. Defaults to 2. This setting does nothing if
parallelisation is disabled.restart_cluster(optional) Restart nodes used for parallel computing
to free up memory prior to starting a parallel process. Note that it does
take time to set up the clusters. Therefore setting this argument to TRUEmay impact processing speed. This argument is ignored ifparallelisFALSEor the cluster was initialised outside of familiar. Default isFALSE, which causes the clusters to be initialised only once.cluster_type(optional) Selection of the cluster type for parallel
processing. Available types are the ones supported by the parallel package
that is part of the base R distribution: psock(default),fork,mpi,nws,sock. In addition,noneis available, which also disables
parallel processing.backend_type(optional) Selection of the backend for distributing
copies of the data. This backend ensures that only a single master copy is
kept in memory. This limits memory usage during parallel processing.
 Several backend options are available, notably socket_server, andnone(default).socket_serveris based on the callr package and R sockets,
comes withfamiliarand is available for any OS.noneuses the package
environment of familiar to store data, and is available for any OS.
However,nonerequires copying of data to any parallel process, and has a
larger memory footprint.server_port(optional) Integer indicating the port on which the
socket server or RServe process should communicate. Defaults to port 6311.
Note that ports 0 to 1024 and 49152 to 65535 cannot be used.feature_max_fraction_missing(optional) Numeric value between 0.0and0.95that determines the meximum fraction of missing values that
still allows a feature to be included in the data set. All features with a
missing value fraction over this threshold are not processed further. The
default value is0.30.sample_max_fraction_missing(optional) Numeric value between 0.0and0.95that determines the maximum fraction of missing values that
still allows a sample to be included in the data set. All samples with a
missing value fraction over this threshold are excluded and not processed
further. The default value is0.30.filter_method(optional) One or methods used to reduce
dimensionality of the data set by removing irrelevant or poorly
reproducible features.
 Several method are available:
 
 none(default): None of the features will be filtered.
 low_variance: Features with a variance below thelow_var_minimum_variance_thresholdare filtered. This can be useful to
filter, for example, genes that are not differentially expressed.
 univariate_test: Features undergo a univariate regression using an
outcome-appropriate regression model. The p-value of the model coefficient
is collected. Features with coefficient p or q-value above theunivariate_test_thresholdare subsequently filtered.
 robustness: Features that are not sufficiently robust according to the
intraclass correlation coefficient are filtered. Use of this method
requires that repeated measurements are present in the data set, i.e. there
should be entries for which the sample and cohort identifiers are the same.
 More than one method can be used simultaneously. Features with singular
values are always filtered, as these do not contain information.univariate_test_threshold(optional) Numeric value between 1.0and0.0that determines which features are irrelevant and will be filtered by
theunivariate_test. The p or q-values are compared to this threshold.
All features with values above the threshold are filtered. The default
value is0.20.univariate_test_threshold_metric(optional) Metric used with the to
compare the univariate_test_thresholdagainst. The following metrics can
be chosen: 
 p_value(default): The unadjusted p-value of each feature is used for
to filter features.
 q_value: The q-value (Story, 2002), is used to filter features. Some
data sets may have insufficient samples to compute the q-value. Theqvaluepackage must be installed from Bioconductor to use this method.
univariate_test_max_feature_set_size(optional) Maximum size of the
feature set after the univariate test. P or q values of features are
compared against the threshold, but if the resulting data set would be
larger than this setting, only the most relevant features up to the desired
feature set size are selected.
 The default value is NULL, which causes features to be filtered based on
their relevance only.low_var_minimum_variance_threshold(required, if used) Numeric value
that determines which features will be filtered by the low_variancemethod. The variance of each feature is computed and compared to the
threshold. If it is below the threshold, the feature is removed. This parameter has no default value and should be set if low_varianceis
used.low_var_max_feature_set_size(optional) Maximum size of the feature
set after filtering features with a low variance. All features are first
compared against low_var_minimum_variance_threshold. If the resulting
feature set would be larger than specified, only the most strongly varying
features will be selected, up to the desired size of the feature set. The default value is NULL, which causes features to be filtered based on
their variance only.robustness_icc_type(optional) String indicating the type of
intraclass correlation coefficient (1,2or3) that should be used to
compute robustness for features in repeated measurements. These types
correspond to the types in Shrout and Fleiss (1979). The default value is1.robustness_threshold_metric(optional) String indicating which
specific intraclass correlation coefficient (ICC) metric should be used to
filter features. This should be one of:
 
 icc: The estimated ICC value itself.
 icc_low(default): The estimated lower limit of the 95% confidence
interval of the ICC, as suggested by Koo and Li (2016).
 icc_panel: The estimated ICC value over the panel average, i.e. the ICC
that would be obtained if all repeated measurements were averaged.
 icc_panel_low: The estimated lower limit of the 95% confidence interval
of the panel ICC.
robustness_threshold_value(optional) The intraclass correlation
coefficient value that is as threshold. The default value is 0.70.transformation_method(optional) The transformation method used to
change the distribution of the data to be more normal-like. The following
methods are available:
 
 none: This disables transformation of features.
 yeo_johnson: Transformation using the location and scale invariant
version of the Yeo-Johnson transformation (Yeo and Johnson, 2000;
Zwanenburg and Löck, 2023).
 yeo_johnson_robust(default): A robust version ofyeo_johnson.
This method is less sensitive to outliers.
 yeo_johnson_conventional: Asyeo_johnson, but without optimisation of
location and scale parameters. This method is equivalent to the original
transformation proposed by Yeo and Johnson (2001).
 box_cox: Transformation using the location and scale invariant version
of the Box-Cox transformation (Box and Cox, 1964; Zwanenburg and Löck,
2023).
 box_cox_robust: A robust version ofyeo_johnson. This method is less
sensitive to outliers.
 box_cox_conventional: Asbox_cox, but without optimisation of
location and scale parameters. This method is equivalent to the original
transformation proposed by Box and Cox (1964). This method requires
strictly positive feature values.
 Transformation requires the power.transformpackage. Only features that
contain numerical data are transformed. Transformation parameters obtained
in development data are stored withinfeatureInfoobjects for later use
with validation data sets.transformation_optimisation_criterion(optional) Transformation
parameters are optimised using a criterion, conventionally
maximum-likelihood-estimation. power.transformimplements multiple
optimisation criteria, of which the following are available: 
 mle(default): Optimisation using maximum likelihood estimation.
 cramer_von_mises: Optimisation using the Cramér-von Mises
criterion. Zwanenburg and Löck (2023) found that this criterion was
relatively robust against outliers.
transformation_gof_test_p_value(optional) Not all transformations
will lead to features that are roughly normally distributed. Zwanenburg and
Löck (2023) established a empirical goodness-of-fit test for central
normality. This parameter sets the significance for rejecting the
null-hypothesis that a feature distribution is centrally normal. When the
null-hypothesis is rejected, no transformation is performed. The default
value is NULL, which disables the test.normalisation_method(optional) The normalisation method used to
improve the comparability between numerical features that may have very
different scales. The following normalisation methods can be chosen:
 
 none: This disables feature normalisation.
 standardisation: Features are normalised by subtraction of their mean
values and division by their standard deviations. This causes every feature
to be have a center value of 0.0 and standard deviation of 1.0.
 standardisation_trim: Asstandardisation, but based on the set of
feature values where the 5% lowest and 5% highest values are discarded.
This reduces the effect of outliers.
 standardisation_winsor: Asstandardisation, but based on the set of
feature values where the 5% lowest and 5% highest values are winsorised.
This reduces the effect of outliers.
 standardisation_robust(default): A robust version ofstandardisationthat relies on computing Huber's M-estimators for location and scale.
 normalisation: Features are normalised by subtraction of their minimum
values and division by their ranges. This maps all feature values to a[0, 1]interval.
 normalisation_trim: Asnormalisation, but based on the set of feature
values where the 5% lowest and 5% highest values are discarded. This
reduces the effect of outliers.
 normalisation_winsor: Asnormalisation, but based on the set of
feature values where the 5% lowest and 5% highest values are winsorised.
This reduces the effect of outliers.
 quantile: Features are normalised by subtraction of their median values
and division by their interquartile range.
 mean_centering: Features are centered by substracting the mean, but do
not undergo rescaling.
 Only features that contain numerical data are normalised. Normalisation
parameters obtained in development data are stored within featureInfoobjects for later use with validation data sets.batch_normalisation_method(optional) The method used for batch
normalisation. Available methods are:
 
 none(default): This disables batch normalisation of features.
 standardisation: Features within each batch are normalised by
subtraction of the mean value and division by the standard deviation in
each batch.
 standardisation_trim: Asstandardisation, but based on the set of
feature values where the 5% lowest and 5% highest values are discarded.
This reduces the effect of outliers.
 standardisation_winsor: Asstandardisation, but based on the set of
feature values where the 5% lowest and 5% highest values are winsorised.
This reduces the effect of outliers.
 standardisation_robust: A robust version ofstandardisationthat
relies on computing Huber's M-estimators for location and scale within each
batch.
 normalisation: Features within each batch are normalised by subtraction
of their minimum values and division by their range in each batch. This
maps all feature values in each batch to a[0, 1]interval.
 normalisation_trim: Asnormalisation, but based on the set of feature
values where the 5% lowest and 5% highest values are discarded. This
reduces the effect of outliers.
 normalisation_winsor: Asnormalisation, but based on the set of
feature values where the 5% lowest and 5% highest values are winsorised.
This reduces the effect of outliers.
 quantile: Features in each batch are normalised by subtraction of the
median value and division by the interquartile range of each batch.
 mean_centering: Features in each batch are centered on 0.0 by
substracting the mean value in each batch, but are not rescaled.
 combat_parametric: Batch adjustments using parametric empirical Bayes
(Johnson et al, 2007).combat_pleads to the same method.
 combat_non_parametric: Batch adjustments using non-parametric empirical
Bayes (Johnson et al, 2007).combat_npandcombatlead to the same
method. Note that we reduced complexity from O(n^2) to O(n) by
only computing batch adjustment parameters for each feature on a subset of
50 randomly selected features, instead of all features.
 Only features that contain numerical data are normalised using batch
normalisation. Batch normalisation parameters obtained in development data
are stored within featureInfoobjects for later use with validation data
sets, in case the validation data is from the same batch. If validation data contains data from unknown batches, normalisation
parameters are separately determined for these batches.
 Note that for both empirical Bayes methods, the batch effect is assumed to
produce results across the features. This is often true for things such as
gene expressions, but the assumption may not hold generally.
 When performing batch normalisation, it is moreover important to check that
differences between batches or cohorts are not related to the studied
endpoint.imputation_method(optional) Method used for imputing missing
feature values. Two methods are implemented:
 
 simple: Simple replacement of a missing value by the median value (for
numeric features) or the modal value (for categorical features).
 lasso: Imputation of missing value by lasso regression (usingglmnet)
based on information contained in other features.
 simpleimputation precedeslassoimputation to ensure that any missing
values in predictors required forlassoregression are resolved. Thelassoestimate is then used to replace the missing value.
 The default value depends on the number of features in the dataset. If the
number is lower than 100, lassois used by default, andsimpleotherwise. Only single imputation is performed. Imputation models and parameters are
stored within featureInfoobjects for later use with validation data
sets.cluster_method(optional) Clustering is performed to identify and
replace redundant features, for example those that are highly correlated.
Such features do not carry much additional information and may be removed
or replaced instead (Park et al., 2007; Tolosi and Lengauer, 2011).
 The cluster method determines the algorithm used to form the clusters. The
following cluster methods are implemented:
 
 none: No clustering is performed.
 hclust(default): Hierarchical agglomerative clustering. If thefastclusterpackage is installed,fastcluster::hclustis used (Muellner
2013), otherwisestats::hclustis used.
 agnes: Hierarchical clustering using agglomerative nesting (Kaufman and
Rousseeuw, 1990). This algorithm is similar tohclust, but uses thecluster::agnesimplementation.
 diana: Divisive analysis hierarchical clustering. This method uses
divisive instead of agglomerative clustering (Kaufman and Rousseeuw, 1990).cluster::dianais used.
 pam: Partioning around medioids. This partitions the data into $k$
clusters around medioids (Kaufman and Rousseeuw, 1990). $k$ is selected
using thesilhouettemetric.pamis implemented using thecluster::pamfunction.
 Clusters and cluster information is stored within featureInfoobjects for
later use with validation data sets. This enables reproduction of the same
clusters as formed in the development data set.cluster_linkage_method(optional) Linkage method used for
agglomerative clustering in hclustandagnes. The following linkage
methods can be used: 
 average(default): Average linkage.
 single: Single linkage.
 complete: Complete linkage.
 weighted: Weighted linkage, also known as McQuitty linkage.
 ward: Linkage using Ward's minimum variance method.
 dianaandpamdo not require a linkage method.
cluster_cut_method(optional) The method used to define the actual
clusters. The following methods can be used:
 
 silhouette: Clusters are formed based on the silhouette score
(Rousseeuw, 1987). The average silhouette score is computed from 2 tonclusters, withnthe number of features. Clusters are only
formed if the average silhouette exceeds 0.50, which indicates reasonable
evidence for structure. This procedure may be slow if the number of
features is large (>100s).
 fixed_cut: Clusters are formed by cutting the hierarchical tree at the
point indicated by thecluster_similarity_threshold, e.g. where features
in a cluster have an average Spearman correlation of 0.90.fixed_cutis
only available foragnes,dianaandhclust.
 dynamic_cut: Dynamic cluster formation using the cutting algorithm in
thedynamicTreeCutpackage. This package should be installed to select
this option.dynamic_cutcan only be used withagnesandhclust.
 The default options are silhouettefor partioning around medioids (pam)
andfixed_cutotherwise.cluster_similarity_metric(optional) Clusters are formed based on
feature similarity. All features are compared in a pair-wise fashion to
compute similarity, for example correlation. The resulting similarity grid
is converted into a distance matrix that is subsequently used for
clustering. The following metrics are supported to compute pairwise
similarities:
 
 mutual_information(default): normalised mutual information.
 mcfadden_r2: McFadden's pseudo R-squared (McFadden, 1974).
 cox_snell_r2: Cox and Snell's pseudo R-squared (Cox and Snell, 1989).
 nagelkerke_r2: Nagelkerke's pseudo R-squared (Nagelkerke, 1991).
 spearman: Spearman's rank order correlation.
 kendall: Kendall rank correlation.
 pearson: Pearson product-moment correlation.
 The pseudo R-squared metrics can be used to assess similarity between mixed
pairs of numeric and categorical features, as these are based on the
log-likelihood of regression models. In familiar, the more informative
feature is used as the predictor and the other feature as the reponse
variable. In numeric-categorical pairs, the numeric feature is considered
to be more informative and is thus used as the predictor. In
categorical-categorical pairs, the feature with most levels is used as the
predictor. In case any of the classical correlation coefficients (pearson,spearmanandkendall) are used with (mixed) categorical features, the
categorical features are one-hot encoded and the mean correlation over all
resulting pairs is used as similarity.cluster_similarity_threshold(optional) The threshold level for
pair-wise similarity that is required to form clusters using fixed_cut.
This should be a numerical value between 0.0 and 1.0. Note however, that a
reasonable threshold value depends strongly on the similarity metric. The
following are the default values used: 
 mcfadden_r2andmutual_information:0.30
 cox_snell_r2andnagelkerke_r2:0.75
 spearman,kendallandpearson:0.90
 Alternatively, if the fixed cutmethod is not used, this value determines
whether any clustering should be performed, because the data may not
contain highly similar features. The default values in this situation are: 
 mcfadden_r2andmutual_information:0.25
 cox_snell_r2andnagelkerke_r2:0.40
 spearman,kendallandpearson:0.70
 The threshold value is converted to a distance (1-similarity) prior to
cutting hierarchical trees.cluster_representation_method(optional) Method used to determine
how the information of co-clustered features is summarised and used to
represent the cluster. The following methods can be selected:
 
 best_predictor(default): The feature with the highest importance
according to univariate regression with the outcome is used to represent
the cluster.
 medioid: The feature closest to the cluster center, i.e. the feature
that is most similar to the remaining features in the cluster, is used to
represent the feature.
 mean: A meta-feature is generated by averaging the feature values for
all features in a cluster. This method aligns all features so that all
features will be positively correlated prior to averaging. Should a cluster
contain one or more categorical features, themedioidmethod will be used
instead, as averaging is not possible. Note that if this method is chosen,
thenormalisation_methodparameter should be one ofstandardisation,standardisation_trim,standardisation_winsororquantile.'
 If the pamcluster method is selected, only themedioidmethod can be
used. In that case 1 medioid is used by default.parallel_preprocessing(optional) Enable parallel processing for the
preprocessing workflow. Defaults to TRUE. When set toFALSE, this will
disable the use of parallel processing while preprocessing, regardless of
the settings of theparallelparameter.parallel_preprocessingis
ignored ifparallel=FALSE.parallel_feature_selection(optional) Enable parallel processing for
the feature selection workflow. Defaults to TRUE. When set toFALSE,
this will disable the use of parallel processing while performing feature
selection, regardless of the settings of theparallelparameter.parallel_feature_selectionis ignored ifparallel=FALSE. | 
Details
This is a thin wrapper around summon_familiar, and functions like
it, but automatically skips learning and subsequent evaluation steps.
The function returns an experimentData object, which can be used to
warm-start other experiments by providing it to the experiment_data
argument. Variable importance may be retrieved from this object using the
get_vimp_table and aggregate_vimp_table methods.
Value
An experimentData object.
See Also
get_vimp_table, aggregate_vimp_table
Model predictions for familiar models and model ensembles
Description
Fits the model or ensemble of models to the data and shows the
result.
Usage
predict(object, ...)
## S4 method for signature 'familiarModel'
predict(
  object,
  newdata,
  type = "default",
  time = NULL,
  dir_path = NULL,
  ensemble_method = "median",
  stratification_threshold = NULL,
  stratification_method = NULL,
  percentiles = NULL,
  ...
)
## S4 method for signature 'familiarEnsemble'
predict(
  object,
  newdata,
  type = "default",
  time = NULL,
  dir_path = NULL,
  ensemble_method = "median",
  stratification_threshold = NULL,
  stratification_method = NULL,
  percentiles = NULL,
  ...
)
## S4 method for signature 'familiarNoveltyDetector'
predict(object, newdata, type = "novelty", ...)
## S4 method for signature 'list'
predict(
  object,
  newdata,
  type = "default",
  time = NULL,
  dir_path = NULL,
  ensemble_method = "median",
  stratification_threshold = NULL,
  stratification_method = NULL,
  percentiles = NULL,
  ...
)
## S4 method for signature 'character'
predict(
  object,
  newdata,
  type = "default",
  time = NULL,
  dir_path = NULL,
  ensemble_method = "median",
  stratification_threshold = NULL,
  stratification_method = NULL,
  percentiles = NULL,
  ...
)
Arguments
| object | A familiar model or ensemble of models that should be used for
prediction. This can also be a path to the ensemble model, one or more
paths to models, or a list of models. | 
| ... | to be documented. | 
| newdata | Data to which the models are fitted. familiarperforms
checks on the data to ensure that all features required for fitting the
model are present, and no additional levels are present in categorical
features. Unlike otherpredictmethods,newdatacannot be missing infamiliar, as training data are not stored with the models. | 
| type | Type of prediction made. The following values are directly
supported:
 
 default: Default prediction, i.e. value estimates forcountandcontinuousoutcomes, predicted class probabilities and class forbinomialandmultinomialand the model response forsurvivaloutcomes.
 survival_probability: Predicts survival probabilities at the time
specified bytime. Only applicable tosurvivaloutcomes. Some models
may not allow for predicting survival probabilities based on their
response.
 novelty: Predicts novelty of each sample, which can be used for
out-of-distribution detection.
 risk_stratification: Predicts the strata to which the data belongs. Only
forsurvivaloutcomes.
 Other values for type are passed to the fitting method of the actual
underlying model. For example for generalised linear models (glm)typecan belink,responseortermsas well. Some of these model-specific
prediction types may fail to return results if the model has been trimmed. | 
| time | Time at which the response (default) or survival probability
(survival_probability) should be predicted forsurvivaloutcomes. Some
models have a response that does not depend ontime, e.g.cox, whereas
others do, e.g.random_forest. | 
| dir_path | Path to the folder containing the models. Ensemble objects
are stored with the models detached. In case the models were moved since
creation, dir_pathcan be used to specify the current folder.
Alternatively theupdate_model_dir_pathmethod can be used to update the
path. | 
| ensemble_method | Method for ensembling predictions from models for the
same sample. Available methods are:
 | 
| stratification_threshold | Threshold value(s) used for stratifying
instances into risk groups. If this parameter is specified,
stratification_methodand any threshold values that come with the model
are ignored, andstratification_thresholdis used instead. | 
| stratification_method | Selects the stratification method from which the
threshold values should be selected. If the model or ensemble of models
does not contain thresholds for the indicated method, an error is returned.
In addition this argument is ignored if a stratification_thresholdis
set. | 
| percentiles | Currently unused. | 
Details
This method is used to predict values for instances specified by the
newdata using the model or ensemble of models specified by the object
argument.
Value
A data.table with predicted values.
Rename outcome classes for plotting and export
Description
Tabular exports and figures created from a familiarCollection
object can be customised by providing names for outcome classes.
Usage
## S4 method for signature 'familiarCollection'
set_class_names(x, old = NULL, new = NULL, order = NULL)
Arguments
| x | A familiarCollection object. | 
| old | (optional) Set of old labels to replace. | 
| new | Set of replacement labels. The number of replacement labels should
be equal to the number of provided old labels or the full number of labels.
If a subset of labels is to be replaced, both oldandnewshould be provided. | 
| order | (optional) Ordered set of replacement labels. This is used to
provide the order in which the labels should be placed, which affects e.g.
levels in a plot. If the ordering is not explicitly provided, the old
ordering is used. | 
Details
Labels convert the internal naming for class levels to the requested
label at export or when plotting. This enables customisation of class
names. Currently assigned labels can be found using the
get_class_names method.
Value
A familiarCollection object with updated labels.
See Also
Name datasets for plotting and export
Description
Tabular exports and figures created from a familiarCollection
object can be customised by setting data labels.
Usage
## S4 method for signature 'familiarCollection'
set_data_set_names(x, old = NULL, new = NULL, order = NULL)
Arguments
| x | A familiarCollection object. | 
| old | (optional) Set of old labels to replace. | 
| new | Set of replacement labels. The number of replacement labels should
be equal to the number of provided old labels or the full number of labels.
If a subset of labels is to be replaced, both oldandnewshould be provided. | 
| order | (optional) Ordered set of replacement labels. This is used to
provide the order in which the labels should be placed, which affects e.g.
levels in a plot. If the ordering is not explicitly provided, the old
ordering is used. | 
Details
Labels convert internal naming of data sets to the requested label
at export or when plotting. Currently assigned labels can be found using
the get_data_set_names method.
Value
A familiarCollection object with custom names for the data sets.
See Also
Rename features for plotting and export
Description
Tabular exports and figures created from a familiarCollection
object can be customised by providing names for features.
Usage
## S4 method for signature 'familiarCollection'
set_feature_names(x, old = NULL, new = NULL, order = NULL)
Arguments
| x | A familiarCollection object. | 
| old | (optional) Set of old labels to replace. | 
| new | Set of replacement labels. The number of replacement labels should
be equal to the number of provided old labels or the full number of labels.
If a subset of labels is to be replaced, both oldandnewshould be provided. | 
| order | (optional) Ordered set of replacement labels. This is used to
provide the order in which the labels should be placed, which affects e.g.
levels in a plot. If the ordering is not explicitly provided, the old
ordering is used. | 
Details
Labels convert the internal naming for features to the requested
label at export or when plotting. This enables customisation without
redoing the analysis with renamed input data. Currently assigned labels can
be found using the get_feature_names method.
Value
A familiarCollection object with updated labels.
See Also
Rename feature selection methods for plotting and export
Description
Tabular exports and figures created from a familiarCollection
object can be customised by providing names for the feature selection
methods.
Usage
## S4 method for signature 'familiarCollection'
set_fs_method_names(x, old = NULL, new = NULL, order = NULL)
Arguments
| x | A familiarCollection object. | 
| old | (optional) Set of old labels to replace. | 
| new | Set of replacement labels. The number of replacement labels should
be equal to the number of provided old labels or the full number of labels.
If a subset of labels is to be replaced, both oldandnewshould be provided. | 
| order | (optional) Ordered set of replacement labels. This is used to
provide the order in which the labels should be placed, which affects e.g.
levels in a plot. If the ordering is not explicitly provided, the old
ordering is used. | 
Details
Labels convert the internal naming for feature selection methods to
the requested label at export or when plotting. This enables the use of
more specific naming, e.g. changing mim to Mutual Information
  Maximisation. Currently assigned labels can be found using the
get_fs_method_names method.
Value
A familiarCollection object with updated labels.
See Also
Rename learners for plotting and export
Description
Tabular exports and figures created from a familiarCollection
object can be customised by providing names for the learners.
Usage
## S4 method for signature 'familiarCollection'
set_learner_names(x, old = NULL, new = NULL, order = NULL)
Arguments
| x | A familiarCollection object. | 
| old | (optional) Set of old labels to replace. | 
| new | Set of replacement labels. The number of replacement labels should
be equal to the number of provided old labels or the full number of labels.
If a subset of labels is to be replaced, both oldandnewshould be provided. | 
| order | (optional) Ordered set of replacement labels. This is used to
provide the order in which the labels should be placed, which affects e.g.
levels in a plot. If the ordering is not explicitly provided, the old
ordering is used. | 
Details
Labels convert the internal naming for learners to the requested
label at export or when plotting. This enables the use of more specific
naming, e.g. changing random_forest_rfsrc to Random Forest.
Currently assigned labels can be found using the get_learner_names
method.
Value
A familiarCollection object with custom labels for the learners.
See Also
Set the name of a familiarData object.
Description
Set the name slot using the object name.
Usage
## S4 method for signature 'familiarData'
set_object_name(x, new = NULL)
Arguments
Value
A familiarData object with a generated or a provided name.
Set the name of a familiarEnsemble object.
Description
Set the name slot using the object name.
Usage
## S4 method for signature 'familiarEnsemble'
set_object_name(x, new = NULL)
Arguments
| x | A familiarEnsembleobject. | 
Value
A familiarEnsemble object with a generated or a provided name.
Set the name of a familiarModel object.
Description
Set the name slot using the object name.
Usage
## S4 method for signature 'familiarModel'
set_object_name(x, new = NULL)
Arguments
| x | A familiarModelobject. | 
Value
A familiarModel object with a generated or a provided name.
Rename risk groups for plotting and export
Description
Tabular exports and figures created from a familiarCollection
object can be customised by providing names for risk groups in survival
analysis.
Usage
## S4 method for signature 'familiarCollection'
set_risk_group_names(x, old = NULL, new = NULL, order = NULL)
Arguments
| x | A familiarCollection object. | 
| old | (optional) Set of old labels to replace. | 
| new | Set of replacement labels. The number of replacement labels should
be equal to the number of provided old labels or the full number of labels.
If a subset of labels is to be replaced, both oldandnewshould be provided. | 
| order | (optional) Ordered set of replacement labels. This is used to
provide the order in which the labels should be placed, which affects e.g.
levels in a plot. If the ordering is not explicitly provided, the old
ordering is used. | 
Details
Labels convert the internal naming for risk groups to the requested
label at export or when plotting. This enables customisation of risk group
names. Currently assigned labels can be found using the
get_risk_group_names method.
Value
A familiarCollection object with updated labels.
See Also
Model summaries
Description
summary produces model summaries.
Usage
summary(object, ...)
## S4 method for signature 'familiarModel'
summary(object, ...)
Arguments
| object | a familiarModel object | 
| ... | additional arguments passed to summarymethods for the underlying
model, when available. | 
Details
This method extends the summary S3 method. For some models
summary requires information that is trimmed from the model. In this case
a copy of summary data is stored with the model, and returned.
Value
Depends on underlying model. See the documentation for the particular
models.
Perform end-to-end machine learning and data analysis
Description
Perform end-to-end machine learning and data analysis
Usage
summon_familiar(
  formula = NULL,
  data = NULL,
  experiment_data = NULL,
  cl = NULL,
  config = NULL,
  config_id = 1L,
  verbose = TRUE,
  .stop_after = "evaluation",
  ...
)
Arguments
| formula | An R formula. The formula can only contain feature names and
dot (.). The*and+1operators are not supported as these refer to
columns that are not present in the data set. Use of the formula interface is optional. | 
| data | A data.tableobject, adata.frameobject, list containing
multipledata.tableordata.frameobjects, or paths to data files. datashould be provided if no file paths are provided to thedata_filesargument. If both are provided, onlydatawill be used.
 All data is expected to be in wide format, and ideally has a sample
identifier (see sample_id_column), batch identifier (seecohort_column)
and outcome columns (seeoutcome_column). In case paths are provided, the data should be stored as csv,rdsorRDatafiles. See documentation for thedata_filesargument for more
information. | 
| experiment_data | Experimental data may provided in the form of | 
| cl | Cluster created using the parallelpackage. This cluster is then
used to speed up computation through parallelisation. When a cluster is not
provided, parallelisation is performed by setting up a cluster on the local
machine. This parameter has no effect if the parallelargument is set toFALSE. | 
| config | List containing configuration parameters, or path to an xmlfile containing these parameters. An empty configuration file can obtained
using theget_xml_configfunction. All parameters can also be set programmatically. These supersede any
arguments derived from the configuration list. | 
| config_id | Identifier for the configuration in case the list or xmltable indicated byconfigcontains more than one set of configurations. | 
| verbose | Indicates verbosity of the results. Default is TRUE, and all
messages and warnings are returned. | 
| .stop_after | Variable for internal use. | 
| ... | Arguments passed on to .parse_file_paths,.parse_experiment_settings,.parse_setup_settings,.parse_preprocessing_settings,.parse_feature_selection_settings,.parse_model_development_settings,.parse_hyperparameter_optimisation_settings,.parse_evaluation_settings 
project_dir(optional) Path to the project directory. familiarchecks if the directory indicated byexperiment_dirand data files indata_fileare relative to theproject_dir.experiment_dir(recommended) Path to the directory where all
intermediate and final results produced by familiarare written to. The experiment_dircan be a path relative toproject_diror an absolute
path. In case no project directory is provided and the experiment directory is
not on an absolute path, a directory will be created in the temporary R
directory indicated by tempdir(). This directory is deleted after closing
the R session or once data analysis has finished. All information will be
lost afterwards. Hence, it is recommended to provide eitherexperiment_diras an absolute path, or provide bothproject_dirandexperiment_dir.data_file(optional) Path to files containing data that should be
analysed. The paths can be relative to project_diror absolute paths. An
error will be raised if the file cannot be found. The following types of data are supported.
 
 csvfiles containing column headers on the first row, and samples per
row.csvfiles are read usingdata.table::fread.
 rdsfiles that contain adata.tableordata.frameobject.rdsfiles are imported usingbase::readRDS.
 RDatafiles that contain a singledata.tableordata.frameobject.RDatafiles are imported usingbase::load.
 All data are expected in wide format, with sample information organised
row-wise.
 More than one data file can be provided. familiarwill try to combine
data files based on column names and identifier columns. Alternatively, data can be provided using the dataargument. These data
are expected to bedata.frameordata.tableobjects or paths to data
files. The latter are handled in the same way as file paths provided todata_file.batch_id_column(recommended) Name of the column containing batch
or cohort identifiers. This parameter is required if more than one dataset
is provided, or if external validation is performed.
 In familiar any row of data is organised by four identifiers:
 
 The batch identifier batch_id_column: This denotes the group to which a
set of samples belongs, e.g. patients from a single study, samples measured
in a batch, etc. The batch identifier is used for batch normalisation, as
well as selection of development and validation datasets. The sample identifier sample_id_column: This denotes the sample level,
e.g. data from a single individual. Subsets of data, e.g. bootstraps or
cross-validation folds, are created at this level. The series identifier series_id_column: Indicates measurements on a
single sample that may not share the same outcome value, e.g. a time
series, or the number of cells in a view. The repetition identifier: Indicates repeated measurements in a single
series where any feature values may differ, but the outcome does not.
Repetition identifiers are always implicitly set when multiple entries for
the same series of the same sample in the same batch that share the same
outcome are encountered.
sample_id_column(recommended) Name of the column containing
sample or subject identifiers. See batch_id_columnabove for more
details. If unset, every row will be identified as a single sample.series_id_column(optional) Name of the column containing series
identifiers, which distinguish between measurements that are part of a
series for a single sample. See batch_id_columnabove for more details. If unset, rows which share the same batch and sample identifiers but have a
different outcome are assigned unique series identifiers.development_batch_id(optional) One or more batch or cohort
identifiers to constitute data sets for development. Defaults to all, or
all minus the identifiers in validation_batch_idfor external validation.
Required if external validation is performed andvalidation_batch_idis
not provided.validation_batch_id(optional) One or more batch or cohort
identifiers to constitute data sets for external validation. Defaults to
all data sets except those in development_batch_idfor external
validation, or none if not. Required ifdevelopment_batch_idis not
provided.outcome_name(optional) Name of the modelled outcome. This name will
be used in figures created by familiar. If not set, the column name in outcome_columnwill be used forbinomial,multinomial,countandcontinuousoutcomes. For other
outcomes (survivalandcompeting_risk) no default is used.outcome_column(recommended) Name of the column containing the
outcome of interest. May be identified from a formula, if a formula is
provided as an argument. Otherwise an error is raised. Note that survivalandcompeting_riskoutcome type outcomes require two columns that
indicate the time-to-event or the time of last follow-up and the event
status.outcome_type(recommended) Type of outcome found in the outcome
column. The outcome type determines many aspects of the overall process,
e.g. the available feature selection methods and learners, but also the
type of assessments that can be conducted to evaluate the resulting models.
Implemented outcome types are:
 
 binomial: categorical outcome with 2 levels.
 multinomial: categorical outcome with 2 or more levels.
 count: Poisson-distributed numeric outcomes.
 continuous: general continuous numeric outcomes.
 survival: survival outcome for time-to-event data.
 If not provided, the algorithm will attempt to obtain outcome_type from
contents of the outcome column. This may lead to unexpected results, and we
therefore advise to provide this information manually.
 Note that competing_risksurvival analysis are not fully supported, and
is currently not a valid choice foroutcome_type.class_levels(optional) Class levels for binomialormultinomialoutcomes. This argument can be used to specify the ordering of levels for
categorical outcomes. These class levels must exactly match the levels
present in the outcome column.event_indicator(recommended) Indicator for events in survivalandcompeting_riskanalyses.familiarwill automatically recognise1,true,t,yandyesas event indicators, including different
capitalisations. If this parameter is set, it replaces the default values.censoring_indicator(recommended) Indicator for right-censoring in
survivalandcompeting_riskanalyses.familiarwill automatically
recognise0,false,f,n,noas censoring indicators, including
different capitalisations. If this parameter is set, it replaces the
default values.competing_risk_indicator(recommended) Indicator for competing
risks in competing_riskanalyses. There are no default values, and if
unset, all values other than those specified by theevent_indicatorandcensoring_indicatorparameters are considered to indicate competing
risks.signature(optional) One or more names of feature columns that are
considered part of a specific signature. Features specified here will
always be used for modelling. Ranking from feature selection has no effect
for these features.novelty_features(optional) One or more names of feature columns
that should be included for the purpose of novelty detection.exclude_features(optional) Feature columns that will be removed
from the data set. Cannot overlap with features in signature,novelty_featuresorinclude_features.include_features(optional) Feature columns that are specifically
included in the data set. By default all features are included. Cannot
overlap with exclude_features, but may overlapsignature. Features insignatureandnovelty_featuresare always included. If bothexclude_featuresandinclude_featuresare provided,include_featurestakes precedence, provided that there is no overlap between the two.reference_method(optional) Method used to set reference levels for
categorical features. There are several options:
 
 auto(default): Categorical features that are not explicitly set by the
user, i.e. columns containing boolean values or characters, use the most
frequent level as reference. Categorical features that are explicitly set,
i.e. as factors, are used as is.
 always: Both automatically detected and user-specified categorical
features have the reference level set to the most frequent level. Ordinal
features are not altered, but are used as is.
 never: User-specified categorical features are used as is.
Automatically detected categorical features are simply sorted, and the
first level is then used as the reference level. This was the behaviour
prior to familiar version 1.3.0.
experimental_design(required) Defines what the experiment looks
like, e.g. cv(bt(fs,20)+mb,3,2)+evfor 2 times repeated 3-fold
cross-validation with nested feature selection on 20 bootstraps and
model-building, and external validation. The basic workflow components are: 
 fs: (required) feature selection step.
 mb: (required) model building step.
 ev: (optional) external validation. Note that internal validation due
to subsampling will always be conducted if the subsampling methods create
any validation data sets.
 The different components are linked using +. Different subsampling methods can be used in conjunction with the basic
workflow components:
 
 bs(x,n): (stratified) .632 bootstrap, withnthe number of
bootstraps. In contrast tobt, feature pre-processing parameters and
hyperparameter optimisation are conducted on individual bootstraps.
 bt(x,n): (stratified) .632 bootstrap, withnthe number of
bootstraps. Unlikebsand other subsampling methods, no separate
pre-processing parameters or optimised hyperparameters will be determined
for each bootstrap.
 cv(x,n,p): (stratified)n-fold cross-validation, repeatedptimes.
Pre-processing parameters are determined for each iteration.
 lv(x): leave-one-out-cross-validation. Pre-processing parameters are
determined for each iteration.
 ip(x): imbalance partitioning for addressing class imbalances on the
data set. Pre-processing parameters are determined for each partition. The
number of partitions generated depends on the imbalance correction method
(see theimbalance_correction_methodparameter). Imbalance partitioning
does not generate validation sets.
 As shown in the example above, sampling algorithms can be nested.
 The simplest valid experimental design is fs+mb, which corresponds to a
TRIPOD type 1a analysis. Type 1b analyses are only possible using
bootstraps, e.g.bt(fs+mb,100). Type 2a analyses can be conducted using
cross-validation, e.g.cv(bt(fs,100)+mb,10,1). Depending on the origin of
the external validation data, designs such asfs+mb+evorcv(bt(fs,100)+mb,10,1)+evconstitute type 2b or type 3 analyses. Type 4
analyses can be done by obtaining one or morefamiliarModelobjects from
others and applying them to your own data set. Alternatively, the experimental_designparameter may be used to provide a
path to a file containing iterations, which is named####_iterations.RDSby convention. This path can be relative to the directory of the current
experiment (experiment_dir), or an absolute path. The absolute path may
thus also point to a file from a different experiment.imbalance_correction_method(optional) Type of method used to
address class imbalances. Available options are:
 
 full_undersampling(default): All data will be used in an ensemble
fashion. The full minority class will appear in each partition, but
majority classes are undersampled until all data have been used.
 random_undersampling: Randomly undersamples majority classes. This is
useful in cases where full undersampling would lead to the formation of
many models due major overrepresentation of the largest class.
 This parameter is only used in combination with imbalance partitioning in
the experimental design, and ipshould therefore appear in the string
that defines the design.imbalance_n_partitions(optional) Number of times random
undersampling should be repeated. 10 undersampled subsets with balanced
classes are formed by default.parallel(optional) Enable parallel processing. Defaults to TRUE.
When set toFALSE, this disables all parallel processing, regardless of
specific parameters such asparallel_preprocessing. However, whenparallelisTRUE, parallel processing of different parts of the
workflow can be disabled by setting respective flags toFALSE.parallel_nr_cores(optional) Number of cores available for
parallelisation. Defaults to 2. This setting does nothing if
parallelisation is disabled.restart_cluster(optional) Restart nodes used for parallel computing
to free up memory prior to starting a parallel process. Note that it does
take time to set up the clusters. Therefore setting this argument to TRUEmay impact processing speed. This argument is ignored ifparallelisFALSEor the cluster was initialised outside of familiar. Default isFALSE, which causes the clusters to be initialised only once.cluster_type(optional) Selection of the cluster type for parallel
processing. Available types are the ones supported by the parallel package
that is part of the base R distribution: psock(default),fork,mpi,nws,sock. In addition,noneis available, which also disables
parallel processing.backend_type(optional) Selection of the backend for distributing
copies of the data. This backend ensures that only a single master copy is
kept in memory. This limits memory usage during parallel processing.
 Several backend options are available, notably socket_server, andnone(default).socket_serveris based on the callr package and R sockets,
comes withfamiliarand is available for any OS.noneuses the package
environment of familiar to store data, and is available for any OS.
However,nonerequires copying of data to any parallel process, and has a
larger memory footprint.server_port(optional) Integer indicating the port on which the
socket server or RServe process should communicate. Defaults to port 6311.
Note that ports 0 to 1024 and 49152 to 65535 cannot be used.feature_max_fraction_missing(optional) Numeric value between 0.0and0.95that determines the meximum fraction of missing values that
still allows a feature to be included in the data set. All features with a
missing value fraction over this threshold are not processed further. The
default value is0.30.sample_max_fraction_missing(optional) Numeric value between 0.0and0.95that determines the maximum fraction of missing values that
still allows a sample to be included in the data set. All samples with a
missing value fraction over this threshold are excluded and not processed
further. The default value is0.30.filter_method(optional) One or methods used to reduce
dimensionality of the data set by removing irrelevant or poorly
reproducible features.
 Several method are available:
 
 none(default): None of the features will be filtered.
 low_variance: Features with a variance below thelow_var_minimum_variance_thresholdare filtered. This can be useful to
filter, for example, genes that are not differentially expressed.
 univariate_test: Features undergo a univariate regression using an
outcome-appropriate regression model. The p-value of the model coefficient
is collected. Features with coefficient p or q-value above theunivariate_test_thresholdare subsequently filtered.
 robustness: Features that are not sufficiently robust according to the
intraclass correlation coefficient are filtered. Use of this method
requires that repeated measurements are present in the data set, i.e. there
should be entries for which the sample and cohort identifiers are the same.
 More than one method can be used simultaneously. Features with singular
values are always filtered, as these do not contain information.univariate_test_threshold(optional) Numeric value between 1.0and0.0that determines which features are irrelevant and will be filtered by
theunivariate_test. The p or q-values are compared to this threshold.
All features with values above the threshold are filtered. The default
value is0.20.univariate_test_threshold_metric(optional) Metric used with the to
compare the univariate_test_thresholdagainst. The following metrics can
be chosen: 
 p_value(default): The unadjusted p-value of each feature is used for
to filter features.
 q_value: The q-value (Story, 2002), is used to filter features. Some
data sets may have insufficient samples to compute the q-value. Theqvaluepackage must be installed from Bioconductor to use this method.
univariate_test_max_feature_set_size(optional) Maximum size of the
feature set after the univariate test. P or q values of features are
compared against the threshold, but if the resulting data set would be
larger than this setting, only the most relevant features up to the desired
feature set size are selected.
 The default value is NULL, which causes features to be filtered based on
their relevance only.low_var_minimum_variance_threshold(required, if used) Numeric value
that determines which features will be filtered by the low_variancemethod. The variance of each feature is computed and compared to the
threshold. If it is below the threshold, the feature is removed. This parameter has no default value and should be set if low_varianceis
used.low_var_max_feature_set_size(optional) Maximum size of the feature
set after filtering features with a low variance. All features are first
compared against low_var_minimum_variance_threshold. If the resulting
feature set would be larger than specified, only the most strongly varying
features will be selected, up to the desired size of the feature set. The default value is NULL, which causes features to be filtered based on
their variance only.robustness_icc_type(optional) String indicating the type of
intraclass correlation coefficient (1,2or3) that should be used to
compute robustness for features in repeated measurements. These types
correspond to the types in Shrout and Fleiss (1979). The default value is1.robustness_threshold_metric(optional) String indicating which
specific intraclass correlation coefficient (ICC) metric should be used to
filter features. This should be one of:
 
 icc: The estimated ICC value itself.
 icc_low(default): The estimated lower limit of the 95% confidence
interval of the ICC, as suggested by Koo and Li (2016).
 icc_panel: The estimated ICC value over the panel average, i.e. the ICC
that would be obtained if all repeated measurements were averaged.
 icc_panel_low: The estimated lower limit of the 95% confidence interval
of the panel ICC.
robustness_threshold_value(optional) The intraclass correlation
coefficient value that is as threshold. The default value is 0.70.transformation_method(optional) The transformation method used to
change the distribution of the data to be more normal-like. The following
methods are available:
 
 none: This disables transformation of features.
 yeo_johnson: Transformation using the location and scale invariant
version of the Yeo-Johnson transformation (Yeo and Johnson, 2000;
Zwanenburg and Löck, 2023).
 yeo_johnson_robust(default): A robust version ofyeo_johnson.
This method is less sensitive to outliers.
 yeo_johnson_conventional: Asyeo_johnson, but without optimisation of
location and scale parameters. This method is equivalent to the original
transformation proposed by Yeo and Johnson (2001).
 box_cox: Transformation using the location and scale invariant version
of the Box-Cox transformation (Box and Cox, 1964; Zwanenburg and Löck,
2023).
 box_cox_robust: A robust version ofyeo_johnson. This method is less
sensitive to outliers.
 box_cox_conventional: Asbox_cox, but without optimisation of
location and scale parameters. This method is equivalent to the original
transformation proposed by Box and Cox (1964). This method requires
strictly positive feature values.
 Transformation requires the power.transformpackage. Only features that
contain numerical data are transformed. Transformation parameters obtained
in development data are stored withinfeatureInfoobjects for later use
with validation data sets.transformation_optimisation_criterion(optional) Transformation
parameters are optimised using a criterion, conventionally
maximum-likelihood-estimation. power.transformimplements multiple
optimisation criteria, of which the following are available: 
 mle(default): Optimisation using maximum likelihood estimation.
 cramer_von_mises: Optimisation using the Cramér-von Mises
criterion. Zwanenburg and Löck (2023) found that this criterion was
relatively robust against outliers.
transformation_gof_test_p_value(optional) Not all transformations
will lead to features that are roughly normally distributed. Zwanenburg and
Löck (2023) established a empirical goodness-of-fit test for central
normality. This parameter sets the significance for rejecting the
null-hypothesis that a feature distribution is centrally normal. When the
null-hypothesis is rejected, no transformation is performed. The default
value is NULL, which disables the test.normalisation_method(optional) The normalisation method used to
improve the comparability between numerical features that may have very
different scales. The following normalisation methods can be chosen:
 
 none: This disables feature normalisation.
 standardisation: Features are normalised by subtraction of their mean
values and division by their standard deviations. This causes every feature
to be have a center value of 0.0 and standard deviation of 1.0.
 standardisation_trim: Asstandardisation, but based on the set of
feature values where the 5% lowest and 5% highest values are discarded.
This reduces the effect of outliers.
 standardisation_winsor: Asstandardisation, but based on the set of
feature values where the 5% lowest and 5% highest values are winsorised.
This reduces the effect of outliers.
 standardisation_robust(default): A robust version ofstandardisationthat relies on computing Huber's M-estimators for location and scale.
 normalisation: Features are normalised by subtraction of their minimum
values and division by their ranges. This maps all feature values to a[0, 1]interval.
 normalisation_trim: Asnormalisation, but based on the set of feature
values where the 5% lowest and 5% highest values are discarded. This
reduces the effect of outliers.
 normalisation_winsor: Asnormalisation, but based on the set of
feature values where the 5% lowest and 5% highest values are winsorised.
This reduces the effect of outliers.
 quantile: Features are normalised by subtraction of their median values
and division by their interquartile range.
 mean_centering: Features are centered by substracting the mean, but do
not undergo rescaling.
 Only features that contain numerical data are normalised. Normalisation
parameters obtained in development data are stored within featureInfoobjects for later use with validation data sets.batch_normalisation_method(optional) The method used for batch
normalisation. Available methods are:
 
 none(default): This disables batch normalisation of features.
 standardisation: Features within each batch are normalised by
subtraction of the mean value and division by the standard deviation in
each batch.
 standardisation_trim: Asstandardisation, but based on the set of
feature values where the 5% lowest and 5% highest values are discarded.
This reduces the effect of outliers.
 standardisation_winsor: Asstandardisation, but based on the set of
feature values where the 5% lowest and 5% highest values are winsorised.
This reduces the effect of outliers.
 standardisation_robust: A robust version ofstandardisationthat
relies on computing Huber's M-estimators for location and scale within each
batch.
 normalisation: Features within each batch are normalised by subtraction
of their minimum values and division by their range in each batch. This
maps all feature values in each batch to a[0, 1]interval.
 normalisation_trim: Asnormalisation, but based on the set of feature
values where the 5% lowest and 5% highest values are discarded. This
reduces the effect of outliers.
 normalisation_winsor: Asnormalisation, but based on the set of
feature values where the 5% lowest and 5% highest values are winsorised.
This reduces the effect of outliers.
 quantile: Features in each batch are normalised by subtraction of the
median value and division by the interquartile range of each batch.
 mean_centering: Features in each batch are centered on 0.0 by
substracting the mean value in each batch, but are not rescaled.
 combat_parametric: Batch adjustments using parametric empirical Bayes
(Johnson et al, 2007).combat_pleads to the same method.
 combat_non_parametric: Batch adjustments using non-parametric empirical
Bayes (Johnson et al, 2007).combat_npandcombatlead to the same
method. Note that we reduced complexity from O(n^2) to O(n) by
only computing batch adjustment parameters for each feature on a subset of
50 randomly selected features, instead of all features.
 Only features that contain numerical data are normalised using batch
normalisation. Batch normalisation parameters obtained in development data
are stored within featureInfoobjects for later use with validation data
sets, in case the validation data is from the same batch. If validation data contains data from unknown batches, normalisation
parameters are separately determined for these batches.
 Note that for both empirical Bayes methods, the batch effect is assumed to
produce results across the features. This is often true for things such as
gene expressions, but the assumption may not hold generally.
 When performing batch normalisation, it is moreover important to check that
differences between batches or cohorts are not related to the studied
endpoint.imputation_method(optional) Method used for imputing missing
feature values. Two methods are implemented:
 
 simple: Simple replacement of a missing value by the median value (for
numeric features) or the modal value (for categorical features).
 lasso: Imputation of missing value by lasso regression (usingglmnet)
based on information contained in other features.
 simpleimputation precedeslassoimputation to ensure that any missing
values in predictors required forlassoregression are resolved. Thelassoestimate is then used to replace the missing value.
 The default value depends on the number of features in the dataset. If the
number is lower than 100, lassois used by default, andsimpleotherwise. Only single imputation is performed. Imputation models and parameters are
stored within featureInfoobjects for later use with validation data
sets.cluster_method(optional) Clustering is performed to identify and
replace redundant features, for example those that are highly correlated.
Such features do not carry much additional information and may be removed
or replaced instead (Park et al., 2007; Tolosi and Lengauer, 2011).
 The cluster method determines the algorithm used to form the clusters. The
following cluster methods are implemented:
 
 none: No clustering is performed.
 hclust(default): Hierarchical agglomerative clustering. If thefastclusterpackage is installed,fastcluster::hclustis used (Muellner
2013), otherwisestats::hclustis used.
 agnes: Hierarchical clustering using agglomerative nesting (Kaufman and
Rousseeuw, 1990). This algorithm is similar tohclust, but uses thecluster::agnesimplementation.
 diana: Divisive analysis hierarchical clustering. This method uses
divisive instead of agglomerative clustering (Kaufman and Rousseeuw, 1990).cluster::dianais used.
 pam: Partioning around medioids. This partitions the data into $k$
clusters around medioids (Kaufman and Rousseeuw, 1990). $k$ is selected
using thesilhouettemetric.pamis implemented using thecluster::pamfunction.
 Clusters and cluster information is stored within featureInfoobjects for
later use with validation data sets. This enables reproduction of the same
clusters as formed in the development data set.cluster_linkage_method(optional) Linkage method used for
agglomerative clustering in hclustandagnes. The following linkage
methods can be used: 
 average(default): Average linkage.
 single: Single linkage.
 complete: Complete linkage.
 weighted: Weighted linkage, also known as McQuitty linkage.
 ward: Linkage using Ward's minimum variance method.
 dianaandpamdo not require a linkage method.
cluster_cut_method(optional) The method used to define the actual
clusters. The following methods can be used:
 
 silhouette: Clusters are formed based on the silhouette score
(Rousseeuw, 1987). The average silhouette score is computed from 2 tonclusters, withnthe number of features. Clusters are only
formed if the average silhouette exceeds 0.50, which indicates reasonable
evidence for structure. This procedure may be slow if the number of
features is large (>100s).
 fixed_cut: Clusters are formed by cutting the hierarchical tree at the
point indicated by thecluster_similarity_threshold, e.g. where features
in a cluster have an average Spearman correlation of 0.90.fixed_cutis
only available foragnes,dianaandhclust.
 dynamic_cut: Dynamic cluster formation using the cutting algorithm in
thedynamicTreeCutpackage. This package should be installed to select
this option.dynamic_cutcan only be used withagnesandhclust.
 The default options are silhouettefor partioning around medioids (pam)
andfixed_cutotherwise.cluster_similarity_metric(optional) Clusters are formed based on
feature similarity. All features are compared in a pair-wise fashion to
compute similarity, for example correlation. The resulting similarity grid
is converted into a distance matrix that is subsequently used for
clustering. The following metrics are supported to compute pairwise
similarities:
 
 mutual_information(default): normalised mutual information.
 mcfadden_r2: McFadden's pseudo R-squared (McFadden, 1974).
 cox_snell_r2: Cox and Snell's pseudo R-squared (Cox and Snell, 1989).
 nagelkerke_r2: Nagelkerke's pseudo R-squared (Nagelkerke, 1991).
 spearman: Spearman's rank order correlation.
 kendall: Kendall rank correlation.
 pearson: Pearson product-moment correlation.
 The pseudo R-squared metrics can be used to assess similarity between mixed
pairs of numeric and categorical features, as these are based on the
log-likelihood of regression models. In familiar, the more informative
feature is used as the predictor and the other feature as the reponse
variable. In numeric-categorical pairs, the numeric feature is considered
to be more informative and is thus used as the predictor. In
categorical-categorical pairs, the feature with most levels is used as the
predictor. In case any of the classical correlation coefficients (pearson,spearmanandkendall) are used with (mixed) categorical features, the
categorical features are one-hot encoded and the mean correlation over all
resulting pairs is used as similarity.cluster_similarity_threshold(optional) The threshold level for
pair-wise similarity that is required to form clusters using fixed_cut.
This should be a numerical value between 0.0 and 1.0. Note however, that a
reasonable threshold value depends strongly on the similarity metric. The
following are the default values used: 
 mcfadden_r2andmutual_information:0.30
 cox_snell_r2andnagelkerke_r2:0.75
 spearman,kendallandpearson:0.90
 Alternatively, if the fixed cutmethod is not used, this value determines
whether any clustering should be performed, because the data may not
contain highly similar features. The default values in this situation are: 
 mcfadden_r2andmutual_information:0.25
 cox_snell_r2andnagelkerke_r2:0.40
 spearman,kendallandpearson:0.70
 The threshold value is converted to a distance (1-similarity) prior to
cutting hierarchical trees.cluster_representation_method(optional) Method used to determine
how the information of co-clustered features is summarised and used to
represent the cluster. The following methods can be selected:
 
 best_predictor(default): The feature with the highest importance
according to univariate regression with the outcome is used to represent
the cluster.
 medioid: The feature closest to the cluster center, i.e. the feature
that is most similar to the remaining features in the cluster, is used to
represent the feature.
 mean: A meta-feature is generated by averaging the feature values for
all features in a cluster. This method aligns all features so that all
features will be positively correlated prior to averaging. Should a cluster
contain one or more categorical features, themedioidmethod will be used
instead, as averaging is not possible. Note that if this method is chosen,
thenormalisation_methodparameter should be one ofstandardisation,standardisation_trim,standardisation_winsororquantile.'
 If the pamcluster method is selected, only themedioidmethod can be
used. In that case 1 medioid is used by default.parallel_preprocessing(optional) Enable parallel processing for the
preprocessing workflow. Defaults to TRUE. When set toFALSE, this will
disable the use of parallel processing while preprocessing, regardless of
the settings of theparallelparameter.parallel_preprocessingis
ignored ifparallel=FALSE.fs_method(required) Feature selection method to be used for
determining variable importance. familiarimplements various feature
selection methods. Please refer to the vignette on feature selection
methods for more details. More than one feature selection method can be chosen. The experiment will
then repeated for each feature selection method.
 Feature selection methods determines the ranking of features. Actual
selection of features is done by optimising the signature size model
hyperparameter during the hyperparameter optimisation step.fs_method_parameter(optional) List of lists containing parameters
for feature selection methods. Each sublist should have the name of the
feature selection method it corresponds to.
 Most feature selection methods do not have parameters that can be set.
Please refer to the vignette on feature selection methods for more details.
Note that if the feature selection method is based on a learner (e.g. lasso
regression), hyperparameter optimisation may be performed prior to
assessing variable importance.vimp_aggregation_method(optional) The method used to aggregate
variable importances over different data subsets, e.g. bootstraps. The
following methods can be selected:
 
 none: Don't aggregate ranks, but rather aggregate the variable
importance scores themselves.
 mean: Use the mean rank of a feature over the subsets to
determine the aggregated feature rank.
 median: Use the median rank of a feature over the subsets to determine
the aggregated feature rank.
 best: Use the best rank the feature obtained in any subset to determine
the aggregated feature rank.
 worst: Use the worst rank the feature obtained in any subset to
determine the aggregated feature rank.
 stability: Use the frequency of the feature being in the subset of
highly ranked features as measure for the aggregated feature rank
(Meinshausen and Buehlmann, 2010).
 exponential: Use a rank-weighted frequence of occurrence in the subset
of highly ranked features as measure for the aggregated feature rank (Haury
et al., 2011).
 borda(default): Use the borda count as measure for the aggregated
feature rank (Wald et al., 2012).
 enhanced_borda: Use an occurrence frequency-weighted borda count as
measure for the aggregated feature rank (Wald et al., 2012).
 truncated_borda: Use borda count computed only on features within the
subset of highly ranked features.
 enhanced_truncated_borda: Apply both the enhanced borda method and the
truncated borda method and use the resulting borda count as the aggregated
feature rank.
 The feature selection methods vignette provides additional information.vimp_aggregation_rank_threshold(optional) The threshold used to
define the subset of highly important features. If not set, this threshold
is determined by maximising the variance in the occurrence value over all
features over the subset size.
 This parameter is only relevant for stability,exponential,enhanced_borda,truncated_bordaandenhanced_truncated_bordamethods.parallel_feature_selection(optional) Enable parallel processing for
the feature selection workflow. Defaults to TRUE. When set toFALSE,
this will disable the use of parallel processing while performing feature
selection, regardless of the settings of theparallelparameter.parallel_feature_selectionis ignored ifparallel=FALSE.learner(required) One or more algorithms used for model
development. A sizeable number learners is supported in familiar. Please
see the vignette on learners for more information concerning the available
learners.hyperparameter(optional) List of lists containing hyperparameters
for learners. Each sublist should have the name of the learner method it
corresponds to, with list elements being named after the intended
hyperparameter, e.g. "glm_logistic"=list("sign_size"=3) All learners have hyperparameters. Please refer to the vignette on learners
for more details. If no parameters are provided, sequential model-based
optimisation is used to determine optimal hyperparameters.
 Hyperparameters provided by the user are never optimised. However, if more
than one value is provided for a single hyperparameter, optimisation will
be conducted using these values.novelty_detector(optional) Specify the algorithm used for training
a novelty detector. This detector can be used to identify
out-of-distribution data prospectively.detector_parameters(optional) List lists containing hyperparameters
for novelty detectors. Currently not used.parallel_model_development(optional) Enable parallel processing for
the model development workflow. Defaults to TRUE. When set toFALSE,
this will disable the use of parallel processing while developing models,
regardless of the settings of theparallelparameter.parallel_model_developmentis ignored ifparallel=FALSE.optimisation_bootstraps(optional) Number of bootstraps that should
be generated from the development data set. During the optimisation
procedure one or more of these bootstraps (indicated by
smbo_step_bootstraps) are used for model development using different
combinations of hyperparameters. The effect of the hyperparameters is then
assessed by comparing in-bag and out-of-bag model performance. The default number of bootstraps is 50. Hyperparameter optimisation may
finish before exhausting the set of bootstraps.optimisation_determine_vimp(optional) Logical value that indicates
whether variable importance is determined separately for each of the
bootstraps created during the optimisation process (TRUE) or the
applicable results from the feature selection step are used (FALSE). Determining variable importance increases the initial computational
overhead. However, it prevents positive biases for the out-of-bag data due
to overlap of these data with the development data set used for the feature
selection step. In this case, any hyperparameters of the variable
importance method are not determined separately for each bootstrap, but
those obtained during the feature selection step are used instead. In case
multiple of such hyperparameter sets could be applicable, the set that will
be used is randomly selected for each bootstrap.
 This parameter only affects hyperparameter optimisation of learners. The
default is TRUE.smbo_random_initialisation(optional) String indicating the
initialisation method for the hyperparameter space. Can be one of
fixed_subsample(default),fixed, orrandom.fixedandfixed_subsamplefirst create hyperparameter sets from a range of default
values set by familiar.fixed_subsamplethen randomly draws up tosmbo_n_random_setsfrom the grid.randomdoes not rely upon a fixed
grid, and randomly draws up tosmbo_n_random_setshyperparameter sets
from the hyperparameter space.smbo_n_random_sets(optional) Number of random or subsampled
hyperparameters drawn during the initialisation process. Default: 100.
Cannot be smaller than10. The parameter is not used whensmbo_random_initialisationisfixed, as the entire pre-defined grid
will be explored.max_smbo_iterations(optional) Maximum number of intensify
iterations of the SMBO algorithm. During an intensify iteration a run-off
occurs between the current best hyperparameter combination and either 10
challenger combination with the highest expected improvement or a set of 20
random combinations.
 Run-off with random combinations is used to force exploration of the
hyperparameter space, and is performed every second intensify iteration, or
if there is no expected improvement for any challenger combination.
 If a combination of hyperparameters leads to better performance on the same
data than the incumbent best set of hyperparameters, it replaces the
incumbent set at the end of the intensify iteration.
 The default number of intensify iteration is 20. Iterations may be
stopped early if the incumbent set of hyperparameters remains the same forsmbo_stop_convergent_iterationsiterations, or performance improvement is
minimal. This behaviour is suppressed during the first 4 iterations to
enable the algorithm to explore the hyperparameter space.smbo_stop_convergent_iterations(optional) The number of subsequent
convergent SMBO iterations required to stop hyperparameter optimisation
early. An iteration is convergent if the best parameter set has not
changed or the optimisation score over the 4 most recent iterations has not
changed beyond the tolerance level in smbo_stop_tolerance. The default value is 3.smbo_stop_tolerance(optional) Tolerance for early stopping due to
convergent optimisation score.
 The default value depends on the square root of the number of samples (at
the series level), and is 0.01for 100 samples. This value is computed as0.1 * 1 / sqrt(n_samples). The upper limit is0.0001for 1M or more
samples.smbo_time_limit(optional) Time limit (in minutes) for the
optimisation process. Optimisation is stopped after this limit is exceeded.
Time taken to determine variable importance for the optimisation process
(see the optimisation_determine_vimpparameter) does not count. The default is NULL, indicating that there is no time limit for the
optimisation process. The time limit cannot be less than 1 minute.smbo_initial_bootstraps(optional) The number of bootstraps taken
from the set of optimisation_bootstrapsas the bootstraps assessed
initially. The default value is 1. The value cannot be larger thanoptimisation_bootstraps.smbo_step_bootstraps(optional) The number of bootstraps taken from
the set of optimisation_bootstrapsbootstraps as the bootstraps assessed
during the steps of each intensify iteration. The default value is 3. The value cannot be larger thanoptimisation_bootstraps.smbo_intensify_steps(optional) The number of steps in each SMBO
intensify iteration. Each step a new set of smbo_step_bootstrapsbootstraps is drawn and used in the run-off between the incumbent best
hyperparameter combination and its challengers. The default value is 5. Higher numbers allow for a more detailed
comparison, but this comes with added computational cost.optimisation_metric(optional) One or more metrics used to compute
performance scores. See the vignette on performance metrics for the
available metrics.
 If unset, the following metrics are used by default:
 
 auc_roc: Forbinomialandmultinomialmodels.
 mse: Mean squared error forcontinuousmodels.
 msle: Mean squared logarithmic error forcountmodels.
 concordance_index: Forsurvivalmodels.
 Multiple optimisation metrics can be specified. Actual metric values are
converted to an objective value by comparison with a baseline metric value
that derives from a trivial model, i.e. majority class for binomial and
multinomial outcomes, the median outcome for count and continuous outcomes
and a fixed risk or time for survival outcomes.optimisation_function(optional) Type of optimisation function used
to quantify the performance of a hyperparameter set. Model performance is
assessed using the metric(s) specified by optimisation_metricon the
in-bag (IB) and out-of-bag (OOB) samples of a bootstrap. These values are
converted to objective scores with a standardised interval of[-1.0, 1.0]. Each pair of objective is subsequently used to compute an
optimisation score. The optimisation score across different bootstraps is
than aggregated to a summary score. This summary score is used to rank
hyperparameter sets, and select the optimal set. The combination of optimisation score and summary score is determined by
the optimisation function indicated by this parameter:
 
 validationormax_validation(default): seeks to maximise OOB score.
 balanced: seeks to balance IB and OOB score.
 stronger_balance: similar tobalanced, but with stronger penalty for
differences between IB and OOB scores.
 validation_minus_sd: seeks to optimise the average OOB score minus its
standard deviation.
 validation_25th_percentile: seeks to optimise the 25th percentile of
OOB scores, and is conceptually similar tovalidation_minus_sd.
 model_estimate: seeks to maximise the OOB score estimate predicted by
the hyperparameter learner (not available for random search).
 model_estimate_minus_sd: seeks to maximise the OOB score estimate minus
its estimated standard deviation, as predicted by the hyperparameter
learner (not available for random search).
 model_balanced_estimate: seeks to maximise the estimate of the balanced
IB and OOB score. This is similar to thebalancedscore, and in fact uses
a hyperparameter learner to predict said score (not available for random
search).
 model_balanced_estimate_minus_sd: seeks to maximise the estimate of the
balanced IB and OOB score, minus its estimated standard deviation. This is
similar to thebalancedscore, but takes into account its estimated
spread.
 Additional detail are provided in the Learning algorithms and
hyperparameter optimisation vignette.hyperparameter_learner(optional) Any point in the hyperparameter
space has a single, scalar, optimisation score value that is a priori
unknown. During the optimisation process, the algorithm samples from the
hyperparameter space by selecting hyperparameter sets and computing the
optimisation score value for one or more bootstraps. For each
hyperparameter set the resulting values are distributed around the actual
value. The learner indicated by hyperparameter_learneris then used to
infer optimisation score estimates for unsampled parts of the
hyperparameter space. The following models are available:
 
 bayesian_additive_regression_treesorbart: Uses Bayesian Additive
Regression Trees (Sparapani et al., 2021) for inference. Unlike standard
random forests, BART allows for estimating posterior distributions directly
and can extrapolate.
 gaussian_process(default): Creates a localised approximate Gaussian
process for inference (Gramacy, 2016). This allows for better scaling than
deterministic Gaussian Processes.
 random_forest: Creates a random forest for inference. Originally
suggested by Hutter et al. (2011). A weakness of random forests is their
lack of extrapolation beyond observed values, which limits their usefulness
in exploiting promising areas of hyperparameter space.
 randomorrandom_search: Forgoes the use of models to steer
optimisation. Instead, a random search is performed.
acquisition_function(optional) The acquisition function influences
how new hyperparameter sets are selected. The algorithm uses the model
learned by the learner indicated by hyperparameter_learnerto search the
hyperparameter space for hyperparameter sets that are either likely better
than the best known set (exploitation) or where there is considerable
uncertainty (exploration). The acquisition function quantifies this
(Shahriari et al., 2016). The following acquisition functions are available, and are described in
more detail in the learner algorithms vignette:
 
 improvement_probability: The probability of improvement quantifies the
probability that the expected optimisation score for a set is better than
the best observed optimisation score
 improvement_empirical_probability: Similar toimprovement_probability, but based directly on optimisation scores
predicted by the individual decision trees.
 expected_improvement(default): Computes expected improvement.
 upper_confidence_bound: This acquisition function is based on the upper
confidence bound of the distribution (Srinivas et al., 2012).
 bayes_upper_confidence_bound: This acquisition function is based on the
upper confidence bound of the distribution (Kaufmann et al., 2012).
exploration_method(optional) Method used to steer exploration in
post-initialisation intensive searching steps. As stated earlier, each SMBO
iteration step compares suggested alternative parameter sets with an
incumbent best set in a series of steps. The exploration method
controls how the set of alternative parameter sets is pruned after each
step in an iteration. Can be one of the following:
 
 single_shot(default): The set of alternative parameter sets is not
pruned, and each intensification iteration contains only a single
intensification step that only uses a single bootstrap. This is the fastest
exploration method, but only superficially tests each parameter set.
 successive_halving: The set of alternative parameter sets is
pruned by removing the worst performing half of the sets after each step
(Jamieson and Talwalkar, 2016).
 stochastic_reject: The set of alternative parameter sets is pruned by
comparing the performance of each parameter set with that of the incumbent
best parameter set using a paired Wilcoxon test based on shared
bootstraps. Parameter sets that perform significantly worse, at an alpha
level indicated bysmbo_stochastic_reject_p_value, are pruned.
 none: The set of alternative parameter sets is not pruned.
smbo_stochastic_reject_p_value(optional) The p-value threshold used
for the stochastic_rejectexploration method. The default value is 0.05.parallel_hyperparameter_optimisation(optional) Enable parallel
processing for hyperparameter optimisation. Defaults to TRUE. When set toFALSE, this will disable the use of parallel processing while performing
optimisation, regardless of the settings of theparallelparameter. The
parameter moreover specifies whether parallelisation takes place within the
optimisation algorithm (inner, default), or in an outer loop (outer)
over learners, data subsamples, etc. parallel_hyperparameter_optimisationis ignored ifparallel=FALSE.
evaluate_top_level_only(optional) Flag that signals that only
evaluation at the most global experiment level is required. Consider a
cross-validation experiment with additional external validation. The global
experiment level consists of data that are used for development, internal
validation and external validation. The next lower experiment level are the
individual cross-validation iterations.
 When the flag is true, evaluations take place on the global level only,
and no results are generated for the next lower experiment levels. In our
example, this means that results from individual cross-validation iterations
are not computed and shown. When the flag isfalse, results are computed
from both the global layer and the next lower level. Setting the flag to truesaves computation time.skip_evaluation_elements(optional) Specifies which evaluation steps,
if any, should be skipped as part of the evaluation process. Defaults to
none, which means that all relevant evaluation steps are performed. It can
have one or more of the following values: 
 none,false: no steps are skipped.
 all,true: all steps are skipped.
 auc_data: data for assessing and plotting the area under the receiver
operating characteristic curve are not computed.
 calibration_data: data for assessing and plotting model calibration are
not computed.
 calibration_info: data required to assess calibration, such as baseline
survival curves, are not collected. These data will still be present in the
models.
 confusion_matrix: data for assessing and plotting a confusion matrix are
not collected.
 decision_curve_analyis: data for performing a decision curve analysis
are not computed.
 feature_expressions: data for assessing and plotting sample clustering
are not computed.
 feature_similarity: data for assessing and plotting feature clusters are
not computed.
 fs_vimp: data for assessing and plotting feature selection-based
variable importance are not collected.
 hyperparameters: data for assessing model hyperparameters are not
collected. These data will still be present in the models.
 ice_data: data for individual conditional expectation and partial
dependence plots are not created.
 model_performance: data for assessing and visualising model performance
are not created.
 model_vimp: data for assessing and plotting model-based variable
importance are not collected.
 permutation_vimp: data for assessing and plotting model-agnostic
permutation variable importance are not computed.
 prediction_data: predictions for each sample are not made and exported.
 risk_stratification_data: data for assessing and plotting Kaplan-Meier
survival curves are not collected.
 risk_stratification_info: data for assessing stratification into risk
groups are not computed.
 univariate_analysis: data for assessing and plotting univariate feature
importance are not computed.
ensemble_method(optional) Method for ensembling predictions from
models for the same sample. Available methods are:
 This parameter is only used if detail_levelisensemble.evaluation_metric(optional) One or more metrics for assessing model
performance. See the vignette on performance metrics for the available
metrics.
 Confidence intervals (or rather credibility intervals) are computed for each
metric during evaluation. This is done using bootstraps, the number of which
depends on the value of confidence_level(Davison and Hinkley, 1997). If unset, the metric in the optimisation_metricvariable is used.sample_limit(optional) Set the upper limit of the number of samples
that are used during evaluation steps. Cannot be less than 20.
 This setting can be specified per data element by providing a parameter
value in a named list with data elements, e.g.
list("sample_similarity"=100, "permutation_vimp"=1000). This parameter can be set for the following data elements:
sample_similarityandice_data.detail_level(optional) Sets the level at which results are computed
and aggregated.
 
 ensemble: Results are computed at the ensemble level, i.e. over all
models in the ensemble. This means that, for example, bias-corrected
estimates of model performance are assessed by creating (at least) 20
bootstraps and computing the model performance of the ensemble model for
each bootstrap.
 hybrid(default): Results are computed at the level of models in an
ensemble. This means that, for example, bias-corrected estimates of model
performance are directly computed using the models in the ensemble. If there
are at least 20 trained models in the ensemble, performance is computed for
each model, in contrast toensemblewhere performance is computed for the
ensemble of models. If there are less than 20 trained models in the
ensemble, bootstraps are created so that at least 20 point estimates can be
made.
 model: Results are computed at the model level. This means that, for
example, bias-corrected estimates of model performance are assessed by
creating (at least) 20 bootstraps and computing the performance of the model
for each bootstrap.
 Note that each level of detail has a different interpretation for bootstrap
confidence intervals. For ensembleandmodelthese are the confidence
intervals for the ensemble and an individual model, respectively. That is,
the confidence interval describes the range where an estimate produced by a
respective ensemble or model trained on a repeat of the experiment may be
found with the probability of the confidence level. Forhybrid, it
represents the range where any single model trained on a repeat of the
experiment may be found with the probability of the confidence level. By
definition, confidence intervals obtained usinghybridare at least as
wide as those forensemble.hybridoffers the correct interpretation if
the goal of the analysis is to assess the result of a single, unspecified,
model. hybridis generally computationally less expensive thenensemble, which
in turn is somewhat less expensive thanmodel.
 A non-default detail_levelparameter can be specified for separate
evaluation steps by providing a parameter value in a named list with data
elements, e.g.list("auc_data"="ensemble", "model_performance"="hybrid").
This parameter can be set for the following data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data,prediction_dataandconfusion_matrix.estimation_type(optional) Sets the type of estimation that should be
possible. This has the following options:
 
 point: Point estimates.
 bias_correctionorbc: Bias-corrected estimates. A bias-corrected
estimate is computed from (at least) 20 point estimates, andfamiliarmay
bootstrap the data to create them.
 bootstrap_confidence_intervalorbci(default): Bias-corrected
estimates with bootstrap confidence intervals (Efron and Hastie, 2016). The
number of point estimates required depends on theconfidence_levelparameter, andfamiliarmay bootstrap the data to create them.
 As with detail_level, a non-defaultestimation_typeparameter can be
specified for separate evaluation steps by providing a parameter value in a
named list with data elements, e.g.list("auc_data"="bci", "model_performance"="point"). This parameter can be set for the following
data elements:auc_data,decision_curve_analyis,model_performance,permutation_vimp,ice_data, andprediction_data.aggregate_results(optional) Flag that signifies whether results
should be aggregated during evaluation. If estimation_typeisbias_correctionorbc, aggregation leads to a single bias-corrected
estimate. Ifestimation_typeisbootstrap_confidence_intervalorbci,
aggregation leads to a single bias-corrected estimate with lower and upper
boundaries of the confidence interval. This has no effect ifestimation_typeispoint. The default value is equal to TRUEexcept when assessing metrics to assess
model performance, as the default violin plot requires underlying data. As with detail_levelandestimation_type, a non-defaultaggregate_resultsparameter can be specified for separate evaluation steps
by providing a parameter value in a named list with data elements, e.g.list("auc_data"=TRUE, , "model_performance"=FALSE). This parameter exists
for the same elements asestimation_type.confidence_level(optional) Numeric value for the level at which
confidence intervals are determined. In the case bootstraps are used to
determine the confidence intervals bootstrap estimation, familiaruses the
rule of thumbn = 20 / ci.levelto determine the number of required
bootstraps. The default value is 0.95.bootstrap_ci_method(optional) Method used to determine bootstrap
confidence intervals (Efron and Hastie, 2016). The following methods are
implemented:
 Note that the standard method is not implemented because this method is
often not suitable due to non-normal distributions. The bias-corrected and
accelerated (BCa) method is not implemented yet.feature_cluster_method(optional) Method used to perform clustering
of features. The same methods as for the cluster_methodconfiguration
parameter are available:none,hclust,agnes,dianaandpam. The value for the cluster_methodconfiguration parameter is used by
default. When generating clusters for the purpose of determining mutual
correlation and ordering feature expressions,noneis ignored andhclustis used instead.feature_linkage_method(optional) Method used for agglomerative
clustering with hclustandagnes. Linkage determines how features are
sequentially combined into clusters based on distance. The methods are
shared with thecluster_linkage_methodconfiguration parameter:average,single,complete,weighted, andward. The value for the cluster_linkage_methodconfiguration parameters is used
by default.feature_cluster_cut_method(optional) Method used to divide features
into separate clusters. The available methods are the same as for the
cluster_cut_methodconfiguration parameter:silhouette,fixed_cutanddynamic_cut. silhouetteis available for all cluster methods, butfixed_cutonly
applies to methods that create hierarchical trees (hclust,agnesanddiana).dynamic_cutrequires thedynamicTreeCutpackage and can only
be used withagnesandhclust.
 The value for the cluster_cut_methodconfiguration parameter is used by
default.feature_similarity_metric(optional) Metric to determine pairwise
similarity between features. Similarity is computed in the same manner as
for clustering, and feature_similarity_metrictherefore has the same
options ascluster_similarity_metric:mcfadden_r2,cox_snell_r2,nagelkerke_r2,mutual_information,spearman,kendallandpearson. The value used for the cluster_similarity_metricconfiguration parameter
is used by default.feature_similarity_threshold(optional) The threshold level for
pair-wise similarity that is required to form feature clusters with the
fixed_cutmethod. This threshold functions in the same manner as the one
defined using thecluster_similarity_thresholdparameter. By default, the value for the cluster_similarity_thresholdconfiguration
parameter is used. Unlike for cluster_similarity_threshold, more than one value can be
supplied here.sample_cluster_method(optional) The method used to perform
clustering based on distance between samples. These are the same methods as
for the cluster_methodconfiguration parameter:hclust,agnes,dianaandpam. The value for the cluster_methodconfiguration parameter is used by
default. When generating clusters for the purpose of ordering samples in
feature expressions,noneis ignored andhclustis used instead.sample_linkage_method(optional) The method used for agglomerative
clustering in hclustandagnes. These are the same methods as for thecluster_linkage_methodconfiguration parameter:average,single,complete,weighted, andward. The value for the cluster_linkage_methodconfiguration parameters is used
by default.sample_similarity_metric(optional) Metric to determine pairwise
similarity between samples. Similarity is computed in the same manner as for
clustering, but sample_similarity_metrichas different options that are
better suited to computing distance between samples instead of between
features. The following metrics are available. 
 gower(default): compute Gower's distance between samples. By default,
Gower's distance is computed based on winsorised data to reduce the effect
of outliers (see below).
 euclidean: compute the Euclidean distance between samples.
 The underlying feature data for numerical features is scaled to the
[0,1]range using the feature values across the samples. The
normalisation parameters required can optionally be computed from feature
data with the outer 5% (on both sides) of feature values trimmed or
winsorised. To do so append_trim(trimming) or_winsor(winsorising) to
the metric name. This reduces the effect of outliers somewhat. Regardless of metric, all categorical features are handled as for the
Gower's distance: distance is 0 if the values in a pair of samples match,
and 1 if they do not.eval_aggregation_method(optional) Method for aggregating variable
importances for the purpose of evaluation. Variable importances are
determined during feature selection steps and after training the model. Both
types are evaluated, but feature selection variable importance is only
evaluated at run-time.
 See the documentation for the vimp_aggregation_methodargument for
information concerning the different methods available.eval_aggregation_rank_threshold(optional) The threshold used to
define the subset of highly important features during evaluation.
 See the documentation for the vimp_aggregation_rank_thresholdargument for
more information.eval_icc_type(optional) String indicating the type of intraclass
correlation coefficient (1,2or3) that should be used to compute
robustness for features in repeated measurements during the evaluation of
univariate importance. These types correspond to the types in Shrout and
Fleiss (1979). The default value is1.stratification_method(optional) Method for determining the
stratification threshold for creating survival groups. The actual,
model-dependent, threshold value is obtained from the development data, and
can afterwards be used to perform stratification on validation data.
 The following stratification methods are available:
 
 median(default): The median predicted value in the development cohort
is used to stratify the samples into two risk groups. For predicted outcome
values that build a continuous spectrum, the two risk groups in the
development cohort will be roughly equal in size.
 mean: The mean predicted value in the development cohort is used to
stratify the samples into two risk groups.
 mean_trim: Asmean, but based on the set of predicted values
where the 5% lowest and 5% highest values are discarded. This reduces the
effect of outliers.
 mean_winsor: Asmean, but based on the set of predicted values where
the 5% lowest and 5% highest values are winsorised. This reduces the effect
of outliers.
 fixed: Samples are stratified based on the sample quantiles of the
predicted values. These quantiles are defined using thestratification_thresholdparameter.
 optimised: Use maximally selected rank statistics to determine the
optimal threshold (Lausen and Schumacher, 1992; Hothorn et al., 2003) to
stratify samples into two optimally separated risk groups.
 One or more stratification methods can be selected simultaneously.
 This parameter is only relevant for survivaloutcomes.stratification_threshold(optional) Numeric value(s) signifying the
sample quantiles for stratification using the fixedmethod. The number of
risk groups will be the number of values +1. The default value is c(1/3, 2/3), which will yield two thresholds that
divide samples into three equally sized groups. Iffixedis not among the
selected stratification methods, this parameter is ignored. This parameter is only relevant for survivaloutcomes.time_max(optional) Time point which is used as the benchmark for
e.g. cumulative risks generated by random forest, or the cutoff for Uno's
concordance index.
 If time_maxis not provided, butevaluation_timesis, the largest value
ofevaluation_timesis used. If both are not provided,time_maxis set
to the 98th percentile of the distribution of survival times for samples
with an event in the development data set. This parameter is only relevant for survivaloutcomes.evaluation_times(optional) One or more time points that are used for
assessing calibration in survival problems. This is done as expected and
observed survival probabilities depend on time.
 If unset, evaluation_timeswill be equal totime_max. This parameter is only relevant for survivaloutcomes.dynamic_model_loading(optional) Enables dynamic loading of models
during the evaluation process, if TRUE. Defaults toFALSE. Dynamic
loading of models may reduce the overall memory footprint, at the cost of
increased disk or network IO. Models can only be dynamically loaded if they
are found at an accessible disk or network location. Setting this parameter
toTRUEmay help if parallel processing causes out-of-memory issues during
evaluation.parallel_evaluation(optional) Enable parallel processing for
hyperparameter optimisation. Defaults to TRUE. When set toFALSE, this
will disable the use of parallel processing while performing optimisation,
regardless of the settings of theparallelparameter. The parameter
moreover specifies whether parallelisation takes place within the evaluation
process steps (inner, default), or in an outer loop (outer) over
learners, data subsamples, etc. parallel_evaluationis ignored ifparallel=FALSE.
 | 
Value
Nothing. All output is written to the experiment directory. If the
experiment directory is in a temporary location, a list with all
familiarModel, familiarEnsemble, familiarData and familiarCollection
objects will be returned.
References
-  Storey, J. D. A direct approach to false discovery rates. J.
R. Stat. Soc. Series B Stat. Methodol. 64, 479–498 (2002).
 
-  Shrout, P. E. & Fleiss, J. L. Intraclass correlations: uses in assessing
rater reliability. Psychol. Bull. 86, 420–428 (1979).
 
-  Koo, T. K. & Li, M. Y. A guideline of selecting and reporting intraclass
correlation coefficients for reliability research. J. Chiropr. Med. 15,
155–163 (2016).
 
-  Yeo, I. & Johnson, R. A. A new family of power transformations to
improve normality or symmetry. Biometrika 87, 954–959 (2000).
 
-  Box, G. E. P. & Cox, D. R. An analysis of transformations. J. R. Stat.
Soc. Series B Stat. Methodol. 26, 211–252 (1964).
 
-  Raymaekers, J., Rousseeuw,  P. J. Transforming variables to central
normality. Mach Learn. (2021).
 
-  Park, M. Y., Hastie, T. & Tibshirani, R. Averaged gene expressions for
regression. Biostatistics 8, 212–227 (2007).
 
-  Tolosi, L. & Lengauer, T. Classification with correlated features:
unreliability of feature ranking and solutions. Bioinformatics 27,
1986–1994 (2011).
 
-  Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in
microarray expression data using empirical Bayes methods. Biostatistics 8,
118–127 (2007)
 
-  Kaufman, L. & Rousseeuw, P. J. Finding groups in data: an introduction
to cluster analysis. (John Wiley & Sons, 2009).
 
-  Muellner, D. fastcluster: fast hierarchical, agglomerative clustering
routines for R and Python. J. Stat. Softw. 53, 1–18 (2013).
 
-  Rousseeuw, P. J. Silhouettes: A graphical aid to the interpretation and
validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987).
 
-  Langfelder, P., Zhang, B. & Horvath, S. Defining clusters from a
hierarchical cluster tree: the Dynamic Tree Cut package for R.
Bioinformatics 24, 719–720 (2008).
 
-  McFadden, D. Conditional logit analysis of qualitative choice behavior.
in Frontiers in Econometrics (ed. Zarembka, P.) 105–142 (Academic Press,
1974).
 
-  Cox, D. R. & Snell, E. J. Analysis of binary data. (Chapman and Hall,
1989).
 
-  Nagelkerke, N. J. D. A note on a general definition of the coefficient
of determination. Biometrika 78, 691–692 (1991).
 
-  Meinshausen, N. & Buehlmann, P. Stability selection. J. R. Stat. Soc.
Series B Stat. Methodol. 72, 417–473 (2010).
 
-  Haury, A.-C., Gestraud, P. & Vert, J.-P. The influence of feature
selection methods on accuracy, stability and interpretability of molecular
signatures. PLoS One 6, e28210 (2011).
 
-  Wald, R., Khoshgoftaar, T. M., Dittman, D., Awada, W. & Napolitano,A. An
extensive comparison of feature ranking aggregation techniques in
bioinformatics. in 2012 IEEE 13th International Conference on Information
Reuse Integration (IRI) 377–384 (2012).
 
-  Hutter, F., Hoos, H. H. & Leyton-Brown, K. Sequential model-based
optimization for general algorithm configuration. in Learning and
Intelligent Optimization (ed. Coello, C. A. C.) 6683, 507–523 (Springer
Berlin Heidelberg, 2011).
 
-  Shahriari, B., Swersky, K., Wang, Z., Adams, R. P. & de Freitas, N.
Taking the Human Out of the Loop: A Review of Bayesian Optimization. Proc.
IEEE 104, 148–175 (2016)
 
-  Srinivas, N., Krause, A., Kakade, S. M. & Seeger, M. W.
Information-Theoretic Regret Bounds for Gaussian Process Optimization in
the Bandit Setting. IEEE Trans. Inf. Theory 58, 3250–3265 (2012)
 
-  Kaufmann, E., Cappé, O. & Garivier, A. On Bayesian upper confidence
bounds for bandit problems. in Artificial intelligence and statistics
592–600 (2012).
 
-  Jamieson, K. & Talwalkar, A. Non-stochastic Best Arm Identification and
Hyperparameter Optimization. in Proceedings of the 19th International
Conference on Artificial Intelligence and Statistics (eds. Gretton, A. &
Robert, C. C.) vol. 51 240–248 (PMLR, 2016).
 
-  Gramacy, R. B. laGP: Large-Scale Spatial Modeling via Local Approximate
Gaussian Processes in R. Journal of Statistical Software 72, 1–46 (2016)
 
-  Sparapani, R., Spanbauer, C. & McCulloch, R. Nonparametric Machine
Learning and Efficient Computation with Bayesian Additive Regression Trees:
The BART R Package. Journal of Statistical Software 97, 1–66 (2021)
 
-  Davison, A. C. & Hinkley, D. V. Bootstrap methods and their application.
(Cambridge University Press, 1997).
 
-  Efron, B. & Hastie, T. Computer Age Statistical Inference. (Cambridge
University Press, 2016).
 
-  Lausen, B. & Schumacher, M. Maximally Selected Rank Statistics.
Biometrics 48, 73 (1992).
 
-  Hothorn, T. & Lausen, B. On the exact distribution of maximally selected
rank statistics. Comput. Stat. Data Anal. 43, 121–137 (2003).
 
Familiar ggplot2 theme
Description
This is the default theme used for plots created by familiar. The theme uses
ggplot2::theme_light as the base template.
Usage
theme_familiar(
  base_size = 10,
  base_family = "",
  base_line_size = 0.5,
  base_rect_size = 0.5
)
Arguments
| base_size | Base font size in points. Size of other plot text elements
is based off this. | 
| base_family | Font family used for text elements. | 
| base_line_size | Base size for line elements, in points. | 
| base_rect_size | Base size for rectangular elements, in points. | 
Value
A complete plotting theme.
Create models using end-to-end machine learning
Description
Train models using familiar. Evaluation is not performed.
Usage
train_familiar(
  formula = NULL,
  data = NULL,
  experiment_data = NULL,
  cl = NULL,
  experimental_design = "fs+mb",
  learner = NULL,
  hyperparameter = NULL,
  verbose = TRUE,
  ...
)
Arguments
| formula | An R formula. The formula can only contain feature names and
dot (.). The*and+1operators are not supported as these refer to
columns that are not present in the data set. Use of the formula interface is optional. | 
| data | A data.tableobject, adata.frameobject, list containing
multipledata.tableordata.frameobjects, or paths to data files. datashould be provided if no file paths are provided to thedata_filesargument. If both are provided, onlydatawill be used.
 All data is expected to be in wide format, and ideally has a sample
identifier (see sample_id_column), batch identifier (seecohort_column)
and outcome columns (seeoutcome_column). In case paths are provided, the data should be stored as csv,rdsorRDatafiles. See documentation for thedata_filesargument for more
information. | 
| experiment_data | Experimental data may provided in the form of | 
| cl | Cluster created using the parallelpackage. This cluster is then
used to speed up computation through parallelisation. When a cluster is not
provided, parallelisation is performed by setting up a cluster on the local
machine. This parameter has no effect if the parallelargument is set toFALSE. | 
| experimental_design | (required) Defines what the experiment looks
like, e.g. cv(bt(fs,20)+mb,3,2)for 2 times repeated 3-fold
cross-validation with nested feature selection on 20 bootstraps and
model-building. The basic workflow components are: 
 fs: (required) feature selection step.
 mb: (required) model building step.
 ev: (optional) external validation. Setting this is not required fortrain_familiar, but if validation batches or cohorts are present in the
dataset (data), these should be indicated in thevalidation_batch_idargument.
 The different components are linked using +. Different subsampling methods can be used in conjunction with the basic
workflow components:
 
 bs(x,n): (stratified) .632 bootstrap, withnthe number of
bootstraps. In contrast tobt, feature pre-processing parameters and
hyperparameter optimisation are conducted on individual bootstraps.
 bt(x,n): (stratified) .632 bootstrap, withnthe number of
bootstraps. Unlikebsand other subsampling methods, no separate
pre-processing parameters or optimised hyperparameters will be determined
for each bootstrap.
 cv(x,n,p): (stratified)n-fold cross-validation, repeatedptimes.
Pre-processing parameters are determined for each iteration.
 lv(x): leave-one-out-cross-validation. Pre-processing parameters are
determined for each iteration.
 ip(x): imbalance partitioning for addressing class imbalances on the
data set. Pre-processing parameters are determined for each partition. The
number of partitions generated depends on the imbalance correction method
(see theimbalance_correction_methodparameter).
 As shown in the example above, sampling algorithms can be nested.
 The simplest valid experimental design is fs+mb. This is the default intrain_familiar, and will create one model for each feature selection
method infs_method. To create more models, a subsampling method should
be introduced, e.g.bs(fs+mb,20)to create 20 models based on bootstraps
of the data. This argument is ignored if the experiment_dataargument is set. | 
| learner | (required) Name of the learner used to develop a model. A
sizeable number learners is supported in familiar. Please see the
vignette on learners for more information concerning the available
learners. Unlike thesummon_familiarfunction,train_familiaronly
allows for a single learner. | 
| hyperparameter | (optional) List, or nested list containing
hyperparameters for learners. If a nested list is provided, each sublist
should have the name of the learner method it corresponds to, with list
elements being named after the intended hyperparameter, e.g.
"glm_logistic"=list("sign_size"=3) All learners have hyperparameters. Please refer to the vignette on learners
for more details. If no parameters are provided, sequential model-based
optimisation is used to determine optimal hyperparameters.
 Hyperparameters provided by the user are never optimised. However, if more
than one value is provided for a single hyperparameter, optimisation will
be conducted using these values. | 
| verbose | Indicates verbosity of the results. Default is TRUE, and all
messages and warnings are returned. | 
| ... | Arguments passed on to .parse_experiment_settings,.parse_setup_settings,.parse_preprocessing_settings,.parse_feature_selection_settings,.parse_model_development_settings,.parse_hyperparameter_optimisation_settings 
batch_id_column(recommended) Name of the column containing batch
or cohort identifiers. This parameter is required if more than one dataset
is provided, or if external validation is performed.
 In familiar any row of data is organised by four identifiers:
 
 The batch identifier batch_id_column: This denotes the group to which a
set of samples belongs, e.g. patients from a single study, samples measured
in a batch, etc. The batch identifier is used for batch normalisation, as
well as selection of development and validation datasets. The sample identifier sample_id_column: This denotes the sample level,
e.g. data from a single individual. Subsets of data, e.g. bootstraps or
cross-validation folds, are created at this level. The series identifier series_id_column: Indicates measurements on a
single sample that may not share the same outcome value, e.g. a time
series, or the number of cells in a view. The repetition identifier: Indicates repeated measurements in a single
series where any feature values may differ, but the outcome does not.
Repetition identifiers are always implicitly set when multiple entries for
the same series of the same sample in the same batch that share the same
outcome are encountered.
sample_id_column(recommended) Name of the column containing
sample or subject identifiers. See batch_id_columnabove for more
details. If unset, every row will be identified as a single sample.series_id_column(optional) Name of the column containing series
identifiers, which distinguish between measurements that are part of a
series for a single sample. See batch_id_columnabove for more details. If unset, rows which share the same batch and sample identifiers but have a
different outcome are assigned unique series identifiers.development_batch_id(optional) One or more batch or cohort
identifiers to constitute data sets for development. Defaults to all, or
all minus the identifiers in validation_batch_idfor external validation.
Required if external validation is performed andvalidation_batch_idis
not provided.validation_batch_id(optional) One or more batch or cohort
identifiers to constitute data sets for external validation. Defaults to
all data sets except those in development_batch_idfor external
validation, or none if not. Required ifdevelopment_batch_idis not
provided.outcome_name(optional) Name of the modelled outcome. This name will
be used in figures created by familiar. If not set, the column name in outcome_columnwill be used forbinomial,multinomial,countandcontinuousoutcomes. For other
outcomes (survivalandcompeting_risk) no default is used.outcome_column(recommended) Name of the column containing the
outcome of interest. May be identified from a formula, if a formula is
provided as an argument. Otherwise an error is raised. Note that survivalandcompeting_riskoutcome type outcomes require two columns that
indicate the time-to-event or the time of last follow-up and the event
status.outcome_type(recommended) Type of outcome found in the outcome
column. The outcome type determines many aspects of the overall process,
e.g. the available feature selection methods and learners, but also the
type of assessments that can be conducted to evaluate the resulting models.
Implemented outcome types are:
 
 binomial: categorical outcome with 2 levels.
 multinomial: categorical outcome with 2 or more levels.
 count: Poisson-distributed numeric outcomes.
 continuous: general continuous numeric outcomes.
 survival: survival outcome for time-to-event data.
 If not provided, the algorithm will attempt to obtain outcome_type from
contents of the outcome column. This may lead to unexpected results, and we
therefore advise to provide this information manually.
 Note that competing_risksurvival analysis are not fully supported, and
is currently not a valid choice foroutcome_type.class_levels(optional) Class levels for binomialormultinomialoutcomes. This argument can be used to specify the ordering of levels for
categorical outcomes. These class levels must exactly match the levels
present in the outcome column.event_indicator(recommended) Indicator for events in survivalandcompeting_riskanalyses.familiarwill automatically recognise1,true,t,yandyesas event indicators, including different
capitalisations. If this parameter is set, it replaces the default values.censoring_indicator(recommended) Indicator for right-censoring in
survivalandcompeting_riskanalyses.familiarwill automatically
recognise0,false,f,n,noas censoring indicators, including
different capitalisations. If this parameter is set, it replaces the
default values.competing_risk_indicator(recommended) Indicator for competing
risks in competing_riskanalyses. There are no default values, and if
unset, all values other than those specified by theevent_indicatorandcensoring_indicatorparameters are considered to indicate competing
risks.signature(optional) One or more names of feature columns that are
considered part of a specific signature. Features specified here will
always be used for modelling. Ranking from feature selection has no effect
for these features.novelty_features(optional) One or more names of feature columns
that should be included for the purpose of novelty detection.exclude_features(optional) Feature columns that will be removed
from the data set. Cannot overlap with features in signature,novelty_featuresorinclude_features.include_features(optional) Feature columns that are specifically
included in the data set. By default all features are included. Cannot
overlap with exclude_features, but may overlapsignature. Features insignatureandnovelty_featuresare always included. If bothexclude_featuresandinclude_featuresare provided,include_featurestakes precedence, provided that there is no overlap between the two.reference_method(optional) Method used to set reference levels for
categorical features. There are several options:
 
 auto(default): Categorical features that are not explicitly set by the
user, i.e. columns containing boolean values or characters, use the most
frequent level as reference. Categorical features that are explicitly set,
i.e. as factors, are used as is.
 always: Both automatically detected and user-specified categorical
features have the reference level set to the most frequent level. Ordinal
features are not altered, but are used as is.
 never: User-specified categorical features are used as is.
Automatically detected categorical features are simply sorted, and the
first level is then used as the reference level. This was the behaviour
prior to familiar version 1.3.0.
imbalance_correction_method(optional) Type of method used to
address class imbalances. Available options are:
 
 full_undersampling(default): All data will be used in an ensemble
fashion. The full minority class will appear in each partition, but
majority classes are undersampled until all data have been used.
 random_undersampling: Randomly undersamples majority classes. This is
useful in cases where full undersampling would lead to the formation of
many models due major overrepresentation of the largest class.
 This parameter is only used in combination with imbalance partitioning in
the experimental design, and ipshould therefore appear in the string
that defines the design.imbalance_n_partitions(optional) Number of times random
undersampling should be repeated. 10 undersampled subsets with balanced
classes are formed by default.parallel(optional) Enable parallel processing. Defaults to TRUE.
When set toFALSE, this disables all parallel processing, regardless of
specific parameters such asparallel_preprocessing. However, whenparallelisTRUE, parallel processing of different parts of the
workflow can be disabled by setting respective flags toFALSE.parallel_nr_cores(optional) Number of cores available for
parallelisation. Defaults to 2. This setting does nothing if
parallelisation is disabled.restart_cluster(optional) Restart nodes used for parallel computing
to free up memory prior to starting a parallel process. Note that it does
take time to set up the clusters. Therefore setting this argument to TRUEmay impact processing speed. This argument is ignored ifparallelisFALSEor the cluster was initialised outside of familiar. Default isFALSE, which causes the clusters to be initialised only once.cluster_type(optional) Selection of the cluster type for parallel
processing. Available types are the ones supported by the parallel package
that is part of the base R distribution: psock(default),fork,mpi,nws,sock. In addition,noneis available, which also disables
parallel processing.backend_type(optional) Selection of the backend for distributing
copies of the data. This backend ensures that only a single master copy is
kept in memory. This limits memory usage during parallel processing.
 Several backend options are available, notably socket_server, andnone(default).socket_serveris based on the callr package and R sockets,
comes withfamiliarand is available for any OS.noneuses the package
environment of familiar to store data, and is available for any OS.
However,nonerequires copying of data to any parallel process, and has a
larger memory footprint.server_port(optional) Integer indicating the port on which the
socket server or RServe process should communicate. Defaults to port 6311.
Note that ports 0 to 1024 and 49152 to 65535 cannot be used.feature_max_fraction_missing(optional) Numeric value between 0.0and0.95that determines the meximum fraction of missing values that
still allows a feature to be included in the data set. All features with a
missing value fraction over this threshold are not processed further. The
default value is0.30.sample_max_fraction_missing(optional) Numeric value between 0.0and0.95that determines the maximum fraction of missing values that
still allows a sample to be included in the data set. All samples with a
missing value fraction over this threshold are excluded and not processed
further. The default value is0.30.filter_method(optional) One or methods used to reduce
dimensionality of the data set by removing irrelevant or poorly
reproducible features.
 Several method are available:
 
 none(default): None of the features will be filtered.
 low_variance: Features with a variance below thelow_var_minimum_variance_thresholdare filtered. This can be useful to
filter, for example, genes that are not differentially expressed.
 univariate_test: Features undergo a univariate regression using an
outcome-appropriate regression model. The p-value of the model coefficient
is collected. Features with coefficient p or q-value above theunivariate_test_thresholdare subsequently filtered.
 robustness: Features that are not sufficiently robust according to the
intraclass correlation coefficient are filtered. Use of this method
requires that repeated measurements are present in the data set, i.e. there
should be entries for which the sample and cohort identifiers are the same.
 More than one method can be used simultaneously. Features with singular
values are always filtered, as these do not contain information.univariate_test_threshold(optional) Numeric value between 1.0and0.0that determines which features are irrelevant and will be filtered by
theunivariate_test. The p or q-values are compared to this threshold.
All features with values above the threshold are filtered. The default
value is0.20.univariate_test_threshold_metric(optional) Metric used with the to
compare the univariate_test_thresholdagainst. The following metrics can
be chosen: 
 p_value(default): The unadjusted p-value of each feature is used for
to filter features.
 q_value: The q-value (Story, 2002), is used to filter features. Some
data sets may have insufficient samples to compute the q-value. Theqvaluepackage must be installed from Bioconductor to use this method.
univariate_test_max_feature_set_size(optional) Maximum size of the
feature set after the univariate test. P or q values of features are
compared against the threshold, but if the resulting data set would be
larger than this setting, only the most relevant features up to the desired
feature set size are selected.
 The default value is NULL, which causes features to be filtered based on
their relevance only.low_var_minimum_variance_threshold(required, if used) Numeric value
that determines which features will be filtered by the low_variancemethod. The variance of each feature is computed and compared to the
threshold. If it is below the threshold, the feature is removed. This parameter has no default value and should be set if low_varianceis
used.low_var_max_feature_set_size(optional) Maximum size of the feature
set after filtering features with a low variance. All features are first
compared against low_var_minimum_variance_threshold. If the resulting
feature set would be larger than specified, only the most strongly varying
features will be selected, up to the desired size of the feature set. The default value is NULL, which causes features to be filtered based on
their variance only.robustness_icc_type(optional) String indicating the type of
intraclass correlation coefficient (1,2or3) that should be used to
compute robustness for features in repeated measurements. These types
correspond to the types in Shrout and Fleiss (1979). The default value is1.robustness_threshold_metric(optional) String indicating which
specific intraclass correlation coefficient (ICC) metric should be used to
filter features. This should be one of:
 
 icc: The estimated ICC value itself.
 icc_low(default): The estimated lower limit of the 95% confidence
interval of the ICC, as suggested by Koo and Li (2016).
 icc_panel: The estimated ICC value over the panel average, i.e. the ICC
that would be obtained if all repeated measurements were averaged.
 icc_panel_low: The estimated lower limit of the 95% confidence interval
of the panel ICC.
robustness_threshold_value(optional) The intraclass correlation
coefficient value that is as threshold. The default value is 0.70.transformation_method(optional) The transformation method used to
change the distribution of the data to be more normal-like. The following
methods are available:
 
 none: This disables transformation of features.
 yeo_johnson: Transformation using the location and scale invariant
version of the Yeo-Johnson transformation (Yeo and Johnson, 2000;
Zwanenburg and Löck, 2023).
 yeo_johnson_robust(default): A robust version ofyeo_johnson.
This method is less sensitive to outliers.
 yeo_johnson_conventional: Asyeo_johnson, but without optimisation of
location and scale parameters. This method is equivalent to the original
transformation proposed by Yeo and Johnson (2001).
 box_cox: Transformation using the location and scale invariant version
of the Box-Cox transformation (Box and Cox, 1964; Zwanenburg and Löck,
2023).
 box_cox_robust: A robust version ofyeo_johnson. This method is less
sensitive to outliers.
 box_cox_conventional: Asbox_cox, but without optimisation of
location and scale parameters. This method is equivalent to the original
transformation proposed by Box and Cox (1964). This method requires
strictly positive feature values.
 Transformation requires the power.transformpackage. Only features that
contain numerical data are transformed. Transformation parameters obtained
in development data are stored withinfeatureInfoobjects for later use
with validation data sets.transformation_optimisation_criterion(optional) Transformation
parameters are optimised using a criterion, conventionally
maximum-likelihood-estimation. power.transformimplements multiple
optimisation criteria, of which the following are available: 
 mle(default): Optimisation using maximum likelihood estimation.
 cramer_von_mises: Optimisation using the Cramér-von Mises
criterion. Zwanenburg and Löck (2023) found that this criterion was
relatively robust against outliers.
transformation_gof_test_p_value(optional) Not all transformations
will lead to features that are roughly normally distributed. Zwanenburg and
Löck (2023) established a empirical goodness-of-fit test for central
normality. This parameter sets the significance for rejecting the
null-hypothesis that a feature distribution is centrally normal. When the
null-hypothesis is rejected, no transformation is performed. The default
value is NULL, which disables the test.normalisation_method(optional) The normalisation method used to
improve the comparability between numerical features that may have very
different scales. The following normalisation methods can be chosen:
 
 none: This disables feature normalisation.
 standardisation: Features are normalised by subtraction of their mean
values and division by their standard deviations. This causes every feature
to be have a center value of 0.0 and standard deviation of 1.0.
 standardisation_trim: Asstandardisation, but based on the set of
feature values where the 5% lowest and 5% highest values are discarded.
This reduces the effect of outliers.
 standardisation_winsor: Asstandardisation, but based on the set of
feature values where the 5% lowest and 5% highest values are winsorised.
This reduces the effect of outliers.
 standardisation_robust(default): A robust version ofstandardisationthat relies on computing Huber's M-estimators for location and scale.
 normalisation: Features are normalised by subtraction of their minimum
values and division by their ranges. This maps all feature values to a[0, 1]interval.
 normalisation_trim: Asnormalisation, but based on the set of feature
values where the 5% lowest and 5% highest values are discarded. This
reduces the effect of outliers.
 normalisation_winsor: Asnormalisation, but based on the set of
feature values where the 5% lowest and 5% highest values are winsorised.
This reduces the effect of outliers.
 quantile: Features are normalised by subtraction of their median values
and division by their interquartile range.
 mean_centering: Features are centered by substracting the mean, but do
not undergo rescaling.
 Only features that contain numerical data are normalised. Normalisation
parameters obtained in development data are stored within featureInfoobjects for later use with validation data sets.batch_normalisation_method(optional) The method used for batch
normalisation. Available methods are:
 
 none(default): This disables batch normalisation of features.
 standardisation: Features within each batch are normalised by
subtraction of the mean value and division by the standard deviation in
each batch.
 standardisation_trim: Asstandardisation, but based on the set of
feature values where the 5% lowest and 5% highest values are discarded.
This reduces the effect of outliers.
 standardisation_winsor: Asstandardisation, but based on the set of
feature values where the 5% lowest and 5% highest values are winsorised.
This reduces the effect of outliers.
 standardisation_robust: A robust version ofstandardisationthat
relies on computing Huber's M-estimators for location and scale within each
batch.
 normalisation: Features within each batch are normalised by subtraction
of their minimum values and division by their range in each batch. This
maps all feature values in each batch to a[0, 1]interval.
 normalisation_trim: Asnormalisation, but based on the set of feature
values where the 5% lowest and 5% highest values are discarded. This
reduces the effect of outliers.
 normalisation_winsor: Asnormalisation, but based on the set of
feature values where the 5% lowest and 5% highest values are winsorised.
This reduces the effect of outliers.
 quantile: Features in each batch are normalised by subtraction of the
median value and division by the interquartile range of each batch.
 mean_centering: Features in each batch are centered on 0.0 by
substracting the mean value in each batch, but are not rescaled.
 combat_parametric: Batch adjustments using parametric empirical Bayes
(Johnson et al, 2007).combat_pleads to the same method.
 combat_non_parametric: Batch adjustments using non-parametric empirical
Bayes (Johnson et al, 2007).combat_npandcombatlead to the same
method. Note that we reduced complexity from O(n^2) to O(n) by
only computing batch adjustment parameters for each feature on a subset of
50 randomly selected features, instead of all features.
 Only features that contain numerical data are normalised using batch
normalisation. Batch normalisation parameters obtained in development data
are stored within featureInfoobjects for later use with validation data
sets, in case the validation data is from the same batch. If validation data contains data from unknown batches, normalisation
parameters are separately determined for these batches.
 Note that for both empirical Bayes methods, the batch effect is assumed to
produce results across the features. This is often true for things such as
gene expressions, but the assumption may not hold generally.
 When performing batch normalisation, it is moreover important to check that
differences between batches or cohorts are not related to the studied
endpoint.imputation_method(optional) Method used for imputing missing
feature values. Two methods are implemented:
 
 simple: Simple replacement of a missing value by the median value (for
numeric features) or the modal value (for categorical features).
 lasso: Imputation of missing value by lasso regression (usingglmnet)
based on information contained in other features.
 simpleimputation precedeslassoimputation to ensure that any missing
values in predictors required forlassoregression are resolved. Thelassoestimate is then used to replace the missing value.
 The default value depends on the number of features in the dataset. If the
number is lower than 100, lassois used by default, andsimpleotherwise. Only single imputation is performed. Imputation models and parameters are
stored within featureInfoobjects for later use with validation data
sets.cluster_method(optional) Clustering is performed to identify and
replace redundant features, for example those that are highly correlated.
Such features do not carry much additional information and may be removed
or replaced instead (Park et al., 2007; Tolosi and Lengauer, 2011).
 The cluster method determines the algorithm used to form the clusters. The
following cluster methods are implemented:
 
 none: No clustering is performed.
 hclust(default): Hierarchical agglomerative clustering. If thefastclusterpackage is installed,fastcluster::hclustis used (Muellner
2013), otherwisestats::hclustis used.
 agnes: Hierarchical clustering using agglomerative nesting (Kaufman and
Rousseeuw, 1990). This algorithm is similar tohclust, but uses thecluster::agnesimplementation.
 diana: Divisive analysis hierarchical clustering. This method uses
divisive instead of agglomerative clustering (Kaufman and Rousseeuw, 1990).cluster::dianais used.
 pam: Partioning around medioids. This partitions the data into $k$
clusters around medioids (Kaufman and Rousseeuw, 1990). $k$ is selected
using thesilhouettemetric.pamis implemented using thecluster::pamfunction.
 Clusters and cluster information is stored within featureInfoobjects for
later use with validation data sets. This enables reproduction of the same
clusters as formed in the development data set.cluster_linkage_method(optional) Linkage method used for
agglomerative clustering in hclustandagnes. The following linkage
methods can be used: 
 average(default): Average linkage.
 single: Single linkage.
 complete: Complete linkage.
 weighted: Weighted linkage, also known as McQuitty linkage.
 ward: Linkage using Ward's minimum variance method.
 dianaandpamdo not require a linkage method.
cluster_cut_method(optional) The method used to define the actual
clusters. The following methods can be used:
 
 silhouette: Clusters are formed based on the silhouette score
(Rousseeuw, 1987). The average silhouette score is computed from 2 tonclusters, withnthe number of features. Clusters are only
formed if the average silhouette exceeds 0.50, which indicates reasonable
evidence for structure. This procedure may be slow if the number of
features is large (>100s).
 fixed_cut: Clusters are formed by cutting the hierarchical tree at the
point indicated by thecluster_similarity_threshold, e.g. where features
in a cluster have an average Spearman correlation of 0.90.fixed_cutis
only available foragnes,dianaandhclust.
 dynamic_cut: Dynamic cluster formation using the cutting algorithm in
thedynamicTreeCutpackage. This package should be installed to select
this option.dynamic_cutcan only be used withagnesandhclust.
 The default options are silhouettefor partioning around medioids (pam)
andfixed_cutotherwise.cluster_similarity_metric(optional) Clusters are formed based on
feature similarity. All features are compared in a pair-wise fashion to
compute similarity, for example correlation. The resulting similarity grid
is converted into a distance matrix that is subsequently used for
clustering. The following metrics are supported to compute pairwise
similarities:
 
 mutual_information(default): normalised mutual information.
 mcfadden_r2: McFadden's pseudo R-squared (McFadden, 1974).
 cox_snell_r2: Cox and Snell's pseudo R-squared (Cox and Snell, 1989).
 nagelkerke_r2: Nagelkerke's pseudo R-squared (Nagelkerke, 1991).
 spearman: Spearman's rank order correlation.
 kendall: Kendall rank correlation.
 pearson: Pearson product-moment correlation.
 The pseudo R-squared metrics can be used to assess similarity between mixed
pairs of numeric and categorical features, as these are based on the
log-likelihood of regression models. In familiar, the more informative
feature is used as the predictor and the other feature as the reponse
variable. In numeric-categorical pairs, the numeric feature is considered
to be more informative and is thus used as the predictor. In
categorical-categorical pairs, the feature with most levels is used as the
predictor. In case any of the classical correlation coefficients (pearson,spearmanandkendall) are used with (mixed) categorical features, the
categorical features are one-hot encoded and the mean correlation over all
resulting pairs is used as similarity.cluster_similarity_threshold(optional) The threshold level for
pair-wise similarity that is required to form clusters using fixed_cut.
This should be a numerical value between 0.0 and 1.0. Note however, that a
reasonable threshold value depends strongly on the similarity metric. The
following are the default values used: 
 mcfadden_r2andmutual_information:0.30
 cox_snell_r2andnagelkerke_r2:0.75
 spearman,kendallandpearson:0.90
 Alternatively, if the fixed cutmethod is not used, this value determines
whether any clustering should be performed, because the data may not
contain highly similar features. The default values in this situation are: 
 mcfadden_r2andmutual_information:0.25
 cox_snell_r2andnagelkerke_r2:0.40
 spearman,kendallandpearson:0.70
 The threshold value is converted to a distance (1-similarity) prior to
cutting hierarchical trees.cluster_representation_method(optional) Method used to determine
how the information of co-clustered features is summarised and used to
represent the cluster. The following methods can be selected:
 
 best_predictor(default): The feature with the highest importance
according to univariate regression with the outcome is used to represent
the cluster.
 medioid: The feature closest to the cluster center, i.e. the feature
that is most similar to the remaining features in the cluster, is used to
represent the feature.
 mean: A meta-feature is generated by averaging the feature values for
all features in a cluster. This method aligns all features so that all
features will be positively correlated prior to averaging. Should a cluster
contain one or more categorical features, themedioidmethod will be used
instead, as averaging is not possible. Note that if this method is chosen,
thenormalisation_methodparameter should be one ofstandardisation,standardisation_trim,standardisation_winsororquantile.'
 If the pamcluster method is selected, only themedioidmethod can be
used. In that case 1 medioid is used by default.parallel_preprocessing(optional) Enable parallel processing for the
preprocessing workflow. Defaults to TRUE. When set toFALSE, this will
disable the use of parallel processing while preprocessing, regardless of
the settings of theparallelparameter.parallel_preprocessingis
ignored ifparallel=FALSE.fs_method(required) Feature selection method to be used for
determining variable importance. familiarimplements various feature
selection methods. Please refer to the vignette on feature selection
methods for more details. More than one feature selection method can be chosen. The experiment will
then repeated for each feature selection method.
 Feature selection methods determines the ranking of features. Actual
selection of features is done by optimising the signature size model
hyperparameter during the hyperparameter optimisation step.fs_method_parameter(optional) List of lists containing parameters
for feature selection methods. Each sublist should have the name of the
feature selection method it corresponds to.
 Most feature selection methods do not have parameters that can be set.
Please refer to the vignette on feature selection methods for more details.
Note that if the feature selection method is based on a learner (e.g. lasso
regression), hyperparameter optimisation may be performed prior to
assessing variable importance.vimp_aggregation_method(optional) The method used to aggregate
variable importances over different data subsets, e.g. bootstraps. The
following methods can be selected:
 
 none: Don't aggregate ranks, but rather aggregate the variable
importance scores themselves.
 mean: Use the mean rank of a feature over the subsets to
determine the aggregated feature rank.
 median: Use the median rank of a feature over the subsets to determine
the aggregated feature rank.
 best: Use the best rank the feature obtained in any subset to determine
the aggregated feature rank.
 worst: Use the worst rank the feature obtained in any subset to
determine the aggregated feature rank.
 stability: Use the frequency of the feature being in the subset of
highly ranked features as measure for the aggregated feature rank
(Meinshausen and Buehlmann, 2010).
 exponential: Use a rank-weighted frequence of occurrence in the subset
of highly ranked features as measure for the aggregated feature rank (Haury
et al., 2011).
 borda(default): Use the borda count as measure for the aggregated
feature rank (Wald et al., 2012).
 enhanced_borda: Use an occurrence frequency-weighted borda count as
measure for the aggregated feature rank (Wald et al., 2012).
 truncated_borda: Use borda count computed only on features within the
subset of highly ranked features.
 enhanced_truncated_borda: Apply both the enhanced borda method and the
truncated borda method and use the resulting borda count as the aggregated
feature rank.
 The feature selection methods vignette provides additional information.vimp_aggregation_rank_threshold(optional) The threshold used to
define the subset of highly important features. If not set, this threshold
is determined by maximising the variance in the occurrence value over all
features over the subset size.
 This parameter is only relevant for stability,exponential,enhanced_borda,truncated_bordaandenhanced_truncated_bordamethods.parallel_feature_selection(optional) Enable parallel processing for
the feature selection workflow. Defaults to TRUE. When set toFALSE,
this will disable the use of parallel processing while performing feature
selection, regardless of the settings of theparallelparameter.parallel_feature_selectionis ignored ifparallel=FALSE.novelty_detector(optional) Specify the algorithm used for training
a novelty detector. This detector can be used to identify
out-of-distribution data prospectively.detector_parameters(optional) List lists containing hyperparameters
for novelty detectors. Currently not used.parallel_model_development(optional) Enable parallel processing for
the model development workflow. Defaults to TRUE. When set toFALSE,
this will disable the use of parallel processing while developing models,
regardless of the settings of theparallelparameter.parallel_model_developmentis ignored ifparallel=FALSE.optimisation_bootstraps(optional) Number of bootstraps that should
be generated from the development data set. During the optimisation
procedure one or more of these bootstraps (indicated by
smbo_step_bootstraps) are used for model development using different
combinations of hyperparameters. The effect of the hyperparameters is then
assessed by comparing in-bag and out-of-bag model performance. The default number of bootstraps is 50. Hyperparameter optimisation may
finish before exhausting the set of bootstraps.optimisation_determine_vimp(optional) Logical value that indicates
whether variable importance is determined separately for each of the
bootstraps created during the optimisation process (TRUE) or the
applicable results from the feature selection step are used (FALSE). Determining variable importance increases the initial computational
overhead. However, it prevents positive biases for the out-of-bag data due
to overlap of these data with the development data set used for the feature
selection step. In this case, any hyperparameters of the variable
importance method are not determined separately for each bootstrap, but
those obtained during the feature selection step are used instead. In case
multiple of such hyperparameter sets could be applicable, the set that will
be used is randomly selected for each bootstrap.
 This parameter only affects hyperparameter optimisation of learners. The
default is TRUE.smbo_random_initialisation(optional) String indicating the
initialisation method for the hyperparameter space. Can be one of
fixed_subsample(default),fixed, orrandom.fixedandfixed_subsamplefirst create hyperparameter sets from a range of default
values set by familiar.fixed_subsamplethen randomly draws up tosmbo_n_random_setsfrom the grid.randomdoes not rely upon a fixed
grid, and randomly draws up tosmbo_n_random_setshyperparameter sets
from the hyperparameter space.smbo_n_random_sets(optional) Number of random or subsampled
hyperparameters drawn during the initialisation process. Default: 100.
Cannot be smaller than10. The parameter is not used whensmbo_random_initialisationisfixed, as the entire pre-defined grid
will be explored.max_smbo_iterations(optional) Maximum number of intensify
iterations of the SMBO algorithm. During an intensify iteration a run-off
occurs between the current best hyperparameter combination and either 10
challenger combination with the highest expected improvement or a set of 20
random combinations.
 Run-off with random combinations is used to force exploration of the
hyperparameter space, and is performed every second intensify iteration, or
if there is no expected improvement for any challenger combination.
 If a combination of hyperparameters leads to better performance on the same
data than the incumbent best set of hyperparameters, it replaces the
incumbent set at the end of the intensify iteration.
 The default number of intensify iteration is 20. Iterations may be
stopped early if the incumbent set of hyperparameters remains the same forsmbo_stop_convergent_iterationsiterations, or performance improvement is
minimal. This behaviour is suppressed during the first 4 iterations to
enable the algorithm to explore the hyperparameter space.smbo_stop_convergent_iterations(optional) The number of subsequent
convergent SMBO iterations required to stop hyperparameter optimisation
early. An iteration is convergent if the best parameter set has not
changed or the optimisation score over the 4 most recent iterations has not
changed beyond the tolerance level in smbo_stop_tolerance. The default value is 3.smbo_stop_tolerance(optional) Tolerance for early stopping due to
convergent optimisation score.
 The default value depends on the square root of the number of samples (at
the series level), and is 0.01for 100 samples. This value is computed as0.1 * 1 / sqrt(n_samples). The upper limit is0.0001for 1M or more
samples.smbo_time_limit(optional) Time limit (in minutes) for the
optimisation process. Optimisation is stopped after this limit is exceeded.
Time taken to determine variable importance for the optimisation process
(see the optimisation_determine_vimpparameter) does not count. The default is NULL, indicating that there is no time limit for the
optimisation process. The time limit cannot be less than 1 minute.smbo_initial_bootstraps(optional) The number of bootstraps taken
from the set of optimisation_bootstrapsas the bootstraps assessed
initially. The default value is 1. The value cannot be larger thanoptimisation_bootstraps.smbo_step_bootstraps(optional) The number of bootstraps taken from
the set of optimisation_bootstrapsbootstraps as the bootstraps assessed
during the steps of each intensify iteration. The default value is 3. The value cannot be larger thanoptimisation_bootstraps.smbo_intensify_steps(optional) The number of steps in each SMBO
intensify iteration. Each step a new set of smbo_step_bootstrapsbootstraps is drawn and used in the run-off between the incumbent best
hyperparameter combination and its challengers. The default value is 5. Higher numbers allow for a more detailed
comparison, but this comes with added computational cost.optimisation_metric(optional) One or more metrics used to compute
performance scores. See the vignette on performance metrics for the
available metrics.
 If unset, the following metrics are used by default:
 
 auc_roc: Forbinomialandmultinomialmodels.
 mse: Mean squared error forcontinuousmodels.
 msle: Mean squared logarithmic error forcountmodels.
 concordance_index: Forsurvivalmodels.
 Multiple optimisation metrics can be specified. Actual metric values are
converted to an objective value by comparison with a baseline metric value
that derives from a trivial model, i.e. majority class for binomial and
multinomial outcomes, the median outcome for count and continuous outcomes
and a fixed risk or time for survival outcomes.optimisation_function(optional) Type of optimisation function used
to quantify the performance of a hyperparameter set. Model performance is
assessed using the metric(s) specified by optimisation_metricon the
in-bag (IB) and out-of-bag (OOB) samples of a bootstrap. These values are
converted to objective scores with a standardised interval of[-1.0, 1.0]. Each pair of objective is subsequently used to compute an
optimisation score. The optimisation score across different bootstraps is
than aggregated to a summary score. This summary score is used to rank
hyperparameter sets, and select the optimal set. The combination of optimisation score and summary score is determined by
the optimisation function indicated by this parameter:
 
 validationormax_validation(default): seeks to maximise OOB score.
 balanced: seeks to balance IB and OOB score.
 stronger_balance: similar tobalanced, but with stronger penalty for
differences between IB and OOB scores.
 validation_minus_sd: seeks to optimise the average OOB score minus its
standard deviation.
 validation_25th_percentile: seeks to optimise the 25th percentile of
OOB scores, and is conceptually similar tovalidation_minus_sd.
 model_estimate: seeks to maximise the OOB score estimate predicted by
the hyperparameter learner (not available for random search).
 model_estimate_minus_sd: seeks to maximise the OOB score estimate minus
its estimated standard deviation, as predicted by the hyperparameter
learner (not available for random search).
 model_balanced_estimate: seeks to maximise the estimate of the balanced
IB and OOB score. This is similar to thebalancedscore, and in fact uses
a hyperparameter learner to predict said score (not available for random
search).
 model_balanced_estimate_minus_sd: seeks to maximise the estimate of the
balanced IB and OOB score, minus its estimated standard deviation. This is
similar to thebalancedscore, but takes into account its estimated
spread.
 Additional detail are provided in the Learning algorithms and
hyperparameter optimisation vignette.hyperparameter_learner(optional) Any point in the hyperparameter
space has a single, scalar, optimisation score value that is a priori
unknown. During the optimisation process, the algorithm samples from the
hyperparameter space by selecting hyperparameter sets and computing the
optimisation score value for one or more bootstraps. For each
hyperparameter set the resulting values are distributed around the actual
value. The learner indicated by hyperparameter_learneris then used to
infer optimisation score estimates for unsampled parts of the
hyperparameter space. The following models are available:
 
 bayesian_additive_regression_treesorbart: Uses Bayesian Additive
Regression Trees (Sparapani et al., 2021) for inference. Unlike standard
random forests, BART allows for estimating posterior distributions directly
and can extrapolate.
 gaussian_process(default): Creates a localised approximate Gaussian
process for inference (Gramacy, 2016). This allows for better scaling than
deterministic Gaussian Processes.
 random_forest: Creates a random forest for inference. Originally
suggested by Hutter et al. (2011). A weakness of random forests is their
lack of extrapolation beyond observed values, which limits their usefulness
in exploiting promising areas of hyperparameter space.
 randomorrandom_search: Forgoes the use of models to steer
optimisation. Instead, a random search is performed.
acquisition_function(optional) The acquisition function influences
how new hyperparameter sets are selected. The algorithm uses the model
learned by the learner indicated by hyperparameter_learnerto search the
hyperparameter space for hyperparameter sets that are either likely better
than the best known set (exploitation) or where there is considerable
uncertainty (exploration). The acquisition function quantifies this
(Shahriari et al., 2016). The following acquisition functions are available, and are described in
more detail in the learner algorithms vignette:
 
 improvement_probability: The probability of improvement quantifies the
probability that the expected optimisation score for a set is better than
the best observed optimisation score
 improvement_empirical_probability: Similar toimprovement_probability, but based directly on optimisation scores
predicted by the individual decision trees.
 expected_improvement(default): Computes expected improvement.
 upper_confidence_bound: This acquisition function is based on the upper
confidence bound of the distribution (Srinivas et al., 2012).
 bayes_upper_confidence_bound: This acquisition function is based on the
upper confidence bound of the distribution (Kaufmann et al., 2012).
exploration_method(optional) Method used to steer exploration in
post-initialisation intensive searching steps. As stated earlier, each SMBO
iteration step compares suggested alternative parameter sets with an
incumbent best set in a series of steps. The exploration method
controls how the set of alternative parameter sets is pruned after each
step in an iteration. Can be one of the following:
 
 single_shot(default): The set of alternative parameter sets is not
pruned, and each intensification iteration contains only a single
intensification step that only uses a single bootstrap. This is the fastest
exploration method, but only superficially tests each parameter set.
 successive_halving: The set of alternative parameter sets is
pruned by removing the worst performing half of the sets after each step
(Jamieson and Talwalkar, 2016).
 stochastic_reject: The set of alternative parameter sets is pruned by
comparing the performance of each parameter set with that of the incumbent
best parameter set using a paired Wilcoxon test based on shared
bootstraps. Parameter sets that perform significantly worse, at an alpha
level indicated bysmbo_stochastic_reject_p_value, are pruned.
 none: The set of alternative parameter sets is not pruned.
smbo_stochastic_reject_p_value(optional) The p-value threshold used
for the stochastic_rejectexploration method. The default value is 0.05.parallel_hyperparameter_optimisation(optional) Enable parallel
processing for hyperparameter optimisation. Defaults to TRUE. When set toFALSE, this will disable the use of parallel processing while performing
optimisation, regardless of the settings of theparallelparameter. The
parameter moreover specifies whether parallelisation takes place within the
optimisation algorithm (inner, default), or in an outer loop (outer)
over learners, data subsamples, etc. parallel_hyperparameter_optimisationis ignored ifparallel=FALSE.
 | 
Details
This is a thin wrapper around summon_familiar, and functions like
it, but automatically skips all evaluation steps. Only a single learner is
allowed.
Value
One or more familiarModel objects.
Updates model directory path for ensemble objects.
Description
Updates the model directory path of a familiarEnsemble object.
Usage
update_model_dir_path(object, dir_path, ...)
## S4 method for signature 'familiarEnsemble'
update_model_dir_path(object, dir_path)
## S4 method for signature 'ANY'
update_model_dir_path(object, dir_path)
Arguments
| object | A familiarEnsembleobject, or one or morefamiliarModelobjects that will be internally converted to afamiliarEnsembleobject.
Paths to such objects can also be provided. | 
| dir_path | Path to the directory where models are stored. | 
| ... | Unused arguments. | 
Details
Ensemble models created by familiar are often written to a directory
on a local drive or network. In such cases, the actual models are detached,
and paths to the models are stored instead. When the models are moved from
their original location, they can no longer be found and attached to the
ensemble. This method allows for pointing to the new directory containing
the models.
Value
A familiarEnsemble object.
Update familiar S4 objects to the most recent version.
Description
Provides backward compatibility for familiar objects exported to
a file. This mitigates compatibility issues when working with files that
become outdated as new versions of familiar are released, e.g. because
slots have been removed.
Usage
update_object(object, ...)
## S4 method for signature 'familiarModel'
update_object(object, ...)
## S4 method for signature 'familiarEnsemble'
update_object(object, ...)
## S4 method for signature 'familiarData'
update_object(object, ...)
## S4 method for signature 'familiarCollection'
update_object(object, ...)
## S4 method for signature 'vimpTable'
update_object(object, ...)
## S4 method for signature 'familiarNoveltyDetector'
update_object(object, ...)
## S4 method for signature 'featureInfo'
update_object(object, ...)
## S4 method for signature 'featureInfoParametersTransformationPowerTransform'
update_object(object, ...)
## S4 method for signature 'experimentData'
update_object(object, ...)
## S4 method for signature 'list'
update_object(object, ...)
## S4 method for signature 'ANY'
update_object(object, ...)
Arguments
| object | A familiarModel, afamiliarEnsemble, afamiliarDataorfamiliarCollectionobject. | 
| ... | Unused arguments. | 
Value
An up-to-date version of the respective S4 object.
Calculate variance-covariance matrix for a model
Description
Calculate variance-covariance matrix for a model
Usage
vcov(object, ...)
## S4 method for signature 'familiarModel'
vcov(object, ...)
Arguments
| object | a familiarModel object | 
| ... | additional arguments passed to vcovmethods for the underlying
model, when available. | 
Details
This method extends the vcov S3 method. For some models vcov
requires information that is trimmed from the model. In this case a copy of
the variance-covariance matrix is stored with the model, and returned.
Value
Variance-covariance matrix of the model in the familiarModel object,
if any.
Variable importance table
Description
A vimpTable object contains information concerning variable importance of one
or more features. These objects are created during feature selection.
Details
vimpTable objects exists in various states. These states are
generally incremental, i.e. one cannot turn a declustered table into the
initial version. Some methods such as aggregation internally do some state
reshuffling.
This object replaces the ad-hoc lists with information that were used in
versions prior to familiar 1.2.0.
Slots
- vimp_table
- Table containing features with corresponding scores. 
- vimp_method
- Method used to compute variable importance scores for each
feature. 
- run_table
- Run table for the data used to compute variable importances
from. Used internally. 
- score_aggregation
- Method used to aggregate the score of contrasts for
each categorical feature, if any, 
- encoding_table
- Table used to relate categorical features to their
contrasts, if any. Not used for all variable importance methods. 
- cluster_table
- Table used to relate original features with features
after clustering. Variable importance is determined after feature
processing, which includes clustering. 
- invert
- Determines whether increasing score corresponds to increasing
(- FALSE) or decreasing rank (- TRUE). Used internally to determine how
ranks should be formed.
 
- project_id
- Identifier of the project that generated the vimpTable
object. 
- familiar_version
- Version of the familiar package used to create this
table. 
- state
- State of the variable importance table. The object can have the
following states:
 - 
-  initial: initial state, directly after the variable importance table is
filled.
 
-  decoded: depending on the variable importance method, the initial
variable importance table may contain the scores of individual contrasts
for categorical variables. When decoded, data in theencoding_tableattribute has been used to aggregate scores from all contrasts into a
single score for each feature.
 
-  declustered: variable importance is determined from fully processed
features, which includes clustering. This means that a single feature in
the variable importance table may represent multiple original features.
When a variable importance table has been declustered, all clusters have
been turned into their constituent features.
 
-  reclustered: When the table is reclustered, features are replaced by
their respective clusters. This is actually used when updating the cluster
table to ensure it fits to a local context. This prevents issues when
attempting to aggregate or apply variable importance tables in data with
different feature preprocessing, and as a result, different clusters.
 
-  ranked: The scores have been used to create ranks, with lower ranks
indicating better features.
 
-  aggregated: Score and ranks from multiple variable importance tables
were aggregated.
 
 
See Also
get_vimp_table, aggregate_vimp_table
Create a waiver object
Description
This function is functionally identical to ggplot2::waiver() function and
creates a waiver object. A waiver object is an otherwise empty object that
serves the same purpose as NULL, i.e. as placeholder for a default value.
Because NULL can sometimes be a valid input argument, it can therefore not
be used to switch to an internal default value.
Usage
waiver()
Value
waiver object