A central feature of bigPLScox is the ability to
operate on file-backed bigmemory::big.matrix objects. This
vignette demonstrates how to prepare such datasets, fit models with
big_pls_cox() and big_pls_cox_gd(), and
integrate them with cross-validation helpers. The examples complement
the introductory vignette “Getting started with bigPLScox”.
We simulate a moderately large design matrix and persist it to disk
via bigmemory::filebacked.big.matrix(). Using file-backed
storage allows models to train on datasets that exceed the available
RAM.
library(bigPLScox)
library(bigmemory)
set.seed(2024)
n_obs <- 5000
n_pred <- 100
X_dense <- matrix(rnorm(n_obs * n_pred), nrow = n_obs)
time <- rexp(n_obs, rate = 0.2)
status <- rbinom(n_obs, 1, 0.7)
big_dir <- tempfile("bigPLScox-")
dir.create(big_dir)
X_big <- filebacked.big.matrix(
nrow = n_obs,
ncol = n_pred,
backingpath = big_dir,
backingfile = "X.bin",
descriptorfile = "X.desc",
init = X_dense
)The resulting big.matrix can be reopened in future
sessions via its descriptor file. All big-memory modelling functions
accept either an in-memory matrix or a big.matrix
reference.
big_pls_cox() runs the classical PLS-Cox algorithm while
streaming data from disk.
fit_big <- big_pls_cox(
X = X_big,
time = time,
status = status,
ncomp = 5
)
head(fit_big$scores)
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [2,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [3,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [4,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [5,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [6,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
str(fit_big)
#> List of 9
#> $ scores : num [1:5000, 1:5] -4.45e-06 -4.45e-06 -4.45e-06 -4.45e-06 -4.45e-06 ...
#> $ loadings: num [1:100, 1:5] -0.1 -0.1 -0.1 -0.1 -0.1 ...
#> $ weights : num [1:100, 1:5] -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 ...
#> $ center : num [1:100] 0.982 0.982 0.982 0.982 0.982 ...
#> $ scale : num [1:100] 1.65e-07 1.65e-07 1.65e-07 1.65e-07 1.65e-07 ...
#> $ cox_fit :List of 10
#> ..$ coefficients : num [1:5] NA NA 1.82e+41 NA NA
#> ..$ var : num [1:5, 1:5] 0 0 0 0 0 0 0 0 0 0 ...
#> ..$ loglik : num [1:2] -26230 -26230
#> ..$ score : num 8.72e-20
#> ..$ iter : int 1
#> ..$ linear.predictors: num [1:5000] 0.000392 0.000392 0.000392 0.000392 0.000392 ...
#> ..$ residuals : Named num [1:5000] 0.8605 0.7913 -0.0129 0.1737 0.9958 ...
#> .. ..- attr(*, "names")= chr [1:5000] "1" "2" "3" "4" ...
#> ..$ means : num [1:5] -4.45e-06 -1.67e-19 -1.67e-32 -1.71e-46 -1.15e-59
#> ..$ method : chr "efron"
#> ..$ class : chr "coxph"
#> $ keepX : int [1:5] 0 0 0 0 0
#> $ time : num [1:5000] 1.024 1.5182 0.1047 5.9911 0.0424 ...
#> $ status : num [1:5000] 1 1 0 1 1 1 1 1 0 1 ...
#> - attr(*, "class")= chr "big_pls_cox"The gradient-descent variant big_pls_cox_gd() uses
stochastic optimisation and is well-suited for very large datasets.
fit_big_gd <- big_pls_cox_gd(
X = X_big,
time = time,
status = status,
ncomp = 5,
max_iter = 100,
tol = 1e-4
)
head(fit_big$scores)
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [2,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [3,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [4,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [5,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [6,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
str(fit_big)
#> List of 9
#> $ scores : num [1:5000, 1:5] -4.45e-06 -4.45e-06 -4.45e-06 -4.45e-06 -4.45e-06 ...
#> $ loadings: num [1:100, 1:5] -0.1 -0.1 -0.1 -0.1 -0.1 ...
#> $ weights : num [1:100, 1:5] -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 ...
#> $ center : num [1:100] 0.982 0.982 0.982 0.982 0.982 ...
#> $ scale : num [1:100] 1.65e-07 1.65e-07 1.65e-07 1.65e-07 1.65e-07 ...
#> $ cox_fit :List of 10
#> ..$ coefficients : num [1:5] NA NA 1.82e+41 NA NA
#> ..$ var : num [1:5, 1:5] 0 0 0 0 0 0 0 0 0 0 ...
#> ..$ loglik : num [1:2] -26230 -26230
#> ..$ score : num 8.72e-20
#> ..$ iter : int 1
#> ..$ linear.predictors: num [1:5000] 0.000392 0.000392 0.000392 0.000392 0.000392 ...
#> ..$ residuals : Named num [1:5000] 0.8605 0.7913 -0.0129 0.1737 0.9958 ...
#> .. ..- attr(*, "names")= chr [1:5000] "1" "2" "3" "4" ...
#> ..$ means : num [1:5] -4.45e-06 -1.67e-19 -1.67e-32 -1.71e-46 -1.15e-59
#> ..$ method : chr "efron"
#> ..$ class : chr "coxph"
#> $ keepX : int [1:5] 0 0 0 0 0
#> $ time : num [1:5000] 1.024 1.5182 0.1047 5.9911 0.0424 ...
#> $ status : num [1:5000] 1 1 0 1 1 1 1 1 0 1 ...
#> - attr(*, "class")= chr "big_pls_cox"Both functions return objects that expose the latent scores and loading vectors, allowing downstream visualisations and diagnostics identical to their in-memory counterparts.
Cross-validation for big-memory models is supported through the list interface. This enables streaming each fold directly from disk.
set.seed(2024)
data_big <- list(x = X_big, time = time, status = status)
cv_big <- cv.coxgpls(
data_big,
nt = 5,
ncores = 1,
ind.block.x = c(10, 40)
)
cv_big$opt_ntFor large experiments consider combining
foreach::foreach() with
doParallel::registerDoParallel() to parallelise folds.
The native C++ solvers substantially reduce wall-clock time compared
to fitting through the R interface alone. The bench package
provides convenient instrumentation; the chunk below only runs when it
is available.
if (requireNamespace("bench", quietly = TRUE)) {
bench::mark(
streaming = big_pls_cox(X_big, time, status, ncomp = 5, keepX = 0),
gd = big_pls_cox_gd(X_big, time, status, ncomp = 5, max_iter = 150),
iterations = 5,
check = FALSE
)
}
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 streaming 30.4ms 30.5ms 32.7 8.81MB 8.19
#> 2 gd 13ms 13.2ms 75.2 6.79MB 18.8Once a model has been fitted we can evaluate deviance residuals using the new C++ backend. Supplying the linear predictor avoids recomputing it in R and works with any matrix backend.
Temporary backing files can be removed after the analysis. In production pipelines you will typically keep the descriptor file alongside the binary data.
help(big_pls_cox) and help(big_pls_cox_gd)
document all tuning parameters for the big-memory solvers.saveRDS() to
avoid recomputing large models when iterating on analyses.