| Title: | Autoencoding Random Forests |
| Version: | 0.1.0 |
| Maintainer: | Binh Duc Vu <vuducbinh2210@gmail.com> |
| Description: | Autoencoding Random Forests ('RFAE') provide a method to autoencode mixed-type tabular data using Random Forests ('RF'), which involves projecting the data to a latent feature space of user-chosen dimensionality (usually a lower dimension), and then decoding the latent representations back into the input space. The encoding stage is useful for feature engineering and data visualisation tasks, akin to how principal component analysis ('PCA') is used, and the decoding stage is useful for compression and denoising tasks. At its core, 'RFAE' is a post-processing pipeline on a trained random forest model. This means that it can accept any trained RF of 'ranger' object type: 'RF', 'URF' or 'ARF'. Because of this, it inherits Random Forests' robust performance and capacity to seamlessly handle mixed-type tabular data. For more details, see Vu et al. (2025) <doi:10.48550/arXiv.2505.21441>. |
| License: | GPL (≥ 3) |
| URL: | https://github.com/bips-hb/RFAE |
| BugReports: | https://github.com/bips-hb/RFAE/issues |
| Depends: | R (≥ 4.4.0) |
| Imports: | caret, data.table, foreach, Matrix, methods, mgcv, ranger, RANN, RSpectra, stats, tibble |
| Suggests: | arf, ggplot2, knitr, rmarkdown, testthat (≥ 3.0.0) |
| VignetteBuilder: | knitr |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.3 |
| Config/testthat/edition: | 3 |
| NeedsCompilation: | no |
| Packaged: | 2026-01-12 03:28:56 UTC; k21189355 |
| Author: | Binh Duc Vu |
| Repository: | CRAN |
| Date/Publication: | 2026-01-17 11:20:07 UTC |
Decode RF Embeddings
Description
Maps the low-dimensional KPCA embedding of a random forest back to the input space via iterative k-nearest neighbors.
Usage
decode_knn(rf, emap, z, x_tilde = NULL, k = 5, parallel = TRUE)
Arguments
rf |
Pre-trained random forest object of class |
emap |
Spectral embedding learned via |
z |
Matrix of embedded data to map back to the input space. |
x_tilde |
Supplied training data, if none supplied then the RF is used to generate synthetic training data according to the eForest scheme. Default is NULL. |
k |
Number of nearest neighbors to evaluate. |
parallel |
Compute in parallel? Must register backend beforehand, e.g.
via |
Details
decode_knn decodes the embedded data back to the original input space
using a k-nearest neighbors (kNN) (Cover & Hart, 1967) approach. For a given
embedding vector, decoding works by first finding the k nearest embeddings
within the training set. Then, x_tilde is either supplied or generated
from the RF (if generated, using the 'eForest' scheme (Feng & Zhou, 2018)),
which provides a proxy for the training samples associated with these
embeddings, to avoid needing to retain training data. Finally, data is
reconstructed by weighted averaging for numerical features, and the most
likely value for categorical features.
Value
Decoded dataset.
References
Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27.
Feng, J., & Zhou, Z. H. (2018, April). Autoencoder by forest. In Proceedings of the AAAI conference on artificial intelligence (Vol. 32 , No. 1).
Examples
# Set seed
set.seed(1)
# Split training and test
trn <- sample(1:nrow(iris), 100)
tst <- setdiff(1:nrow(iris), trn)
# Train RF, learn the encodings and project test points.
rf <- ranger::ranger(Species ~ ., data = iris[trn, ], num.trees=50)
emap <- encode(rf, iris[trn, ], k=2)
emb <- predict(emap, rf, iris[tst, ])
# Decode test samples back to the input space
out <- decode_knn(rf, emap, emb, k=5)$x_hat
Encoding with Diffusion Maps
Description
Computes the diffusion map of a random forest kernel, including a spectral decomposition and associated weights.
Usage
encode(rf, x, k = 5L, stepsize = 1L, parallel = TRUE)
Arguments
rf |
Pre-trained random forest object of class |
x |
Training data for estimating embedding weights. |
k |
Dimensionality of the spectral embedding. |
stepsize |
Number of steps of a random walk for the diffusion process. See Details. |
parallel |
Compute in parallel? Must register backend beforehand, e.g.
via |
Details
encode learns a low-dimensional embedding of the data implied by the
adjacency matrix of the rf. Random forests can be understood as an
adaptive nearest neighbors algorithm, where proximity between samples is
determined by how often they are routed to the same leaves. We compute the
spectral decomposition of the model adjacencies over the training data
X, and take the leading k eigenvectors and eigenvalues. The
function returns the resulting diffusion map, eigenvectors, eigenvalues,
and leaf sizes.
Let K be the weighted adjacency matrix of code x implied by
rf. This defines a weighted, undirected graph over the training data,
which we can also interpret as the transitions of a Markov process 'between'
data points. Spectral analysis produces the decomposition K = V\lambda V^{-1},
where we can take leading nonconstant eigenvectors. The diffusion map
Z = \sqrt{n} V \lambda^{t} (Coifman & Lafon, 2006) represents the
long-run connectivity structure of the graph after t time steps of a Markov
process, with some nice optimization properties (von Luxburg, 2007). We can
embed new data into this space using the Nyström formula (Bengio et al.,
2004).
Value
A list with eight elements: (1) Z: a k-dimensional nonlinear
embedding of x implied by rf. (2) A: the normalized
adjacency matrix (3) v: the leading k eigenvectors;
(4) lambda: the leading k eigenvalues; (5) stepsize: the
number of steps in the random walk. (6) leafIDs: a matrix with
nrow(x) rows and rf$num.trees columns, representing the
terminal nodes of each training sample in each tree; (7) the number of
samples in each leaf; (8) metadata about the rf.
References
Bengio, Y., Delalleau, O., Le Roux, N., Paiement, J., Vincent, P., & Ouimet, M. (2004). Learning eigenfunctions links spectral embedding and kernel PCA. Neural Computation, 16(10): 2197-2219.
Coifman, R. R., & Lafon, S. (2006). Diffusion maps. Applied and Computational Harmonic Analysis, 21(1), 5–30.
von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4), 395–416.
See Also
Examples
# Train ARF
arf <- arf::adversarial_rf(iris)
# Embed the data
emap <- encode(arf, iris)
Post-process data
Description
This function prepares output data.
Usage
post_x(x, meta, round = TRUE)
Arguments
x |
Input |
meta |
Metadata. |
round |
Round continuous variables to their respective maximum precision in the real data set? |
Value
A data.frame which follows the structure and ordering of the input dataset.
Predict Spectral Embeddings
Description
Projects test data into the forest embedding space using a pre-trained encoding map.
Usage
## S3 method for class 'encode'
predict(object, rf, x, parallel = TRUE, ...)
Arguments
object |
Spectral embedding for the |
rf |
Pre-trained random forest object of class |
x |
Data to be embedded. |
parallel |
Compute in parallel? Must register backend beforehand, e.g.
via |
... |
Additional arguments passed to methods. |
Details
This function uses the weights learned via eigenmap to project new
data into the low-dimensional embedding space using the Nyström formula.
For details, see Bengio et al. (2004).
Value
A matrix of embeddings, with nrow(x) rows and k columns, the
latter argument used to learn the eigenmap.
References
Bengio, Y., Delalleau, O., Le Roux, N., Paiement, J., Vincent, P., & Ouimet, M. (2004). Learning eigenfunctions links spectral embedding and kernel PCA. Neural Computation, 16(10): 2197-2219.
See Also
Examples
# Set seed
set.seed(1)
# Split training and test
trn <- sample(1:nrow(iris), 100)
tst <- setdiff(1:nrow(iris), trn)
# Train RF. You can also use RF variants, such as the Adversarial Random
# Forests (ARF).
rf <- ranger::ranger(Species ~ ., data = iris[trn, ], num.trees=50)
# Learn the encodings, which are found using diffusion maps.
emap <- encode(rf, iris[trn, ], k=2)
# Embed test points
emb <- predict(emap, rf, iris[tst, ])
Preprocess input data
Description
This function prepares input data.
Usage
prep_x(x, to_numeric = NULL, to_factor = NULL, default = 5)
Arguments
x |
Input |
to_numeric |
List of variables to force as numeric. |
to_factor |
List of variables to force as factor. |
default |
Threshold to classify a variable as numeric (more than default unique values) or factor (less or equal to unique values). |
Value
Preprocessed data.frame.
Mixed-type Reconstruction Error
Description
Computes the reconstruction error of a decoded dataset compared to the original.
Usage
reconstruction_error(Xhat, X)
Arguments
Xhat |
Reconstructed dataset |
X |
Ground truth dataset |
Details
In standard AEs, reconstruction error is generally estimated via L_2
loss. This is not sensible with a mix of continuous and categorical data, so
we devise a measure that evaluates distortion on continuous variables as
1 - R^2, and categorical variables as prediction error.
Value
A list containing column-wise reconstruction error, and the average reconstruction error for categorical and numeric variables. Values lie between 0-1, where 0 represents perfect reconstruction, and 1 represents no reconstruction.
Examples
# Set seed
set.seed(1)
# Split training and test
trn <- sample(1:nrow(iris), 100)
tst <- setdiff(1:nrow(iris), trn)
# Train RF, learn the encodings and project test points.
rf <- ranger::ranger(Species ~ ., data = iris[trn, ], num.trees=50)
emap <- encode(rf, iris[trn, ], k=2)
emb <- predict(emap, rf, iris[tst, ])
# Decode test samples back to the input space
out <- decode_knn(rf, emap, emb, k=5)$x_hat
# Compute the reconstruction error
error <- reconstruction_error(out, iris[tst, ])