Essential concepts and setup

Leonardo Ramirez-Lopez

2026-04-20

Think Globally, Fit Locally(Saul and Roweis, 2003)

1 Introduction

Spectroscopic data analysis plays a central role in many environmental, agricultural, and food-related applications. Techniques such as near-infrared (NIR), mid-infrared (IR), and other forms of diffuse reflectance spectroscopy provide rapid, non-destructive, and cost-efficient measurements that can be used to infer chemical, physical, or biological properties of complex matrices, including soils, plant materials, and food products. In quantitative applications, these measurements are typically linked to reference laboratory values through empirical calibration models.

As spectral databases grow in size and diversity, their effective use becomes increasingly challenging. Large spectral libraries often contain substantial heterogeneity, domain shifts, redundant observations, and samples that are only locally informative for a given prediction problem. Under these conditions, global modelling strategies are often insufficient on their own, and methods based on dimensionality reduction, dissimilarity analysis, neighbour retrieval, local modelling, and targeted sample selection become essential.

The resemble package provides a framework for sample retrieval and local learning in spectral chemometrics. It is designed to support the analysis of large and complex spectral datasets through tools for projection-based representation, dissimilarity computation, neighbourhood search, memory-based learning, evolutionary subset search, and retrieval-based modelling with pre-computed local models. The package therefore supports both classical local modelling workflows and newer strategies for exploiting spectral libraries as structured resources for predictive modelling.

The functions presented here are implemented based on the methods described in Ramirez-Lopez et al. (2026b), Ramirez-Lopez et al. (2026a), and Ramirez-Lopez et al. (2013).

The main functionalities of resemble include:

2 Citing the package

Simply type and you will get the info you need:

citation(package = "resemble")
To cite resemble in publications use:

  Ramirez-Lopez, L., and Stevens, A., and Orellano, C., (2026).
  resemble: Regression and similarity evaluation for memory-based
  learning in spectral chemometrics. R package Vignette R package
  version 3.0.0.

A BibTeX entry for LaTeX users is

  @Manual{resemble-package,
    title = {resemble: Sample Retrieval and Local Learning in Spectral Chemometrics.},
    author = {Leonardo Ramirez-Lopez and Antoine Stevens and Claudio Orellano},
    publication = {R package Vignette},
    year = {2026},
    note = {R package version 3.0.0},
    url = {https://CRAN.R-project.org/package=resemble},
  }

3 Dataset used across the vignettes

The vignettes in resemble use the soil near-infrared (NIR) spectral dataset provided in the prospectr package (Stevens and Ramirez-Lopez, 2024). This dataset is used because soils are among the most complex matrices analyzed by NIR spectroscopy. It was originally used in the Chimiométrie 2006 challenge (Pierna and Dardenne, 2008).

The dataset contains NIR absorbance spectra for 825 dried and sieved soil samples collected from agricultural fields across the Walloon region of Belgium. In R, the data are stored in a data.frame with the following structure:

Load the necessary packages and data:

library(resemble)
library(prospectr)

The dataset can be loaded into R as follows:

data(NIRsoil)
dim(NIRsoil)
str(NIRsoil)

4 Spectral preprocessing

Throughout the vignettes, the same preprocessing workflow is used to improve the suitability of the spectra for quantitative analysis. In particular, the goal is to reduce unwanted baseline variation and enhance local spectral features that may be informative for modeling. The preprocessing steps are implemented using the prospectr package (Stevens and Ramirez-Lopez, 2024).

The following steps are applied:

  1. Detrending is applied first to reduce broad baseline shifts and curvature effects across the spectra.

  2. A first-order Savitzky–Golay derivative (Savitzky and Golay, 1964) is then computed to emphasize local spectral features and reduce remaining additive effects.

# obtain a numeric vector of the wavelengths at which spectra is recorded 
wavs <- as.numeric(colnames(NIRsoil$spc))

# pre-process the spectra:
# - use detrend
# - use first order derivative
diff_order <- 1
poly_order <- 1
window <- 7

# Preprocess spectra
NIRsoil$spc_pr <- savitzkyGolay(
  detrend(NIRsoil$spc, wav = wavs),
  m = diff_order, p = poly_order, w = window
)
Figure 1: Raw spectral absorbance data (top) and first derivative of the absorbance spectra (bottom).

Both the raw absorbance spectra and the preprocessed spectra are shown in Figure 1. The preprocessed spectra, obtained as the first derivative of detrended absorbance, are used as the predictor variables in all examples throughout this document.

For illustration purposes, the NIRsoil data are divided into training and test subsets. In the examples that require a response variable, Ciso is used to demonstrate the functionality of the package.

train_x <- NIRsoil$spc_pr[NIRsoil$train == 1, ]
train_y <- NIRsoil$Ciso[NIRsoil$train == 1]

test_x  <- NIRsoil$spc_pr[NIRsoil$train == 0, ]
test_y  <- NIRsoil$Ciso[NIRsoil$train == 0]

The notation used throughout the resemble package for arguments referring to training and test observations is as follows:

References

Pierna, J.A.F., Dardenne, P., 2008. Soil parameter quantification by NIRS as a chemometric challenge at “chimiométrie 2006.” Chemometrics and intelligent laboratory systems 91, 94–98.
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Demattê, J., Scholten, T., 2013. The spectrum-based learner: A new local approach for modeling soil vis–NIR spectra of complex datasets. Geoderma 195, 268–279.
Ramirez-Lopez, L., Metz, M., Lesnoff, M., Orellano, C., Perez-Fernandez, E., Plans, M., Breure, T., Behrens, T., Viscarra Rossel, R., Peng, Y., 2026a. Rethinking local spectral modelling: From per-query refitting to model libraries. Analytica Chimica Acta.
Ramirez-Lopez, L., Viscarra Rossel, R., Behrens, T., Orellano, C., Perez-Fernandez, E., Kooijman, L., Wadoux, A.M.J.-C., Breure, T., Summerauer, L., Safanelli, J.L., Plans, M., 2026b. When spectral libraries are too complex to search: Evolutionary subset selection for domain-adaptive calibration. Analytica Chimica Acta.
Saul, L., Roweis, S., 2003. Think globally, fit locally: Unsupervised learning of low dimensional manifolds. Journal of machine learning research 4, 119–155.
Savitzky, A., Golay, M., 1964. Smoothing and differentiation of data by simplified least squares procedures. Anal. Chem. 36, 1627–1639.
Stevens, A., Ramirez-Lopez, L., 2024. An introduction to the prospectr package. R Package Vignette, Report No.: R Package Version 0.2.7 3.