Neural estimators are neural networks that transform data into parameter point estimates. They are likelihood free and, once constructed with simulated data, substantially faster than classical methods. They also approximate Bayes estimators and, therefore, are often referred to as neural Bayes estimators.
The package NeuralEstimators
facilitates the development
of neural Bayes estimators in a user-friendly manner. It caters for
arbitrary statistical models by having the user implicitly define their
model via simulated data; this makes the development of neural Bayes
estimators particularly straightforward for models with existing
implementations in R
or other programming languages. This
vignette describes the R
interface to the
Julia
version of the package, whose documentation is
available here.
We here provide an overview of point estimation using neural Bayes estimators. For a more detailed discussion on the framework and its implementation, see Sainsbury-Dale, Zammit-Mangion, and Huser (2024).
The goal of parametric point estimation is to estimate a \(p\)-dimensional parameter \(\boldsymbol{\theta} \in \Theta \subseteq \mathbb{R}^p\) from data \(\boldsymbol{Z} \in \mathbb{R}^n\) using an estimator, \(\hat{\boldsymbol{\theta}} : \mathbb{R}^n\to\Theta\). A ubiquitous decision-theoretic approach to the construction of estimators is based on average-risk optimality. Consider a nonnegative loss function \(L: \Theta \times \Theta \to \mathbb{R}^{\geq 0}\). An estimator’s Bayes risk is its loss averaged over all possible data realisations and parameter values, \[ \int_\Theta \int_{\mathbb{R}^n} L(\boldsymbol{\theta}, \hat{\boldsymbol{\theta}}(\boldsymbol{z}))f(\boldsymbol{z} \mid \boldsymbol{\theta}) \rm{d} \boldsymbol{z} \rm{d} \Pi(\boldsymbol{\theta}), \] where \(\Pi(\cdot)\) is a prior measure for \(\boldsymbol{\theta}\). Any minimiser of the Bayes risk is said to be a Bayes estimator with respect to \(L(\cdot, \cdot)\) and \(\Pi(\cdot)\).
Bayes estimators are theoretically attractive. For example, unique Bayes estimators are admissible and, under suitable regularity conditions, they are consistent and asymptotically efficient. They are also highly interpretable: Bayes estimators are functionals of the posterior distribution, and the specific functional is determined by the choice of loss function. For instance, under quadratic loss, the Bayes estimator is the posterior mean.
Despite their attactive properties, Bayes estimators are typically unavailable in closed form. A way forward is to assume a flexible parametric model for the estimator, and to optimise the parameters within that model in order to approximate the Bayes estimator. Neural networks are ideal candidates, since they are universal function approximators, and because they are extremely fast to evaluate.
Let \(\hat{\boldsymbol{\theta}}(\cdot;
\boldsymbol{\gamma})\) denote a neural point estimator, where
\(\boldsymbol{\gamma}\) contains the
neural-network parameters. Bayes estimators may be approximated with
\(\hat{\boldsymbol{\theta}}(\cdot;
\boldsymbol{\gamma}^*)\), where \(\boldsymbol{\gamma}^*\) solves the
optimisation task,
\[
\boldsymbol{\gamma}^*
\equiv
\underset{\boldsymbol{\gamma}}{\mathrm{arg\,min}} \; \frac{1}{K}
\sum_{k=1}^K L(\boldsymbol{\theta}^{(k)},
\hat{\boldsymbol{\theta}}(\boldsymbol{Z}^{(k)}; \boldsymbol{\gamma})),
\] whose objective function is a Monte Carlo approximation of the
Bayes risk made using a set \(\{\boldsymbol{\theta}^{(k)} : k = 1, \dots,
K\}\) of parameter vectors sampled from the prior and, for each
\(k\), simulated data \(\boldsymbol{Z}^{(k)} \sim f(\boldsymbol{z}
\mid \boldsymbol{\theta}^{(k)})\). Note that this procedure does
not involve evaluation, or knowledge, of the likelihood function, and
that the optimisation task can be performed straightforwardly using
back-propagation and stochastic gradient descent.
Parameter estimation from replicated data is commonly required in statistical applications. A parsimonious architecture for such estimators is based on the so-called DeepSets representation (Zaheer et al., 2017). Suppressing dependence on neural-network parameters \(\boldsymbol{\gamma}\) for notational convenience, a neural Bayes estimator couched in the DeepSets framework has the form,
\[ \hat{\boldsymbol{\theta}}(\boldsymbol{Z}_1, \dots, \boldsymbol{Z}_m) = \boldsymbol{\psi}\left(\boldsymbol{a}(\{\boldsymbol{\phi}(\boldsymbol{Z}_i ): i = 1, \dots, m\})\right), \] where \(\{\boldsymbol{Z}_i : i = 1, \dots, m\}\) are mutually conditionally independent replicates from the statistical model, \(\boldsymbol{\psi}(\cdot)\) and \(\boldsymbol{\phi}(\cdot)\) are neural networks referred to as the summary and inference networks, respectively, and \(\boldsymbol{a}(\cdot)\) is a permutation-invariant aggregation function (typically the element-wise mean). The architecture of \(\boldsymbol{\psi}(\cdot)\) depends on the structure of each \(\boldsymbol{Z}_i\) with, for example, a CNN used for gridded data and a fully-connected multilayer perceptron (MLP) used for unstructured multivariate data. The architecture of \(\boldsymbol{\phi}(\cdot)\) is always an MLP.
The DeepSets representation has several motivations. First, Bayes estimators are invariant to permutations of independent replicates, satisfying \(\hat{\boldsymbol{\theta}}(\boldsymbol{Z}_1,\dots,\boldsymbol{Z}_m) = \hat{\boldsymbol{\theta}}(\boldsymbol{Z}_{\pi(1)},\dots,\boldsymbol{Z}_{\pi(m)})\) for any permutation \(\pi(\cdot)\); estimators constructed in the DeepSets representation are guaranteed to exhibit this property. Second, the DeepSets architecture is a universal approximator for continuously differentiable permutation-invariant functions and, therefore, any Bayes estimator that is a continuously differentiable function of the data can be approximated arbitrarily well by a neural Bayes estimator in the DeepSets form. Third, estimators constructed using DeepSets may be used with an arbitrary number \(m\) of conditionally independent replicates, therefore amortising the cost of training with respect to this choice. See Sainsbury-Dale, Zammit-Mangion, and Huser (2024) for further details on the use of DeepSets in the context of neural Bayes estimation, and for a discussion on the architecture’s connection to conventional estimators.
Uncertainty quantification with neural Bayes estimators often proceeds through the bootstrap distribution (e.g., Lenzi et al., 2023; Richards et al., 2023; Sainsbury-Dale et al., 2024). Bootstrap-based approaches are particularly attractive when nonparametric bootstrap is possible (e.g., when the data are independent replicates), or when simulation from the fitted model is fast, in which case parametric bootstrap is also computationally efficient. However, these conditions are not always met. For example, when making inference from a single spatial field, nonparametric bootstrap is not possible without breaking the spatial dependence structure, and the cost of simulation from the fitted model is often non-negligible (e.g., exact simulation from a Gaussian process model requires the factorisation of an \(n \times n\) matrix, where \(n\) is the number of spatial locations, which is a task that is \(O(n^3)\) in computational complexity). Further, although bootstrap-based methods for uncertainty quantification are often considered to be fairly accurate and favourable to methods based on asymptotic normality, there are situations where bootstrap procedures are not reliable (see, e.g., Canty et al., 2006, pg. 6).
Alternatively, by leveraging ideas from (Bayesian) quantile regression, one may construct a neural Bayes estimator that approximates a set of marginal posterior quantiles (Fisher et al., 2023; Sainsbury-Dale et al., 2023, Sec. 2.2.4), which can then be used to construct univariate credible intervals for each parameter. Inference then remains fully amortised since, once the estimators are trained, both point estimates and credible intervals can be obtained with virtually zero computational cost.
Posterior quantiles can be targeted by employing the quantile loss function which, for a single parameter \(\theta\), is \[ L_\tau(\theta, \hat{\theta}) = (\hat{\theta} - \theta)(\mathbb{I}(\hat{\theta} > \theta) - \tau), \quad \tau \in (0, 1), \] where \(\mathbb{I}(\cdot)\) denotes the indicator function. In particular, the Bayes estimator under the above loss function is the posterior \(\tau\)-quantile. When there are \(p > 1\) parameters, \(\boldsymbol{\theta} = (\theta_1, \dots, \theta_p)'\), the Bayes estimator under the joint loss \({L(\boldsymbol{\theta}, \hat{\boldsymbol{\theta}}) = \sum_{k=1}^p L_\tau(\theta_k, \hat{\theta}_k)}\) is the vector of marginal posterior quantiles since, in general, a Bayes estimator under a sum of univariate loss functions is given by the vector of marginal Bayes estimators (Sainsbury-Dale et al., 2023, Thm. 2).
Neural Bayes estimators are conceptually simple and can be used in a wide range of problems where other approaches, such as maximum-likelihood estimation, are computationally infeasible. The estimator also has marked practical appeal, as the general workflow for its construction is only loosely connected to the statistical model being considered. The workflow is as follows:
The package NeuralEstimators
is designed to implement
this workflow in a user friendly manner, as will be illustrated in the
following examples.
We now consider several examples, where the data are either univariate, multivariate but unstructured, or multivariate and collected over a regular grid. As we will show, it is the structure of the data that dictates the class of neural-network architecture that one must employ, and these examples therefore serve both to illustrate the functionality of the package and to provide guidelines on the architecture to use for a given application.
Before proceeding, we load the required packages (see here
for instructions on installing Julia and the Julia packages
NeuralEstimators
and Flux
):
library("NeuralEstimators")
library("JuliaConnectoR")
library("ggplot2")
library("ggpubr")
juliaEval('using NeuralEstimators, Flux')
#> Starting Julia ...
We first develop a neural Bayes estimator for \(\boldsymbol{\theta} \equiv (\mu, \sigma)'\) from data \(Z_1, \dots, Z_m\) that are independent and identically distributed according to a \(\rm{Gau}(\mu, \sigma^2)\) distribution.
First, we sample parameters from the prior \(\Pi(\cdot)\) to construct parameter sets
used for training and validating the estimator. Here, we use the priors
\(\mu \sim \rm{Gau}(0, 1)\) and \(\sigma \sim \rm{Gamma}(1, 1)\), and we
assume that the parameters are independent a priori. In
NeuralEstimators
, the sampled parameters are stored as
\(p \times K\) matrices, with \(p\) the number of parameters in the model
and \(K\) the number of sampled
parameter vectors.
prior <- function(K) {
mu <- rnorm(K)
sigma <- rgamma(K, 1)
Theta <- matrix(c(mu, sigma), byrow = TRUE, ncol = K)
return(Theta)
}
K <- 10000
theta_train <- prior(K)
theta_val <- prior(K/10)
Next, we implicitly define the statistical model with simulated data.
In NeuralEstimators
, the simulated data are stored as a
list
, where each element of the list corresponds to a data
set simulated conditional on one parameter vector. Since each replicate
of our model is univariate, we take the summary network of the DeepSets
framework to be an MLP with a single input neuron, and each simulated
data set is stored as a matrix with the independent replicates in the
columns (i.e, a \(1 \times m\)
matrix).
simulate <- function(Theta, m) {
apply(Theta, 2, function(theta) t(rnorm(m, theta[1], theta[2])), simplify = FALSE)
}
m <- 15
Z_train <- simulate(theta_train, m)
Z_val <- simulate(theta_val, m)
We now design architectures for the summary network and outer network
of the DeepSets framework, and initialise our neural point estimator.
This can be done using the R helper function
initialise_estimator()
, or using Julia code directly, as
follows:
estimator <- juliaEval('
d = 1 # dimension of each replicate (univariate data)
p = 2 # number of parameters in the statistical model
w = 32 # number of neurons in each hidden layer
psi = Chain(Dense(d, w, relu), Dense(w, w, relu), Dense(w, w, relu))
phi = Chain(Dense(w, w, relu), Dense(w, w, relu), Dense(w, p))
deepset = DeepSet(psi, phi)
estimator = PointEstimator(deepset)
')
A note on long-term stability: If constructing a
neural Bayes estimator for repeated and long term use, it is recommended
to define the architecture using Julia code directly. This is because
initialise_estimator()
is subject to change, as methods for
designing neural-network architectures improve over time and these
improved methods are incorporated into the package.
Once the estimator is initialised, it is fitted using
train()
, here using the default mean-absolute-error loss.
We use the sets of parameters and simulated data constructed earlier;
one may alternatively use the arguments sampler
and
simulator
to pass functions for sampling from the prior and
simulating from the model, respectively. These functions can be defined
in R (as we have done in this example) or in Julia using the package
JuliaConnectoR
, and this approach facilitates the technique
known as “on-the-fly” simulation.
estimator <- train(
estimator,
theta_train = theta_train,
theta_val = theta_val,
Z_train = Z_train,
Z_val = Z_val,
epochs = 50
)
#> Computing the initial validation risk... Initial validation risk = 0.8859203
#> Computing the initial training risk... Initial training risk = 0.8599103
#> Epoch: 1 Training risk: 0.583 Validation risk: 0.354 Run time of epoch: 18.138 seconds
#> Epoch: 2 Training risk: 0.306 Validation risk: 0.289 Run time of epoch: 0.438 seconds
#> Epoch: 3 Training risk: 0.264 Validation risk: 0.259 Run time of epoch: 0.41 seconds
#> Epoch: 4 Training risk: 0.243 Validation risk: 0.243 Run time of epoch: 0.406 seconds
#> Epoch: 5 Training risk: 0.233 Validation risk: 0.234 Run time of epoch: 0.406 seconds
#> Epoch: 6 Training risk: 0.227 Validation risk: 0.228 Run time of epoch: 0.407 seconds
#> Epoch: 7 Training risk: 0.222 Validation risk: 0.225 Run time of epoch: 0.401 seconds
#> Epoch: 8 Training risk: 0.219 Validation risk: 0.22 Run time of epoch: 0.401 seconds
#> Epoch: 9 Training risk: 0.215 Validation risk: 0.218 Run time of epoch: 0.405 seconds
#> Epoch: 10 Training risk: 0.212 Validation risk: 0.213 Run time of epoch: 0.401 seconds
#> Epoch: 11 Training risk: 0.209 Validation risk: 0.21 Run time of epoch: 0.402 seconds
#> Epoch: 12 Training risk: 0.206 Validation risk: 0.208 Run time of epoch: 0.402 seconds
#> Epoch: 13 Training risk: 0.204 Validation risk: 0.205 Run time of epoch: 0.424 seconds
#> Epoch: 14 Training risk: 0.202 Validation risk: 0.203 Run time of epoch: 0.423 seconds
#> Epoch: 15 Training risk: 0.2 Validation risk: 0.203 Run time of epoch: 0.43 seconds
#> Epoch: 16 Training risk: 0.198 Validation risk: 0.199 Run time of epoch: 0.436 seconds
#> Epoch: 17 Training risk: 0.197 Validation risk: 0.197 Run time of epoch: 0.427 seconds
#> Epoch: 18 Training risk: 0.196 Validation risk: 0.196 Run time of epoch: 0.407 seconds
#> Epoch: 19 Training risk: 0.194 Validation risk: 0.194 Run time of epoch: 0.413 seconds
#> Epoch: 20 Training risk: 0.193 Validation risk: 0.193 Run time of epoch: 0.41 seconds
#> Epoch: 21 Training risk: 0.192 Validation risk: 0.192 Run time of epoch: 0.429 seconds
#> Epoch: 22 Training risk: 0.191 Validation risk: 0.191 Run time of epoch: 0.414 seconds
#> Epoch: 23 Training risk: 0.19 Validation risk: 0.19 Run time of epoch: 0.419 seconds
#> Epoch: 24 Training risk: 0.189 Validation risk: 0.192 Run time of epoch: 0.412 seconds
#> Epoch: 25 Training risk: 0.188 Validation risk: 0.188 Run time of epoch: 0.415 seconds
#> Epoch: 26 Training risk: 0.188 Validation risk: 0.187 Run time of epoch: 0.413 seconds
#> Epoch: 27 Training risk: 0.187 Validation risk: 0.187 Run time of epoch: 0.409 seconds
#> Epoch: 28 Training risk: 0.186 Validation risk: 0.186 Run time of epoch: 0.424 seconds
#> Epoch: 29 Training risk: 0.186 Validation risk: 0.186 Run time of epoch: 0.411 seconds
#> Epoch: 30 Training risk: 0.185 Validation risk: 0.185 Run time of epoch: 0.409 seconds
#> Epoch: 31 Training risk: 0.185 Validation risk: 0.184 Run time of epoch: 0.405 seconds
#> Epoch: 32 Training risk: 0.184 Validation risk: 0.184 Run time of epoch: 0.411 seconds
#> Epoch: 33 Training risk: 0.184 Validation risk: 0.187 Run time of epoch: 0.406 seconds
#> Epoch: 34 Training risk: 0.184 Validation risk: 0.184 Run time of epoch: 0.41 seconds
#> Epoch: 35 Training risk: 0.183 Validation risk: 0.185 Run time of epoch: 0.402 seconds
#> Epoch: 36 Training risk: 0.183 Validation risk: 0.184 Run time of epoch: 0.408 seconds
#> Epoch: 37 Training risk: 0.183 Validation risk: 0.182 Run time of epoch: 0.402 seconds
#> Epoch: 38 Training risk: 0.182 Validation risk: 0.182 Run time of epoch: 0.41 seconds
#> Epoch: 39 Training risk: 0.182 Validation risk: 0.181 Run time of epoch: 0.405 seconds
#> Epoch: 40 Training risk: 0.182 Validation risk: 0.182 Run time of epoch: 0.428 seconds
#> Epoch: 41 Training risk: 0.182 Validation risk: 0.181 Run time of epoch: 0.408 seconds
#> Epoch: 42 Training risk: 0.181 Validation risk: 0.182 Run time of epoch: 0.403 seconds
#> Epoch: 43 Training risk: 0.181 Validation risk: 0.181 Run time of epoch: 0.408 seconds
#> Epoch: 44 Training risk: 0.181 Validation risk: 0.181 Run time of epoch: 0.428 seconds
#> Epoch: 45 Training risk: 0.18 Validation risk: 0.183 Run time of epoch: 0.406 seconds
#> Epoch: 46 Training risk: 0.181 Validation risk: 0.179 Run time of epoch: 0.41 seconds
#> Epoch: 47 Training risk: 0.18 Validation risk: 0.181 Run time of epoch: 0.404 seconds
#> Epoch: 48 Training risk: 0.18 Validation risk: 0.181 Run time of epoch: 0.404 seconds
#> Epoch: 49 Training risk: 0.18 Validation risk: 0.18 Run time of epoch: 0.407 seconds
#> Epoch: 50 Training risk: 0.18 Validation risk: 0.179 Run time of epoch: 0.405 seconds
If the argument savepath
in train()
is
specified, the neural-network state (e.g., the weights and biases) will
be saved during training as bson
files, and the best state
(as measured by validation risk) will be saved as
best_network.bson
. To load these saved states into a neural
network in later R
sessions, one may use the function
loadstate()
. Note that one may also manually save a trained
estimator using savestate()
.
Once a neural Bayes estimator has been trained, its performance should be assessed. Simulation-based (empirical) methods for evaluating the performance of a neural Bayes estimator are ideal, since simulation is already required for constructing the estimator, and because the estimator can be applied to thousands of simulated data sets at almost no computational cost.
To assess the accuracy of the resulting neural Bayes estimator, one
may use the function assess()
. The resulting object can
then be passed to the functions risk()
,
bias()
, and rmse()
to compute a Monte Carlo
approximation of the Bayes risk, the bias, and the RMSE, or passed to
the function plotestimates()
to visualise the estimates
against the corresponding true values:
theta_test <- prior(1000)
Z_test <- simulate(theta_test, m)
assessment <- assess(estimator, theta_test, Z_test, estimator_names = "NBE")
#> Running NBE...
parameter_labels <- c("θ1" = expression(mu), "θ2" = expression(sigma))
plotestimates(assessment, parameter_labels = parameter_labels)
In addition to assessing the estimator over the entire parameter
space \(\Theta\), it is often helpful
to visualise the empirical sampling distribution of an estimator for a
particular parameter configuration. This can be done by calling
assess()
with \(J > 1\)
data sets simulated under a single parameter configuration, and
providing the resulting estimates to plotdistribution()
.
This function can be used to plot the empirical joint and marginal
sampling distribution of the neural Bayes estimator, with the true
parameter values demarcated in red.
theta <- as.matrix(c(0, 0.5)) # true parameters
J <- 400 # number of data sets
Z <- lapply(1:J, function(i) simulate(theta, m)[[1]]) # simulate data
assessment <- assess(estimator, theta, Z, estimator_names = "NBE")
#> Running NBE...
joint <- plotdistribution(assessment, type = "scatter",
parameter_labels = parameter_labels)
marginal <- plotdistribution(assessment, type = "box",
parameter_labels = parameter_labels,
return_list = TRUE)
ggpubr::ggarrange(plotlist = c(joint, marginal), nrow = 1, common.legend = TRUE)
Once the neural Bayes estimator has been trained and assessed, it can
be applied to observed data using the function estimate()
,
and non-parametric bootstrap estimates can be computed using the
function bootstrap()
. Below, we use simulated data as a
surrogate for observed data:
theta <- as.matrix(c(0, 0.5)) # true parameters
Z <- simulate(theta, m) # pretend that this is observed data
thetahat <- estimate(estimator, Z) # point estimates from the "observed data"
bs <- bootstrap(estimator, Z) # non-parametric bootstrap estimates
bs[, 1:6]
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 0.1313937 0.1203711 0.01909048 0.07940051 0.06820792 0.1044405
#> [2,] 0.2697322 0.1982008 0.26073503 0.27769640 0.35609826 0.3419894
Suppose now that our data consists of \(m\) replicates of a \(d\)-dimensional multivariate distribution.
For unstructured \(d\)-dimensional data, the estimator is based on an MLP, and each data set is stored as a two-dimensional array (a matrix), with the second dimension storing the independent replicates. That is, in this case, we store the data as a list of \(d \times m\) matrices (previously they were stored as \(1\times m\) matrices), and the inner network (the summary network) of the DeepSets representation has \(d\) input neurons (in the previous example, it had 1 input neuron).
For example, consider the task of estimating \(\boldsymbol{\theta} \equiv (\mu_1, \mu_2, \sigma, \rho)'\) from data \(\boldsymbol{Z}_1, \dots, \boldsymbol{Z}_m\) that are independent and identically distributed according to the bivariate Gaussian distribution, \[ \rm{Gau}\left(\begin{bmatrix}\mu_1 \\ \mu_2\end{bmatrix}, \sigma^2\begin{bmatrix} 1 & \rho \\ \rho & 1\end{bmatrix}\right). \] Then, to construct a neural Bayes estimator for this simple model, one may use the following code for defining a prior, the data simulator, and the neural-network architecture:
prior <- function(K) {
mu1 <- rnorm(K)
mu2 <- rnorm(K)
sigma <- rgamma(K, 1)
rho <- runif(K, -1, 1)
Theta <- matrix(c(mu1, mu2, sigma, rho), byrow = TRUE, ncol = K)
return(Theta)
}
simulate <- function(Theta, m) {
apply(Theta, 2, function(theta) {
mu <- c(theta[1], theta[2])
sigma <- theta[3]
rho <- theta[4]
Sigma <- sigma^2 * matrix(c(1, rho, rho, 1), 2, 2)
Z <- MASS::mvrnorm(m, mu, Sigma)
Z <- t(Z)
Z
}, simplify = FALSE)
}
estimator <- initialise_estimator(p = 4, d = 2, architecture = "MLP")
The training, assessment, and application of the neural Bayes estimator then remains as given above.
For data collected over a regular grid, the neural Bayes estimator is based on a convolutional neural network (CNN). We give specific attention to this case in a separate vignette, available here. There, we also outline two techinques for performing neural Bayes estimation with incomplete/missing data (Wang et al., 2024).
Various other methods are implemented in the package, and will be described in forthcoming vignettes. These topics include constructing a neural Bayes estimator for irregular spatial data using graph neural networks (Sainsbury-Dale et al., 2023); dealing with settings in which some data are censored (Richards et al., 2023); constructing posterior credible intervals using a neural Bayes estimator trained under the quantile loss function (e.g., Fisher et al., 2023; Sainsbury-Dale et al., 2023, Sec. 2.2.4); and performing inference using neural likelihood-to-evidence ratios (see, e.g., Hermans et al., 2020; Walchessen et al., 2024; Zammit-Mangion et al., 2024, Sec. 5.2).
If you’d like to implement these methods before these vignettes are made available, please contact the package maintainer.