This vignette is written for baggr users who want to learn about (Bayesian) meta-analysis concepts and analyse data on typically continuous variables. If you are looking for information that is more specific to binary data, please read vignette("baggr_binary"). We continue to welcome feedback and suggestions.

What is baggr?

baggr (pronounced “bagger” or “badger” and short for Bayesian Aggregator) is a package for aggregating evidence on causal effects measured in several separate and different instances. These instances may be different studies, groups, locations or “sites” however conceptualised. We refer to these separate pieces of evidence as “groups” for the remainder of this vignette When each group is a study, the model is that of meta-analysis, but aggregation of evidence is not limited to this case.

One of the most basic objects of interest is the average treatment effect (ATE), the difference in the mean outcome in treatment and control groups; for more information see work by Rubin (1974). In meta-analysis we are often interested in the average of this average effect across groups, estimated using all the evidence from all the groups. Consider the case where the evidence in each study or group is generated by comparing the outcomes of treatment and control samples in a randomized experiment. We will ignore any covariate information at the individual or group level for now.

Consider some outcome of interest $y_{ik}$ such as consumption, income or health outcomes for a household or individual $i = 1,2,...N_k$ in study group $k = 1,2....K$. Let $Y_k$ denote the $N_k$-length vector of observed outcomes from group $k$. Denote treatment status by $T_{ik}$, and denote by $T_k$ the $N_k$-length vector of all treatment status indicators from group $k$. In individual-level models, treatment can be either binary or a multi-level factor (for example, multiple treatment arms plus control).

Suppose that $y_{ik}$ varies randomly around its mean $\mu_k + \tau_k T_i$. In this setting $\tau_k$ is the treatment effect in group $k$. The random variation in $y_{ik}$ may be the result of sampling variation or measurement error, as in the Rubin (1981) model, or it may be the result of unmodelled heterogeneity or uncertainty in outcomes for individuals within the group. Allow the variance of the outcome variable $y_{ik}$ to vary across sites, so $\sigma_{y_k}^2$ may differ across $k$.

Recent features

This vignette has been updated in 2026 for baggr v0.8. Since 2020, baggr has added several user-facing features that are useful in standard applied workflows:

Selection models for Rubin summary-data models: see selection()
Funnel-plot diagnostics via funnel_plot(). Also new plotting via bubble().
Multi-arm treatment handling in individual-level data workflows via baggr().
Cluster-level random effects support through the cluster argument in baggr().
Model comparison and visualization with baggr_compare(), loocv(), and associated plotting methods.

Data inputs: reported effects or full individual-level data sets

For average effects aggregation, baggr detects the input format automatically. The table below summarises the user-facing formats and the corresponding internal category returned by detect_input_type().

Input format	Required columns (minimum)	Internal input type	Typical default model
Effect estimates only (continuous outcomes)	`tau`, `se`	`pool_noctrl_narrow`	`rubin`
Control mean + treatment effect summary (continuous outcomes)	`mu`, `tau`, `se.mu`, `se.tau`	`pool_wide`	`mutau`
Binary summary counts (2x2-style per group)	`a`, `b`, `c`, `d` (plus optional `group`)	`pool_binary`	`rubin` (or `logit` if requested)
Individual-level data (continuous outcomes)	`group`, `treatment`, `outcome`	`individual`	`rubin_full`
Individual-level data (binary outcomes)	`group`, `treatment`, `outcome` with binary `outcome`	`individual_binary`	`logit`

If you supply cluster = ..., baggr will fit cluster random intercepts in models that support clustering for individual-level data.

Examples:

# Auto-detect binary summary input (a, b, c, d columns)
baggr(df_binary_summary)

# Auto-detect individual-level input with a multi-arm treatment factor
baggr(df_multiarm_individual,
      outcome = "outcome",
      treatment = "treatment_arm",
      group = "study")

As an example of an individual-level data set we include in data frames microcredit and microcredit_simplified. The former contains all the microcredit outcome data used in Meager (2019), standardized to USD PPP in 2009 terms per two weeks (a time period is necessary as these are flow variables). It therefore contains NAs and other undesirable features, to allow the user to see how baggr handles these common data issues. The data set microcredit_simplified has these issues cleaned up and contains only one outcome of interest, consumer durables spending.

baggr also has a function that automatically produces summary data from full data sets, in case one wishes to run the comparatively faster summary-data models. The prepare_ma() function applied to a dataframe with columns named "group", "outcome", and "treatment" automatically estimates the control group means, treatment effects, and associated standard errors for each group using an Ordinary Least Squares regression. The resulting output is already formatted as a valid input to the baggr() command itself:

prepare_ma(microcredit_simplified, outcome = "consumption")
#>   group       mu       tau     se.mu    se.tau n.mu n.tau
#> 1     1 303.6065  5.510755  2.559897  4.101140 8298  8262
#> 2     2 280.0887 50.449393 11.141075 22.156317  260   701
#> 3     3 196.4215 -5.171338 14.432604 19.266339  444   551
#> 4     4 276.2791  4.641604  3.730907  5.451094 3248  3579
#> 5     5 327.5246 -2.935731  4.027768  6.022955 2771  2716

ATE aggregation models in baggr

baggr includes summary-data models and individual-level-data models for aggregating average treatment effects.

Rubin model

Consider first the evidence aggregation model from Rubin (1981), discussed extensively in Chapter 5 of Gelman et al. (2013); the model consists of a hierarchical likelihood as follows:

\[\begin{equation} \begin{aligned} \hat{\tau_k} &\sim N(\tau_k, \hat{se_k}^2) \; \forall \; k \\ \tau_k &\sim N(\tau, \sigma_{\tau}^2) \; \forall \; k . \end{aligned} \label{rubin model} \end{equation}\]

The motivation for this model structure is discussed in detail in the sources above and in Meager (2019). To complete the Bayesian model, we now need priors. baggr has a set of default priors for each model (adjusted to data), as well as allowing the user to specify her own priors if desired. In the Rubin model, the defaults are:

For $\tau$, prior_hypermean = normal(0, 10*max(abs(tau))).
For $\sigma_{\tau}$, prior_hypersd = uniform(0, 10*sd(tau)) (for at least two groups).

$\mu$-$\tau$ model

In case you also have data on the control groups’ mean outcomes and the uncertainty on those, you can augment the Rubin (1981) model to incorporate that information. Following Meager (2019), if one has access to the estimated control means $\{\hat{\mu_k}\}^K_{k=1}$ and their standard errors $\{\hat{se}_{\mu k}\}^K_{k=1}$, one can fit a joint Gaussian model on the pairs $\{\hat{\mu_k},\hat{\tau_k}\}_{k=1}^{K}$:

\[\begin{equation} \begin{aligned} \hat{\tau_k} &\sim N(\tau_k, \hat{se_{\tau k}}^2) \; \forall \; k \\ \hat{\mu_k} &\sim N(\mu_k, \hat{se_{\mu k}}^2) \; \forall \; k \\ \left( \begin{array}{c} \mu_{k}\\ \tau_{k} \end{array} \right) &\sim N\left( \left( \begin{array}{c} \mu\\ \tau \end{array} \right), V \right) \; \text{where} \;V = \left[ \begin{array}{cc} \sigma^2_{\mu} & \sigma_{\tau\mu} \\ \sigma_{\tau\mu} & \sigma_{\tau}^2 \end{array} \right]\forall \; k. \\ \end{aligned} \label{full data model} \end{equation}\]

In baggr this model is referred to as "mutau".

If you have only few groups, the priors on $V$ will need to be relatively strong to avoid overfitting. See Meager (2019) for more discussion of this issue in particular, or see the Stan Manual on hierarchical priors. In this model, defaults are:

prior_hypermean = multinormal(c(0, 0), diag(c(100*max(abs(mu)), 100*max(abs(tau))))).
prior_hypersd = uniform(0, 10*sd(tau)).
prior_hypercor = lkj(3).

Individual-level models

For individual-level data (outcome, treatment, group), baggr supports three likelihood models:

"rubin_full": $y_i \sim N(\mu_{k[i]} + \tau_{k[i]} z_i + x_i^\top\beta + u_i, \sigma_{y,k[i]}^2)$.
"mutau_full": $y_i \sim N(\mu_{k[i]} + \tau_{k[i]} z_i + x_i^\top\beta, \sigma_{y,k[i]}^2)$.
"logit": $y_i \sim \text{Bernoulli}(\text{logit}^{-1}(\mu_{k[i]} + \tau_{k[i]} z_i + x_i^\top\beta))$.

Models, their inputs, hierarchy and priors: a summary table

Model (name in `baggr`)	Required input columns	Hierarchical structure	Prior knobs (main arguments)
Rubin summary-data model (`"rubin"`)	`tau`, `se` (optional `group`, optional `covariates`)	With `pooling = "partial"`: $\tau_k \sim N(\tau, \sigma_\tau^2)$. `pooling = "none"` uses independent study effects; `pooling = "full"` uses one common effect.	`prior_hypermean` (for $\tau$), `prior_hypersd` (for $\sigma_\tau$), `prior_beta` (if `covariates` used), `prior_selection` (if `selection` is used).
$\mu$-$\tau$ summary-data model (`"mutau"`)	`mu`, `tau`, `se.mu`, `se.tau` (optional `group`, optional `covariates`)	With `pooling = "partial"`: $\begin{pmatrix}\mu_k\\\tau_k\end{pmatrix} \sim N\left(\begin{pmatrix}\mu\\\tau\end{pmatrix}, V\right)$, where $V$ is parameterized via scales and a correlation matrix. `pooling = "none"` and `"full"` follow the same none/full pooling interpretation as above.	`prior_hypermean` (for $(\mu,\tau)$), `prior_hypersd` (scales in $V$), `prior_hypercor` (correlation in $V$), `prior_beta` (if `covariates` used).
Rubin individual-level model (`"rubin_full"`)	`outcome`, `treatment`, `group` (optional individual-level `covariates`, optional `cluster`)	Treatment effects follow Rubin-style pooling across groups (`none`/`partial`/`full`), with optional baseline pooling via `pooling_control`.	`prior_hypermean` (treatment hypermean), `prior_hypersd` (treatment heterogeneity), `prior_control` and `prior_control_sd` (control arm baseline and its heterogeneity), `prior_sigma` (within-group outcome SD), `prior_beta`, `prior_cluster`.
$\mu$-$\tau$ individual-level model (`"mutau_full"`)	`outcome`, `treatment`, `group` (optional individual-level `covariates`)	Partial-pooling form jointly pools $(\mu_k, \tau_k)$ with a multivariate normal hierarchy and correlation; none/full pooling use the corresponding independent/common parameter special cases.	`prior_hypermean`, `prior_hypersd`, `prior_hypercor`, `prior_control_sd` (for baseline SD under partial baseline pooling), `prior_sigma`, `prior_beta`.
Logistic individual-level model (`"logit"`)	`outcome` (0/1), `treatment`, `group` (optional individual-level `covariates`)	Log-odds treatment effects are pooled across groups (`none`/`partial`/`full`); control-arm log-odds can be pooled separately via `pooling_control`.	`prior_hypermean`, `prior_hypersd`, `prior_control`, `prior_control_sd`, `prior_beta`.

selection = ... (publication selection on $|z| = |\tau / se|$) is currently available only for the summary-data Rubin model (model = "rubin"). Use prior_selection to set priors on relative publication probabilities across the specified $|z|$ intervals.

Where inputs are individual-level data, i.e. outcome, treatment, group, you can specify column names in as arguments to baggr() function.

Prior choice in baggr

In the “Rubin” and “$\mu$ and $\tau$” models, the user can specify custom priors beyond the defaults using the prior arguments. These prior arguments are subdivided into 3 categories:

Priors for the hypermean, the average effect across groups, such as $\tau$ or the vector $(\mu, \tau)$. This is denoted using the prior_hypermean argument in baggr.
Priors for the hyper-standard-deviations $\sigma_{\tau}$, the standard deviation of effects across groups. This also refers to the prior on $\theta$, the scale parameter of the hyper-variance-covariance matrix $V$ in the”$\mu$ and $\tau$” model. This type of parameter is denoted prior_hypersd in baggr.
Priors on hypercorrelations between parameters that may co-vary across the groups. When working with multi-dimensional parameters at the upper level of the model (aka performing multivariate shrinkage), this is the correlation in the distribution of the different parameters across the groups. For example, in the “$\mu$ and $\tau$” model this parameter is the correlation matrix $\Omega$ that governs the hyper-variance-covariance matrix $V$. This type of parameter is denoted prior_hypercor in baggr.

The possible prior distributions we allow for in the current version are:

For prior_hypermean we allow "normal", "uniform", "cauchy", "lognormal", and (generalised) "student_t" with any parameter values which are logically possible given support constraints (e.g. user cannot specify a negative variance on a normal distribution or a negative scale on a Cauchy) When the model has a vector hypermean, “$\mu$ and $\tau$” model, baggr applies the given prior to all the elements of the vector independently. If the user wishes to specify a prior dependence between the components, one can supply prior_hypermean with a multinormal argument (see ?priors for details).
For prior_hypersd we allow "normal" and "uniform" with any parameter values which are logically possible given support constraints (as above).
For prior_hypercor, which is only necessary when there is a multivariate shrinkage operation (as in the “$\mu$ and $\tau$” model) we allow an LKJ correlation prior ("lkj") with any parameter values which are logically possible given support constraints. For further details of this strategy see Meager (2019).

Notation for priors is “plain-text”, in that you can write the distributions as normal(), uniform() etc. See ?priors for details, or continue to the example below with the Rubin model.

Running the Rubin model in baggr

To demonstrate the Rubin model in baggr, consider the 8 schools example from Rubin (1981). In this dataset, 8 schools in the United States performed similar randomized experiments to estimate the causal effect of an SAT tutoring program on learning outcomes. Reported treatment effects and their standard errors are included in baggr as a data frame:

schools
#>      group tau se
#> 1 School A  28 15
#> 2 School B   8 10
#> 3 School C  -3 16
#> 4 School D   7 11
#> 5 School E  -1  9
#> 6 School F   1 11
#> 7 School G  18 10
#> 8 School H  12 18

To fit the model in baggr (having followed the installation instructions and loaded the package):

baggr_schools <- baggr(schools, model = "rubin", pooling = "partial")

This creates a baggr class object, and you can access the underlying stanfit object by calling baggr_schools$fit. If you don’t change the default priors, then baggr will print a message informing you of the priors it has chosen.

Printing baggr_schools returns a summary of the posterior inference. First baggr records the model type and the pooling regime chosen by the user or implemented by default. Second, baggr returns inference on the aggregate treatment effect $\tau$ by reporting its posterior mean and 95% uncertainty interval, and similar inference on the hyper-SD $\sigma_{\tau}$. Lastly, it prints the “updated” inference on each of the groups’ treatment effects, displaying their new posterior means, standard deviations and pooling factors (see below).

print(baggr_schools)
#> Model type: Rubin model with aggregate data 
#> Pooling of effects: partial 
#> 
#> Aggregate treatment effect (on mean), 8 groups:
#> Hypermean (tau) =  8.0 with 95% interval -1.7 to 18.4 
#> Hyper-SD (sigma_tau) = 6.59 with 95% interval 0.23 to 20.10 
#> Posterior predictive effect = 8.2 with 95% interval -11.6 to 29.3 
#> Total pooling (1 - I^2) = 0.76 with 95% interval 0.25 to 1.00 
#> 
#> Group-specific treatment effects:
#>          mean  sd   2.5%  50% 97.5% pooling
#> School A 11.5 8.3  -1.83 10.3    32    0.82
#> School B  8.0 6.3  -4.31  8.0    21    0.72
#> School C  6.3 7.7 -10.34  6.6    21    0.84
#> School D  7.6 6.6  -5.93  7.7    21    0.75
#> School E  5.2 6.4  -8.93  5.6    16    0.69
#> School F  6.2 6.7  -8.75  6.6    18    0.75
#> School G 10.8 6.8  -0.99 10.2    26    0.72
#> School H  8.5 7.7  -6.44  8.3    25    0.86

Choosing priors

It is possible to fit the Rubin model without specifying any priors, in which case the user is notified about the automatic prior choice that baggr performs. But the priors can also be easily customised by using prior_ arguments to baggr(). The Rubin model performs univariate shrinkage so we will not need to specify a hyper-correlation prior, but we can specify some custom priors on the hyper-mean and hyper-sd. If desired, the user can specify only some priors as custom distributions - the rest will be chosen automatically and the user will be notified of this in the output.

Consider changing both as an example, say placing a Normal prior with mean -5 and standard deviation 10 on our hypermean $\tau$, as well as placing a uniform prior with lower bound 0 and upper bound 5 on our hyper-standard-deviation $\sigma_{\tau}$. Thus, what is expressed mathematically as $\tau \sim N(-5,10)$ and $\sigma_{\tau} \sim U[0,5]$ is expressed in baggr as follows:

baggr(schools, "rubin", 
      prior_hypermean = normal(-5, 10),
      prior_hypersd   = uniform(0, 5))

It is also possible to pass your custom priors as a list to baggr, as follows:

custom_priors <- list( hypermean = cauchy(0,25), hypersd = normal(0,30))
baggr(schools, "rubin", pooling = "partial", prior = custom_priors)

Note that the Rubin model assumes Gaussian distribution of effects across groups. This is generally appropriate as a first pass at the problem (see McCulloch and Neuhaus (2011)) except if the distribution is known to be asymmetric for scientific reasons. For example, if you are working with risk ratios or odds ratios these statistics cannot be negative and the chosen hyper-distribution should typically reflect that. However, in many cases it is possible and indeed standard to maintain the Gaussian assumption on a transform of the object: for example, you can safely fit the Rubin model to the logarithm of the risk ratios or odds ratios. While bearing in mind that log transforms obscure inherent dependencies between means and variances in the raw scale, this is still much better than applying a Gaussian to the raw object itself.

Understanding and criticising baggr model performance

Baggr models are run in Stan, and the fit and results need to be checked, understood and criticised as you would any stan model or indeed any MCMC model fitting exercise. You must pay attention to printed warnings about the Rhat criterion: if you see a warning that Rhat statistic exceeds 1.05 for any parameter, you MUST NOT use the results for inference. This warning means the MCMC chains have not converged, and it is exceedingly unlikely that the “posterior inference” printed out corresponds to anything close to the true posterior implied by your model and data. If you use results from which the Rhat statistic exceeds 1.05 YOUR INFERENCE WILL BE WRONG.

If you see this warning, try re-running the model with the option iter set to a large number such as 10,000, as below. It is also good practice to run many chains, such as 8 rather than the default 4, to have a greater chance to detect pathological MCMC behaviour. You do this by passing baggr the stan arguments iter = 10000 and chains = 8, like so:

baggr_schools <- baggr(schools, model = "rubin", pooling = "partial", 
                       iter = 10000, chains = 8)

Other warnings you may see involve “divergent transitions”. While not as serious as high Rhat, this can signal problems with the model. As the stan message that you will see suggests, try adjusting adapt_delta argument above 0.8, e.g. to 0.99; same as with iter and chains, since you can pass Stan arguments within baggr(), you can simply do:

baggr_schools <- baggr(schools, model = "rubin", pooling = "partial", 
                       iter = 10000, chains = 8,
                       control = list(adapt_delta = 0.99)) #see rstan documentation for details

Measuring “pooling”

It is often useful to measure the extent to which the hierarchical model is “pooling” or sharing information across the groups in the process of aggregation. Baggr automatically computes and prints such a metric, as seen above. You can access more details by writing pooling(baggr_schools) or heterogeneity(baggr_schools).

Estimate of pooling in each group

In the output above we can see a “pooling” column next to each group. This is a statistic due to Gelman and Pardoe (2006).

In a partial pooling model (see [baggr]), group $k$ (e.g. study) has a treatment effect estimate, with some SE around the real $\tau_k$, i.e. $\hat{\tau_k} \sim \mathcal{N}(\tau_k, \hat{se_k})$ ( itself is an estimate of the true $se_k$). Each $\tau_k$ itself is distributed with mean $\tau$ and variance $\sigma_{\tau}^2$.

The quantity of interest is ratio of variability in $\tau$ to total variability. By convention, we subtract it from 1, to obtain a pooling metric $p$.

\[p = 1 - \frac{\sigma_{\tau}^2}{\sigma_{\tau}^2 + se_k^2}\]

If $p < 0.5$, that means the variation across studies is higher than variation within studies.
Values close to 1 indicate nearly full pooling. Variation across studies dominates.
Values close to 0 – no pooling. Variation within studies dominates.

Note that, since $\sigma_{\tau}^2$ is a Bayesian parameter (rather than a single fixed value) $p$ is also a parameter. It is typical for $p$ to have very high dispersion, as in many cases we cannot precisely estimate $\sigma_{\tau}$. That is certainly the case in our example:

pooling(baggr_schools)
#> , , 1
#> 
#>            [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
#> 2.5%  0.3577221 0.1984207 0.3878907 0.2304851 0.1670173 0.2304851 0.1984207
#> mean  0.8234448 0.7193128 0.8375336 0.7459774 0.6885869 0.7459774 0.7193128
#> 97.5% 0.9997547 0.9994482 0.9997844 0.9995439 0.9993188 0.9995439 0.9994482
#>            [,8]
#> 2.5%  0.4450673
#> mean  0.8613130
#> 97.5% 0.9998296

Overall pooling (in the model)

Sometimes researchers prefer to summarise heterogeneity by using a single measure. One possibility is to provide a “big” estimate of pooling $P$ is analogous to averaging $p$ across groups:

\[P = 1 - \frac{\sigma_{\tau}^2}{\sigma_{\tau}^2 + \text{E}(se_k^2)}\]

where is average over $K$ groups. Note that the denominator in the formula above is an application of the law of total variance to $\hat{\tau_k}$, i.e. $\text{Var}(\hat{\tau_k})$ is a sum of between-study variance ($\text{Var}({\tau_k})$) and average within-study variance ($\text{E}(se_k^2)$); von Hippel (2015) provides more details.

In many contexts, i.e. medical statistics, it is typical to report $1-P$, called $I^2$ (see Higgins et al. (2003) for an overview). Higher values of I-squared indicate higher heterogeneity.

Same as for group-specific estimates, $P$ is a Bayesian parameter and its dispersion can be high.

Plotting and model comparison in baggr

A fundamental step to understanding the model is to plot the posterior distributions. baggr has several automatic plot functions which you can access by calling baggr_plot() or, equivalently, using the default plot function; these visuals are based on bayesplot package. Plotting functions always take baggr class object as their first argument. By default, means and 95% posterior intervals of the effects in each group are shown. Extra options are available, such as whether to order the results by effect size. For the 8 schools Rubin model we have

plot(baggr_schools, order = FALSE)

Similarly important is the plot of potential treatment effects (posterior predictive distribution), obtained by repeatedly drawing hypermean and hyper-SD values from baggr’s Markov Chain model, then drawing new value of treatment effect conditional on these:

effect_plot(baggr_schools)

The values underlying the plot can be obtained by effect_draw, which can be used e.g. to draw 1 new value:

effect_draw(baggr_schools, draws = 1)
#> [1] 13.62787

You can also summarise over the entire distribution to obtain mean, SD, and quantiles (effect_draw(baggr_schools, summary = TRUE)). For meta-regression models (models with covariates), you can also use this function to make predictions for new data, at set values of covariates.

However, from the viewpoint of model building, the most important plots are the ones that directly compare many possible models.

Basic model comparison with baggr_compare

The default Rubin model (which we have selected explicitly above) is that of partial pooling. When using baggr_compare without any extra arguments, full pooling, no pooling and partial pooling versions of the model will be fit:

my_baggr_comparison <- baggr_compare(schools)

The result of the comparison includes all the models, but it also produces an automatic comparison plot. Because the output object of baggr_compare is a ggplot object you can edit it further yourself and build on it as you would any ggplot object by, for example, changing theme or labels:

plot(my_baggr_comparison) + 
  ggtitle("8 schools: model comparison")

Comparing existing models, understanding impact of the prior

The comparison can be made not just for different types of pooling, but for any number of models which include the same groups. A typical example is comparing two models on the same data, but with different priors.

Consider this model with alternative, very strong priors:

baggr_schools_v2 <- baggr(schools, prior_hypermean = normal(10, 2.5))

We can compare it with the previous model by providing both models as arguments to baggr_compare and effects_plot. It’s good to name the arguments, as these names will be used to label them:

effect_plot("Default model" = baggr_schools, "normal(10, 2.5)" = baggr_schools_v2) +
  coord_cartesian(xlim = c(-10, 30)) + theme(legend.position = "top")

baggr_compare("Default model" = baggr_schools, "normal(10, 2.5)" = baggr_schools_v2)
#> 
#> Mean treatment effects:
#>                     2.5%    mean   97.5%  median      sd
#> Default model   -1.72754 7.98685 18.3932 7.92932 5.02494
#> normal(10, 2.5)  5.23982 9.51759 14.0120 9.51188 2.22904
#> 
#> SD for treatment effects:
#>                     2.5%    mean   97.5%  median      sd
#> Default model   0.234978 6.58845 20.0992 5.32198 5.35516
#> normal(10, 2.5) 0.281336 6.00750 18.8549 4.76078 5.06531
#> 
#> Posterior predictive effects:
#>                      2.5%    mean   97.5%  median      sd
#> Default model   -11.49860 8.22281 30.1620 7.93321 9.80367
#> normal(10, 2.5)  -7.73755 9.50167 26.6387 9.46154 8.27914

Forest plots for the models

Forest plots are typical for reporting meta-analysis results. We adapted the default functionality from forestplot to work with baggr objects. You can choose to display 1) input data for groups, and/or 2) posterior distributions for groups. In both cases these are followed by a display mean treatment effect.

The plots can be modified by passing extra arguments – see ?forestplot.

forest_plot(baggr_schools, show = "both")

The show = "both" option displays both the observed group-level inputs and posterior group estimates in a single plot.

You may find this hard to read. For a visual similar to a forest plot but more in the default plotting style, you can also try

plot(baggr_schools, style = "forest")

Cross-validation in baggr

Baggr has built-in, automated leave-one-out cross validation for its models. The values returned by loocv() can be used to understand how any one group affects the overall result, as well as how well the model predicts the omitted group.

This function automatically runs $K$ baggr models, leaving out one group at a time, and then calculates expected log predictive density (ELPD) for that group (see Gelman, Hwang, and Vehtari (2014)). The main output is the cross-validation information criterion, or -2 times the ELPD averaged over $K$ models. This is related to, and often approximated by, the Watanabe-Akaike Information Criterion. A value closer to zero (i.e. a smaller number in magnitude) means a better fit. For more information on cross-validation we recommend reading Gelman, Hwang, and Vehtari (2014).

Therefore, if you have $K$ groups, using the loocv() will run your model of choice $K$ times. Be aware that this may take a while even for simple models. (You should see a progress bar in your terminal window.) The loocv() function takes in the same arguments as baggr(), plus an option (return_models) for whether to return all the models or just the summary statistics. For the 8 schools example we can do

loocv_res <- loocv(schools, return_models = FALSE, 
                   iter = 1000, #just to make it a bit faster -- don't try it at home!
                   model = "rubin", pooling = "partial")

loocv_res
#> LOO estimate based on 8-fold cross-validation
#> 
#>       Estimate Standard Error
#> elpd     -31.8          0.953
#> looic     63.5          1.910

The loocv() output contains more than just the matrix it prints, and this additional information can be accessed via attributes(), e.g. the mean treatment effects, their variability and elpd for each model that are stored in the attribute df:

names(attributes(loocv_res))
#> [1] "names" "class"
attr(loocv_res, "df")
#> NULL

This data frame can then be used to examine or compute the variation in the inference on $\tau$ in the absence of each group. If the user is interested in manually checking the consequences of excluding a particular group or set of groups, this is also possible in baggr using subsetting. For example, suppose that we want to run the Rubin model on school groups 1-7 and predict the effect in the 8th school. The code below shows how you can specify a subset of the dataframe as your “data” argument, and then designate another subset as the “testing” holdout set by assigning it to the argument “test_data” in the baggr command. Here we have done it for both partial and full pooling:

We can compare the performance of the two models using the mean log predictive density. This itself is a density as the name suggests so we will compute the log expected value of this density: here, as before, a number closer to zero is better. In this case the full pooling model actually does slightly better:

fit1$mean_lpd
#> [1] 7.930254
fit2$mean_lpd
#> [1] 7.735453

If we run the full loocv for both models, we can estimate the difference between the ELPD of those models as well as the standard error of that difference. As an example, let’s consider the two models above.

loocv_full <- loocv(data = schools, 
                    model = "rubin", 
                    pooling = "full")

We can compare those fits with loo_compare, which gives use the differences in ELPD for the various models we pass it. Generally it is important to pay attention to both the difference in the predictive density and also the standard error of that difference.

loo_compare(loocv_res, loocv_full)
#> Comparison of cross-validation
#> 
#>                     ELPD ELPD SE
#> Model 1 - Model 2 -0.806   0.322
#> 
#> Positive ELPD indicates the reference group is preferred.

References

Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2013. Bayesian Data Analysis. CRC Press.

Gelman, Andrew, Jessica Hwang, and Aki Vehtari. 2014. “Understanding Predictive Information Criteria for Bayesian Models.” Statistics and Computing 24 (6): 997–1016. https://doi.org/10.1007/s11222-013-9416-2.

Gelman, Andrew, and Iain Pardoe. 2006. “Bayesian Measures of Explained Variance and Pooling in Multilevel (Hierarchical) Models.” Technometrics 48 (2): 241–51. https://doi.org/10.1198/004017005000000517.

Higgins, Julian P T, Simon G Thompson, Jonathan J Deeks, and Douglas G Altman. 2003. “Measuring Inconsistency in Meta-Analyses.” BMJ : British Medical Journal 327 (7414): 557–60.

McCulloch, Charles E., and John M. Neuhaus. 2011. “Misspecifying the Shape of a Random Effects Distribution: Why Getting It Wrong May Not Matter.” Statistical Science 26 (3): 388–402.

Meager, Rachael. 2019. “Aggregating Distributional Treatment Effects: A Bayesian Hierarchical Analysis of the Microcredit Literature.” https://doi.org/10.31222/osf.io/7tkvm.

Rubin, Donald B. 1974. “Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies.” Journal of Educational Psychology.

———. 1981. “Estimation in Parallel Randomized Experiments.” Journal of Educational Statistics 6 (4): 377–401.

von Hippel, Paul T. 2015. “The Heterogeneity Statistic I2 Can Be Biased in Small Meta-Analyses.” BMC Medical Research Methodology 15 (April). https://doi.org/10.1186/s12874-015-0024-z.

Aggregating Average Treatment Effects with baggr

Rachael Meager, Witold Wiecek

2026-02-10