The permutation framework is perfectly suited for making inference as it allows one to perform point estimation, confidence regions and hypothesis tests under mild assumptions about the collected data and no distributional assumption. In this article, we briefly illustrate how each of these aspects can be treated from a permutation point of view using the **flipr** package. This package has been written and is intended as a low-level implementation of the permutation framework in the context of statistical inference. The mathematical object behind the scene is the so-called plausibility function, sometimes called p-value function. This article explains what the plausibility function is and shows how it can be easily computed using the permutation framework with **flipr**. We illustrate the shape of the plausibility function using both Gaussian and Gamma distributions.

Let us work on a very classic toy example. Suppose that we are interested in studying the difference in means of two populations. We will generate:

- A first sample of size \(n_1 = 10\) observations from a distribution with mean \(1\) and unit variance;
- A second sample of size \(n_2 = 10\) observations from a distribution with mean \(4\) and unit variance.

```
<- 10
n1 <- 10
n2 <- 1
mu1 <- 4
mu2 <- 1
sd1 <- 1
sd2 <- 100000 B
```

We will consider three different scenari:

- Scenario A
Both populations follow Gaussian distributions.

```
set.seed(1234)
<- rnorm(n1, mean = mu1, sd = sd1)
a1 <- rnorm(n2, mean = mu2, sd = sd2) a2
```

- Scenario B
Both populations follow Gamma distributions.

```
set.seed(1234)
<- rgamma(n1, shape = 1, rate = 1)
b1 <- rgamma(n2, shape = 16, rate = 4) b2
```

- Scenario C
The first population follows a Gaussian distribution while the second population follows a Gamma distribution.

```
set.seed(1234)
<- rnorm(n1, mean = mu1, sd = sd1)
c1 <- rgamma(n2, shape = 16, rate = 4) c2
```

For each scenario, the goal is to make inference on the difference between the means of the two distributions. In other words, the parameter of interest is \(\delta = \mu_2 - \mu_1\).

The plausibility function for the mean difference is a curve that represents the variation of the \(p\)-value of a test in which the null hypothesis is \(\delta = \delta_0\) as a function of \(\delta_0\) (Martin 2017; Fraser 2019; Infanger and Schmidt-Trucksäss 2019).

With **flipr**, it is easy to trace such a plausibility function. Three ingredients are required alongside the data to instantiate such a function:

- a null specification function that tells
**flipr**how the second sample should be transformed in order to make it exchangeable with the first sample under the null hypothesis; - a list of test statistics to use for detecting differences between the distributions that generated the two observed samples;
- a list of index assignments that tells
**flipr**which test statistics to use for each parameter under investigation.

In the context of null hypothesis testing, we consider the null hypothesis \(H_0: \mu_y - \mu_x = \delta\). We can use a permutation scheme to approach the \(p\)-value if the two samples are *exchangeable* under \(H_0\). This means that we need to transform for example the second sample to *make* it exchangeable with the first sample under \(H_0\). In this simple example, this can be achieved as follows. Let \(X_1, \dots, X_{n_x} \sim \mathcal{N}(\mu_x, 1)\) and \(Y_1, \dots, Y_{n_y} \sim \mathcal{N}(\mu_y, 1)\). We can then transform the second sample as \(Y_i \longleftarrow Y_i - \delta\).

We can define a proper function to do this, termed the *null specification* function, which takes two input arguments:

`y`

which is a list storing the data points in the second sample;`parameters`

which is a numeric vector of values for the parameters under investigation (here only \(\delta\) and thus`parameters`

is of length \(1\) with`parameters[1] = delta`

).

In our simple example, it boils down to:

```
<- function(y, parameters) {
null_spec ::map(y, ~ .x - parameters[1])
purrr }
```

Next, we need to decide which test statistic(s) we are going to use for performing the test. Here, we are only interested in one parameter, namely the mean difference \(\delta\). Since the two samples share the same variance, we can use for example the \(t\)-statistic with a pooled estimate of the common variance.

This statistic can be easily computed using `stats::t.test(x, y, var.equal = TRUE)$statistic`

. However, we want to extend its evaluation to any permuted version of the data. Test statistic functions compatible with **flipr** should have at least two mandatory input arguments:

`data`

which is either a concatenated list of size \(n_x + n_y\) regrouping the data points of both samples or a distance matrix of size \((n_x + n_y) \times (n_x + n_y)\) stored as an object of class`dist`

.`indices1`

which is an integer vector of size \(n_x\) storing the indices of the data points belonging to the first sample in the current permuted version of the data.

A **flipr**-compatible version of the t-statistic is already implemented in **flipr** and ready to use as `stat_student`

or its alias `stat_t`

. Here, we are only going to use the \(t\)-statistic for this example, but we might be willing to use more than one statistic for a parameter or we might have several parameters under investigation, each one of them requiring a different test statistic. We therefore group all the test statistics that we need into a single list:

`<- list(stat_t) stat_functions `

Finally we need to define a named list that tells **flipr** which test statistics among the ones declared in the `stat_functions`

list should be used for each parameter under investigation. This is used to determine bounds on each parameter for the plausibility function. This list, often termed `stat_assignments`

, should therefore have as many elements as there are parameters under investigation. Each element should be named after a parameter under investigation and should list the indices corresponding to the test statistics that should be used for that parameter in `stat_functions`

. In our example, it boils down to:

`<- list(delta = 1) stat_assignments `

In **flipr**, the plausibility function is implemented as an R6Class object. Assume we observed two samples stored in lists `x`

and `y`

, we therefore instantiate a plausibility function for this data as follows:

```
<- PlausibilityFunction$new(
pf null_spec = null_spec,
stat_functions = stat_functions,
stat_assignments = stat_assignments,
x, y )
```

Now, assume we want to test the following hypotheses:

\[ H_0: \delta = 0 \quad \mbox{v.s.} \quad H_1: \delta \ne 0. \]

We use the `$get_value()`

method for this purpose, which essentially evaluates the permutation \(p\)-value of a two-sided test by default:

`$get_value(0) pf`

It is possible to evaluate one-sided p-values as well by changing the `$alternative`

field through the `$set_alternative()`

method, which accepts `two_tail`

, `left_tail`

and `right_tail`

values.`

By default, the number of sampled permutations is `1000L`

. It is accessible through the `$nperms`

field and can be changed via the `$set_nperms()`

method.

Let us instantiate the plausibility for the data simulated under scenario A:

```
<- PlausibilityFunction$new(
pfa null_spec = null_spec,
stat_functions = stat_functions,
stat_assignments = stat_assignments,
a1, a2
)$set_nperms(B) pfa
```

We can compute a point estimate of the mean difference and store it inside the plausibility function class via the `$set_point_estimate()`

method:

`$set_point_estimate(mean(a2) - mean(a1)) pfa`

The computed value can then be accessed via the `$point_estimate`

field:

```
$point_estimate
pfa#> delta
#> 3.264987
```

or by displaying the list of parameters under investigation which is stored in the `$parameters`

field.

`$parameters pfa`

```
#> $delta
#> Delta (quantitative)
#> Range: [?, ?]
#> Point estimate: 3.26
```

In this list, one can see that parameters come with an unknown range by default. We can however compute their bounds by defining a maximum confidence level through the `$set_max_conf_level()`

method of the `PlausibilityFunction`

class. When a plausibility function is instantiated, the default value for the `$max_conf_level`

field is \(0.99\). To set parameter bounds automatically, use the `$set_parameter_bounds()`

method:

```
$set_parameter_bounds(
pfapoint_estimate = pfa$point_estimate,
conf_level = pfa$max_conf_level
)
```

We can now inspect again the list of parameters under investigation to see the updated bounds:

```
$parameters
pfa#> $delta
#> Delta (quantitative)
#> Range: [2.02, 4.7]
#> Point estimate: 3.26
```

Once bounds are known for each parameter, it becomes possible to generate a grid for later evaluating the plausibility function. This is done through the `$set_grid()`

method as follows:

```
$set_grid(
pfaparameters = pfa$parameters,
npoints = 50L
)
```

We can then take a look at the newly created grid:

`$grid pfa`

```
#> # A tibble: 51 × 1
#> delta
#> <dbl>
#> 1 2.02
#> 2 2.07
#> 3 2.12
#> 4 2.17
#> 5 2.22
#> 6 2.27
#> 7 2.32
#> 8 2.37
#> 9 2.42
#> 10 2.47
#> # … with 41 more rows
```

We can go a step further and evaluate the plausibility function on that grid using the `$evaluate_grid()`

method as follows:

`$evaluate_grid(grid = pfa$grid) pfa`

Again, we can then take a look at the updated grid:

```
$grid
pfa#> # A tibble: 51 × 2
#> delta pvalue
#> <dbl> <dbl>
#> 1 2.02 0.0102
#> 2 2.07 0.0137
#> 3 2.12 0.0185
#> 4 2.17 0.0240
#> 5 2.22 0.0317
#> 6 2.27 0.0405
#> 7 2.32 0.0520
#> 8 2.37 0.0659
#> 9 2.42 0.0834
#> 10 2.47 0.104
#> # … with 41 more rows
```

We can add to this grid the p-value computed from the t-test assuming normality of the data:

```
<- pfa$grid %>%
dfa mutate(
pvalue_alt = delta %>%
map_dbl(~ {
t.test(
x = a2,
y = a1,
alternative = "two.sided",
mu = .x,
var.equal = TRUE
$p.value
)
})%>%
) select(
delta, `Parametric Approach` = pvalue_alt,
`Permutation Approach` = pvalue
%>%
) pivot_longer(-delta)
```

```
%>%
dfa filter(value > 1e-3) %>%
ggplot(aes(delta, value, color = name)) +
geom_line() +
geom_hline(
yintercept = 0.05,
color = "black",
linetype = "dashed"
+
) labs(
title = "Scenario A: P-value function for the mean difference",
subtitle = "Using Student's t-statistic and two-tailed p-values",
x = expression(delta),
y = "p-value",
color = "Type of test"
+
) theme_bw() +
theme(legend.position = "top") +
scale_y_log10()
```

However, keep in mind that the parametric Student’s \(t\)-test is only valid for Gaussian data or for large samples invoking the central limit theorem. When sample sizes are low and samples are not drawn from Gaussian distributions, only the permutation approach is valid. When both samples come from non-Gaussian data (here Gamma-distributed samples) and sample sizes are low (here \(10\) for each sample), we get the following \(p\)-value functions (cutting below \(0.001\) since the usual range of useful \(p\)-values is from \(0.01\) and above):

```
<- PlausibilityFunction$new(
pfb null_spec = null_spec,
stat_functions = stat_functions,
stat_assignments = stat_assignments,
b1, b2
)$set_nperms(B)
pfb$set_point_estimate(mean(b2) - mean(b1))
pfb$set_parameter_bounds(
pfbpoint_estimate = pfb$point_estimate,
conf_level = pfb$max_conf_level
)$set_grid(
pfbparameters = pfb$parameters,
npoints = 50L
)$evaluate_grid(grid = pfb$grid) pfb
```

```
<- pfb$grid %>%
dfb mutate(
pvalue_alt = delta %>%
map_dbl(~ {
t.test(
x = b2,
y = b1,
alternative = "two.sided",
mu = .x,
var.equal = TRUE
$p.value
)
})%>%
) select(
delta, `Parametric Approach` = pvalue_alt,
`Permutation Approach` = pvalue
%>%
) pivot_longer(-delta)
```

```
%>%
dfb filter(value > 1e-3) %>%
ggplot(aes(delta, value, color = name)) +
geom_line() +
geom_hline(
yintercept = 0.05,
color = "black",
linetype = "dashed"
+
) labs(
title = "Scenario B: P-value function for the mean difference",
subtitle = "Using Student's t-statistic and two-tailed p-values",
x = expression(delta),
y = "p-value",
color = "Type of test"
+
) theme_bw() +
theme(legend.position = "top") +
scale_y_log10()
```

The parametric \(p\)-value function is clearly shifted in the decreasing part on the right of the plot.

When only one of the samples come from non-Gaussian data (here the second sample is Gamma-distributed and sample sizes are low (here \(10\) for each sample), we get the following \(p\)-value functions (cutting below \(0.001\) since the usual range of useful \(p\)-values is from \(0.01\) and above):

```
<- PlausibilityFunction$new(
pfc null_spec = null_spec,
stat_functions = stat_functions,
stat_assignments = stat_assignments,
c1, c2
)$set_nperms(B)
pfc$set_point_estimate(mean(c2) - mean(c1))
pfc$set_parameter_bounds(
pfcpoint_estimate = pfc$point_estimate,
conf_level = pfc$max_conf_level
)$set_grid(
pfcparameters = pfc$parameters,
npoints = 50L
)$evaluate_grid(grid = pfc$grid) pfc
```

```
<- pfc$grid %>%
dfc mutate(
pvalue_alt = delta %>%
map_dbl(~ {
t.test(
x = c2,
y = c1,
alternative = "two.sided",
mu = .x,
var.equal = TRUE
$p.value
)
})%>%
) select(
delta, `Parametric Approach` = pvalue_alt,
`Permutation Approach` = pvalue
%>%
) pivot_longer(-delta)
```

```
%>%
dfc filter(value > 1e-3) %>%
ggplot(aes(delta, value, color = name)) +
geom_line() +
geom_hline(
yintercept = 0.05,
color = "black",
linetype = "dashed"
+
) labs(
title = "Scenario C: P-value function for the mean difference",
subtitle = "Using Student's t-statistic and two-tailed p-values",
x = expression(delta),
y = "p-value",
color = "Type of test"
+
) theme_bw() +
theme(legend.position = "top") +
scale_y_log10()
```

The situation seems less critical here when only one sample is non-Gaussian but this is only one simulation using a Gamma-distributed second sample. The bias could be more important for other distributions.

Another issue with the parametric approach is that it is very limited in the type of data that one can use. The parametric \(t\)-test for instance is only meant for real-valued data. It will not be applicable to multivariate data or functional data or topological data or network-valued data or any other type of data that is not representable as real-valued data. This further motivates inference via permutation.

Getting access to the \(p\)-value function is key to making inference as point estimation, confidence regions and hypothesis tests all come as a by-product as we will show in the next sections.

One can obtain a point estimate of the parameter under investigation by searching which value of the parameter reaches the maximum of the \(p\)-value function (which is \(1\)). One can use the `$set_point_estimate()`

method to do that:

`$set_point_estimate(overwrite = TRUE) pfa`

The computed point estimate is then stored in the `$point_estimate`

field and can be retrieved as:

```
$point_estimate
pfa#> delta
#> 3.264987
```

One can obtain a confidence interval for the parameter under investigation by searching for which values of the parameter the \(p\)-value function remains above a pre-specified significance level \(\alpha\). This is achieved via the `$set_parameter_bounds()`

method:

```
$set_parameter_bounds(
pfapoint_estimate = pfa$point_estimate,
conf_level = 0.95
)
```

```
$parameters
pfa#> $delta
#> Delta (quantitative)
#> Range: [2.02, 4.7]
#> Point estimate: 3.26
```

Extracting the \(p\)-value for a single hypothesis test in which the null hypothesis is \(H_0: \delta = \delta_0\) is immediate from the \(p\)-value function as it boils down to evaluating the \(p\)-value function in \(\delta_0\). Hence we can for instance test \(H_0: \delta = 3\) against the alternative \(H_1: \delta \ne 3\) using the following piece of code:

```
$set_nperms(1000)
pfa$get_value(3)
pfa#> [1] 0.5714259
```

Fraser, D. A. S. 2019. “The p-Value Function and Statistical Inference.” *The American Statistician* 73 (sup1): 135–47. https://doi.org/10.1080/00031305.2018.1556735.

Infanger, Denis, and Arno Schmidt-Trucksäss. 2019. “P Value Functions: An Underused Method to Present Research Results and to Promote Quantitative Reasoning.” *Statistics in Medicine* 38 (21): 4189–97. https://doi.org/10.1002/sim.8293.

Martin, Ryan. 2017. “A Statistical Inference Course Based on p-Values.” *The American Statistician* 71 (2): 128–36. https://doi.org/10.1080/00031305.2016.1208629.