R package estimability: Support for determining estimability of linear functions

Website:

https://rvlenth.github.io/estimability/

Overview

When a model is rank-deficient, predictions from that model may not be unique. The estimability R package provides a set of tools to assess whether predictions from a rank-deficient model are unique (i.e., are estimable). These tools apply to a rich class of models that includes linear models, generalized linear models, mixed models, and models with more general correlated-error structures (e.g., spatial models, time series models). For further details, see doi:10.32614/RJ-2015-016.

Features

A nonest.basis() function is provided that determines a basis for the null space of a matrix. This may be used in conjunction with is.estble() to determine the estimability (within a tolerance) of a given linear function of the regression coefficients in a linear model.
A set of epredict() methods are provided for lm, glm, and mlm objects. These work just like predict(), except an NA is returned for any cases that are not estimable. This is a useful alternative to the generic warning that “predictions from rank-deficient models are unreliable.”
A function estble.subspace() that projects a set of linear functions onto an
estimable subspace (possibly of smaller dimension). This can be useful in creating a set of estimable contrasts for joint testing.
Package developers may wish to import this package and incorporate estimability checks for their predict methods.

Installation

To install latest version from CRAN, run

install.packages("estimability")

Release notes for the latest CRAN version are found at https://cran.r-project.org/package=estimability/NEWS – or do news(package = "estimability") for notes on the version you have installed.

To install the latest development version from GitHub, have the newest remotes package installed, then run

remotes::install_github("rvlenth/estimability", dependencies = TRUE)

For latest release notes on this GitHub version, see the NEWS file

Example

Suppose we have four predictors x1, x2, x3, and x4, and a response variable y:

x1 <- -4:4
x2 <- c(-2, 1, -1, 2, 0, 2, -1, 1,-2)
x3 <- 3 * x1 - 2 * x2
x4 <- x2 - x1 + 4
y <- 1 + x1 + x2 + x3 + x4 + c(-.5, .5, .5, -.5, 0, .5, -.5, -.5, .5)
dat <- data.frame(x1, x2, x3, x4, y)
head(dat)
#>   x1 x2  x3 x4    y
#> 1 -4 -2  -8  6 -7.5
#> 2 -3  1 -11  8 -3.5
#> 3 -2 -1  -4  5 -0.5
#> 4 -1  2  -7  7  1.5
#> 5  0  0   0  4  5.0
#> 6  1  2  -1  5  8.5

If we fit the following two models, different slope coefficients are estimated as NA depending on the order they are provided to the model formula:

mod1234 <- lm(y ~ x1 + x2 + x3 + x4, data = dat)
mod4321 <- lm(y ~ x4 + x3 + x2 + x1, data = dat)
zapsmall(rbind(
    b1234 = coef(mod1234),
    b4321 = coef(mod4321)[c(1, 5:2)]
))
#>       (Intercept) x1 x2 x3 x4
#> b1234           5  3  0 NA NA
#> b4321         -19 NA NA  3  6

Now suppose our goal is to make predictions of y for the new observations in testset:

testset <- data.frame( 
    x1 = c(3,  6,  6,  0,  0,  1), 
    x2 = c(1,  2,  2,  0,  0,  2), 
    x3 = c(7, 14, 14,  0,  0,  3), 
    x4 = c(2,  4,  0,  4,  0,  4)
)
testset
#>   x1 x2 x3 x4
#> 1  3  1  7  2
#> 2  6  2 14  4
#> 3  6  2 14  0
#> 4  0  0  0  4
#> 5  0  0  0  0
#> 6  1  2  3  4

cbind(
    testset, 
    pred1234 = predict(mod1234, newdata = testset),
    pred4321 = predict(mod4321, newdata = testset)
)
#>   x1 x2 x3 x4 pred1234 pred4321
#> 1  3  1  7  2       14       14
#> 2  6  2 14  4       23       47
#> 3  6  2 14  0       23       23
#> 4  0  0  0  4        5        5
#> 5  0  0  0  0        5      -19
#> 6  1  2  3  4        8       14

Predictions for the first, third, and fourth row are the same for both models, but predictions for the second, fifth, and sixth row are different. This is a problem, as the underlying data provided to the models are the same! Note that R does provide a warning about predictions from the rank-deficient fit (the warning itself is omitted here).

The estimability package remedies these inconsistencies via epredict(), which indicates the rows of testdata that are estimable (with their predictions) or not estimable (with NA):

library(estimability)
cbind(
    testset, 
    pred1234 = epredict(mod1234, newdata = testset),
    pred4321 = epredict(mod4321, newdata = testset)
)
#>   x1 x2 x3 x4 pred1234 pred4321
#> 1  3  1  7  2       14       14
#> 2  6  2 14  4       NA       NA
#> 3  6  2 14  0       23       23
#> 4  0  0  0  4        5        5
#> 5  0  0  0  0       NA       NA
#> 6  1  2  3  4       NA       NA

Other predict() arguments can be passed to epredict(), e.g.,:

epredict(mod1234, newdata = testset, interval = "prediction", level = 0.9)
#>   fit       lwr       upr
#> 1  14 12.715387 15.284613
#> 2  NA        NA        NA
#> 3  23 21.449058 24.550942
#> 4   5  3.817418  6.182582
#> 5  NA        NA        NA
#> 6  NA        NA        NA