---
title: "Visualizing regression model predictions"
author: "Jacob Long"
date: "`r Sys.Date()`"
output:
# html_document:
# toc: true
# toc_float: true
# theme: "spacelab"
html_vignette
vignette: >
%\VignetteIndexEntry{Visualizing regression model predictions}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r echo=FALSE}
required <- c("MASS")
if (!all(sapply(required, requireNamespace, quietly = TRUE)))
knitr::opts_chunk$set(eval = FALSE)
knitr::opts_chunk$set(message = F, warning = F, fig.width = 6, fig.height = 4,
dpi = 125, render = knitr::normal_print)
library(jtools)
```
One great way to understand what your regression model is telling you is to
look at what kinds of predictions it generates. The most straightforward way
to do so is to pick a predictor in the model and calculate predicted values
across values of that predictor, holding everything else in the model equal.
This is what Fox and Weisberg (2018) call a "predictor effect display."
# Linear model example
To illustrate, let's create a model using the `mpg` data from the `ggplot2`
package. These data comprise information about 234 cars over several years.
We will be predicting the gas mileage in cities (`cty`) using several variables,
including engine displacement (`displ`), model year (`year`), # of engine
cylinders (`cyl`), class of car (`class`), and fuel type (`fl`).
Here's a model summary courtesy of `summ`:
```{r}
library(ggplot2)
data(mpg)
fit <- lm(cty ~ displ + year + cyl + class + fl, data = mpg[mpg$fl != "c",])
summ(fit)
```
Let's explore the effect of engine displacement on gas mileage:
```{r}
effect_plot(fit, pred = displ)
```
To be clear, these predictions set all the continuous variables other than
`displ` to their mean value. This can be tweaked via the `centered` argument
("none" or a vector of variables to center are options). Factor variables are
set to their base level and logical variables are set to `FALSE`.
So this plot, in this case, is not *super* illuminating. Let's see the
uncertainty around this line.
```{r}
effect_plot(fit, pred = displ, interval = TRUE)
```
Now we're getting somewhere.
If you want to get a feel for how the data are distributed,
you can add what is known as a rug plot.
```{r}
effect_plot(fit, pred = displ, interval = TRUE, rug = TRUE)
```
For a more direct look at how the model relates to the observed data, you can
use the `plot.points = TRUE` argument.
```{r}
effect_plot(fit, pred = displ, interval = TRUE, plot.points = TRUE)
```
Now we're really learning something about our model---and things aren't looking
great. It seems like a simple linear model may not even be appropriate. Let's
try fitting a polynomial for the `displ` term to capture that curvature.
```{r}
fit_poly <- lm(cty ~ poly(displ, 2) + year + cyl + class + fl, data = mpg)
effect_plot(fit_poly, pred = displ, interval = TRUE, plot.points = TRUE)
```
Okay, now we're getting closer even though the predicted line curiously
grazes over the top of most of the observed data. Before we panic, let's
introduce another feature that might clear things up.
# Partial residuals plots
In complex regressions like the one in this running example, plotting the
observed data can sometimes be relatively uninformative because the points seem
to be all over the place. While the typical effects plot shows predicted values
of `cty` across different values of `displ`, I included
included a lot of predictors besides `displ` in this model and
they may be accounting for some of this variation. This is what
*partial residual plots* are designed to help with. Using the argument
`partial.residuals = TRUE`, what is plotted instead is the
observed data *with the effects of all the control variables accounted for*.
In other words, the value `cty` for the observed data is based only on the
values of `displ` and the model error. Let's take a look.
```{r}
effect_plot(fit_poly, pred = displ, interval = TRUE, partial.residuals = TRUE)
```
There we go! Our polynomial term for `displ` is looking much better now.
You could tell in the previous plot without partial residuals that the *shape*
of the predictions were about right, but the predicted line was just too high.
The partial residuals set all those controls to the same value which shifted
the observations up (in this case) to where the predictions are. That means
the model does a good job of explaining away that discrepancy and we can see
more clearly the polynomial term for `displ` works better than a linear main
effect.
You can learn more about the technique and theory in Fox and Weisberg (2018).
Another place to generate partial residual plots is in Fox's `effects` package.
# Generalized linear models
Plotting can be even more essential to understands models like
GLMs (e.g., logit, probit, poisson).
## Logit and probit
We'll use the `bacteria` data from the `MASS` package to explore binary
dependent variable models. These data come from a study in which children
with a bacterial illness were provided with either an active drug or placebo
and some were given extra encouragement to take the medicine by the
doctor. These conditions are represented by the `trt` variable. Patients
were checked for presence or absence of the bacteria (`y`) every few weeks
(`week`).
```{r}
library(MASS)
data(bacteria)
l_mod <- glm(y ~ trt + week, data = bacteria, family = binomial)
summ(l_mod)
```
Let's check out the effect of time.
```{r}
effect_plot(l_mod, pred = week, interval = TRUE, y.label = "% testing positive")
```
As time goes on, fewer patients test positive for the bacteria.
## Poisson
For a poisson example, we'll use the `Insurance` data from the `MASS` package.
We're predicting the number of car insurance claims for people with different
combinations of car type, region, and age. While `Age` is an ordered factor,
I'll convert it to a continuous variable for the sake of demonstration.
`Claims` is a count variable, so the poisson distribution is an appropriate
modeling approach.
```{r}
library(MASS)
data(Insurance)
Insurance$age_n <- as.numeric(Insurance$Age)
p_mod <- glm(Claims ~ District + Group + age_n, data = Insurance,
offset = log(Holders), family = poisson)
summ(p_mod)
```
Okay, age is a significant predictor of the number of claims. Note that
we have an offset term, so the count we're predicting is more like a *rate*.
That is, we are modeling how many claims there are adjusting for the amount
of policyholders.
```{r}
effect_plot(p_mod, pred = age_n, interval = TRUE)
```
So what does this mean? Critical is understanding the scale of the outcome
variable. Because of the offset, we must pick a value of the offset to
generate predictions at. `effect_plot`, by default, sets the offset at 1.
That means the predictions you see can be interpreted as a percentage; for
every policyholder, there are between 0.16 and 0.10 claims. We can also see
that as age goes up, the proportion of policyholders with claims goes down.
Now let's take a look at the observed data...
```{r}
effect_plot(p_mod, pred = age_n, interval = TRUE, plot.points = TRUE)
```
Oops! That doesn't look right, does it? The problem here is the offset. Some
age groups have many more policyholders than others and they all have more
than 1, which is what we set the offset to for the predictions. This is a more
extreme version of the problem we had the with the linear model previously, so
let's use the same solution: partial residuals.
```{r}
effect_plot(p_mod, pred = age_n, interval = TRUE, partial.residuals = TRUE)
```
Now we're getting somewhere. The only difficulty in interpreting this is the
overlapping points. Let's use the `jitter` argument to add a random bit of
noise to each observation so we can see just how many points there are more
clearly.
```{r}
effect_plot(p_mod, pred = age_n, interval = TRUE, partial.residuals = TRUE,
jitter = c(0.1, 0))
```
Because I didn't want to alter the height of the points, I provided a vector
with both 0.1 (referring to the horizontal position) and 0 (referring to the
vertical position) to the `jitter` argument.
# Categorical predictors
These methods don't work as clearly when the predictor isn't continuous.
Luckily, `effect_plot` automatically handles such cases and offers a number of
options for visualizing effects of categorical predictors.
Using our first example, predicting gas mileage, let's focus on the class of
car as predictor.
```{r}
effect_plot(fit, pred = fl, interval = TRUE)
```
We can clearly see how diesel ("d") is associated with the best mileage by far
and ethanol ("e") the worst by a little bit.
You can plot the observed data in these types of plots as well:
```{r}
effect_plot(fit, pred = fl, interval = TRUE, plot.points = TRUE,
jitter = .2)
```
These seem a bit far off from the predictions.
Let's see if the partial residuals are a little more in line with expectations.
```{r}
effect_plot(fit, pred = fl, interval = TRUE, partial.residuals = TRUE,
jitter = .2)
```
Now things make a little more sense and you can see the range of possibilities
within each category after accounting for model year and so on. Diesel in
particular seems have too few and too spaced out observations to take overly
seriously.
Let's also look at the bacteria example, using treatment type as the
predictor of interest.
```{r}
effect_plot(l_mod, pred = trt, interval = TRUE, y.label = "% testing positive")
```
Now we can see that receiving the drug is clearly superior to placebo, but
the drug plus encouragement is not only no better than the drug alone, it's
hardly better than placebo. Of course, we can also tell that the confidence
intervals are fairly wide, so I won't say that these data tell us anything
definitively besides the superiority of the drug over placebo.
You may also want to use lines to convey the ordered nature of this predictor.
```{r}
effect_plot(l_mod, pred = trt, interval = TRUE, y.label = "% testing positive",
cat.geom = "line")
```