ProSGPV in GLM and Cox models

Introduction

ProSGPV is a package that performs variable selection with Second-Generation P-Values (SGPV). This document illustrates how ProSGPV performs variable selection with data from generalized linear models (GLM) and Cox proportional hazards models. Technical details about this algorithm can be found at Zuo, Stewart, and Blume (2020).

We will use Logistic regression to illustrate how ProSGPV selects variables with a low-dimensional (\(n>p\)) real-world data example and how to choose models when each run gives you slightly different results. We will use Poisson regression to show how ProSGPV works with simulated high-dimensional Poisson data and a visualization tool for the selection process. We will use Cox proportional hazards model to show how a fast one-stage algorithm works with simulated data together with a visualization tool for the one-stage ProSGPV.

To install the ProSGPV pacKakge from CRAN, you can do

install("ProSGPV")

Alternatively, you can install a development version of ProSGPV by doing

devtools::install_github("zuoyi93/ProSGPV")

Once the package is installed, we can load the package to the current environment.

library(ProSGPV)
#> Loading required package: glmnet
#> Loading required package: Matrix
#> Loaded glmnet 4.1-2
#> Loading required package: brglm2
#> Package 'ProSGPV' version 1.0.0

GLM examples

Logistic regression

Traditionally, maximum likelihood (ML) has been used to fit a Logistic model. However, when sample size is small or effect sizes are strong, complete/quasi-complete separation may happen, which may inflate the parameter estimation bias by a lot (Albert and Anderson 1984; Rahman and Sultana 2017). Recently, Kosmidis and Firth (2021) proposed to use a Jeffreys prior to address the separation issue in the Logistic regression, and that is implemented in ProSGPV package.

In this section, we will use a real-world data example to showcase how ProSGPV performs variable selection when the outcome is binary. Lower back pain can be caused by a variety of problems with any parts of the complex, interconnected network of spinal muscles, nerves, bones, discs or tendons in the lumbar spine. Dr. Henrique da Mota collected 12 biomechanical attributes from 310 patients, of whom 100 are normal and 210 are abnormal (Disk Hernia or Spondylolisthesis). The goal is to differentiate the normal patients from the abnormal using those 12 variables. The biomechanical attributes include pelvic incidence, pelvic tilt, lumbar lordosis angle, sacral slope, pelvic radius, degree of spondylolisthesis, pelvic slope, direct tilt, thoracic slope cervical tilt, sacrum angle, and scoliosis slope. The data set spine can be accessed in the ProSGPV package.

x <- spine[,-ncol(spine)]
y <- spine[,ncol(spine)]

sgpv.2s.l <- pro.sgpv(x,y,family="binomial") 
#> Warning: brglmFit: fitted probabilities numerically 0 or 1 occurred

sgpv.2s.l
#> Selected variables are sacral_slope pelvic_radius degree_spondylolisthesis

ProSGPV selects three out of 12 variables and they are acral slope, pelvic radius, and degree of spondylolisthesis.

We can get a summary of the model that we selected.

summary(sgpv.2s.l)
#> Warning: brglmFit: fitted probabilities numerically 0 or 1 occurred
#> 
#> Call:
#> glm(formula = Response ~ ., family = "binomial", data = data.d, 
#>     method = "brglmFit", type = "MPL_Jeffreys")
#> 
#> Deviance Residuals: 
#>      Min        1Q    Median        3Q       Max  
#> -2.78904  -0.48146   0.04248   0.33705   2.26768  
#> 
#> Coefficients:
#>                          Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)              17.42056    3.15274   5.526 3.28e-08 ***
#> sacral_slope             -0.11300    0.02214  -5.105 3.31e-07 ***
#> pelvic_radius            -0.11716    0.02254  -5.197 2.03e-07 ***
#> degree_spondylolisthesis  0.15797    0.02162   7.306 2.75e-13 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 389.86  on 309  degrees of freedom
#> Residual deviance: 184.42  on 306  degrees of freedom
#> AIC: 192.42
#> 
#> Number of Fisher Scoring iterations: 5

Coefficient estimates can be extracted by using the coef function.

coef(sgpv.2s.l)
#> Warning: brglmFit: fitted probabilities numerically 0 or 1 occurred
#>  [1]  0.0000000  0.0000000  0.0000000 -0.1130012 -0.1171616  0.1579717
#>  [7]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000

Prediction in probability can be made by using the predict function.

head(predict(sgpv.2s.l))
#> Warning: brglmFit: fitted probabilities numerically 0 or 1 occurred
#>         1         2         3         4         5         6 
#> 0.7765972 0.8116983 0.3053596 0.9012994 0.8132581 0.3842353

Poisson regression

gen.sim.data function is used to generate some high-dimensional (\(p>n\)) simulation data from Poisson distribution in this section. There are many arguments in the gen.sim.data function, as it can generate data with continuous, binary, count, and survival outcomes.
n is the number of observations, p is the number of variables, s is the number of true signals, family can take the value of "gaussian", "binomial", "poisson", and "cox", beta.min is the smallest effect size in absolute value, beta.max is the largest effect size, rho is the autocorrelation coefficient in the design matrix, nu is the signal-to-noise ratio in the continuous outcome case, sig is the standard deviation in the covariance matrix of the design matrix, intercept is used in the linear mean vector of GLM, scale and shape are parameters in Weibull survival time, and rateC is the rate of censoring.

In this section, we simulate 80 observations with 100 explanatory variables and one count outcome. The number of true signals is four. The smallest log rate ratio is 0.5 and the largest log rate ratio is 1.5. When \(p>n\), only the two-stage ProSGPV will be used, even if stage is set to be 1.

set.seed(1)
data.log <- gen.sim.data(n=80,p=100,s=4,beta.min=0.5,beta.max=1.5,family="poisson")

x <- data.log[[1]]
y <- data.log[[2]]
(true.index <- data.log[[3]])
#> [1]   3  43  89 100
(true.beta <- data.log[[4]])
#>   [1]  0.0000000  0.0000000  0.5000000  0.0000000  0.0000000  0.0000000
#>   [7]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [13]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [19]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [25]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [31]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [37]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [43]  0.8333333  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [49]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [55]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [61]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [67]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [73]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [79]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [85]  0.0000000  0.0000000  0.0000000  0.0000000 -1.1666667  0.0000000
#>  [91]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [97]  0.0000000  0.0000000  0.0000000 -1.5000000

sgpv.2s.p <- pro.sgpv(x,y,family="poisson")
sgpv.2s.p
#> Selected variables are V3 V43 V89 V100

We see that the two-stage ProSGPV recovered the true support.

Similarly, the summary of the final model is available by calling summary.

summary(sgpv.2s.p)
#> 
#> Call:
#> glm(formula = Response ~ ., family = "poisson", data = data.d)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -1.7468  -0.6263  -0.2324   0.3456   1.7695  
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept) -0.15872    0.12192  -1.302    0.193    
#> V3           0.59830    0.04679  12.786   <2e-16 ***
#> V43          0.89750    0.06422  13.975   <2e-16 ***
#> V89         -1.15970    0.05135 -22.583   <2e-16 ***
#> V100        -1.57528    0.08177 -19.265   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for poisson family taken to be 1)
#> 
#>     Null deviance: 1018.60  on 79  degrees of freedom
#> Residual deviance:   47.37  on 75  degrees of freedom
#> AIC: 243.92
#> 
#> Number of Fisher Scoring iterations: 5

Coefficients can be extracted by coef. We see the estimates are pretty close to the truth.

coef(sgpv.2s.p)
#>   [1]  0.0000000  0.0000000  0.5982981  0.0000000  0.0000000  0.0000000
#>   [7]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [13]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [19]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [25]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [31]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [37]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [43]  0.8974952  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [49]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [55]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [61]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [67]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [73]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [79]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [85]  0.0000000  0.0000000  0.0000000  0.0000000 -1.1597040  0.0000000
#>  [91]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
#>  [97]  0.0000000  0.0000000  0.0000000 -1.5752756

In-sample prediction can be done as follows.

head(predict(sgpv.2s.p))
#>          1          2          3          4          5          6 
#> 0.09056540 1.52559015 0.09082897 2.36900015 3.64998025 2.35678653

Out-of-sample prediction is also available by feeding data into the newdata argument.

set.seed(1)
data.log <- gen.sim.data(n=10,p=100,s=4,beta.min=0.5,beta.max=1.5,family="poisson")

x.new <- data.log[[1]]
y.new <- data.log[[2]]

data.frame(Observed=y.new,Predicted=predict(sgpv.2s.p,newdata=x.new))
#>    Observed  Predicted
#> 1         1  0.2511760
#> 2         0  0.7257742
#> 3         0  0.2668769
#> 4         7  6.4782805
#> 5         9 10.2552082
#> 6         0  0.2627837
#> 7         1  1.0470441
#> 8         0  0.4297897
#> 9         4  1.2498434
#> 10        7  5.5053366

The prediction agrees with the truth well.

The selection process can be visualized below. lambda.max sets the limit of X-axis. Only effects whose lower bound doesn’t reach the null regions are kept. Selected variables are labeled blue on the Y-axis. The vertical dotted line is the \(\lambda\) selected by generalized information criterion (Fan and Tang (2013)).

plot(sgpv.2s.p,lambda.max = 0.5)

You can also chooses to view only one line per variable by setting lpv to be 1. That line is the 95% confidence bound that is closer to the null.

plot(sgpv.2s.p,lambda.max = 0.5,lpv=1)