1 Introduction

ProSGPV is a package that performs variable selection with Second-Generation P-Values (SGPV). This document illustrates how ProSGPV works with continuous outcomes in linear regression. Technical details about this algorithm can be found at Zuo, Stewart, and Blume (2020).

To install the ProSGPV pacKakge from CRAN, you can do

install("ProSGPV")

Alternatively, you can install a development version of ProSGPV by doing

devtools::install_github("zuoyi93/ProSGPV")

Once the package is installed, we can load the package to the current environment.

library(ProSGPV)

The data set is stored in the ProSGPV package with the name t.housing. The goal is to find important variables associated with the sale prices of real estate units and then build a prediction model. More details about data collection are available in Rafiei and Adeli (2016). There are 26 explanatory variables and one outcome, and variable description is shown below.

Category	Label	Description
Outcome	V9	Actual sales price
Project physical and financial features	V2 V3 V4 V5 V6 V7 V8	Total floor area of the building Lot area Total preliminary estimated construction cost Preliminary estimated construction cost Equivalent preliminary estimated construction cost in a selected base year Duration of construction Price of the unit at the beginning of the project
Economic variables and indices	V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29	The number of building permits issued Building services index for a pre-selected base year Wholesale price index of building materials for the base year Total floor areas of building permits issued by the city/municipality Cumulative liquidity Private sector investment in new buildings Land price index for the base year The number of loans extended by banks in a time resolution The amount of loans extended by banks in a time resolution The interest rate for loan in a time resolution The average construction cost by private sector when completed The average cost of buildings by private sector at the beginning Official exchange rate with respect to dollars Nonofficial (street market) exchange rate with respect to dollars Consumer price index (CPI) in the base year CPI of housing, water, fuel & power in the base year Stock market index Population of the city Gold price per ounce

2 Two-stage algorithm

We can load the data and feed into pro.sgpv function. By default, a two-stage algorithm is run and prints the indices of the selected variables.

x <- t.housing[, -ncol(t.housing)]
y <- t.housing$V9

sgpv.2s <- pro.sgpv(x,y)
sgpv.2s
#> Selected variables are V8 V12 V13 V15 V17 V26

We can print the summary of the linear regression with selected variables with the S3 method summary.

summary(sgpv.2s)
#> 
#> Call:
#> lm(formula = Response ~ ., data = data.d)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1276.35   -75.59    -9.58    59.46  1426.22 
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  1.708e+02  3.471e+01   4.920 1.31e-06 ***
#> V8           1.211e+00  1.326e-02  91.277  < 2e-16 ***
#> V12         -2.737e+01  2.470e+00 -11.079  < 2e-16 ***
#> V13          2.185e+01  2.105e+00  10.381  < 2e-16 ***
#> V15          2.041e-03  1.484e-04  13.756  < 2e-16 ***
#> V17         -3.459e+00  8.795e-01  -3.934  0.00010 ***
#> V26         -4.683e+00  1.780e+00  -2.630  0.00889 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 194.8 on 365 degrees of freedom
#> Multiple R-squared:  0.9743, Adjusted R-squared:  0.9739 
#> F-statistic:  2310 on 6 and 365 DF,  p-value: < 2.2e-16

Coefficient estimates can be extracted by use of S3 method coef. Note that it returns a vector of length \(p\).

coef(sgpv.2s)
#>  [1]   0.000000000   0.000000000   0.000000000   0.000000000   0.000000000
#>  [6]   0.000000000   1.210755031   0.000000000 -27.367601037  21.853920174
#> [11]   0.000000000   0.002040784   0.000000000  -3.459496972   0.000000000
#> [16]   0.000000000   0.000000000   0.000000000   0.000000000   0.000000000
#> [21]   0.000000000   0.000000000  -4.683172725   0.000000000   0.000000000
#> [26]   0.000000000

In-sample prediction can be made using S3 method predict and an external sample can be provided to make out-of-sample prediction with an argument of newdata in the predict function.

head(predict(sgpv.2s))
#>         1         2         3         4         5         6 
#> 1565.7505 3573.7793  741.7576  212.1297 5966.1682 5724.0172

The ProSGPV selection path can be extracted by use of S3 method plot. lambda.max argument controls the range of \(\lambda\). The black vertical dotted line is the \(\lambda\) selected by generalized information criterion (Fan and Tang (2013)). The null zone is the grey shaded region near 0. The blue labels on the Y-axis are the selected variables.

plot(sgpv.2s,lambda.max = 0.005)

By default, three lines per variables are provided. You can also choose to view only one bound per variable by setting lpv argument to 1, where the one bound is the confidence bound that is closer to 0.

plot(sgpv.2s, lambda.max=0.005, lpv=1)

3 One-stage algorithm

One-stage algorithm is available when \(n>p\) but may have reduced support recovery rate and higher parameter estimation bias. Its advantage is its fast computation speed and its result being fixed for a given data set.

sgpv.1s <- pro.sgpv(x,y,stage=1)
sgpv.1s
#> Selected variables are V8 V12 V13 V15 V17 V25 V26

Note that the one-stage algorithm selects one more variable than the two-stage algorithm.

S3 methods summary, coef, predict and plot are available for the one-stage algorithm. Particularly, plot(sgpv.1s) would presents the variable selection results in the full model. Point estimates and 95% confidence intervals are shown for each variable, and the null bounds are shown in green vertical bars. Selected variables are colored in blue.

plot(sgpv.1s)

4 High dimensional case where \(p>n\)

ProSGPV also works when \(p>n\). This section will show you an example in linear regression. Simulated data are generated by use of function gen.sim.data. gen.sim.data can generate Gaussian, Binomial, Poisson, and survival data. Details can be found in the corresponding help mannual. In this section, we generated 100 observations with 200 variables in the design matrix, where only four are signals. The goal is to recover the support of those four variables.

set.seed(30)
data.linear <- gen.sim.data(n=100, p=200, s=4)

# explanatory variables
x <- data.linear[[1]]

# outcome
y <- data.linear[[2]]

# true support
(true.index <- data.linear[[3]])
#> [1]  26 114 118 190

# true coefficients
true.beta <- data.linear[[4]]

ProSGPV can identify the true support.

h.sgpv <- pro.sgpv(x,y)
h.sgpv
#> Selected variables are V26 V114 V118 V190

Similar to the low-dimensional case, S3 methods summary, coef, predict, and plot are available. Below we show the selection path.

png("vignettes/assets/linear.fig.4.png", units="in", width=7, height=7, res=300)
plot(h.sgpv)
dev.off()

5 Way to address high correlation or dense signals

When the design matrix has high within correlation, or signals are known to be dense in the data, the ProSGPV algorithm yields a null bound that is slightly higher than the noise level, which subsequently affects the support recovery performance by missing some small true effects. One way to address this issue is to replace the constant null bound in the ProSGPV with a generalized variance inflation factor (GVIF)-adjusted null bound. Please see Fox and Monette (1992) for more details on how to calculate the GVIF for each variable in the design matrix. Essentially, we deflate the coefficient estimate standard error of each variable by its GVIF, and thus produce a smaller null bound, which includes more variables in the final selection set. This adjustment is found to be helpful when signals are dense, too.

References

Fan, Yingying, and Cheng Yong Tang. 2013. “Tuning Parameter Selection in High Dimensional Penalized Likelihood.” Journal of the Royal Statistical Society: SERIES B: Statistical Methodology, 531–52.

Fox, John, and Georges Monette. 1992. “Generalized Collinearity Diagnostics.” Journal of the American Statistical Association 87 (417): 178–83.

Rafiei, Mohammad Hossein, and Hojjat Adeli. 2016. “A Novel Machine Learning Model for Estimation of Sale Prices of Real Estate Units.” Journal of Construction Engineering and Management 142 (2): 04015066.

Zuo, Yi, Thomas G Stewart, and Jeffrey D Blume. 2020. “Variable Selection with Second-Generation p-Values.” arXiv Preprint arXiv:2012.07941.