Lucas County Housing Example

Houjian Hou

Scope

This vignette walks through the full SS-SVCQR workflow on the small Lucas County housing subset shipped with the package (lucas_housing_sample, \(n = 150\)). The aim is to make every step of the case study reproducible without having to download external data: the model is the same one used in the JSS software paper but on a subset rather than the full \(n = 25{,}357\) panel. The replication script at the project root reproduces the full-sample case study from data/lucas_housing_clean.csv.

The hedonic interpretation is conventional: log sale price is regressed on global controls (log total living area, log lot size, sale-year indicators) and on candidate spatially varying age effects. Because the Lucas County region is geographically compact, longitude and latitude serve as a pragmatic spatial index for graph construction; for larger study regions, projected coordinates should be used.

Loading the package subset

library("sssvcqr")
data("lucas_housing_sample")
housing <- lucas_housing_sample
str(housing)
#> 'data.frame':    150 obs. of  8 variables:
#>  $ log_price  : num  11.9 11 11.4 11.1 11.3 ...
#>  $ log_TLA    : num  7.92 6.51 7.45 7.29 7.86 ...
#>  $ log_lotsize: num  10.69 8.48 9.9 8.78 8.82 ...
#>  $ age_scaled : num  0.09 0.56 0.76 0.63 0.9 0.71 0.99 0.63 1.07 0.49 ...
#>  $ age2_scaled: num  0.0081 0.3136 0.5776 0.3969 0.81 ...
#>  $ longitude  : num  -83.8 -83.5 -83.7 -83.6 -83.6 ...
#>  $ latitude   : num  41.7 41.7 41.6 41.6 41.7 ...
#>  $ sale_year  : int  1994 1997 1997 1997 1996 1995 1996 1996 1997 1998 ...
summary(housing[, c("log_price", "log_TLA", "log_lotsize", "age_scaled")])
#>    log_price         log_TLA       log_lotsize       age_scaled    
#>  Min.   : 8.434   Min.   :6.275   Min.   : 7.601   Min.   :0.0300  
#>  1st Qu.:10.612   1st Qu.:7.020   1st Qu.: 8.434   1st Qu.:0.3700  
#>  Median :11.097   Median :7.207   Median : 8.716   Median :0.5650  
#>  Mean   :10.989   Mean   :7.217   Mean   : 8.887   Mean   :0.5703  
#>  3rd Qu.:11.460   3rd Qu.:7.416   3rd Qu.: 9.113   3rd Qu.:0.7600  
#>  Max.   :12.801   Max.   :8.595   Max.   :12.257   Max.   :1.1700

Assembling y, Z, X, and coordinates

The matrix interface separates always-global controls (Z) from candidate spatially varying covariates (X). The coordinates (u) drive both the proximity graph and the location-indexed deviation fields.

y <- housing$log_price
Z <- model.matrix(~ log_TLA + log_lotsize + sale_year, data = housing)
X <- as.matrix(housing[, c("age_scaled", "age2_scaled")])
u <- scale(as.matrix(housing[, c("longitude", "latitude")]))
c(n = nrow(housing), q = ncol(Z), p = ncol(X))
#>   n   q   p 
#> 150   4   2

Inspecting graph choices

It is good practice to query the graph before fitting, both to check that the proximity structure is connected and to inspect the degree distribution. build_graph_laplacian() returns the sparse adjacency, the degree vector, the chosen Laplacian, and component membership.

graph <- build_graph_laplacian(u, k = 8L)
length(graph$components_list)
#> [1] 1
range(graph$D_vec)
#> [1] 1.678900e-21 8.446148e+00

Fitting SS-SVCQR

The fit below uses fixed penalties for speed; the JSS software paper tunes \((\lambda_1, \lambda_2)\) by spatially blocked cross-validation on the full sample. Tighter ADMM tolerances are appropriate for the moderate sample sizes encountered in this vignette.

fit <- ss_svcqr(
  y = y, Z = Z, X = X, u = u,
  tau = 0.5,
  lambda1 = 3, lambda2 = 1, k_nn = 8,
  control = list(max_iter = 100, warn_nonconvergence = FALSE)
)
summary(fit)
#> Sparse-smooth SVC quantile regression summary
#>   n = 150  q = 4  p = 2  tau = 0.5 
#>   lambda1 = 3  lambda2 = 1 
#>   iterations = 70  converged = TRUE 
#> 
#> alpha:
#> [1] -67.52811362   0.92689808   0.11572866   0.03569393
#> beta_G:
#> [1]  0.2598358 -1.2339241
#> delta L2 norms:
#> [1] 0.5186381 0.0000000

The deviation L2 norms make the global-versus-local decision explicit: a norm at exact zero means the group penalty has classified the corresponding candidate as global.

Visualizing the fitted local coefficient

For a single candidate effect, the package’s plot() method renders the local total coefficient surface over the first two coordinate columns. Inverse-distance-weighted interpolation gives a smooth visual summary; the observed locations are overlaid as small reference marks.

plot(fit, type = "coefficient", index = 1)

Spatially blocked cross-validation

For real analyses, the penalties should be tuned rather than fixed by hand. The example below evaluates a small grid with three spatial folds on this subset; empirical applications should use a broader grid and more iterations.

cv <- cv_ss_svcqr(
  y = y, Z = Z, X = X, u = u,
  tau = 0.5,
  lambda1_seq = c(2, 3),
  lambda2_seq = c(0.5, 1),
  K_folds = 3, adaptive_weights = FALSE,
  control = list(max_iter = 25, warn_nonconvergence = FALSE)
)
cv
#> Spatially blocked CV for SS-SVCQR
#>   tau = 0.5 
#>   best lambda1 = 2 
#>   best lambda2 = 0.5 
#>   best mean check loss = 0.1643256

KKT diagnostics

kkt_sssvcqr() returns first-order optimality summaries and the maximum violation of the per-component degree-weighted centering constraints. Both should be small after a converged fit.

kkt <- kkt_sssvcqr(y, Z, X, fit)
signif(kkt$max_violation, 3)
#> [1] 2000
signif(kkt$max_centering_violation, 3)
#> [1] 4.16e-16

Predicting at new locations

The predict() method returns fitted conditional quantiles when given new \(Z\), \(X\), and \(u\) (type = "response", the default), or local coefficient surfaces when called with type = "coefficients". New-location deviations are extrapolated by inverse-distance-weighted averaging of the \(k\) nearest training deviations.

unew <- u[1:3, , drop = FALSE]
round(predict(fit,
  Znew = Z[1:3, , drop = FALSE],
  Xnew = X[1:3, , drop = FALSE],
  unew = unew), 3)
#> [1] 12.235 10.549 11.320
round(predict(fit, type = "coefficients")[1:3, ], 3)
#>       [,1]   [,2]
#> [1,] 0.254 -1.234
#> [2,] 0.300 -1.234
#> [3,] 0.306 -1.234

Where to look next