---
title: "Choosing K and the denoising parameters"
output:
  rmarkdown::html_vignette: default
vignette: >
  %\VignetteIndexEntry{Choosing K and the denoising parameters}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r cdk-knit-opts, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 6, fig.height = 4,
  fig.align = "center"
)
```

```{r setup}
library(MetaHunt)
set.seed(1)
```

This vignette focuses on two practical knobs in the MetaHunt
pipeline: the latent rank `K` and the d-fSPA denoising parameters
`(N, Delta)`. For the broader setup — the four assumptions, the
three-step pipeline, and the running notation — see
`vignette("metahunt-intro", package = "MetaHunt")`.

## Why this matters

Choosing `K` is the single most consequential decision in a MetaHunt
fit. Picking `K` too small underfits: real cross-study heterogeneity
gets squashed into a low-rank approximation that cannot represent the
data, and downstream predictions are biased. Picking `K` too large
inflates variance and risks recovering spurious "bases" that fit
noise. The denoising step in d-fSPA controls finite-sample variance
in a complementary way: averaging each study with its near
neighbours before basis hunting smooths over per-study estimation
error, at the cost of a small smoothing bias.

## A small standalone simulation

```{r cdk-simulate}
m <- 30; G <- 20; K_true <- 3
x <- seq(0, 1, length.out = G)
basis <- rbind(sin(pi * x), cos(pi * x), x)
W <- data.frame(w1 = rnorm(m), w2 = rnorm(m))
beta <- cbind(c(1, -0.8), c(-0.5, 1.2), c(0, 0))
pi_true <- exp(as.matrix(W) %*% beta); pi_true <- pi_true / rowSums(pi_true)
F_hat <- pi_true %*% basis + matrix(rnorm(m * G, sd = 0.05), m, G)
```

## Unsupervised diagnostic: reconstruction error vs K

The **elbow** plot tracks how well the recovered bases reconstruct
the observed `F_hat` as a function of `K`. It is unsupervised — it
does not use `W` — and is fast.

```{r cdk-elbow}
elbow <- reconstruction_error_curve(F_hat, K_range = 2:6,
                                    dfspa_args = list(denoise = FALSE))
plot(elbow$K, elbow$error, type = "b",
     xlab = "K", ylab = "reconstruction error",
     main = "Reconstruction error vs K",
     ylim = c(0, max(elbow$error, na.rm = TRUE) * 1.05))
```

## Supervised diagnostic: cross-validated prediction error vs K

The **CV prediction-error** curve uses the metadata `W` to predict
held-out studies' functions and reports the average prediction
error. This is supervised and tends to identify a tighter elbow when
the metadata is informative.

```{r cdk-cv}
cv <- cv_error_curve(F_hat, W, K_range = 2:6, n_folds = 4,
                     dfspa_args = list(denoise = FALSE), seed = 1)
plot(cv$K, cv$cv_error, type = "b",
     xlab = "K", ylab = "CV prediction error",
     main = "CV prediction error vs K",
     ylim = c(0, max(cv$cv_error, na.rm = TRUE) * 1.05))
```

Both curves should dip near `K = 3`, the true rank in this
simulation.

## The d-fSPA denoising knobs (N, Delta)

`dfspa()` averages each study with its near neighbours before
running the projection algorithm. Two parameters control this:
`N` (the neighbourhood size, in number of studies) and `Delta` (a
distance threshold). Larger `N` and `Delta` smooth more aggressively.

### Bypassing denoising

In clean simulations or with small `m`, the simplest choice is to
bypass denoising entirely. This avoids the small-sample failure mode
where aggressive denoising prunes too many studies.

```{r cdk-bypass}
fit_no <- metahunt(F_hat, W, K = K_true,
                   dfspa_args = list(denoise = FALSE))
fit_no
```

### Setting (N, Delta) by hand

If you have a sense of scale for the within-study estimation error,
pass `N` and `Delta` directly. These two calls illustrate a
hand-tuned and a near-default configuration on the same data.

```{r cdk-dfspa-knobs}
fit_no <- metahunt(F_hat, W, K = K_true,
                   dfspa_args = list(denoise = FALSE))

fit_manual <- metahunt(F_hat, W, K = K_true,
                       dfspa_args = list(N = 0.5 * log(nrow(F_hat)),
                                         Delta = 0.4))
```

### Tuning (N, Delta) by CV

`select_denoising_params()` cross-validates over a grid of
`(N, Delta)` combinations at fixed `K`. With small `m`, the search
will frequently warn that some combinations prune everything ("Only
0 studies survive denoising but K = 3..."). These warnings are
expected: aggressive `(N, Delta)` on small training folds is too
strong. The function records those folds as failures and returns
the best surviving combination.

```{r cdk-select-denoising, warning = FALSE}
tune <- select_denoising_params(F_hat, W, K = K_true, n_folds = 4, seed = 1)
tune$best
```

## Practical recipe

- Start with the elbow plot to get a rough range for `K`. Refine
  with the CV curve if `W` is informative.
- For very small `m` (say `m < 30`), bypass denoising
  (`denoise = FALSE`) and pick `K` from the CV curve.
- For larger `m`, leave the d-fSPA defaults on or tune
  `(N, Delta)` with `select_denoising_params()`.
- Treat warnings from `select_denoising_params()` as informative,
  not fatal. The reported `best` is the best surviving combination.
- Sanity-check the recovered bases visually with `plot(fit)`.
  Bases that look like noise are a sign of `K` set too high.

## See also

- `vignette("metahunt-intro", package = "MetaHunt")` — the full
  pipeline and key assumptions.
- `?metahunt` — the wrapper around the three pipeline steps.
- `?dfspa` — d-fSPA basis hunting and its denoising arguments.
- `?reconstruction_error_curve` — the unsupervised elbow diagnostic.
- `?cv_error_curve` — the supervised CV diagnostic.
- `?select_denoising_params` — cross-validating `(N, Delta)` at
  fixed `K`.
