Single-cell RNA-seq integration methods aim to remove technical batch effects while preserving biological variation. CIDER provides a ground-truth-free approach to:
This vignette focuses how showing the process using the example data of dendritic cells.
Apart from CIDER, the following packages also need to be loaded:
The example data can be downloaded from https://figshare.com/s/d5474749ca8c711cc205. This dataset contains 26593 genes and 564 cells, and comprises four dendritic cell subtypes (CD141, CD1C, DoubleNeg, and pDC) from two batches. The raw count matrix and sample information were downloaded from a curated set, and cells with fewer than 500 detected genes have been removed.
# Download the data
data_url <- "https://figshare.com/ndownloader/files/52116197"
data_file <- file.path(tempdir(), "dendritic.rds")
if (!file.exists(data_file)) {
message("Downloading data...")
download.file(data_url, destfile = data_file, mode = "wb")
}
dendritic_reduced <- load(data_file)
First an integration method\(^1\) is
applied on the dendritic data. You can apply other integration methods
to the your data, as long as the correct PCs are stored in your Seurat
object, i.e. Reductions(seu.integrated, "pca")
or
seu.integrated@reductions$pca
.
seu.list <- SplitObject(dendritic, split.by = "Batch")
for (i in 1:length(seu.list)) {
seu.list[[i]] <- NormalizeData(seu.list[[i]], verbose = FALSE)
seu.list[[i]] <- FindVariableFeatures(seu.list[[i]],
selection.method = "vst",
nfeatures = 1000, verbose = FALSE)
}
seu.anchors <- FindIntegrationAnchors(object.list = seu.list,
dims = 1:15, verbose = FALSE)
seu.integrated <- IntegrateData(anchorset = seu.anchors,
dims = 1:15, verbose = FALSE)
DefaultAssay(seu.integrated) <- "integrated"
seu.integrated <- ScaleData(seu.integrated, verbose = FALSE)
seu.integrated <- RunPCA(seu.integrated, verbose = FALSE)
seu.integrated <- RunTSNE(seu.integrated, reduction = "pca", dims = 1:5)
Clear the intermediate outcome.
CIDER evaluates integration results in three steps.
Clustering based on the corrected PCs (hdbscan.seurat
).
This step uses HDBSCAN, which is a density-based clustering
algorithm\(^2\). The clustering results
are stored in seu.integrated$dbscan_cluster
. Clusters are
further divided into batch-specific clusters by concatenating
dbscan_cluster and batch, stored in
seu.integrated$initial_cluster
.
Compute IDER-based similarity matrix (getIDEr
) among the
batch-specific initial clusters. If multiple CPUs are availble, you can
set use.parallel = TRUE
and n.cores
to the
number of available cores to speed it up.
Assign the similarity and estimate empirical p values
(estimateProb
) for the correctness of integration. High
similarity values and low p values indicate that the cell are similar to
the surrounding cells and likely integrated correctly.
The evaluation scores can be viewed by the scatterPlot
as below. As shown cells with dbscan_cluster of 2 and 3 have low
regional similarity and high empirical p values, suggesting that they
can be incorrectly integrated.
p1 <- scatterPlot(seu.integrated, "tsne", "dbscan_cluster")
p2 <- scatterPlot(seu.integrated, "tsne", colour.by = "similarity") + labs(fill = "Similarity")
p3 <- scatterPlot(seu.integrated, "tsne", colour.by = "pvalue") + labs(fill = "Prob of \nrejection")
plot_grid(p1, p2, p3, ncol = 3)
Interpretation Guide:
✅ High similarity + Low p-value: Well-integrated regions
❌ Low similarity + High p-value: Potential integration errors
To have more insight, we can view the IDER-based similarity matrix by
functions plotNetwork
or plotHeatmap
. Both of
them require the input of a Seurat object and the output of
getIDEr
. In this example, 1_Batch1 and 1_Batch2 as well as
4_Batch1 and 4_Batch2 have high similarity.
plotNetwork
generates a graph where vertexes are initial
clusters and edge widths are similarity values. The parameter
weight.factor
controls the scale of edge widths; larger
weight.factor
will give bolder edges proportionally.
#> IGRAPH 656dbb4 UNW- 10 12 --
#> + attr: name (v/c), frame.color (v/c), size (v/n), label.family (v/c),
#> | weight (e/n), width (e/n)
#> + edges from 656dbb4 (vertex names):
#> [1] 4_Batch1--4_Batch2 4_Batch1--3_Batch2 3_Batch1--4_Batch2 3_Batch1--3_Batch2
#> [5] 3_Batch1--2_Batch2 2_Batch1--3_Batch2 2_Batch1--2_Batch2 0_Batch1--0_Batch2
#> [9] 0_Batch1--2_Batch2 1_Batch1--0_Batch2 1_Batch1--1_Batch2 1_Batch1--2_Batch2
So far the evaluation have completed and CIDER has not used the ground truth at all!
Let’s peep at the ground truth before the closure of this vignette. As shown in the figure below, the clusters having low IDER-based similarity and high p values actually have at least two populations (CD1C and CD141), verifying that CIDER spots the wrongly integrated cells.
hdbscan.seurat
parameters if initial clustering
is too granularcutree.h
in estimateProb
to change
confidence thresholdsuse.parallel=TRUE
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-apple-darwin20
#> Running under: macOS Monterey 12.5.1
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
#>
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: Europe/London
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] viridis_0.6.5 viridisLite_0.4.2 cowplot_1.1.3 ggplot2_3.5.1
#> [5] pheatmap_1.0.12 Seurat_5.1.0 SeuratObject_5.0.2 sp_2.1-4
#> [9] CIDER_0.99.4
#>
#> loaded via a namespace (and not attached):
#> [1] RColorBrewer_1.1-3 rstudioapi_0.16.0 jsonlite_1.8.8
#> [4] magrittr_2.0.3 spatstat.utils_3.1-0 farver_2.1.2
#> [7] rmarkdown_2.27 vctrs_0.6.5 ROCR_1.0-11
#> [10] spatstat.explore_3.3-2 htmltools_0.5.8.1 sass_0.4.9
#> [13] sctransform_0.4.1 parallelly_1.38.0 KernSmooth_2.23-24
#> [16] bslib_0.7.0 htmlwidgets_1.6.4 ica_1.0-3
#> [19] plyr_1.8.9 plotly_4.10.4 zoo_1.8-12
#> [22] cachem_1.1.0 igraph_2.0.3 mime_0.12
#> [25] lifecycle_1.0.4 iterators_1.0.14 pkgconfig_2.0.3
#> [28] Matrix_1.7-0 R6_2.5.1 fastmap_1.2.0
#> [31] fitdistrplus_1.2-1 future_1.34.0 shiny_1.9.1
#> [34] digest_0.6.37 colorspace_2.1-1 patchwork_1.2.0
#> [37] tensor_1.5 RSpectra_0.16-2 irlba_2.3.5.1
#> [40] labeling_0.4.3 progressr_0.14.0 fansi_1.0.6
#> [43] spatstat.sparse_3.1-0 httr_1.4.7 polyclip_1.10-7
#> [46] abind_1.4-5 compiler_4.4.1 withr_3.0.1
#> [49] doParallel_1.0.17 fastDummies_1.7.4 highr_0.11
#> [52] MASS_7.3-61 tools_4.4.1 lmtest_0.9-40
#> [55] httpuv_1.6.15 future.apply_1.11.2 goftest_1.2-3
#> [58] glue_1.7.0 dbscan_1.2.2 nlme_3.1-165
#> [61] promises_1.3.0 grid_4.4.1 Rtsne_0.17
#> [64] cluster_2.1.6 reshape2_1.4.4 generics_0.1.3
#> [67] gtable_0.3.5 spatstat.data_3.1-2 tidyr_1.3.1
#> [70] data.table_1.16.0 utf8_1.2.4 spatstat.geom_3.3-2
#> [73] RcppAnnoy_0.0.22 ggrepel_0.9.5 RANN_2.6.2
#> [76] foreach_1.5.2 pillar_1.9.0 stringr_1.5.1
#> [79] spam_2.10-0 RcppHNSW_0.6.0 limma_3.60.6
#> [82] later_1.3.2 splines_4.4.1 dplyr_1.1.4
#> [85] lattice_0.22-6 survival_3.7-0 deldir_2.0-4
#> [88] tidyselect_1.2.1 locfit_1.5-9.10 miniUI_0.1.1.1
#> [91] pbapply_1.7-2 knitr_1.48 gridExtra_2.3
#> [94] edgeR_4.2.2 scattermore_1.2 xfun_0.46
#> [97] statmod_1.5.0 matrixStats_1.4.1 stringi_1.8.4
#> [100] lazyeval_0.2.2 yaml_2.3.10 evaluate_0.24.0
#> [103] codetools_0.2-20 kernlab_0.9-33 tibble_3.2.1
#> [106] cli_3.6.3 uwot_0.2.2 xtable_1.8-4
#> [109] reticulate_1.39.0 munsell_0.5.1 jquerylib_0.1.4
#> [112] Rcpp_1.0.13 globals_0.16.3 spatstat.random_3.3-1
#> [115] png_0.1-8 spatstat.univar_3.0-1 parallel_4.4.1
#> [118] dotCall64_1.1-1 listenv_0.9.1 scales_1.3.0
#> [121] ggridges_0.5.6 leiden_0.4.3.1 purrr_1.0.2
#> [124] rlang_1.1.4