| Type: | Package |
| Title: | Implementation of the DISCO Metric for Internal Clustering Evaluation |
| Version: | 0.1.1 |
| Description: | Implementation of the DISCO (Density-based Internal Score for Clusterings with nOise) metric, a cluster validity index for evaluating density-based clustering results without ground truth labels. DISCO is the first index to explicitly assess the quality of noise point assignments in addition to cluster quality. It uses density-connectivity distance derived from a minimum spanning tree of the mutual-reachability graph, providing interpretable, bounded scores in [-1, 1]. Higher scores indicate better clustering. Based on Beer et al. (2025) <doi:10.48550/arXiv.2503.00127>. |
| License: | MIT + file LICENSE |
| Encoding: | UTF-8 |
| Language: | en-GB |
| Depends: | R (≥ 3.6.0) |
| Imports: | FNN, stats |
| Suggests: | dbscan, testthat (≥ 3.0.0) |
| RoxygenNote: | 7.3.3 |
| URL: | https://github.com/aminentezari/discoCVI |
| BugReports: | https://github.com/aminentezari/discoCVI/issues |
| NeedsCompilation: | no |
| Packaged: | 2026-05-02 08:55:19 UTC; aminentezari |
| Author: | Amin Entezari |
| Maintainer: | Amin Entezari <amin_entezari@outlook.com> |
| Repository: | CRAN |
| Date/Publication: | 2026-05-05 18:20:13 UTC |
disco: Density-Informed Clustering Scoring and Optimisation
Description
The DISCO metric evaluates the quality of a clustering result using density-connectivity (DC) distances derived from the minimum spanning tree of the mutual-reachability graph. It generalises the silhouette coefficient to density-based partitions and assigns a principled score to noise points.
Author(s)
Maintainer: Amin Entezari amin_entezari@outlook.com (ORCID)
Other contributors:
Davide Chicco davidechicco@davidechicco.it (ORCID) (Supervision and guidance) [contributor]
See Also
Useful links:
Report bugs at https://github.com/aminentezari/discoCVI/issues
Compute mutual-reachability (reachability) distances
Description
For every pair of points (i, j), the mutual-reachability distance is
d_{\text{reach}}(i,j) =
\max\bigl(\text{core}_k(i),\; \text{core}_k(j),\; d(i,j)\bigr)
where \text{core}_k(i) is the distance from point i to its
k-th nearest neighbour (k = \texttt{min\_points} - 1) and
d(i,j) is the Euclidean distance.
Usage
calculate_reachability_distance(points, min_points = 5)
Arguments
points |
A numeric matrix of shape |
min_points |
A positive integer |
Value
A symmetric n \times n numeric matrix of mutual-reachability
distances with zeros on the diagonal.
Compute density-connectivity (DC) distances
Description
The main entry point for constructing the DC-distance matrix used throughout the DISCO metric. The procedure is:
Compute mutual-reachability distances (
calculate_reachability_distance).Build the minimum spanning tree of those distances (
get_mst_edges).Extract pairwise DC distances as minimax-path weights in the MST (
extract_dc_distances_from_mst).
Usage
compute_dc_distances(X, min_points = 5)
Arguments
X |
A numeric matrix of shape |
min_points |
A positive integer for the core-distance neighbourhood
size. Default is |
Value
A symmetric n \times n numeric matrix of DC distances.
Examples
set.seed(1)
X <- matrix(rnorm(40), ncol = 2)
D <- compute_dc_distances(X, min_points = 5)
dim(D) # 20 x 20
Per-sample DISCO scores
Description
Computes a pointwise DISCO score for every observation in X. The
score lies in [-1, 1]:
Values close to
1indicate a well-placed point (either tightly within its cluster or a noise point far from all clusters).Values close to
0indicate borderline placement.Values close to
-1indicate a misplaced point (or a noise point that should belong to a cluster).
Four cases are handled:
-
All noise — every point labelled
-1: all scores are-1. -
Single cluster with no noise — all scores are
0. -
One real cluster + noise — cluster points are scored via
p_cluster; noise points are scored viap_noise. -
Two or more clusters (optional noise) — non-noise points use
p_clusteron the non-noise sub-matrix; noise points usep_noise.
Usage
disco_samples(X, labels, min_points = 5)
Arguments
X |
A numeric matrix ( |
labels |
An integer vector of length |
min_points |
A positive integer for the core-distance neighbourhood
size. Default is |
Value
A numeric vector of length n with per-point DISCO scores.
See Also
disco_score, p_cluster,
p_noise
Examples
set.seed(42)
X <- matrix(rnorm(100), ncol = 2)
labels <- rep(c(0L, 1L), each = 25)
s <- disco_samples(X, labels)
hist(s, main = "Per-sample DISCO scores")
Overall DISCO score
Description
Returns the mean of the per-sample DISCO scores produced by
disco_samples. This is the single-number summary of
clustering quality used for model selection and comparison.
Usage
disco_score(X, labels, min_points = 5)
Arguments
X |
A numeric matrix ( |
labels |
An integer vector of cluster labels ( |
min_points |
Positive integer; core-distance neighbourhood size.
Default is |
Value
A single numeric value in [-1, 1].
See Also
Examples
set.seed(42)
X <- matrix(rnorm(100), ncol = 2)
labels <- rep(c(0L, 1L), each = 25)
disco_score(X, labels)
Extract DC distances from an MST via BFS minimax-path traversal
Description
Given the edges of a minimum spanning tree, computes the density-connectivity (DC) distance matrix. The DC distance between two points is the maximum edge weight on the unique path connecting them in the MST (minimax path). BFS is used to propagate these path maxima from every source node.
Usage
extract_dc_distances_from_mst(mst_edges, n)
Arguments
mst_edges |
A |
n |
Integer, total number of points. |
Value
A symmetric n \times n numeric matrix of DC distances.
Compute a minimum spanning tree via Prim's algorithm
Description
Implements Prim's algorithm exactly as in the reference Python code
(dctree.py, lines 401–438), operating on a precomputed distance
matrix.
Usage
get_mst_edges(dist_matrix)
Arguments
dist_matrix |
A symmetric |
Value
A data.frame with n-1 rows and three columns:
- i
Integer index of the first endpoint (1-based).
- j
Integer index of the second endpoint (1-based).
- dist
Edge weight.
Silhouette-like cluster scores using DC distances
Description
Computes a silhouette coefficient for each non-noise point using the
precomputed (or internally derived) DC-distance matrix rather than raw
Euclidean distances. This matches the formula used by
sklearn.metrics.silhouette_samples with metric="precomputed"
(Python reference: disco.py, line 280):
s(i) = \frac{b(i) - a(i)}{\max(a(i),\, b(i))}
where a(i) is the mean DC distance from i to all other points
in the same cluster and b(i) is the minimum over other clusters of
the mean DC distance from i to that cluster.
Usage
p_cluster(X, labels, min_points = 5, precomputed_dc_dists = FALSE)
Arguments
X |
Either (a) a numeric matrix of raw data ( |
labels |
An integer vector of cluster labels of length |
min_points |
Positive integer; used only when
|
precomputed_dc_dists |
Logical. If |
Value
A numeric vector of length n with per-point silhouette-style
scores in [-1, 1].
See Also
disco_samples, compute_dc_distances
Examples
set.seed(7)
X <- matrix(rnorm(60), ncol = 2)
labels <- rep(c(0L, 1L, 2L), each = 10)
p_cluster(X, labels)
Noise-point scores
Description
Assigns a quality score to each point labelled as noise (-1). Two
complementary sub-scores are computed and the element-wise minimum is taken
as the final score:
p_{\text{sparse}}Measures whether the noise point is sparser (more peripheral in density) than the densest part of every real cluster. For cluster
c, letM_cbe the maximum core distance over all points inc. Thenp_{\text{sparse}}(i) = \min_c \frac{\text{core}_k(i) - M_c}{\max(\text{core}_k(i),\, M_c)}.p_{\text{far}}Measures whether the noise point is far from every cluster in DC-distance space. For cluster
c, letd_{\min,c}(i)be the minimum DC distance fromito any point inc. Thenp_{\text{far}}(i) = \min_c \frac{d_{\min,c}(i) - M_c}{\max(d_{\min,c}(i),\, M_c)}.
Both sub-scores lie in [-1, 1]: positive means the noise point is
legitimately sparse/far; negative means it is denser/closer than the cluster
boundary and might be a misclassified inlier.
Usage
p_noise(X, labels, min_points = 5, dc_dists = NULL)
Arguments
X |
A numeric matrix ( |
labels |
An integer vector of length |
min_points |
Positive integer |
dc_dists |
Optional precomputed |
Value
A named list with two numeric vectors, each of length equal to the number of noise points:
- p_sparse
Sparsity-based noise scores.
- p_far
DC-distance-based noise scores.
See Also
Examples
set.seed(3)
X <- matrix(rnorm(60), ncol = 2) # 30 rows x 2 cols
labels <- c(rep(0L, 10), rep(1L, 10), rep(-1L, 10)) # 30 labels
nr <- p_noise(X, labels, min_points = 3)
str(nr)