Mercator for Continuous Data

C.E. Coombes, Zachary B. Abrams, Kevin R. Coombes

2024-04-28

Introduction

The Mercator package is intended to facilitate the exploratory analysis of data sets. It consists of two main parts, one devoted to tools for binary matrices, and the other focused on visualization. These visualization tools can be used with binary, continuous, categorical, or mixed data, since they only depend on a distance matrix. Each distance matrix can be visualized with multiple techniques, providing a consistent interface to thoroughly explore the data set. In thus vignette, we illustrate the visualization of a continuous data set.

The Mercator Class

First we load the package.

suppressMessages( suppressWarnings( library(Mercator) ) )

Now we load a “fake” set of synthertic continuous data that comes with the Mercator package. We will use this data set to illustrate the visualization methods.

set.seed(36766)
data(fakedata)
ls()
## [1] "fakeclin" "fakedata"
dim(fakedata)
## [1] 776 300
dim(fakeclin)
## [1] 300   4

Visualization

The Mercator Package currently supports visualization of data with methods that include standard techniques (hierarchical clustering) and large-scale visualizations (multidimensional scaling (MDS),T-distributed Stochastic Neighbor Embedding (t-SNE), and iGraph.) In order to create a Mercator object, we must provide

We are going to start with hierarchical clustering, with an arbitrarily assigned number of 4 groups.

mercury <- Mercator(dist(t(fakedata)), "euclid", "hclust", 4)
summary(mercury)
## An object of the 'Mercator' class, using the ' euclid ' metric, of size
## [1] 300 300
## Contains these visualizations:  hclust

Hierarchical Clustering

Here is a “view” of the dendrogram produced by hierarchical clustering. Note that view is an argument to the plot function for Mercator objects. If omitted, the first view in the list is used.

plot(mercury, view = "hclust")

Hierarchical clustering.

The dendrogram suggests that there might actually be more than 4 subtypes in the data, but we’re going to wait until we see some other views of the data before doing anything about that.

t-Distributed Stochastic Neighbor Embedding

Mercator can use t-distributed Stochastic Neighbor Embedding (t-SNE) plots for visualizing large-scale, high-dimensional data in a 2-dimensional space.

mercury <- addVisualization(mercury, "tsne")
plot(mercury, view = "tsne", main="t-SNE; Euclidean Distance")

A t-SNE Plot.

The t-SNE plot also suggests more than four subtypes; perhaps as many as seven or eight.

Optional t-SNE parameters, such as perplexity, can be used to fine-tune the plot when the visualization is created. Using addVisualization to create a new, tuned plot of an existing type overwrites the existing plot of that type.

mercury <- addVisualization(mercury, "tsne", perplexity = 15)
## Warning in addVisualization(mercury, "tsne", perplexity = 15): Overwriting an
## existing visualization:tsne
plot(mercury, view = "tsne",  main="t-SNE; Euclidean Distance; perplexity = 15")

A t-SNE plot with smaller perplexity.

Multi-Dimensional Scaling

Mercator allows visualization of multi-dimensional scaling (MDS) plots, as well.

mercury <- addVisualization(mercury, "mds")
plot(mercury, view = "mds", main="MDS; Euclidean Distance")

Multi-dimensional scaling.

Interestingly, the MDS plot (which is equivalent to principal components analysis, PCA, when used with Euclidean distances) doesn’t provide clear evidence of more than three or four subtypes. That’s not surprising, since groups separated in high dimensions can easily be flattened by linear projections.

iGraph

Mercator can visualize complex networks using iGraph. IN the next chunk of code, we add an iGraph visualization. We then look at the resulting graph, using three different “layouts”. The Q parameter is a cutoff (qutoff?) on the distance used to include edges; if omitted, it defaults to the 10th percentile. We arrived at the value Q=24 shown here by trial-and-error, though one could plot a histogram of the distances (via hist()) to make a more informed choice.

set.seed(73633)
mercury <- addVisualization(mercury, "graph", Q = 24)
## Warning in layout_nicely(myg): Non-positive edge weight found, ignoring all
## weights during graph layout.
plot(mercury, view = "graph", layout = "tsne", main="T-SNE Layout")

iGraph views.

plot(mercury, view = "graph", layout = "mds", main = "MDS Layout")

iGraph views.

plot(mercury, view = "graph", layout = "nicely", main = "'Nicely' Layout")

iGraph views.

The last layout, in this case, is possibly not so nice.

Cluster Identities

We can use the getClusters function to determine the cluster assignments and use these for further manipulation. For example, we can easily determine cluster size.

my.clust <- getClusters(mercury)
table(my.clust)
## my.clust
##  1  2  3  4 
## 82 68 74 76

We might also compare the cluster labels to the “true” subtypes in our “fake” data set.

table(my.clust, fakeclin$Type)
##         
## my.clust  1  2  3  4  5  6  7  8
##        1  0 40  0  1  0 41  0  0
##        2 30  0 36  0  0  0  2  0
##        3  4  0  0 34  0  0 36  0
##        4  0  0  0  0 41  0  0 35

Silhouette-Width Barplots

The barplot method produces a version of the “silhouette width” plot from Kaufman and Rouseeuw (and borrowed from the cluster package).

barplot(mercury)

Silhouette widths.

For each observation in the data set, the silhouette width is a measure of how much we believe that it is placed in the correct cluster. Here we see that about 10% to 20% of the observations in each cluster may be incorrectly classified, since their silhouette widths are negative.

Reclustering

We can “recluster” by specifying a different number of clusters.

mercury <- recluster(mercury, K = 8)
plot(mercury, view = "tsne")

A t-SNE plot after reclustering.

The silhouette-width barplot changes with the number of clusters. In this case, it suggests that eight clusters may not describe the data as well as four. However, the previous t-SNE plot also shows that the algorithmically derived cluster labels don’t seem to match the visible clusters very well.

barplot(mercury)

Silhouette widths with eight clusters.

Hierarchical Clusters

The clustering algorithm used within Mercator is partitioning around medoids (PAM). You can run any clustering algorithm of your choice and assign the resulting cluster labels to the Mercator object. As part of our visualizations, we have laready pefomred hierarchcai clustering. So, we can assign cluster labels by cutting the branches of the dendrogram. We can use the cutree function after extracting the dendrogram from the view. (Note that we use the remapColors function here to try to keep the same color assignments for the PAM-defined clusters and the hierarchical clusters.)

hclass <- cutree(mercury@view[["hclust"]], k = 8)
neptune <- setClusters(mercury, hclass)
neptune <- remapColors(mercury, neptune)
plot(neptune, view = "tsne")

A t-SNE plot colored by heierachical clustering.

The assignments by hierarchical clustering appear to more consistent thant eh PAM clusters with the t-SNE plot, though one suspect that the assignemnts among the pink, red, and orchid groups may be difficult. The silhouette width barplot (below) confirms that hierarchical clustering works better than PAM on this data set. Only the “red” group #4 contains a large number of apparently misclassified samples.

barplot(neptune)

True Clusters

For our fake data set, since we simulated it, we know the “true” labels. So, we can “recluster” using the true assignments.

venus <- setClusters(neptune, fakeclin$Type)
venus <- remapColors(neptune, venus)
plot(venus, view = "tsne")

A t-SNE plot with true cluster labels.

barplot(venus)

Silhouette widths with true clusters.

We can also see how the hierarchical clustering compare to the true cluster assignments.

table(getClusters(neptune), getClusters(venus))
##    
##      1  2  3  4  5  6  7  8
##   1 41  0  2  1  0  0  0  0
##   2  0 21  0  0  0  0  0  1
##   3  0  0 38  0  0  0  0  0
##   4  0  4  0 34  0  0  0  0
##   5  0  0  0  0 35  2  0  0
##   6  0  0  0  0  0 39  0  0
##   7  0  0  0  0  0  0 36  0
##   8  0  9  0  0  0  0  2 35

Appendix

This analaysis was performed in the following environment:

sessionInfo()
## R version 4.4.0 (2024-04-24 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 10 x64 (build 19045)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=C                          
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] Mercator_1.1.4       Thresher_1.1.4       PCDimension_1.1.13  
## [4] ClassDiscovery_3.4.5 oompaBase_3.2.9      cluster_2.1.6       
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.5         xfun_0.43            bslib_0.7.0         
##  [4] ggplot2_3.5.1        lattice_0.22-6       vctrs_0.6.5         
##  [7] tools_4.4.0          generics_0.1.3       stats4_4.4.0        
## [10] flexmix_2.3-19       Polychrome_1.5.1     tibble_3.2.1        
## [13] fansi_1.0.6          highr_0.10           pkgconfig_2.0.3     
## [16] Matrix_1.7-0         KernSmooth_2.23-22   scatterplot3d_0.3-44
## [19] lifecycle_1.0.4      kohonen_3.0.12       compiler_4.4.0      
## [22] munsell_0.5.1        movMF_0.2-8          htmltools_0.5.8.1   
## [25] sass_0.4.9           yaml_2.3.8           pillar_1.9.0        
## [28] jquerylib_0.1.4      MASS_7.3-60.2        openssl_2.1.2       
## [31] cachem_1.0.8         viridis_0.6.5        mclust_6.1          
## [34] RSpectra_0.16-1      cpm_2.3              tidyselect_1.2.1    
## [37] digest_0.6.35        Rtsne_0.17           slam_0.1-50         
## [40] dplyr_1.1.4          kernlab_0.9-32       changepoint_2.2.4   
## [43] ade4_1.7-22          fastmap_1.1.1        grid_4.4.0          
## [46] oompaData_3.1.3      colorspace_2.1-0     cli_3.6.2           
## [49] magrittr_2.0.3       utf8_1.2.4           scales_1.3.0        
## [52] rmarkdown_2.26       umap_0.2.10.0        igraph_2.0.3        
## [55] nnet_7.3-19          reticulate_1.36.1    gridExtra_2.3       
## [58] png_0.1-8            askpass_1.2.0        zoo_1.8-12          
## [61] modeltools_0.2-23    evaluate_0.23        knitr_1.46          
## [64] viridisLite_0.4.2    rlang_1.1.3          Rcpp_1.0.12         
## [67] dendextend_1.17.1    glue_1.7.0           jsonlite_1.8.8      
## [70] R6_2.5.1