Clustering algorithm on the Wireless data

Giovanni Saraceno

The Wireless Indoor Localization Data

We consider the Wireless Indoor Localization Data Set, publicly available in the UCI Machine Learning Repository’s website. This data set is used to study the performance of different indoor localization algorithms. It is available within the QuadratiK package as wireless.

library(QuadratiK)
head(wireless)
##    V1  V2  V3  V4  V5  V6  V7 V8
## 1 -64 -56 -61 -66 -71 -82 -81  1
## 2 -68 -57 -61 -65 -71 -85 -85  1
## 3 -63 -60 -60 -67 -76 -85 -84  1
## 4 -61 -60 -68 -62 -77 -90 -80  1
## 5 -63 -65 -60 -63 -77 -81 -87  1
## 6 -64 -55 -63 -66 -76 -88 -83  1

The Wireless Indoor Localization data set contains the measurements of the Wi-Fi signal strength in different indoor rooms. It consists of a data frame with 2000 rows and 8 columns. The first 7 variables report the values of the Wi-Fi signal strength received from 7 different Wi-Fi routers in an office location in Pittsburgh (USA). The last column indicates the class labels, from 1 to 4, indicating the different rooms. Notice that, the Wi-Fi signal strength is measured in dBm, decibel milliwatts, which is expressed as a negative value ranging from -100 to 0. In total, we have 500 observations for each room.

Given that the Wi-Fi signal strength takes values in a limited range, it is appropriate to consider the spherically transformed observations, by \(L_2\) normalization, and consequently perform the clustering algorithm on the 7-dimensional sphere.

We perform the clustering algorithm on the wireless data set. We consider the \(K= 3, 4, 5, 6\) as possible values for the number of clusters.

wire <- wireless[,-8]
labels <- wireless[,8]
wire_norm <- wire/sqrt(rowSums(wire^2))
set.seed(2468)
res_pk <- pkbc(as.matrix(wire_norm),3:6)

To guide the choice of the number of clusters, the function validation provides cluster validation measures and graphical tools. Specifically, it displays the Elbow plot from the computed within-cluster sum of squares values and returns an object with IGP the InGroup Proportion (IGP), and metrics a table of computed evaluation measures. This include the \(k\)-sample test statistics with corresponding critical value and, if the true labels are provided, the measures of adjusted rand index (ARI), Macro-Precision and Macro-Recall.

set.seed(2468)
res_validation <- validation(res_pk, true_label = labels)
res_validation$IGP
round(res_validation$metrics, 8)

The Elbow plots and the reported metrics suggest \(K=4\) as number of clusters. This is in accordance with the know ground truth. Once the number of clusters is selected, the function summary_stat in the QuadratiK package can be used to obtain additional summary information with respect to the clustering results. In particular, the function provides mean, standard deviation, median, inter-quantile range, minimum and maximum computed for each variable, overall and by the assigned membership, as well as a representation of data points. For \(d=2\) and \(d=3\), observations are displayed on the unit circle and unit sphere, respectively. If \(d>3\), the spherical PCA is applied on the data set, and the first 3 principal components are used for visualizing data points on the sphere, after application of \(L_2\) normalization.

summary_clust <- summary_stat(res_pk,4,true_label=labels)
summary_clust$metrics
## [[1]]
##            Group 1     Group 2     Group 3     Group 4     Overall
## mean   -0.33976246 -0.23423419 -0.29470203 -0.34377323 -0.30363137
## sd      0.01283014  0.05324085  0.01581506  0.01298448  0.05251967
## median -0.33952107 -0.24604104 -0.29625176 -0.34242063 -0.31847692
## IQR     0.01674475  0.03014941  0.02182651  0.01835030  0.06412536
## min    -0.37455424 -0.30643954 -0.34994496 -0.39357081 -0.39357081
## max    -0.29339739 -0.06308050 -0.23847076 -0.31226867 -0.06308050
## 
## [[2]]
##            Group 1     Group 2     Group 3     Group 4     Overall
## mean   -0.30636324 -0.35918738 -0.32636022 -0.31540450 -0.32655905
## sd      0.01392354  0.02052666  0.01842337  0.01608549  0.02635941
## median -0.30600005 -0.35809184 -0.32487137 -0.31538238 -0.32221067
## IQR     0.01958541  0.02460137  0.02426854  0.02263003  0.03624468
## min    -0.35600342 -0.45578394 -0.40026324 -0.36631517 -0.45578394
## max    -0.26903743 -0.30605235 -0.27775661 -0.27586207 -0.26903743
## 
## [[3]]
##            Group 1     Group 2     Group 3     Group 4     Overall
## mean   -0.32905221 -0.35786158 -0.31417147 -0.28929074 -0.32231793
## sd      0.01563476  0.02459888  0.01710971  0.02094437  0.03163402
## median -0.32915518 -0.35575396 -0.31237754 -0.29072716 -0.32209343
## IQR     0.01957377  0.03437142  0.02422092  0.02851547  0.03973413
## min    -0.38248214 -0.43623816 -0.38676345 -0.33639128 -0.43623816
## max    -0.28256746 -0.28237524 -0.26460827 -0.23485570 -0.23485570
## 
## [[4]]
##            Group 1     Group 2     Group 3     Group 4     Overall
## mean   -0.34918727 -0.24111635 -0.30079691 -0.35013402 -0.31082318
## sd      0.01475930  0.04839750  0.01973759  0.01679759  0.05251032
## median -0.34858664 -0.24898566 -0.30107248 -0.35004522 -0.32519736
## IQR     0.01898212  0.03356955  0.02672633  0.02286452  0.06907237
## min    -0.40129910 -0.31690470 -0.35542814 -0.41273961 -0.41273961
## max    -0.30779911 -0.07399736 -0.23119131 -0.30906423 -0.07399736
## 
## [[5]]
##            Group 1     Group 2     Group 3     Group 4     Overall
## mean   -0.38226045 -0.43319403 -0.37583631 -0.28257085 -0.36807675
## sd      0.01770626  0.02787579  0.01814674  0.02110273  0.05820777
## median -0.37991233 -0.43009295 -0.37549563 -0.28337199 -0.37738801
## IQR     0.02531565  0.04108696  0.02541214  0.02710382  0.06966093
## min    -0.45385132 -0.52580164 -0.42808192 -0.34353824 -0.52580164
## max    -0.34027255 -0.36916034 -0.33333333 -0.21464345 -0.21464345
## 
## [[6]]
##            Group 1     Group 2     Group 3     Group 4     Overall
## mean   -0.45125600 -0.46327257 -0.48366352 -0.49713921 -0.47391075
## sd      0.01446455  0.02195263  0.01814601  0.01412842  0.02488735
## median -0.45081811 -0.46519225 -0.48319020 -0.49649227 -0.47411000
## IQR     0.01833254  0.02770315  0.02503768  0.01895624  0.03825696
## min    -0.51475369 -0.53938031 -0.53130759 -0.53840040 -0.53938031
## max    -0.41652096 -0.40313011 -0.44107352 -0.45551482 -0.40313011
## 
## [[7]]
##            Group 1     Group 2     Group 3     Group 4     Overall
## mean   -0.45728346 -0.46863418 -0.48984740 -0.49701329 -0.47827795
## sd      0.01710219  0.02396734  0.01993870  0.01587742  0.02515570
## median -0.45703017 -0.46920446 -0.49013407 -0.49711582 -0.47914650
## IQR     0.02311965  0.03050579  0.02784124  0.02030659  0.03643335
## min    -0.50728615 -0.56835661 -0.54571977 -0.53648350 -0.56835661
## max    -0.41326597 -0.38995633 -0.43815927 -0.44423198 -0.38995633

The figure shows the data points according to the first three spherical principal components, colored by the true labels and the assigned membership. The clusters identified with \(k=4\) achieve high performance in terms of ARI, Macro Precision and Macro Recall. The plot of points using the principal components also shows that the identified cluster follows the initial labels.