SOCR ≫ | DSPA ≫ | Topics ≫ |
As we learned in Chapters 6-8, classification could help us make predictions on new observations. However, classification requires (human supervised) predefined label classes. What if we are in the early phases of a study and/or don’t have the required resources to manually define, derive or generate these class labels?
Clustering can help us explore the dataset and separate cases into groups representing similar traits or characteristics. Each group could be a potential candidate for a class. Clustering is used for exploratory data analytics, i.e., as unsupervised learning, rather than for confirmatory analytics or for predicting specific outcomes.
In this chapter, we will present (1) clustering as a machine learning task, (2) the silhouette plots for classification evaluation, (3) the k-Means clustering algorithm and how to tune it, (4) examples of several interesting case-studies, including Divorce and Consequences on Young Adults, Pediatric Trauma, and Youth Development, (5) demonstrate hierarchical clustering, and (6) Gaussian mixture modeling.
As we mentioned before, clustering represents classification of unlabeled cases. Scatter plots depict a simple illustration of the clustering process. Assume we don’t know much about the ingredients of frankfurter hot dogs and we have the following graph.
## Loading required package: xml2
## {xml_nodeset (1)}
## [1] <div id="content" class="mw-body-primary" role="main">\n\t<a id="top ...
In terms of calories and sodium, these hot dogs are clearly separated into three different clusters. Cluster 1 has hot dogs of low calories and medium sodium content; Cluster 2 has both calorie and sodium at medium levels; Cluster 3 has both sodium and calories at high levels. We can make a bold guess about the meats used in these three clusters of hot dogs. For cluster 1, it could be mostly chicken meat since it has low calories. The second cluster might be beef and the 3-rd one is likely to be pork, because beef hot dogs have considerably less calories and salt than pork hot dogs. However, this is just guessing. Some hot dogs have a mixture of two or three types of meat. The real situation is somewhat similar to what we guessed but with some random noise, especially in cluster 2.
The following 2 plots show the primary type of meat used for each hot dog labeled by name (top) and color-coded (bottom).
Silhouette plots are useful for interpretation and validation of consistency of all clustering algorithms. The silhouette value, \(\in [-1,1]\), measures the similarity (cohesion) of a data point to its cluster relative to other clusters (separation). Silhouette plots rely on a distance metric, e.g., the Euclidean distance, Manhattan distance, Minkowski distance, etc.
Suppose a clustering method groups all data points (objects), \(\{X_i\}_i\), into \(k\) clusters and define:
Then, the silhouette of \(X_i\) is defined by:
\[-1 \leq s_i = \frac{l_i - d_i}{\max\{l_i, d_i\}} \equiv \begin{cases} 1-\frac{d_i}{l_i}, & \mbox{if } d_i < l_i \\ 0, & \mbox{if } d_i = l_i \\ \frac{l_i}{d_i}-1, & \mbox{if } d_i > l_i \\ \end{cases} \leq 1\]
Note that:
The K-means algorithm is one of the most commonly used algorithms for clustering.
This algorithm is similar to k-nearest neighbors (KNN) presented in Chapter 6. In clustering, we don’t have a priori pre-determined labels, and the algorithm is trying to deduce intrinsic groupings in the data.
Similar to KNN, k-means uses Euclidean distance (\(| |_2\) norm) most of the times, however Manhattan distance (\(| |_1\) norm), or the more general Minkowski distance (\((\sum_{i=1}^n{|p_i - q_i|^c})^{\frac{1}{c}}\)) may also be used. For \(c=2\), the Minkowski distance represents the classical Euclidean distance: \[dist(x, y)=\sqrt{\sum_{n=1}^n(x_i-y_i)^2}.\]
How can we separate clusters using this formula? The k-means protocol is as follows:
Although there is no guarantee that the k-means algorithm converges to a global optimum, in practice, the algorithm tends to converge, i.e., the assignments no longer change, to a local minimum as there are only a finite number of such Voronoi partitionings.
We don’t want our number of clusters to be either too large or too small. If it is too large, the groups are too specific to be meaningful. On the other hand, too few groups might be too broadly general to be useful. As we mentioned in Chapter 6, \(k=\sqrt{\frac{n}{2}}\) is a good place to start. However, it might generate a large number of groups.
Also, the elbow method may be used to determine the relationship of \(k\) and homogeneity of the observations of each cluster.
When we graph within-group homogeneity against \(k\), we can find an “elbow point” that suggests a minimum \(k\) corresponding to relatively large within-group homogeneity.
This graph shows that homogeneity barely increases above the “elbow point”. There are various ways to measure homogeneity within a cluster. For detailed explanations please read On clustering validation techniques, Journal of Intelligent Information Systems Vol. 17, pp. 107-145, by M. Halkidi, Y. Batistakis, and M. Vazirgiannis (2001).
The dataset we will be using is the Divorce and Consequences on Young Adults dataset. This is a longitudinal study focused on examining the consequences of recent parental divorce for young adults (initially ages 18-23) whose parents had divorced within 15 months of the study’s first wave (1990-91). The sample consisted of 257 White respondents with newly divorced parents. Here we have a subset of this dataset with 47 respondents in our case-studies folder, CaseStudy01_Divorce_YoungAdults_Data.csv.
Let’s load the dataset and pull out a summary of all variables.
divorce<-read.csv("https://umich.instructure.com/files/399118/download?download_frd=1")
summary(divorce)
## DIVYEAR momint dadint momclose
## Min. :89.00 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:89.00 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:1.000
## Median :90.00 Median :1.000 Median :2.000 Median :2.000
## Mean :89.68 Mean :1.809 Mean :2.489 Mean :1.809
## 3rd Qu.:90.00 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:2.000
## Max. :90.00 Max. :4.000 Max. :4.000 Max. :4.000
## depression livewithmom gethitched
## Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:2.000
## Median :3.000 Median :1.000 Median :2.000
## Mean :2.851 Mean :1.489 Mean :2.213
## 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :4.000 Max. :9.000 Max. :8.000
According to the summary, DIVYEAR is actually a dummy variable (either 89 or 90). We can recode (binarize) the DIVYEAR using the ifelse()
function (mentioned in Chapter 7). The following line of code generates a new indicator variable for divorce year=1990.
divorce$DIVYEAR<-ifelse(divorce$DIVYEAR==89, 0, 1)
We also need another preprocessing step to deal with livewithmom
, which has missing values, livewithmom=9
. We can impute these using momint
and dadint
variables for each specific participant
table(divorce$livewithmom)
##
## 1 2 9
## 31 15 1
divorce[divorce$livewithmom==9, ]
## DIVYEAR momint dadint momclose depression livewithmom gethitched
## 45 1 3 1 3 3 9 2
For instance, respondents that feel much closer to their dads may be assigned divorce$livewithmom==2
, suggesting they most likely live with their fathers. Of course, alternative imputation strategies are also possible.
divorce[45, 6]<-2
divorce[45, ]
## DIVYEAR momint dadint momclose depression livewithmom gethitched
## 45 1 3 1 3 3 2 2
We are only using R base functionality, so no need to install any additional packages now, library(stats)
may be necessary. Then, the function kmeans()
will provide the k-means clustering of the data.
myclusters<-kmeans(mydata, k)
The output consists of:
Before we perform clustering, we need to standardize the features to avoid biasing the clustering based on features that use large-scale values. Note that distance calculations are sensitive to measuring units.as.data.frame()
will convert our dataset into a data frame allowing us to use the lapply()
function. Next, we use a combination of lapply()
and scale()
to standardize our data.
di_z<- as.data.frame(lapply(divorce, scale))
str(di_z)
## 'data.frame': 47 obs. of 7 variables:
## $ DIVYEAR : num 0.677 0.677 -1.445 0.677 -1.445 ...
## $ momint : num 1.258 1.258 -0.854 1.258 -0.854 ...
## $ dadint : num -0.514 -0.514 0.536 1.586 0.536 ...
## $ momclose : num 0.225 1.401 -0.951 1.401 0.225 ...
## $ depression : num 0.164 -0.937 1.265 0.164 -2.038 ...
## $ livewithmom: num -0.711 1.377 -0.711 -0.711 -0.711 ...
## $ gethitched : num 0.846 -0.229 -0.229 0.846 -0.229 ...
The resulting dataset, di_z
, is standardized so all features are unitless and follow approximately standardized normal distribution.
Next, we need to think about selecting a proper \(k\). We have a relatively small dataset with 47 observations. Obviously we cannot have a \(k\) as large as 10. The rule of thumb suggests \(k=\sqrt{47/2}=4.8\). This would be relatively large also because we will have less than 10 observations for each cluster. It is very likely that for some clusters we only have one observation. A better choice may be 3. Let’s see if this will work.
library(stats)
set.seed(321)
diz_clussters<-kmeans(di_z, 3)
Let’s look at the clusters created by the k-means model.
diz_clussters$size
## [1] 12 24 11
At first glance, it seems that 3 worked well for the number of clusters. We don’t have any cluster that contains a small number of observations. The three clusters have relatively equal number of respondents.
Silhouette plots represent the most appropriate evaluation strategy to assess the quality of the clustering. Silhouette values are between -1 and 1. In our case, two data points correspond to a negative Silhouette values, suggesting these cases may be “mis-clustered”, or perhaps are ambiguous as the Silhouette value is close to 0. We can observe that the average Silhouette is reasonable, about \(0.2\).
require(cluster)
## Loading required package: cluster
dis = dist(di_z)
sil = silhouette(diz_clussters$cluster, dis)
summary(sil)
## Silhouette of 47 units in 3 clusters from silhouette.default(x = diz_clussters$cluster, dist = dis) :
## Cluster sizes and average silhouette widths:
## 12 24 11
## 0.16444649 0.27684356 0.07921684
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.08466 0.11760 0.20080 0.20190 0.30450 0.39820
plot(sil)
The next step would be to interpret the clusters in the context of this social study.
diz_clussters$centers
## DIVYEAR momint dadint momclose depression livewithmom
## 1 0.5004720 1.1698438 -0.07631029 1.2049200 -0.1112567 0.1591755
## 2 -0.2953914 -0.5016290 0.36107795 -0.5096937 0.1180883 -0.7107373
## 3 0.0985208 -0.1817299 -0.70455885 -0.2023993 -0.1362761 1.3770536
## gethitched
## 1 -0.1390230
## 2 -0.1390230
## 3 0.4549845
This result shows:
Cluster 1: divyear=mostly 90, momint=very close, dadint=not close, livewithmom=mostly mother, depression=not often, (gethiched) marry=will likely not get married. Cluster 1 represents mostly adolescents that are closer to mom than dad. These young adults do not often feel depressed and they may avoid getting married. These young adults tends to be not be too emotional and do not value family.
Cluster 2: divyear=mostly 89, momint=not close, dadint=very close, livewithmom=father, depression=mild, marry=do not know/not inclined. Cluster 2 includes children that mostly live with dad and only feel close to dad. These people don’t felt severely depressed and are not inclined to marry. These young adults may prefer freedom and tend to be more naive.
Cluster 3: divyear=mix of 89 and 90, momint=not close, dadint=not at all, livewithmom=mother, depression=sometimes, marry=tend to get married. Cluster 3 contains children that did not feel close to either dad or mom. They sometimes felt depressed and are willing to build their own family. These young adults seem to be more independent.
We can see that these three different clusters do contain three alternative types of young adults.
Bar plots provide an alternative strategy to visualize the difference between clusters.
par(mfrow=c(1, 1), mar=c(4, 4, 4, 2))
myColors <- c("darkblue", "red", "green", "brown", "pink", "purple", "yellow")
barplot(t(diz_clussters$centers), beside = TRUE, xlab="cluster",
ylab="value", col = myColors)
legend("top", ncol=2, legend = c("DIVYEAR", "momint", "dadint", "momclose", "depression", "livewithmom", "gethitched"), fill = myColors)
For each of the three clusters, the bars in the plot above represent the following order of features DIVYEAR, momint, dadint, momclose, depression, livewithmom, gethitched
.
Clustering results could be utilized as new information augmenting the original dataset. For instance, we can add a cluster label in our divorce
dataset:
divorce$clusters<-diz_clussters$cluster
divorce[1:5, ]
## DIVYEAR momint dadint momclose depression livewithmom gethitched
## 1 1 3 2 2 3 1 3
## 2 1 3 2 3 2 2 2
## 3 0 1 3 1 4 1 2
## 4 1 3 4 3 3 1 3
## 5 0 1 3 2 1 1 2
## clusters
## 1 1
## 2 1
## 3 2
## 4 1
## 5 2
We can also examine the relationship between live with mom and feel close to mom by displaying a scatter plot of these two variables. If we suspect that young adults’ personality might affect this relationship, then we could consider the potential personality (cluster type) in the plot. The cluster labels associated with each participant are printed in different positions relative to each pair of observations, (livewithmom, momint)
.
require(ggplot2)
ggplot(divorce, aes(livewithmom, momint), main="Scatterplot Live with mom vs feel close to mom") +
geom_point(aes(colour = factor(clusters), shape=factor(clusters), stroke = 8), alpha=1) +
theme_bw(base_size=25) +
geom_text(aes(label=ifelse(clusters%in%1, as.character(clusters), ''), hjust=2, vjust=2, colour = factor(clusters)))+
geom_text(aes(label=ifelse(clusters%in%2, as.character(clusters), ''), hjust=-2, vjust=2, colour = factor(clusters)))+
geom_text(aes(label=ifelse(clusters%in%3, as.character(clusters), ''), hjust=2, vjust=-1, colour = factor(clusters))) +
guides(colour = guide_legend(override.aes = list(size=8))) +
theme(legend.position="top")
We used ggplot()
function in ggplot2
package to label points with cluster types. ggplot(divorce, aes(livewithmom, momint))+geom_point()
gives us the scatterplot and the three geom_text()
functions help us label the points with the corresponding cluster identifiers.
This picture shows that live with mom does not necessarily mean young adults will feel close to mom. For “emotional” (cluster 1) young adults, they felt close to their mom whether they live with their mom or not. “Naive” (cluster 2) young adults feel closer to mom if they live with mom. However, they tend to be estranged from their mother. “Independent” (cluster 3) young adults are opposite to cluster 1. They felt closer to mom if they don’t live with her.
Let’s still use the divorce data to illustrate a model improvement using k-means++.
(Appropriate) initialization of the k-means algorithm is of paramount importance. The k-means++ extension provides a practical strategy to obtain an optimal initialization for k-means clustering using a predefined kpp_init
method.
# install.packages("matrixStats")
require(matrixStats)
## Loading required package: matrixStats
kpp_init = function(dat, K) {
x = as.matrix(dat)
n = nrow(x)
# Randomly choose a first center
centers = matrix(NA, nrow=K, ncol=ncol(x))
set.seed(123)
centers[1,] = as.matrix(x[sample(1:n, 1),])
for (k in 2:K) {
# Calculate dist^2 to closest center for each point
dists = matrix(NA, nrow=n, ncol=k-1)
for (j in 1:(k-1)) {
temp = sweep(x, 2, centers[j,], '-')
dists[,j] = rowSums(temp^2)
}
dists = rowMins(dists)
# Draw next center with probability proportional to dist^2
cumdists = cumsum(dists)
prop = runif(1, min=0, max=cumdists[n])
centers[k,] = as.matrix(x[min(which(cumdists > prop)),])
}
return(centers)
}
clust_kpp = kmeans(di_z, kpp_init(di_z, 3), iter.max=100, algorithm='Lloyd')
We can observe some differences.
clust_kpp$centers
## DIVYEAR momint dadint momclose depression livewithmom
## 1 0.3741445 1.2578161 -0.6636602 0.5610651 -0.1505730 -0.4124815
## 2 -0.2659149 -0.5798266 0.3805174 -0.2538624 0.1639572 -0.5560862
## 3 0.3508225 0.5269697 -0.4329499 0.2251408 -0.2594488 1.3770536
## gethitched
## 1 0.9990071
## 2 -0.1489684
## 3 -0.2285310
This improvement is not substantial - the new overall average Silhouette value remains \(0.2\) for k-means++, compared with the value of \(0.2\) reported for the earlier k-means clustering, albeit the 3 groups generated by each method are quite distinct. In addition, the number of “mis-clustered” instances remains 2 although their Silhouette values are rather smaller than before and the overall cluster 1 Silhouette average value is low (\(0.006\)).
sil2 = silhouette(clust_kpp$cluster, dis)
summary(sil2)
## Silhouette of 47 units in 3 clusters from silhouette.default(x = clust_kpp$cluster, dist = dis) :
## Cluster sizes and average silhouette widths:
## 7 27 13
## 0.00644352 0.24933847 0.19476785
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.12750 0.08781 0.22950 0.19810 0.29050 0.38120
plot(sil2)
Similar to what we performed for KNN and SVM, we can tune the k-means parameters, including centers initialization and \(k\).
n_rows <- 21
mat = matrix(0,nrow = n_rows)
for (i in 2:n_rows){
set.seed(321)
clust_kpp = kmeans(di_z, kpp_init(di_z, i), iter.max=100, algorithm='Lloyd')
sil = silhouette(clust_kpp$cluster, dis)
mat[i] = mean(as.matrix(sil)[,3])
}
colnames(mat) <- c("Avg_Silhouette_Value")
mat
## Avg_Silhouette_Value
## [1,] 0.0000000
## [2,] 0.1948335
## [3,] 0.1980686
## [4,] 0.1789654
## [5,] 0.1716270
## [6,] 0.1546357
## [7,] 0.1622488
## [8,] 0.1767659
## [9,] 0.1928883
## [10,] 0.2026559
## [11,] 0.2006313
## [12,] 0.1586044
## [13,] 0.1735035
## [14,] 0.1707446
## [15,] 0.1626367
## [16,] 0.1609723
## [17,] 0.1785733
## [18,] 0.1839546
## [19,] 0.1660019
## [20,] 0.1573574
## [21,] 0.1561791
ggplot(data.frame(k=2:n_rows,sil=mat[2:n_rows]),aes(x=k,y=sil))+
geom_line()+
scale_x_continuous(breaks = 2:n_rows)
This suggests that \(k\sim 3\) may be an appropriate number of clusters to use in this case.
Next, let’s set the maximal iteration of the algorithm and rerun the model with optimal k=2
, k=3
or k=10
. Below, we just demonstrate the results for \(k=3\). There are still 2 mis-clustered observations, which is not a significant improvement on the prior model according to the average Silhouette measure.
k <- 3
set.seed(31)
clust_kpp = kmeans(di_z, kpp_init(di_z, k), iter.max=200, algorithm="MacQueen")
sil3 = silhouette(clust_kpp$cluster, dis)
summary(sil3)
## Silhouette of 47 units in 3 clusters from silhouette.default(x = clust_kpp$cluster, dist = dis) :
## Cluster sizes and average silhouette widths:
## 10 22 15
## 0.02096194 0.30414984 0.15474729
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.1365 0.1032 0.1971 0.1962 0.3122 0.4113
plot(sil3)
Note that we now see 3 cases of group 1 that have negative silhouette values (previously we had only 2), albeit the overall average silhouette remains \(0.2\).
Let’s go through another example demonstrating the k-means clustering method using a larger dataset.
The dataset we will interrogate now includes Services Utilization by Trauma-Exposed Children in the US data, which is located in our case-studies folder. This case study examines associations between post-traumatic psychopathology and service utilization by trauma-exposed children.
Variables:
Case_04_ChildTrauma._Data.csv
) are tab-delimited.First, we need to load the dataset into R and report its summary and dimensions.
trauma<-read.csv("https://umich.instructure.com/files/399129/download?download_frd=1", sep = " ")
summary(trauma); dim(trauma)
## id sex age ses
## Min. : 1.0 Min. :0.000 Min. : 2.000 Min. :0.00
## 1st Qu.: 250.8 1st Qu.:0.000 1st Qu.: 7.000 1st Qu.:0.00
## Median : 500.5 Median :1.000 Median : 9.000 Median :0.00
## Mean : 500.5 Mean :0.506 Mean : 8.982 Mean :0.18
## 3rd Qu.: 750.2 3rd Qu.:1.000 3rd Qu.:11.000 3rd Qu.:0.00
## Max. :1000.0 Max. :1.000 Max. :25.000 Max. :1.00
## race traumatype ptsd dissoc
## black :200 dvexp :250 Min. :0.00 Min. :0.000
## hispanic:100 neglect :350 1st Qu.:0.00 1st Qu.:0.000
## other :100 physabuse :100 Median :0.00 Median :1.000
## white :600 psychabuse:200 Mean :0.29 Mean :0.598
## sexabuse :100 3rd Qu.:1.00 3rd Qu.:1.000
## Max. :1.00 Max. :1.000
## service
## Min. : 0.000
## 1st Qu.: 8.000
## Median :10.000
## Mean : 9.926
## 3rd Qu.:12.000
## Max. :20.000
## [1] 1000 9
In the summary we see two factors race
and traumatype
. Traumatype
codes the real classes we are interested in. If the clusters created by the model are quite similar to the trauma types, our model may have a quite reasonable interpretation. Let’s also create a dummy variable for each racial category.
trauma$black<-ifelse(trauma$race=="black", 1, 0)
trauma$hispanic<-ifelse(trauma$race=="hispanic", 1, 0)
trauma$other<-ifelse(trauma$race=="other", 1, 0)
trauma$white<-ifelse(trauma$race=="white", 1, 0)
Then, we will remove traumatype
the class variable from the dataset to avoid biasing the clustering algorithm. Thus, we are simulating a real biomedical case-study where we do not necessarily have the actual class information available, i.e., classes are latent features.
trauma_notype<-trauma[, -c(1, 5, 6)]
Similar to case-study 1, let’s standardize the dataset and fit a k-means model.
tr_z<- as.data.frame(lapply(trauma_notype, scale))
str(tr_z)
## 'data.frame': 1000 obs. of 10 variables:
## $ sex : num 0.988 0.988 -1.012 -1.012 0.988 ...
## $ age : num -0.997 1.677 -0.997 0.674 -0.662 ...
## $ ses : num -0.468 -0.468 -0.468 -0.468 -0.468 ...
## $ ptsd : num 1.564 -0.639 -0.639 -0.639 1.564 ...
## $ dissoc : num 0.819 -1.219 0.819 0.819 0.819 ...
## $ service : num 2.314 0.678 -0.303 0.351 1.66 ...
## $ black : num 2 2 2 2 2 ...
## $ hispanic: num -0.333 -0.333 -0.333 -0.333 -0.333 ...
## $ other : num -0.333 -0.333 -0.333 -0.333 -0.333 ...
## $ white : num -1.22 -1.22 -1.22 -1.22 -1.22 ...
set.seed(1234)
trauma_clusters<-kmeans(tr_z, 6)
Here we use k=6 in the hope that we have exactly 5 clusters that matches the 5 trauma types. In this case study, we have 1,000 observations and k=6 may be a reasonable option.
To assess the clustering model results, we can examine the resulting clusters.
trauma_clusters$centers
## sex age ses ptsd dissoc
## 1 -0.001999144 0.061154336 -0.091055799 -0.077094361 0.02446247
## 2 -1.011566709 -0.006361734 -0.000275301 0.002214351 0.81949287
## 3 0.026286613 -0.043755817 0.029890657 0.064206246 -1.21904661
## 4 0.067970886 0.046116384 -0.078047828 0.044053921 -0.07746450
## 5 0.047979449 -0.104263129 0.156095655 0.022026960 -0.09784989
## 6 0.987576985 0.028799955 0.019511957 -0.038046568 0.81949287
## service black hispanic other white
## 1 0.001308569 1.9989997 -0.3331666 -0.3331666 -1.2241323
## 2 0.126332303 -0.4997499 -0.3331666 -0.3331666 0.8160882
## 3 -0.030083167 -0.4997499 -0.3331666 -0.3331666 0.8160882
## 4 0.128894052 -0.4997499 -0.3331666 2.9984996 -1.2241323
## 5 -0.103376956 -0.4997499 2.9984996 -0.3331666 -1.2241323
## 6 -0.111481162 -0.4997499 -0.3331666 -0.3331666 0.8160882
myColors <- c("darkblue", "red", "green", "brown", "pink", "purple", "lightblue", "orange", "grey", "yellow")
barplot(t(trauma_clusters$centers), beside = TRUE, xlab="cluster",
ylab="value", col = myColors)
legend("topleft", ncol=4, legend = c("sex", "age", "ses", "ptsd", "dissoc", "service", "black", "hispanic", "other", "white"), fill = myColors)
On this barplot, the bars in each cluster represents sex, age, ses, ptsd, dissoc, service, black, hispanic, other,
and white
, respectively. It is quite obvious that each cluster has some unique features.
Next, we can compare the k-means computed cluster labels to the original labels. Let’s evaluate the similarities between the automated cluster labels and their real class counterparts using a confusion matrix table, where rows represent the k-means clusters, columns show the actual labels, and the cell values include the frequencies of the corresponding pairings.
trauma$clusters<-trauma_clusters$cluster
table(trauma$clusters, trauma$traumatype)
##
## dvexp neglect physabuse psychabuse sexabuse
## 1 0 0 100 0 100
## 2 10 118 0 61 0
## 3 23 133 0 79 0
## 4 100 0 0 0 0
## 5 100 0 0 0 0
## 6 17 99 0 60 0
We can see that all of the children in cluster 4 belongs to dvexp
(exposure to domestic violence or intimate partner violence). If we use the mode of each cluster to be the class for that group of children, we can classify 63 sexabuse
cases, 279 neglect
cases, 41 physabuse
cases, 100 dvexp
cases and another 71 neglect
cases. That is 554 cases out of 1000 cases identified with correct class. The model have some problem in distinguishing between neglect
and psychabuse
but it have a very good accuracy.
Let’s review the output Silhouette value summary. It works well as only a small portion of samples appear mis-clustered.
dis_tra = dist(tr_z)
sil_tra = silhouette(trauma_clusters$cluster, dis_tra)
summary(sil_tra)
## Silhouette of 1000 units in 6 clusters from silhouette.default(x = trauma_clusters$cluster, dist = dis_tra) :
## Cluster sizes and average silhouette widths:
## 200 189 235 100 100 176
## 0.2595725 0.2185706 0.1039559 0.3223076 0.3199830 0.2423110
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.008893 0.139100 0.234400 0.224500 0.303300 0.388200
#plot(sil_tra)
# report the overall mean silhouette value
mean(sil_tra[,"sil_width"])
## [1] 0.2245298
# The sil object colnames are ("cluster", "neighbor", "sil_width")
Next, let’s try to tune \(k\) with k-means++ and see if \(k=6\) appears to be optimal.
mat = matrix(0,nrow = 11)
for (i in 2:11){
set.seed(321)
clust_kpp = kmeans(tr_z, kpp_init(tr_z, i), iter.max=100, algorithm='Lloyd')
sil = silhouette(clust_kpp$cluster, dis_tra)
mat[i] = mean(as.matrix(sil)[,3])
}
mat
## [,1]
## [1,] 0.0000000
## [2,] 0.2433222
## [3,] 0.1675486
## [4,] 0.1997315
## [5,] 0.2116534
## [6,] 0.2400086
## [7,] 0.2251367
## [8,] 0.2199859
## [9,] 0.2249569
## [10,] 0.2347122
## [11,] 0.2304451
ggplot(data.frame(k=2:11,sil=mat[2:11]),aes(x=k,y=sil))+geom_line()+scale_x_continuous(breaks = 2:11)
Finally, let’s use k-means++ with \(k=6\) and set the algorithm’s maximal iteration before rerunning the experiment:
set.seed(1234)
clust_kpp = kmeans(tr_z, kpp_init(tr_z, 6), iter.max=100, algorithm='Lloyd')
sil = silhouette(clust_kpp$cluster, dis_tra)
summary(sil)
## Silhouette of 1000 units in 6 clusters from silhouette.default(x = clust_kpp$cluster, dist = dis_tra) :
## Cluster sizes and average silhouette widths:
## 422 100 178 85 15 200
## 0.2166778 0.3353976 0.1898492 0.2478090 0.2294502 0.2836607
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03672 0.19730 0.23080 0.24000 0.27710 0.40650
# plot(sil)
# report the overall mean silhouette value
mean(sil[,"sil_width"])
## [1] 0.2400086
Use Boys Town Study of Youth Development data, second case study, CaseStudy02_Boystown_Data.csv, we used in Chapter 6, to find clusters using variables about GPA, alcohol abuse, attitudes on drinking, social status, parent closeness and delinquency for clustering(all variables other than gender and ID).
First we must load the data and transfer sex
, dadjob
and momjob
into dummy variables.
boystown<-read.csv("https://umich.instructure.com/files/399119/download?download_frd=1", sep=" ")
boystown$sex<-boystown$sex-1
boystown$dadjob <- (-1)*(boystown$dadjob-2)
boystown$momjob <- (-1)*(boystown$momjob-2)
str(boystown)
## 'data.frame': 200 obs. of 11 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ sex : num 0 0 0 0 1 1 0 0 1 1 ...
## $ gpa : int 5 0 3 2 3 3 1 5 1 3 ...
## $ Alcoholuse: int 2 4 2 2 6 3 2 6 5 2 ...
## $ alcatt : int 3 2 3 1 2 0 0 3 0 1 ...
## $ dadjob : num 1 1 1 1 1 1 1 1 1 1 ...
## $ momjob : num 0 0 0 0 1 0 0 0 1 1 ...
## $ dadclose : int 1 3 2 1 2 1 3 6 3 1 ...
## $ momclose : int 1 4 2 2 1 2 1 2 3 2 ...
## $ larceny : int 1 0 0 3 1 0 0 0 1 1 ...
## $ vandalism : int 3 0 2 2 2 0 5 1 4 0 ...
Then, extract all the variables, except the first two columns (subject identifiers and genders).
boystown_sub<-boystown[, -c(1, 2)]
Next, we need to standardize and clustering the data with k=3
. You may have the following centers (numbers could be a little different).
## gpa Alcoholuse alcatt dadjob momjob dadclose
## 1 -0.5101243 -0.08555163 -0.30098866 0.1939577 0.04868109 1.1914502
## 2 -0.2753631 0.49998217 0.13804858 -0.2421906 -0.30151766 -0.4521484
## 3 0.6590193 -0.51256447 0.04599325 0.1451756 0.31107377 -0.2896562
## momclose larceny vandalism
## 1 0.65647213 -0.1755012 -0.4453044
## 2 -0.33341358 -0.4017282 0.5252308
## 3 -0.06343891 0.5769583 -0.2981561
Add k-means cluster labels as a new (last) column back in the original dataset.
To investigate the gender distribution within different clusters we may use aggregate()
.
# Compute the averages for the variable 'sex', grouped by cluster
aggregate(data=boystown, sex~clusters, mean)
## clusters sex
## 1 1 0.6875000
## 2 2 0.5802469
## 3 3 0.6760563
Here clusters
is the new vector indicating cluster labels. The gender distribution does not vary much between different cluster labels.
There are a number of R
hierarchical clustering packages, including:
hclust
in base R.agnes
in the cluster
package.Alternative distance measures (or linkages) can be used in all Hierarchical Clustering, e.g., single, complete and ward.
We will demonstrate hierarchical clustering using case-study 1 (Divorce and Consequences on Young Adults). Pre-set \(k=3\) and notice that we have to use normalized data for hierarchical clustering.
require(cluster)
pitch_sing = agnes(di_z, diss=FALSE, method='single')
pitch_comp = agnes(di_z, diss=FALSE, method='complete')
pitch_ward = agnes(di_z, diss=FALSE, method='ward')
sil_sing = silhouette(cutree(pitch_sing, k=3), dis)
sil_comp = silhouette(cutree(pitch_comp, k=3), dis)
# try 10 clusters, see plot above
sil_ward = silhouette(cutree(pitch_ward, k=10), dis)
You can generate the hierarchical plot by ggdendrogram
in the package ggdendro
.
# install.packages("ggdendro")
require(ggdendro)
## Loading required package: ggdendro
ggdendrogram(as.dendrogram(pitch_ward), leaf_labels=FALSE, labels=FALSE)
mean(sil_ward[,"sil_width"])
## [1] 0.2398738
ggdendrogram(as.dendrogram(pitch_ward), leaf_labels=TRUE, labels=T, size=10)
Generally speaking, the best result should come from wald linkage, but you should also try complete linkage (method=‘complete’). We can see that the hierarchical clustering result (average silhouette value \(\sim 0.24\)) mostly agrees with the prior k-means (\(0.2\)) and k-means++ (\(0.2\)) results.
summary(sil_ward)
## Silhouette of 47 units in 10 clusters from silhouette.default(x = cutree(pitch_ward, k = 10), dist = dis) :
## Cluster sizes and average silhouette widths:
## 4 5 6 3 6 12
## 0.25905454 0.29195989 0.29305926 -0.02079056 0.19263836 0.26268274
## 5 2 3 1
## 0.32594365 0.44074717 0.08760990 0.00000000
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.1477 0.1231 0.2577 0.2399 0.3524 0.5176
plot(sil_ward)
More details about Gaussian mixture models (GMM) are provided here. Below is a brief introduction to GMM using the Mclust
function in the R package mclust
.
For multivariate mixture, there are totally 14 possible models:
For more practical details, you may refer Mclust
. For more theoretical details, you may refer C. Fraley and A. E. Raftery (2002). Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association 97:611:631.
Let’s use the Divorce and Consequences on Young Adults dataset for a demonstration.
library(mclust)
## Package 'mclust' version 5.2.3
## Type 'citation("mclust")' for citing this R package in publications.
set.seed(1234)
gmm_clust = Mclust(di_z)
gmm_clust$modelName
## [1] "EEE"
Thus, the optimal model here is “EEE”.
plot(gmm_clust$BIC, legendArgs = list(x = "bottom", ncol = 2, cex = 1))
plot(gmm_clust,what = "density")
plot(gmm_clust,what = "classification")
Try to replicate these results with other data from the list of our Case-Studies.