# Hierarchical Agglomerative Clustering
# Load data
Review the numeric heart dataset from 08-PCA.Rmd
```{r}
load("data/unsupervised.RData")
str(ml_num)
```
# Overview
Hierarchical agglomerative clustering is a "bottom-up" method of clustering. Each observation begins as its own cluster and forms clusters with like items as it moves up the hierarchy. That is, all leaves are their own clusters to begin with and form clusters as grouping moves up the trunk and various branches are formed.
Distance and cluster method information are usually displayed at the bottom of the graph, while the vertical axis displays the height, which refers to the distance between two clusters. We can also "cut" the dendrogram to specify a number of clusters, which is similar to defining _k_ in k-means clustering (which can also be problematic).
```{r}
library(ape)
library(pvclust)
library(mclust)
```
Start by using the `hclust` built-in function, which prefers a distance matrix via the `dist` function. This plots rows as opposed to columns like the methods further below.
```{r}
?hclust
# Create distance matrix
heart_dist = dist(ml_num[, -6], method = "euclidean")
# Fit hclust_model
system.time({
hclust_model = hclust(heart_dist, method = "complete")
})
# Plot hclust_model dendrogram
plot(hclust_model, hang = -1)
```
Data are visualized in dendrograms, or branching tree-like structures similar to decision trees, albeit with less information displayed at each node. The most similar items are found lower in the dendrogram and fuse into $n-1$ clusters as we move up the tree; the next two items to fuse into a cluster produce $n-2$ clusters and so on as we move up the tree until there is just one overarching cluster. Thus, clusters become more inclusive as we move up the hierarchy.
Dissimilarity is applied not just to single observations, but to groups as well (linkage).
You can also cut the tree to see how the tree varies:
```{r}
# If we want only 5 clusters, for example (must be a number between 1-303), since ml_num has 303 observations:
cutree(hclust_model, 5)
```
# The `ape` package
The [`ape` package](https://cran.r-project.org/web/packages/ape/index.html) provides some great functionality for constructing and plotting clusters:
```{r}
# Various plots
plot(as.phylo(hclust_model))
plot(as.phylo(hclust_model), type = "cladogram")
plot(as.phylo(hclust_model), type = "unrooted")
# Radial plot
colors = c("red", "orange", "blue", "green", "purple")
clus5 = cutree(hclust_model, 5)
plot(as.phylo(hclust_model), type = "fan", tip.color = colors[clus5], lwd = 2, cex = 1)
# These color settings apply to the other ape plots as well
```
# The `pvclust` package
The [pvclust](http://stat.sys.i.kyoto-u.ac.jp/prog/pvclust/) package offers a straightfoward way to perform hierarchical agglomerative clustering of columns with two types of p-values at each split: approximately unbiased **(AU)** and bootstrap probability **(BP)**.
## Compare different dissimilarity measures
### Ward's method: minimum variance between clusters
```{r}
system.time({
pvclust_model_ward = pvclust(ml_num[,-6],
method.hclust = "ward.D",
method.dist = "euclidean",
nboot = 1000, parallel = T)
})
plot(pvclust_model_ward)
# pvrect will draw rectangles around clusters with high or low p-values
pvrect(pvclust_model_ward, alpha = 0.95)
```
### Complete linkage: largest intercluster difference
```{r}
pvclust_model_complete = pvclust(ml_num[,-6],
method.hclust = "complete",
method.dist = "euclidean",
nboot = 1000, parallel = T)
plot(pvclust_model_complete)
pvrect(pvclust_model_complete, alpha = 0.95)
```
### Single linkage: smallest intercluster difference
```{r}
pvclust_model_single = pvclust(ml_num[,-6],
method.hclust = "single",
method.dist = "euclidean",
nboot = 1000, parallel = T)
plot(pvclust_model_single)
pvrect(pvclust_model_single, alpha = 0.95)
```
### Average linkage: mean intercluster difference
```{r}
pvclust_model_average = pvclust(ml_num[,-6],
method.hclust = "average",
method.dist = "euclidean",
nboot = 1000, parallel = T)
plot(pvclust_model_complete)
pvrect(pvclust_model_complete, alpha = 0.95)
```
### View summaries
```{r}
(clust_sum = list("Ward" = pvclust_model_ward$edges,
"Complete" = pvclust_model_complete$edges,
"Single" = pvclust_model_single$edges,
"Average" = pvclust_model_average$edges))
```
### Plot Euclidean distance linkages
```{r}
par(mfrow = c(2,2))
plot(pvclust_model_ward, main = "Ward", xlab = "", sub = "")
pvrect(pvclust_model_ward, alpha = 0.95)
plot(pvclust_model_complete, main = "Complete", xlab = "", sub = "")
pvrect(pvclust_model_complete, alpha = 0.95)
plot(pvclust_model_single, main = "Single", xlab = "", sub = "")
pvrect(pvclust_model_single, alpha = 0.95)
plot(pvclust_model_average, main = "Average", xlab = "", sub = "")
pvrect(pvclust_model_average, alpha = 0.95)
par(mfrow = c(1,1))
```
### View standard error plots:
```{r}
par(mfrow=c(2,2))
seplot(pvclust_model_ward, main = "Ward")
seplot(pvclust_model_complete, main = "Complete")
seplot(pvclust_model_single, main = "Single")
seplot(pvclust_model_average, main = "Average")
par(mfrow=c(1,1))
```
# Going further - the `mclust` package
The [`mclust`](https://cran.r-project.org/web/packages/mclust/index.html) package provides "Gaussian finite mixture models fitted via EM algorithm for model-based clustering, classification, and density estimation, including Bayesian regularization, dimension reduction for visualisation, and resampling-based inference."
```{r}
# Fit model
mclust_model = Mclust(ml_num)
# View various plots
plot(mclust_model, what = "BIC")
plot(mclust_model, what = "classification")
plot(mclust_model, what = "uncertainty")
plot(mclust_model, what = "density")
```
### Return best performing model
```{r}
summary(mclust_model)
```
### Cross-validated mclust
```{r}
# sort age in decreasing order
ml_num = ml_num[order(-ml_num$age),]
head(ml_num)
# create a binary factor variable from age: "less than 0" and "greater than/equal to 0"
ml_num$class = cut(ml_num$age,
breaks = c(min(ml_num$age),
0,
max(ml_num$age)),
levels = c(1, 2),
labels = c("less than 0", "greater than/equal to 0"))
ml_num
# Define our predictors (X) and class labels (class)
X = subset(ml_num, select = -c(h.target, class))
class = ml_num$h.target
# Fit the model (EEE covariance structure, basically the same as linear discriminant analysis)
mclust_model2 = MclustDA(X, class = class, modelType = "EDDA", modelNames = "EEE")
# Cross-validate!
set.seed(1)
cv_mclust = cvMclustDA(mclust_model2, nfold = 20)
# View cross-validation error and standard error of the cv error
cv_mclust[c("error", "se")]
```