09-hclust.Rmd · 计量经济学/Machine-Learning-in-R

# Hierarchical Agglomerative Clustering # Load data Review the numeric heart dataset from 08-PCA.Rmd ```{r} load("data/unsupervised.RData") str(ml_num) ``` # Overview Hierarchical agglomerative clustering is a "bottom-up" method of clustering. Each observation begins as its own cluster and forms clusters with like items as it moves up the hierarchy. That is, all leaves are their own clusters to begin with and form clusters as grouping moves up the trunk and various branches are formed. Distance and cluster method information are usually displayed at the bottom of the graph, while the vertical axis displays the height, which refers to the distance between two clusters. We can also "cut" the dendrogram to specify a number of clusters, which is similar to defining _k_ in k-means clustering (which can also be problematic). ```{r} library(ape) library(pvclust) library(mclust) ``` Start by using the `hclust` built-in function, which prefers a distance matrix via the `dist` function. This plots rows as opposed to columns like the methods further below. ```{r} ?hclust # Create distance matrix heart_dist = dist(ml_num[, -6], method = "euclidean") # Fit hclust_model system.time({ hclust_model = hclust(heart_dist, method = "complete") }) # Plot hclust_model dendrogram plot(hclust_model, hang = -1) ``` Data are visualized in dendrograms, or branching tree-like structures similar to decision trees, albeit with less information displayed at each node. The most similar items are found lower in the dendrogram and fuse into $n-1$ clusters as we move up the tree; the next two items to fuse into a cluster produce $n-2$ clusters and so on as we move up the tree until there is just one overarching cluster. Thus, clusters become more inclusive as we move up the hierarchy. Dissimilarity is applied not just to single observations, but to groups as well (linkage). You can also cut the tree to see how the tree varies: ```{r} # If we want only 5 clusters, for example (must be a number between 1-303), since ml_num has 303 observations: cutree(hclust_model, 5) ``` # The `ape` package The [`ape` package](https://cran.r-project.org/web/packages/ape/index.html) provides some great functionality for constructing and plotting clusters: ```{r} # Various plots plot(as.phylo(hclust_model)) plot(as.phylo(hclust_model), type = "cladogram") plot(as.phylo(hclust_model), type = "unrooted") # Radial plot colors = c("red", "orange", "blue", "green", "purple") clus5 = cutree(hclust_model, 5) plot(as.phylo(hclust_model), type = "fan", tip.color = colors[clus5], lwd = 2, cex = 1) # These color settings apply to the other ape plots as well ``` # The `pvclust` package The [pvclust](http://stat.sys.i.kyoto-u.ac.jp/prog/pvclust/) package offers a straightfoward way to perform hierarchical agglomerative clustering of columns with two types of p-values at each split: approximately unbiased **(AU)** and bootstrap probability **(BP)**. ## Compare different dissimilarity measures ### Ward's method: minimum variance between clusters ```{r} system.time({ pvclust_model_ward = pvclust(ml_num[,-6], method.hclust = "ward.D", method.dist = "euclidean", nboot = 1000, parallel = T) }) plot(pvclust_model_ward) # pvrect will draw rectangles around clusters with high or low p-values pvrect(pvclust_model_ward, alpha = 0.95) ``` ### Complete linkage: largest intercluster difference ```{r} pvclust_model_complete = pvclust(ml_num[,-6], method.hclust = "complete", method.dist = "euclidean", nboot = 1000, parallel = T) plot(pvclust_model_complete) pvrect(pvclust_model_complete, alpha = 0.95) ``` ### Single linkage: smallest intercluster difference ```{r} pvclust_model_single = pvclust(ml_num[,-6], method.hclust = "single", method.dist = "euclidean", nboot = 1000, parallel = T) plot(pvclust_model_single) pvrect(pvclust_model_single, alpha = 0.95) ``` ### Average linkage: mean intercluster difference ```{r} pvclust_model_average = pvclust(ml_num[,-6], method.hclust = "average", method.dist = "euclidean", nboot = 1000, parallel = T) plot(pvclust_model_complete) pvrect(pvclust_model_complete, alpha = 0.95) ``` ### View summaries ```{r} (clust_sum = list("Ward" = pvclust_model_ward$edges, "Complete" = pvclust_model_complete$edges, "Single" = pvclust_model_single$edges, "Average" = pvclust_model_average$edges)) ``` ### Plot Euclidean distance linkages ```{r} par(mfrow = c(2,2)) plot(pvclust_model_ward, main = "Ward", xlab = "", sub = "") pvrect(pvclust_model_ward, alpha = 0.95) plot(pvclust_model_complete, main = "Complete", xlab = "", sub = "") pvrect(pvclust_model_complete, alpha = 0.95) plot(pvclust_model_single, main = "Single", xlab = "", sub = "") pvrect(pvclust_model_single, alpha = 0.95) plot(pvclust_model_average, main = "Average", xlab = "", sub = "") pvrect(pvclust_model_average, alpha = 0.95) par(mfrow = c(1,1)) ``` ### View standard error plots: ```{r} par(mfrow=c(2,2)) seplot(pvclust_model_ward, main = "Ward") seplot(pvclust_model_complete, main = "Complete") seplot(pvclust_model_single, main = "Single") seplot(pvclust_model_average, main = "Average") par(mfrow=c(1,1)) ``` # Going further - the `mclust` package The [`mclust`](https://cran.r-project.org/web/packages/mclust/index.html) package provides "Gaussian finite mixture models fitted via EM algorithm for model-based clustering, classification, and density estimation, including Bayesian regularization, dimension reduction for visualisation, and resampling-based inference." ```{r} # Fit model mclust_model = Mclust(ml_num) # View various plots plot(mclust_model, what = "BIC") plot(mclust_model, what = "classification") plot(mclust_model, what = "uncertainty") plot(mclust_model, what = "density") ``` ### Return best performing model ```{r} summary(mclust_model) ``` ### Cross-validated mclust ```{r} # sort age in decreasing order ml_num = ml_num[order(-ml_num$age),] head(ml_num) # create a binary factor variable from age: "less than 0" and "greater than/equal to 0" ml_num$class = cut(ml_num$age, breaks = c(min(ml_num$age), 0, max(ml_num$age)), levels = c(1, 2), labels = c("less than 0", "greater than/equal to 0")) ml_num # Define our predictors (X) and class labels (class) X = subset(ml_num, select = -c(h.target, class)) class = ml_num$h.target # Fit the model (EEE covariance structure, basically the same as linear discriminant analysis) mclust_model2 = MclustDA(X, class = class, modelType = "EDDA", modelNames = "EEE") # Cross-validate! set.seed(1) cv_mclust = cvMclustDA(mclust_model2, nfold = 20) # View cross-validation error and standard error of the cv error cv_mclust[c("error", "se")] ```