Lab5-Clustering.Rmd

---
title: "Lab 5: Clustering"
subtitle: "High Dimensional Data Analysis practicals"
author: "Milan Malfait"
date: "24 Feb 2022 <br/> (Last updated: 2022-02-22)"
---

```{r setup, include=FALSE, cache=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.show = "hold"
)

options(width = 80)
```

### [Change log](https://github.com/statOmics/HDDA/commits/master/Lab6-Clustering.Rmd) {-}

***

```{r libraries, message=FALSE, warning=FALSE}
## Install necessary packages with:

# install.packages(c("mclust", "gclus", "GGally", "tidyverse"))

# if (!requireNamespace("remotes", quietly = TRUE)) {
#     install.packages("remotes")
# }
# remotes::install_github("vqv/ggbiplot")

library(mclust)
library(gclus)  # contains the 'wine' data
library(ggbiplot)
library(GGally)
library(tidyverse)

theme_set(theme_minimal())
```


# The wine data

In this lab session, we will explore the [`wine`][wine] data, following the
example analysis from [Scrucca *et al.* (2016)][scrucca2016].

This dataset provides 13 measurements obtained from a chemical analysis of 178
wines grown in the same region in Italy but derived from three different
cultivars (Barolo, Grignolino, Barbera). The original cultivar labels are
provided in the dataset.

We will apply different clustering algorithms and validate them by comparing how
well the clusters capture the original classes.

```{r}
data("wine", package = "gclus")
class <- factor(wine$Class, levels = 1:3, labels = c("Barolo", "Grignolino", "Barbera"))
table(class)

X <- as.matrix(wine[, -1])
summary(X)
```


# Hierarchical clustering

### Tasks {-}

#### 1. Perform hierarhical clustering of the wine data, using a Euclidean distance matrix and the complete-linkage algorithm (see `?hclust`). Plot the clustering *dendrogram*. {-}

<details><summary>Solution</summary>

```{r}
## Calculate distance matrix and perform hierarchical clustering
wine_dist <- dist(X, method = "euclidean")
hc <- hclust(wine_dist, method = "complete")

plot(hc, labels = FALSE)
```

</details>

#### 2. Select an appropriate number of clusters from the hierarchical clustering (see `?cutree`). Visualize the clusters on a PCA biplot and compare with the original labels. {-}

```{r}
hc_clusters <- cutree(hc, k = 3)
table(class, hc_clusters)
```

<details><summary>Solution</summary>

```{r, fig.asp=1}
wine_pca <- prcomp(X, scale. = TRUE)

ggbiplot(wine_pca, groups = class) +
  scale_color_brewer(palette = "Set2") +
  labs(color = "Original labels") +
  theme(aspect.ratio = 0.8, legend.position = "top")

ggbiplot(wine_pca, groups = factor(hc_clusters)) +
  scale_color_brewer(palette = "Set2") +
  labs(color = "HC clusters") +
  theme(aspect.ratio = 0.8, legend.position = "top")
```

</details>

#### Bonus: can you improve the results by using different distance metrics or linkages? {-}


# Model-based clustering

### Tasks {-}

#### 1. Perform model-based clustering on the `wine` data (use [`mclust::Mclust()`][mclust]). Plot the BIC values and interpret the results. Compare the identified clusters with the original (true) labels. {-}

<details><summary>Solution</summary>

```{r}
mod <- Mclust(X)
summary(mod)
summary(mod$BIC)
```

```{r}
table(class, mod$classification)

## Annotate clusters
mc_clusters <- factor(mod$classification)
```

```{r}
plot(mod, what = "BIC", ylim = range(mod$BIC[, -(1:2)], na.rm = TRUE),
  legendArgs = list(x = "bottomleft")
)
plot(mod, what = "classification")
```

There is a clear indication of a three-component mixture with covariances having
different shapes and volumes but the same orientation (VVE). See
`?mclustModelNames` for a description of the different `mclust` models.

</details>

#### 2. Visualize the clusters found by `Mclust()` on the PCA biplot. Compare with the original labels. {-}

<details><summary>Solution</summary>

```{r, fig.asp=1}
ggbiplot(wine_pca, groups = class) +
  scale_color_brewer(palette = "Set2") +
  labs(color = "Original labels") +
  theme(aspect.ratio = 0.8, legend.position = "top")

## PCA plot annotated with clusters
ggbiplot(wine_pca, groups = mc_clusters) +
  labs(color = "mclust clusters") +
  scale_color_brewer(palette = "Set2") +
  theme(aspect.ratio = 0.8, legend.position = "top")
```

</details>

#### 3. Perform a dimensionality reduction of the wine data using the PCA. Select an appropriate number of PC's. Redo the clustering on this reduced dimension representation and make the same figures as before. How do the results differ? {-}

<details><summary>Solution</summary>

```{r}
## Calculate total variance by summing the PC variances (sdev's squared)
tot_var <- sum(wine_pca$sdev^2)

## Create data.frame of the proportion of variance explained by each PC
wine_prop_var <- data.frame(
  PC = 1:ncol(wine_pca$x),
  var = wine_pca$sdev^2
) %>%
  ## Using `mutate` to calculate prop. var and cum. prop. var
  mutate(
    prop_var = var / tot_var,
    cum_prop_var = cumsum(var / tot_var)
  )

wine_prop_var

## Plot the proportion of variance explained by each PC
p1 <- ggplot(wine_prop_var, aes(PC, prop_var)) +
  geom_point() +
  geom_line() +
  geom_vline(xintercept = 6.5, col = "firebrick") +
  scale_x_continuous(breaks = 1:ncol(wine_pca$x)) +
  labs(y = "Proportion of variance")

## Plot the cumulative proportion of variance explained by each PC
p2 <- ggplot(wine_prop_var, aes(PC, cum_prop_var)) +
  geom_point() +
  geom_line() +
  geom_vline(xintercept = 6.5, col = "firebrick") +
  scale_x_continuous(breaks = 1:ncol(wine_pca$x)) +
  labs(y = "Cumulative proportion of variance")

gridExtra::grid.arrange(p1, p2, ncol = 2)
```

Selecting first 6 PC's, keeping 85% of the variance.

```{r}
k <- 6
pca_X <- wine_pca$x[, 1:k]
head(pca_X)
```

```{r}
mod2 <- Mclust(pca_X)
summary(mod2)
summary(mod2$BIC)
```

```{r}
table(class, mod2$classification)

## Annotate clusters
mc_pca_clusters <- factor(mod2$classification)
```

```{r, fig.asp=0.8}
plot(mod2, what = "BIC", ylim = range(mod2$BIC[, -(1:2)], na.rm = TRUE),
  legendArgs = list(x = "bottomleft")
)
```

```{r, fig.asp=1}
df <- as.data.frame(pca_X)
df$clusters <- mc_pca_clusters

## Using ggscatmat() from GGally package to plot all pairwise PCs
ggscatmat(df, columns = 1:k, color = "clusters") +
  theme(legend.position = "bottom", aspect.ratio = 0.6) +
  scale_color_brewer(palette = "Set2", name = "mclust-PCA clusters")
```

*Note: you can ignore the upper-right panels of this figure. These give the correlations between each pair of variables (PC's here) for each group, but are not relevant here.*

</details>


```{r, child="_session-info.Rmd"}
```

[wine]: https://rdrr.io/cran/gclus/man/wine.html
[scrucca2016]: https://svn.r-project.org/Rjournal/html/archive/2016/RJ-2016-021/RJ-2016-021.pdf
[mclust]: https://rdrr.io/cran/mclust/man/Mclust.html