Skip to content

Lesson supplementary materials

Marc Galland edited this page Jun 28, 2021 · 1 revision

4. Sample clustering

To assess if samples from the same condition are grouped together, we are going to perform a clustering analysis. It has two main steps:

  • Calculate the distance between samples based on their normalized gene counts.
  • Perform a hierarchical cluster analysis using hclust.
  • Convert the result to a dendrogram and plot the resulting sample tree.

4.1 Distance calculation

The Euclidean distance will be calculated between each sample based on the There are many ways to define a distance between two points.

Simple example first!

# sample 1 and 2
sample_a = c(1,2,3)
sample_b = c(1,2,3)

# Source: https://stackoverflow.com/questions/5559384/euclidean-distance-of-two-vectors
euc.dist <- function(x1, x2) sqrt(sum((x1 - x2) ^ 2))
euc.dist(sample_a, sample_b)

{: .language-r}

The distance is 0 because the two samples have the same coordinates in their 3-dimensional space (x = 1, y = 1, z = 1).

sample_a = c(1,2,6)
sample_b = c(1,2,3)
euc.dist(sample_a, sample_b) # equal to 3

# if you do it manually
sqrt(sum(1 - 1)^2 + sum(2 - 2)^2 + sum(6 -3)^2)

{: .language-r}

This should give you the same result.

Now, imagine you'd have to calculate the distance between each sample in a N dimensional space (here N = 33,768 genes). Normally only Neo can see through the Matrix but maybe you do too...

Let's use R instead.

We will work on the complete counts_normalised matrix as we only have to compute the distance between 48 samples. If you would do this for the ~33000 genes, it would take much much more time and you will have to stop your R session.

# calculate the sample Euclidean distances 
distances_between_samples <- dist(
	t(counts_normalised),               # notice the t() for transpose. Otherwise you get the distance between genes
	method = "euclidean")  
as.matrix(distances_between_samples)[1:5,1:5]

{: .language-r}

           ERR1406259 ERR1406271 ERR1406282 ERR1406294 ERR1406305
ERR1406259        0.0   207590.4   167892.5   543575.1   823124.6
ERR1406271   207590.4        0.0   192718.5   417620.6   677362.6
ERR1406282   167892.5   192718.5        0.0   413844.6   692401.0
ERR1406294   543575.1   417620.6   413844.6        0.0   333604.2
ERR1406305   823124.6   677362.6   692401.0   333604.2        0.0

{: .output}

Question

How can you check that you have indeed calculated a distance between samples?

Solution

You can notice that the distance between identical a sample and itself (e.g. ERR1406259 and ERR1406259) is equal to 0. {: .solution} {: .challenge}

4.2 Hierarchical clustering

This is the step where you will define your clusters. We will use the distance matrix computed before.

There are different methods to perform clustering and they go way beyond this course but here is a note on the different methods (see the original source).

Methods

Maximum or complete linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the largest value (i.e., maximum value) of these dissimilarities as the distance between the two clusters. It tends to produce more compact clusters.

Minimum or single linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the smallest of these dissimilarities as a linkage criterion. It tends to produce long, “loose” clusters.

Mean or average linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the average of these dissimilarities as the distance between the two clusters.

Centroid linkage clustering: It computes the dissimilarity between the centroid for cluster 1 (a mean vector of length p variables) and the centroid for cluster 2.

Ward’s minimum variance method: It minimizes the total within-cluster variance. At each step the pair of clusters with minimum between-cluster distance are merged. {: .callout}

Here, we are going to use the Ward's clustering method.

clustering_of_samples <- hclust(distances_between_samples, method = "ward.D2")

{: .language-r}

Finally, let's plot your results.

plot(clustering_of_samples)

{: .language-r}

4.3 Heatmap of sample-to-sample correlations

We can also build a more pretty figure where

# to build a sample to 
xp_design_for_heatmap <- read.delim("experimental_design_modified.txt",
                                    header = TRUE,
                                    sep = "\t",
                                    colClasses = rep("factor",4))
                                    
    

row.names(xp_design_for_heatmap) <- xp_design_for_heatmap$sample
xp_design_for_heatmap$sample <- NULL

anno_info_colors = list(
  seed = c(MgCl2 = "#d8b365", 
           Methylobacterium_extorquens_PA1 = "#f5f5f5", 
           Sphingomonas_melonis_Fr1 = "#5ab4ac"),
  infected = c(mock = "gray", 
               Pseudomonas_syringae_DC3000 = "black"),
  dpi = c("2" = "dodgerblue", 
          "7" = "dodgerblue4")
)


correlation_between_samples <- cor(counts_normalised)

my_custom_breaks <- quantile(
  correlation_between_samples, # correlations
  seq(0,1,0.1))


# load library and plot heatmap
library(pheatmap)
library(RColorBrewer)

pheatmap(mat = correlation_between_samples, 
         color = rev(brewer.pal(n = length(my_custom_breaks) - 1,"RdYlBu")), 
         breaks = my_custom_breaks,
         annotation_row = xp_design_for_heatmap,
         annotation_col = xp_design_for_heatmap,
         annotation_colors = anno_info_colors
         )

{: .language-r}

While the heatmap is full of colors and pleasant to display (arguable perhaps), it is not easy to distinguish between conditions. A Principal Component Analysis will help us to decide whether the experimental design is reflected in the count results.