-
Notifications
You must be signed in to change notification settings - Fork 16
Lesson supplementary materials
To assess if samples from the same condition are grouped together, we are going to perform a clustering analysis. It has two main steps:
- Calculate the distance between samples based on their normalized gene counts.
- Perform a hierarchical cluster analysis using
hclust
. - Convert the result to a dendrogram and plot the resulting sample tree.
The Euclidean distance will be calculated between each sample based on the There are many ways to define a distance between two points.
Simple example first!
# sample 1 and 2
sample_a = c(1,2,3)
sample_b = c(1,2,3)
# Source: https://stackoverflow.com/questions/5559384/euclidean-distance-of-two-vectors
euc.dist <- function(x1, x2) sqrt(sum((x1 - x2) ^ 2))
euc.dist(sample_a, sample_b)
{: .language-r}
The distance is 0 because the two samples have the same coordinates in their 3-dimensional space (x = 1, y = 1, z = 1).
sample_a = c(1,2,6)
sample_b = c(1,2,3)
euc.dist(sample_a, sample_b) # equal to 3
# if you do it manually
sqrt(sum(1 - 1)^2 + sum(2 - 2)^2 + sum(6 -3)^2)
{: .language-r}
This should give you the same result.
Now, imagine you'd have to calculate the distance between each sample in a N dimensional space (here N = 33,768 genes). Normally only Neo can see through the Matrix but maybe you do too...
Let's use R instead.
We will work on the complete counts_normalised
matrix as we only have to compute the distance between 48 samples. If you would do this for the ~33000 genes, it would take much much more time and you will have to stop your R session.
# calculate the sample Euclidean distances
distances_between_samples <- dist(
t(counts_normalised), # notice the t() for transpose. Otherwise you get the distance between genes
method = "euclidean")
as.matrix(distances_between_samples)[1:5,1:5]
{: .language-r}
ERR1406259 ERR1406271 ERR1406282 ERR1406294 ERR1406305
ERR1406259 0.0 207590.4 167892.5 543575.1 823124.6
ERR1406271 207590.4 0.0 192718.5 417620.6 677362.6
ERR1406282 167892.5 192718.5 0.0 413844.6 692401.0
ERR1406294 543575.1 417620.6 413844.6 0.0 333604.2
ERR1406305 823124.6 677362.6 692401.0 333604.2 0.0
{: .output}
How can you check that you have indeed calculated a distance between samples?
You can notice that the distance between identical a sample and itself (e.g. ERR1406259 and ERR1406259) is equal to 0. {: .solution} {: .challenge}
This is the step where you will define your clusters. We will use the distance matrix computed before.
There are different methods to perform clustering and they go way beyond this course but here is a note on the different methods (see the original source).
Maximum or complete linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the largest value (i.e., maximum value) of these dissimilarities as the distance between the two clusters. It tends to produce more compact clusters.
Minimum or single linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the smallest of these dissimilarities as a linkage criterion. It tends to produce long, “loose” clusters.
Mean or average linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the average of these dissimilarities as the distance between the two clusters.
Centroid linkage clustering: It computes the dissimilarity between the centroid for cluster 1 (a mean vector of length p variables) and the centroid for cluster 2.
Ward’s minimum variance method: It minimizes the total within-cluster variance. At each step the pair of clusters with minimum between-cluster distance are merged. {: .callout}
Here, we are going to use the Ward's clustering method.
clustering_of_samples <- hclust(distances_between_samples, method = "ward.D2")
{: .language-r}
Finally, let's plot your results.
plot(clustering_of_samples)
{: .language-r}
We can also build a more pretty figure where
# to build a sample to
xp_design_for_heatmap <- read.delim("experimental_design_modified.txt",
header = TRUE,
sep = "\t",
colClasses = rep("factor",4))
row.names(xp_design_for_heatmap) <- xp_design_for_heatmap$sample
xp_design_for_heatmap$sample <- NULL
anno_info_colors = list(
seed = c(MgCl2 = "#d8b365",
Methylobacterium_extorquens_PA1 = "#f5f5f5",
Sphingomonas_melonis_Fr1 = "#5ab4ac"),
infected = c(mock = "gray",
Pseudomonas_syringae_DC3000 = "black"),
dpi = c("2" = "dodgerblue",
"7" = "dodgerblue4")
)
correlation_between_samples <- cor(counts_normalised)
my_custom_breaks <- quantile(
correlation_between_samples, # correlations
seq(0,1,0.1))
# load library and plot heatmap
library(pheatmap)
library(RColorBrewer)
pheatmap(mat = correlation_between_samples,
color = rev(brewer.pal(n = length(my_custom_breaks) - 1,"RdYlBu")),
breaks = my_custom_breaks,
annotation_row = xp_design_for_heatmap,
annotation_col = xp_design_for_heatmap,
annotation_colors = anno_info_colors
)
{: .language-r}
While the heatmap is full of colors and pleasant to display (arguable perhaps), it is not easy to distinguish between conditions. A Principal Component Analysis will help us to decide whether the experimental design is reflected in the count results.