scRNA-seq/exercise_02-scrna_clustering.Rmd

---
title: "Single Cell Exercise: Clustering cells"
output:
  html_notebook:
    toc: true
    toc_float: true
editor_options:
  chunk_output_type: inline
---

**CCDL 2021**

In this exercise notebook we will be using the output file from yesterday's single cell exercise, `mammary_gland_norm_sce.rds`.
In that notebook, we used the [*Tabula Muris* project](https://www.nature.com/articles/s41586-018-0590-4) data from the cells of a mouse mammary gland sample, `10X_P7_12`.

We filtered and normalized the data to produce the aforementioned output file.

In this exercise we will be following some of the steps for the downstream analysis of tag-based scRNA-seq data as we did in the `04-dimension_reduction_scRNA.Rmd` and `05-clustering_markers_scRNA.Rmd` instruction notebooks.


## Setup

First, load the libraries we need and set the file path to the output plots directory.

```{r setup}
# load the libraries
library(ggplot2)
library(scater)
library(scran)

# clustering tools
library(bluster)

# File path to plots directory
plots_dir <- "plots"

# create the plots directory
fs::dir_create(plots_dir)
```

Set the seed as there will be some randomization involved in some of the steps in this notebook.

```{r set_seed, solution = TRUE}
# Set the seed for reproducibility

```

## Import and set up the data and metadata

Set the file path to the RDS file containing a normalized `SingleCellExperiment` object that we saved in `exercise_01-scrna_quant.Rmd`.
Hint: it should be named `mammary_gland_norm_sce.rds`, and may be a few levels deep in a subdirectory of the `data` directory

```{r sce_file, solution = TRUE}
# store the path in a variable named `sce_file`

```

Import the `SingleCellExperiment` object using `readr`'s function for reading RDS files.
Remember that you can use `::` to call any functions from packages that you haven't imported.

```{r readrds, solution = TRUE}
# read the RDS file and save it in the variable `normalized_sce`

```

What data matrices are present in `normalized_sce`?
How many cells are included in the object?
Which dimension reductions are present, if any?

Use the empty code chunk below to explore `normalized_sce` and answer the above questions.

```{r explore-sce, solution = TRUE}
# use this chunk to explore the questions above

```

## Calculating and storing PCA results

If there are not already dimensionality reduction results stored with the `normalized_sce `object (or if we want to replace them), we will want to start with calculating PCA, and storing the results in our object for later use.

To do this, use the `runPCA()` function from `scater`, which performs the PCA calculations and returns a new object with the results stored in the `reducedDim` slot.
Specify the number of genes to use for the PCA calculation using the function's `ntop` argument.
Recall that this will pick the genes with the greatest variance.
The default is 500 genes, but you will probably want to use more!

```{r runPCA, solution = TRUE}
# calculate and store PCA results

```

Use the `reducedDimNames()` function to check which reduced dimensionality matrices are stored in the object.

```{r reduced_dim_names, solution = TRUE}
# check for which reduced dimensionality matrices are stored in your SCE object

```

### Plotting PCA results

Now that we have the PCA results stored in the `SingleCellExperiment` object, we can use the `scater::plotReducedDim()` function to plot them with some nice defaults easily.

```{r plotPCA, solution = TRUE}
# plot PCA results (PC1 vs. PC2)

```

As we noted before, the `plotReducedDim()` function uses `ggplot2` under the hood, so we can save our plots using `ggsave()`!

Use the chunk below to save the last plot you produced to the `plots_dir`.
By default, `ggsave()` will save the last plot that you made, so we will only need to supply a file path with the location and name of the file we want to write.

```{r save_plot, solution = TRUE}
# save the last plot using `ggsave()`

```

### UMAP

**UMAP** (Uniform Manifold Approximation and Projection) is a machine learning technique designed to provide more detail in highly dimensional data than a typical principal components analysis.
While PCA assumes that the variation we care about has a particular distribution (normal, broadly speaking), UMAP allows more complicated distributions that it learns from the data.
The underlying mathematics are beyond me, but if you are more ambitious than I am, you can look at the paper by [McInnes, Healy, & Melville (2018)](https://arxiv.org/abs/1802.03426).
The main advantage of this change in underlying assumptions is that UMAP can do a better job separating clusters, especially when some of those clusters may be more similar to each other than others.

Like `runPCA()`, `scater::runUMAP()` returns a matrix of results, with one row for each sample, and a column for each of the UMAP dimensions returned, and stores these results in a SingleCellExperiment object.

Run `scater::runUMAP()` in the chunk below, specifying that we should use the stored PCA results with the `dimred` argument.

```{r calculate_umap, solution = TRUE}
# run UMAP with the stored PCA results

```

Now using the same `plotReducedDim()` function as above, plot the UMAP results.
Also try adding some color with the `color_by` argument, using the number of genes detected in each cell to assign a hue.

```{r plot_umap, solution = TRUE}
# make a UMAP plot with `plotReducedDim()`

```

## Cell clustering


When we performed dimensionality reduction on our data above, we could see visually that the cells tended cluster together into groups.
Let's try identifying distinct groups of cells with similar patterns of gene expression that we can assign labels to.

As noted in the `05-clustering_markers_scRNA.Rmd` instruction notebook, there are a number of methods to identify clusters and assign cells to those in multidimensional data like the single cell data we have.

We can use the function `clusterRows()` from the `bluster` package to facilitate applying many of those algorithms.

To specify the clustering algorithm and the parameters specific to that algorithm to `clusterRows()`, we will provide it an argument with one of a set of `*Params()` functions, which are outlined below.
Note that there are other algorithms that we will not discuss, but can be explored in the [Flexible clustering for Bioconductor vignette](https://bioconductor.org/packages/devel/bioc/vignettes/bluster/inst/doc/clusterRows.html).

- `KmeansParam()` will apply the k-means clustering algorithm (see also `?kmeans`).
Some of the more important parameters are:
  - `centers`: the number of clusters to be assigned
  - `nstart`: Since the final k-means clusters may depend on the random clusters chosen at the start, we might want to repeat the procedure a number of times and choose the best output. This argument tells how many times to repeat the clustering (the default is 1).

- `NNGraphParam()` will apply community detection algorithms on a nearest-neighbor (NN) graph.
Some of the more important parameters are:
  - `k`: the number of neighbors for each cell to use when constructing the graph
  - `type`: how the neighbor graph is weighted.
  The options are `rank` (default), `number` and `jaccard`, with the last of those being the default in `Seurat`.
  For more details see `?makeSNNGraph`
  - `cluster.fun`: Which cluster detection algorithm to use.
  The options are many, but common choices are `walktrap` (default),  `louvain` (used by `Seurat`), and `leiden`.

Take some time below to explore the k-means and graph-based clustering algorithms and parameters!
You might also note how the results can change from one run to the next with no changes in parameters.
Do different clustering algorithms tend to agree with what you "see" in the dimensionality reduction plots?
Is that dependent on which set of dimensions you use?

### k-means clustering

First, try exploring the k-means clustering method.
Remember that the `k` here refers to the number of clusters that we will create, and must be chosen before we start.

You might wonder: How many clusters should we use?
That is a hard question and for now the answer is up to you!

However, for an intuitive visualization of the general k-means method, you might find [this StatQuest video](https://www.youtube.com/watch?v=4b5d3muPQmA) useful.
For more discussion of the method in a single-cell context, including some tips on choosing `k`, the [Orchestrating Single-Cell Analysis book chapter on k-means](https://bioconductor.org/books/3.14/OSCA.basic/clustering.html#vector-quantization-with-k-means) is a good reference.

We already computed *and stored* a matrix with reduced dimensions with the `runPCA()` function above.
We will extract that from the `SingleCellExperiment` object with the `reducedDim()` function, which returns a matrix with the cells as rows that we can directly supply to the `clusterRows()` function.

When implementing `clusterRows()` below, you can use `KmeansParam()` to specify that we want k-means clustering, with the `centers` parameter to set how many clusters we will assign (`k`).

```{r kmeans, solution = TRUE}
# set the number of clusters

# extract the principal components matrix

# perform the clustering using `clusterRows()`

```

The `clusterRows()` function returned a vector of cluster assignments as integers, but the numerical values have no inherent meaning.
For plotting, convert those integers into factors, so R is not tempted to treat them as a continuous variable.

Store the new factor values back into the cell information table of the original `normalized_sce` object for convenient storage and later use.
You can do this with the `$` notation and a column name of your choosing.
If you are going to try different numbers of clusters, you might find it useful to include that in the column name so you can keep track of the various results.
For example, if you used two clusters, you might use  `normalized_sce$kcluster_2` to store the results.

```{r store_kclusters, solution = TRUE}
# save clusters in the SCE object as a factor

```

Now use the `scater` function `plotReducedDim()` as we have in the `05-clustering_markers_scRNA.Rmd` instruction notebook, to plot the results and see how the clustering looks.

Start by using the UMAP coordinates for the plot and don't forget to color the points by our clustering results!
As a shortcut, you can also use the function `plotUMAP()`.

```{r plot_k, solution = TRUE}
# plot clustering results

```

Do those clusters line up with what you might have expected if you were doing this by eye?

If we repeat the above steps, do we get the same cluster assignments?
What happens if we change the number of clusters?

Remember that cluster numbers here are assigned arbitrarily, so even if we got exactly the same logical clusters across runs, we wouldn't expect the label numbers to be the same or stable.

Use the chunk below to explore the questions above!

```{r explore_kclusters_n, solution = TRUE}
# try re-running the above steps with a different number of clusters

# perform the clustering using `clusterRows()`

# save clusters in the SCE object as a factor

# plot new clustering results

```

What do the results look like if you plot with the `PCA` or `TSNE` coordinates?

```{r explore_k_plot, solution = TRUE}
# try plotting with different dimensionality reduction coordinates

```

### Graph-based clustering

The other common type of clustering method for single cell data is graph-based clustering.

To apply this clustering algorithm, use the same `bluster::clusterRows()` function as before, but specify `NNGraphParam()` as the second argument to tell it that we want to use a nearest-neighbor graph-based method.

Also remember to specify `k` and the cluster detection algorithm using the `cluster.fun`.

In the example below, you may use the default values or specify your own for the aforementioned arguments.

```{r nnclust, solution = TRUE}
# apply a graph-based clustering algorithm

# store cluster results in the `normalized_sce` object

```

Now you can plot the results of our graph-based `UMAP` clustering using `plotReducedDim()`.
This time, use the `text_by` argument to include the cluster ids directly on the plot.

```{r plot_nnclust, solution = TRUE}
# perform graph-based clustering with the UMAP results using `plotReducedDim()`

```

How do these results compare to the k-means clustering result?

How sensitive is this to the parameters we choose?
How do the numbers of clusters change with different parameters?

Use the chunk below to try a different set of parameters, and make more chunks to explore more.
Play around with the clustering and plotting dimensions as much as you like!

```{r explore_nclust, solution = TRUE}
# perform neighbor clustering with different parameters

# store cluster results in the `normalized_sce` object

# plot clustering results

```

## Session Info

```{r sessioninfo}
sessionInfo()
```