-
-
Notifications
You must be signed in to change notification settings - Fork 28
/
03-normalizing_scRNA.Rmd
246 lines (180 loc) · 9.92 KB
/
03-normalizing_scRNA.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
---
title: "Normalizing scRNA-seq data"
author: CCDL for ALSF
date: 2021
output:
html_notebook:
toc: true
toc_float: true
---
## Objectives
This notebook will demonstrate how to:
- Normalize alevin results to better compare expression among cells
- Identify genes with the greatest biological variation among cells
---
In this notebook, we'll perform normalization of scRNA-seq count data that we have already done quality-control analyses of.
For this tutorial, we will be using a pair of single-cell analysis specific
R packages: `scater` and `scran` to work with our data.
This tutorial is in part based on the [scran
tutorial](https://bioconductor.org/packages/devel/bioc/vignettes/scran/inst/doc/scran.html).
![](diagrams/tag-based_4.png)
## Set Up
Load the libraries we will be using, and set the random number generation seed value for reproducibility.
```{r setup}
# Set seed for reproducibility
set.seed(1234)
# Magrittr for the pipe %>%
library(magrittr)
# GGPlot2 for the plots
library(ggplot2)
# Packages for single cell processing
library(scater)
library(scran)
```
Now let's set up the files we will be using:
```{r filepaths}
# main data directory
data_dir <- file.path("data", "tabula-muris")
# Filtered count matrix file from previous notebook
filtered_sce_file <- file.path(data_dir, "filtered", "filtered_sce.rds")
# Metadata file location
metadata_file <- file.path(data_dir, "TM_droplet_metadata.csv")
# Output directory for normalized data
norm_dir <- file.path(data_dir, "normalized")
if (!dir.exists(norm_dir)) {
dir.create(norm_dir, recursive = TRUE)
}
```
## Read in the filtered count matrix and metadata
```{r read_data}
bladder_sce <- readr::read_rds(filtered_sce_file)
sc_metadata <- readr::read_csv(metadata_file, guess_max = 10000)
```
### Adding more metadata to the SCE object
Because the Tabula Muris project is a well-studied data set, we actually have some cell type information for this data set that we can refer to.
Note that we would normally **NOT** have this information until later in the analysis pipeline!
Nonetheless, adding it here will be useful for visualizing the results of our normalization (and demonstrating how one might add metadata to the SingleCellExperiment object).
```{r sample_info}
# get the column (cell) metadata (this includes earlier QC stats!)
# and convert to a data frame
cell_info <- data.frame(colData(bladder_sce)) %>%
# convert the row names of this data frame to a separate column
tibble::rownames_to_column("barcode")
cell_metadata <- sc_metadata %>%
# filter to just the sample we are working with
dplyr::filter(channel == "10X_P4_3") %>%
# extract the 16 nt cell barcodes from the `cell` column
dplyr::mutate(barcode = stringr::str_sub(cell, start= -16)) %>%
# choose only the columns we want to add
dplyr::select(c("barcode", "cell_ontology_class", "free_annotation"))
# Join the tables together, using `left_join()` to preserve all rows in cell_info
cell_info <- cell_info %>%
dplyr::left_join(cell_metadata)
```
Check that the sample info accession ids are still the same as the columns of our data.
```{r check_sampleinfo, live = TRUE}
all.equal(cell_info$barcode, colnames(bladder_sce))
```
Now we can add that data back to the `SingleCellExperiment` object.
To keep with the format of that object, we have to convert our table to a `DataFrame` object in order for this to work.
Just to keep things confusing, a `DataFrame` is not the same as a `data.frame` that we have been using throughout.
We also need to be sure to include the `row.names` argument to keep those properly attached.
Note that this will replace all of the previous column (cell) metadata, which is part of the reason that we pulled out all previous column data content first.
```{r replace_coldata, live = TRUE}
# add new metadata data back to `bladder_sce`
colData(bladder_sce) <- DataFrame(cell_info, row.names = cell_info$barcode)
```
## Normalization of count data
In whatever data we are working with, we are always looking to maximize biological variance and minimize technical variance.
A primary source of technical variation we are concerned with is the variation in library sizes among our samples.
While different cells may have different total transcript counts, it seems more likely that the primary source of variation that we see is due to library construction, amplification, and sequencing.
This is where normalization methods usually come into the workflow.
The distribution of the counts that we saw in the previous notebook, and in particular the fact that the count data is noisy with many zero counts, makes normalization particularly tricky.
To handle this noise, we normalize cells in groups with other cells like them; a method introduced in [Lun *et al.* (2016)](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0947-7).
Briefly, we first cluster the cells to find groups of similar cells, then compute normalization factors based on the sums of expression in those groups.
The group normalization is then applied back to the individual cells within the group to create a normalized count matrix.
In this case, we will also log-transform the normalized counts to get a less skewed distribution of expression measures.
Note that because of the zero counts, the `logNormCounts()` function will add a pseudocount of 1 to each value before computing the log.
```{r sce_normalize}
# Step 1) Group cells with other like cells by clustering.
qclust <- scran::quickCluster(bladder_sce)
# Step 2) Compute sum factors for each cell cluster grouping.
bladder_sce <- scran::computeSumFactors(bladder_sce, clusters = qclust)
# Step 3) Normalize using these pooled sum factors and log transform.
bladder_sce <- scater::logNormCounts(bladder_sce)
```
### Compare normalized data to count data
One way to determine whether our normalization yields biologically relevant results is to plot it and see if similarly labeled samples and cells end up together.
Because plotting expression for thousands genes together isn't practical, we will reduce the dimensions of our data using Principal Components Analysis (PCA).
We will also make the same plot with our *unnormalized* data, to visualize the effect of normalization on our sample.
Before plotting the unnormalized data, we will log transform the raw counts to make their scaling more comparable to the normalized data.
To do this we will use the `log1p()` function, which is specifically designed for the case where we want to add 1 to all of our values before taking the log, as we do here.
(We could do something like `log(counts + 1)`, but this is both more efficient and more accurate.)
```{r pca}
# Use PCA for dimension reduction of cells' scran normalized data
norm_pca <- scater::calculatePCA(bladder_sce)
# PCA on the raw counts, log transformed
log_pca <- counts(bladder_sce) %>% # get the raw counts
log1p() %>% # log transform to make these more comparable to the normalized values
scater::calculatePCA() # calculate PCA scores
```
Note that we are using `scater::calculatePCA()` two different ways here: once on the full `gbm_sce` object, and once on just the `counts` matrix.
When we use `calculatePCA()` on the object, it automatically uses the log normalized matrix from inside the object.
Next we will arrange the PCA scores for plotting, adding a column with the cell-type data so we can color each point of the plot.
```{r pca_df}
# Set up the PCA scores for plotting
norm_pca_scores <- data.frame(norm_pca,
geo_accession = rownames(norm_pca),
cell_type = bladder_sce$cell_ontology_class)
log_pca_scores <- data.frame(log_pca,
geo_accession = rownames(log_pca),
cell_type = bladder_sce$cell_ontology_class)
```
Now we will plot the unnormalized PCA scores with their cell labels:
```{r pca_plot}
# Now plot counts pca
ggplot(log_pca_scores, aes(x = PC1, y = PC2, color = cell_type)) +
geom_point() +
labs(title = "Log counts (unnormalized) PCA scores",
color = "Cell Type") +
scale_color_brewer(palette = "Dark2", na.value = "grey70") + # add a visually distinct color palette
theme_bw()
```
We've plotted the unnormalized data for you.
Knowing that we want the same graph, but different data, use the above template to plot the normalized data.
Feel free to customize the plot with a different theme or color scheme!
Let's plot the `norm_pca_scores` data:
```{r norm_pca_plot, live = TRUE}
ggplot(norm_pca_scores, aes(x = PC1, y = PC2, color = cell_type)) +
geom_point() +
labs(title = "Normalized log counts PCA scores",
color = "Cell Type") +
scale_color_brewer(palette = "Dark2", na.value = "grey70") + # add a visually distinct color palette
theme_bw()
```
Do you see an effect from the normalization in the comparison between these plots?
## Save the normalized data to tsv file
In case we wanted to return to this data later, let's save the normalized data
to a tsv file.
In order to do this we need to extract our normalized counts from `bladder_sce`.
Refer back to the `SingleCellExperiment` figure above to determine why we are using this `logcounts()` function.
Recall that `readr::write_tsv` requires a data frame so we need to convert the `logcounts` matrix to a data frame.
We will actually have to do this in two steps: first by making the sparse matrix to a standard R matrix, then converting that to a data frame.
```{r save_tsv}
# Save this gene matrix to a tsv file
logcounts(bladder_sce) %>%
as.matrix() %>%
as.data.frame() %>%
readr::write_tsv(file.path(norm_dir, "scran_norm_gene_matrix.tsv"))
```
We may want to return to our normalized `bladder_sce` object in the future, so we will
also save our data in an RDS file so that we can re-load it into our R
environment as a `SingleCellExperiment` object.
```{r save_rds}
# Save the data as an RDS
readr::write_rds(bladder_sce, file.path(norm_dir, "normalized_bladder_sce.rds"))
```
### Print session info
```{r sessioninfo}
sessionInfo()
```