-
-
Notifications
You must be signed in to change notification settings - Fork 28
/
06-overrepresentation_analysis.Rmd
430 lines (311 loc) · 20.3 KB
/
06-overrepresentation_analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
---
title: "Pathway analysis: Over-representation analysis (ORA)"
output:
html_notebook:
toc: true
toc_float: true
author: CCDL for ALSF
date: 2020
---
## Objectives
This notebook will demonstrate how to:
- Perform gene identifier conversion with [`AnnotationDBI` annotation packages](https://bioconductor.org/packages/release/bioc/vignettes/AnnotationDbi/inst/doc/IntroToAnnotationPackages.pdf)
- Prepare gene sets for over-representation analysis, including an appropriate background set
- Perform over-representation analysis with the `clusterProfiler` package
- Introduce resources for ORA of scRNA-seq data, such as Gene Ontology
---
In this notebook, we'll cover a type of pathway or gene set analysis called over-representation analysis (ORA).
The idea behind ORA is relatively straightforward: given a set of genes, do these genes overlap with a pathway more than we expect by chance?
The simplicity of only requiring an input gene set (sort of, more on that below) can be attractive.
ORA has some limitations, outlined nicely (and more extensively!) in [Khatri _et al._ (2012)]( https://doi.org/10.1371/journal.pcbi.1002375).
One of the main issues with ORA is that typically all genes are treated as equal -- the context of the magnitude of a change we may be measuring is removed and each gene is treated as independent, which can sometimes result in an incorrect estimate of significance.
We will use the [`clusterProfiler` package](https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html) ([Yu *et al.* 2012](https://doi.org/10.1089/omi.2011.0118.)) to perform ORA.
`clusterProfiler` has many built-in functions that will run a specific type of analysis using a specific source of pathways/gene sets automatically, but for our purposes we're going to keep things as general as possible.
See the [`clusterProfiler` book](https://yulab-smu.github.io/clusterProfiler-book/index.html) for more information about the package's full suite of functionality.
Because different bioinformatics tools often require different types of gene identifiers, we'll also cover how to convert between gene identifiers using [`AnnotationDbi`](https://bioconductor.org/packages/release/bioc/html/AnnotationDbi.html) Bioconductor packages in this notebook.
Check out the [_AnnotationDbi: Introduction To Bioconductor Annotation Packages_ (Carlson 2020.) vignette](https://bioconductor.org/packages/release/bioc/vignettes/AnnotationDbi/inst/doc/IntroToAnnotationPackages.pdf) for more information.
#### Other resources
* For another example using `clusterProfiler`, see [_Intro to DGE: Functional Analysis._ from Harvard Chan Bioinformatics Core Training.](https://hbctraining.github.io/DGE_workshop/lessons/09_functional_analysis.html)
* [`WebGestaltR`](https://cran.r-project.org/web/packages/WebGestaltR/) is another R package that can be used for ORA.
* OSCA has [a section on using Gene Ontology sets to annotate clusters](https://bioconductor.org/books/3.14/OSCA.basic/cell-type-annotation.html#assigning-cluster-labels-from-markers) using a different Bioconductor package.
## Set up
### Libraries
```{r libraries}
# Pipes
library(magrittr)
# Package we'll use to
library(clusterProfiler)
# Package that contains MSigDB gene sets in tidy format
library(msigdbr)
# Homo sapiens annotation package we'll use for gene identifier conversion
library(org.Hs.eg.db)
```
### Directories and files
#### Directories
```{r create_ora_directory, live = TRUE}
# We'll use the output of the clustering notebook for ORA
hodgkins_analysis_dir <- file.path("analysis", "hodgkins")
# We'll create a directory to specifically hold the pathway results if it doesn't
# exist yet
results_dir <- file.path(hodgkins_analysis_dir, "pathway-analysis")
if (!dir.exists(results_dir)) {
dir.create(results_dir, recursive = TRUE)
}
```
#### Input files
We're going to perform ORA on the marker genes from cluster 1 we looked at in the previous notebook.
```{r cluster_input}
input_file <- file.path(hodgkins_analysis_dir,
"markers",
"cluster01_markers.tsv")
```
We're going to use a file that is a cleaned version of the [CellMarker](http://bio-bigdata.hrbmu.edu.cn/CellMarker/) ([Zhang *et al.* 2018](https://doi.org/10.1093/nar/gky900)) human markers resources.
You can see how we obtained and cleaned this file in [this notebook](https://github.com/AlexsLemonade/training-modules/blob/master/scRNA-seq/setup/cellmarker_genes.Rmd).
```{r cm_input}
cellmarker_file <- file.path("gene-sets",
"CellMarker_cleaned_human_markers.tsv")
```
#### Output files
In the next notebook, we'll use the cluster 1 results.
Saving the post-gene identifier conversion results in this notebook will prevent us from duplicating effort (and code)!
```{r gene_id_file}
cluster1_output_file <- file.path(hodgkins_analysis_dir,
"markers",
"cluster01_markers_with_gene_symbols.tsv")
```
We'll save the table of ORA results (e.g., p-values).
```{r output_files, live = TRUE}
go_results_file <- file.path(results_dir, "cluster01_go_ora_results.tsv")
cm_results_file <- file.path(results_dir,
"cluster01_CellMarker_ora_results.tsv")
```
## Read in marker genes
Let's read in our marker genes from cluster 1.
```{r read_in_markers, live = TRUE}
markers_df <- readr::read_tsv(input_file)
```
And let's take a peek!
```{r marker_head, live = TRUE}
head(markers_df)
```
### Gene identifier conversion
Our data frame of marker genes contains Ensembl gene identifiers.
We will need to convert from these identifiers into gene symbols or Entrez IDs for our use with cell marker data.
We're going to convert our identifiers to gene symbols because they are a bit more human readable, but you can, with the change of a single argument, use the same code to convert to many other types of identifiers!
The annotation package `org.Hs.eg.db` contains information for different identifiers.
`org.Hs.eg.db` is specific to _Homo sapiens_ -- this is what the `Hs` in the package name is referencing.
To perform gene identifier conversion in mouse (_Mus musculus_) we could use `org.Mm.eg.db`;
we would use `org.Dr.eg.db` for zebrafish (_Danio rerio_).
We can see what types of IDs are available to us in an annotation package with `keytypes()`.
```{r keytypes, live = TRUE}
keytypes(org.Hs.eg.db)
```
Even though we'll use this package to convert from Ensembl gene IDs (`ENSEMBL`) to gene symbols (`SYMBOL`), we could just as easily use it to convert from an Ensembl transcript ID (`ENSEMBLTRANS`) to Entrez IDs (`ENTREZID`).
The function we will use to map from Ensembl gene IDs to gene symbols is called `mapIds()`.
```{r map_to_symbol}
# This returns a named vector which we can convert to a data frame, where
# the keys (Ensembl IDs) are the names
symbols_vector <- mapIds(org.Hs.eg.db, # Specify the annotation package
# The vector of gene identifiers we want to
# map
keys = markers_df$gene,
# What type of gene identifiers we're starting
# with
keytype = "ENSEMBL",
# The type of gene identifier we want returned
column = "SYMBOL",
# In the case of 1:many mappings, return the
# first one. This is default behavior!
multiVals = "first")
# We would like a data frame we can join to the DGE results
symbols_df <- data.frame(
ensembl_id = names(symbols_vector),
gene_symbol = symbols_vector
) %>%
# Drop genes that don't have a gene symbol!
tidyr::drop_na()
```
This message is letting us know that sometimes Ensembl gene identifiers will map to multiple gene symbols.
In this case, it's also possible that a gene symbol will map to multiple Ensembl IDs.
Now we are ready to add the gene symbols to our data frame with the marker genes.
We can use a _join_ function from the `dplyr` package to do this, which will use the Ensembl gene IDs in both data frames to determine how to join the rows.
```{r add_symbols, live = TRUE}
markers_df <- symbols_df %>%
# An *inner* join will only return rows that are in both data frames
dplyr::inner_join(markers_df,
# The name of the column that contains the Ensembl gene IDs
# in the left data frame and right data frame, effectively
# dropping genes that don't have gene symbols
by = c("ensembl_id" = "gene"))
```
We're going to use this table again, with the gene symbols, in the next notebook so let's write it to a file.
```{r save_gene_symbols, live = TRUE}
readr::write_tsv(markers_df, file = cluster1_output_file)
```
## Over-representation Analysis (ORA)
To test for over-representation, we can calculate a p-value with a hypergeometric test ([ref](https://yulab-smu.github.io/clusterProfiler-book/chapter2.html#over-representation-analysis)).
\(p = 1 - \displaystyle\sum_{i = 0}^{k-1}\frac{ {M \choose i}{ {N-M} \choose {n-i} } } { {N \choose n} }\)
Where `N` is the number of genes in the background distribution, `M` is the number of genes in a pathway or gene set, `n` is the number of genes we are interested in (our marker genes), and `k` is the number of genes that overlap between the pathway or gene set and our marker genes.
Borrowing an example from [_clusterProfiler: universal enrichment tool for functional and comparative study_ (Yu )](http://yulab-smu.top/clusterProfiler-book/chapter2.html#over-representation-analysis):
> **Example**: Suppose we have 17,980 genes detected in a Microarray study and 57 genes were differentially expressed. Among the differential expressed genes, 28 are annotated to a gene set.
We'll call genes that are differentially expressed `gene_in_interest` and genes that are in the gene set `in_gene_set`.
```{r gene_table}
gene_table <- data.frame(
gene_not_interest = c(2613, 15310),
gene_in_interest = c(28, 29)
)
rownames(gene_table) <- c("in_gene_set", "not_in_gene_set")
gene_table
```
We can assess if the 28 overlapping genes mean that the differentially expressed genes are over-represented in the gene set with the hypergeometric distribution.
This corresponds to a one-sided Fisher's exact test.
```{r fisher_test}
fisher.test(gene_table, alternative = "greater")
```
When we test **multiple pathways or gene sets**, the p-values then need to be **adjusted** for multiple hypothesis testing.
### Marker gene preparation
#### Top 100 cluster 1 genes
We're interested in what pathways are over-represented in genes that mark, or are more highly expressed in, cluster 1.
We'll select the top 100 genes for this comparison.
```{r cluster01_genes}
cluster01_markers <- markers_df %>%
# Positive fold changes
dplyr::filter(summary.logFC > 0) %>%
# Take the "top 100" when ordering by FDR (smallest FDR values)
dplyr::slice_min(order_by = FDR, n = 100) %>%
# Extract a vector of the gene symbols
dplyr::pull(gene_symbol)
```
You may think that selecting the top 100 genes is pretty arbitrary and you are not wrong!
This also may seem strict, but let's think about our goal for this particular analysis: we hope that using ORA with resources that encode some knowledge about cell types will help us annotate cluster 1.
We're using the very top ranked genes in service of this goal.
An alternative approach would be to have filtered to genes that are below a specific FDR threshold, but that would also be an arbitrary threshold that we chose.
This is a weakness of over-representation analysis; we generally need to pick a threshold for the genes that we include in our tests.
As we mentioned in the clustering notebook, our statistical calculations are circular here.
We identified clusters based on gene expression and then identified genes that are differentially expressed between clusters.
We can use the FDR values for ranking, like we are here, but we shouldn't use them to define "significance."
You can read a little more about this in the [_Invalidity of p-values_ section of the OSCA book](https://bioconductor.org/books/3.14/OSCA.advanced/marker-detection-redux.html#p-value-invalidity).
#### Background set
As we saw above, calculating the p-value for ORA relies on the number of genes in the background distribution.
Sometimes folks consider genes from the entire genome to comprise the background, but in the example borrowed from the `clusterProfiler` authors, they state:
> 17,980 genes detected in a Microarray study
Where the key phrase is **genes detected**.
If we were unable to include a gene in our marker gene comparisons because, for example, we could not reliably measure it, we shouldn't include in our background set!
Our markers data frame has all of the genes that we did not filter out and have gene symbols corresponding to the Ensembl gene IDs, so we can use the `gene_symbol` column as our background set.
```{r get_background_set, live = TRUE}
background_set <- markers_df$gene_symbol
```
### Gene Ontology ORA
#### A note on GO
The Gene Ontology (GO) ([Ashburner *et al.* 2000](https://dx.doi.org/10.1038/75556); [The Gene Ontology Consortium. 2021](https://doi.org/10.1093/nar/gkaa1113))is [an ontology](https://en.wikipedia.org/wiki/Ontology_(information_science)) that documents or describes how genes function in the biological domain and is comprised of 3 parts ([GO ontology documentation](http://geneontology.org/docs/ontology-documentation/)):
* **Molecular function (MF):** activities that occur on the molecular level, often can be performed by individual genes
* **Biological process (BP):** programs carried out by multiple molecular entities; similar to what is sometimes thought of as a pathway, but with no notion of _dynamics_
* **Cellular component (CC):** cellular structures and compartments
GO is "loosely hierarchical" ([GO ontology documentation](http://geneontology.org/docs/ontology-documentation/)).
There are parent and child terms, where the "higher up" a term is, the more general it is.
This is somewhat abstract, so let's take a look at a GO term that's likely to be relevant to our Hodgkin's lymphoma data to get an idea of what this is like in practice: [`leukocyte activation`](http://www.informatics.jax.org/vocab/gene_ontology/GO:0045321).
#### Run `enrichGO()`
Now that we have our background set and our genes of interest, we're ready to run ORA using the `enrichGO()` function.
```{r go_ora}
go_ora_results <- enrichGO(gene = cluster01_markers, # Top 100 genes
universe = background_set, # Background genes
keyType = "SYMBOL", # Our genes are gene symbols
OrgDb = org.Hs.eg.db, # Supports conversion, if needed
ont = "BP", # Biological process
pAdjustMethod = "BH", # FDR
pvalueCutoff = 0.00001) # Very stringent for viz
```
What is returned by `enrichGO()`?
```{r view_go_ora, eval = FALSE}
View(go_ora_results)
```
The information we're most likely interested in is in the `result` slot.
Let's convert this into a data frame that we can write to file.
```{r go_df}
go_result_df <- data.frame(go_ora_results@result)
```
Let's take a look at the GO sets with an adjusted p-value less than 0.00001.
```{r filter_padj, live = TRUE}
go_result_df %>%
dplyr::filter(p.adjust < 0.00001)
```
Some of these columns have names that are somewhat opaque and are used for visualization, so let's briefly talk about them here!
| Column name | Numerator | Denominator |
|-------------|-----------|-------------|
| `GeneRatio` | Number of genes in the top 100 marker genes and the specific gene set (also in `Count` column) | Number of genes in the top 100 marker genes and _any_ gene set |
| `BgRatio` | Number of genes in the background set and the specific gene set | Number of genes in the background set and _any_ gene set |
#### Visualizing results
We can use a dot plot to visualize our significant enrichment results.
```{r dotplot, live = TRUE}
enrichplot::dotplot(go_ora_results)
```
**What cell type do you think the cells from cluster 1 are?**
We can use an [UpSet plot](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4720993/) to visualize the **overlap** between the gene sets that were returned as significant.
```{r upsetplot, live = TRUE}
enrichplot::upsetplot(go_ora_results)
```
It's important to keep in mind that gene sets or pathways aren't independent, either!
This is particularly true of GO terms, which are hierarchical in nature.
### CellMarker ORA
Next, we'll use another resource, CellMarker, as additional support for our conclusions based on GO.
#### About CellMarker
[CellMarker](http://bio-bigdata.hrbmu.edu.cn/CellMarker/) is a manually curated database of cell marker data for human and mouse ([Zhang *et al.* 2018](https://doi.org/10.1093/nar/gky900)).
Studies/evidence of the following nature were included in the manual curation process ([Zhang *et al.* 2018](https://doi.org/10.1093/nar/gky900)):
* scRNA-seq
* Flow cytometry
* Immunostaining
And there are both gene and protein markers.
Let's read in the CellMarker data we've already cleaned:
```{r read_in_cm, live = TRUE}
cellmarker_df <- readr::read_tsv(cellmarker_file)
```
#### Run `enricher()`
`clusterProfiler::enricher()` is a more general way to perform ORA -- we specify the pathways or gene sets ourselves.
We can use the it to run ORA with the CellMarker data, which contains gene symbols.
That's why we needed to perform gene identifier conversion earlier!
```{r cm_ora, live = TRUE}
cellmarker_ora_results <- enricher(
gene = cluster01_markers, # Genes of interest
pvalueCutoff = 0.05, # More permissive cut off
pAdjustMethod = "BH", # FDR
universe = background_set, # Background set
# The pathway information should be a data frame with a term name or
# identifier and the gene identifiers
TERM2GENE = cellmarker_df
)
```
#### Visualizing results
Let's look at a bar plot of the significant results.
```{r cm_barplot}
barplot(cellmarker_ora_results) +
# This is a ggplot, so we can use the following to label the x-axis!
ggplot2::labs(x = "count")
```
One thing that the [web version of CellMarker](http://bio-bigdata.hrbmu.edu.cn/CellMarker/) allows us to do is to get a better idea of _marker prevalence_.
Some of the genes that show up in our top 100 may not be very specific.
It's also likely that kidney is not very relevant to our analysis of Hodgkin's lymphoma, but lymph node is highly relevant.
However, blood cells in kidney are not _unexpected_ and, if we view the [CellMarker tissue type browser](http://bio-bigdata.hrbmu.edu.cn/CellMarker/browse.jsp), we can see that there are _a lot_ of genes annotated to kidney.
Every resource will have some limitations "baked in."
We could do additional filtering of the gene sets to remove irrelevant tissues and to remove marker genes that are not particularly specific.
(Fewer gene sets would mean a lower multiple hypothesis testing burden.)
If some of the CellMarker marker genes come from immunostaining of terminally differentiated cells, we might not expect their transcripts to show up in scRNA-seq data, for example.
All things to keep in mind if you use this resource for your own work!
### Write results to file
Write the results to file for both GO and CellMarker gene sets.
```{r write_results, live = TRUE}
readr::write_tsv(go_result_df, file = go_results_file)
data.frame(cellmarker_ora_results@result) %>%
readr::write_tsv(file = cm_results_file)
```
## Parting thoughts
The goal of analyzing marker genes with ORA, like we did in this notebook, is to help us interpret our clustering results or annotate clusters.
We shouldn't use marker genes or any downstream analysis we perform with them to _justify_ cluster assignments; we generally want our cluster assignments to be data-driven.
Just like it's important to be skeptical about p-values returned from the marker gene analysis itself, we should not get too attached to the idea of statistical significance here.
If we were to tweak the cutoffs (e.g., take the top 2500 genes), we might expect to get completely different significant results.
This is a limitation of ORA: we have to make some choices that are somewhat arbitrary.
It's a good idea to think about the goal of your analysis _upfront_ and pick cutoffs based on your goals.
That can help you avoid the temptation of trying a bunch of cutoffs until you get results that "look good."
## Session Info
```{r session_info}
sessionInfo()
```