-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path06-web-tools.Rmd
570 lines (338 loc) · 30.3 KB
/
06-web-tools.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
# Online Tools
Functional enrichment analysis can be performed using various web-based tools, each of which is designed to meet specific analytical needs. These tools often vary in the databases they use, their statistical approaches, and their capabilities to perform different types of analysis, such as Over-Representation Analysis (ORA) or Gene Set Enrichment Analysis (GSEA).
In this workshop, we will explore several popular tools for functional enrichment analysis, including gProfiler, STRING, Reactome, and MSigDB GSEA. Each tool offers unique features and insights, providing flexibility in selecting the right method for diverse datasets and research questions.
<!-- ## FEA in [g:Profiler](https://biit.cs.ut.ee/gprofiler/gost) -->
## FEA in gProfiler <a href="https://biit.cs.ut.ee/gprofiler/gost" target="_blank"><img src="images/gp2-logo-new.png" alt="g:Profiler Logo" style="height:35px; vertical-align:middle;"></a>
gProfiler is known for its integration of numerous species and databases. It supports both ORA and GSEA, enabling users to assess Gene Ontology (GO), biological pathways, regulatory motifs and protein databases. With gProfile one can
### Steps to perform ORA in g:Profiler:
<span style="color:orange;">- Prepare Input List:</span> Ensure your input is formatted as one gene per line or in a suitable format for g:Profiler.
<span style="color:orange;">- Input Gene List:</span> Paste your prepared gene list directly into the input box on the g:Profiler web page or upload a file containing your list.
<span style="color:orange;">- Select Organism:</span> Choose the appropriate organism from the `Organism` dropdown menu (e.g., *Homo sapiens* for human data).
<span style="color:orange;">- Choose Statistical Domain Scope:</span> Under `Advanced options`, select your preferred statistical background from the `Statistical domain scope` menu.If you choose "Custom" background, provide your custom background list by pasting or uploading the relevant file.
<span style="color:orange;">- Set Significance Threshold:</span> Select the desired significance threshold method, such as *g:SCS*, *Bonferroni*, or *Benjamini-Hochberg*.
- Specify the threshold value (e.g., 0.05, 0.1, etc.).
<span style="color:orange;">- Select Functional Annotation Databases: </span> Navigate to the `Data sources` tab and choose one or more databases for analysis. Available options include:
- *Gene Ontology (GO)*: Biological Process, Molecular Function, and Cellular Component.
- *KEGG Pathways*
- *Reactome Pathways*
- *WikiPathways*
- *TRANSFAC*
- *mirTarBase*
- *Human Protein Atlas*
- *CORUM*
- *Human Phenotype Ontology (HP)*
<span style="color:orange;">- Run Query:</span> Run the analysis and review the enriched terms, pathways, and visual outputs. Download the results as needed for further exploration.
#### Browse the gProfiler Results
- **Overview**:
The analysis provided a comprehensive list of enriched terms across selected databases, highlighting significant GO. The results give a high-level summary of pathways or terms most relevant to the input data.
- **Detailed Results**:
The detailed results section includes a tabulated format with enriched terms, adjusted p-values, and relevant statistics. Each entry provides information such as the enrichment score, associated genes, and functional annotations, allowing for an in-depth understanding of biological significance.
- **GO Context**:
The Gene Ontology (GO) context is divided into three main categories: Biological Process (BP), Molecular Function (MF), and Cellular Component (CC). The analysis identifies which GO terms are significantly enriched, offering insights into the broader biological implications of the gene set. This helps in pinpointing processes such as cellular responses, metabolic pathways, and molecular interactions.
- **Query Info**:
This section includes specifics about the input data, including the total number of queried genes and any identifiers not recognised or mapped. It also details the statistical background used, the chosen organism, and other analysis settings, ensuring transparency and reproducibility of the results.
#### {-}
#### Different Backgrounds
#### **Challenge:** How different backgrounds impact the output? {- .challenge}
Use 'All known genes' in one analysis and 'Custom' background in another. Download the results by clicking on CSV button. Browse the results in the spreadsheets and find out the difference between two.
#### **Questions ** {- .rationale}
Which background would you use in your analysis?
How is multi-query support implemented in gProfiler?
How can one perform Under Representation Analysis in gProfiler?
### Steps to perform GSEA in g:Profiler:
<span style="color:orange;">- Prepare Your Pre-ranked List:</span> Steps to provide a ranked gene list are given [here](degust.html).
<span style="color:orange;">- Input Gene List:</span> Paste your prepared gene list directly into the input box on the g:Profiler web page or upload a file containing your list.
<span style="color:orange;">- Select Organism:</span> Same as above.
<span style="color:orange;">- Select Ordered query:</span> The "Ordered query" option in g:Profiler is designed to work with pre-ranked gene lists.
<span style="color:orange;">- Set Significance Threshold:</span> Same as above.
<span style="color:orange;">- Provide a Custom GMT:</span> This GMT file can be downloaded from [MSigDB](https://www.gsea-msigdb.org/gsea/msigdb/index.jsp).
<span style="color:orange;">- Run Query:</span> Same as above.
#### **Challenge:** GSEA with gProfiler {- .challenge}
Download the Hallmark gene sets ([h.all.v2024.1.Hs.symbols.gmt](https://www.gsea-msigdb.org/)) from MSigDB and use it as Custom GMT.
<!-- Full path to h.all.v2024.1.Hs.symbols.gmt -->
<!-- [h.all.v2024.1.Hs.symbols.gmt](https://www.gsea-msigdb.org/gsea/msigdb/download_file.jsp?filePath=/msigdb/release/2024.1.Hs/h.all.v2024.1.Hs.symbols.gmt) -->
#### **Question ** {- .rationale}
How can one use multi-GMT as custom background?
<!-- ## [STRING](https://string-db.org) -->
## FEA in STRING <a href="https://string-db.org" target="_blank"><img src="images/string-logo.png" alt="STRING Logo" style="height:35px; vertical-align:middle;"></a>
STRING (Search Tool for the Retrieval of Interacting Genes/Proteins) is a resource for exploring protein-protein interaction (PPI) networks. It combines experimental data, predictions, and curated information to build networks that highlight functional relationships, helping to reveal shared pathways or biological processes within gene or protein lists.
### Steps to Perform ORA in STRING:
<span style="color:orange;">- Select Multiple proteins tab.</span>
<span style="color:orange;">- Input Gene List:</span> Paste your prepared gene list directly into the input box on the STRING web page or upload a file containing your list.
<span style="color:orange;">- Select Organism:</span> Choose the appropriate organism from the `Organisms` dropdown menu (e.g., *Homo sapiens* for human data). STRING would auto-detect the organism if ENSEMBL IDs provided.
<span style="color:orange;">- Modify Settings:</span> Under `Advanced Settings`, you can modify `Required score` from low (0.15) to highest (0.9) confidence. Similarly `FDR stringency` and `Network type` can be selected.
NOTE: In cases where long list of features is provided, STRING may chnage some of its settings so that:
- the nodes will have a simplified (not 3D) design
- previews of protein structures are not shown
- the network edges show interaction confidence only
### Browse the STRING ORA Results
STRING generates multiple tabs as output, shown here:
<!-- ![](images/string-results-tabs.png){ width=100% } -->
```{r, echo=FALSE, eval=TRUE, out.width="100%", fig.align = "center", fig.cap="Results tabs in STRING"}
knitr::include_graphics("images/string-results-tabs.png")
```
#### Viewers
Under the `Viewers` tab, various visualisation layouts are available, with the Network option being the most notable and widely used.
#### Legend
The `Legend` tab offers a guide to the colors of nodes and edges, along with annotations for each individual query in the input list.
<!-- ![Nodes and edges colour-coded](images/string-legend.png){ width=100% } -->
```{r, echo=FALSE, out.width="100%", fig.align = "center", fig.cap="Nodes and edges colour-coded"}
knitr::include_graphics("images/string-legend.png")
```
#### Settings
In the `Settings` tab of the STRING results, users have the flexibility to adjust existing settings and apply new filters to customise their data view and analysis. This tab you to switch between network types, strengths, data sources, intertaction scores and more.
#### Analysis
One of the most essential tabs is the `Analysis` tab, which offers comprehensive functional enrichment analysis from a range of databases. These include Gene Ontology (GO) for biological processes, molecular functions, and cellular components; Pathway enrichment from sources such as KEGG, Reactome, and WikiPathways; and other significant data sources such as Human Phenotype annotations and UniProt for protein function and structure.
Columns of the STRING enrichment table are explained as following:
<!-- ![Enrichment Table Columns](images/Enrichment-Table-Columns.png){ width=40% } -->
<span style="color:orange;">- Count In Network:</span>
The first number indicates how many proteins in your network are annotated with a particular term. The second number indicates how many proteins in total (in your network and in the background) have this term assigned. You can click on the numbers to see the network view of the gene sets behind them.
<span style="color:orange;">- Strength:</span>
Log10(observed / expected). This measure describes how large the enrichment effect is. It’s the ratio between i) the number of proteins in your network that are annotated with a term and ii) the number of proteins that we expect to be annotated with this term in a random network of the same size.
<span style="color:orange;">- Signal:</span>
The signal is defined as a weighted harmonic mean between the observed/expected ratio and -log(FDR). FDR tends to emphasise larger terms due to their potential for achieving lower p-values, while the observed/expected ratio highlights smaller terms, which have a high foreground to background ratio but cannot achieve low FDR values due to their size. The signal measure seeks to balance both metrics for a more intuittive ordering of enriched terms.
<span style="color:orange;">- False Discovery Rate:</span>
This measure describes how significant the enrichment is. Shown are p-values corrected for multiple testing within each category using the Benjamini–Hochberg procedure.
STRING visualises terms within each category using a bubble plot, effectively showcasing the significance and size of enriched terms. Additionally, it renders groups of related terms based on a user-defined similarity level, allowing users to identify clusters of functionally related terms within the data. This helps in interpreting complex enrichment results and highlighting key biological processes or pathways that are closely associated.
<!-- ![Functional enrichment visualisation with STRING](images/string-enrichment_KEGG_sim0.7_graph_plus.png){ width=100% } -->
```{r, echo=FALSE, out.width="100%", fig.align = "center", fig.cap="Functional enrichment visualisation with STRING"}
knitr::include_graphics("images/string-enrichment_KEGG_sim0.7_graph_plus.png")
```
Towards the the bottom of the `Analysis` page, one can change the background including adding one of their own.
<!-- ![Statistical background](images/string-statistical-background.png){ width=100% } -->
```{r, echo=FALSE, out.width="100%", fig.align = "center", fig.cap="Statistical background"}
knitr::include_graphics("images/string-statistical-background.png")
```
Finally the enriched terms can be downloaded at the end of the `Analysis` page, either individually per category or all enriched terms together.
#### Exports
The network data can be exported with the `Exports` tab. Also Network data can be directly sent to [Cytoscape](https://cytoscape.org/) ![](images/network-to-Cytoscape.png) for further networking. It is expected to have Cytoscape installed before exporting to it.
#### Clusters
The `Clusters` tab essentially provides three different types of clustering algorithms:
- k-means clustering: Initialise *k* centroids randomly, assigns each data point to the nearest centroid, recompute the centroids as the mean of all points in a cluster until until centroids do not change significantly.
- MCL clustering (Markov clustering): is a graph-based algorithm that uses flow simulation to detect clusters in a network by modeling random walks.
- DBSCAN clustering: is a density-based algorithm that groups points closely packed together while marking points in low-density regions as outliers or noise
<!-- ![](images/string-clusters.png) -->
```{r, echo=FALSE, out.width="100%", fig.align = "center", fig.cap="Network clustering in STRING"}
knitr::include_graphics("images/string-clusters.png")
```
Clusters can be downloaded in `.tsv` format.
#### **Question** {- .rationale}
What was the overlap in enrichment terms between gProfiler and STRING at FDR ≤ 0.05?
### Steps to Perform GSEA in STRING:
<span style="color:orange;">- Selcet Proteins with Values/Ranks.</span>
<span style="color:orange;">- Input Gene List:</span> Paste your gene list with a meaningful value for ranking (fold-change, log-pvalue, abundance, ...) directly into the input box on the STRING web page or upload a file containing your list of features and their corresponding values.
<span style="color:orange;">- Select Organism:</span> Same as above.
<span style="color:orange;">- Advanced Setting:</span> FDR stringency and the initial sort order can be set up in advance and hit the Search.
### Browse the STRING GSEA Results
The output differs from ORA. For each gene set, the results include the enrichment score, its direction within the ranked list, the number of overlapping features with the gene set, and the associated FDR.
When a user selects a gene set from the enriched table,
```{r, echo=FALSE, out.width="100%", fig.align = "center", fig.cap="An example table of WikiPathway gene sets"}
knitr::include_graphics("images/string-gsea-wikiPathways.png")
```
the associated genes are displayed within the ranking list. A table showing these genes along with their original ranking values is also provided.
```{r, echo=FALSE, fig.align = "center", fig.cap="List of genes in the term (WP197) and their positions on the ranked list"}
knitr::include_graphics("images/string-gsea-ranking-n-table.png")
```
Additionally, the locations of the corresponding proteins are highlighted in the proteome network:
```{r, echo=FALSE, fig.align = "center", fig.cap="Proteome network"}
knitr::include_graphics("images/string-gsea-proteome-network.png")
```
A Functional enrichment visualisation (similar to that of ORA) is provided at below the enriched tables.
Modify `Enrichment display settings` tab before downloading the enriched tables. It is recommended to merge terms with a certain level of similarity to reduce redundancy, especially if there are many overlapping terms.
```{r, echo=FALSE, fig.align = "center", fig.cap="Enrichment display settings"}
knitr::include_graphics("images/string-enrichement-display-settings.png")
```
Here is an example output of [GSEA on STRING](https://version-12-0.string-db.org/cgi/globalenrichment?networkId=bKhJ4fXp6sna) from a previous run (the link will expire in future).
<!-- ## FEA in [GenePattern](https://www.genepattern.org/#gsc.tab=0) -->
## FEA in GenePattern <a href="https://www.genepattern.org/#gsc.tab=0" target="_blank"><img src="images/GenePattern-logo.png" alt="GenePattern Logo" style="height:35px; vertical-align:middle;"></a>
GenePattern, an online platform developed by the Broad Institute, offers a suite of tools for analyzing and visualizing genomic data, making bioinformatics accessible to researchers through a user-friendly, no-programming interface. Among its supported tools is Gene Set Enrichment Analysis (GSEA), which implements [MSigDB GSEA](https://www.gsea-msigdb.org/gsea/index.jsp) analysis for identifying enriched gene sets in genomic data.
MSigDB (Molecular Signatures Database) is a collection of gene sets for Gene Set Enrichment Analysis, representing pathways and gene signatures linked to biological states or diseases. It helps identify enriched gene sets, aiding the analysis of gene expression changes and key pathways in experimental data.
### Steps to Locate GSEA Module in GenePattern:
- Click on the Run button and then the Public Server
<!-- ![](images/GenePattern-Run.png) -->
```{r, echo=FALSE, fig.align = "center", fig.cap="Navigate to Public Server"}
knitr::include_graphics("images/GenePattern-Run.png")
```
- Sign in to GenePattern or Enter as Guest
- Under `Modules` tab hit `Browse Modules`
- Find gsea in the Browse Modules by Category page and hit GSEA
<!-- ![](images/Browse_Modules_gsea.png){ width=100% } -->
```{r, echo=FALSE, out.width="100%", fig.align = "center", fig.cap="Browse GSEA module in GenePattern"}
knitr::include_graphics("images/Browse_Modules_gsea.png")
```
### Steps to Perform GSEA:
<!-- https://cloud.genepattern.org/gp/pages/index.jsf?jobid=613752&openVisualizers=true&openNewWindow=false -->
1. Basic Parameters
\- Create both `.gct` and `.cls` files following [this scrit in R](degust.html)
\- Load the `.gct` input file in the `expression dataset` tab and `.cls` file in the `phenotype labels` tab
\- Select a `.gmt` file (Gene Matrix Transposed) from the `gene sets database` tab
\- Set permutation under `number of permutations` tab
\- Type of the permuattaion to be set under `permutation type` tab
\- Select an appropriate DNA Chip annotation file from `chip platform file` tab
\- Name the output file in `output file name` tab
2. Advanced Parameters
\- Scoring Scheme:
- K-S: The score increment is the same for all genes in *S* regardless of their ranking or correlation strength.
- WEighted: the score increment for each gene in *S* is weighted by its correlation with the phenotype, typically the absolute value of the correlation or ranking metric.
\- Metric for ranking genes: Ranking metric of interest can be chosen from drop down menu. A detailed description od the metrics is given on [GSEA-MSigDB Documentation](https://docs.gsea-msigdb.org/#GSEA/GSEA_User_Guide/#metrics-for-ranking-genes).
- Categorical Phenotypes: Signal-to-Noise Ratio, t-Test, Ratio of Classes, Log2 Ratio of Classes
- Continuous Phenotypes: Pearson Correlation, Spearman Correlation
\- Minimum and Maximum size of gene sets can be set using `max gene set size` and `min gene set size` tabs
<!-- <GeneSetName> <Description> <Gene1> <Gene2> <Gene3> ... -->
#### Browse the GSEA results
Once the job has been queued and successfully run, the output will be listd on the left panel under `Jobs` tab:
```{r, echo=FALSE, fig.align = "center", fig.cap="Job status in GenePattern"}
knitr::include_graphics("images/GenePattern-Jobs.png")
```
Of the most important files is the `.zip` file that was earlier specified under `output file name` tab in Basic parameters section which includes all the results. The results can also be navigated using the single files listed under the job id.
For Pezzini experiment, two `html` files generated for each of up- and down-regulated gene sets, something like:
- gsea_report_for_Diff_1731388275794.html
- gsea_report_for_Nodiff_1731388275794.html
The tabulated versions of the results are given in `.tsv` format:
- gsea_report_for_Diff_1731388275794.tsv
- gsea_report_for_Nodiff_1731388275794.tsv
<!-- ```{css, echo=FALSE} -->
<!-- table { -->
<!-- width: 75%; -->
<!-- margin: auto; -->
<!-- border-collapse: collapse; -->
<!-- } -->
<!-- th, td { -->
<!-- padding: 5px; -->
<!-- text-align: left; -->
<!-- border: 1px solid #ddd; -->
<!-- } -->
<!-- ``` -->
The GSEA result tables have the following header and below is given details of one gene set:
```{r, echo=FALSE, message=FALSE}
data <- data.frame(
Parameter = c("GS (follow link to MSigDB)", "GS DETAILS", "SIZE", "ES", "NES", "NOM p-val", "FDR q-val", "FWER p-val", "RANK AT MAX", "LEADING EDGE"),
Value = c(
"[REACTOME_FRS_MEDIATED_FGFR2_SIGNALING](https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/REACTOME_FRS_MEDIATED_FGFR2_SIGNALING)",
"Details ...", # Link formatted in Markdown
"16",
"0.83905387",
"1.7128055",
"0",
"0.03902518",
"0.648",
"995",
"tags=38%, list=7%, signal=40%"
)
)
# Display the data as a table
kable(data, caption = "Summary of GSEA Results for REACTOME_FRS_MEDIATED_FGFR2_SIGNALING Gene Set")
```
<!-- HALLMARK_CHOLESTEROL_HOMEOSTASIS -->
<!-- 68 0.59 1.49 0.012 0.059 0.201 4015 tags=60%, list=28%, signal=83% -->
The leading edge column has three values:
- tags: 38% of the genes in the gene set are key to the enrichment result.
- list: These genes make up 7% of the total gene list being analyzed.
- signal: They contribute 40% of the enrichment signal, highlighting their importance in driving the association between this gene set and the biological phenotype being studied.
#### **Challenge:** How do different ranking metrics impact the output? {- .challenge}
Run GSEA analysis using Hallmark gene sets with two metrics (tTest and Ratio_of_Classes). What are the upregulated terms (FDR < 0.1) in the `Diff` class, based on the t-test and Ratio of Classes metrics?
#### **Question ** {- .rationale}
Why might the HALLMARK_CHOLESTEROL_HOMEOSTASIS gene set be upregulated specifically in the differentiation condition of SH-SY5Y cells in [Pezzini, et al 2016](https://pubmed.ncbi.nlm.nih.gov/27422411/) experiment?
<details>
<summary>Show</summary>
- Relevance: Cholesterol is essential for neuronal function and membrane fluidity, particularly in processes like axonal growth and synapse formation. Neurons have a high demand for cholesterol, especially during differentiation when they extend axons and dendrites.
- Possible Insight: Upregulation of genes in this set could signify that differentiating cells are actively producing or transporting cholesterol to support membrane synthesis and cellular remodeling required for mature neuronal structures.
</details>
#### {-}
<!-- ## FEA in [Reactome](https://reactome.org/) -->
## FEA in Reactome <a href="https://reactome.org/" target="_blank"><img src="images/reactome-logo.png" alt="Reactome Logo" style="height:35px; vertical-align:middle;"></a>
Reactome is an open-source database of curated biological pathways across species, offering pathway maps and enrichment tools to analyse gene lists in a pathway-focused context. It’s ideal for visualising data within established biochemical and cellular processes.
### Steps to perform ORA in Reactome:
- Hit the `Analysis Tools` tab
```{r, echo=FALSE, out.width="100%", fig.align = "center", fig.cap="Analysis in Reactome"}
knitr::include_graphics("images/reactome-tabs.png")
```
- Choose `Analyse gene list` from the left panel
```{r, echo=FALSE, out.width="20%", fig.align = "center", fig.cap="Analysis Tools in Reactome"}
knitr::include_graphics("images/reactome-analysis-tools.png")
```
- Upload list of features on the box or choose a file, hit continue
- Select prefered options:
\- Project to Human: This option will convert identifiers from non-human species into human equivalents, allowing you to analyse data across species.
\- Include interactors: This option integrates interactors from IntAct, a protein interaction database. Including interactors broadens the background network, potentially offering deeper insights.
- Hit Analyse!
### Steps to perform GSA in Reactome:
- If `Analyse gene expression` was chosen instead, Reactome offers the following gene set analysis:
```{r, echo=FALSE, out.width="100%", fig.align = "center", fig.cap="Ractome GSA "}
knitr::include_graphics("images/reactome-gsea-methods.png")
```
- Lets try CAMERA as it represents the `camera()` function of `limma` package in `R` for a gene set analysis.
- Choose TMM normalisation to ensure consistency with the input data used in other tools within our workshop.
- Select data type and provide input data
```{r, echo=FALSE, out.width="100%", fig.align = "center", fig.cap="Ractome GSA - data type"}
knitr::include_graphics("images/reactome-gsea-select-data.png")
```
- Annotate columns by adding extra info as follows:
```{r, echo=FALSE, out.width="100%", fig.align = "center", fig.cap="Pathway diagram "}
knitr::include_graphics("images/reactome-gsea-add-column.png")
```
- Save dataset and Continue
- You can now browse the results
### Browse the Reactome results
Results can be interactively browsed using the reactome pathway or voronoi visualisation modes: <a target="_blank"><img src="images/reactome-modes.png" alt="Reactome Modes" style="height:35px; vertical-align:middle;"></a>
User can explore the pathway names listed in the table within the `Analysis` tab and they are displayed as popups on the pathway diagrams.
One can also select a pathway of interest by navigating through the left panel or by simply searching for the term in the search box.
The enriched table can be downloaded as shown below:
```{r, echo=FALSE, out.width="100%", fig.align = "center", fig.cap="Table of ORA with Reactome"}
knitr::include_graphics("images/reactome-download-table.png")
```
The diagram can be downloaded using this icons: <a target="_blank"><img src="images/reactome-download-diagram.png" alt="Download Diagram" style="height:35px; vertical-align:middle;"></a>
Here is a sample pathway diagram from Reactome GSA.
```{r, echo=FALSE, out.width="100%", fig.align = "center", fig.cap="Pathway diagram - GSA"}
knitr::include_graphics("images/reactome-PathwaysOverview-expression-data.png")
```
In case `ssGSEA` was selected, an overall output would look like below:
```{r, echo=FALSE, out.width="70%", fig.align = "center", fig.cap="Expression of top 30 pathways with ssGSEA"}
knitr::include_graphics("images/reactome-ssGSEA.png")
```
#### {-}
#### **Question ** {- .rationale}
When running FEA in Reactome, how do you prefer the analysis methods?
\- PADOG (Pathway Analysis with Down-weighting of Overlapping Genes)
\- CAMERA (Correlation Adjusted Mean Rank)
\- ssGSEA (Single Sample Gene Set Enrichment Analysis)
#### {-}
## Uncertainties of a functional enrichment analsysis
This section provides a summary of the paper by [Wünsch et al. (2023)](https://wires.onlinelibrary.wiley.com/doi/full/10.1002/wics.1643), which explores uncertainties inherent in functional enrichment analysis. The study critically examines the sources of variability and challenges in this analytical approach, offering insights into improving its reliability and robustness.
```{r, echo=FALSE, out.width="100%", fig.align = "center", fig.cap="From RNA sequencing measurements to the final results: A practical guide to navigating the choices and uncertainties of gene set analysis"}
knitr::include_graphics("images/Wunsch_et_al_2023.png")
```
### <span style="color:red;">Types of FEA</span>
Functional enrichment analysis (FEA) typically involves one of over representation analysis (ORA), gene set enrichment analysis (GSEA) also known as functional class scoring (FCS), and Pathway Topology (PT).
1. **ORA**
\- ORA methods are the least complex among the three approaches of FEA.
\- ORA methods requires a list of differentially expressed genes that are already analysed in differential expression analysis.
\- The background population, the universe, can be a more general set of gene like those in human genome or more specific from thos observed in an experiment.
\- A contingency table is created and the null distribution is modeled using the hypergeometric distribution.
2. **FCS**
\- FCS methods aim to aggregate the values of the gene-level statistics (ranks) into gene set-level statistic (enrichment score, ES).
\- FCS can be classified as one of FCS I, those that take the expression data as input or FCS II that take a pre-ranked list of genes as input. With the latter, the information of the conditions (phenotypes) of the samples is lost, as such phenotype permutation cannot be performed leaving the choice of null hypothesis to gene set permutation.
3. **PT**
\- PT additionally models interactions between the genes. This approach generally scores considerably lower in terms of popularity in the reference database.
### <span style="color:red;">Considerations</span>
\- Pre-filter expression data: Exclude lowly expressed genes to improve statistical power.
\- Handle gene IDs carefully: Convert gene IDs to the required format and remove any duplicates.
\- Normalise expression data: Address sample-specific biases to enable fair comparisons between samples.
\- Use appropriate methods for differential expression analysis: Recommended methods include limma (voom), DESeq2, and edgeR.
\- Select suitable gene-level statistics: For FCS II, choose metrics like moderated t-statistic to rank genes meaningfully.
\- Adjust for multiple testing: Ensure your analysis includes a correction for multiple hypothesis testing. Some methods require manual adjustments.
\- Choose gene set databases based on biological context: Ensure that the database aligns with the research question and the experimental system.
### <span style="color:red;">Recommendation</span>
- Awareness of Uncertainties:
\- Recognise uncertainties in methods, parameter choices, and data preprocessing when conducting Gene Set Analysis (GSA).
\- Understand that the method's name alone does not capture the full analysis pipeline.
- Clearly document all analysis choices, including methods, parameters, and preprocessing steps.
- Select methods, parameters, and preprocessing steps before starting the analysis to minimise bias.
- Set Technical Parameters:
\- Fix technical parameters like the random seed and number of permutations before running the analysis to ensure reproducibility.
\- Avoid adjusting these parameters to obtain favorable results.
- Avoid Cherry-picking:
\- Refrain from selectively reporting results based on favorable outcomes, as this can lead to over-optimistic and non-reproducible findings.
\- Avoid excessive tweaking of the analysis strategy to fit the data post hoc.
- Use different pipelines or parameter configurations as part of sensitivity analysis to check the consistency of results.
- Share complete analysis workflows, including code and documentation, to allow others to replicate the findings accurately.