Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GHA: Automated live (rendered) versions of the notebooks #441

Merged
merged 1 commit into from
Mar 16, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
591 changes: 591 additions & 0 deletions RNA-seq/01-qc_trim_quant.nb.html

Large diffs are not rendered by default.

12 changes: 11 additions & 1 deletion RNA-seq/02-gastric_cancer_tximeta-live.Rmd
Original file line number Diff line number Diff line change
@@ -1,12 +1,22 @@
---
title: "Gastric cancer: gene-level summarization with `tximeta`"
author: CCDL for ALSF
date: 2021
output:
html_notebook:
toc: true
toc_float: true
---

**CCDL 2021**
## Objectives

This notebook will demonstrate how to:

- Import RNA-seq expression quantification output using `tximeta`
- Summarize transcript-level expression to the gene level
- Interrogate and extract data from a `SummarizedExperiment` object

---

In this notebook, we'll import the transcript expression quantification output from `salmon quant` using the [`tximeta`](https://bioconductor.org/packages/release/bioc/html/tximeta.html) package.
`tximeta` is in part a wrapper around another package, [`tximport`](https://bioconductor.org/packages/release/bioc/html/tximport.html), which imports transcript expression data and summarizes it to the gene level.
Expand Down
1,039 changes: 1,039 additions & 0 deletions RNA-seq/02-gastric_cancer_tximeta.nb.html

Large diffs are not rendered by default.

55 changes: 34 additions & 21 deletions RNA-seq/03-gastric_cancer_exploratory-live.Rmd
Original file line number Diff line number Diff line change
@@ -1,23 +1,31 @@
---
title: "Gastric cancer: exploratory analysis"
author: CCDL for ALSF
date: 2021
output:
html_notebook:
toc: true
toc_float: true
---

**CCDL 2021**
## Objectives

This notebook will demonstrate how to:

- Create a `DESeq2` data set from a `SummarizedExperiment`
- Transform RNA-seq count data with a Variance Stabilizing Transformation
- Create PCA plots to explore structure among RNA-seq samples

---

In this notebook, we'll import the gastric cancer data and do some exploratory
analyses and visual inspection.
We'll use the [`DESeq2`](https://bioconductor.org/packages/release/bioc/html/DESeq2.html) package for this.

![](diagrams/rna-seq_6.png)

`DESeq2` also has an
[excellent vignette](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html)
from Love, Anders, and Huber from which this is adapted
(see also: [Love, Anders, and Huber. _Genome Biology_. 2014.](https://doi.org/10.1186/s13059-014-0550-8)).
`DESeq2` also has an [excellent vignette](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html)
from Love, Anders, and Huber from which this is adapted (see also: [Love, Anders, and Huber. _Genome Biology_. 2014.](https://doi.org/10.1186/s13059-014-0550-8)).

## Libraries and functions

Expand Down Expand Up @@ -75,8 +83,7 @@ First, let's read in the data we processed with `tximeta`.

### Set up DESeq2 object

We use the tissue of origin in the design formula because that will allow us
to model this variable of interest.
We use the tissue of origin in the design formula because that will allow us to model this variable of interest.

```{r ddset}
ddset <- DESeqDataSet(gene_summarized,
Expand All @@ -85,27 +92,33 @@ ddset <- DESeqDataSet(gene_summarized,

### Variance stabilizing transformation

Before visualizing the data, we'll transform it such that it is on a `log2`
scale for large counts and library size is taken into account with the `DESeq2`
function for variance stabilizing transformation.
See [this section of
the `DESeq2` vignette](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#data-transformations-and-visualization)
for more on this topic.
Raw count data is not usually suitable for the algorithms we use for dimensionality reduction, clustering, or heatmaps.
To improve this, we will transform the count data to create an expression measure that is better suited for these analyses.
The core transformation will map the expression to a log2 scale, while accounting for some of the expected variation among samples and genes.

Since different samples are usually sequenced to different depths, we want to transform our RNA-seq count data to make different samples more directly comparable.
We also want to deal with the fact that genes with low counts are also likely to have higher variance (on the log2 scale), as that could bias our clustering.
To handle both of these considerations, we can calculate a Variance Stabilizing Transformation of the count data, and work with that transformed data for our analysis.

See [this section of the `DESeq2` vignette](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#data-transformations-and-visualization) for more on this topic.

```{r vst}
vst_data <- vst(ddset)
```

### Principal components analysis
### Principal component analysis

Principal component analysis (PCA) is a dimensionality reduction technique that allows us to identify the largest components of variation in a complex dataset.
Our expression data can be thought of as mapping each sample in a multidimensional space defined by the expression level of each gene.
The expression of many of those genes are correlated, so we can often get a better, simpler picture of the data by combining the information from those correlated genes.

PCA rotates and transforms this space so that each axis is now a combination of multiple correlated genes, ordered so the first axes capture the most variation from the data.
These new axes are the "principal components."
If we look at the first few components, we can often get a nice overview of relationships among the samples in the data.

Principal components analysis (PCA) is a dimensionality reduction technique
that captures the main sources of variation in our data in the first two
principal components (PC1 and PC2).
Visualizing PC1 and PC2 can give us insight into how different variables (e.g.,
tissue source) affect our dataset and help us spot any technical effects
(more on that below).
The `plotPCA()` function we will use from the `DESeq2` package calculates and plots the first two principal components (PC1 and PC2).
Visualizing PC1 and PC2 can give us insight into how different variables (e.g., tissue source) affect our dataset and help us spot any technical effects (more on that below).

`DESeq2` has built-in functionality for performing PCA.

```{r plotPCA, live = TRUE}
# DESeq2 built in function is called plotPCA and we want to color points by
Expand Down
828 changes: 828 additions & 0 deletions RNA-seq/03-gastric_cancer_exploratory.nb.html

Large diffs are not rendered by default.

424 changes: 424 additions & 0 deletions RNA-seq/04-nb_cell_line_tximeta.nb.html

Large diffs are not rendered by default.

23 changes: 17 additions & 6 deletions RNA-seq/05-nb_cell_line_DESeq2-live.Rmd
Original file line number Diff line number Diff line change
@@ -1,12 +1,23 @@
---
title: "Neuroblastoma Cell Line: Differential expression analysis with DESeq2"
author: CCDL for ALSF
date: 2021
output:
html_notebook:
toc: true
toc_float: true
---

**CCDL 2021**

## Objectives

This notebook will demonstrate how to:

- Perform differential expression analysis with `DESeq2`
- Apply a shrinkage algorithm to improve estimates of expression changes
- Draw a volcano plot with the `EnhancedVolcano` package

---

In this notebook, we'll perform an analysis to identify the genes that are differentially expressed in _MYCN_ amplified vs. nonamplified neuroblastoma cell lines.

Expand Down Expand Up @@ -135,9 +146,9 @@ gene_summarized$status <- relevel(gene_summarized$status, ref = "Nonamplified")

```

### Differential expression
## Differential expression analysis

#### Filtering low-expressed genes
### Filtering low-expressed genes

Genes that have very low counts are not likely to yield reliable differential expression results, so we will do some light [pre-filtering](http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#pre-filtering).
We will keep only genes with total counts of at least 10 across all samples.
Expand All @@ -148,7 +159,7 @@ ddset <- ddset[genes_to_keep, ]
```


#### Differential expression analysis
### The `DESeq()` function

We'll now use the wrapper function `DESeq()` to perform our differential expression analysis.
As mentioned earlier, this performs a number of steps, including an [outlier removal procedure](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#approach-to-count-outliers).
Expand Down Expand Up @@ -179,7 +190,7 @@ summary(deseq_results, alpha = 0.05)
```


#### Shrinking log2 fold change estimates
### Shrinking log2 fold change estimates

The estimates of log2 fold change calculated by `DESeq()` are not corrected for expression level.
This means that when counts are small, we are likely to end up with some large fold change values that overestimate the true extent of the change between conditions.
Expand Down Expand Up @@ -262,7 +273,7 @@ readr::write_tsv(deseq_df, file = deseq_df_file)
```


#### Volcano Plot
## Making a Volcano Plot

With these shrunken effect sizes, we will draw a volcano plot, using the [`EnhancedVolcano` package](https://github.com/kevinblighe/EnhancedVolcano) to make it a bit easier.
This package automatically color codes the points by cutoffs for both significance and fold change and labels many of the significant genes (subject to spacing).
Expand Down
1,159 changes: 1,159 additions & 0 deletions RNA-seq/05-nb_cell_line_DESeq2.nb.html

Large diffs are not rendered by default.

Loading