AlexsLemonade · jashapiro · Mar 16, 2021 · Mar 16, 2021
diff --git a/RNA-seq/01-qc_trim_quant.nb.html b/RNA-seq/01-qc_trim_quant.nb.html
diff --git a/RNA-seq/02-gastric_cancer_tximeta-live.Rmd b/RNA-seq/02-gastric_cancer_tximeta-live.Rmd
@@ -1,12 +1,22 @@
 ---
 title: "Gastric cancer: gene-level summarization with `tximeta`"
+author: CCDL for ALSF
+date: 2021
 output:   
   html_notebook: 
     toc: true
     toc_float: true
 ---
 
-**CCDL 2021**
+## Objectives
+
+This notebook will demonstrate how to:
+
+- Import RNA-seq expression quantification output using `tximeta`
+- Summarize transcript-level expression to the gene level
+- Interrogate and extract data from a `SummarizedExperiment` object 
+
+---
 
 In this notebook, we'll import the transcript expression quantification output from `salmon quant` using the [`tximeta`](https://bioconductor.org/packages/release/bioc/html/tximeta.html) package.
 `tximeta` is in part a wrapper around another package, [`tximport`](https://bioconductor.org/packages/release/bioc/html/tximport.html), which imports transcript expression data and summarizes it to the gene level.

diff --git a/RNA-seq/02-gastric_cancer_tximeta.nb.html b/RNA-seq/02-gastric_cancer_tximeta.nb.html
diff --git a/RNA-seq/03-gastric_cancer_exploratory-live.Rmd b/RNA-seq/03-gastric_cancer_exploratory-live.Rmd
@@ -1,23 +1,31 @@
 ---
 title: "Gastric cancer: exploratory analysis"
+author: CCDL for ALSF
+date: 2021
 output:   
   html_notebook: 
     toc: true
     toc_float: true
 ---
 
-**CCDL 2021**
+## Objectives
+
+This notebook will demonstrate how to:
+
+- Create a `DESeq2` data set from a `SummarizedExperiment`
+- Transform RNA-seq count data with a Variance Stabilizing Transformation 
+- Create PCA plots to explore structure among RNA-seq samples
+
+---
 
 In this notebook, we'll import the gastric cancer data and do some exploratory
 analyses and visual inspection.
 We'll use the [`DESeq2`](https://bioconductor.org/packages/release/bioc/html/DESeq2.html) package for this.
 
 ![](diagrams/rna-seq_6.png)
 
-`DESeq2` also has an 
-[excellent vignette](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html) 
-from Love, Anders, and Huber from which this is adapted 
-(see also: [Love, Anders, and Huber. _Genome Biology_. 2014.](https://doi.org/10.1186/s13059-014-0550-8)).
+`DESeq2` also has an [excellent vignette](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html) 
+from Love, Anders, and Huber from which this is adapted (see also: [Love, Anders, and Huber. _Genome Biology_. 2014.](https://doi.org/10.1186/s13059-014-0550-8)).
 
 ## Libraries and functions
 
@@ -75,8 +83,7 @@ First, let's read in the data we processed with `tximeta`.
 
 ### Set up DESeq2 object
 
-We use the tissue of origin in the design formula because that will allow us
-to model this variable of interest.
+We use the tissue of origin in the design formula because that will allow us to model this variable of interest.
 
 ```{r ddset}
 ddset <- DESeqDataSet(gene_summarized,
@@ -85,27 +92,33 @@ ddset <- DESeqDataSet(gene_summarized,
 
 ### Variance stabilizing transformation
 
-Before visualizing the data, we'll transform it such that it is on a `log2` 
-scale for large counts and library size is taken into account with the `DESeq2` 
-function for variance stabilizing transformation.
-See [this section of
-the `DESeq2` vignette](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#data-transformations-and-visualization) 
-for more on this topic. 
+Raw count data is not usually suitable for the algorithms we use for dimensionality reduction, clustering, or heatmaps. 
+To improve this, we will transform the count data to create an expression measure that is better suited for these analyses. 
+The core transformation will map the expression to a log2 scale, while accounting for some of the expected variation among samples and genes.
+
+Since different samples are usually sequenced to different depths, we want to transform our RNA-seq count data to make different samples more directly comparable. 
+We also want to deal with the fact that genes with low counts are also likely to have higher variance (on the log2 scale), as that could bias our clustering.
+To handle both of these considerations, we can calculate a Variance Stabilizing Transformation of the count data, and work with that transformed data for our analysis.
+
+See [this section of the `DESeq2` vignette](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#data-transformations-and-visualization) for more on this topic. 
 
 ```{r vst}
 vst_data <- vst(ddset)
 ```
 
-### Principal components analysis
+### Principal component analysis
+
+Principal component analysis (PCA) is a dimensionality reduction technique that allows us to identify the largest components of variation in a complex dataset.
+Our expression data can be thought of as mapping each sample in a multidimensional space defined by the expression level of each gene.
+The expression of many of those genes are correlated, so we can often get a better, simpler picture of the data by combining the information from those correlated genes.
+
+PCA rotates and transforms this space so that each axis is now a combination of multiple correlated genes, ordered so the first axes capture the most variation from the data. 
+These new axes are the "principal components."
+If we look at the first few components, we can often get a nice overview of relationships among the samples in the data.
 
-Principal components analysis (PCA) is a dimensionality reduction technique
-that captures the main sources of variation in our data in the first two 
-principal components (PC1 and PC2).
-Visualizing PC1 and PC2 can give us insight into how different variables (e.g.,
-tissue source) affect our dataset and help us spot any technical effects 
-(more on that below).
+The `plotPCA()` function we will use from the `DESeq2` package calculates and plots the first two principal components (PC1 and PC2).
+Visualizing PC1 and PC2 can give us insight into how different variables (e.g., tissue source) affect our dataset and help us spot any technical effects (more on that below).
 
-`DESeq2` has built-in functionality for performing PCA.
 
 ```{r plotPCA, live = TRUE}
 # DESeq2 built in function is called plotPCA and we want to color points by

diff --git a/RNA-seq/03-gastric_cancer_exploratory.nb.html b/RNA-seq/03-gastric_cancer_exploratory.nb.html
diff --git a/RNA-seq/04-nb_cell_line_tximeta.nb.html b/RNA-seq/04-nb_cell_line_tximeta.nb.html
diff --git a/RNA-seq/05-nb_cell_line_DESeq2-live.Rmd b/RNA-seq/05-nb_cell_line_DESeq2-live.Rmd
@@ -1,12 +1,23 @@
 ---
 title: "Neuroblastoma Cell Line: Differential expression analysis with DESeq2"
+author: CCDL for ALSF
+date: 2021
 output:   
   html_notebook: 
     toc: true
     toc_float: true
 ---
 
-**CCDL 2021**
+
+## Objectives
+
+This notebook will demonstrate how to:
+
+- Perform differential expression analysis with `DESeq2`
+- Apply a shrinkage algorithm to improve estimates of expression changes
+- Draw a volcano plot with the `EnhancedVolcano` package
+
+---
 
 In this notebook, we'll perform an analysis to identify the genes that are differentially expressed in _MYCN_ amplified vs. nonamplified neuroblastoma cell lines. 
 
@@ -135,9 +146,9 @@ gene_summarized$status <- relevel(gene_summarized$status, ref = "Nonamplified")
 
 ```
 
-### Differential expression
+## Differential expression analysis
 
-#### Filtering low-expressed genes
+### Filtering low-expressed genes
 
 Genes that have very low counts are not likely to yield reliable differential expression results, so we will do some light [pre-filtering](http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#pre-filtering). 
 We will keep only genes with total counts of at least 10 across all samples.
@@ -148,7 +159,7 @@ ddset <- ddset[genes_to_keep, ]
 ```
 
 
-#### Differential expression analysis
+### The `DESeq()` function
 
 We'll now use the wrapper function `DESeq()` to perform our differential expression analysis.
 As mentioned earlier, this performs a number of steps, including an [outlier removal procedure](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#approach-to-count-outliers). 
@@ -179,7 +190,7 @@ summary(deseq_results, alpha = 0.05)
 ```
 
 
-#### Shrinking log2 fold change estimates
+### Shrinking log2 fold change estimates
 
 The estimates of log2 fold change calculated by `DESeq()` are not corrected for expression level.
 This means that when counts are small, we are likely to end up with some large fold change values that overestimate the true extent of the change between conditions.
@@ -262,7 +273,7 @@ readr::write_tsv(deseq_df, file = deseq_df_file)
 ```
 
 
-#### Volcano Plot 
+## Making a Volcano Plot 
 
 With these shrunken effect sizes, we will draw a volcano plot, using the [`EnhancedVolcano` package](https://github.com/kevinblighe/EnhancedVolcano) to make it a bit easier.
 This package automatically color codes the points by cutoffs for both significance and fold change and labels many of the significant genes (subject to spacing).

diff --git a/RNA-seq/05-nb_cell_line_DESeq2.nb.html b/RNA-seq/05-nb_cell_line_DESeq2.nb.html