diff --git a/RNA-seq/01-qc_trim_quant.nb.html b/RNA-seq/01-qc_trim_quant.nb.html new file mode 100644 index 00000000..238a7cfd --- /dev/null +++ b/RNA-seq/01-qc_trim_quant.nb.html @@ -0,0 +1,591 @@ + + + + +
+ + + + + + + + + + +This notebook will demonstrate how to:
+We will first learn how to process RNA-seq data at the command line using samples that were assayed with paired-end sequencing.
+These samples come from a project (PRJNA178120
) that includes 8 samples from normal gastric tissue, gastric cancer cell lines and primary gastric tumor cell cultures.
Here we will perform quality control checks, trimming, and estimate the transcript abundances for a single sample, SRR585570.
+ +Later, we will use the full dataset (n = 8) to explore how to summarize estimates to the gene level and do some exploratory data analyses with data the course directors have processed ahead of time.
+We’ll first want to set our working directory to the top-level of the RNA-seq folder.
+Copy and paste the text in the code blocks below into your Terminal
window in RStudio. It should be in the lower left hand corner as a tab next to Console
.
Set current directory to the top-level of the RNA-seq module:
+cd ~/training-modules/RNA-seq
+Here ~/
refers to your home directory on the RStudio Server, which is the base folder in which your files live, including most of the materials for training. This is also the default working directory when you open a new RStudio session. A home directory is specific to you as a user on the RStudio Server; each user has their own folder to store their files.
Because these steps are computationally time intensive, we’ve prepared a script to start running things. Once we start running the script, we will give a short lecture to introduce this module and then walk through and explain each of the individual steps that the script is executing.
+Enter the following in the Terminal to start running the script:
+bash scripts/run_SRR585570.sh
+Note: Don’t worry if the Salmon step does not complete by the time we move on to the next notebook. This is a time and resource intensive step, so we have prepared the required output in case we need it.
+The raw data FASTQ files (fastq.gz
) for this sample, SRR585570, are in data/gastric-cancer/fastq/SRR585570
. The first two directories, data/
and gastric-cancer/
, tell us that these files are data and which experiment or dataset these data are from. We’ll be working with an additional dataset later in the module, so this latter distinction will become more important. The third directory, fastq/
, tells us that this is where we will be storing fastq files.
The final directory, SRR585570
, is specific to the sample we are working with. The use of the SRR585570
folder might seem unnecessary because we are only processing a single sample here, but keeping files for individual samples in their own folder helps keep things organized for multi-sample workflows. (You can peek ahead and look at the data/NB-cell/quant
folder for such an example.)
There is no “one size fits all” approach for project organization. It’s most important that it’s consistent, easy for you and others to find the files you need quickly, and minimizes the likelihood for errors (e.g., writing over files accidentally).
+The first thing our script does is use FastQC for quality control in command line mode. Here’s a link to the FastQC documentation: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/
+Let’s take a look at some example reports from the authors of FastQC:
+FastQC runs a series of quality checks on sequencing data and provides an HTML report. As the authors point out in the docs:
+++It is important to stress that although the analysis results appear to give a pass/fail result, these evaluations must be taken in the context of what you expect from your library.
+
The documentation for individual modules/analyses in FastQC is a great resource!
+To save time, our script only runs one FASTQ file for SRR585570 with the following commands:
+mkdir -p QC/gastric-cancer/fastqc/SRR585570
+mkdir
allows us to create a new folder in the QC/gastric-cancer/fastqc
directory specifically to hold the report information that will be generated by FastQC for this sample. The -p
allows us to create parent directories and will prevent an error if the directory we specify already exists.
# In the interest of time, we'll run one of the fastq files through FastQC
+fastqc data/gastric-cancer/fastq/SRR585570/SRR585570_1.fastq.gz \
+ -o QC/gastric-cancer/fastqc/SRR585570
+-o
The -o
flag allows us to specify where the output of FastQC is saved. Note that this is saved in a separate place than the raw data files and in a directory specifically for quality control information.
For comparison to the report for SRR585570_1.fastq.gz
we generate with our script, we’ve prepared a FastQC report for one of the sets of reads for another sample in the experiment. It can be found at QC/gastric-cancer/fastqc/SRR585574/SRR585574_1_fastqc.html
.
Let’s look at the reports for both samples.
+We use fastp to preprocess the FASTQ files (Chen et al. Bioinformatics. 2018.). Note that fastp has quality control functionality and many different options for preprocessing (see all options on GitHub), most of which we will not cover. Here, we focus on adapter trimming, quality filtering, and length filtering.
+Below, we discuss the commands we used in the script.
+# Create a directory to hold the trimmed fastq files
+mkdir -p data/gastric-cancer/fastq-trimmed/SRR585570
+# Create a directory to hold the QC output from fastp
+mkdir -p QC/gastric_cancer/fastp/SRR585570
+As we’ll cover below, fastp essentially has two kinds of output: trimmed and filtered FASTQ files (data) and reports (quality control).
+# Run the adapter and quality trimming step -- also produces QC report
+fastp -i data/gastric-cancer/fastq/SRR585570/SRR585570_1.fastq.gz \
+ -I data/gastric-cancer/fastq/SRR585570/SRR585570_2.fastq.gz \
+ -o data/gastric-cancer/fastq-trimmed/SRR585570/SRR585570_fastp_1.fastq.gz \
+ -O data/gastric-cancer/fastq-trimmed/SRR585570/SRR585570_fastp_2.fastq.gz \
+ --qualified_quality_phred 15 \
+ --length_required 20 \
+ --report_title "SRR585570" \
+ --json QC/gastric-cancer/fastp/SRR585570/SRR585570_fastp.json \
+ --html QC/gastric-cancer/fastp/SRR585570/SRR585570_fastp.html
+Below, we’ll walk through the arguments/options we used to run fastp
. By default, fastp performs adapter trimming, which you can read more about here. For paired-end data like the data we have for SRR585570, adapters can be detected automatically without specifying an adapter sequence.
-i
and -I
These arguments specify the read1 input and read2 (sometimes called left and right) input, respectively.
+-o
and -O
These arguments specify the read1 output and read2 output, respectively. Note that the output is being placed in data/gastric-cancer/fastq-trimmed/SRR585570/
, so the processed FASTQ files will be kept separate from from the original files. It is generally good practice to treat your “raw” data and its directories as fixed and separate from any processing and analysis that you do, to prevent accidentally modification of those original files. And in the event that you accidentally do modify the originals, you know exactly which files and directories to reset.
--qualified_quality_phred
Phred scores are the quality information included in a FASTQ file and the values indicate the chances that a base is called incorrectly. Let’s look at a screenshot of the Per Base Sequence Quality module from FastQC bad Illumina example we linked to above.
+Anything below 20, where a Phred score of 20 represents a 1 in 100 chance that the call is incorrect, is considered poor quality by FastQC. Using --qualified_quality_phred 15
(which is the default), means scores >= 15 are considered “qualified.” Using the default parameters as we do here, reads will be filtered out if >40% of the bases are unqualified. You can read more about the quality filtering functionality of fastp here. The Salmon documentation notes that, given the way we run salmon quant
, quantification may be more sensitive to calls that are likely to be erroneous (of low quality) and, therefore, quality trimming may be important.
Trimming, in contrast to filtering, refers to removing low quality base calls from the (typically 3’) end of reads. A recent paper from the Salmon authors (Srivastava et al. 2020) notes that trimming did not affect mapping rates from random publicly available human bulk (paired-end) RNA-seq samples (they used TrimGalore). fastp does have the functionality to perform trimming using a sliding window, which must be enabled. We are not using it here.
+Note that there are two kinds of encoding for Phred scores: Phred 33 encoding and Phred 64 encoding. FastQC guessed that the file for SRR585570 uses Sanger/Illumina 1.9 encoding (Phred 33). If we had Phred 64 data, we’d use the --phred64
flag. You can read a little bit more about the encoding here.
--length_required
Trimming reads may result in short reads, which may affect gene expression estimates (Williams et al. BMC Bioinformatics. 2016.). Using --length_required 20
means that reads shorter than 20bp will be discarded (similar to what was used in Srivastava et al. above).
--report_title
When we look at the HTML report, it’s helpful to quickly identify what sample the report is for. Using --report title "SRR585570"
means that the report will be titled “SRR585570” rather than the default (“fastp report”).
--json
and --html
With these options, we’re specifying where the JSON and HTML reports will be saved (in the QC/gastric-cancer/fastp/
directory we created) and what the filenames will be. Including the sample name in the filenames again may help us with project organization.
If we look at QC/gastric-cancer/fastp/SRR585570_fastp.json
or the top of the HTML report, we can see that fastp reports certain metrics before and after filtering, which can be very useful in making analysis decisions.
We’ll use Salmon for quantifying transcript expression (documentation). Salmon (Patro, et al. Nature Methods. 2017.) is fast and requires very little memory, which makes it a great choice for running on your laptop during training. We can use the output for downstream analyses like differential expression analysis and clustering. We use Salmon in mapping mode, with mapping validation enabled, using the following command:
+# We perform quantification on the files that have been trimmed
+# and use the index generated with -k 23, as this may "improve sensitivity"
+# per the Salmon documentation
+salmon quant -i index/Homo_sapiens/short_index \
+ -l A \
+ -1 data/gastric-cancer/fastq-trimmed/SRR585570/SRR585570_fastp_1.fastq.gz \
+ -2 data/gastric-cancer/fastq-trimmed/SRR585570/SRR585570_fastp_2.fastq.gz \
+ -o data/gastric-cancer/salmon_quant/SRR585570 \
+ --validateMappings --rangeFactorizationBins 4 \
+ --gcBias --seqBias \
+ --threads 4
+Below, we’ll walk through the arguments/options we used to run salmon quant
.
-i
Salmon requires a set of transcripts (what we want to quantify) in the form of a transcriptome index built with salmon index
. Building an index can take a while (but you only have to do it once!), so we’ve built the one we use today ahead of time. Before we use it, we’ll take a moment to give a bit of background.
You can see how we obtained this index and others on GitHub. Note that we used Homo sapiens GRCh38, Ensembl release 95. It is important to keep track of what build, resource, and files were used and putting our shell scripts on GitHub allows us to do that.
+The salmon index
command has a parameter -k
which sets the k-mer length. The index we used was built with -k 23
and can be found here:
index/Homo_sapiens/short_index
+Using a smaller value for k than the default (k = 31) is appropriate for shorter reads and may improve sensitivity when using --validateMappings
according to the Salmon documentation.
-l
We use -l A
to allow Salmon to automatically infer the library type based on a subset of reads, but you can also provide the library type to Salmon with this argument.
-1
and -2
These data are paired-end, we use -1
and -2
to specify read1 and read2, respectively.
-o
Output directory, salmon quant
should create this for us if it doesn’t exist yet.
--validateMappings
and --rangeFactorizationBins
Using --validateMappings
enables mapping validation, where Salmon checks its mappings using traditional alignment. This helps prevent “spurious mappings” where a read maps to a target but does not arise from it (see documentation for flag and the release notes for v0.10.0
where this was introduced).
When enabling mapping validation with --validateMappings
, setting --rangeFactorizationBins 4
can improve quantification for certain classes of transcripts (docs).
--gcBias
With this option enabled, Salmon will attempt to correct for fragment GC-bias. Regions with high or low GC content tend to be underrepresented in sequencing data.
+It should be noted that this is only appropriate for use with paired-end reads, as fragment length can not be inferred from single-end reads (see this GitHub issue).
+--seqBias
With this option enabled, Salmon will attempt to correct for the bias that occurs when using random hexamer priming (preferential sequencing of reads when certain motifs appear at the beginning).
+--threads
The --threads
argument controls the number of threads that are available to Salmon during quantification. This in essence controls how much of the mapping can occur in parallel. If you had access to a computer with many cores, you could increase the number of threads to make quantification go faster.
Navigate to data/gastric-cancer/salmon_quant/SRR585571/aux_info
and open meta_info.json
. Look for a field called percent_mapped
– what value does this sample have?
tximeta
This notebook will demonstrate how to:
+tximeta
SummarizedExperiment
objectIn this notebook, we’ll import the transcript expression quantification output from salmon quant
using the tximeta
package. tximeta
is in part a wrapper around another package, tximport
, which imports transcript expression data and summarizes it to the gene level. Working at the gene rather than transcript level has a number of potential advantages for interpretability, efficiency, and reduction of false positives (Soneson et al. 2016). tximeta
eases some of the burden of import by automatically identifying the correct set of annotation data to append to many data sets (Love et al. 2020).
For more information about tximeta
, see this excellent vignette from Love et al.
# Load magrittr for the pipe
+library(magrittr)
+
+# Load the tximeta package
+library(tximeta)
+
+# Load the SummarizedExperiment package
+library(SummarizedExperiment)
+
+
+Loading required package: MatrixGenerics
+
+
+Loading required package: matrixStats
+
+
+
+Attaching package: 'MatrixGenerics'
+
+
+The following objects are masked from 'package:matrixStats':
+
+ colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
+ colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
+ colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
+ colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
+ colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
+ colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
+ colWeightedMeans, colWeightedMedians, colWeightedSds,
+ colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
+ rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
+ rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
+ rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
+ rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
+ rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
+ rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
+ rowWeightedSds, rowWeightedVars
+
+
+Loading required package: GenomicRanges
+
+
+Loading required package: stats4
+
+
+Loading required package: BiocGenerics
+
+
+Loading required package: parallel
+
+
+
+Attaching package: 'BiocGenerics'
+
+
+The following objects are masked from 'package:parallel':
+
+ clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
+ clusterExport, clusterMap, parApply, parCapply, parLapply,
+ parLapplyLB, parRapply, parSapply, parSapplyLB
+
+
+The following objects are masked from 'package:stats':
+
+ IQR, mad, sd, var, xtabs
+
+
+The following objects are masked from 'package:base':
+
+ anyDuplicated, append, as.data.frame, basename, cbind, colnames,
+ dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
+ grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
+ order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
+ rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
+ union, unique, unsplit, which.max, which.min
+
+
+Loading required package: S4Vectors
+
+
+
+Attaching package: 'S4Vectors'
+
+
+The following object is masked from 'package:base':
+
+ expand.grid
+
+
+Loading required package: IRanges
+
+
+Loading required package: GenomeInfoDb
+
+
+Loading required package: Biobase
+
+
+Welcome to Bioconductor
+
+ Vignettes contain introductory material; view with
+ 'browseVignettes()'. To cite Bioconductor, see
+ 'citation("Biobase")', and for packages 'citation("pkgname")'.
+
+
+
+Attaching package: 'Biobase'
+
+
+The following object is masked from 'package:MatrixGenerics':
+
+ rowMedians
+
+
+The following objects are masked from 'package:matrixStats':
+
+ anyMissing, rowMedians
+
+
+
+# directory where the data are located
+data_dir <- file.path("data", "gastric-cancer")
+
+# directory where the quant files are located, each sample is its own
+# directory
+quant_dir <- file.path(data_dir, "salmon_quant")
+
+# create a directory to hold the tximeta results if it doesn't exist yet
+txi_dir <- file.path(data_dir, "txi")
+if (!dir.exists(txi_dir)) {
+ dir.create(txi_dir, recursive = TRUE)
+}
+
+
+
+We’ll need the quant.sf
files for all the samples in an experiment which we have stored in quant_dir
.
# the quant files themselves
+sf_files <- list.files(quant_dir, recursive = TRUE, full.names = TRUE,
+ pattern = "quant.sf")
+
+
+
+
+
+
+# sample metadata file
+meta_file <- file.path(data_dir, "gastric-cancer_metadata.tsv")
+
+
+
+Output
+ + + +# Name the output gastric-cancer_tximeta.RDS and use the directory created
+# above as the rest of the path
+txi_out_file <- file.path(txi_dir, "gastric-cancer_tximeta.RDS")
+
+
+
+All output files from salmon quant
we’ll use with tximeta
are named quant.sf
. Unfortunately, this means that the file names themselves do not have any information about the sample they come from!
# Let's look at the full path for the quant.sf files
+sf_files
+
+
+[1] "data/gastric-cancer/salmon_quant/SRR585570/quant.sf"
+[2] "data/gastric-cancer/salmon_quant/SRR585571/quant.sf"
+[3] "data/gastric-cancer/salmon_quant/SRR585572/quant.sf"
+[4] "data/gastric-cancer/salmon_quant/SRR585573/quant.sf"
+[5] "data/gastric-cancer/salmon_quant/SRR585574/quant.sf"
+[6] "data/gastric-cancer/salmon_quant/SRR585575/quant.sf"
+[7] "data/gastric-cancer/salmon_quant/SRR585576/quant.sf"
+[8] "data/gastric-cancer/salmon_quant/SRR585577/quant.sf"
+
+
+data/gastric-cancer/salmon_quant/SRR585570/quant.sf
+
+data/gastric-cancer/salmon_quant/SRR585571/quant.sf
+
+data/gastric-cancer/salmon_quant/SRR585572/quant.sf
+
+data/gastric-cancer/salmon_quant/SRR585573/quant.sf
+
+data/gastric-cancer/salmon_quant/SRR585574/quant.sf
+
+data/gastric-cancer/salmon_quant/SRR585575/quant.sf
+
+data/gastric-cancer/salmon_quant/SRR585576/quant.sf
+
+data/gastric-cancer/salmon_quant/SRR585577/quant.sf
+
+
+
+Let’s extract the sample names from the file paths using the stringr
package.
Notice how the file path is separated by /
. If we were to split up this character string by /
, the second to last item is the sample names (because we used them as directory names for the salmon
output). This is exactly what stringr::word()
allows us to do: split up the file paths by /
and extract the sample names.
sample_names <- stringr::word(sf_files, -2, sep = "/")
+sample_names
+
+
+[1] "SRR585570" "SRR585571" "SRR585572" "SRR585573" "SRR585574" "SRR585575"
+[7] "SRR585576" "SRR585577"
+
+
+SRR585570
+
+SRR585571
+
+SRR585572
+
+SRR585573
+
+SRR585574
+
+SRR585575
+
+SRR585576
+
+SRR585577
+
+
+
+tximeta
needs a data frame with at least these two columns: - a files
column with the file paths to the quant.sf files - a names
column with the sample names
coldata <- data.frame(files = sf_files,
+ names = sample_names)
+
+
+
+We have more information about these samples stored in the metadata file that we will also want stored in coldata
. Let’s read in the sample metadata from the TSV file.
# Read in the sample metadata TSV file and have a look
+sample_meta_df <- readr::read_tsv(meta_file)
+
+
+
+── Column specification ────────────────────────────────────────────────────────
+cols(
+ tissue = col_character(),
+ srr_accession = col_character(),
+ title = col_character()
+)
+
+
+sample_meta_df
+
+We’ll want this information to be added to the coldata
, which we can do by using a join function to match up the rows between the two data frames and combine them.
coldata <- coldata %>%
+ dplyr::inner_join(sample_meta_df, by = c("names" = "srr_accession"))
+
+coldata
+
+tximeta
Using the coldata
data frame that we set up, we can now run the tximeta()
to import our expression data while automatically finding and associating the transcript annotations that were used when we performed the quantification.
The first time you run tximeta()
you may get a message about storing downloaded transcriptome data in a cache directory so that it can retrieve the data more quickly the next time. We recommend you use the cache, and accept the default location.
txi_data <- tximeta(coldata)
+
+
+importing quantifications
+
+
+reading in files with read_tsv
+
+
+1 2 3 4 5 6 7 8
+found matching transcriptome:
+[ Ensembl - Homo sapiens - release 95 ]
+useHub=TRUE: checking for EnsDb via 'AnnotationHub'
+using temporary cache /tmp/RtmpCAqF9W/BiocFileCache
+snapshotDate(): 2020-10-27
+found matching EnsDb via 'AnnotationHub'
+downloading 1 resources
+retrieving 1 resource
+loading from cache
+require("ensembldb")
+generating transcript ranges
+
+
+
+*tximeta currently works easily for most human and mouse datasets, but requires a few more steps for other species.
+We’ll summarize to the gene level using the summarizeToGene()
function.
# Summarize to the gene level
+gene_summarized <- summarizeToGene(txi_data)
+
+
+loading existing EnsDb created: 2021-03-16 18:26:32
+
+
+obtaining transcript-to-gene mapping from database
+
+
+generating gene ranges
+
+
+summarizing abundance
+
+
+summarizing counts
+
+
+summarizing length
+
+
+
+We can use the class
function to see what type of object gene_summarized
is.
# Check what type of object `gene_summarized` is
+class(gene_summarized)
+
+
+[1] "RangedSummarizedExperiment"
+attr(,"package")
+[1] "SummarizedExperiment"
+
+
+RangedSummarizedExperiment
+
+
+
+This tells us that gene_summarized
is an object called a SummarizedExperiment
which can be handled by functions from the package of the same name. We more specifically have a RangedSummarizedExperiment
which is a more specific type of SummarizedExperiment
.
SummarizedExperiment
objects have this general structure:
This figure is from this handy vignette about SummarizedExperiment
objects.
As shown in the diagram, we can use some of the functions provided by the SummarizedExperiment
package to extract data from our gene_summarized
object. For example, calling rowData()
on our object shows all the gene information that tximeta
set up!
# rowData() shows us our gene annotation
+rowData(gene_summarized)
+
+
+DataFrame with 37788 rows and 9 columns
+ gene_id gene_name gene_biotype
+ <character> <character> <character>
+ENSG00000000003 ENSG00000000003 TSPAN6 protein_coding
+ENSG00000000005 ENSG00000000005 TNMD protein_coding
+ENSG00000000419 ENSG00000000419 DPM1 protein_coding
+ENSG00000000457 ENSG00000000457 SCYL3 protein_coding
+ENSG00000000460 ENSG00000000460 C1orf112 protein_coding
+... ... ... ...
+ENSG00000285978 ENSG00000285978 AC113348.2 protein_coding
+ENSG00000285982 ENSG00000285982 AC012213.5 protein_coding
+ENSG00000285986 ENSG00000285986 BX248415.1 unprocessed_pseudogene
+ENSG00000285990 ENSG00000285990 AL589743.7 transcribed_unproces..
+ seq_coord_system description gene_id_version
+ <character> <character> <character>
+ENSG00000000003 chromosome tetraspanin 6 [Sourc.. ENSG00000000003.14
+ENSG00000000005 chromosome tenomodulin [Source:.. ENSG00000000005.5
+ENSG00000000419 chromosome dolichyl-phosphate m.. ENSG00000000419.12
+ENSG00000000457 chromosome SCY1 like pseudokina.. ENSG00000000457.13
+ENSG00000000460 chromosome chromosome 1 open re.. ENSG00000000460.16
+... ... ... ...
+ENSG00000285978 chromosome novel transcript ENSG00000285978.1
+ENSG00000285982 chromosome novel protein ENSG00000285982.1
+ENSG00000285986 chromosome complement factor H .. ENSG00000285986.1
+ENSG00000285990 chromosome neurobeachin (NBEA) .. ENSG00000285990.1
+ symbol entrezid
+ <character> <list>
+ENSG00000000003 TSPAN6 7105
+ENSG00000000005 TNMD 64102
+ENSG00000000419 DPM1 8813
+ENSG00000000457 SCYL3 57147
+ENSG00000000460 C1orf112 55732
+... ... ...
+ENSG00000285978 AC113348.2 NA
+ENSG00000285982 AC012213.5 NA
+ENSG00000285986 BX248415.1 NA
+ENSG00000285990 AL589743.7 NA
+ tx_ids
+ <CharacterList>
+ENSG00000000003 ENST00000373020,ENST00000494424,ENST00000496771,...
+ENSG00000000005 ENST00000373031,ENST00000485971
+ENSG00000000419 ENST00000371582,ENST00000371584,ENST00000371588,...
+ENSG00000000457 ENST00000367770,ENST00000367771,ENST00000367772,...
+ENSG00000000460 ENST00000286031,ENST00000359326,ENST00000413811,...
+... ...
+ENSG00000285978 ENST00000638723
+ENSG00000285982 ENST00000649416
+ENSG00000285986 ENST00000649395
+ENSG00000285990 ENST00000649331
+ [ reached getOption("max.print") -- omitted 1 rows ]
+
+
+
+The assay
slot in SummarizedExperiment
s holds data from the experiment. In this case, it will include our gene-level expression information stored as a gene x sample matrix.
Multiple assays
can be stored in an SummarizedExperiment
and we can use the assayNames()
function to see what assays are included in gene_summarized
.
assayNames(gene_summarized)
+
+
+[1] "counts" "abundance" "length"
+
+
+counts
+
+abundance
+
+length
+
+
+
+If we want to extract an assay
’s data, we can use assay()
function and specify the name of the assay we want to extract.
counts_mat <- assay(gene_summarized, "counts")
+
+
+
+We can use the class
function to see what type of object the assay()
function returns.
# Check what type of object `counts_mat` is
+class(counts_mat)
+
+
+[1] "matrix" "array"
+
+
+matrix
+
+array
+
+
+
+Alternatively, we could extract the TPM data – called abundance
from gene_summarized
.
# Let's look at the first few rows of the gene-level TPM
+head(assay(gene_summarized, "abundance"))
+
+
+ SRR585570 SRR585571 SRR585572 SRR585573 SRR585574 SRR585575
+ENSG00000000003 25.032603 18.896831 12.288181 26.911045 22.088410 17.168736
+ENSG00000000005 0.121595 0.000000 0.000000 0.000000 0.000000 0.000000
+ENSG00000000419 26.679297 20.771196 103.246348 69.495297 66.335181 77.471536
+ENSG00000000457 5.655858 2.921236 6.511128 5.107480 5.106009 3.845323
+ENSG00000000460 1.757408 2.933740 1.354462 2.195826 6.341131 16.792151
+ENSG00000000938 1.692637 2.807437 0.078625 0.000000 0.028502 0.000000
+ SRR585576 SRR585577
+ENSG00000000003 17.974009 29.513266
+ENSG00000000005 0.000000 0.000000
+ENSG00000000419 44.036543 35.660794
+ENSG00000000457 4.452530 3.346821
+ENSG00000000460 5.089307 7.927036
+ENSG00000000938 0.000000 0.000000
+
+
+
+We could use readr::write_tsv
to save counts_mat
only but gene_summarized
has a lot of information stored here beyond the counts, so we may want to save all of this to a RDS object.
# Write `gene_summarized` to RDS object
+readr::write_rds(gene_summarized, file = txi_out_file)
+
+
+
+We’ll import this with the DESeq2
package in the next notebook.
Record session info for reproducibility & provenance purposes.
+ + + +sessionInfo()
+
+
+R version 4.0.3 (2020-10-10)
+Platform: x86_64-pc-linux-gnu (64-bit)
+Running under: Ubuntu 20.04 LTS
+
+Matrix products: default
+BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
+
+locale:
+ [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
+ [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
+ [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
+ [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
+ [9] LC_ADDRESS=C LC_TELEPHONE=C
+[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
+
+attached base packages:
+[1] parallel stats4 stats graphics grDevices utils datasets
+[8] methods base
+
+other attached packages:
+ [1] ensembldb_2.14.0 AnnotationFilter_1.14.0
+ [3] GenomicFeatures_1.42.1 AnnotationDbi_1.52.0
+ [5] SummarizedExperiment_1.20.0 Biobase_2.50.0
+ [7] GenomicRanges_1.42.0 GenomeInfoDb_1.26.2
+ [9] IRanges_2.24.1 S4Vectors_0.28.1
+[11] BiocGenerics_0.36.0 MatrixGenerics_1.2.0
+[13] matrixStats_0.57.0 tximeta_1.8.4
+[15] magrittr_2.0.1 optparse_1.6.6
+
+loaded via a namespace (and not attached):
+ [1] httr_1.4.2 jsonlite_1.7.2
+ [3] bit64_4.0.5 AnnotationHub_2.22.0
+ [5] shiny_1.6.0 assertthat_0.2.1
+ [7] interactiveDisplayBase_1.28.0 askpass_1.1
+ [9] BiocManager_1.30.10 BiocFileCache_1.14.0
+[11] blob_1.2.1 GenomeInfoDbData_1.2.4
+[13] Rsamtools_2.6.0 yaml_2.2.1
+[15] progress_1.2.2 BiocVersion_3.12.0
+[17] lattice_0.20-41 pillar_1.4.7
+[19] RSQLite_2.2.3 glue_1.4.2
+[21] digest_0.6.27 promises_1.1.1
+[23] XVector_0.30.0 Matrix_1.3-2
+[25] htmltools_0.5.1.1 httpuv_1.5.5
+[27] XML_3.99-0.5 pkgconfig_2.0.3
+[29] biomaRt_2.46.3 zlibbioc_1.36.0
+[31] purrr_0.3.4 xtable_1.8-4
+[33] getopt_1.20.3 later_1.1.0.1
+[35] BiocParallel_1.24.1 tibble_3.0.5
+[37] openssl_1.4.3 generics_0.1.0
+[39] ellipsis_0.3.1 withr_2.4.0
+[41] cachem_1.0.1 lazyeval_0.2.2
+[43] cli_2.2.0 crayon_1.3.4
+[45] mime_0.9 ps_1.5.0
+[47] memoise_1.1.0 evaluate_0.14
+[49] fansi_0.4.2 xml2_1.3.2
+[51] tools_4.0.3 prettyunits_1.1.1
+[53] hms_1.0.0 lifecycle_0.2.0
+[55] stringr_1.4.0 DelayedArray_0.16.2
+[57] Biostrings_2.58.0 compiler_4.0.3
+[59] rlang_0.4.10 grid_4.0.3
+[61] RCurl_1.98-1.2 rstudioapi_0.13
+[63] tximport_1.18.0 rappdirs_0.3.1
+[65] bitops_1.0-6 rmarkdown_2.6
+[67] DBI_1.1.1 curl_4.3
+[69] R6_2.5.0 GenomicAlignments_1.26.0
+[71] knitr_1.30 dplyr_1.0.3
+[73] rtracklayer_1.50.0 fastmap_1.1.0
+[75] bit_4.0.4 ProtGenerics_1.22.0
+[77] readr_1.4.0 stringi_1.5.3
+[79] Rcpp_1.0.6 vctrs_0.3.6
+[81] dbplyr_2.0.0 tidyselect_1.1.0
+[83] xfun_0.20
+
+
+This notebook will demonstrate how to:
+DESeq2
data set from a SummarizedExperiment
In this notebook, we’ll import the gastric cancer data and do some exploratory analyses and visual inspection. We’ll use the DESeq2
package for this.
DESeq2
also has an excellent vignette from Love, Anders, and Huber from which this is adapted (see also: Love, Anders, and Huber. Genome Biology. 2014.).
# Load the DESeq2 library
+library(DESeq2)
+
+
+Loading required package: S4Vectors
+
+
+Loading required package: stats4
+
+
+Loading required package: BiocGenerics
+
+
+Loading required package: parallel
+
+
+
+Attaching package: 'BiocGenerics'
+
+
+The following objects are masked from 'package:parallel':
+
+ clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
+ clusterExport, clusterMap, parApply, parCapply, parLapply,
+ parLapplyLB, parRapply, parSapply, parSapplyLB
+
+
+The following objects are masked from 'package:stats':
+
+ IQR, mad, sd, var, xtabs
+
+
+The following objects are masked from 'package:base':
+
+ anyDuplicated, append, as.data.frame, basename, cbind, colnames,
+ dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
+ grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
+ order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
+ rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
+ union, unique, unsplit, which.max, which.min
+
+
+
+Attaching package: 'S4Vectors'
+
+
+The following object is masked from 'package:base':
+
+ expand.grid
+
+
+Loading required package: IRanges
+
+
+Loading required package: GenomicRanges
+
+
+Loading required package: GenomeInfoDb
+
+
+Loading required package: SummarizedExperiment
+
+
+Loading required package: MatrixGenerics
+
+
+Loading required package: matrixStats
+
+
+
+Attaching package: 'MatrixGenerics'
+
+
+The following objects are masked from 'package:matrixStats':
+
+ colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
+ colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
+ colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
+ colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
+ colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
+ colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
+ colWeightedMeans, colWeightedMedians, colWeightedSds,
+ colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
+ rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
+ rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
+ rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
+ rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
+ rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
+ rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
+ rowWeightedSds, rowWeightedVars
+
+
+Loading required package: Biobase
+
+
+Welcome to Bioconductor
+
+ Vignettes contain introductory material; view with
+ 'browseVignettes()'. To cite Bioconductor, see
+ 'citation("Biobase")', and for packages 'citation("pkgname")'.
+
+
+
+Attaching package: 'Biobase'
+
+
+The following object is masked from 'package:MatrixGenerics':
+
+ rowMedians
+
+
+The following objects are masked from 'package:matrixStats':
+
+ anyMissing, rowMedians
+
+
+
+
+
+
+# magrittr for the pipe
+library(magrittr)
+
+
+
+# Main data directory
+data_dir <- file.path("data", "gastric-cancer")
+
+# directory with the tximeta processed data
+txi_dir <- file.path(data_dir, "txi")
+txi_file <- file.path(txi_dir, "gastric-cancer_tximeta.RDS")
+
+
+
+
+
+
+# sample metadata file
+meta_file <- file.path(data_dir, "gastric-cancer_metadata.tsv")
+
+
+
+We’ll create a directory to hold our plots.
+ + + +# Create a plots directory if it does not exist yet
+plots_dir <- file.path("plots", "gastric-cancer")
+if (!dir.exists(plots_dir)) {
+ dir.create(plots_dir, recursive = TRUE)
+}
+
+
+
+Output
+ + + +# We will save a PDF copy of the PCA plot to the plots directory
+# and name the file "gastric-cancer_PC_scatter.pdf"
+pca_plot_file <- file.path(plots_dir, "gastric-cancer_PC_scatter.pdf")
+
+
+
+First, let’s read in the data we processed with tximeta
.
# Read in the RDS file we created in the last notebook
+gene_summarized <- readr::read_rds(txi_file)
+
+
+
+We use the tissue of origin in the design formula because that will allow us to model this variable of interest.
+ + + +ddset <- DESeqDataSet(gene_summarized,
+ design = ~ tissue)
+
+
+using counts and average transcript lengths from tximeta
+
+
+Warning in DESeqDataSet(gene_summarized, design = ~tissue): some variables in
+design formula are characters, converting to factors
+
+
+
+Raw count data is not usually suitable for the algorithms we use for dimensionality reduction, clustering, or heatmaps. To improve this, we will transform the count data to create an expression measure that is better suited for these analyses. The core transformation will map the expression to a log2 scale, while accounting for some of the expected variation among samples and genes.
+Since different samples are usually sequenced to different depths, we want to transform our RNA-seq count data to make different samples more directly comparable. We also want to deal with the fact that genes with low counts are also likely to have higher variance (on the log2 scale), as that could bias our clustering. To handle both of these considerations, we can calculate a Variance Stabilizing Transformation of the count data, and work with that transformed data for our analysis.
+See this section of the DESeq2
vignette for more on this topic.
vst_data <- vst(ddset)
+
+
+using 'avgTxLength' from assays(dds), correcting for library size
+
+
+
+Principal component analysis (PCA) is a dimensionality reduction technique that allows us to identify the largest components of variation in a complex dataset. Our expression data can be thought of as mapping each sample in a multidimensional space defined by the expression level of each gene. The expression of many of those genes are correlated, so we can often get a better, simpler picture of the data by combining the information from those correlated genes.
+PCA rotates and transforms this space so that each axis is now a combination of multiple correlated genes, ordered so the first axes capture the most variation from the data. These new axes are the “principal components.” If we look at the first few components, we can often get a nice overview of relationships among the samples in the data.
+The plotPCA()
function we will use from the DESeq2
package calculates and plots the first two principal components (PC1 and PC2). Visualizing PC1 and PC2 can give us insight into how different variables (e.g., tissue source) affect our dataset and help us spot any technical effects (more on that below).
# DESeq2 built in function is called plotPCA and we want to color points by
+# tissue
+plotPCA(vst_data, intgroup = "tissue")
+
+
+
+
+
+
+Save the most recent plot to file with ggsave
from ggplot2
# Save the PDF file
+ggplot2::ggsave(pca_plot_file, plot = ggplot2::last_plot())
+
+
+Saving 7 x 5 in image
+
+
+
+We don’t have batch information (i.e., when the samples were run) for this particular experiment, but let’s imagine that SRR585574
and SRR585576
were run separately from all other samples. We’ll add this as a new “toy” column in the sample data (colData
).
# Extract colData
+sample_info <- colData(vst_data)
+
+# Print out preview
+sample_info
+
+
+DataFrame with 8 rows and 3 columns
+ names tissue title
+ <character> <factor> <character>
+SRR585570 SRR585570 gastric_normal Gastric normal (CGC-..
+SRR585571 SRR585571 gastric_normal Gastric normal (CGC-..
+SRR585572 SRR585572 primary_gastric_tumor Primary gastric tumo..
+SRR585573 SRR585573 primary_gastric_tumor Primary gastric tumo..
+SRR585574 SRR585574 primary_gastric_tumor Primary gastric tumo..
+SRR585575 SRR585575 gastric_cancer_cell_line SNU484
+SRR585576 SRR585576 gastric_cancer_cell_line SNU601
+SRR585577 SRR585577 gastric_cancer_cell_line SNU668
+
+
+
+Now we can add a new column with toy batch information and re-store the colData()
.
# Add batch information
+sample_info$batch <- c("batch1", "batch1", "batch1", "batch1", "batch2",
+ "batch1", "batch2", "batch1")
+
+
+
+If this batch information were real we would have included it with the sample metadata when we made the original SummarizedExperiment
object with tximeta
. We would then include it in the model stored in our DESeq2 object using the design
argument (design = ~ tissue + batch
) and we would re-run the DESeqDataSet()
and vst()
steps we did above. Here we will take a bit of a shortcut and add it directly to the colData()
for our vst()
-transformed data.
# Add coldata() with batch info to vst_data
+colData(vst_data) <- sample_info
+
+
+
+
+
+
+# PCA plot - tissue *and* batch
+# We want plotPCA to return the data so we can have more control about the plot
+pca_data <- plotPCA(vst_data,
+ intgroup = c("tissue", "batch"),
+ returnData = TRUE)
+
+
+
+
+
+
+# Here we are setting up the percent variance that we are extracting from the `pca_data` object
+percent_var <- round(100 * attr(pca_data, "percentVar"))
+
+
+
+Let’s use ggplot to visualize the first two principal components.
+ + + +# Color points by "batch" and use shape to indicate the tissue of origin
+ggplot2::ggplot(pca_data, ggplot2::aes(PC1, PC2,
+ color = batch,
+ shape = tissue)) +
+ ggplot2::geom_point(size = 3) +
+ ggplot2::xlab(paste0("PC1: ", percent_var[1],"% variance")) +
+ ggplot2::ylab(paste0("PC2: ", percent_var[2],"% variance")) +
+ ggplot2::coord_fixed()
+
+
+
+
+
+
+Record session info for reproducibility & provenance purposes.
+ + + +sessionInfo()
+
+
+R version 4.0.3 (2020-10-10)
+Platform: x86_64-pc-linux-gnu (64-bit)
+Running under: Ubuntu 20.04 LTS
+
+Matrix products: default
+BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
+
+locale:
+ [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
+ [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
+ [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
+ [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
+ [9] LC_ADDRESS=C LC_TELEPHONE=C
+[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
+
+attached base packages:
+[1] parallel stats4 stats graphics grDevices utils datasets
+[8] methods base
+
+other attached packages:
+ [1] magrittr_2.0.1 DESeq2_1.30.1
+ [3] SummarizedExperiment_1.20.0 Biobase_2.50.0
+ [5] MatrixGenerics_1.2.0 matrixStats_0.57.0
+ [7] GenomicRanges_1.42.0 GenomeInfoDb_1.26.2
+ [9] IRanges_2.24.1 S4Vectors_0.28.1
+[11] BiocGenerics_0.36.0 optparse_1.6.6
+
+loaded via a namespace (and not attached):
+ [1] locfit_1.5-9.4 Rcpp_1.0.6 lattice_0.20-41
+ [4] assertthat_0.2.1 digest_0.6.27 R6_2.5.0
+ [7] RSQLite_2.2.3 evaluate_0.14 httr_1.4.2
+[10] ggplot2_3.3.3 pillar_1.4.7 zlibbioc_1.36.0
+[13] rlang_0.4.10 annotate_1.68.0 blob_1.2.1
+[16] Matrix_1.3-2 rmarkdown_2.6 labeling_0.4.2
+[19] splines_4.0.3 BiocParallel_1.24.1 readr_1.4.0
+[22] geneplotter_1.68.0 stringr_1.4.0 RCurl_1.98-1.2
+[25] bit_4.0.4 munsell_0.5.0 DelayedArray_0.16.2
+[28] compiler_4.0.3 xfun_0.20 pkgconfig_2.0.3
+[31] htmltools_0.5.1.1 tidyselect_1.1.0 tibble_3.0.5
+[34] GenomeInfoDbData_1.2.4 XML_3.99-0.5 crayon_1.3.4
+[37] dplyr_1.0.3 bitops_1.0-6 grid_4.0.3
+[40] jsonlite_1.7.2 xtable_1.8-4 gtable_0.3.0
+[43] lifecycle_0.2.0 DBI_1.1.1 scales_1.1.1
+[46] stringi_1.5.3 farver_2.0.3 XVector_0.30.0
+[49] genefilter_1.72.1 getopt_1.20.3 ellipsis_0.3.1
+[52] vctrs_0.3.6 generics_0.1.0 RColorBrewer_1.1-2
+[55] tools_4.0.3 bit64_4.0.5 glue_1.4.2
+[58] purrr_0.3.4 hms_1.0.0 survival_3.2-7
+[61] yaml_2.2.1 AnnotationDbi_1.52.0 colorspace_2.0-0
+[64] memoise_1.1.0 knitr_1.30
+
+
+In this section, we’ll be working with RNA-seq data from neuroblastoma (NB) cell lines from Harenza, et al. (2017)
+The course directors have already processed the raw data using salmon quant
and the quant.sf
files for each sample can be found in data/NB-cell/salmon_quant/<SAMPLE>
.
In the gastric cancer example, we imported Salmon-processed data with tximeta
to then use with DESeq2
. We will also use DESeq2
for these analyses, specifically for differential expression analysis.
In order to prepare the NB cell line data for differential expression analysis, we will modify the gastric cancer tximeta notebook (02-gastric_cancer_tximeta-live.Rmd
) and save this new notebook as nb_cell_tximeta.Rmd
:
To create a new notebook, select File
> New File
> R Notebook
. The new notebook should appear in your Source Pane in RStudio. Save the new notebook, using Ctrl+S (Cmd+S on Mac) or File
> Save
, in the training-modules/RNA-seq
directory with the name nb_cell_line_tximeta.Rmd
. You can add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Cmd+Option+I.
Alter the code from 02-gastric_cancer_tximeta-live.Rmd
to use the NB cell line data. The quant.sf
files for each sample can be found in data/NB-cell/salmon_quant/<SAMPLE>
.
Save the tximeta
output as data/NB-cell/txi/NB-cell_tximeta.RDS
. Note that data/NB-cell/txi/
is a new directory.
This notebook will demonstrate how to:
+DESeq2
EnhancedVolcano
packageIn this notebook, we’ll perform an analysis to identify the genes that are differentially expressed in MYCN amplified vs. nonamplified neuroblastoma cell lines.
+These RNA-seq data are from Harenza, et al. (2017).
+More information about DESeq2 can be found in the excellent vignette from Love, Anders, and Huber from which this is adapted (see also: Love, et al. (2014)).
+DESeq2 takes unnormalized counts or estimated counts and does the following:
+# magrittr pipe
+library(magrittr)
+
+# Load the DESeq2 library
+library(DESeq2)
+
+
+Loading required package: S4Vectors
+
+
+Loading required package: stats4
+
+
+Loading required package: BiocGenerics
+
+
+Loading required package: parallel
+
+
+
+Attaching package: 'BiocGenerics'
+
+
+The following objects are masked from 'package:parallel':
+
+ clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
+ clusterExport, clusterMap, parApply, parCapply, parLapply,
+ parLapplyLB, parRapply, parSapply, parSapplyLB
+
+
+The following objects are masked from 'package:stats':
+
+ IQR, mad, sd, var, xtabs
+
+
+The following objects are masked from 'package:base':
+
+ anyDuplicated, append, as.data.frame, basename, cbind, colnames,
+ dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
+ grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
+ order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
+ rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
+ union, unique, unsplit, which.max, which.min
+
+
+
+Attaching package: 'S4Vectors'
+
+
+The following object is masked from 'package:base':
+
+ expand.grid
+
+
+Loading required package: IRanges
+
+
+Loading required package: GenomicRanges
+
+
+Loading required package: GenomeInfoDb
+
+
+Loading required package: SummarizedExperiment
+
+
+Loading required package: MatrixGenerics
+
+
+Loading required package: matrixStats
+
+
+
+Attaching package: 'MatrixGenerics'
+
+
+The following objects are masked from 'package:matrixStats':
+
+ colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
+ colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
+ colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
+ colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
+ colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
+ colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
+ colWeightedMeans, colWeightedMedians, colWeightedSds,
+ colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
+ rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
+ rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
+ rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
+ rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
+ rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
+ rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
+ rowWeightedSds, rowWeightedVars
+
+
+Loading required package: Biobase
+
+
+Welcome to Bioconductor
+
+ Vignettes contain introductory material; view with
+ 'browseVignettes()'. To cite Bioconductor, see
+ 'citation("Biobase")', and for packages 'citation("pkgname")'.
+
+
+
+Attaching package: 'Biobase'
+
+
+The following object is masked from 'package:MatrixGenerics':
+
+ rowMedians
+
+
+The following objects are masked from 'package:matrixStats':
+
+ anyMissing, rowMedians
+
+
+# We will be making fancy volcano plots
+library(EnhancedVolcano)
+
+
+Loading required package: ggplot2
+
+
+Loading required package: ggrepel
+
+
+Registered S3 methods overwritten by 'ggalt':
+ method from
+ grid.draw.absoluteGrob ggplot2
+ grobHeight.absoluteGrob ggplot2
+ grobWidth.absoluteGrob ggplot2
+ grobX.absoluteGrob ggplot2
+ grobY.absoluteGrob ggplot2
+
+
+
+Input
+ + + +# directory with the tximeta processed data
+txi_dir <- file.path("data", "NB-cell", "txi")
+txi_file <- file.path(txi_dir, "NB-cell_tximeta.RDS")
+
+
+
+Output
+We’ll create a results directory to hold our results.
+ + + +# Create a results directory if it doesn't already exist
+results_dir <- file.path("results", "NB-cell")
+if (!dir.exists(results_dir)) {
+ dir.create(results_dir, recursive = TRUE)
+}
+
+
+
+We will also need a directory to store our plots.
+ + + +# Create a plots directory if it doesn't already exist
+plots_dir <- file.path("plots", "NB-cell")
+if (!dir.exists(plots_dir)) {
+ dir.create(plots_dir, recursive = TRUE)
+}
+
+
+
+
+
+
+# RDS for the output of DESeq analysis
+deseq_file <- file.path(results_dir,
+ "NB-cell_DESeq_amplified_v_nonamplified.RDS")
+
+# DESeq2 results table
+deseq_df_file <- file.path(results_dir,
+ "NB-cell_DESeq_amplified_v_nonamplified_results.tsv")
+
+# PNG of the volcano plot
+volcano_file <- file.path(plots_dir, "NB-cell_volcano.png")
+
+
+
+First, let’s read in the data we processed with tximeta
.
# Read in the RDS file we created in the last notebook
+gene_summarized <- readr::read_rds(txi_file)
+
+
+
+We’re most interested in MYCN amplification, which we had stored in the status
column of the sample metadata of gene_summarized
. While the sample metadata is stored internally in the colData
slot, the SummarizedExperiment
object makes it easy for us to access it as if it were just a column of a data frame, using the familiar $
syntax.
gene_summarized$status
+
+
+ [1] "Amplified" "Amplified" "Amplified" "Amplified" "Amplified"
+ [6] "Amplified" "Amplified" "Amplified" "Nonamplified" "Nonamplified"
+[11] "Amplified" "Amplified" "Amplified" "Nonamplified" "Amplified"
+[16] "Amplified" "Amplified" "Amplified" "Nonamplified" "Amplified"
+[21] "Amplified" "Amplified" "Nonamplified" "Amplified" "Nonamplified"
+[26] "Amplified" "Amplified" "Amplified" "Nonamplified" "Nonamplified"
+[31] "Nonamplified" "Amplified" "Amplified" "Amplified" "Nonamplified"
+[36] "Nonamplified" "Amplified" "Amplified" "Nonamplified"
+
+
+Amplified
+
+Amplified
+
+Amplified
+
+Amplified
+
+Amplified
+
+Amplified
+
+Amplified
+
+Amplified
+
+Nonamplified
+
+Nonamplified
+
+Amplified
+
+Amplified
+
+Amplified
+
+Nonamplified
+
+Amplified
+
+Amplified
+
+Amplified
+
+Amplified
+
+Nonamplified
+
+Amplified
+
+Amplified
+
+Amplified
+
+Nonamplified
+
+Amplified
+
+Nonamplified
+
+Amplified
+
+Amplified
+
+Amplified
+
+Nonamplified
+
+Nonamplified
+
+Nonamplified
+
+Amplified
+
+Amplified
+
+Amplified
+
+Nonamplified
+
+Nonamplified
+
+Amplified
+
+Amplified
+
+Nonamplified
+
+
+
+This is stored as a character
type, but to give a bit more information to DESeq
, we will convert this to a factor
.
gene_summarized$status <- as.factor(gene_summarized$status)
+
+
+
+We’ll want to use the “Nonamplified” samples as our reference. Let’s look at the levels
of status
.
levels(gene_summarized$status)
+
+
+[1] "Amplified" "Nonamplified"
+
+
+Amplified
+
+Nonamplified
+
+
+
+We can see that these are in alphabetical order, so “Amplified” samples would be the reference. We can use the relevel()
function to remedy this.
gene_summarized$status <- relevel(gene_summarized$status, ref = "Nonamplified")
+
+
+
+
+
+
+# Check what the levels are now
+levels(gene_summarized$status)
+
+
+[1] "Nonamplified" "Amplified"
+
+
+Nonamplified
+
+Amplified
+
+
+
+# Create a DESeq2 dataset from `gene_summarized`
+# remember that `status` is the variable of interest here
+ddset <- DESeqDataSet(gene_summarized,
+ design = ~ status)
+
+
+using counts and average transcript lengths from tximeta
+
+
+
+Genes that have very low counts are not likely to yield reliable differential expression results, so we will do some light pre-filtering. We will keep only genes with total counts of at least 10 across all samples.
+ + + +genes_to_keep <- rowSums(counts(ddset)) >= 10
+ddset <- ddset[genes_to_keep, ]
+
+
+
+DESeq()
functionWe’ll now use the wrapper function DESeq()
to perform our differential expression analysis. As mentioned earlier, this performs a number of steps, including an outlier removal procedure. For this particular dataset, there is a pretty large number of outliers, which can be a bit of a red flag, but we will proceed for now.
deseq_object <- DESeq(ddset)
+
+
+estimating size factors
+
+
+using 'avgTxLength' from assays(dds), correcting for library size
+
+
+estimating dispersions
+
+
+gene-wise dispersion estimates
+
+
+mean-dispersion relationship
+
+
+final dispersion estimates
+
+
+fitting model and testing
+
+
+-- replacing outliers and refitting for 2895 genes
+-- DESeq argument 'minReplicatesForReplace' = 7
+-- original counts are preserved in counts(dds)
+
+
+estimating dispersions
+
+
+fitting model and testing
+
+
+
+Let’s save this to our results file.
+ + + +# Save the results as an RDS
+readr::write_rds(deseq_object, file = deseq_file)
+
+
+
+Now we will have a look at the results table.
+ + + +deseq_results <- results(deseq_object)
+deseq_results
+
+
+log2 fold change (MLE): status Amplified vs Nonamplified
+Wald test p-value: status Amplified vs Nonamplified
+DataFrame with 24912 rows and 6 columns
+ baseMean log2FoldChange lfcSE stat pvalue
+ <numeric> <numeric> <numeric> <numeric> <numeric>
+ENSG00000000003 1148.278399 0.921536 0.424309 2.171849 0.029867
+ENSG00000000005 0.627406 1.672285 2.247996 0.743900 0.456937
+ENSG00000000419 1680.109464 -0.176649 0.215485 -0.819775 0.412344
+ENSG00000000457 962.907631 -0.257752 0.166387 -1.549110 0.121355
+ENSG00000000460 1595.937423 -0.133821 0.197230 -0.678504 0.497452
+... ... ... ... ... ...
+ENSG00000285976 1874.02776 0.0285397 0.183730 0.155335 0.876557
+ENSG00000285978 1.40743 -1.1452465 0.874165 -1.310103 0.190161
+ENSG00000285982 90.93868 0.1131803 0.493040 0.229556 0.818437
+ENSG00000285990 13.77859 0.3673226 0.456293 0.805015 0.420811
+ENSG00000285991 17.07491 0.0709553 0.333191 0.212957 0.831361
+ padj
+ <numeric>
+ENSG00000000003 0.133479
+ENSG00000000005 NA
+ENSG00000000419 0.656626
+ENSG00000000457 0.326981
+ENSG00000000460 0.721065
+... ...
+ENSG00000285976 0.946696
+ENSG00000285978 0.427379
+ENSG00000285982 0.918545
+ENSG00000285990 0.662821
+ENSG00000285991 0.926078
+
+
+
+How many genes were differentially expressed (FDR < 0.05)?
+ + + +summary(deseq_results, alpha = 0.05)
+
+
+
+out of 24799 with nonzero total read count
+adjusted p-value < 0.05
+LFC > 0 (up) : 1071, 4.3%
+LFC < 0 (down) : 1798, 7.3%
+outliers [1] : 0, 0%
+low counts [2] : 3478, 14%
+(mean count < 1)
+[1] see 'cooksCutoff' argument of ?results
+[2] see 'independentFiltering' argument of ?results
+
+
+
+The estimates of log2 fold change calculated by DESeq()
are not corrected for expression level. This means that when counts are small, we are likely to end up with some large fold change values that overestimate the true extent of the change between conditions.
We can correct this by applying a “shrinkage” procedure, which will adjust large values with small counts downward, while preserving values with larger counts, which are likely to be more accurate.
+To do this, we will use the lfcShrink()
function, but first we need to know the name and/or position of the “coefficient” that was calculated by DESeq()
, which we can do with the resultsNames()
function
# get the deseq coefficient names:
+resultsNames(deseq_object)
+
+
+[1] "Intercept" "status_Amplified_vs_Nonamplified"
+
+
+Intercept
+
+status_Amplified_vs_Nonamplified
+
+
+
+We are interested in the status
coefficient, which is in position 2.
There are a few options for the shrinkage estimation. The default is apeglm
(Zhu et al. 2018), but we have found that this can be sensitive to extreme outliers, which are definitely a factor in this data set. So for this data set we will be using ashr
(Stephens 2017)
# calculate shrunken log2 fold change estimates
+deseq_shrunken <- lfcShrink(deseq_object,
+ # the coefficient we want to reestimate
+ coef = 2,
+ # We will use `ashr` for estimation
+ type = "ashr"
+ )
+
+
+using 'ashr' for LFC shrinkage. If used in published research, please cite:
+ Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
+ https://doi.org/10.1093/biostatistics/kxw041
+
+
+
+Let’s compare the log2 fold change estimates from the two results tables by creating a plot.
+First we will combine the results into a new data frame.
+ + + +comparison_df <- data.frame(
+ lfc_original = deseq_results$log2FoldChange,
+ lfc_shrunken = deseq_shrunken$log2FoldChange,
+ logmean = log10(deseq_results$baseMean)
+ )
+
+
+
+Now we can plot the original and shrunken log2 fold change values to see what happened after shrinkage.
+ + + +ggplot(comparison_df,
+ aes(x = lfc_original,
+ y = lfc_shrunken,
+ color = logmean)) +
+ geom_point(alpha = 0.1) +
+ theme_bw() +
+ scale_color_viridis_c() +
+ coord_cartesian(xlim = c(-10,10), ylim = c(-10,10)) # zoom in on the middle
+
+
+
+
+
+
+We will now do a bit of manipulation to store the results in a data frame and add the gene symbols.
+ + + +# this is of class DESeqResults -- we want a data frame
+deseq_df <- deseq_shrunken %>%
+ # convert to a data frame
+ as.data.frame() %>%
+ # the gene ids were stored as row names -- let's them a column for easy display
+ tibble::rownames_to_column(var = "gene_id") %>%
+ # add on the gene symbols from the original deseq object
+ dplyr::mutate(gene_symbol = rowData(deseq_object)$gene_name)
+
+
+
+Let’s print out the results table, sorted by log2 fold change. The highest values should be genes more expressed in the MYCN amplified cell lines.
+ + + +# Print the table sorted by log2FoldChange
+deseq_df %>%
+ dplyr::arrange(dplyr::desc(log2FoldChange))
+
+Now let’s write the full results table to a file.
+ + + +readr::write_tsv(deseq_df, file = deseq_df_file)
+
+
+
+With these shrunken effect sizes, we will draw a volcano plot, using the EnhancedVolcano
package to make it a bit easier. This package automatically color codes the points by cutoffs for both significance and fold change and labels many of the significant genes (subject to spacing). EnhancedVolcano
has many, many options, which is a good thing if you don’t like all of it’s default settings. Even better, it outputs a ggplot2
object, so if we want to customize it further, we can do that with the same ggplot2
commands we have used before.
EnhancedVolcano(deseq_df,
+ x = 'log2FoldChange', # fold change statistic to plot
+ y = 'pvalue', # significance values
+ lab = deseq_df$gene_symbol, # labels for points
+ pCutoff = 1e-05, # The p value cutoff we will use (default)
+ FCcutoff = 1, # The fold change cutoff (default)
+ title = NULL, # no title
+ subtitle = NULL, # or subtitle
+ caption = NULL, # or caption
+ labSize = 3 # smaller labels
+ ) +
+ # change the overall theme
+ theme_classic() +
+ # move the legend to the bottom
+ theme(legend.position = "bottom")
+
+
+
+
+
+
+We will save this plot to a file as well:
+ + + +ggsave(volcano_file, plot = last_plot())
+
+
+Saving 7 x 5 in image
+
+
+
+Record session info for reproducibility & provenance purposes.
+ + + +sessionInfo()
+
+
+R version 4.0.3 (2020-10-10)
+Platform: x86_64-pc-linux-gnu (64-bit)
+Running under: Ubuntu 20.04 LTS
+
+Matrix products: default
+BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
+
+locale:
+ [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
+ [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
+ [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
+ [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
+ [9] LC_ADDRESS=C LC_TELEPHONE=C
+[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
+
+attached base packages:
+[1] parallel stats4 stats graphics grDevices utils datasets
+[8] methods base
+
+other attached packages:
+ [1] EnhancedVolcano_1.8.0 ggrepel_0.9.1
+ [3] ggplot2_3.3.3 DESeq2_1.30.1
+ [5] SummarizedExperiment_1.20.0 Biobase_2.50.0
+ [7] MatrixGenerics_1.2.0 matrixStats_0.57.0
+ [9] GenomicRanges_1.42.0 GenomeInfoDb_1.26.2
+[11] IRanges_2.24.1 S4Vectors_0.28.1
+[13] BiocGenerics_0.36.0 magrittr_2.0.1
+[15] optparse_1.6.6
+
+loaded via a namespace (and not attached):
+ [1] bitops_1.0-6 bit64_4.0.5 ash_1.0-15
+ [4] RColorBrewer_1.1-2 httr_1.4.2 tools_4.0.3
+ [7] irlba_2.3.3 R6_2.5.0 KernSmooth_2.23-18
+[10] vipor_0.4.5 DBI_1.1.1 colorspace_2.0-0
+[13] withr_2.4.0 tidyselect_1.1.0 ggrastr_0.2.1
+[16] ggalt_0.4.0 bit_4.0.4 compiler_4.0.3
+[19] extrafontdb_1.0 DelayedArray_0.16.2 labeling_0.4.2
+[22] scales_1.1.1 SQUAREM_2021.1 proj4_1.0-10
+[25] readr_1.4.0 genefilter_1.72.1 mixsqp_0.3-43
+[28] stringr_1.4.0 digest_0.6.27 rmarkdown_2.6
+[31] XVector_0.30.0 pkgconfig_2.0.3 htmltools_0.5.1.1
+[34] extrafont_0.17 invgamma_1.1 maps_3.3.0
+[37] rlang_0.4.10 RSQLite_2.2.3 farver_2.0.3
+[40] generics_0.1.0 jsonlite_1.7.2 BiocParallel_1.24.1
+[43] dplyr_1.0.3 RCurl_1.98-1.2 GenomeInfoDbData_1.2.4
+[46] Matrix_1.3-2 Rcpp_1.0.6 ggbeeswarm_0.6.0
+[49] munsell_0.5.0 lifecycle_0.2.0 stringi_1.5.3
+[52] yaml_2.2.1 MASS_7.3-53 zlibbioc_1.36.0
+[55] grid_4.0.3 blob_1.2.1 crayon_1.3.4
+[58] lattice_0.20-41 splines_4.0.3 annotate_1.68.0
+[61] hms_1.0.0 locfit_1.5-9.4 knitr_1.30
+[64] pillar_1.4.7 geneplotter_1.68.0 XML_3.99-0.5
+[67] glue_1.4.2 evaluate_0.14 vctrs_0.3.6
+[70] Rttf2pt1_1.3.8 gtable_0.3.0 getopt_1.20.3
+[73] purrr_0.3.4 assertthat_0.2.1 ashr_2.2-47
+[76] xfun_0.20 xtable_1.8-4 viridisLite_0.3.0
+[79] survival_3.2-7 truncnorm_1.0-8 tibble_3.0.5
+[82] AnnotationDbi_1.52.0 beeswarm_0.2.3 memoise_1.1.0
+[85] ellipsis_0.3.1
+
+
+This notebook will demonstrate how to:
+DESeq2
ComplexHeatmap
packageIn this notebook, we cluster RNA-seq data from the Open Pediatric Brain Tumor Atlas (OpenPBTA) project and create a heatmap. OpenPBTA is a collaborative project organized by the CCDL and the Center for Data-Driven Discovery in Biomedicine (D3b) at the Children’s Hospital of Philadelphia conducted openly on GitHub.
+You can read more about the project here.
+We’ve downloaded some of the publicly available expression data from the project and selected a subset with the most common disease types for analysis here.
+We’ll use a package called ComplexHeatmap
to make our heatmap. This package allows us to annotate our heatmap with sample information, and will also perform clustering as part of generating the heatmap. It is highly flexible and opinionated - the data structures we pass ComplexHeatmap
functions often need to be just right. See the ComplexHeatmap
Complete Reference for more information.
# We will manipulate RNASeq data with DESeq2 at the start
+library(DESeq2)
+
+
+Loading required package: S4Vectors
+
+
+Loading required package: stats4
+
+
+Loading required package: BiocGenerics
+
+
+Loading required package: parallel
+
+
+
+Attaching package: 'BiocGenerics'
+
+
+The following objects are masked from 'package:parallel':
+
+ clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
+ clusterExport, clusterMap, parApply, parCapply, parLapply,
+ parLapplyLB, parRapply, parSapply, parSapplyLB
+
+
+The following objects are masked from 'package:stats':
+
+ IQR, mad, sd, var, xtabs
+
+
+The following objects are masked from 'package:base':
+
+ anyDuplicated, append, as.data.frame, basename, cbind, colnames,
+ dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
+ grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
+ order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
+ rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
+ union, unique, unsplit, which.max, which.min
+
+
+
+Attaching package: 'S4Vectors'
+
+
+The following object is masked from 'package:base':
+
+ expand.grid
+
+
+Loading required package: IRanges
+
+
+Loading required package: GenomicRanges
+
+
+Loading required package: GenomeInfoDb
+
+
+Loading required package: SummarizedExperiment
+
+
+Loading required package: MatrixGenerics
+
+
+Loading required package: matrixStats
+
+
+
+Attaching package: 'MatrixGenerics'
+
+
+The following objects are masked from 'package:matrixStats':
+
+ colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
+ colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
+ colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
+ colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
+ colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
+ colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
+ colWeightedMeans, colWeightedMedians, colWeightedSds,
+ colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
+ rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
+ rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
+ rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
+ rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
+ rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
+ rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
+ rowWeightedSds, rowWeightedVars
+
+
+Loading required package: Biobase
+
+
+Welcome to Bioconductor
+
+ Vignettes contain introductory material; view with
+ 'browseVignettes()'. To cite Bioconductor, see
+ 'citation("Biobase")', and for packages 'citation("pkgname")'.
+
+
+
+Attaching package: 'Biobase'
+
+
+The following object is masked from 'package:MatrixGenerics':
+
+ rowMedians
+
+
+The following objects are masked from 'package:matrixStats':
+
+ anyMissing, rowMedians
+
+
+# Then we'll be doing a bit of data wrangling with the Tidyverse
+library(tidyverse)
+
+
+── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
+
+
+✔ ggplot2 3.3.3 ✔ purrr 0.3.4
+✔ tibble 3.0.5 ✔ dplyr 1.0.3
+✔ tidyr 1.1.2 ✔ stringr 1.4.0
+✔ readr 1.4.0 ✔ forcats 0.5.0
+
+
+── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
+✖ dplyr::collapse() masks IRanges::collapse()
+✖ dplyr::combine() masks Biobase::combine(), BiocGenerics::combine()
+✖ dplyr::count() masks matrixStats::count()
+✖ dplyr::desc() masks IRanges::desc()
+✖ tidyr::expand() masks S4Vectors::expand()
+✖ dplyr::filter() masks stats::filter()
+✖ dplyr::first() masks S4Vectors::first()
+✖ dplyr::lag() masks stats::lag()
+✖ ggplot2::Position() masks BiocGenerics::Position(), base::Position()
+✖ purrr::reduce() masks GenomicRanges::reduce(), IRanges::reduce()
+✖ dplyr::rename() masks S4Vectors::rename()
+✖ dplyr::slice() masks IRanges::slice()
+
+
+# ComplexHeatmap is the package we'll use for making a heatmap
+# It will do the hierarchical clustering for us as well
+library(ComplexHeatmap)
+
+
+Loading required package: grid
+
+
+========================================
+ComplexHeatmap version 2.6.2
+Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
+Github page: https://github.com/jokergoo/ComplexHeatmap
+Documentation: http://jokergoo.github.io/ComplexHeatmap-reference
+
+If you use it in published research, please cite:
+Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional
+ genomic data. Bioinformatics 2016.
+
+This message can be suppressed by:
+ suppressPackageStartupMessages(library(ComplexHeatmap))
+========================================
+
+
+
+We have stored the data we’ll use in this notebook in data/open-pbta
.
data_dir <- file.path("data", "open-pbta")
+
+# We'll store the heatmap in plots/open-pbta - create directory if it doesn't exist yet
+plots_dir <- file.path("plots", "open-pbta")
+if (!dir.exists(plots_dir)) {
+ dir.create(plots_dir, recursive = TRUE)
+}
+
+
+
+# The metadata describing the samples
+histologies_file <- file.path(data_dir, "pbta-histologies-subset.tsv")
+
+# The RNA-seq counts table
+rnaseq_file = file.path(data_dir, "pbta-rsem-expected_count-subset.rds")
+
+
+
+heatmap_file <- file.path(plots_dir,
+ "common_histologies_high_variance_heatmap.png")
+
+
+
+Let’s read in the metadata file file and take a look at the data.
+ + + +histologies_df <- read_tsv(histologies_file)
+
+
+
+── Column specification ────────────────────────────────────────────────────────
+cols(
+ .default = col_character(),
+ OS_days = col_double(),
+ age_last_update_days = col_double()
+)
+ℹ Use `spec()` for the full column specifications.
+
+
+
+Use the chunk below to explore the metadata data frame.
+ + + +View(histologies_df)
+
+
+
+We’ll use the disease labels in a column called short_histology
when we label the heatmap. Let’s count how many samples are assigned each short_histology
label using the Tidyverse.
histology_count_df <- histologies_df %>%
+ # Count how many samples are in each short_histology and name the column
+ # with that number n
+ count(short_histology) %>%
+ # Sort from largest number of samples to smallest number of samples in a
+ # histology
+ arrange(desc(n))
+
+histology_count_df
+
+Read in the expression count matrix (stored as a data frame).
+ + + +# Read in and examine the RNA-seq data
+rnaseq_exp <- read_rds(rnaseq_file)
+
+
+
+The count data we have consists mostly of integers, but because of the estimation procedure that RSEM uses, some counts are fractional. DESeq2
, which we will be using below expects all integers, so we will round all of the count data. This is easiest if we first convert from a data frame to a matrix.
rnaseq_mat <- rnaseq_exp %>%
+ # move gene_id to the rownames
+ tibble::column_to_rownames("gene_id") %>%
+ # convert to a matrix and round
+ as.matrix() %>%
+ round()
+
+
+
+Raw counts are not usually suitable for the algorithms we use for clustering and heatmap display, so we will use the vst()
function from the DESeq2
package to transform our data.
Since we are starting from a matrix, not a SummarizedExperiment
as we did previously, we will need to provide the sample information ourselves. Just to be sure nothing is out of order, we will check that the identifiers for the sample information stored in histologies_df
matches the columns of our matrix.
all.equal(histologies_df$Kids_First_Biospecimen_ID,
+ colnames(rnaseq_mat))
+
+
+[1] TRUE
+
+
+
+Now we can make our matrix into a DESeq2
dataset, adding on the sample information from histologies_df
. Unlike when we were performing differential expression analysis, we won’t provide an experimental design at this stage.
ddset <- DESeqDataSetFromMatrix(rnaseq_mat,
+ colData = histologies_df,
+ design = ~ 1) # don't store an experimental design
+
+
+converting counts to integer mode
+
+
+
+We will again remove low count genes, as they are not likely to be informative.
+ + + +genes_to_keep <- rowSums(counts(ddset)) >= 10
+ddset <- ddset[genes_to_keep, ]
+
+
+
+Now we can apply the variance stabilizing transformation, saving the results in a new object.
+ + + +# apply variance stabilizing transformation
+vst_data <- vst(ddset, blind = TRUE)
+
+
+
+This object stores information about the transformation that was applied, but for now, we will only need the matrix of transformed data, which we can extract with assay()
.
# extract transformed data
+expr_mat <- assay(vst_data)
+
+
+
+What are the dimensions of this transformed RNA-seq data matrix?
+ + + +dim(expr_mat)
+
+
+[1] 48663 607
+
+
+
+Almost 50k genes would be hard to visualize on a single heatmap! If we are making a heatmap to get an idea of the structure in our data, it can be helpful to subset to high variance genes. This is because the genes that don’t vary are not likely to contribute much to the overall patterns we are interested in.
+First, we’ll calculate the variance for each gene using rowVars()
from the matrixStats
package and then take the genes in the top 10%.
# Calculate variance from the expression data
+gene_variance <- matrixStats::rowVars(expr_mat)
+
+# Find the value that we'll use as a threshold to filter the top 10%
+variance_threshold <- quantile(gene_variance, 0.9)
+
+# Row indices of high variance genes
+high_variance_index <- which(gene_variance > variance_threshold)
+
+# What does a row index look like?
+head(high_variance_index)
+
+
+[1] 7 15 24 25 26 28
+
+
+
+
+
+
+# Get a matrix that is subset to just the high variance genes
+high_var_mat <- expr_mat[high_variance_index, ]
+
+
+
+First, we’ll set up the metadata that we want to use to label samples in the heatmap. In ComplexHeatmap
terminology, this is called annotation, or HeatmapAnnotation
, specifically.
sample_annotation_df <- histologies_df %>%
+ # Select only the columns that we'll use
+ select(Kids_First_Biospecimen_ID,
+ short_histology,
+ composition,
+ tumor_descriptor)
+
+# Let's examine these columns
+sample_annotation_df
+
+ComplexHeatmap
is going to want the data frame we provide to have the sample identifiers as row names, so let’s set that up.
sample_annotation_df <- sample_annotation_df %>%
+ tibble::column_to_rownames("Kids_First_Biospecimen_ID")
+
+
+
+To specify the colors in our annotation bar, we need to create a list of named vectors. The names of the list need to exactly match the column names in sample_annotation_df
and the names in each vector need to exactly match the values in those columns.
# The Okabe Ito palette is recommended for those with color vision deficiencies
+histology_colors <- palette.colors(palette = "Okabe-Ito")[2:5]
+# `palette.colors()` returns a named vector, which can cause trouble
+histology_colors <- unname(histology_colors)
+
+# annotation color list for ComplexHeatMap
+sample_annotation_colors <- list(
+ short_histology = c(
+ "LGAT" = histology_colors[[1]],
+ "Ependymoma" = histology_colors[[2]],
+ "HGAT" = histology_colors[[3]],
+ "Medulloblastoma" = histology_colors[[4]]
+ ),
+ composition = c(
+ "Solid Tissue" = "#A0A0A0", # light grey
+ "Derived Cell Line" = "#000000" # black
+ ),
+ tumor_descriptor = c(
+ "Initial CNS Tumor" = "#3333FF",
+ "Progressive" = "#FFFF99",
+ "Recurrence" = "#CCCCFF",
+ "Second Malignancy" = "#000033",
+ "Unavailable" = "#FFFFFF" # white for missing data
+ )
+)
+
+
+
+We need to create a special type of object with HeatmapAnnotation
using the annotation data frame and the list of color vectors. We will also make the annotation labels a bit nicer to look at than the raw columns names.
column_annotation <- HeatmapAnnotation(
+ df = sample_annotation_df,
+ col = sample_annotation_colors,
+ annotation_label = c("Histology", "Composition", "Tumor descriptor")
+)
+
+
+
+We will z-score the expression values for display. This is sometimes called a standardized score. Some heatmap plotting packages will do this for you, but ComplexHeatmap
does not. It’s calculated for each value in a row (gene) by subtracting the gene’s mean and dividing by the gene’s standard deviation. This will result in every row having a mean of 0 and a standard deviation of 1.
zscores_mat <-
+ (high_var_mat - rowMeans(high_var_mat)) / matrixStats::rowSds(high_var_mat)
+
+
+
+Since all genes have the same z-score variance (by definition!), if you’re filtering to high variance values, it’s important to do that prior to standardizing.
+Okay, now we’re ready to make a heatmap complete with annotation bars.
+ + + +Heatmap(zscores_mat,
+ # The distance metric used for clustering the rows
+ # This is different from the default (Euclidean)
+ clustering_distance_rows = "pearson",
+ # Linkage method for row clustering
+ # This is different from the default (complete)
+ clustering_method_rows = "average",
+ # Distance metric for columns
+ clustering_distance_columns = "pearson",
+ # Linkage for columns
+ clustering_method_columns = "average",
+ show_row_names = FALSE,
+ show_column_names = FALSE,
+ # Add annotation bars to the top of the heatmap
+ top_annotation = column_annotation,
+ # This will be used as the label for the color bar
+ # of the cells of the heatmap itself
+ name = "z-score")
+
+
+The automatically generated colors map from the minus and plus 99^th of
+the absolute values in the matrix. There are outliers in the matrix
+whose patterns might be hidden by this color mapping. You can manually
+set the color to `col` argument.
+
+Use `suppressMessages()` to turn off this message.
+
+
+`use_raster` is automatically set to TRUE for a matrix with more than
+2000 rows. You can control `use_raster` argument by explicitly setting
+TRUE/FALSE to it.
+
+Set `ht_opt$message = FALSE` to turn off this message.
+
+
+
+
+
+
+ComplexHeatmap
gives a few warning here, but nothing to be concerned about. One is complaining that the color scale may be truncated relative to our data. The other warns that our very large heatmap itself is being drawn at a lower resolution to speed things up.
We have to create the heatmap twice – once for display in the notebook and once to save to a PNG file. For the second drawing we will deal with the raster warning and produce a higher resolution output.
+ + + +# Open PNG plot device
+png(filename = heatmap_file,
+ width = 11,
+ height = 7,
+ units = "in",
+ res = 300)
+# Heatmap, again!
+Heatmap(zscores_mat,
+ clustering_distance_rows = "pearson",
+ clustering_method_rows = "average",
+ clustering_distance_columns = "pearson",
+ clustering_method_columns = "average",
+ show_row_names = FALSE,
+ show_column_names = FALSE,
+ top_annotation = column_annotation,
+ name = "z-score",
+ use_raster = FALSE) # higher resolution for output (be careful with PDF output!)
+
+
+The automatically generated colors map from the minus and plus 99^th of
+the absolute values in the matrix. There are outliers in the matrix
+whose patterns might be hidden by this color mapping. You can manually
+set the color to `col` argument.
+
+Use `suppressMessages()` to turn off this message.
+
+
+# Shut down current graphics device
+dev.off()
+
+
+png
+ 2
+
+
+
+You probably noticed that there were some arguments that we defined that were related to the clustering of samples and genes when we built the heatmaps. The clustering determines the arrangement of genes and samples in the heatmap, making it so genes or samples with similar expression patterns are grouped. If we didn’t do that, we would have a heatmap that looked basically like random static!
+The kind of clustering we performed was hierarchical clustering. This is an agglomerative method of clustering - each sample starts in its own cluster and then at each step of the algorithm the two most similar clusters are joined until there’s only a single cluster left.
+The arguments we chose for clustering_distance_*
and clustering_method_*
told ComplexHeatmap
to use 1 minus the correlation between samples as the distance measure, and to average samples when they were merged into a group. This method is also known as UPGMA (unweighted pair group method with arithmetic mean).
The result of this clustering process is a tree (dendrogram), which is shown at the top for samples and to the left for genes. It is important to note that the linear arrangement of samples in the tree is somewhat arbitrary, so we have to be careful not to overinterpret it. We can rotate around any branch of the tree in our visualization with no change to the tree topology itself. In fact, the samples at opposite ends of the heatmap could be right next to one another with the right set of rotations!
+You probably noticed that the hierarchical clustering we performed didn’t perfectly separate out the different cancer subtypes we were looking at.
+What would have happened if we had used PCA to visualize the relationships among samples? Would that have made it easier to separate disease types? Let’s try it!
+We will use the VST transformed data here too, and the plotPCA()
function from DESeq2
.
# Use plotPCA, but return the data for custom plotting
+pca_df <- plotPCA(vst_data,
+ ntop = 5000, # use the top 5000 genes by variance
+ intgroup = "short_histology",
+ returnData = TRUE)
+
+
+ggplot(pca_df, aes(PC1, PC2, color = short_histology)) +
+ geom_point() +
+ theme_bw() +
+ scale_color_manual(
+ values = c("LGAT" = histology_colors[[1]],
+ "Ependymoma" = histology_colors[[2]],
+ "HGAT" = histology_colors[[3]],
+ "Medulloblastoma" = histology_colors[[4]])
+ ) +
+ labs(color = "Histology")
+
+
+
+
+
+
+We can see that neither method perfectly separates the different disease types. With PCA, if we didn’t color the points, it might even be difficult to identify distinct clusters in the data at all by eye. (We could look at more than two PCs, which might help out, but we would need to use a different function for our calculation, since plotPCA()
only returns the first two.)
For nice comparison of the relative advantages of these two methods (with a little mention of some further directions), we recommend blog post by Soneson.
+sessionInfo()
+
+
+R version 4.0.3 (2020-10-10)
+Platform: x86_64-pc-linux-gnu (64-bit)
+Running under: Ubuntu 20.04 LTS
+
+Matrix products: default
+BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
+
+locale:
+ [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
+ [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
+ [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
+ [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
+ [9] LC_ADDRESS=C LC_TELEPHONE=C
+[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
+
+attached base packages:
+ [1] grid parallel stats4 stats graphics grDevices utils
+ [8] datasets methods base
+
+other attached packages:
+ [1] ComplexHeatmap_2.6.2 forcats_0.5.0
+ [3] stringr_1.4.0 dplyr_1.0.3
+ [5] purrr_0.3.4 readr_1.4.0
+ [7] tidyr_1.1.2 tibble_3.0.5
+ [9] ggplot2_3.3.3 tidyverse_1.3.0
+[11] DESeq2_1.30.1 SummarizedExperiment_1.20.0
+[13] Biobase_2.50.0 MatrixGenerics_1.2.0
+[15] matrixStats_0.57.0 GenomicRanges_1.42.0
+[17] GenomeInfoDb_1.26.2 IRanges_2.24.1
+[19] S4Vectors_0.28.1 BiocGenerics_0.36.0
+[21] optparse_1.6.6
+
+loaded via a namespace (and not attached):
+ [1] bitops_1.0-6 fs_1.5.0 lubridate_1.7.9.2
+ [4] bit64_4.0.5 RColorBrewer_1.1-2 httr_1.4.2
+ [7] tools_4.0.3 backports_1.2.1 R6_2.5.0
+[10] DBI_1.1.1 colorspace_2.0-0 GetoptLong_1.0.5
+[13] withr_2.4.0 tidyselect_1.1.0 bit_4.0.4
+[16] compiler_4.0.3 cli_2.2.0 rvest_0.3.6
+[19] Cairo_1.5-12.2 xml2_1.3.2 DelayedArray_0.16.2
+[22] labeling_0.4.2 scales_1.1.1 genefilter_1.72.1
+[25] digest_0.6.27 rmarkdown_2.6 XVector_0.30.0
+[28] pkgconfig_2.0.3 htmltools_0.5.1.1 dbplyr_2.0.0
+[31] GlobalOptions_0.1.2 rlang_0.4.10 readxl_1.3.1
+[34] rstudioapi_0.13 RSQLite_2.2.3 farver_2.0.3
+[37] shape_1.4.5 generics_0.1.0 jsonlite_1.7.2
+[40] BiocParallel_1.24.1 RCurl_1.98-1.2 magrittr_2.0.1
+[43] GenomeInfoDbData_1.2.4 Matrix_1.3-2 Rcpp_1.0.6
+[46] munsell_0.5.0 fansi_0.4.2 lifecycle_0.2.0
+[49] stringi_1.5.3 yaml_2.2.1 zlibbioc_1.36.0
+[52] blob_1.2.1 crayon_1.3.4 lattice_0.20-41
+[55] haven_2.3.1 splines_4.0.3 annotate_1.68.0
+[58] circlize_0.4.12 hms_1.0.0 magick_2.6.0
+[61] locfit_1.5-9.4 ps_1.5.0 knitr_1.30
+[64] pillar_1.4.7 rjson_0.2.20 geneplotter_1.68.0
+[67] reprex_0.3.0 XML_3.99-0.5 glue_1.4.2
+[70] evaluate_0.14 modelr_0.1.8 png_0.1-7
+[73] vctrs_0.3.6 cellranger_1.1.0 gtable_0.3.0
+[76] getopt_1.20.3 clue_0.3-58 assertthat_0.2.1
+[79] xfun_0.20 xtable_1.8-4 broom_0.7.3
+[82] survival_3.2-7 AnnotationDbi_1.52.0 memoise_1.1.0
+[85] cluster_2.1.0 ellipsis_0.3.1
+
+
+For example, we can do some simple multiplication like this. When you execute code within the notebook, the results appear beneath the code. Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.
- +5 * 6
+
+[1] 30
+
Use the console to calculate other expressions. Standard order of operations applies (mostly), and you can use parentheses ()
as you might expect (but not brackets []
or braces{}
, which have special meanings). Note however, that you must always specify multiplication with *
; implicit multiplication such as 10(3 + 4)
or 10x
will not work and will generate an error, or worse.
10 * (3 + 4)^2
+
+[1] 490
+
To define a variable, we use the assignment operator which looks like an arrow: <-
, for example x <- 7
takes the value on the right-hand side of the operator and assigns it to the variable name on the left-hand side.
# Define a variable x to equal 7, and print out the value of x
x <- 7
# We can have R repeat back to us what `x` is by just using `x`
x
+
+[1] 7
+
Some features of variables, considering the example x <- 7
: Every variable has a name, a value, and a type. This variable’s name is x
, its value is 7
, and its type is numeric
(7 is a number!). Re-defining a variable will overwrite the value.
x <- 5.5
x
+
+[1] 5.5
+
We can modify an existing variable by reassigning it to its same name. Here we’ll add 2
to x
and reassign the result back to x
.
x <- x + 2
x
+
+[1] 7.5
+
@@ -480,13 +496,20 @@ Comments in R code are indicated with pound signs (aka hashtags, octothorps). R will ignore any text in a line after the pound sign, so you can put whatever text you like there.
- -22/7 # not quite pi
-
-# If we need a better approximation of pi, we can use Euler's formula
+
+22/7 # not quite pi
+
+
+[1] 3.142857
+
+
+# If we need a better approximation of pi, we can use Euler's formula
# This uses atan(), which calculates arctangent.
20 * atan(1/7) + 8 * atan(3/79)
+
+[1] 3.141593
+
Help out Future You by adding lots of comments! Future You next week thinks Today You is an idiot, and the only way you can convince Future You that Today You is reasonably competent is by adding comments in your code explaining why Today You is actually not so bad.
@@ -503,42 +526,57 @@ Functions
Functions also return values for us to use. In the case of log()
, the returned value is the log’d value the function computed.
-
+
log(73)
+
+[1] 4.290459
+
Here we can specify an argument of base
to calculate log base 3.
-
+
log(81, base = 3)
+
+[1] 4
+
If we don’t specify the argument names, it assumes they are in the order that log
defines them. See ?log
to see more about its arguments.
-
+
log(8, 2)
+
+[1] 3
+
We can switch the order if we specify the argument names.
-
+
log(base = 10, x = 4342)
+
+[1] 3.63769
+
We can also provide variables as arguments in the same way as the raw values.
-
+
meaning <- 42
log(meaning)
+
+[1] 3.73767
+
@@ -549,7 +587,7 @@ Variable Types
Variable types in R can sometimes be coerced (converted) from one type to another.
-
+
# Define a variable with a number
x <- 15
@@ -558,32 +596,50 @@ Variable Types
The function class()
will tell us the variable’s type.
-
+
class(x)
+
+[1] "numeric"
+
+
+numeric
+
Let’s coerce it to a character.
-
+
x <- as.character(x)
class(x)
+
+[1] "character"
+
+
+character
+
See it now has quotes around it? It’s now a character and will behave as such
-
+
x
+
+[1] "15"
+
+
+15
+
Use this chunk to try to perform calculations with x
, now that it is a character, what happens?
-
+
# Try to perform calculations on `x`
@@ -591,7 +647,7 @@ Variable Types
But we can’t coerce everything:
-
+
# Let's create a character variable
x <- "look at my character variable"
@@ -600,17 +656,23 @@ Variable Types
Let’s try making this a numeric variable:
-
+
x <- as.numeric(x)
+
+Warning: NAs introduced by coercion
+
Print out x
.
-
+
x
+
+[1] NA
+
R is telling us it doesn’t know how to convert this to a numeric variable, so it has returned NA
instead.
@@ -669,7 +731,7 @@ Vectors
You will have noticed that all your computations tend to pop up with a [1]
preceding them in R’s output. This is because, in fact, all (ok mostly all) variables are by default vectors, and our answers are the first (in these cases only) value in the vector. As vectors get longer, new index indicators will appear at the start of new lines.
-
+
# This is actually an vector that has one item in it.
x <- 7
@@ -677,70 +739,94 @@ Vectors
-
+
# The length() functions tells us how long an vector is:
length(x)
+
+[1] 1
+
We can define vectors with the function c()
, which stands for “combine”. This function takes a comma-separated set of values to place in the vector, and returns the vector itself:
-
+
my_numeric_vector <- c(1, 1, 2, 3, 5, 8, 13, 21)
my_numeric_vector
+
+[1] 1 1 2 3 5 8 13 21
+
We can build on vectors in place by redefining them
-
+
# add the next two Fibonacci numbers to the series.
my_numeric_vector <- c(my_numeric_vector, 34, 55)
my_numeric_vector
+
+ [1] 1 1 2 3 5 8 13 21 34 55
+
We can pull out specific items from an vector using a process called indexing, which uses brackets []
to specify the position of an item.
-
+
# Grab the fourth value from my_numeric_vector
# This gives us an vector of length 1
my_numeric_vector[4]
+
+[1] 3
+
Colons are also a nice way to quickly make ordered numeric vectors Use a colon to specify an inclusive range of indices This will return an vector with 2, 3, 4, and 5.
-
+
my_numeric_vector[2:5]
+
+[1] 1 2 3 5
+
One major benefit of vectors is the concept of vectorization, where R by default performs operations on the entire vector at once. For example, we can get the log of all numbers 1-2 with a single, simple call, and more!
-
+
values_1_to_20 <- 1:20
-
+
# calculate the log of values_1_to_20
log(values_1_to_20)
+
+ [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
+ [8] 2.0794415 2.1972246 2.3025851 2.3978953 2.4849066 2.5649494 2.6390573
+[15] 2.7080502 2.7725887 2.8332133 2.8903718 2.9444390 2.9957323
+
Finally, we can apply logical expressions to vectors, just as we can do for single values. The output here is a logical vector telling us whether each value in example_vector is TRUE or FALSE
-
+
# Which values are <= 3?
values_1_to_20 <= 3
+
+ [1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
+[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
+
There are several key functions which can be used on vectors containing numeric values, some of which are below.
@@ -753,11 +839,14 @@ Vectors
We can try out these functions on the vector values_1_to_20
we’ve created.
-
-mean(values_1_to_20)
-
-# Try out some of the other functions we've listed above
-
+
+mean(values_1_to_20)
+
+
+[1] 10.5
+
+
+# Try out some of the other functions we've listed above
@@ -771,28 +860,37 @@ The %in%
logical operator
%in%
is useful for determining whether a given item(s) are in an vector.
-
+
# is `7` in our vector?
7 %in% values_1_to_20
+
+[1] TRUE
+
-
+
# is `50` in our vector?
50 %in% values_1_to_20
+
+[1] FALSE
+
We can test a vector of values being within another vector of values.
-
+
question_values <- c(1:3, 7, 50)
# Are these values in our vector?
question_values %in% values_1_to_20
+
+[1] TRUE TRUE TRUE TRUE FALSE
+
@@ -802,7 +900,7 @@ Data frames
Data frames are one of the most useful tools for data analysis in R. They are tables which consist of rows and columns, much like a spreadsheet. Each column is a variable which behaves as a vector, and each row is an observation. We will begin our exploration with dataset of measurements from three penguin species measured, which we can find in the palmerpenguins
package. We’ll talk more about packages soon! To use this dataset, we will load it from the palmerpenguins
package using a ::
(more on this later) and assign it to a variable named penguins
in our current environment.
-
+
penguins <- palmerpenguins::penguins
@@ -823,65 +921,129 @@ Exploring data frames
This provides summary statistics for each column:
-
+
summary(penguins)
+
+ species island bill_length_mm bill_depth_mm
+ Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
+ Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
+ Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
+ Mean :43.92 Mean :17.15
+ 3rd Qu.:48.50 3rd Qu.:18.70
+ Max. :59.60 Max. :21.50
+ NA's :2 NA's :2
+ flipper_length_mm body_mass_g sex year
+ Min. :172.0 Min. :2700 female:165 Min. :2007
+ 1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
+ Median :197.0 Median :4050 NA's : 11 Median :2008
+ Mean :200.9 Mean :4202 Mean :2008
+ 3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
+ Max. :231.0 Max. :6300 Max. :2009
+ NA's :2 NA's :2
+
This provides a short view of the structure and contents of the data frame.
-
+
str(penguins)
+
+tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
+ $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
+ $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
+ $ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
+ $ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
+ $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
+ $ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
+ $ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
+ $ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
+
You’ll notice that the column species
is a factor: This is a special type of character variable that represents distinct categories known as “levels”. We have learned here that there are three levels in the species
column: Adelie, Chinstrap, and Gentoo. We might want to explore individual columns of the data frame more in-depth. We can examine individual columns using the dollar sign $
to select one by name:
-
+
# Extract bill_length_mm as a vector
-penguins$bill_length_mm
-
-# indexing operators can be used too
+penguins$bill_length_mm
+
+
+ [1] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42.0 37.8 37.8 41.1 38.6 34.6
+ [16] 36.6 38.7 42.5 34.4 46.0 37.8 37.7 35.9 38.2 38.8 35.3 40.6 40.5 37.9 40.5
+ [31] 39.5 37.2 39.5 40.9 36.4 39.2 38.8 42.2 37.6 39.8 36.5 40.8 36.0 44.1 37.0
+ [46] 39.6 41.1 37.5 36.0 42.3 39.6 40.1 35.0 42.0 34.5 41.4 39.0 40.6 36.5 37.6
+ [61] 35.7 41.3 37.6 41.1 36.4 41.6 35.5 41.1 35.9 41.8 33.5 39.7 39.6 45.8 35.5
+ [76] 42.8 40.9 37.2 36.2 42.1 34.6 42.9 36.7 35.1 37.3 41.3 36.3 36.9 38.3 38.9
+ [91] 35.7 41.1 34.0 39.6 36.2 40.8 38.1 40.3 33.1 43.2
+ [ reached getOption("max.print") -- omitted 244 entries ]
+
+
+# indexing operators can be used too
penguins$bill_depth_mm[1:10]
+
+ [1] 18.7 17.4 18.0 NA 19.3 20.6 17.8 19.6 18.1 20.2
+
We can perform our regular vector operations on columns directly.
-
+
# calculate the mean of the bill_length_mm column
mean(penguins$bill_length_mm,
na.rm = TRUE) # remove missing values before calculating the mean
+
+[1] 43.92193
+
We can also calculate the full summary statistics for a single column directly.
-
+
# show a summary of the bill_length_mm column
summary(penguins$bill_length_mm)
+
+ Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
+ 32.10 39.23 44.45 43.92 48.50 59.60 2
+
Extract Species
as a vector and subset it to see a preview.
-
+
# get the first 10 values of the Species column
penguins$species[1:10]
+
+ [1] Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie
+Levels: Adelie Chinstrap Gentoo
+
And view its levels with the levels()
function.
-
+
levels(penguins$species)
+
+[1] "Adelie" "Chinstrap" "Gentoo"
+
+
+Adelie
+
+Chinstrap
+
+Gentoo
+
@@ -892,7 +1054,7 @@ Files and directories
Here we will use a function, read_tsv()
from the readr
package. Before we are able to use the function, we have to load the package using library()
.
-
+
library(readr)
@@ -900,15 +1062,21 @@ Files and directories
file.path()
creates a properly formatted file path by adding a path separator (/
on Mac and Linux operating system, which is the operating system that our RStudio Server runs on) between separate folders or directories. Because file path separators can differ between your computer and the computer of someone who wants to use your code, we use file.path()
instead of typing out "data/gene_results_GSE44971.tsv"
. Each argument to file.path()
is a directory or file name. You’ll notice each argument is in quotes, we specify data
first because the file, gene_results_GSE44971.tsv
is in the data
folder.
-
+
file.path("data", "gene_results_GSE44971.tsv")
+
+[1] "data/gene_results_GSE44971.tsv"
+
+
+data/gene_results_GSE44971.tsv
+
We can store this file path as a variable in our environment.
-
+
gene_file_path <- file.path("data", "gene_results_GSE44971.tsv")
@@ -916,19 +1084,38 @@ Files and directories
Now we are ready to use read_tsv()
to read the file into R. The resulting data frame will be stored in a variable named stats_df
. Note the <-
which is responsible for saving this to our global environment.
-
+
# read in the file `gene_results_GSE44971.tsv` from the data directory
stats_df <- read_tsv(gene_file_path)
+
+
+── Column specification ────────────────────────────────────────────────────────
+cols(
+ ensembl_id = col_character(),
+ gene_symbol = col_character(),
+ contrast = col_character(),
+ log_fold_change = col_double(),
+ avg_expression = col_double(),
+ t_statistic = col_double(),
+ p_value = col_double(),
+ adj_p_value = col_double()
+)
+
Take a look at your environment panel to see what stats_df
looks like. We can also print out a preview of the stats_df
data frame here.
-
+
# display stats_df
stats_df
+
+
+
@@ -936,14 +1123,49 @@ Session Info
At the end of every notebook, you will see us print out sessionInfo
. This aids in the reproducibility of your code by showing exactly what packages and versions were being used the last time the notebook was run.
-
+
sessionInfo()
+
+R version 4.0.3 (2020-10-10)
+Platform: x86_64-pc-linux-gnu (64-bit)
+Running under: Ubuntu 20.04 LTS
+
+Matrix products: default
+BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
+
+locale:
+ [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
+ [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
+ [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
+ [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
+ [9] LC_ADDRESS=C LC_TELEPHONE=C
+[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
+
+attached base packages:
+[1] stats graphics grDevices utils datasets methods base
+
+other attached packages:
+[1] readr_1.4.0 optparse_1.6.6
+
+loaded via a namespace (and not attached):
+ [1] rstudioapi_0.13 knitr_1.30 magrittr_2.0.1
+ [4] hms_1.0.0 getopt_1.20.3 R6_2.5.0
+ [7] rlang_0.4.10 fansi_0.4.2 stringr_1.4.0
+[10] tools_4.0.3 palmerpenguins_0.1.0 xfun_0.20
+[13] cli_2.2.0 htmltools_0.5.1.1 ellipsis_0.3.1
+[16] yaml_2.2.1 digest_0.6.27 assertthat_0.2.1
+[19] tibble_3.0.5 lifecycle_0.2.0 crayon_1.3.4
+[22] ps_1.5.0 vctrs_0.3.6 glue_1.4.2
+[25] evaluate_0.14 rmarkdown_2.6 stringi_1.5.3
+[28] compiler_4.0.3 pillar_1.4.7 jsonlite_1.7.2
+[31] pkgconfig_2.0.3
+


diff --git a/intro-to-R-tidyverse/02-intro_to_ggplot2-live.Rmd b/intro-to-R-tidyverse/02-intro_to_ggplot2-live.Rmd
index 8084d96b..c011309e 100644
--- a/intro-to-R-tidyverse/02-intro_to_ggplot2-live.Rmd
+++ b/intro-to-R-tidyverse/02-intro_to_ggplot2-live.Rmd
@@ -1,14 +1,24 @@
---
title: "Introduction to ggplot2"
+author: "CCDL for ALSF"
+date: 2021
output:
html_notebook:
toc: true
toc_float: true
---
-**CCDL 2020**
-## Objective for this notebook analysis:
+## Objectives
+
+This notebook will demonstrate how to:
+
+- Load and use R packages
+- Read in and perform simple manipulations of data frames
+- Use `ggplot2` to plot and visualize data
+- Customize plots using features of `ggplot2`
+
+---
We'll use a real gene expression dataset to get comfortable making visualizations using ggplot2.
We've [performed differential expression analyses](./scripts/00-setup-intro-to-R.R) on a pre-processed [astrocytoma microarray dataset](https://www.refine.bio/experiments/GSE44971/gene-expression-data-from-pilocytic-astrocytoma-tumour-samples-and-normal-cerebellum-controls).
@@ -20,10 +30,11 @@ We performed three sets of contrasts:
3) An interaction of both `sex` and `tissue`.
**More ggplot2 resources:**
-+ [handy cheatsheet for ggplot2](https://rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf)
-+ [ggplot2 courses and books from Tidyverse](https://ggplot2.tidyverse.org/)
-+ [ggplot2 data viz chapter of R for Data Science](https://r4ds.had.co.nz/data-visualisation.html)
-+ [ggplot2 online tutorial](http://r-statistics.co/Complete-Ggplot2-Tutorial-Part1-With-R-Code.html)
+
+- [handy cheatsheet for ggplot2](https://rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf)
+- [ggplot2 courses and books from Tidyverse](https://ggplot2.tidyverse.org/)
+- [ggplot2 data viz chapter of R for Data Science](https://r4ds.had.co.nz/data-visualisation.html)
+- [ggplot2 online tutorial](http://r-statistics.co/Complete-Ggplot2-Tutorial-Part1-With-R-Code.html)
## Set Up
@@ -217,7 +228,6 @@ For now, we will choose the value of 5.5 (that is close to a Bonferroni correcti
```
-We can also change the background and appearance of the plot as a whole by adding a `theme`.
We can change the x and y labels by using `ylab` and `xlab` functions and add a
title using `ggtitle`.
diff --git a/intro-to-R-tidyverse/02-intro_to_ggplot2.nb.html b/intro-to-R-tidyverse/02-intro_to_ggplot2.nb.html
index 315f553a..e83cc0f2 100644
--- a/intro-to-R-tidyverse/02-intro_to_ggplot2.nb.html
+++ b/intro-to-R-tidyverse/02-intro_to_ggplot2.nb.html
@@ -323,10 +323,11 @@ Objectives
- Load and use R packages
-- Manipulate data frames
-- Use ggplot to plot and visualize data
-- Customize plots using features of ggplot
+- Read in and perform simple manipulations of data frames
+- Use
ggplot2
to plot and visualize data
+- Customize plots using features of
ggplot2
+
We’ll use a real gene expression dataset to get comfortable making visualizations using ggplot2. We’ve performed differential expression analyses on a pre-processed astrocytoma microarray dataset. We’ll start by making a volcano plot of differential gene expression results from this experiment. We performed three sets of contrasts:
sex
category contrasting: Male
vs Female
@@ -351,7 +352,7 @@ Set Up
We saved these results to a tab separated values (TSV) file called gene_results_GSE44971.tsv
. It’s been saved to the data
folder. File paths are relative to where this notebook file (.Rmd) is saved. So we can reference it later, let’s make a variable with our data directory name.
-
+
data_dir <- "data"
@@ -359,7 +360,7 @@ Set Up
Let’s declare our output folder name as its own variable.
-
+
plots_dir <- "plots"
@@ -367,7 +368,7 @@ Set Up
We can also create a directory if it doesn’t already exist.
-
+
# The if statement here tests whether the plot directory exists and
# only executes the expressions between the braces if it does not.
if (!dir.exists(plots_dir)) {
@@ -379,9 +380,23 @@ Set Up
In this notebook we will be using functions from the Tidyverse set of packages, so we need to load in those functions using library()
. We could load the individual packages we need one at a time, but it is convenient for now to load them all with the tidyverse
package, which groups many of them together. Keep a look out for where we tell you which individual package different functions come from.
-
+
library(tidyverse)
+
+── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
+
+
+✔ ggplot2 3.3.3 ✔ purrr 0.3.4
+✔ tibble 3.0.5 ✔ dplyr 1.0.3
+✔ tidyr 1.1.2 ✔ stringr 1.4.0
+✔ readr 1.4.0 ✔ forcats 0.5.0
+
+
+── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
+✖ dplyr::filter() masks stats::filter()
+✖ dplyr::lag() masks stats::lag()
+
@@ -390,47 +405,101 @@ Read in the differential expression analysis results file
Here we are using a tidyverse
function read_tsv()
from the readr
package. Like we did in the previous notebook, we will store the resulting data frame as stats_df
.
-
+
# read in the file `gene_results_GSE44971.tsv` from the data directory
stats_df <- read_tsv(file.path(
data_dir,
"gene_results_GSE44971.tsv"
))
+
+
+── Column specification ────────────────────────────────────────────────────────
+cols(
+ ensembl_id = col_character(),
+ gene_symbol = col_character(),
+ contrast = col_character(),
+ log_fold_change = col_double(),
+ avg_expression = col_double(),
+ t_statistic = col_double(),
+ p_value = col_double(),
+ adj_p_value = col_double()
+)
+
We can take a look at a column individually by using a $
. Note we are using head()
so the whole thing doesn’t print out.
-
+
head(stats_df$contrast)
+
+[1] "male_female" "male_female" "male_female" "male_female" "male_female"
+[6] "male_female"
+
+
+male_female
+
+male_female
+
+male_female
+
+male_female
+
+male_female
+
+male_female
+
If we want to see a specific set of values, we can use brackets with the indices of the values we’d like returned.
-
+
stats_df$avg_expression[6:10]
+
+[1] 19.084011 8.453933 5.116563 6.345609 25.473133
+
Let’s look at some basic statistics from the data set using summary()
-
+
# summary of stats_df
summary(stats_df)
+
+ ensembl_id gene_symbol contrast log_fold_change
+ Length:6804 Length:6804 Length:6804 Min. :-180.8118
+ Class :character Class :character Class :character 1st Qu.: -1.6703
+ Mode :character Mode :character Mode :character Median : 0.1500
+ Mean : 0.2608
+ 3rd Qu.: 2.1049
+ Max. : 129.3009
+ avg_expression t_statistic p_value adj_p_value
+ Min. : 5.003 Min. :-32.84581 Min. :0.00000 Min. :0.00000
+ 1st Qu.: 6.304 1st Qu.: -1.16444 1st Qu.:0.01309 1st Qu.:0.05657
+ Median : 8.482 Median : 0.10619 Median :0.18919 Median :0.41354
+ Mean : 13.847 Mean : -0.00819 Mean :0.31223 Mean :0.44833
+ 3rd Qu.: 14.022 3rd Qu.: 1.46589 3rd Qu.:0.57634 3rd Qu.:0.82067
+ Max. :190.708 Max. : 10.48302 Max. :0.99979 Max. :0.99988
+
The statistics for contrast
are not very informative, so let’s do that again with just the contrast
column after converting it to a factor
-
+
# summary of `stats_df$contrast` as a factor
summary(as.factor(stats_df$contrast))
+
+astrocytoma_normal interaction male_female
+ 2268 2268 2268
+
@@ -439,7 +508,7 @@ Set up the dataset
Before we make our plot, we want to calculate a set of new values for each row; transformations of the raw statistics in our table. To do this we will use a function from the dplyr
package called mutate()
to make a new column of -log10 p values.
-
+
# add a `neg_log10_p` column to the data frame
stats_df <- mutate(stats_df, # data frame we'd like to add a variable to
neg_log10_p = -log10(p_value) # column name and values
@@ -458,16 +527,21 @@ Set up the dataset
Now we can try out the filter()
function.Notice that we are not assigning the results to a variable, so this filtered dataset will not be saved to the environment.
-
+
# filter stats_df to "male_female" only
filter(stats_df, contrast == "male_female")
+
+
+
Now we can assign the results to a new data frame: male_female_df
.
-
+
# filter and save to male_female_df
male_female_df <- filter(stats_df, contrast == "male_female")
@@ -479,7 +553,7 @@ Plotting this data
Let’s make a volcano plot with this data. First let’s take a look at only the tumor vs. normal comparison. Let’s save this as a separate data frame by assigning it a new name.
-
+
tumor_normal_df <- filter(stats_df, contrast == "astrocytoma_normal")
@@ -492,7 +566,7 @@ Plotting this data
-
+
ggplot(
tumor_normal_df, # This first argument is the data frame with the data we want to plot
aes(
@@ -502,12 +576,15 @@ Plotting this data
) # This is the column name of the data we want for the y-axis
)
+
+
+
You’ll notice this plot doesn’t have anything on it because we haven’t specified a plot type yet. To do that, we will add another ggplot layer with +
which will specify exactly what we want to plot. A volcano plot is a special kind of scatter plot, so to make that we will want to plot individual points, which we can do with geom_point()
.
-
+
# This first part is the same as before
ggplot(
tumor_normal_df,
@@ -519,6 +596,9 @@ Plotting this data
# Now we are adding on a layer to specify what kind of plot we want
geom_point()
+
+
+
Here’s a brief summary of ggplot2 structure.
@@ -527,7 +607,7 @@ Adjust our ggplot
Now that we have a base plot that shows our data, we can add layers on to it and adjust it. We can adjust the color of points using the color
aesthetic.
-
+
ggplot(
tumor_normal_df,
aes(
@@ -538,12 +618,15 @@ Adjust our ggplot
) +
geom_point()
+
+
+
Because we have so many points overlapping one another, we will want to adjust the transparency, which we can do with an alpha
argument.
-
+
ggplot(
tumor_normal_df,
aes(
@@ -554,12 +637,15 @@ Adjust our ggplot
) +
geom_point(alpha = 0.2) # We are using the `alpha` argument to make our points transparent
+
+
+
Notice that we added the alpha within the geom_point()
function, not to the aes()
. We did this because we want all of the points to have the same level of transparency, and it will not vary depending on any variable in the data. We can also change the background and appearance of the plot as a whole by adding a theme
.
-
+
ggplot(
tumor_normal_df,
aes(
@@ -571,12 +657,15 @@ Adjust our ggplot
geom_point(alpha = 0.2) +
theme_bw() # Add on this set of appearance presets to make it pretty
+
+
+
We are not limited to a single plotting layer. For example, if we want to add a horizontal line to indicate a significance cutoff, we can do that with geom_hline()
. For now, we will choose the value of 5.5 (that is close to a Bonferroni correction) and add that to the plot.
-
+
ggplot(
tumor_normal_df,
aes(
@@ -588,12 +677,15 @@ Adjust our ggplot
geom_point(alpha = 0.2) +
geom_hline(yintercept = 5.5, color = "darkgreen") # we can specify colors by names here
+
+
+
We can change the x and y labels by using ylab
and xlab
functions and add a title using ggtitle
.
-
+
ggplot(
tumor_normal_df,
aes(
@@ -609,21 +701,23 @@ Adjust our ggplot
ylab("-log10 p value") + # Add a y label
ggtitle("Astrocytoma Tumor vs Normal Cerebellum") # Add main title
+
+
+
Use this chunk to make the same kind of plot as the previous chunk but instead plot the male female contrast data, that is stored in male_female_df
.
-
-# Use this chunk to make the same kind of volcano plot, but with the male-female contrast data.
-
+
+# Use this chunk to make the same kind of volcano plot, but with the male-female contrast data.
Turns out, we don’t have to plot each contrast separately, instead, we can use the original data frame that contains all three contrasts’ data, stats_df
, and add a facet_wrap
to make each contrast its own plot.
-
+
ggplot(
stats_df, # Switch to the bigger data frame with all three contrasts' data
aes(
@@ -641,12 +735,15 @@ Adjust our ggplot
ylab("-log10 p value") +
coord_cartesian(xlim = c(-25, 25)) # zoom in on the x-axis
+
+
+
We can store the plot as an object in the global environment by using <-
operator. Here we will call it volcano_plot
.
-
+
volcano_plot <- ggplot(
stats_df, # We are calling this plot `volcano_plot`
aes(
@@ -668,12 +765,15 @@ Adjust our ggplot
When we are happy with our plot, we can save the plot using ggsave
.
-
+
ggsave(
plot = volcano_plot,
filename = file.path(plots_dir, "volcano_plot.png")
)
+
+Saving 7 x 5 in image
+
@@ -681,15 +781,55 @@ Adjust our ggplot
Session Info
-
+
# Print out the versions and packages we are using in this session
sessionInfo()
+
+R version 4.0.3 (2020-10-10)
+Platform: x86_64-pc-linux-gnu (64-bit)
+Running under: Ubuntu 20.04 LTS
+
+Matrix products: default
+BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
+
+locale:
+ [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
+ [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
+ [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
+ [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
+ [9] LC_ADDRESS=C LC_TELEPHONE=C
+[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
+
+attached base packages:
+[1] stats graphics grDevices utils datasets methods base
+
+other attached packages:
+ [1] forcats_0.5.0 stringr_1.4.0 dplyr_1.0.3 purrr_0.3.4
+ [5] readr_1.4.0 tidyr_1.1.2 tibble_3.0.5 ggplot2_3.3.3
+ [9] tidyverse_1.3.0 optparse_1.6.6
+
+loaded via a namespace (and not attached):
+ [1] tidyselect_1.1.0 xfun_0.20 haven_2.3.1 colorspace_2.0-0
+ [5] vctrs_0.3.6 generics_0.1.0 htmltools_0.5.1.1 getopt_1.20.3
+ [9] yaml_2.2.1 rlang_0.4.10 pillar_1.4.7 glue_1.4.2
+[13] withr_2.4.0 DBI_1.1.1 dbplyr_2.0.0 modelr_0.1.8
+[17] readxl_1.3.1 lifecycle_0.2.0 munsell_0.5.0 gtable_0.3.0
+[21] cellranger_1.1.0 rvest_0.3.6 evaluate_0.14 labeling_0.4.2
+[25] knitr_1.30 ps_1.5.0 fansi_0.4.2 broom_0.7.3
+[29] Rcpp_1.0.6 scales_1.1.1 backports_1.2.1 jsonlite_1.7.2
+[33] farver_2.0.3 fs_1.5.0 hms_1.0.0 digest_0.6.27
+[37] stringi_1.5.3 grid_4.0.3 cli_2.2.0 tools_4.0.3
+[41] magrittr_2.0.1 crayon_1.3.4 pkgconfig_2.0.3 ellipsis_0.3.1
+[45] xml2_1.3.2 reprex_0.3.0 lubridate_1.7.9.2 assertthat_0.2.1
+[49] rmarkdown_2.6 httr_1.4.2 rstudioapi_0.13 R6_2.5.0
+[53] compiler_4.0.3
+


diff --git a/intro-to-R-tidyverse/03-intro_to_tidyverse-live.Rmd b/intro-to-R-tidyverse/03-intro_to_tidyverse-live.Rmd
index 440e4856..91149b7a 100644
--- a/intro-to-R-tidyverse/03-intro_to_tidyverse-live.Rmd
+++ b/intro-to-R-tidyverse/03-intro_to_tidyverse-live.Rmd
@@ -1,5 +1,7 @@
---
title: "Introduction to tidyverse"
+author: "CCDL for ALSF"
+date: 2021
output:
html_notebook:
toc: true
@@ -8,15 +10,24 @@ editor_options:
chunk_output_type: inline
---
-**CCDL 2020**
-## Objective for this notebook analysis:
+## Objectives
+
+This notebook will demonstrate how to:
+
+- Use functions from the tidyverse to read and write data frames
+- Implement and use tidyverse functions to wrangle data (i.e. filter, mutate, arrange, join)
+- Use `magrittr` pipes (`%>%`) to combine multiple operations
+- Use the `apply()` function to apply functions across rows or columns of a matrix
+
+---
We'll use the same gene expression dataset we used in the [previous notebook](./02-intro_to_ggplot2.Rmd).
It is a pre-processed [astrocytoma microarray dataset](https://www.refine.bio/experiments/GSE44971/gene-expression-data-from-pilocytic-astrocytoma-tumour-samples-and-normal-cerebellum-controls)
that we performed a set of [differential expression analyses on](./scripts/00-setup-intro-to-R.R).
-**More tidyverse resources:**
+**More tidyverse resources:**
+
- [R for Data Science](https://r4ds.had.co.nz/)
- [tidyverse documentation](https://dplyr.tidyverse.org/)
- [Cheatsheet of tidyverse data transformation](https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf)
@@ -187,10 +198,10 @@ Use this chunk to explore what `gene_df` looks like.
What information is contained in `gene_df`?
-## dplyr pipes
+## `magrittr` pipes
One nifty feature of the tidyverse is pipes: `%>%`
-These handy things allows you to funnel the result of one expression to the next,
+These handy things, which come from the `magrittr` package, allow you to funnel the result of one expression to the next,
making your code a little more streamlined.
For example, the output from this:
diff --git a/intro-to-R-tidyverse/03-intro_to_tidyverse.nb.html b/intro-to-R-tidyverse/03-intro_to_tidyverse.nb.html
index 938626aa..f87fbe44 100644
--- a/intro-to-R-tidyverse/03-intro_to_tidyverse.nb.html
+++ b/intro-to-R-tidyverse/03-intro_to_tidyverse.nb.html
@@ -321,11 +321,12 @@ 2021
Objectives
This notebook will demonstrate how to:
-- Use the tidyverse package to read and write data frames
-- Use dplyr pipes to manipulate data frames
-- Implement and use functions in the tidyverse package to wrangle data (i.e. filter, mutate, arrange, join)
-- Use the apply function to apply functions across entire data frames
+- Use functions from the tidyverse to read and write data frames
+- Implement and use tidyverse functions to wrangle data (i.e. filter, mutate, arrange, join)
+- Use
magrittr
pipes (%>%
) to combine multiple operations
+- Use the
apply()
function to apply functions across rows or columns of a matrix
+
We’ll use the same gene expression dataset we used in the previous notebook. It is a pre-processed astrocytoma microarray dataset that we performed a set of differential expression analyses on.
More tidyverse resources:
@@ -344,306 +345,185 @@ Set Up
Our RStudio Server already has the tidyverse
group of packages installed for you. But if you needed to install it or other packages available on CRAN, you do it using the install.packages()
function like this: install.packages("tidyverse")
.
-
-
```r
-library(tidyverse)
-
-<!-- rnb-source-end -->
-
-<!-- rnb-chunk-end -->
-
-
-<!-- rnb-text-begin -->
-
-
-### Referencing a library's function with `::`
-
-Note that if we had not imported the tidyverse set of packages using `library()`
-like above, and we wanted to use a tidyverse function like `read_tsv()`, we
-would need to tell R what package to find this function in.
-To do this, we would use `::` to tell R to load in this function from the
-`readr` package by using `readr::read_tsv()`.
-You will see this `::` method of referencing libraries within packages
-throughout the course.
-We like to use it in part to remove any ambiguity in which version of a
-function we are using; it is not too uncommon for different packages to use the
-same name for very different functions!
-
-## Managing directories
-
-Before we can import the data we need, we should double check where R is
-looking for files, aka the current **working directory**.
-We can do this by using the `getwd()` function, which will tell us what folder
-we are in.
-
-
-<!-- rnb-text-end -->
-
-
-<!-- rnb-chunk-begin -->
-
-
-<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuIyBMZXQncyBjaGVjayB3aGF0IGRpcmVjdG9yeSB3ZSBhcmUgaW46XG5nZXR3ZCgpXG5gYGBcbmBgYCJ9 -->
-
-```r
-```r
-# Let's check what directory we are in:
+
+library(tidyverse)
+
+
+── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
+
+
+✔ ggplot2 3.3.3 ✔ purrr 0.3.4
+✔ tibble 3.0.5 ✔ dplyr 1.0.3
+✔ tidyr 1.1.2 ✔ stringr 1.4.0
+✔ readr 1.4.0 ✔ forcats 0.5.0
+
+
+── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
+✖ dplyr::filter() masks stats::filter()
+✖ dplyr::lag() masks stats::lag()
+
+
+
+
+Referencing a library’s function with ::
+Note that if we had not imported the tidyverse set of packages using library()
like above, and we wanted to use a tidyverse function like read_tsv()
, we would need to tell R what package to find this function in. To do this, we would use ::
to tell R to load in this function from the readr
package by using readr::read_tsv()
. You will see this ::
method of referencing libraries within packages throughout the course. We like to use it in part to remove any ambiguity in which version of a function we are using; it is not too uncommon for different packages to use the same name for very different functions!
+
+
+
+Managing directories
+Before we can import the data we need, we should double check where R is looking for files, aka the current working directory. We can do this by using the getwd()
function, which will tell us what folder we are in.
+
+
+
+# Let's check what directory we are in:
getwd()
-
-<!-- rnb-source-end -->
-
-<!-- rnb-output-begin eyJkYXRhIjoiWzFdIFxcL19fdy90cmFpbmluZy1tb2R1bGVzL3RyYWluaW5nLW1vZHVsZXMvaW50cm8tdG8tUi10aWR5dmVyc2VcXFxuL19fdy90cmFpbmluZy1tb2R1bGVzL3RyYWluaW5nLW1vZHVsZXMvaW50cm8tdG8tUi10aWR5dmVyc2VcbiJ9 -->
-
-[1] /__w/training-modules/training-modules/intro-to-R-tidyverse
-/__w/training-modules/training-modules/intro-to-R-tidyverse
-
-
-
-<!-- rnb-output-end -->
-
-<!-- rnb-chunk-end -->
-
-
-<!-- rnb-text-begin -->
-
-
-For Rmd files, the working directory is wherever the file is located, but
-commands executed in the console may have a different working directory.
-
-We will want to make a directory for our output and we will call this directory:
-`results`.
-But before we create the directory, we should check if it already exists.
-We will show two ways that we can do this.
-
-First, we can use the `dir()` function to have R list the files in our working
-directory.
-
-
-<!-- rnb-text-end -->
-
-
-<!-- rnb-chunk-begin -->
-
-
-<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuIyBMZXQncyBjaGVjayB3aGF0IGZpbGVzIGFyZSBoZXJlXG5kaXIoKVxuYGBgXG5gYGAifQ== -->
-
-```r
-```r
-# Let's check what files are here
+
+
+[1] "/__w/training-modules/training-modules/intro-to-R-tidyverse"
+
+
+/__w/training-modules/training-modules/intro-to-R-tidyverse
+
+
+
+For Rmd files, the working directory is wherever the file is located, but commands executed in the console may have a different working directory.
+We will want to make a directory for our output and we will call this directory: results
. But before we create the directory, we should check if it already exists. We will show two ways that we can do this.
+First, we can use the dir()
function to have R list the files in our working directory.
+
+
+
+# Let's check what files are here
dir()
-
-<!-- rnb-source-end -->
-
-<!-- rnb-output-begin eyJkYXRhIjoiIFsxXSBcXDAwYS1yc3R1ZGlvX2d1aWRlLm1kXFwgICAgICAgICAgICAgICAgICAgXG4gWzJdIFxcMDBiLWRlYnVnZ2luZ19yZXNvdXJjZXMubWRcXCAgICAgICAgICAgICBcbiBbM10gXFwwMGMtZ29vZC1zY2llbnRpZmljLWNvZGluZy1wcmFjdGljZXMubWRcXFxuIFs0XSBcXDAxLWludHJvX3RvX2Jhc2VfUi1saXZlLlJtZFxcICAgICAgICAgICAgXG4gWzVdIFxcMDEtaW50cm9fdG9fYmFzZV9SLm5iLmh0bWxcXCAgICAgICAgICAgICBcbiBbNl0gXFwwMS1pbnRyb190b19iYXNlX1IuUm1kXFwgICAgICAgICAgICAgICAgIFxuIFs3XSBcXDAyLWludHJvX3RvX2dncGxvdDItbGl2ZS5SbWRcXCAgICAgICAgICAgXG4gWzhdIFxcMDItaW50cm9fdG9fZ2dwbG90Mi5uYi5odG1sXFwgICAgICAgICAgICBcbiBbOV0gXFwwMi1pbnRyb190b19nZ3Bsb3QyLlJtZFxcICAgICAgICAgICAgICAgIFxuWzEwXSBcXDAzLWludHJvX3RvX3RpZHl2ZXJzZS1saXZlLlJtZFxcICAgICAgICAgXG5bMTFdIFxcMDMtaW50cm9fdG9fdGlkeXZlcnNlLm5iLmh0bWxcXCAgICAgICAgICBcblsxMl0gXFwwMy1pbnRyb190b190aWR5dmVyc2UuUm1kXFwgICAgICAgICAgICAgIFxuWzEzXSBcXGRhdGFcXCAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgXG5bMTRdIFxcZGlhZ3JhbXNcXCAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBcblsxNV0gXFxleGVyY2lzZV8wMS1pbnRyb190b19iYXNlX1IuUm1kXFwgICAgICAgIFxuWzE2XSBcXGV4ZXJjaXNlXzAyLWludHJvX3RvX1IuUm1kXFwgICAgICAgICAgICAgXG5bMTddIFxcZXhlcmNpc2VfMDNhLWludHJvX3RvX3RpZHl2ZXJzZS5SbWRcXCAgICBcblsxOF0gXFxleGVyY2lzZV8wM2ItaW50cm9fdG9fdGlkeXZlcnNlLlJtZFxcICAgIFxuWzE5XSBcXHBsb3RzXFwgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgXG5bMjBdIFxcUkVBRE1FLm1kXFwgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBcblsyMV0gXFxyZXN1bHRzXFwgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIFxuWzIyXSBcXHNjcmVlbnNob3RzXFwgICAgICAgICAgICAgICAgICAgICAgICAgICAgXG5bMjNdIFxcc2NyaXB0c1xcICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBcbjAwYS1yc3R1ZGlvX2d1aWRlLm1kXG5cbjAwYi1kZWJ1Z2dpbmdfcmVzb3VyY2VzLm1kXG5cbjAwYy1nb29kLXNjaWVudGlmaWMtY29kaW5nLXByYWN0aWNlcy5tZFxuXG4wMS1pbnRyb190b19iYXNlX1ItbGl2ZS5SbWRcblxuMDEtaW50cm9fdG9fYmFzZV9SLm5iLmh0bWxcblxuMDEtaW50cm9fdG9fYmFzZV9SLlJtZFxuXG4wMi1pbnRyb190b19nZ3Bsb3QyLWxpdmUuUm1kXG5cbjAyLWludHJvX3RvX2dncGxvdDIubmIuaHRtbFxuXG4wMi1pbnRyb190b19nZ3Bsb3QyLlJtZFxuXG4wMy1pbnRyb190b190aWR5dmVyc2UtbGl2ZS5SbWRcblxuMDMtaW50cm9fdG9fdGlkeXZlcnNlLm5iLmh0bWxcblxuMDMtaW50cm9fdG9fdGlkeXZlcnNlLlJtZFxuXG5kYXRhXG5cbmRpYWdyYW1zXG5cbmV4ZXJjaXNlXzAxLWludHJvX3RvX2Jhc2VfUi5SbWRcblxuZXhlcmNpc2VfMDItaW50cm9fdG9fUi5SbWRcblxuZXhlcmNpc2VfMDNhLWludHJvX3RvX3RpZHl2ZXJzZS5SbWRcblxuZXhlcmNpc2VfMDNiLWludHJvX3RvX3RpZHl2ZXJzZS5SbWRcblxucGxvdHNcblxuUkVBRE1FLm1kXG5cbnJlc3VsdHNcblxuc2NyZWVuc2hvdHNcblxuc2NyaXB0c1xuIn0= -->
-
-[1] \00a-rstudio_guide.md
-[2] \00b-debugging_resources.md
-[3] \00c-good-scientific-coding-practices.md
-[4] \01-intro_to_base_R-live.Rmd
-[5] \01-intro_to_base_R.nb.html
-[6] \01-intro_to_base_R.Rmd
-[7] \02-intro_to_ggplot2-live.Rmd
-[8] \02-intro_to_ggplot2.nb.html
-[9] \02-intro_to_ggplot2.Rmd
-[10] \03-intro_to_tidyverse-live.Rmd
-[11] \03-intro_to_tidyverse.nb.html
-[12] \03-intro_to_tidyverse.Rmd
-[13]
-[14]
-[15] _01-intro_to_base_R.Rmd
-[16] _02-intro_to_R.Rmd
-[17] _03a-intro_to_tidyverse.Rmd
-[18] _03b-intro_to_tidyverse.Rmd
-[19]
-[20] .md
-[21]
-[22]
-[23]
-00a-rstudio_guide.md
-00b-debugging_resources.md
-00c-good-scientific-coding-practices.md
-01-intro_to_base_R-live.Rmd
-01-intro_to_base_R.nb.html
-01-intro_to_base_R.Rmd
-02-intro_to_ggplot2-live.Rmd
-02-intro_to_ggplot2.nb.html
-02-intro_to_ggplot2.Rmd
-03-intro_to_tidyverse-live.Rmd
-03-intro_to_tidyverse.nb.html
-03-intro_to_tidyverse.Rmd
-data
-diagrams
-exercise_01-intro_to_base_R.Rmd
-exercise_02-intro_to_R.Rmd
-exercise_03a-intro_to_tidyverse.Rmd
-exercise_03b-intro_to_tidyverse.Rmd
-plots
-README.md
-results
-screenshots
-scripts
-
-
-
-<!-- rnb-output-end -->
-
-<!-- rnb-chunk-end -->
-
-
-<!-- rnb-text-begin -->
-
-
-This shows us there is no folder called "results" yet.
-
-If we want to more pointedly look for "results" in our working directory we can
-use the `dir.exists()` function.
-
-
-<!-- rnb-text-end -->
-
-
-<!-- rnb-chunk-begin -->
-
-
-<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuIyBDaGVjayBpZiB0aGUgcmVzdWx0cyBkaXJlY3RvcnkgZXhpc3RzXG5kaXIuZXhpc3RzKFxccmVzdWx0c1xcKVxuYGBgXG5gYGAifQ== -->
-
-```r
-```r
-# Check if the results directory exists
-dir.exists(\results\)
-
-<!-- rnb-source-end -->
-
-<!-- rnb-output-begin eyJkYXRhIjoiWzFdIFRSVUVcbiJ9 -->
-
-[1] TRUE
-
-
-
-<!-- rnb-output-end -->
-
-<!-- rnb-chunk-end -->
-
-
-<!-- rnb-text-begin -->
-
-
-If the above says `FALSE` that means we will need to create a `results`
-directory using the function `dir.create()`.
-
-
-<!-- rnb-text-end -->
-
-
-<!-- rnb-chunk-begin -->
-
-
-<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuIyBNYWtlIGEgZGlyZWN0b3J5IHdpdGhpbiB0aGUgd29ya2luZyBkaXJlY3RvcnkgY2FsbGVkICdyZXN1bHRzJ1xuZGlyLmNyZWF0ZShcXHJlc3VsdHNcXClcbmBgYFxuYGBgIn0= -->
-
-```r
-```r
-# Make a directory within the working directory called 'results'
-dir.create(\results\)
-
-<!-- rnb-source-end -->
-
-<!-- rnb-chunk-end -->
-
-
-<!-- rnb-text-begin -->
-
-
-After creating the results directory above, let's re-run `dir.exists()` to see
-if now it exists.
+
+
+ [1] "00a-rstudio_guide.md"
+ [2] "00b-debugging_resources.md"
+ [3] "00c-good-scientific-coding-practices.md"
+ [4] "01-intro_to_base_R-live.Rmd"
+ [5] "01-intro_to_base_R.nb.html"
+ [6] "01-intro_to_base_R.Rmd"
+ [7] "02-intro_to_ggplot2-live.Rmd"
+ [8] "02-intro_to_ggplot2.nb.html"
+ [9] "02-intro_to_ggplot2.Rmd"
+[10] "03-intro_to_tidyverse-live.Rmd"
+[11] "03-intro_to_tidyverse.nb.html"
+[12] "03-intro_to_tidyverse.Rmd"
+[13] "data"
+[14] "diagrams"
+[15] "exercise_01-intro_to_base_R.Rmd"
+[16] "exercise_02-intro_to_R.Rmd"
+[17] "exercise_03a-intro_to_tidyverse.Rmd"
+[18] "exercise_03b-intro_to_tidyverse.Rmd"
+[19] "plots"
+[20] "README.md"
+[21] "results"
+[22] "screenshots"
+[23] "scripts"
+
+
+00a-rstudio_guide.md
+00b-debugging_resources.md
-<!-- rnb-text-end -->
+00c-good-scientific-coding-practices.md
+01-intro_to_base_R-live.Rmd
-<!-- rnb-chunk-begin -->
+01-intro_to_base_R.nb.html
+01-intro_to_base_R.Rmd
-<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuIyBSZS1jaGVjayBpZiB0aGUgcmVzdWx0cyBkaXJlY3RvcnkgZXhpc3RzXG5kaXIuZXhpc3RzKFxccmVzdWx0c1xcKVxuYGBgXG5gYGAifQ== -->
+02-intro_to_ggplot2-live.Rmd
-```r
-```r
-# Re-check if the results directory exists
-dir.exists(\results\)
-
-<!-- rnb-source-end -->
+02-intro_to_ggplot2.nb.html
-<!-- rnb-output-begin eyJkYXRhIjoiWzFdIFRSVUVcbiJ9 -->
-
-[1] TRUE
-
+02-intro_to_ggplot2.Rmd
+03-intro_to_tidyverse-live.Rmd
-<!-- rnb-output-end -->
+03-intro_to_tidyverse.nb.html
-<!-- rnb-chunk-end -->
+03-intro_to_tidyverse.Rmd
+data
-<!-- rnb-text-begin -->
+diagrams
+exercise_01-intro_to_base_R.Rmd
-We can use the output of `dir.exists()` to automatically create or hold off on
-creating a directory by putting this together in an `if` statement like below.
-An `if` statement has two main parts:
-First, the test, which is an expression that will result in either `TRUE` or `FALSE`.
-This is put in parenthesis immediately after the `if`.
-The next part is the body, which is the commands that will be executed *if* the
-test is `TRUE`.
-These are placed within a set of braces `{ }`.
-Note that we used an exclamation point in the test to signify that we want a
-directory to be created only *if* `dir.exists(results)` is NOT equal to `TRUE`.
+exercise_02-intro_to_R.Rmd
+exercise_03a-intro_to_tidyverse.Rmd
-<!-- rnb-text-end -->
+exercise_03b-intro_to_tidyverse.Rmd
+plots
-<!-- rnb-chunk-begin -->
+README.md
+results
-<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuIyBJZiAncmVzdWx0cycgZGlyZWN0b3J5IGRvZXNuJ3QgZXhpc3QuLi5cbmlmICghZGlyLmV4aXN0cyhcXHJlc3VsdHNcXCkpIHtcbiAgIyAuLi4gY3JlYXRlIGEgJ3Jlc3VsdHMnIGRpcmVjdG9yeVxuICBkaXIuY3JlYXRlKFxccmVzdWx0c1xcKVxufVxuYGBgXG5gYGAifQ== -->
+screenshots
-```r
-```r
-# If 'results' directory doesn't exist...
-if (!dir.exists(\results\)) {
+scripts
+
+
+
+This shows us there is no folder called “results” yet.
+If we want to more pointedly look for “results” in our working directory we can use the dir.exists()
function.
+
+
+
+# Check if the results directory exists
+dir.exists("results")
+
+
+[1] TRUE
+
+
+
+If the above says FALSE
that means we will need to create a results
directory using the function dir.create()
.
+
+
+
+# Make a directory within the working directory called 'results'
+dir.create("results")
+
+
+Warning in dir.create("results"): 'results' already exists
+
+
+
+After creating the results directory above, let’s re-run dir.exists()
to see if now it exists.
+
+
+
+# Re-check if the results directory exists
+dir.exists("results")
+
+
+[1] TRUE
+
+
+
+We can use the output of dir.exists()
to automatically create or hold off on creating a directory by putting this together in an if
statement like below. An if
statement has two main parts: First, the test, which is an expression that will result in either TRUE
or FALSE
. This is put in parenthesis immediately after the if
. The next part is the body, which is the commands that will be executed if the test is TRUE
. These are placed within a set of braces { }
. Note that we used an exclamation point in the test to signify that we want a directory to be created only if dir.exists(results)
is NOT equal to TRUE
.
+
+
+
+# If 'results' directory doesn't exist...
+if (!dir.exists("results")) {
# ... create a 'results' directory
- dir.create(\results\)
+ dir.create("results")
}
-
-<!-- rnb-source-end -->
-
-<!-- rnb-chunk-end -->
-
-
-<!-- rnb-text-begin -->
-
-
-The `dir.exists()` function will not work on files themselves.
-In that case, there is an analogous function called `file.exists()`.
-
-Try using the `file.exists()` function to see if the file
-`gene_results_GSE44971.tsv` exists in the current directory.
-Use the code chunk we set up for you below.
-Note that in our notebooks (and sometimes elsewhere), wherever you see a
-`<FILL_IN_THE_BLANK>` like in the chunk below, that is meant for you to replace
-(including the angle brackets) with the correct phrase before you run the chunk
-(otherwise you will get an error).
-
-
-<!-- rnb-text-end -->
-
-
-<!-- rnb-chunk-begin -->
-
-
-<!-- rnb-source-begin eyJkYXRhIjpbIiMgUmVwbGFjZSB0aGUgPFBVVF9GSUxFX05BTUVfSEVSRT4gd2l0aCB0aGUgbmFtZSBvZiB0aGUgZmlsZSB5b3UgYXJlIGxvb2tpbmcgZm9yIiwiIyBSZW1lbWJlciB0byB1c2UgcXVvdGVzIHRvIG1ha2UgaXQgYSBjaGFyYWN0ZXIgc3RyaW5nIiwiZmlsZS5leGlzdHMoPFBVVF9GSUxFX05BTUVfSEVSRT4pIl19 -->
-
-```r
-# Replace the <PUT_FILE_NAME_HERE> with the name of the file you are looking for
+
+
+
+The dir.exists()
function will not work on files themselves. In that case, there is an analogous function called file.exists()
.
+Try using the file.exists()
function to see if the file gene_results_GSE44971.tsv
exists in the current directory. Use the code chunk we set up for you below. Note that in our notebooks (and sometimes elsewhere), wherever you see a <FILL_IN_THE_BLANK>
like in the chunk below, that is meant for you to replace (including the angle brackets) with the correct phrase before you run the chunk (otherwise you will get an error).
+
+
+
+# Replace the <PUT_FILE_NAME_HERE> with the name of the file you are looking for
# Remember to use quotes to make it a character string
file.exists(<PUT_FILE_NAME_HERE>)
@@ -655,362 +535,192 @@ Read a TSV file
Declare the name of the directory where we will read in the data.
-
-```r
-data_dir <- \data\
-
-<!-- rnb-source-end -->
-
-<!-- rnb-chunk-end -->
-
-
-<!-- rnb-text-begin -->
-
-
-Although base R has functions to read in data files, the functions in the
-`readr` package (part of the tidyverse) are faster and more straightforward
-to use so we are going to use those here.
-Because the file we are reading in is a TSV (tab separated values) file we will
-be using the `read_tsv` function.
-There are analogous functions for CSV (comma separated values) files
-(`read_csv()`) and other files types.
-
-## Read in the differential expression analysis results file
-
-
-<!-- rnb-text-end -->
-
-
-<!-- rnb-chunk-begin -->
-
-
-<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuc3RhdHNfZGYgPC0gcmVhZHI6OnJlYWRfdHN2KFxuICBmaWxlLnBhdGgoZGF0YV9kaXIsXG4gICAgICAgICAgICBcXGdlbmVfcmVzdWx0c19HU0U0NDk3MS50c3ZcXClcbiAgKVxuYGBgXG5gYGAifQ== -->
-
-```r
-```r
-stats_df <- readr::read_tsv(
+
+data_dir <- "data"
+
+
+
+Although base R has functions to read in data files, the functions in the readr
package (part of the tidyverse) are faster and more straightforward to use so we are going to use those here. Because the file we are reading in is a TSV (tab separated values) file we will be using the read_tsv
function. There are analogous functions for CSV (comma separated values) files (read_csv()
) and other files types.
+
+
+
+Read in the differential expression analysis results file
+
+
+
+stats_df <- readr::read_tsv(
file.path(data_dir,
- \gene_results_GSE44971.tsv\)
+ "gene_results_GSE44971.tsv")
)
+
+
-<!-- rnb-source-end -->
-
-<!-- rnb-chunk-end -->
-
-
-<!-- rnb-text-begin -->
-
-
-Following the template of the previous chunk, use this chunk to read in the file
-`GSE44971.tsv` that is in the `data` folder and save it in the variable `gene_df`.
-
-
-<!-- rnb-text-end -->
-
-
-<!-- rnb-chunk-begin -->
-
-
-<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuIyBVc2UgdGhpcyBjaHVuayB0byByZWFkIGluIGRhdGEgZnJvbSB0aGUgZmlsZSBgR1NFNDQ5NzEudHN2YFxuZ2VuZV9kZiA8LSByZWFkcjo6cmVhZF90c3YoXG4gIGZpbGUucGF0aChkYXRhX2RpcixcbiAgICAgICAgICAgIFxcR1NFNDQ5NzEudHN2XFwpXG4gIClcbmBgYFxuYGBgIn0= -->
-
-```r
-```r
-# Use this chunk to read in data from the file `GSE44971.tsv`
+── Column specification ────────────────────────────────────────────────────────
+cols(
+ ensembl_id = col_character(),
+ gene_symbol = col_character(),
+ contrast = col_character(),
+ log_fold_change = col_double(),
+ avg_expression = col_double(),
+ t_statistic = col_double(),
+ p_value = col_double(),
+ adj_p_value = col_double()
+)
+
+
+
+Following the template of the previous chunk, use this chunk to read in the file GSE44971.tsv
that is in the data
folder and save it in the variable gene_df
.
+
+
+
+# Use this chunk to read in data from the file `GSE44971.tsv`
gene_df <- readr::read_tsv(
file.path(data_dir,
- \GSE44971.tsv\)
+ "GSE44971.tsv")
)
+
+
-<!-- rnb-source-end -->
-
-<!-- rnb-chunk-end -->
-
-
-<!-- rnb-text-begin -->
-
-
-Use this chunk to explore what `gene_df` looks like.
-
-
-<!-- rnb-text-end -->
-
-
-<!-- rnb-chunk-begin -->
-
-
-<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuIyBFeHBsb3JlIGBnZW5lX2RmYFxuYGBgXG5gYGAifQ== -->
-
-```r
-```r
-# Explore `gene_df`
-
-<!-- rnb-source-end -->
-
-<!-- rnb-chunk-end -->
-
-
-<!-- rnb-text-begin -->
-
-
-What information is contained in `gene_df`?
-
-## dplyr pipes
-
-One nifty feature of the tidyverse is pipes: `%>%`
-These handy things allows you to funnel the result of one expression to the next,
-making your code a little more streamlined.
-
-For example, the output from this:
-
-
-<!-- rnb-text-end -->
-
-
-<!-- rnb-chunk-begin -->
-
-
-<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuZmlsdGVyKHN0YXRzX2RmLCBjb250cmFzdCA9PSBcXG1hbGVfZmVtYWxlXFwpXG5gYGBcbmBgYCJ9 -->
-
-```r
-```r
-filter(stats_df, contrast == \male_female\)
-
-<!-- rnb-source-end -->
-
-<!-- rnb-chunk-end -->
-
-
-<!-- rnb-text-begin -->
-
-
-...is the same as the output from this:
-
-
-<!-- rnb-text-end -->
-
-
-<!-- rnb-chunk-begin -->
-
-
-<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuc3RhdHNfZGYgJT4lIGZpbHRlcihjb250cmFzdCA9PSBcXG1hbGVfZmVtYWxlXFwpXG5gYGBcbmBgYCJ9 -->
-
-```r
-```r
-stats_df %>% filter(contrast == \male_female\)
-
-<!-- rnb-source-end -->
-
-<!-- rnb-chunk-end -->
-
-
-<!-- rnb-text-begin -->
-
-
-This can make your code cleaner and easier to follow a series of related
-commands.
-Let's look at an example with our stats of of how the same
-functions look with or without pipes:
-
-*Example 1:* without pipes:
-
-
-<!-- rnb-text-end -->
-
-
-<!-- rnb-chunk-begin -->
-
-
-<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuc3RhdHNfYXJyYW5nZWQgPC0gYXJyYW5nZShzdGF0c19kZiwgdF9zdGF0aXN0aWMpXG5zdGF0c19maWx0ZXJlZCA8LSBmaWx0ZXIoc3RhdHNfYXJyYW5nZWQsIGF2Z19leHByZXNzaW9uID4gNTApXG5zdGF0c19ub3BpcGUgPC0gc2VsZWN0KHN0YXRzX2ZpbHRlcmVkLCBjb250cmFzdCwgbG9nX2ZvbGRfY2hhbmdlLCBwX3ZhbHVlKVxuYGBgXG5gYGAifQ== -->
-
-```r
-```r
-stats_arranged <- arrange(stats_df, t_statistic)
+── Column specification ────────────────────────────────────────────────────────
+cols(
+ .default = col_double(),
+ Gene = col_character()
+)
+ℹ Use `spec()` for the full column specifications.
+
+
+
+Use this chunk to explore what gene_df
looks like.
+
+
+
+# Explore `gene_df`
+
+
+
+What information is contained in gene_df
?
+
+
+magrittr
pipes
+One nifty feature of the tidyverse is pipes: %>%
These handy things, which come from the magrittr
package, allow you to funnel the result of one expression to the next, making your code a little more streamlined.
+For example, the output from this:
+
+
+
+filter(stats_df, contrast == "male_female")
+
+
+
+
+
+
+…is the same as the output from this:
+
+
+
+stats_df %>% filter(contrast == "male_female")
+
+
+
+
+
+
+This can make your code cleaner and easier to follow a series of related commands. Let’s look at an example with our stats of of how the same functions look with or without pipes:
+Example 1: without pipes:
+
+
+
+stats_arranged <- arrange(stats_df, t_statistic)
stats_filtered <- filter(stats_arranged, avg_expression > 50)
stats_nopipe <- select(stats_filtered, contrast, log_fold_change, p_value)
-
-<!-- rnb-source-end -->
-
-<!-- rnb-chunk-end -->
-
-
-<!-- rnb-text-begin -->
-
-
-UGH, we have to keep track of all of those different intermediate data frames
-and type their names so many times here!
-We could maybe streamline things by using the same variable name at each stage,
-but even then there is a lot of extra typing, and it is easy to get confused
-about what has been done where.
-It's annoying and makes it harder for people to read.
-
-*Example 2:* Same result as 1 but with pipes!
-
-
-<!-- rnb-text-end -->
-
-
-<!-- rnb-chunk-begin -->
-
-
-<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuIyBFeGFtcGxlIG9mIHRoZSBzYW1lIG1vZGlmaWNhdGlvbnMgYXMgYWJvdmUgYnV0IHdpdGggcGlwZXMhXG5zdGF0c19waXBlICA8LSBzdGF0c19kZiAlPiVcbiAgICAgICAgICAgICAgIGFycmFuZ2UodF9zdGF0aXN0aWMpICU+JVxuICAgICAgICAgICAgICAgZmlsdGVyKGF2Z19leHByZXNzaW9uID4gNTApICU+JVxuICAgICAgICAgICAgICAgc2VsZWN0KGNvbnRyYXN0LCBsb2dfZm9sZF9jaGFuZ2UsIHBfdmFsdWUpXG5gYGBcbmBgYCJ9 -->
-
-```r
-```r
-# Example of the same modifications as above but with pipes!
+
+
+
+UGH, we have to keep track of all of those different intermediate data frames and type their names so many times here! We could maybe streamline things by using the same variable name at each stage, but even then there is a lot of extra typing, and it is easy to get confused about what has been done where. It’s annoying and makes it harder for people to read.
+Example 2: Same result as 1 but with pipes!
+
+
+
+# Example of the same modifications as above but with pipes!
stats_pipe <- stats_df %>%
arrange(t_statistic) %>%
filter(avg_expression > 50) %>%
select(contrast, log_fold_change, p_value)
-
-<!-- rnb-source-end -->
-
-<!-- rnb-chunk-end -->
-
-
-<!-- rnb-text-begin -->
-
-
-What the `%>%` (pipe) is doing here is feeding the result of the expression on
-its left into the first argument of the next function (to its right, or on the
-next line here).
-We can then skip that first argument (the data in these cases), and move right
-on to the part we care about at that step: what we are arranging, filtering, or
-selecting in this case.
-
-Let's double check that these are the same by using the function, `all.equal()`.
-
-
-<!-- rnb-text-end -->
-
-
-<!-- rnb-chunk-begin -->
-
-
-<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuYWxsLmVxdWFsKHN0YXRzX25vcGlwZSwgc3RhdHNfcGlwZSlcbmBgYFxuYGBgIn0= -->
-
-```r
-```r
-all.equal(stats_nopipe, stats_pipe)
-
-<!-- rnb-source-end -->
-
-<!-- rnb-output-begin eyJkYXRhIjoiWzFdIFRSVUVcbiJ9 -->
-
-[1] TRUE
-
-
-
-<!-- rnb-output-end -->
-
-<!-- rnb-chunk-end -->
-
-
-<!-- rnb-text-begin -->
-
-
-`all.equal()` is letting us know that these two objects are the same.
-
-Now that hopefully you are convinced that the tidyverse can help you make your
-code neater and easier to use and read, let's go through some of the popular
-tidyverse functions and so we can create pipelines like this.
-
-## Common tidyverse functions
-
-Let's say we wanted to filter this gene expression dataset to particular sample
-groups.
-In order to do this, we would use the function `filter()` as well as a logic
-statement (usually one that refers to a column or columns in the data frame).
-
-
-<!-- rnb-text-end -->
-
-
-<!-- rnb-chunk-begin -->
-
-
-<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuIyBIZXJlIGxldCdzIGZpbHRlciBzdGF0c19kZiB0byB0aGUgZ2VuZV9zeW1ib2wgXFxTTkNBXFxcbnN0YXRzX2RmICU+JSBcbiAgZmlsdGVyKGdlbmVfc3ltYm9sID09IFxcU05DQVxcKVxuYGBgXG5gYGAifQ== -->
-
-```r
-```r
-# Here let's filter stats_df to the gene_symbol \SNCA\
+
+
+
+What the %>%
(pipe) is doing here is feeding the result of the expression on its left into the first argument of the next function (to its right, or on the next line here). We can then skip that first argument (the data in these cases), and move right on to the part we care about at that step: what we are arranging, filtering, or selecting in this case.
+Let’s double check that these are the same by using the function, all.equal()
.
+
+
+
+all.equal(stats_nopipe, stats_pipe)
+
+
+[1] TRUE
+
+
+
+all.equal()
is letting us know that these two objects are the same.
+Now that hopefully you are convinced that the tidyverse can help you make your code neater and easier to use and read, let’s go through some of the popular tidyverse functions and so we can create pipelines like this.
+
+
+Common tidyverse functions
+Let’s say we wanted to filter this gene expression dataset to particular sample groups. In order to do this, we would use the function filter()
as well as a logic statement (usually one that refers to a column or columns in the data frame).
+
+
+
+# Here let's filter stats_df to the gene_symbol "SNCA"
stats_df %>%
- filter(gene_symbol == \SNCA\)
-
-<!-- rnb-source-end -->
-
-<!-- rnb-chunk-end -->
-
-
-<!-- rnb-text-begin -->
-
-
-We can use `filter()` similarly for numeric statements.
-
-
-<!-- rnb-text-end -->
-
-
-<!-- rnb-chunk-begin -->
-
-
-<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuIyBIZXJlIGxldCdzIGZpbHRlciB0aGUgZGF0YSB0byByb3dzIHdpdGggYXZlcmFnZSBleHByZXNzaW9uIHZhbHVlcyBhYm92ZSA1MFxuc3RhdHNfZGYgJT4lXG4gIGZpbHRlcihhdmdfZXhwcmVzc2lvbiA+IDUwKVxuYGBgXG5gYGAifQ== -->
-
-```r
-```r
-# Here let's filter the data to rows with average expression values above 50
+ filter(gene_symbol == "SNCA")
+
+
+
+
+
+
+We can use filter()
similarly for numeric statements.
+
+
+
+# Here let's filter the data to rows with average expression values above 50
stats_df %>%
filter(avg_expression > 50)
-
-<!-- rnb-source-end -->
-
-<!-- rnb-chunk-end -->
-
-
-<!-- rnb-text-begin -->
-
-
-We can apply multiple filters at once, which will require all of them to be
-satisfied for every row in the results:
-
-
-<!-- rnb-text-end -->
-
-
-<!-- rnb-chunk-begin -->
-
-
-<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuIyBmaWx0ZXIgdG8gaGlnaGx5IGV4cHJlc3NlZCBnZW5lcyB3aXRoIGNvbnRyYXN0IFxcbWFsZV9mZW1hbGVcXFxuc3RhdHNfZGYgJT4lXG4gIGZpbHRlcihjb250cmFzdCA9PSBcXG1hbGVfZmVtYWxlXFwsIFxuICAgICAgICAgYXZnX2V4cHJlc3Npb24gPiA1MClcbmBgYFxuYGBgIn0= -->
-
-```r
-```r
-# filter to highly expressed genes with contrast \male_female\
+
+
+
+
+
+
+We can apply multiple filters at once, which will require all of them to be satisfied for every row in the results:
+
+
+
+# filter to highly expressed genes with contrast "male_female"
stats_df %>%
- filter(contrast == \male_female\,
+ filter(contrast == "male_female",
avg_expression > 50)
-
-<!-- rnb-source-end -->
-
-<!-- rnb-chunk-end -->
-
-
-<!-- rnb-text-begin -->
-
-
-When we are filtering, the `%in%` operator can come in handy if we have multiple
-items we would like to match.
-Let's take a look at what using `%in%` does.
-
-
-<!-- rnb-text-end -->
-
-
-<!-- rnb-chunk-begin -->
-
-
-<!-- rnb-source-begin eyJkYXRhIjpbImdlbmVzX29mX2ludGVyZXN0IDwtIGMoXCJTTkNBXCIsIFwiQ0RLTjFBXCIpIiwic3RhdHNfZGYkZ2VuZV9zeW1ib2wgJWluJSBnZW5lc19vZl9pbnRlcmVzdCJdfQ== -->
-
-```r
-genes_of_interest <- c("SNCA", "CDKN1A")
+
+
+
+
+
+
+When we are filtering, the %in%
operator can come in handy if we have multiple items we would like to match. Let’s take a look at what using %in%
does.
+
+
+
+genes_of_interest <- c("SNCA", "CDKN1A")
stats_df$gene_symbol %in% genes_of_interest
@@ -1018,627 +728,314 @@ Read a TSV file
%in%
returns a logical vector that now we can use in dplyr::filter
.
-
-```r
-# filter to genes of interest
+
+# filter to genes of interest
stats_df %>%
- filter(gene_symbol %in% c(\SNCA\, \CDKN1A\))
-
-<!-- rnb-source-end -->
-
-<!-- rnb-chunk-end -->
-
-
-<!-- rnb-text-begin -->
-
-
-Let's return to our first `filter()` and build on to it.
-This time, let's keep only some of the columns from the data frame using the
-`select()` function.
-Let's also save this as a new data frame called `stats_filtered_df`.
-
-
-<!-- rnb-text-end -->
-
-
-<!-- rnb-chunk-begin -->
-
-
-<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuIyBmaWx0ZXIgdG8gaGlnaGx5IGV4cHJlc3NlZCBcXG1hbGVfZmVtYWxlXFxcbiMgYW5kIHNlbGVjdCBnZW5lX3N5bWJvbCwgbG9nX2ZvbGRfY2hhbmdlIGFuZCB0X3N0YXRpc3RpY1xuc3RhdHNfZmlsdGVyZWRfZGYgPC0gc3RhdHNfZGYgJT4lXG4gIGZpbHRlcihjb250cmFzdCA9PSBcXG1hbGVfZmVtYWxlXFwsIFxuICAgICAgICAgYXZnX2V4cHJlc3Npb24gPiA1MCkgJT4lXG4gIHNlbGVjdChsb2dfZm9sZF9jaGFuZ2UsIHRfc3RhdGlzdGljKVxuYGBgXG5gYGAifQ== -->
-
-```r
-```r
-# filter to highly expressed \male_female\
+ filter(gene_symbol %in% c("SNCA", "CDKN1A"))
+
+
+
+
+
+
+Let’s return to our first filter()
and build on to it. This time, let’s keep only some of the columns from the data frame using the select()
function. Let’s also save this as a new data frame called stats_filtered_df
.
+
+
+
+# filter to highly expressed "male_female"
# and select gene_symbol, log_fold_change and t_statistic
stats_filtered_df <- stats_df %>%
- filter(contrast == \male_female\,
+ filter(contrast == "male_female",
avg_expression > 50) %>%
select(log_fold_change, t_statistic)
-
-<!-- rnb-source-end -->
-
-<!-- rnb-chunk-end -->
-
-
-<!-- rnb-text-begin -->
-
-
-Let's say we wanted to arrange this dataset so that the genes are arranged by
-the smallest p values to the largest.
-In order to do this, we would use the function `arrange()` as well as the column
-we would like to sort by (in this case `p_value`).
-
-
-<!-- rnb-text-end -->
-
-
-<!-- rnb-chunk-begin -->
-
-
-<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuc3RhdHNfZGYgJT4lIFxuICBhcnJhbmdlKHBfdmFsdWUpIFxuYGBgXG5gYGAifQ== -->
-
-```r
-```r
-stats_df %>%
+
+
+
+Let’s say we wanted to arrange this dataset so that the genes are arranged by the smallest p values to the largest. In order to do this, we would use the function arrange()
as well as the column we would like to sort by (in this case p_value
).
+
+
+
+stats_df %>%
arrange(p_value)
-
-<!-- rnb-source-end -->
-
-<!-- rnb-chunk-end -->
-
-
-<!-- rnb-text-begin -->
-
-
-What if we want to sort from largest to smallest?
-Like if we want to see the genes with the highest average expression?
-We can use the same function, but instead use the `desc()` function and now we
-are using `avg_expression` column.
-
-
-<!-- rnb-text-end -->
-
-
-<!-- rnb-chunk-begin -->
-
-
-<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuIyBhcnJhbmdlIGRlc2NlbmRpbmcgYnkgYXZnX2V4cHJlc3Npb25cbnN0YXRzX2RmICU+JVxuICBhcnJhbmdlKGRlc2MoYXZnX2V4cHJlc3Npb24pKVxuYGBgXG5gYGAifQ== -->
-
-```r
-```r
-# arrange descending by avg_expression
+
+
+
+
+
+
+What if we want to sort from largest to smallest? Like if we want to see the genes with the highest average expression? We can use the same function, but instead use the desc()
function and now we are using avg_expression
column.
+
+
+
+# arrange descending by avg_expression
stats_df %>%
arrange(desc(avg_expression))
-
-<!-- rnb-source-end -->
-
-<!-- rnb-chunk-end -->
-
-
-<!-- rnb-text-begin -->
-
-
-What if we would like to create a new column of values?
-For that we use `mutate()` function.
-
-
-<!-- rnb-text-end -->
-
-
-<!-- rnb-chunk-begin -->
-
-
-<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuc3RhdHNfZGYgJT4lIFxuICBtdXRhdGUobG9nMTBfcF92YWx1ZSA9IC1sb2cxMChwX3ZhbHVlKSlcbmBgYFxuYGBgIn0= -->
-
-```r
-```r
-stats_df %>%
+
+
+
+
+
+
+What if we would like to create a new column of values? For that we use mutate()
function.
+
+
+
+stats_df %>%
mutate(log10_p_value = -log10(p_value))
-
-<!-- rnb-source-end -->
-
-<!-- rnb-chunk-end -->
-
-
-<!-- rnb-text-begin -->
-
-
-What if we want to obtain summary statistics for a column or columns?
-The `summarize` function allows us to calculate summary statistics for a column.
-Here we will use summarize to obtain an mean log folder change over all the
-genes, and its standard deviation.
-
-
-<!-- rnb-text-end -->
-
-
-<!-- rnb-chunk-begin -->
-
-
-<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuc3RhdHNfZGYgJT4lIFxuICBzdW1tYXJpemUobWVhbihsb2dfZm9sZF9jaGFuZ2UpLFxuICAgICAgICAgICAgc2QobG9nX2ZvbGRfY2hhbmdlKSlcbmBgYFxuYGBgIn0= -->
-
-```r
-```r
-stats_df %>%
- summarize(mean(log_fold_change),
- sd(log_fold_change))
-
-<!-- rnb-source-end -->
-
-<!-- rnb-chunk-end -->
-
-
-<!-- rnb-text-begin -->
-
-
-What if we'd like to obtain a summary statistics but have them for various
-groups?
-Conveniently named, there's a function called `group_by()` that seamlessly
-allows us to do this.
-Also note that `group_by()` allows us to group by multiple variables at a time
-if you want to.
-
-
-<!-- rnb-text-end -->
-
-
-<!-- rnb-chunk-begin -->
-
-
-<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuc3RhdHNfc3VtbWFyeV9kZiA8LSBzdGF0c19kZiAlPiVcbiAgICAgIGdyb3VwX2J5KGNvbnRyYXN0KSAlPiUgXG4gICAgICBzdW1tYXJpemUobWVhbihsb2dfZm9sZF9jaGFuZ2UpLFxuICAgICAgICAgICAgICAgIHNkKGxvZ19mb2xkX2NoYW5nZSkpXG5gYGBcbmBgYCJ9 -->
-
-```r
-```r
-stats_summary_df <- stats_df %>%
- group_by(contrast) %>%
- summarize(mean(log_fold_change),
- sd(log_fold_change))
-
-<!-- rnb-source-end -->
-
-<!-- rnb-chunk-end -->
-
-
-<!-- rnb-text-begin -->
-
-
-Let's look at a preview of what we made:
-
-
-<!-- rnb-text-end -->
-
-
-<!-- rnb-chunk-begin -->
-
-
-<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuc3RhdHNfc3VtbWFyeV9kZlxuYGBgXG5gYGAifQ== -->
-
-```r
-```r
-stats_summary_df
-
-<!-- rnb-source-end -->
-
-<!-- rnb-chunk-end -->
-
-
-<!-- rnb-text-begin -->
-
-
-Here we have the mean log fold change expression per each contrast we made.
-
-## A brief intro to the `apply` family of functions
-
-In base R, the `apply` family of functions can be an alternative methods for
-performing transformations across a data frame, matrix or other object structures.
-
-One of this family is (shockingly) the function `apply()`, which operates on
-matrices.
-
-A matrix is similar to a data frame in that it is a rectangular table of data,
-but it has an additional constraint:
-rather than each column having a type, ALL data in a matrix has the same type.
-
-The first argument to `apply()` is the data object we want to work on.
-The third argument is the function we will apply to each row or column of the
-data object.
-The second argument in specifies whether we are applying the function
-across rows or across columns (1 for rows, 2 for columns).
-
-Remember that `gene_df` is a gene x sample gene expression data frame that has
-columns of two different types, character and numeric.
-Converting it to a matrix will require us to make them all the same type.
-We can coerce it into a matrix using `as.matrix()`, in which case R will
-pick a type that it can convert everything to.
-What does it choose?
-
-
-<!-- rnb-text-end -->
-
-
-<!-- rnb-chunk-begin -->
-
-
-<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuIyBDb2VyY2UgYGdlbmVfZGZgIGludG8gYSBtYXRyaXhcbmdlbmVfbWF0cml4IDwtIGFzLm1hdHJpeChnZW5lX2RmKVxuYGBgXG5gYGAifQ== -->
-
-```r
-```r
-# Coerce `gene_df` into a matrix
+
+
+
+
+
+
+What if we want to obtain summary statistics for a column or columns? The summarize
function allows us to calculate summary statistics for a column. Here we will use summarize to obtain an mean log folder change over all the genes, and its standard deviation.
+
+
+
+stats_df %>%
+ summarize(mean(log_fold_change),
+ sd(log_fold_change))
+
+
+
+
+
+
+What if we’d like to obtain a summary statistics but have them for various groups? Conveniently named, there’s a function called group_by()
that seamlessly allows us to do this. Also note that group_by()
allows us to group by multiple variables at a time if you want to.
+
+
+
+stats_summary_df <- stats_df %>%
+ group_by(contrast) %>%
+ summarize(mean(log_fold_change),
+ sd(log_fold_change))
+
+
+
+Let’s look at a preview of what we made:
+
+
+
+stats_summary_df
+
+
+
+
+
+
+Here we have the mean log fold change expression per each contrast we made.
+
+
+A brief intro to the apply
family of functions
+In base R, the apply
family of functions can be an alternative methods for performing transformations across a data frame, matrix or other object structures.
+One of this family is (shockingly) the function apply()
, which operates on matrices.
+A matrix is similar to a data frame in that it is a rectangular table of data, but it has an additional constraint: rather than each column having a type, ALL data in a matrix has the same type.
+The first argument to apply()
is the data object we want to work on. The third argument is the function we will apply to each row or column of the data object. The second argument in specifies whether we are applying the function across rows or across columns (1 for rows, 2 for columns).
+Remember that gene_df
is a gene x sample gene expression data frame that has columns of two different types, character and numeric. Converting it to a matrix will require us to make them all the same type. We can coerce it into a matrix using as.matrix()
, in which case R will pick a type that it can convert everything to. What does it choose?
+
+
+
+# Coerce `gene_df` into a matrix
gene_matrix <- as.matrix(gene_df)
-
-<!-- rnb-source-end -->
-
-<!-- rnb-chunk-end -->
-
-
-<!-- rnb-text-begin -->
-
-
-
-<!-- rnb-text-end -->
-
-
-<!-- rnb-chunk-begin -->
-
-
-<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuIyBFeHBsb3JlIHRoZSBzdHJ1Y3R1cmUgb2YgdGhlIGBnZW5lX21hdHJpeGAgb2JqZWN0XG5zdHIoZ2VuZV9tYXRyaXgpXG5gYGBcbmBgYCJ9 -->
-
-```r
-```r
-# Explore the structure of the `gene_matrix` object
+
+
+
+
+
+
+# Explore the structure of the `gene_matrix` object
str(gene_matrix)
-
-<!-- rnb-source-end -->
-
-<!-- rnb-output-begin eyJkYXRhIjoiIGNociBbMToyMDA1NiwgMTo1OV0gXFxFTlNHMDAwMDAwMDAwMDNcXCBcXEVOU0cwMDAwMDAwMDAwNVxcIFxcRU5TRzAwMDAwMDAwNDE5XFwgLi4uXG4gLSBhdHRyKCosIFxcZGltbmFtZXNcXCk9TGlzdCBvZiAyXG4gIC4uJCA6IE5VTExcbiAgLi4kIDogY2hyIFsxOjU5XSBcXEdlbmVcXCBcXEdTTTEwOTQ4MTRcXCBcXEdTTTEwOTQ4MTVcXCBcXEdTTTEwOTQ4MTZcXCAuLi5cbiJ9 -->
-
-chr [1:20056, 1:59] … - attr(*, )=List of 2 ..$ : NULL ..$ : chr [1:59] …
-
-
-
-<!-- rnb-output-end -->
-
-<!-- rnb-chunk-end -->
-
-
-<!-- rnb-text-begin -->
-
-
-While that worked, it is rare that we want numbers converted to text, so we are
-going to select only the columns with numeric values before converting it to a matrix.
-We can do this most easily by removing the first column, which contains the gene names
-stored as character values.
-
-
-<!-- rnb-text-end -->
-
-
-<!-- rnb-chunk-begin -->
-
-
-<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuIyBMZXQncyBzYXZlIGEgbmV3IG1hdHJpeCBvYmplY3QgbmFtZXMgYGdlbmVfbnVtX21hdHJpeGAgY29udGFpbmluZyBvbmx5XG4jIHRoZSBudW1lcmljIHZhbHVlc1xuZ2VuZV9udW1fbWF0cml4IDwtIGFzLm1hdHJpeChnZW5lX2RmWywgLTFdKVxuXG4jIEV4cGxvcmUgdGhlIHN0cnVjdHVyZSBvZiB0aGUgYGdlbmVfbnVtX21hdHJpeGAgb2JqZWN0XG5zdHIoZ2VuZV9udW1fbWF0cml4KVxuYGBgXG5gYGAifQ== -->
-
-```r
-```r
-# Let's save a new matrix object names `gene_num_matrix` containing only
+
+
+ chr [1:20056, 1:59] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" ...
+ - attr(*, "dimnames")=List of 2
+ ..$ : NULL
+ ..$ : chr [1:59] "Gene" "GSM1094814" "GSM1094815" "GSM1094816" ...
+
+
+
+While that worked, it is rare that we want numbers converted to text, so we are going to select only the columns with numeric values before converting it to a matrix. We can do this most easily by removing the first column, which contains the gene names stored as character values.
+
+
+
+# Let's save a new matrix object names `gene_num_matrix` containing only
# the numeric values
gene_num_matrix <- as.matrix(gene_df[, -1])
# Explore the structure of the `gene_num_matrix` object
str(gene_num_matrix)
-
-<!-- rnb-source-end -->
-
-<!-- rnb-output-begin eyJkYXRhIjoiIG51bSBbMToyMDA1NiwgMTo1OF0gOS41OTUxIC0wLjA0MzYgOC41MjQ2IDEuNjAxMyAwLjYxODkgLi4uXG4gLSBhdHRyKCosIFxcZGltbmFtZXNcXCk9TGlzdCBvZiAyXG4gIC4uJCA6IE5VTExcbiAgLi4kIDogY2hyIFsxOjU4XSBcXEdTTTEwOTQ4MTRcXCBcXEdTTTEwOTQ4MTVcXCBcXEdTTTEwOTQ4MTZcXCBcXEdTTTEwOTQ4MTdcXCAuLi5cbiJ9 -->
-
-num [1:20056, 1:58] 9.5951 -0.0436 8.5246 1.6013 0.6189 … - attr(*, )=List of 2 ..$ : NULL ..$ : chr [1:58] …
-
-
-
-<!-- rnb-output-end -->
-
-<!-- rnb-chunk-end -->
-
-
-<!-- rnb-text-begin -->
-
-
-Why do we have a `[, -1]` after `gene_df` in the above chunk?
-
-Now that the matrix is all numbers, we can do things like calculate the column
-or row statistics using `apply()`.
-
-
-<!-- rnb-text-end -->
-
-
-<!-- rnb-chunk-begin -->
-
-
-<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuIyBDYWxjdWxhdGUgcm93IG1lYW5zXG5nZW5lX21lYW5zIDwtIGFwcGx5KGdlbmVfbnVtX21hdHJpeCwgMSwgbWVhbikgIyBOb3RpY2Ugd2UgYXJlIHVzaW5nIDEgaGVyZVxuXG4jIEhvdyBsb25nIHdpbGwgYGdlbmVfbWVhbnNgIGJlPyBcbmxlbmd0aChnZW5lX21lYW5zKVxuYGBgXG5gYGAifQ== -->
-
-```r
-```r
-# Calculate row means
+
+
+ num [1:20056, 1:58] 9.5951 -0.0436 8.5246 1.6013 0.6189 ...
+ - attr(*, "dimnames")=List of 2
+ ..$ : NULL
+ ..$ : chr [1:58] "GSM1094814" "GSM1094815" "GSM1094816" "GSM1094817" ...
+
+
+
+Why do we have a [, -1]
after gene_df
in the above chunk?
+Now that the matrix is all numbers, we can do things like calculate the column or row statistics using apply()
.
+
+
+
+# Calculate row means
gene_means <- apply(gene_num_matrix, 1, mean) # Notice we are using 1 here
# How long will `gene_means` be?
length(gene_means)
-
-<!-- rnb-source-end -->
-
-<!-- rnb-output-begin eyJkYXRhIjoiWzFdIDIwMDU2XG4ifQ== -->
-
-[1] 20056
-
-
-
-<!-- rnb-output-end -->
-
-<!-- rnb-chunk-end -->
-
-
-<!-- rnb-text-begin -->
-
-
-Note that we can obtain the same results if we select just the columns with numeric
-values from the `gene_df` data frame.
-This allows R to do the as.matrix() coercion automatically, and can be a handy shortcut
-if you have a *mostly* numeric data frame.
-
-
-<!-- rnb-text-end -->
-
-
-<!-- rnb-chunk-begin -->
-
-
-<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuIyBDYWxjdWxhdGUgcm93IG1lYW5zIHVzaW5nIHRoZSBgZ2VuZV9kZmAgb2JqZWN0IGFmdGVyIHJlbW92aW5nIHRoZSBjaGFyYWN0ZXIgY29sdW1uXG4jIGFwcGx5KCkgY29udmVydHMgdGhpcyB0byBhIG1hdHJpeCBpbnRlcm5hbGx5XG5nZW5lX21lYW5zX2Zyb21fZGYgPC0gYXBwbHkoZ2VuZV9kZlssIC0xXSwgMSwgbWVhbikgXG5cbiMgTGV0J3MgY2hlY2sgdGhhdCB0aGUgdHdvIGdlbmUgbWVhbnMgb2JqZWN0cyBhcmUgZXF1YWxcbmFsbC5lcXVhbChnZW5lX21lYW5zLCBnZW5lX21lYW5zX2Zyb21fZGYpXG5gYGBcbmBgYCJ9 -->
-
-```r
-```r
-# Calculate row means using the `gene_df` object after removing the character column
+
+
+[1] 20056
+
+
+
+Note that we can obtain the same results if we select just the columns with numeric values from the gene_df
data frame. This allows R to do the as.matrix() coercion automatically, and can be a handy shortcut if you have a mostly numeric data frame.
+
+
+
+# Calculate row means using the `gene_df` object after removing the character column
# apply() converts this to a matrix internally
gene_means_from_df <- apply(gene_df[, -1], 1, mean)
# Let's check that the two gene means objects are equal
all.equal(gene_means, gene_means_from_df)
-
-<!-- rnb-source-end -->
-
-<!-- rnb-output-begin eyJkYXRhIjoiWzFdIFRSVUVcbiJ9 -->
-
-[1] TRUE
-
-
-
-<!-- rnb-output-end -->
-
-<!-- rnb-chunk-end -->
-
-
-<!-- rnb-text-begin -->
-
-
-Now let's investigate the same set up, but use 2 to `apply` over the columns of
-our matrix.
-
-
-<!-- rnb-text-end -->
-
-
-<!-- rnb-chunk-begin -->
-
-
-<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuIyBDYWxjdWxhdGUgc2FtcGxlIG1lYW5zXG5zYW1wbGVfbWVhbnMgPC0gYXBwbHkoZ2VuZV9udW1fbWF0cml4LCAyLCBtZWFuKSAjIE5vdGljZSB3ZSB1c2UgMiBoZXJlXG5cbiMgSG93IGxvbmcgd2lsbCBgc2FtcGxlX21lYW5zYCBiZT8gXG5sZW5ndGgoc2FtcGxlX21lYW5zKVxuYGBgXG5gYGAifQ== -->
-
-```r
-```r
-# Calculate sample means
+
+
+[1] TRUE
+
+
+
+Now let’s investigate the same set up, but use 2 to apply
over the columns of our matrix.
+
+
+
+# Calculate sample means
sample_means <- apply(gene_num_matrix, 2, mean) # Notice we use 2 here
# How long will `sample_means` be?
length(sample_means)
-
-<!-- rnb-source-end -->
-
-<!-- rnb-output-begin eyJkYXRhIjoiWzFdIDU4XG4ifQ== -->
-
-[1] 58
-
-
-
-<!-- rnb-output-end -->
-
-<!-- rnb-chunk-end -->
-
-
-<!-- rnb-text-begin -->
-
-
-We can put the gene names back into the numeric matrix object by
-assigning them as rownames.
-
-
-<!-- rnb-text-end -->
-
-
-<!-- rnb-chunk-begin -->
-
-
-<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuIyBBc3NpZ24gdGhlIGdlbmUgbmFtZXMgZnJvbSBnZW5lX2RmJEdlbmUgdG8gdGhlIGBnZW5lX251bV9tYXRyaXhgIG9iamVjdCB1c2luZ1xuIyB0aGUgYHJvd25hbWVzKClgIGZ1bmN0aW9uXG5yb3duYW1lcyhnZW5lX251bV9tYXRyaXgpIDwtIGdlbmVfZGYkR2VuZVxuXG4jIEV4cGxvcmUgdGhlIGBnZW5lX251bV9tYXRyaXhgIG9iamVjdFxuaGVhZChnZW5lX251bV9tYXRyaXgpXG5gYGBcbmBgYCJ9 -->
-
-```r
-```r
-# Assign the gene names from gene_df$Gene to the `gene_num_matrix` object using
+
+
+[1] 58
+
+
+
+We can put the gene names back into the numeric matrix object by assigning them as rownames.
+
+
+
+# Assign the gene names from gene_df$Gene to the `gene_num_matrix` object using
# the `rownames()` function
rownames(gene_num_matrix) <- gene_df$Gene
# Explore the `gene_num_matrix` object
head(gene_num_matrix)
-
-<!-- rnb-source-end -->
-
-<!-- rnb-output-begin  -->
-
- GSM1094814 GSM1094815 GSM1094816 GSM1094817 GSM1094818
-ENSG00000000003 9.59510150 8.4785070 12.6802129 8.677614838 10.75552946 ENSG00000000005 -0.04361838 -0.1307889 0.5345931 -0.005805166 -0.05430255 ENSG00000000419 8.52458571 9.8405725 11.9923201 9.639163317 10.03349327 ENSG00000000457 1.60130552 1.8895554 1.3747388 1.637826214 1.63562493 ENSG00000000460 0.61891285 0.5321708 0.4805598 0.617947976 0.70636135 ENSG00000000938 0.55573058 0.9942862 1.8030176 1.237317457 0.84152852 GSM1094819 GSM1094820 GSM1094821 GSM1094822 GSM1094823 ENSG00000000003 6.37470691 9.10028584 7.3546860 8.51847190 9.4216113 ENSG00000000005 -0.04831174 0.01411359 -0.1108279 -0.02625776 -0.1692604 ENSG00000000419 12.78335826 10.75552946 9.1711113 9.30210174 9.4915415 ENSG00000000457 1.46586071 1.79852032 1.6389259 1.80586748 1.5813979 ENSG00000000460 0.77224572 0.89607132 0.6740559 0.63157954 0.7480556 ENSG00000000938 3.32404606 0.81562856 0.9728617 0.77129700 1.3596402 GSM1094824 GSM1094825 GSM1094826 GSM1094827 GSM1094828 ENSG00000000003 5.0239629 7.89737460 8.1126876 7.03444640 9.6984918 ENSG00000000005 -0.1359247 -0.08624286 -0.2044839 -0.09037887 -0.1602416 ENSG00000000419 11.8835897 10.88079782 9.9174930 10.41753701 10.2695503 ENSG00000000457 1.5525410 1.92489254 1.8046590 1.50382159 1.6198069 ENSG00000000460 0.7072273 0.89196068 0.8223559 0.61970982 0.6776549 ENSG00000000938 0.8758421 0.62191515 0.7675971 0.92791338 1.0351067 GSM1094829 GSM1094830 GSM1094831 GSM1094832 GSM1094833 ENSG00000000003 13.98689230 10.5868331 7.6836223 8.3862587 11.18932763 ENSG00000000005 -0.05038705 0.3096031 -0.1551062 -0.1938994 -0.08369537 ENSG00000000419 10.12104053 9.3653576 10.2184110 9.5951015 12.58713982 ENSG00000000457 1.67741832 1.5762471 2.0663493 1.7504928 1.67321632 ENSG00000000460 0.71786250 0.4991620 0.7912559 0.8103023 0.94248698 ENSG00000000938 0.82152165 0.6556572 0.9782599 0.6568353 0.73458782 GSM1094834 GSM1094835 GSM1094836 GSM1094837 GSM1094838 ENSG00000000003 9.7562003 9.6984918 10.56891510 9.9391025 7.8738131 ENSG00000000005 -0.0437601 -0.1120755 -0.08208306 -0.2067112 -0.1211891 ENSG00000000419 9.9799646 10.0798974 9.59510150 9.7417927 10.2105827 ENSG00000000457 1.3778594 1.5630889 1.74146532 1.6518036 1.7806133 ENSG00000000460 0.6201925 0.5570300 0.70084983 0.7137118 0.7355154 ENSG00000000938 0.7674132 1.2165228 0.60856106 0.6041645 1.0624067 GSM1094839 GSM1094840 GSM1094841 GSM1094842 GSM1094843 ENSG00000000003 8.6311353 8.58077557 9.1579585 6.3317019 10.1939387 ENSG00000000005 -0.1109070 -0.03963564 -0.1148106 -0.1137150 -0.1645040 ENSG00000000419 10.1517344 11.18932763 10.5775132 13.3760971 12.2271693 ENSG00000000457 1.7281446 1.70242287 1.6503090 1.2990208 1.5687866 ENSG00000000460 0.5284219 0.67643466 0.7353276 0.6223990 0.5646406 ENSG00000000938 0.6469084 0.88120942 0.5230414 0.9909517 0.8484174 GSM1094844 GSM1094845 GSM1094846 GSM1094847 GSM1094848 ENSG00000000003 10.44364159 9.62435722 16.05075944 6.9334508 8.55180910 ENSG00000000005 -0.03132427 -0.01400534 -0.03529112 0.1268899 -0.03857382 ENSG00000000419 10.28571947 11.47682424 9.88540523 8.9646682 11.10911330 ENSG00000000457 1.66150175 1.62312829 1.37320729 1.3402742 0.81703931 ENSG00000000460 0.66798926 0.58089659 0.46957607 0.4222455 0.29500657 ENSG00000000938 0.53726484 1.08997535 0.91859664 0.8170393 1.92489254 GSM1094849 GSM1094850 GSM1094851 GSM1094852 GSM1094853 ENSG00000000003 9.29497760 7.5027098 6.9593119 8.33588532 8.16826110 ENSG00000000005 -0.09269777 -0.1712545 0.6359455 -0.04951916 -0.05576644 ENSG00000000419 11.16941260 14.4432389 9.5329440 11.50854450 9.46361675 ENSG00000000457 1.45557567 1.4528744 1.1872029 1.31744971 1.20863565 ENSG00000000460 0.43611482 0.3623008 0.5544340 0.41423708 0.33909186 ENSG00000000938 1.18025639 1.5680560 1.6256345 0.86704140 1.48853231 GSM1094854 GSM1094855 GSM1094856 GSM1094857 GSM1094858 ENSG00000000003 9.8020077 7.92580451 8.5122426 8.5300217 6.45774124 ENSG00000000005 -0.1606687 0.02223393 -0.1340762 0.0143126 -0.02043163 ENSG00000000419 11.8011262 8.66606873 8.6484013 8.7775212 9.88540523 ENSG00000000457 1.4582023 1.40127133 1.2111514 1.3356778 1.78911533 ENSG00000000460 0.8086875 0.44922354 0.5992390 0.3759818 0.35878489 ENSG00000000938 2.2089847 0.75746472 0.7677913 0.8519271 1.05184938 GSM1094859 GSM1094860 GSM1094861 GSM1094862 GSM1094863 ENSG00000000003 8.06834906 6.29704946 9.59510150 8.2198571 6.0207988 ENSG00000000005 -0.08355142 -0.04656716 0.03031249 -0.1646146 -0.1405284 ENSG00000000419 8.67761484 14.59843763 8.03382884 11.0881792 10.1123251 ENSG00000000457 1.45785045 1.19855713 1.31224455 1.1589472 1.5021307 ENSG00000000460 0.62352156 0.64477354 0.30380735 0.3217531 0.3130842 ENSG00000000938 0.70757412 1.78181966 0.78496368 1.5556700 0.6950488 GSM1094864 GSM1094865 GSM1094866 GSM1094867 GSM1094868 ENSG00000000003 10.60783176 10.23536609 9.031964212 7.66629540 1.06494502 ENSG00000000005 0.02602656 -0.07968734 0.007326678 -0.09030723 -0.09543977 ENSG00000000419 9.68412621 12.79993557 9.794484639 10.23536609 12.97072224 ENSG00000000457 1.47876418 1.46084185 1.585389962 1.59111604 1.76232531 ENSG00000000460 0.57717175 0.64888234 0.777446911 0.64510940 0.11986491 ENSG00000000938 0.54561199 0.59862564 0.479774397 0.37790623 0.83880649 GSM1094869 GSM1094870 GSM1094871 ENSG00000000003 1.0408332 1.7262079 1.0292255 ENSG00000000005 -0.1023734 -0.1537910 -0.1100603 ENSG00000000419 8.6196300 11.9588188 10.8900846 ENSG00000000457 1.5755541 1.7445362 1.6275308 ENSG00000000460 0.2372586 0.3456454 0.1885289 ENSG00000000938 0.7200364 0.9199667 0.7240255
-
-
-
-<!-- rnb-output-end -->
-
-<!-- rnb-chunk-end -->
-
-
-<!-- rnb-text-begin -->
-
-
-Row names like this can be very convenient for keeping matrices organized, but
-row names (and column names) can be lost or misordered if you are not careful,
-especially during input and output, so treat them with care.
-
-Although the `apply` functions may not be as easy to use as the tidyverse
-functions, for some applications, `apply` methods can be better suited.
-In this workshop, we will not delve too deeply into the various other apply
-functions (`tapply()`, `lapply()`, etc.) but you can read more information about
-them [here](https://www.guru99.com/r-apply-sapply-tapply.html).
-
-## The dplyr::join functions
-
-Let's say we have a scenario where we have two data frames that we would like to
-combine.
-Recall that `stats_df` and `gene_df` are data frames that contain information
-about some of the same genes.
-The [`dplyr::join` family of functions](https://dplyr.tidyverse.org/reference/join.html)
-are useful for various scenarios of combining data frames.
-
-For now, we will focus on `inner_join()`, which will combine data frames by only
-keeping information about matching rows that are in both data frames.
-We need to use the `by` argument to designate what column(s)
-should be used as a key to match the data frames.
-In this case we want to match the gene information between the two, so we will
-specify that we want to compare values in the `ensembl_id` column from
-`stats_df` to the `Gene` column from `gene_df`.
-
-
-<!-- rnb-text-end -->
-
-
-<!-- rnb-chunk-begin -->
-
-
-<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuc3RhdHNfZGYgJT4lIFxuICBpbm5lcl9qb2luKGdlbmVfZGYsIGJ5ID0gYygnZW5zZW1ibF9pZCcgPSAnR2VuZScpKSBcbmBgYFxuYGBgIn0= -->
-
-```r
-```r
-stats_df %>%
+
+
+ GSM1094814 GSM1094815 GSM1094816 GSM1094817 GSM1094818
+ENSG00000000003 9.59510150 8.4785070 12.6802129 8.677614838 10.75552946
+ GSM1094819 GSM1094820 GSM1094821 GSM1094822 GSM1094823
+ENSG00000000003 6.37470691 9.10028584 7.3546860 8.51847190 9.4216113
+ GSM1094824 GSM1094825 GSM1094826 GSM1094827 GSM1094828
+ENSG00000000003 5.0239629 7.89737460 8.1126876 7.03444640 9.6984918
+ GSM1094829 GSM1094830 GSM1094831 GSM1094832 GSM1094833
+ENSG00000000003 13.98689230 10.5868331 7.6836223 8.3862587 11.18932763
+ GSM1094834 GSM1094835 GSM1094836 GSM1094837 GSM1094838
+ENSG00000000003 9.7562003 9.6984918 10.56891510 9.9391025 7.8738131
+ GSM1094839 GSM1094840 GSM1094841 GSM1094842 GSM1094843
+ENSG00000000003 8.6311353 8.58077557 9.1579585 6.3317019 10.1939387
+ GSM1094844 GSM1094845 GSM1094846 GSM1094847 GSM1094848
+ENSG00000000003 10.44364159 9.62435722 16.05075944 6.9334508 8.55180910
+ GSM1094849 GSM1094850 GSM1094851 GSM1094852 GSM1094853
+ENSG00000000003 9.29497760 7.5027098 6.9593119 8.33588532 8.16826110
+ GSM1094854 GSM1094855 GSM1094856 GSM1094857 GSM1094858
+ENSG00000000003 9.8020077 7.92580451 8.5122426 8.5300217 6.45774124
+ GSM1094859 GSM1094860 GSM1094861 GSM1094862 GSM1094863
+ENSG00000000003 8.06834906 6.29704946 9.59510150 8.2198571 6.0207988
+ GSM1094864 GSM1094865 GSM1094866 GSM1094867 GSM1094868
+ENSG00000000003 10.60783176 10.23536609 9.031964212 7.66629540 1.06494502
+ GSM1094869 GSM1094870 GSM1094871
+ENSG00000000003 1.0408332 1.7262079 1.0292255
+ [ reached getOption("max.print") -- omitted 5 rows ]
+
+
+
+Row names like this can be very convenient for keeping matrices organized, but row names (and column names) can be lost or misordered if you are not careful, especially during input and output, so treat them with care.
+Although the apply
functions may not be as easy to use as the tidyverse functions, for some applications, apply
methods can be better suited. In this workshop, we will not delve too deeply into the various other apply functions (tapply()
, lapply()
, etc.) but you can read more information about them here.
+
+
+The dplyr::join functions
+Let’s say we have a scenario where we have two data frames that we would like to combine. Recall that stats_df
and gene_df
are data frames that contain information about some of the same genes. The dplyr::join
family of functions are useful for various scenarios of combining data frames.
+For now, we will focus on inner_join()
, which will combine data frames by only keeping information about matching rows that are in both data frames. We need to use the by
argument to designate what column(s) should be used as a key to match the data frames. In this case we want to match the gene information between the two, so we will specify that we want to compare values in the ensembl_id
column from stats_df
to the Gene
column from gene_df
.
+
+
+
+stats_df %>%
inner_join(gene_df, by = c('ensembl_id' = 'Gene'))
-
-<!-- rnb-source-end -->
-
-<!-- rnb-chunk-end -->
-
-
-<!-- rnb-text-begin -->
-
-
-## Save data to files
-
-#### Save to TSV files
-
-Let's write some of the data frames we created to a file.
-To do this, we can use the `readr` library of `_write()` functions.
-The first argument of `write_tsv()` is the data we want to write, and the second
-argument is a character string that describes the path to the new file we would
-like to create.
-Remember that we created a `results` directory to put our output in,
-but if we want to save our data to a directory other than our working directory,
-we need to specify this.
-This is what we will use the `file.path()` function for.
-Let's look in a bit more detail what `file.path()` does, by examining the
-results of the function in the examples below.
-
-
-<!-- rnb-text-end -->
-
-
-<!-- rnb-chunk-begin -->
-
-
-<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuIyBXaGljaCBvZiB0aGVzZSBmaWxlIHBhdGhzIGlzIHdoYXQgd2Ugd2FudCB0byB1c2UgdG8gc2F2ZSBvdXIgZGF0YSB0byB0aGVcbiMgcmVzdWx0cyBkaXJlY3Rvcnkgd2UgY3JlYXRlZCBhdCB0aGUgYmVnaW5uaW5nIG9mIHRoaXMgbm90ZWJvb2s/XG5maWxlLnBhdGgoXFxkb2NrZXItaW5zdGFsbFxcLCBcXHN0YXRzX3N1bW1hcnkudHN2XFwpXG5gYGBcbmBgYCJ9 -->
-
-```r
-```r
-# Which of these file paths is what we want to use to save our data to the
+
+
+
+
+
+
+
+
+Save data to files
+
+Save to TSV files
+Let’s write some of the data frames we created to a file. To do this, we can use the readr
library of _write()
functions. The first argument of write_tsv()
is the data we want to write, and the second argument is a character string that describes the path to the new file we would like to create. Remember that we created a results
directory to put our output in, but if we want to save our data to a directory other than our working directory, we need to specify this. This is what we will use the file.path()
function for. Let’s look in a bit more detail what file.path()
does, by examining the results of the function in the examples below.
+
+
+
+# Which of these file paths is what we want to use to save our data to the
# results directory we created at the beginning of this notebook?
-file.path(\docker-install\, \stats_summary.tsv\)
-
-<!-- rnb-source-end -->
-
-<!-- rnb-output-begin eyJkYXRhIjoiWzFdIFxcZG9ja2VyLWluc3RhbGwvc3RhdHNfc3VtbWFyeS50c3ZcXFxuZG9ja2VyLWluc3RhbGwvc3RhdHNfc3VtbWFyeS50c3ZcbiJ9 -->
-
-[1] -install/stats_summary.tsv
-docker-install/stats_summary.tsv
-
-
-
-<!-- rnb-output-end -->
-
-<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuZmlsZS5wYXRoKFxccmVzdWx0c1xcLCBcXHN0YXRzX3N1bW1hcnkudHN2XFwpXG5gYGBcbmBgYCJ9 -->
-
-```r
-```r
-file.path(\results\, \stats_summary.tsv\)
-
-<!-- rnb-source-end -->
-
-<!-- rnb-output-begin eyJkYXRhIjoiWzFdIFxccmVzdWx0cy9zdGF0c19zdW1tYXJ5LnRzdlxcXG5yZXN1bHRzL3N0YXRzX3N1bW1hcnkudHN2XG4ifQ== -->
-
-[1] /stats_summary.tsv
-results/stats_summary.tsv
-
-
-
-<!-- rnb-output-end -->
-
-<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuZmlsZS5wYXRoKFxcc3RhdHNfc3VtbWFyeS50c3ZcXCwgXFxyZXN1bHRzXFwpXG5gYGBcbmBgYCJ9 -->
-
-```r
-```r
-file.path(\stats_summary.tsv\, \results\)
-
-<!-- rnb-source-end -->
-
-<!-- rnb-output-begin eyJkYXRhIjoiWzFdIFxcc3RhdHNfc3VtbWFyeS50c3YvcmVzdWx0c1xcXG5zdGF0c19zdW1tYXJ5LnRzdi9yZXN1bHRzXG4ifQ== -->
-
-[1] _summary.tsv/results
-stats_summary.tsv/results
-
-
-
-<!-- rnb-output-end -->
-
-<!-- rnb-chunk-end -->
-
-
-<!-- rnb-text-begin -->
-
-
-Replace `<NEW_FILE_PATH>` below with the `file.path()` statement from above that
-will successfully save our file to the `results` folder
-
-
-<!-- rnb-text-end -->
-
-
-<!-- rnb-chunk-begin -->
-
-
-<!-- rnb-source-begin eyJkYXRhIjpbIiMgV3JpdGUgb3VyIGRhdGEgZnJhbWUgdG8gYSBUU1YgZmlsZSIsInJlYWRyOjp3cml0ZV90c3Yoc3RhdHNfc3VtbWFyeV9kZiwgPE5FV19GSUxFX1BBVEg+KSJdfQ== -->
-
-```r
-# Write our data frame to a TSV file
+file.path("docker-install", "stats_summary.tsv")
+
+
+[1] "docker-install/stats_summary.tsv"
+
+
+docker-install/stats_summary.tsv
+
+
+file.path("results", "stats_summary.tsv")
+
+
+[1] "results/stats_summary.tsv"
+
+
+results/stats_summary.tsv
+
+
+file.path("stats_summary.tsv", "results")
+
+
+[1] "stats_summary.tsv/results"
+
+
+stats_summary.tsv/results
+
+
+
+Replace <NEW_FILE_PATH>
below with the file.path()
statement from above that will successfully save our file to the results
folder
+
+
+
+# Write our data frame to a TSV file
readr::write_tsv(stats_summary_df, <NEW_FILE_PATH>)
@@ -1674,44 +1071,54 @@ Read an RDS file
Session Info
-
-```r
-# Print out the versions and packages we are using in this session
+
+# Print out the versions and packages we are using in this session
sessionInfo()
-
-<!-- rnb-source-end -->
-
-<!-- rnb-output-begin eyJkYXRhIjoiUiB2ZXJzaW9uIDQuMC4zICgyMDIwLTEwLTEwKVxuUGxhdGZvcm06IHg4Nl82NC1wYy1saW51eC1nbnUgKDY0LWJpdClcblJ1bm5pbmcgdW5kZXI6IFVidW50dSAyMC4wNCBMVFNcblxuTWF0cml4IHByb2R1Y3RzOiBkZWZhdWx0XG5CTEFTL0xBUEFDSzogL3Vzci9saWIveDg2XzY0LWxpbnV4LWdudS9vcGVuYmxhcy1wdGhyZWFkL2xpYm9wZW5ibGFzcC1yMC4zLjguc29cblxubG9jYWxlOlxuIFsxXSBMQ19DVFlQRT1lbl9VUy5VVEYtOCAgICAgICBMQ19OVU1FUklDPUMgICAgICAgICAgICAgIFxuIFszXSBMQ19USU1FPWVuX1VTLlVURi04ICAgICAgICBMQ19DT0xMQVRFPWVuX1VTLlVURi04ICAgIFxuIFs1XSBMQ19NT05FVEFSWT1lbl9VUy5VVEYtOCAgICBMQ19NRVNTQUdFUz1DICAgICAgICAgICAgIFxuIFs3XSBMQ19QQVBFUj1lbl9VUy5VVEYtOCAgICAgICBMQ19OQU1FPUMgICAgICAgICAgICAgICAgIFxuIFs5XSBMQ19BRERSRVNTPUMgICAgICAgICAgICAgICBMQ19URUxFUEhPTkU9QyAgICAgICAgICAgIFxuWzExXSBMQ19NRUFTVVJFTUVOVD1lbl9VUy5VVEYtOCBMQ19JREVOVElGSUNBVElPTj1DICAgICAgIFxuXG5hdHRhY2hlZCBiYXNlIHBhY2thZ2VzOlxuWzFdIHN0YXRzICAgICBncmFwaGljcyAgZ3JEZXZpY2VzIHV0aWxzICAgICBkYXRhc2V0cyAgbWV0aG9kcyAgIGJhc2UgICAgIFxuXG5vdGhlciBhdHRhY2hlZCBwYWNrYWdlczpcbiBbMV0gZm9yY2F0c18wLjUuMCAgIHN0cmluZ3JfMS40LjAgICBkcGx5cl8xLjAuMyAgICAgcHVycnJfMC4zLjQgICAgXG4gWzVdIHJlYWRyXzEuNC4wICAgICB0aWR5cl8xLjEuMiAgICAgdGliYmxlXzMuMC41ICAgIGdncGxvdDJfMy4zLjMgIFxuIFs5XSB0aWR5dmVyc2VfMS4zLjAgb3B0cGFyc2VfMS42LjYgXG5cbmxvYWRlZCB2aWEgYSBuYW1lc3BhY2UgKGFuZCBub3QgYXR0YWNoZWQpOlxuIFsxXSBSY3BwXzEuMC42ICAgICAgICBjZWxscmFuZ2VyXzEuMS4wICBwaWxsYXJfMS40LjcgICAgICBjb21waWxlcl80LjAuMyAgIFxuIFs1XSBkYnBseXJfMi4wLjAgICAgICB0b29sc180LjAuMyAgICAgICBkaWdlc3RfMC42LjI3ICAgICBsdWJyaWRhdGVfMS43LjkuMlxuIFs5XSBqc29ubGl0ZV8xLjcuMiAgICBldmFsdWF0ZV8wLjE0ICAgICBsaWZlY3ljbGVfMC4yLjAgICBndGFibGVfMC4zLjAgICAgIFxuWzEzXSBwa2djb25maWdfMi4wLjMgICBybGFuZ18wLjQuMTAgICAgICByZXByZXhfMC4zLjAgICAgICBjbGlfMi4yLjAgICAgICAgIFxuWzE3XSByc3R1ZGlvYXBpXzAuMTMgICBEQklfMS4xLjEgICAgICAgICB5YW1sXzIuMi4xICAgICAgICBoYXZlbl8yLjMuMSAgICAgIFxuWzIxXSB4ZnVuXzAuMjAgICAgICAgICB3aXRocl8yLjQuMCAgICAgICB4bWwyXzEuMy4yICAgICAgICBodHRyXzEuNC4yICAgICAgIFxuWzI1XSBrbml0cl8xLjMwICAgICAgICBmc18xLjUuMCAgICAgICAgICBobXNfMS4wLjAgICAgICAgICBnZW5lcmljc18wLjEuMCAgIFxuWzI5XSB2Y3Ryc18wLjMuNiAgICAgICBncmlkXzQuMC4zICAgICAgICBnZXRvcHRfMS4yMC4zICAgICB0aWR5c2VsZWN0XzEuMS4wIFxuWzMzXSBnbHVlXzEuNC4yICAgICAgICBSNl8yLjUuMCAgICAgICAgICBmYW5zaV8wLjQuMiAgICAgICByZWFkeGxfMS4zLjEgICAgIFxuWzM3XSBybWFya2Rvd25fMi42ICAgICBtb2RlbHJfMC4xLjggICAgICBtYWdyaXR0cl8yLjAuMSAgICBwc18xLjUuMCAgICAgICAgIFxuWzQxXSBiYWNrcG9ydHNfMS4yLjEgICBzY2FsZXNfMS4xLjEgICAgICBlbGxpcHNpc18wLjMuMSAgICBodG1sdG9vbHNfMC41LjEuMVxuWzQ1XSBydmVzdF8wLjMuNiAgICAgICBhc3NlcnR0aGF0XzAuMi4xICBjb2xvcnNwYWNlXzIuMC0wICBzdHJpbmdpXzEuNS4zICAgIFxuWzQ5XSBtdW5zZWxsXzAuNS4wICAgICBicm9vbV8wLjcuMyAgICAgICBjcmF5b25fMS4zLjQgICAgIFxuIn0= -->
-
-R version 4.0.3 (2020-10-10) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04 LTS
-Matrix products: default BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
-locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
-[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
-[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
-[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
-[9] LC_ADDRESS=C LC_TELEPHONE=C
-[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
-attached base packages: [1] stats graphics grDevices utils datasets methods base
-other attached packages: [1] forcats_0.5.0 stringr_1.4.0 dplyr_1.0.3 purrr_0.3.4
-[5] readr_1.4.0 tidyr_1.1.2 tibble_3.0.5 ggplot2_3.3.3
-[9] tidyverse_1.3.0 optparse_1.6.6
-loaded via a namespace (and not attached): [1] Rcpp_1.0.6 cellranger_1.1.0 pillar_1.4.7 compiler_4.0.3
-[5] dbplyr_2.0.0 tools_4.0.3 digest_0.6.27 lubridate_1.7.9.2 [9] jsonlite_1.7.2 evaluate_0.14 lifecycle_0.2.0 gtable_0.3.0
-[13] pkgconfig_2.0.3 rlang_0.4.10 reprex_0.3.0 cli_2.2.0
-[17] rstudioapi_0.13 DBI_1.1.1 yaml_2.2.1 haven_2.3.1
-[21] xfun_0.20 withr_2.4.0 xml2_1.3.2 httr_1.4.2
-[25] knitr_1.30 fs_1.5.0 hms_1.0.0 generics_0.1.0
-[29] vctrs_0.3.6 grid_4.0.3 getopt_1.20.3 tidyselect_1.1.0 [33] glue_1.4.2 R6_2.5.0 fansi_0.4.2 readxl_1.3.1
-[37] rmarkdown_2.6 modelr_0.1.8 magrittr_2.0.1 ps_1.5.0
-[41] backports_1.2.1 scales_1.1.1 ellipsis_0.3.1 htmltools_0.5.1.1 [45] rvest_0.3.6 assertthat_0.2.1 colorspace_2.0-0 stringi_1.5.3
-[49] munsell_0.5.0 broom_0.7.3 crayon_1.3.4
-```
+
+
+R version 4.0.3 (2020-10-10)
+Platform: x86_64-pc-linux-gnu (64-bit)
+Running under: Ubuntu 20.04 LTS
+
+Matrix products: default
+BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
+
+locale:
+ [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
+ [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
+ [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
+ [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
+ [9] LC_ADDRESS=C LC_TELEPHONE=C
+[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
+
+attached base packages:
+[1] stats graphics grDevices utils datasets methods base
+
+other attached packages:
+ [1] forcats_0.5.0 stringr_1.4.0 dplyr_1.0.3 purrr_0.3.4
+ [5] readr_1.4.0 tidyr_1.1.2 tibble_3.0.5 ggplot2_3.3.3
+ [9] tidyverse_1.3.0 optparse_1.6.6
+
+loaded via a namespace (and not attached):
+ [1] Rcpp_1.0.6 cellranger_1.1.0 pillar_1.4.7 compiler_4.0.3
+ [5] dbplyr_2.0.0 tools_4.0.3 digest_0.6.27 lubridate_1.7.9.2
+ [9] jsonlite_1.7.2 evaluate_0.14 lifecycle_0.2.0 gtable_0.3.0
+[13] pkgconfig_2.0.3 rlang_0.4.10 reprex_0.3.0 cli_2.2.0
+[17] rstudioapi_0.13 DBI_1.1.1 yaml_2.2.1 haven_2.3.1
+[21] xfun_0.20 withr_2.4.0 xml2_1.3.2 httr_1.4.2
+[25] knitr_1.30 fs_1.5.0 hms_1.0.0 generics_0.1.0
+[29] vctrs_0.3.6 grid_4.0.3 getopt_1.20.3 tidyselect_1.1.0
+[33] glue_1.4.2 R6_2.5.0 fansi_0.4.2 readxl_1.3.1
+[37] rmarkdown_2.6 modelr_0.1.8 magrittr_2.0.1 ps_1.5.0
+[41] backports_1.2.1 scales_1.1.1 ellipsis_0.3.1 htmltools_0.5.1.1
+[45] rvest_0.3.6 assertthat_0.2.1 colorspace_2.0-0 stringi_1.5.3
+[49] munsell_0.5.0 broom_0.7.3 crayon_1.3.4


diff --git a/machine-learning/01-openpbta_heatmap.nb.html b/machine-learning/01-openpbta_heatmap.nb.html
index ed74a473..222fa3cb 100644
--- a/machine-learning/01-openpbta_heatmap.nb.html
+++ b/machine-learning/01-openpbta_heatmap.nb.html
@@ -611,7 +611,7 @@ Annotation
diff --git a/machine-learning/02-openpbta_consensus_clustering.nb.html b/machine-learning/02-openpbta_consensus_clustering.nb.html
index 936a8a41..bdeaf8cf 100644
--- a/machine-learning/02-openpbta_consensus_clustering.nb.html
+++ b/machine-learning/02-openpbta_consensus_clustering.nb.html
@@ -907,7 +907,7 @@ PCA
-
+
diff --git a/machine-learning/04-openpbta_plot_LV.nb.html b/machine-learning/04-openpbta_plot_LV.nb.html
index 2516ed0d..bbbe6235 100644
--- a/machine-learning/04-openpbta_plot_LV.nb.html
+++ b/machine-learning/04-openpbta_plot_LV.nb.html
@@ -508,7 +508,7 @@ PLIER results
diff --git a/pathway-analysis/01-overrepresentation_analysis-live.Rmd b/pathway-analysis/01-overrepresentation_analysis-live.Rmd
index 8f08a7c0..957cbe79 100644
--- a/pathway-analysis/01-overrepresentation_analysis-live.Rmd
+++ b/pathway-analysis/01-overrepresentation_analysis-live.Rmd
@@ -8,6 +8,17 @@ author: CCDL for ALSF
date: 2020
---
+## Objectives
+
+This notebook will demonstrate how to:
+
+- Perform gene identifier conversion with [`AnnotationDBI` annotation packages](https://bioconductor.org/packages/release/bioc/vignettes/AnnotationDbi/inst/doc/IntroToAnnotationPackages.pdf)
+- Access [Molecular Signatures Database gene set collections](https://www.gsea-msigdb.org/gsea/msigdb/collections.jsp) via the `msigdbr` package
+- Prepare gene sets for over-representation analysis, including an appropriate background set
+- Perform over-representation analysis with the `clusterProfiler` package
+
+---
+
In this notebook, we'll cover a type of pathway or gene set analysis called over-representation analysis (ORA).
The idea behind ORA is relatively straightforward: given a set of genes, do these genes overlap with a pathway more than we expect by chance?
The simplicity of only requiring an input gene set (sort of, more on that below) can be attractive.
@@ -219,9 +230,9 @@ vs_low_df <- vs_low_df %>%
```
**Now we'll read in our data frame of DGE results from another comparison.**
-To save us some time during instruction, we've already done the gene identifier conversion and filtering to remove `NA` values in [this notebook](https://github.com/AlexsLemonade/training-modules/tree/master/pathway-analysis/setup/03-leukemia_DGE.Rmd).
+To save us some time during instruction, we've already done the gene identifier conversion and filtering to remove `NA` values in [this notebook](https://github.com/AlexsLemonade/training-modules/tree/master/pathway-analysis/setup/01-leukemia_DGE.Rmd).
We took a different series of steps to achieve the same thing, which is often possible in R!
-
+
```{r read_in_unsorted}
vs_unsorted_df <- readr::read_tsv(vs_unsorted_file)
@@ -268,8 +279,7 @@ The authors sorted populations of primary leukemia cells and examined the stem c
We compared the population that the authors identified as having high stem cell capacity to a low stem cell capacity population.
We also compared the high stem cell capacity cells to a mix of populations (e.g., unsorted cells).
-You can see the code in [here](https://github.com/AlexsLemonade/training-modules/tree/master/pathway-analysis/setup/03-leukemia_DGE.Rmd).
-
+You can see the code in [here](https://github.com/AlexsLemonade/training-modules/tree/master/pathway-analysis/setup/01-leukemia_DGE.Rmd).
We're interested in what pathways are over-represented in genes that specifically distinguish the high capacity population from the low capacity population.
diff --git a/pathway-analysis/01-overrepresentation_analysis.nb.html b/pathway-analysis/01-overrepresentation_analysis.nb.html
index a1a2bd94..07850257 100644
--- a/pathway-analysis/01-overrepresentation_analysis.nb.html
+++ b/pathway-analysis/01-overrepresentation_analysis.nb.html
@@ -345,87 +345,125 @@ Set up
Libraries
-
+
# Pipes
library(magrittr)
# Package we'll use to
library(clusterProfiler)
-
+
+
+
+
clusterProfiler v3.18.1 For help: https://guangchuangyu.github.io/software/clusterProfiler
If you use clusterProfiler in published research, please cite:
-Guangchuang Yu, Li-Gen Wang, Yanyan Han, Qing-Yu He. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287.
-
-Attaching package: ‘clusterProfiler’
-
-The following object is masked from ‘package:stats’:
+Guangchuang Yu, Li-Gen Wang, Yanyan Han, Qing-Yu He. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287.
+
+
+
+Attaching package: 'clusterProfiler'
+
+
+The following object is masked from 'package:stats':
filter
-
-
+
+
# Package that contains MSigDB gene sets in tidy format
library(msigdbr)
# Mus musculus annotation package we'll use for gene identifier conversion
library(org.Mm.eg.db)
-
-Loading required package: AnnotationDbi
-Loading required package: stats4
-Loading required package: BiocGenerics
-Loading required package: parallel
-
-Attaching package: ‘BiocGenerics’
-
-The following objects are masked from ‘package:parallel’:
-
- clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport, clusterMap, parApply,
- parCapply, parLapply, parLapplyLB, parRapply, parSapply, parSapplyLB
-
-The following objects are masked from ‘package:stats’:
-
- IQR, mad, sd, var, xtabs
-
-The following objects are masked from ‘package:base’:
-
- anyDuplicated, append, as.data.frame, basename, cbind, colnames, dirname, do.call, duplicated,
- eval, evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply, match,
- mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rownames,
- sapply, setdiff, sort, table, tapply, union, unique, unsplit, which.max, which.min
-
-Loading required package: Biobase
-Welcome to Bioconductor
-
- Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see
- 'citation("Biobase")', and for packages 'citation("pkgname")'.
-
-Loading required package: IRanges
-Loading required package: S4Vectors
-
-Attaching package: ‘S4Vectors’
-
-The following object is masked from ‘package:clusterProfiler’:
-
- rename
-
-The following object is masked from ‘package:base’:
-
- expand.grid
-
-
-Attaching package: ‘IRanges’
-
-The following object is masked from ‘package:clusterProfiler’:
-
- slice
-
-
-Attaching package: ‘AnnotationDbi’
-
-The following object is masked from ‘package:clusterProfiler’:
+
+Loading required package: AnnotationDbi
+
+
+Loading required package: stats4
+
+
+Loading required package: BiocGenerics
+
+
+Loading required package: parallel
+
+
+
+Attaching package: 'BiocGenerics'
+
+
+The following objects are masked from 'package:parallel':
+
+ clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
+ clusterExport, clusterMap, parApply, parCapply, parLapply,
+ parLapplyLB, parRapply, parSapply, parSapplyLB
+
+
+The following objects are masked from 'package:stats':
+
+ IQR, mad, sd, var, xtabs
+
+
+The following objects are masked from 'package:base':
+
+ anyDuplicated, append, as.data.frame, basename, cbind, colnames,
+ dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
+ grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
+ order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
+ rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
+ union, unique, unsplit, which.max, which.min
+
+
+Loading required package: Biobase
+
+
+Welcome to Bioconductor
+
+ Vignettes contain introductory material; view with
+ 'browseVignettes()'. To cite Bioconductor, see
+ 'citation("Biobase")', and for packages 'citation("pkgname")'.
+
+
+Loading required package: IRanges
+
+
+Loading required package: S4Vectors
+
+
+
+Attaching package: 'S4Vectors'
+
+
+The following object is masked from 'package:clusterProfiler':
+
+ rename
+
+
+The following object is masked from 'package:base':
+
+ expand.grid
+
+
+
+Attaching package: 'IRanges'
+
+
+The following object is masked from 'package:clusterProfiler':
+
+ slice
+
+
+
+Attaching package: 'AnnotationDbi'
+
+
+The following object is masked from 'package:clusterProfiler':
select
-
+
+
+
+
@@ -435,7 +473,7 @@ Directories and files
Directories
-
+
# We'll create a directory to specifically hold the ORA results if it doesn't
# exist yet
results_dir <- file.path("results", "leukemia")
@@ -451,7 +489,7 @@ Input files
For our ORA example, we’re going to use two tables of differential gene expression (DGE) analysis results.
-
+
input_dir <- file.path("data", "leukemia")
# This file contains the DGE results for a cell population with high stem cell
@@ -472,7 +510,7 @@ Output files
We’ll save the table of ORA results (e.g., p-values).
-
+
kegg_results_file <- file.path(results_dir, "leukemia_kegg_ora_results.tsv")
@@ -486,22 +524,20 @@ Gene sets
Let’s take a look at what organisms the package supports.
-
+
msigdbr_species()
-
-
The results we’re interested in here come from mouse samples, so we can obtain just the gene sets relevant to M. musculus with the species
argument to msigdbr()
.
-
+
mm_msigdb_df <- msigdbr(species = "Mus musculus")
@@ -522,7 +558,7 @@ Gene sets
And are a subset of C2: curated gene sets
. Specifically, we will use the KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways.
-
+
# Filter the mouse data frame to the KEGG pathways that are included in the
# curated gene sets
mm_kegg_df <- mm_msigdb_df %>%
@@ -534,15 +570,51 @@ Gene sets
Note: We could specified that we wanted the KEGG gene sets using the category
and subcategory
arguments of msigdbr()
, but we’re going for general steps!
-
+
colnames(mm_kegg_df)
-
- [1] "gs_cat" "gs_subcat" "gs_name" "entrez_gene"
- [5] "gene_symbol" "human_entrez_gene" "human_gene_symbol" "gs_id"
- [9] "gs_pmid" "gs_geoid" "gs_exact_source" "gs_url"
-[13] "gs_description" "species_name" "species_common_name" "ortholog_sources"
-[17] "num_ortholog_sources"
+
+ [1] "gs_cat" "gs_subcat" "gs_name"
+ [4] "entrez_gene" "gene_symbol" "human_entrez_gene"
+ [7] "human_gene_symbol" "gs_id" "gs_pmid"
+[10] "gs_geoid" "gs_exact_source" "gs_url"
+[13] "gs_description" "species_name" "species_common_name"
+[16] "ortholog_sources" "num_ortholog_sources"
+
+
+gs_cat
+
+gs_subcat
+
+gs_name
+
+entrez_gene
+
+gene_symbol
+
+human_entrez_gene
+
+human_gene_symbol
+
+gs_id
+
+gs_pmid
+
+gs_geoid
+
+gs_exact_source
+
+gs_url
+
+gs_description
+
+species_name
+
+species_common_name
+
+ortholog_sources
+
+num_ortholog_sources
@@ -552,12 +624,12 @@ Gene sets
Read in DGE results and prep
-
+
vs_low_df <- readr::read_tsv(vs_low_file)
-
+
-── Column specification ───────────────────────────────────────────────────────────────────────────────────────────
+── Column specification ────────────────────────────────────────────────────────
cols(
Gene = col_character(),
baseMean = col_double(),
@@ -567,22 +639,20 @@ Read in DGE results and prep
pvalue = col_double(),
padj = col_double()
)
-
+
Let’s take a peek at the top of the DGE results data frame.
-
+
head(vs_low_df)
-
-
@@ -593,14 +663,64 @@ Gene identifier conversion
We can see what types of IDs are available to us in an annotation package with keytypes()
.
-
+
keytypes(org.Mm.eg.db)
-
- [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" "ENTREZID" "ENZYME"
- [8] "EVIDENCE" "EVIDENCEALL" "GENENAME" "GO" "GOALL" "IPI" "MGI"
-[15] "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM" "PMID" "PROSITE" "REFSEQ"
-[22] "SYMBOL" "UNIGENE" "UNIPROT"
+
+ [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
+ [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
+[11] "GO" "GOALL" "IPI" "MGI" "ONTOLOGY"
+[16] "ONTOLOGYALL" "PATH" "PFAM" "PMID" "PROSITE"
+[21] "REFSEQ" "SYMBOL" "UNIGENE" "UNIPROT"
+
+
+ACCNUM
+
+ALIAS
+
+ENSEMBL
+
+ENSEMBLPROT
+
+ENSEMBLTRANS
+
+ENTREZID
+
+ENZYME
+
+EVIDENCE
+
+EVIDENCEALL
+
+GENENAME
+
+GO
+
+GOALL
+
+IPI
+
+MGI
+
+ONTOLOGY
+
+ONTOLOGYALL
+
+PATH
+
+PFAM
+
+PMID
+
+PROSITE
+
+REFSEQ
+
+SYMBOL
+
+UNIGENE
+
+UNIPROT
@@ -608,7 +728,7 @@ Gene identifier conversion
The function we will use to map from Ensembl gene IDs to gene symbols is called mapIds()
.
-
+
# This returns a named vector which we can convert to a data frame, where
# the keys (Ensembl IDs) are the names
symbols_vector <- mapIds(org.Mm.eg.db, # Specify the annotation package
@@ -624,10 +744,10 @@ Gene identifier conversion
# first one. This is default behavior!
multiVals = "first")
-
+
'select()' returned 1:many mapping between keys and columns
-
-
+
+
# We would like a data frame we can join to the DGE results
symbols_df <- data.frame(
ensembl_id = names(symbols_vector),
@@ -641,7 +761,7 @@ Gene identifier conversion
Let’s do this first for the comparison to the low stem cell capacity population.
-
+
vs_low_df <- symbols_df %>%
# An *inner* join will only return rows that are in both data frames
dplyr::inner_join(vs_low_df,
@@ -658,7 +778,7 @@ Drop NA
values
Let’s filter to rows that do not have any NA
using a function tidyr::drop_na()
. This will also drop genes that have an Ensembl gene identifier but no gene symbol!
-
+
# Remove rows that are not complete (e.g., contain NAs) by filtering to only
# complete rows
vs_low_df <- vs_low_df %>%
@@ -669,12 +789,12 @@ Drop NA
values
Now we’ll read in our data frame of DGE results from another comparison. To save us some time during instruction, we’ve already done the gene identifier conversion and filtering to remove NA
values in this notebook. We took a different series of steps to achieve the same thing, which is often possible in R!
-
+
vs_unsorted_df <- readr::read_tsv(vs_unsorted_file)
-
+
-── Column specification ───────────────────────────────────────────────────────────────────────────────────────────
+── Column specification ────────────────────────────────────────────────────────
cols(
Gene = col_character(),
baseMean = col_double(),
@@ -685,7 +805,7 @@ Drop NA
values
padj = col_double(),
gene_symbol = col_character()
)
-
+
@@ -702,7 +822,7 @@ Over-representation Analysis (ORA)
We’ll call genes that are differentially expressed gene_in_interest
and genes that are in the gene set in_gene_set
.
-
+
gene_table <- data.frame(
gene_not_interest = c(2613, 15310),
gene_in_interest = c(28, 29)
@@ -711,19 +831,17 @@ Over-representation Analysis (ORA)
gene_table
-
-
We can assess if the 28 overlapping genes mean that the differentially expressed genes are over-represented in the gene set with the hypergeometric distribution. This corresponds to a one-sided Fisher’s exact test.
-
+
fisher.test(gene_table, alternative = "greater")
@@ -751,7 +869,7 @@ High stem cell capacity ORA
We’ll start with the high stem cell capacity vs. low stem cell capacity population comparison. Genes with positive log2 fold-changes (LFC) will be more highly expressed in the high stem cell capacity cells based on how we set up the analysis.
-
+
vs_low_genes <- vs_low_df %>%
# Filter to the positive LFC and filter based on significance too (padj)
dplyr::filter(log2FoldChange > 0,
@@ -765,7 +883,7 @@ High stem cell capacity ORA
Now, we’ll take the same steps for our other results.
-
+
vs_unsorted_genes <- vs_unsorted_df %>%
dplyr::filter(log2FoldChange > 0,
padj < 0.05) %>%
@@ -776,7 +894,7 @@ High stem cell capacity ORA
We want genes that are in the first comparison but not in the second! We can use setdiff()
, a base R function for set operations, to get the list that we want.
-
+
# What genes are in the first set but *not* in the second set
genes_for_ora <- setdiff(vs_low_genes, vs_unsorted_genes)
@@ -793,7 +911,7 @@ Background set
We can use another function for set operations, intersect()
, to get our background set of genes that were included in both comparisons.
-
+
# intersect() will return the genes in both sets - we are using the entire data
# frame here (complete cases), not just the significant genes
background_set <- intersect(vs_low_df$gene_symbol,
@@ -812,7 +930,7 @@ Run enricher()
Now that we have our background set, our genes of interest, and our pathway information, we’re ready to run ORA using the enricher()
function.
-
+
kegg_ora_results <- enricher(
gene = genes_for_ora, # Genes of interest
pvalueCutoff = 0.05,
@@ -839,7 +957,7 @@ Run enricher()
The information we’re most likely interested in is in the results
slot. Let’s convert this into a data frame that we can write to file.
-
+
kegg_result_df <- data.frame(kegg_ora_results@result)
@@ -850,25 +968,25 @@ Visualizing results
We can use a dot plot to visualize our significant enrichment results.
-
+
enrichplot::dotplot(kegg_ora_results)
-
+
wrong orderBy parameter; set to default `orderBy = "x"`
-
-
-
+
+
+
We can use an UpSet plot to visualize the overlap between the gene sets that were returned as significant.
-
+
enrichplot::upsetplot(kegg_ora_results)
-
-
+
+
@@ -876,18 +994,16 @@ Visualizing results
We can look at the geneID
column of our results to see what genes overlap; it’s a good idea to take a look.
-
+
kegg_result_df %>%
# Use dplyr::select() - the name of the pathway is in the ID column
dplyr::select(ID, geneID)
-
-
@@ -895,7 +1011,7 @@ Visualizing results
Write results to file
-
+
readr::write_tsv(kegg_result_df, file = kegg_results_file)
@@ -907,50 +1023,65 @@ Write results to file
Session Info
-
+
sessionInfo()
-
+
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
-Running under: Ubuntu 18.04.3 LTS
+Running under: Ubuntu 20.04 LTS
Matrix products: default
-BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
-LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
+BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
locale:
- [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8 LC_COLLATE=C.UTF-8
- [5] LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8 LC_PAPER=C.UTF-8 LC_NAME=C
- [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
+ [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
+ [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
+ [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
+ [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
+ [9] LC_ADDRESS=C LC_TELEPHONE=C
+[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
-[1] parallel stats4 stats graphics grDevices utils datasets methods base
+[1] parallel stats4 stats graphics grDevices utils datasets
+[8] methods base
other attached packages:
-[1] org.Mm.eg.db_3.12.0 AnnotationDbi_1.52.0 IRanges_2.24.1 S4Vectors_0.28.1
-[5] Biobase_2.50.0 BiocGenerics_0.36.0 msigdbr_7.2.1 clusterProfiler_3.18.1
-[9] magrittr_2.0.1
+ [1] org.Mm.eg.db_3.12.0 AnnotationDbi_1.52.0 IRanges_2.24.1
+ [4] S4Vectors_0.28.1 Biobase_2.50.0 BiocGenerics_0.36.0
+ [7] msigdbr_7.2.1 clusterProfiler_3.18.1 magrittr_2.0.1
+[10] optparse_1.6.6
loaded via a namespace (and not attached):
- [1] enrichplot_1.10.2 bit64_4.0.5 RColorBrewer_1.1-2 tools_4.0.3 R6_2.5.0
- [6] DBI_1.1.1 colorspace_2.0-0 tidyselect_1.1.0 gridExtra_2.3 bit_4.0.4
-[11] compiler_4.0.3 cli_2.2.0 scatterpie_0.1.5 labeling_0.4.2 shadowtext_0.0.7
-[16] scales_1.1.1 readr_1.4.0 stringr_1.4.0 digest_0.6.27 ggupset_0.3.0
-[21] rmarkdown_2.6 DOSE_3.16.0 pkgconfig_2.0.3 htmltools_0.5.1.1 fastmap_1.1.0
-[26] rlang_0.4.10 rstudioapi_0.13 RSQLite_2.2.3 farver_2.0.3 generics_0.1.0
-[31] jsonlite_1.7.2 BiocParallel_1.24.1 GOSemSim_2.16.1 dplyr_1.0.3 GO.db_3.12.1
-[36] Matrix_1.3-2 fansi_0.4.2 Rcpp_1.0.6 munsell_0.5.0 viridis_0.5.1
-[41] lifecycle_0.2.0 stringi_1.5.3 yaml_2.2.1 ggraph_2.0.4 MASS_7.3-53
-[46] plyr_1.8.6 qvalue_2.22.0 grid_4.0.3 blob_1.2.1 ggrepel_0.9.1
-[51] DO.db_2.9 crayon_1.3.4 lattice_0.20-41 graphlayouts_0.7.1 cowplot_1.1.1
-[56] splines_4.0.3 hms_1.0.0 knitr_1.30 pillar_1.4.7 fgsea_1.16.0
-[61] igraph_1.2.6 reshape2_1.4.4 fastmatch_1.1-0 glue_1.4.2 evaluate_0.14
-[66] downloader_0.4 data.table_1.13.6 renv_0.12.5-2 BiocManager_1.30.10 vctrs_0.3.6
-[71] tweenr_1.0.1 gtable_0.3.0 purrr_0.3.4 polyclip_1.10-0 tidyr_1.1.2
-[76] assertthat_0.2.1 cachem_1.0.1 ggplot2_3.3.3 xfun_0.20 ggforce_0.3.2
-[81] tidygraph_1.2.0 viridisLite_0.3.0 tibble_3.0.5 rvcheck_0.1.8 memoise_1.1.0
-[86] ellipsis_0.3.1
+ [1] enrichplot_1.10.2 bit64_4.0.5 RColorBrewer_1.1-2
+ [4] tools_4.0.3 R6_2.5.0 DBI_1.1.1
+ [7] colorspace_2.0-0 tidyselect_1.1.0 gridExtra_2.3
+[10] bit_4.0.4 compiler_4.0.3 cli_2.2.0
+[13] scatterpie_0.1.5 labeling_0.4.2 shadowtext_0.0.7
+[16] scales_1.1.1 readr_1.4.0 stringr_1.4.0
+[19] digest_0.6.27 ggupset_0.3.0 rmarkdown_2.6
+[22] DOSE_3.16.0 pkgconfig_2.0.3 htmltools_0.5.1.1
+[25] fastmap_1.1.0 rlang_0.4.10 rstudioapi_0.13
+[28] RSQLite_2.2.3 farver_2.0.3 generics_0.1.0
+[31] jsonlite_1.7.2 BiocParallel_1.24.1 GOSemSim_2.16.1
+[34] dplyr_1.0.3 GO.db_3.12.1 Matrix_1.3-2
+[37] fansi_0.4.2 Rcpp_1.0.6 munsell_0.5.0
+[40] viridis_0.5.1 lifecycle_0.2.0 stringi_1.5.3
+[43] yaml_2.2.1 ggraph_2.0.4 MASS_7.3-53
+[46] plyr_1.8.6 qvalue_2.22.0 grid_4.0.3
+[49] blob_1.2.1 ggrepel_0.9.1 DO.db_2.9
+[52] crayon_1.3.4 lattice_0.20-41 graphlayouts_0.7.1
+[55] cowplot_1.1.1 splines_4.0.3 hms_1.0.0
+[58] ps_1.5.0 knitr_1.30 pillar_1.4.7
+[61] fgsea_1.16.0 igraph_1.2.6 reshape2_1.4.4
+[64] fastmatch_1.1-0 glue_1.4.2 evaluate_0.14
+[67] downloader_0.4 data.table_1.13.6 BiocManager_1.30.10
+[70] vctrs_0.3.6 tweenr_1.0.1 gtable_0.3.0
+[73] getopt_1.20.3 purrr_0.3.4 polyclip_1.10-0
+[76] tidyr_1.1.2 assertthat_0.2.1 cachem_1.0.1
+[79] ggplot2_3.3.3 xfun_0.20 ggforce_0.3.2
+[82] tidygraph_1.2.0 viridisLite_0.3.0 tibble_3.0.5
+[85] rvcheck_0.1.8 memoise_1.1.0 ellipsis_0.3.1
diff --git a/pathway-analysis/02-gene_set_enrichment_analysis-live.Rmd b/pathway-analysis/02-gene_set_enrichment_analysis-live.Rmd
index 99403826..7fced4a3 100644
--- a/pathway-analysis/02-gene_set_enrichment_analysis-live.Rmd
+++ b/pathway-analysis/02-gene_set_enrichment_analysis-live.Rmd
@@ -8,20 +8,27 @@ author: CCDL for ALSF
date: 2020
---
+## Objectives
+
+This notebook will demonstrate how to:
+
+- Prepare tabular data of gene-level statistics for use with Gene Set Enrichment Analysis (GSEA), including how to remove duplicate gene identifiers
+- Perform GSEA with the `clusterProfiler` package
+- Visualize GSEA results with the `enrichplot` package
+
+---
+
In this notebook, we will perform Gene Set Enrichment Analysis (GSEA) on the neuroblastoma cell line differential gene expression (DGE) results we generated during the RNA-seq module.
To refresh our memory, we analyzed data from [Harenza *et al.* (2017)](https://doi.org/10.1038/sdata.2017.33) and specifically tested for DGE between with _MYCN_ amplified cell lines and non-amplified cell lines using `DESeq2`.
We have a table of results that contains our log2 fold changes and adjusted p-values.
-*Note: We've taken an additional step to get more accurate log2 fold change estimates.
-You can see the code for that in `setup/01-prepare_NB_cell_line.Rmd`.*
-
-GSEA is a functional class scoring (FCS) approach to pathway analysis that was first introduced in [Subramanian, Tamayo *et al.* (2005)](https://doi.org/10.1073/pnas.0506580102).
+GSEA is a functional class scoring (FCS) approach to pathway analysis that was first introduced in [Subramanian _et al._ (2005)](https://doi.org/10.1073/pnas.0506580102).
The rationale behind FCS approaches is that small changes in individual genes that participate in the same biological process or pathway can be significant and of biological interest.
FCS methods are better suited for identifying these pathways that show coordinated changes than ORA.
In ORA, we pick a cutoff that _typically_ only captures genes with large individual changes.
-There are 3 general steps in FCS methods ([Khatri, Sirota, and Butte. 2012]( https://doi.org/10.1371/journal.pcbi.1002375)):
+There are 3 general steps in FCS methods ([Khatri _et al._ 2012]( https://doi.org/10.1371/journal.pcbi.1002375)):
1. Calculate a gene-level statistic (we'll use log2 fold change from DESeq2 here)
2. Gene-level statistics are aggregated into a pathway-level statistic
@@ -31,7 +38,8 @@ There are 3 general steps in FCS methods ([Khatri, Sirota, and Butte. 2012]( htt
* For another example using `clusterProfiler` for GSEA, see [_Intro to DGE: Functional Analysis._ from Harvard Chan Bioinformatics Core Training.](https://hbctraining.github.io/DGE_workshop/lessons/09_functional_analysis.html)
* The way we'll use `clusterProfiler` here uses `fgsea` (Fast Gene Set Enrichment Analysis) under the hood.
-You can read more about fgsea in [Korotkevich, Sukhov, and Sergushichev. (2019)](https://doi.org/10.1101/060012).
+You can read more about fgsea in [Korotkevich _et al._ (2021)](https://doi.org/10.1101/060012).
+* [refine.bio examples Gene set enrichment analysis - RNA-seq](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/pathway-analysis_rnaseq_02_gsea.html) from which this material has been adapted.
## Set up
@@ -42,8 +50,6 @@ You can read more about fgsea in [Korotkevich, Sukhov, and Sergushichev. (2019)]
library(clusterProfiler)
# Package that contains the MSigDB gene sets in tidy format
library(msigdbr)
-# Annotation package that we will use for human gene identifier conversion
-library(org.Hs.eg.db)
```
### Directories and Files
@@ -52,11 +58,11 @@ library(org.Hs.eg.db)
```{r}
# Where the DGE results are stored
-input_dir <- file.path("results", "gene-metrics")
+input_dir <- file.path("..", "RNA-seq", "results", "NB-cell")
# We will create a directory to specifically hold our GSEA results if it does
# not yet exist
-output_dir <- file.path("results", "gsea")
+output_dir <- file.path("results", "NB-cell")
if (!dir.exists(output_dir)) {
dir.create(output_dir, recursive = TRUE)
}
@@ -67,7 +73,7 @@ if (!dir.exists(output_dir)) {
```{r input_files}
# DGE results
dge_results_file <- file.path(input_dir,
- "nb_cell_line_mycn_amplified_v_nonamplified.tsv")
+ "NB-cell_DESeq_amplified_v_nonamplified_results.tsv")
```
#### Output files
@@ -75,10 +81,30 @@ dge_results_file <- file.path(input_dir,
```{r output_files}
# GSEA pathway-level scores and statistics
gsea_results_file <- file.path(output_dir,
- "nb_cell_line_gsea_results.tsv")
+ "NB-cell_gsea_results.tsv")
```
-## Gene sets
+## Gene Set Enrichment Analysis
+
+_Adapted from [refine.bio examples](https://github.com/AlexsLemonade/refinebio-examples/blob/33cdeff66d57f9fe8ee4fcb5156aea4ac2dce07f/03-rnaseq/pathway-analysis_rnaseq_02_gsea.Rmd)_
+
+![](diagrams/subramanian_fig1.jpg)
+
+**Figure 1. [Subramanian _et al._ (2005)](https://doi.org/10.1073/pnas.0506580102).**
+
+GSEA calculates a pathway-level metric, called an enrichment score (sometimes abbreviated as ES), by ranking genes by a gene-level statistic.
+This score reflects whether or not a gene set or pathway is over-represented at the top or bottom of the gene rankings ([Subramanian _et al._ 2005](https://doi.org/10.1073/pnas.0506580102); [Yu](http://yulab-smu.top/clusterProfiler-book/chapter2.html#gene-set-enrichment-analysis))
+
+Specifically, all genes are ranked from most positive to most negative based on their statistic and a running sum is calculated:
+Starting with the most highly ranked genes, the running sum increases for each gene in the pathway and decreases for each gene not in the pathway.
+The enrichment score for a pathway is the running sum's maximum deviation from zero.
+GSEA also assesses statistical significance of the scores for each pathway through permutation testing.
+As a result, each input pathway will have a p-value associated with it that is then corrected for multiple hypothesis testing ([Subramanian _et al._ 2005](https://doi.org/10.1073/pnas.0506580102); [Yu](http://yulab-smu.top/clusterProfiler-book/chapter2.html#gene-set-enrichment-analysis)).
+
+The implementation of GSEA we use in here examples requires a gene list ordered by some statistic and input gene sets.
+When you use previously computed gene-level statistics with GSEA, it is called GSEA pre-ranked.
+
+### Gene sets
In the previous notebook, we used KEGG pathways for over-representation analysis.
We identified pathways that were significantly over-represented and found that the significant pathways shared genes.
@@ -110,72 +136,117 @@ We can retrieve only the Hallmark gene sets by specifying `category = "H"` to th
head(dge_results_df)
```
-We can take the same steps we did earlier to convert these Ensembl gene IDs to Entrez IDs.
-We have to change the `column` argument to `mapIds()` (and the downstream steps).
+Since this data frame of DGE results includes gene symbols, we do not need to perform any kind of gene identifier conversion.
+We do, however, need to check for duplicate gene symbols.
+We can accomplish this with `duplicated()`, which returns a logical vector (e.g., `TRUE` or `FALSE`).
+The function `sum()` will count `TRUE` values as 1s and `FALSE` as 0s, so using it with `duplicated()` will count the number of duplicate values.
+
+```{r any_duplicated, live = TRUE}
+
+```
-When `mapIds()` is run with `multiValues = "first"` and it encounters multiple matches, only the first identifier is returned.
-This is the default behavior.
-It will also return a named vector of the IDs you queried for (`column`), where the names are the input identifiers (`keys`), in the same order as the keys which means you can use it directly with `dplyr::mutate()`.
+This will cause a problem when we go to run GSEA.
-```{r convert_entrez, live = TRUE}
+### Removing duplicates
- # Create a new column 'entrez_id' that contains the Entrez IDs returned by
- # mapIds()
+The GSEA approach requires on discriminating between genes that are in a gene set and those that are not.
+Practically speaking, gene sets are just collections of gene identifiers!
+When the function we use for GSEA pre-ranked gets a list with duplicated gene identifiers, it can produce unexpected results.
+
+Compared to the total number of genes that are in our results, there are not a lot of duplicates but we'll still need to make a decision about how to handle them.
+
+Let's get a vector of the duplicated gene symbols so we can use it to explore our filtering steps.
+
+```{r gene_dups, live = TRUE}
+
+```
- # We need these gene identifiers to perform our pathway analysis, so remove
- # any genes where we don't have Entrez IDs
+Now we'll look at the values for the the duplicated gene symbols.
+```{r show_gene_dups}
+dge_results_df %>%
+ dplyr::filter(gene_symbol %in% duplicated_gene_symbols) %>%
+ dplyr::arrange(gene_symbol)
```
-### Preranked list
+We can see that the associated values vary for each row.
-The `GSEA()` function takes a preranked (sorted) named vector of statistics, where the names in the vector are gene identifiers.
+Let's keep the gene symbols associated with the higher absolute value of the log2 fold change.
+
+Retaining the instance of the gene symbols with the higher absolute value of a gene-level statistic means that we will retain the value that is likely to be more highly- or lowly-ranked or, put another way, the values less likely to be towards the middle of the ranked gene list.
+We should keep this decision in mind when interpreting our results.
+For example, if all the duplicate identifiers happened to be in a particular gene set, we may get an overly optimistic view of how perturbed that gene set is because we preferentially selected instances of the identifier that have a higher absolute value of the statistic used for ranking.
+
+In the next chunk, we are going to filter out the duplicated rows using the `dplyr::distinct()` function after sorting by absolute value of the log2 fold change.
+This will keep the first row with the duplicated value thus keeping the row with the largest absolute value.
+
+```{r filter_dge}
+filtered_dge_df <- dge_results_df %>%
+ # Sort so that the highest absolute values of the log2 fold change are at the
+ # top
+ dplyr::arrange(dplyr::desc(abs(log2FoldChange))) %>%
+ # Filter out the duplicated rows using `dplyr::distinct()`
+ dplyr::distinct(gene_symbol, .keep_all = TRUE)
+```
+
+Let's see what happened to our duplicate identifiers.
+
+```{r show_filtered_dge, live = TRUE}
+# Subset to & arrange by gene symbols that were duplicated in the original
+# data frame of results
+
+```
+
+Now we're ready to prep our pre-ranked list for GSEA.
+
+### Pre-ranked list
+
+The `GSEA()` function takes a pre-ranked (sorted) named vector of statistics, where the names in the vector are gene identifiers.
This is step 1 -- gene-level statistics.
```{r lfc_vector}
-lfc_vector <- dge_results_df$log2FoldChange
-names(lfc_vector) <- dge_results_df$entrez_id
+lfc_vector <- filtered_dge_df %>%
+ # Extract a vector of `log2FoldChange` named by `gene_symbol`
+ dplyr::pull(log2FoldChange, name = gene_symbol)
lfc_vector <- sort(lfc_vector, decreasing = TRUE)
```
+Let's look at the top ranked values.
+
```{r head_lfc, live = TRUE}
# Look at first entries of the log2 fold change vector
```
+And the bottom of the list.
+
```{r tail_lfc, live = TRUE}
# Look at the last entries of the log2 fold change vector
```
-## GSEA
-
-![](diagrams/subramanian_fig1.jpg)
-
-**Figure 1. [Subramanian, Tamayo *et al.* (2005)](https://doi.org/10.1073/pnas.0506580102).**
+## Run GSEA
-The enrichment score (ES) for a pathway, a pathway-level statistic, is calculated using our gene-level statistics.
-Genes are ranked from most highly positive to most highly negative and weighting them according to their gene-level statistic.
-A running score is calculated by starting with the most highly ranked genes and increasing the score when a gene is in the pathway and decreasing the score when a gene is not in the pathway.
-The ES is the maximum deviation from zero.
-Significance is assessed by generating a null distribution by sampling random gene sets of the same size and an FDR (false discovery rate) value is calculated to account for multiple hypothesis testing.
-([Subramanian, Tamayo *et al.* 2005](https://doi.org/10.1073/pnas.0506580102); [Korotkevich, Sukhov, and Sergushichev. 2019](https://doi.org/10.1101/060012)).
+Now for the analysis!
We can use the `GSEA()` function to perform GSEA with any generic set of gene sets, but there are several functions for using specific, commonly used gene sets (e.g., `gseKEGG()`).
```{r run_gsea}
gsea_results <- GSEA(geneList = lfc_vector, # ordered ranked gene list
- nPerm = 1000, # number of permutations
minGSSize = 25, # minimum gene set size
maxGSSize = 500, # maximum gene set set
pvalueCutoff = 0.05,
- pAdjustMethod = "BH", # Benjamini-Hochberg correction
+ pAdjustMethod = "BH", # correction for multiple hypothesis testing
TERM2GENE = dplyr::select(hs_hallmark_df,
gs_name,
- entrez_gene))
+ gene_symbol))
```
-Let's take a look at the results.
+The warning about ties means that there are multiple genes that have the same log2 fold change value.
+This percentage is small and unlikely to impact our results.
+A large number of ties might tell us there's something wrong with our DGE results ([Ballereau _et al._ 2018](https://bioinformatics-core-shared-training.github.io/cruk-summer-school-2018/RNASeq2018/html/06_Gene_set_testing.nb.html)).
+
+Let's take a look at the GSEA results.
```{r view_gsea, live = TRUE, eval = FALSE}
@@ -189,21 +260,20 @@ Let's write these results to file.
gsea_results@result %>% readr::write_tsv(gsea_results_file)
```
-
### Visualizing GSEA results
We can visualize GSEA results for individual pathways or gene sets using `enrichplot::gseaplot()`.
-Let's take a look at 3 different pathways -- one with a highly positive NES, one with a highly negative NES, and one somewhere in the middle -- to get more insight into how ES are calculated.
+Let's take a look at 3 different pathways -- one with a highly positive NES, one with a highly negative NES, and one that was not a significant result -- to get more insight into how ES are calculated.
#### Highly Positive NES
-The gene set `HALLMARK_MYC_TARGETS_V2` had high positive log2 fold changes.
+The gene set `HALLMARK_MYC_TARGETS_V1` had high positive log2 fold changes.
Recall a positive log2 fold change means a it had a higher expression value in _MYCN_ amplified cell lines.
-```{r myc_v2}
+```{r myc_v1}
enrichplot::gseaplot(gsea_results,
- geneSetID = "HALLMARK_MYC_TARGETS_V2",
- title = "HALLMARK_MYC_TARGETS_V2",
+ geneSetID = "HALLMARK_MYC_TARGETS_V1",
+ title = "HALLMARK_MYC_TARGETS_V1",
color.line = "#0066FF")
```
@@ -211,32 +281,29 @@ Notice how the genes that are in the gene set, indicated by the black bars, tend
#### Highly Negative NES
-The gene set `HALLMARK_INFLAMMATORY_RESPONSE` had a highly negative NES.
+The gene set `HALLMARK_INTERFERON_ALPHA_RESPONSE` had a highly negative NES.
```{r inflammatory}
enrichplot::gseaplot(gsea_results,
- geneSetID = "HALLMARK_INFLAMMATORY_RESPONSE",
- title = "HALLMARK_INFLAMMATORY_RESPONSE",
+ geneSetID = "HALLMARK_INTERFERON_ALPHA_RESPONSE",
+ title = "HALLMARK_INTERFERON_ALPHA_RESPONSE",
color.line = "#0066FF")
```
This gene set shows the opposite pattern -- genes in the pathway tend to be on the right side of the graph.
-#### Somewhere in the middle
+#### A non-significant result
-A moderately negative NES is somewhere in the middle in this particular experiment.
-Let's look at `HALLMARK_P53_PATHWAY`.
+The `@results` slot will only show gene sets that pass the `pvalueCutoff` threshold we supplied to `GSEA()`, but we can plot any gene set so long as we know its name.
+Let's look at `HALLMARK_P53_PATHWAY`, which was not in the results we viewed earlier.
+
+```{r p53, live = TRUE}
-```{r p53}
-enrichplot::gseaplot(gsea_results,
- geneSetID = "HALLMARK_P53_PATHWAY",
- title = "HALLMARK_P53_PATHWAY",
- color.line = "#0066FF")
```
Genes in the pathway are distributed more evenly throughout the ranked list, resulting in a more "middling" score.
-*Note: The plots returned by `enrichplot::gseaplot` are ggplots, so we can use `ggplot2::ggsave()` to save them to file.*
+*Note: The plots returned by `enrichplot::gseaplot` are ggplots, so we could use `ggplot2::ggsave()` to save them to file if we wanted to.*
## Session Info
diff --git a/pathway-analysis/02-gene_set_enrichment_analysis.nb.html b/pathway-analysis/02-gene_set_enrichment_analysis.nb.html
index 724644de..a46c6641 100644
--- a/pathway-analysis/02-gene_set_enrichment_analysis.nb.html
+++ b/pathway-analysis/02-gene_set_enrichment_analysis.nb.html
@@ -350,23 +350,29 @@ Set up
Libraries
-
+
# Package to run GSEA
library(clusterProfiler)
-
+
+
+
+
clusterProfiler v3.18.1 For help: https://guangchuangyu.github.io/software/clusterProfiler
If you use clusterProfiler in published research, please cite:
-Guangchuang Yu, Li-Gen Wang, Yanyan Han, Qing-Yu He. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287.
-
-Attaching package: ‘clusterProfiler’
-
-The following object is masked from ‘package:stats’:
+Guangchuang Yu, Li-Gen Wang, Yanyan Han, Qing-Yu He. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287.
+
+
+
+Attaching package: 'clusterProfiler'
+
+
+The following object is masked from 'package:stats':
filter
-
-
+
+
# Package that contains the MSigDB gene sets in tidy format
library(msigdbr)
@@ -379,7 +385,7 @@ Directories and Files
Directories
-
+
# Where the DGE results are stored
input_dir <- file.path("..", "RNA-seq", "results", "NB-cell")
@@ -397,7 +403,7 @@ Directories
Input files
-
+
# DGE results
dge_results_file <- file.path(input_dir,
"NB-cell_DESeq_amplified_v_nonamplified_results.tsv")
@@ -409,7 +415,7 @@ Input files
Output files
-
+
# GSEA pathway-level scores and statistics
gsea_results_file <- file.path(output_dir,
"NB-cell_gsea_results.tsv")
@@ -438,7 +444,7 @@ Gene sets
We can retrieve only the Hallmark gene sets by specifying category = "H"
to the msigdbr()
function.
-
+
hs_hallmark_df <- msigdbr(species = "Homo sapiens",
category = "H")
@@ -450,13 +456,13 @@ Gene sets
Differential gene expression results
-
+
# Read in the DGE results
dge_results_df <- readr::read_tsv(dge_results_file)
-
+
-── Column specification ───────────────────────────────────────────────────────────────────────────────────────────
+── Column specification ────────────────────────────────────────────────────────
cols(
gene_id = col_character(),
baseMean = col_double(),
@@ -466,27 +472,25 @@ Differential gene expression results
padj = col_double(),
gene_symbol = col_character()
)
-
+
-
+
head(dge_results_df)
-
-
Since this data frame of DGE results includes gene symbols, we do not need to perform any kind of gene identifier conversion. We do, however, need to check for duplicate gene symbols. We can accomplish this with duplicated()
, which returns a logical vector (e.g., TRUE
or FALSE
). The function sum()
will count TRUE
values as 1s and FALSE
as 0s, so using it with duplicated()
will count the number of duplicate values.
-
+
sum(duplicated(dge_results_df$gene_symbol))
@@ -502,7 +506,7 @@ Removing duplicates
Let’s get a vector of the duplicated gene symbols so we can use it to explore our filtering steps.
-
+
duplicated_gene_symbols <- dge_results_df %>%
dplyr::filter(duplicated(gene_symbol)) %>%
dplyr::pull(gene_symbol)
@@ -512,18 +516,16 @@ Removing duplicates
Now we’ll look at the values for the the duplicated gene symbols.
-
+
dge_results_df %>%
dplyr::filter(gene_symbol %in% duplicated_gene_symbols) %>%
dplyr::arrange(gene_symbol)
-
-
We can see that the associated values vary for each row.
@@ -532,7 +534,7 @@ Removing duplicates
In the next chunk, we are going to filter out the duplicated rows using the dplyr::distinct()
function after sorting by absolute value of the log2 fold change. This will keep the first row with the duplicated value thus keeping the row with the largest absolute value.
-
+
filtered_dge_df <- dge_results_df %>%
# Sort so that the highest absolute values of the log2 fold change are at the
# top
@@ -545,20 +547,18 @@ Removing duplicates
Let’s see what happened to our duplicate identifiers.
-
+
# Subset to & arrange by gene symbols that were duplicated in the original
# data frame of results
filtered_dge_df %>%
dplyr::filter(gene_symbol %in% duplicated_gene_symbols) %>%
dplyr::arrange(gene_symbol)
-
-
Now we’re ready to prep our pre-ranked list for GSEA.
@@ -568,7 +568,7 @@ Pre-ranked list
The GSEA()
function takes a pre-ranked (sorted) named vector of statistics, where the names in the vector are gene identifiers. This is step 1 – gene-level statistics.
-
+
lfc_vector <- filtered_dge_df %>%
# Extract a vector of `log2FoldChange` named by `gene_symbol`
dplyr::pull(log2FoldChange, name = gene_symbol)
@@ -579,26 +579,26 @@ Pre-ranked list
Let’s look at the top ranked values.
-
+
# Look at first entries of the log2 fold change vector
head(lfc_vector)
-
+
GAGE12D GABBR1 FABP5P7 KIAA0355 MNX1 GAGE2A
-23.620487 23.130027 22.682631 22.562940 6.813764 6.801245
+23.620487 23.130046 22.682641 22.562941 6.813764 6.801245
And the bottom of the list.
-
+
# Look at the last entries of the log2 fold change vector
tail(lfc_vector)
-
+
MUC15 TFPI2 LENG1 MT1M DAZ2 CSPG4P4Y
- -8.017004 -8.768669 -22.754783 -25.318887 -25.459848 -26.095612
+ -8.017004 -8.768669 -22.754783 -25.318887 -25.444288 -26.095612
@@ -610,7 +610,7 @@ Run GSEA
We can use the GSEA()
function to perform GSEA with any generic set of gene sets, but there are several functions for using specific, commonly used gene sets (e.g., gseKEGG()
).
-
+
gsea_results <- GSEA(geneList = lfc_vector, # ordered ranked gene list
minGSSize = 25, # minimum gene set size
maxGSSize = 500, # maximum gene set set
@@ -620,13 +620,22 @@ Run GSEA
gs_name,
gene_symbol))
-
-preparing geneSet collections...
-GSEA analysis...
-There are ties in the preranked stats (0.06% of the list).
-The order of those tied genes will be arbitrary, which may produce unexpected results.leading edge analysis...
-done...
-
+
+preparing geneSet collections...
+
+
+GSEA analysis...
+
+
+Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.06% of the list).
+The order of those tied genes will be arbitrary, which may produce unexpected results.
+
+
+leading edge analysis...
+
+
+done...
+
The warning about ties means that there are multiple genes that have the same log2 fold change value. This percentage is small and unlikely to impact our results. A large number of ties might tell us there’s something wrong with our DGE results (Ballereau et al. 2018).
@@ -644,7 +653,7 @@ Run GSEA
Let’s write these results to file.
-
+
gsea_results@result %>% readr::write_tsv(gsea_results_file)
@@ -657,14 +666,14 @@ Highly Positive NES
The gene set HALLMARK_MYC_TARGETS_V1
had high positive log2 fold changes. Recall a positive log2 fold change means a it had a higher expression value in MYCN amplified cell lines.
-
+
enrichplot::gseaplot(gsea_results,
geneSetID = "HALLMARK_MYC_TARGETS_V1",
title = "HALLMARK_MYC_TARGETS_V1",
color.line = "#0066FF")
-
-
+
+
@@ -675,14 +684,14 @@ Highly Negative NES
The gene set HALLMARK_INTERFERON_ALPHA_RESPONSE
had a highly negative NES.
-
+
enrichplot::gseaplot(gsea_results,
geneSetID = "HALLMARK_INTERFERON_ALPHA_RESPONSE",
title = "HALLMARK_INTERFERON_ALPHA_RESPONSE",
color.line = "#0066FF")
-
-
+
+
@@ -693,14 +702,14 @@ A non-significant result
The @results
slot will only show gene sets that pass the pvalueCutoff
threshold we supplied to GSEA()
, but we can plot any gene set so long as we know its name. Let’s look at HALLMARK_P53_PATHWAY
, which was not in the results we viewed earlier.
-
+
enrichplot::gseaplot(gsea_results,
geneSetID = "HALLMARK_P53_PATHWAY",
title = "HALLMARK_P53_PATHWAY",
color.line = "#0066FF")
-
-
+
+
@@ -713,49 +722,64 @@ A non-significant result
Session Info
-
+
sessionInfo()
-
+
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
-Running under: Ubuntu 18.04.3 LTS
+Running under: Ubuntu 20.04 LTS
Matrix products: default
-BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
-LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
+BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
locale:
- [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8 LC_COLLATE=C.UTF-8
- [5] LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8 LC_PAPER=C.UTF-8 LC_NAME=C
- [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
+ [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
+ [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
+ [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
+ [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
+ [9] LC_ADDRESS=C LC_TELEPHONE=C
+[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
-[1] stats graphics grDevices datasets utils methods base
+[1] stats graphics grDevices utils datasets methods base
other attached packages:
-[1] msigdbr_7.2.1 clusterProfiler_3.18.1
+[1] msigdbr_7.2.1 clusterProfiler_3.18.1 optparse_1.6.6
loaded via a namespace (and not attached):
- [1] enrichplot_1.10.2 bit64_4.0.5 RColorBrewer_1.1-2 tools_4.0.3 R6_2.5.0
- [6] DBI_1.1.1 BiocGenerics_0.36.0 colorspace_2.0-0 tidyselect_1.1.0 gridExtra_2.3
-[11] bit_4.0.4 compiler_4.0.3 cli_2.2.0 Biobase_2.50.0 scatterpie_0.1.5
-[16] labeling_0.4.2 shadowtext_0.0.7 scales_1.1.1 readr_1.4.0 stringr_1.4.0
-[21] digest_0.6.27 rmarkdown_2.6 DOSE_3.16.0 pkgconfig_2.0.3 htmltools_0.5.1.1
-[26] fastmap_1.1.0 rlang_0.4.10 rstudioapi_0.13 RSQLite_2.2.3 farver_2.0.3
-[31] generics_0.1.0 jsonlite_1.7.2 BiocParallel_1.24.1 GOSemSim_2.16.1 dplyr_1.0.3
-[36] magrittr_2.0.1 GO.db_3.12.1 Matrix_1.3-2 fansi_0.4.2 Rcpp_1.0.6
-[41] munsell_0.5.0 S4Vectors_0.28.1 viridis_0.5.1 lifecycle_0.2.0 stringi_1.5.3
-[46] yaml_2.2.1 ggraph_2.0.4 MASS_7.3-53 plyr_1.8.6 qvalue_2.22.0
-[51] grid_4.0.3 blob_1.2.1 parallel_4.0.3 ggrepel_0.9.1 DO.db_2.9
-[56] crayon_1.3.4 lattice_0.20-41 graphlayouts_0.7.1 cowplot_1.1.1 splines_4.0.3
-[61] hms_1.0.0 knitr_1.30 pillar_1.4.7 fgsea_1.16.0 igraph_1.2.6
-[66] reshape2_1.4.4 stats4_4.0.3 fastmatch_1.1-0 glue_1.4.2 evaluate_0.14
-[71] downloader_0.4 data.table_1.13.6 renv_0.12.5-2 BiocManager_1.30.10 vctrs_0.3.6
-[76] tweenr_1.0.1 gtable_0.3.0 purrr_0.3.4 polyclip_1.10-0 tidyr_1.1.2
-[81] assertthat_0.2.1 cachem_1.0.1 ggplot2_3.3.3 xfun_0.20 ggforce_0.3.2
-[86] tidygraph_1.2.0 viridisLite_0.3.0 tibble_3.0.5 rvcheck_0.1.8 AnnotationDbi_1.52.0
-[91] memoise_1.1.0 IRanges_2.24.1 ellipsis_0.3.1
+ [1] enrichplot_1.10.2 bit64_4.0.5 RColorBrewer_1.1-2
+ [4] tools_4.0.3 R6_2.5.0 DBI_1.1.1
+ [7] BiocGenerics_0.36.0 colorspace_2.0-0 tidyselect_1.1.0
+[10] gridExtra_2.3 bit_4.0.4 compiler_4.0.3
+[13] cli_2.2.0 Biobase_2.50.0 scatterpie_0.1.5
+[16] labeling_0.4.2 shadowtext_0.0.7 scales_1.1.1
+[19] readr_1.4.0 stringr_1.4.0 digest_0.6.27
+[22] rmarkdown_2.6 DOSE_3.16.0 pkgconfig_2.0.3
+[25] htmltools_0.5.1.1 fastmap_1.1.0 rlang_0.4.10
+[28] rstudioapi_0.13 RSQLite_2.2.3 farver_2.0.3
+[31] generics_0.1.0 jsonlite_1.7.2 BiocParallel_1.24.1
+[34] GOSemSim_2.16.1 dplyr_1.0.3 magrittr_2.0.1
+[37] GO.db_3.12.1 Matrix_1.3-2 fansi_0.4.2
+[40] Rcpp_1.0.6 munsell_0.5.0 S4Vectors_0.28.1
+[43] viridis_0.5.1 lifecycle_0.2.0 stringi_1.5.3
+[46] yaml_2.2.1 ggraph_2.0.4 MASS_7.3-53
+[49] plyr_1.8.6 qvalue_2.22.0 grid_4.0.3
+[52] blob_1.2.1 parallel_4.0.3 ggrepel_0.9.1
+[55] DO.db_2.9 crayon_1.3.4 lattice_0.20-41
+[58] graphlayouts_0.7.1 cowplot_1.1.1 splines_4.0.3
+[61] hms_1.0.0 ps_1.5.0 knitr_1.30
+[64] pillar_1.4.7 fgsea_1.16.0 igraph_1.2.6
+[67] reshape2_1.4.4 stats4_4.0.3 fastmatch_1.1-0
+[70] glue_1.4.2 evaluate_0.14 downloader_0.4
+[73] data.table_1.13.6 BiocManager_1.30.10 vctrs_0.3.6
+[76] tweenr_1.0.1 gtable_0.3.0 getopt_1.20.3
+[79] purrr_0.3.4 polyclip_1.10-0 tidyr_1.1.2
+[82] assertthat_0.2.1 cachem_1.0.1 ggplot2_3.3.3
+[85] xfun_0.20 ggforce_0.3.2 tidygraph_1.2.0
+[88] viridisLite_0.3.0 tibble_3.0.5 rvcheck_0.1.8
+[91] AnnotationDbi_1.52.0 memoise_1.1.0 IRanges_2.24.1
+[94] ellipsis_0.3.1
diff --git a/pathway-analysis/03-gene_set_variation_analysis-live.Rmd b/pathway-analysis/03-gene_set_variation_analysis-live.Rmd
index fc47b880..13369210 100644
--- a/pathway-analysis/03-gene_set_variation_analysis-live.Rmd
+++ b/pathway-analysis/03-gene_set_variation_analysis-live.Rmd
@@ -8,15 +8,25 @@ author: CCDL for ALSF
date: 2020
---
+## Objectives
+
+This notebook will demonstrate how to:
+
+- Identify when Gene Set Variation Analysis (GSVA) is well-suited for an analysis
+- Perform GSVA on transformed RNA-seq data with the `GSVA` package
+- Explore the dependence of GSVA scores on gene set size with random gene sets
+
+---
+
So far every pathway analysis method we've covered relies on some information about groups of samples in our data.
-For over-representation analysis (ORA), we took the top 100 genes that distinguished neurons from other cell types in a scRNA-seq experiment -- it relied on cell type labels.
+For over-representation analysis (ORA), we created gene sets from two different two group comparisons.
In the Gene Set Enrichment Analysis (GSEA) example, we used statistics from a differential gene expression (DGE) analysis where we compared _MYCN_ amplified cell lines to non-amplified cell lines; we needed that amplification status information.
What if we're less sure about groups in our data or we want to analyze our data in a more unsupervised manner?
-In this notebook we will cover a method called Gene Set Variation Analysis (GSVA) ([Hänzelmann, Castelo, and Guinney. 2013](https://doi.org/10.1186/1471-2105-14-7)) that allows us to calculate gene set or pathway scores on a per-sample basis.
+In this notebook we will cover a method called Gene Set Variation Analysis (GSVA) ([Hänzelmann _et al._ 2013](https://doi.org/10.1186/1471-2105-14-7)) that allows us to calculate gene set or pathway scores on a per-sample basis.
-We like this quote from the GSVA paper ([Hänzelmann, Castelo, and Guinney. 2013](https://doi.org/10.1186/1471-2105-14-7)) to set the stage:
+We like this quote from the GSVA paper ([Hänzelmann _et al._ 2013](https://doi.org/10.1186/1471-2105-14-7)) to set the stage:
> While [gene set enrichment] methods are generally regarded as end points of a bioinformatic analysis, GSVA constitutes a starting point to build pathway-centric models of biology.
@@ -45,10 +55,10 @@ library(GSVA)
```{r directories}
# We have some medulloblastoma data from the OpenPBTA project that we've
# prepared ahead of time
-input_dir <- file.path("data", "medulloblastoma")
+input_dir <- file.path("data", "open-pbta")
-# Create a directory specifically for our GSVA results if it does not yet exist
-output_dir <- file.path("results", "gsva")
+# Create a directory specifically for the results using this dataset
+output_dir <- file.path("results", "open-pbta")
if (!dir.exists(output_dir)) {
dir.create(output_dir, recursive = TRUE)
}
@@ -124,14 +134,14 @@ For GSVA, we need a matrix.
![](diagrams/hanzelmann_fig1.jpg)
-**Figure 1 from [Hänzelmann, Castelo, and Guinney. (2013)](https://doi.org/10.1186/1471-2105-14-7).**
+**Figure 1 from [Hänzelmann _et al._ (2013)](https://doi.org/10.1186/1471-2105-14-7).**
You may notice that GSVA has some commonalities with GSEA.
Rather than ranking genes based on some statistic _we_ selected ahead of time, GSVA fits a model and ranks genes based on their expression level relative to the sample distribution.
-This is a way of asking if a gene _i_ is highly or lowly expressed in a sample _j_ in the context of this experiment and ranking accordingly ([Hänzelmann, Castelo, and Guinney. 2013](https://doi.org/10.1186/1471-2105-14-7)).
+This is a way of asking if a gene _i_ is highly or lowly expressed in a sample _j_ in the context of this experiment and ranking accordingly ([Hänzelmann _et al._ 2013](https://doi.org/10.1186/1471-2105-14-7)).
The pathway-level score calculated is a way of asking how genes _within_ a gene set vary as compared to genes that are _outside_ of that gene set ([Malhotra. 2018](https://towardsdatascience.com/decoding-gene-set-variation-analysis-8193a0cfda3)).
(This is sometimes called a competitive test in gene set enrichment literature.)
-The intuition here is that we will get pathway-level scores for each sample that indicate if genes in a pathway vary concordantly in one direction (overexpressed or underexpressed relative to the overall population) ([Hänzelmann, Castelo, and Guinney. 2013](https://doi.org/10.1186/1471-2105-14-7)).
+The intuition here is that we will get pathway-level scores for each sample that indicate if genes in a pathway vary concordantly in one direction (overexpressed or underexpressed relative to the overall population) ([Hänzelmann _et al._ 2013](https://doi.org/10.1186/1471-2105-14-7)).
The output is a gene set by sample matrix of GSVA scores.
@@ -174,22 +184,22 @@ all_genes <- rownames(rnaseq_mat)
set.seed(2020)
```
-Our minimum gene set size earlier was 15 genes and our maximum gene set size was 500 genes. We'll use the same minimum and maximum values for our random gene sets and some values in between.
+Our minimum gene set size earlier was 15 genes and our maximum gene set size was 500 genes.
+We'll use the same minimum and maximum values for our random gene sets and some values in between.
-```{r}
+```{r gene_set_sizes}
# Make a list of integers that indicate the random gene set sizes
gene_set_size <- list(15, 25, 50, 100, 250, 500)
```
-For each gene set size, we will generate 100 random gene sets
+For each gene set size, we will generate 100 random gene sets.
-```{r}
+```{r random_gene_sets}
# Set number of replicates
-nreps <- 100
+nreps <- 100
# Generate 100 random gene sets of each size
-random_gene_sets <-
+random_gene_sets <- rep(gene_set_size, nreps) %>% # Repeat gene sizes so we run `nreps` times
purrr::map(
- rep(gene_set_size, nreps), # Repeat gene sizes so we run `nreps` times
# Sample the vector of all genes, choosing the number of items specified
# in the element of gene set size
~ base::sample(x = all_genes,
@@ -199,16 +209,19 @@ random_gene_sets <-
The Hallmarks list we used earlier stored the gene set names as the name of the list, so let's add names to our random gene sets that indicate what size they are and so `gsva()` doesn't get upset.
-```{r}
+```{r name_random_gene_sets}
# We will include the size of the gene set in the gene set name
# Start by taking the length of each pathway and appending "pathway_" to that
# number
-lengths_vector <- purrr::map(random_gene_sets, ~ length(.x)) %>%
+lengths_vector <- random_gene_sets %>%
+ # Get the length of each gene set (number of genes)
+ purrr::map(~ length(.x)) %>%
+ # Make it "pathway_"
purrr::map(~ paste0("pathway_", .x)) %>%
# Return a vector
purrr::flatten_chr()
-# Add the names in lengths_vector to the list
+# Add the names in lengths_vector to the list - "pathway_"
random_gene_sets <- random_gene_sets %>%
# make.names() appends a "version" if something is not unique
purrr::set_names(nm = make.names(lengths_vector, unique = TRUE))
@@ -299,7 +312,7 @@ Here's a figure from the OpenPBTA project, where the middle panel is a heatmap o
gsva_results %>%
as.data.frame() %>%
tibble::rownames_to_column("pathway") %>%
- readr::write_tsv(path = gsva_results_file)
+ readr::write_tsv(file = gsva_results_file)
```
## Session Info
diff --git a/pathway-analysis/03-gene_set_variation_analysis.nb.html b/pathway-analysis/03-gene_set_variation_analysis.nb.html
index 6350426d..c3733022 100644
--- a/pathway-analysis/03-gene_set_variation_analysis.nb.html
+++ b/pathway-analysis/03-gene_set_variation_analysis.nb.html
@@ -347,7 +347,7 @@ Set up
Libraries
-
+
# Pipes
library(magrittr)
# Gene Set Variation Analysis
@@ -362,7 +362,7 @@ Directories and files
Directories
-
+
# We have some medulloblastoma data from the OpenPBTA project that we've
# prepared ahead of time
input_dir <- file.path("data", "open-pbta")
@@ -381,7 +381,7 @@ Input
We have VST transformed RNA-seq data, annotated with gene symbols, that has been collapsed such that there are no duplicated gene identifiers (see setup
).
-
+
rnaseq_file <- file.path(input_dir, "medulloblastoma_vst_collapsed.tsv")
@@ -391,7 +391,7 @@ Input
Output
-
+
gsva_results_file <- file.path(output_dir, "medulloblastoma_gsva_results.tsv")
@@ -407,7 +407,7 @@ Gene sets
The RNA-seq data uses gene symbols, so we need gene sets that use gene symbols, too.
-
+
# R can often read in data from a URL
hallmarks_url <- "https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.1/h.all.v7.1.symbols.gmt"
@@ -433,39 +433,37 @@ RNA-seq data
We’re only working with the medulloblastoma samples in this example.
-
+
rnaseq_df <- readr::read_tsv(rnaseq_file)
-
+
-── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────
+── Column specification ────────────────────────────────────────────────────────
cols(
.default = col_double(),
gene_symbol = col_character()
)
ℹ Use `spec()` for the full column specifications.
-
+
-
+
# What does the RNA-seq data frame look like?
rnaseq_df[1:5, 1:5]
-
-
For GSVA, we need a matrix.
-
+
rnaseq_mat <- rnaseq_df %>%
tibble::column_to_rownames("gene_symbol") %>%
as.matrix()
@@ -484,7 +482,7 @@ GSVA
Perform GSVA
-
+
gsva_results <- gsva(rnaseq_mat,
hallmarks_list,
method = "gsva",
@@ -497,132 +495,143 @@ Perform GSVA
# Compute Gaussian-distributed scores
mx.diff = TRUE)
-
-945 genes with constant expression values throuhgout the samples.Since argument method!="ssgsea", genes with constant expression values are discarded.
-
-
+
+Warning in .filterFeatures(expr, method): 945 genes with constant expression
+values throuhgout the samples.
+
+
+Warning in .filterFeatures(expr, method): Since argument method!="ssgsea", genes
+with constant expression values are discarded.
+
+
Estimating GSVA scores for 50 gene sets.
Estimating ECDFs with Gaussian kernels
- |
- | | 0%
- |
- |== | 2%
- |
- |===== | 4%
- |
- |======= | 6%
- |
- |========= | 8%
- |
- |============ | 10%
- |
- |============== | 12%
- |
- |================ | 14%
- |
- |=================== | 16%
- |
- |===================== | 18%
- |
- |======================= | 20%
- |
- |========================== | 22%
- |
- |============================ | 24%
- |
- |============================== | 26%
- |
- |================================ | 28%
- |
- |=================================== | 30%
- |
- |===================================== | 32%
- |
- |======================================= | 34%
- |
- |========================================== | 36%
- |
- |============================================ | 38%
- |
- |============================================== | 40%
- |
- |================================================= | 42%
- |
- |=================================================== | 44%
- |
- |===================================================== | 46%
- |
- |======================================================== | 48%
- |
- |========================================================== | 50%
- |
- |============================================================ | 52%
- |
- |=============================================================== | 54%
- |
- |================================================================= | 56%
- |
- |=================================================================== | 58%
- |
- |====================================================================== | 60%
- |
- |======================================================================== | 62%
- |
- |========================================================================== | 64%
- |
- |============================================================================= | 66%
- |
- |=============================================================================== | 68%
- |
- |================================================================================= | 70%
- |
- |==================================================================================== | 72%
- |
- |====================================================================================== | 74%
- |
- |======================================================================================== | 76%
- |
- |========================================================================================== | 78%
- |
- |============================================================================================= | 80%
- |
- |=============================================================================================== | 82%
- |
- |================================================================================================= | 84%
- |
- |==================================================================================================== | 86%
- |
- |====================================================================================================== | 88%
- |
- |======================================================================================================== | 90%
- |
- |=========================================================================================================== | 92%
- |
- |============================================================================================================= | 94%
- |
- |=============================================================================================================== | 96%
- |
- |================================================================================================================== | 98%
- |
- |====================================================================================================================| 100%
+ |
+ | | 0%
+ |
+ |= | 2%
+ |
+ |=== | 4%
+ |
+ |==== | 6%
+ |
+ |====== | 8%
+ |
+ |======= | 10%
+ |
+ |======== | 12%
+ |
+ |========== | 14%
+ |
+ |=========== | 16%
+ |
+ |============= | 18%
+ |
+ |============== | 20%
+ |
+ |=============== | 22%
+ |
+ |================= | 24%
+ |
+ |================== | 26%
+ |
+ |==================== | 28%
+ |
+ |===================== | 30%
+ |
+ |====================== | 32%
+ |
+ |======================== | 34%
+ |
+ |========================= | 36%
+ |
+ |=========================== | 38%
+ |
+ |============================ | 40%
+ |
+ |============================= | 42%
+ |
+ |=============================== | 44%
+ |
+ |================================ | 46%
+ |
+ |================================== | 48%
+ |
+ |=================================== | 50%
+ |
+ |==================================== | 52%
+ |
+ |====================================== | 54%
+ |
+ |======================================= | 56%
+ |
+ |========================================= | 58%
+ |
+ |========================================== | 60%
+ |
+ |=========================================== | 62%
+ |
+ |============================================= | 64%
+ |
+ |============================================== | 66%
+ |
+ |================================================ | 68%
+ |
+ |================================================= | 70%
+ |
+ |================================================== | 72%
+ |
+ |==================================================== | 74%
+ |
+ |===================================================== | 76%
+ |
+ |======================================================= | 78%
+ |
+ |======================================================== | 80%
+ |
+ |========================================================= | 82%
+ |
+ |=========================================================== | 84%
+ |
+ |============================================================ | 86%
+ |
+ |============================================================== | 88%
+ |
+ |=============================================================== | 90%
+ |
+ |================================================================ | 92%
+ |
+ |================================================================== | 94%
+ |
+ |=================================================================== | 96%
+ |
+ |===================================================================== | 98%
+ |
+ |======================================================================| 100%
Note: the gsva()
documentation says we can use kcdf = "Gaussian"
if we had RNA-seq log-CPMs, log-RPKMs or log-TPMs, but we would use kcdf = "Poisson"
on integer counts.
-
+
# Let's explore what the output of gsva() looks like
gsva_results[1:5, 1:5]
-
- BS_09Z7TC35 BS_1AYRM596 BS_1BWP5MCT BS_1QXEC43H BS_1TWCV047
-HALLMARK_TNFA_SIGNALING_VIA_NFKB -0.44911439 0.5208667 -0.5609193 0.32300630 -0.1468081
-HALLMARK_HYPOXIA -0.38297104 0.2436910 -0.5058759 0.36247083 -0.2971559
-HALLMARK_CHOLESTEROL_HOMEOSTASIS -0.26534735 0.1054224 -0.5180933 0.32418657 -0.4561386
-HALLMARK_MITOTIC_SPINDLE 0.12727006 0.2339489 -0.4338076 -0.07023068 -0.1861483
-HALLMARK_WNT_BETA_CATENIN_SIGNALING -0.09287646 0.1602124 -0.4427594 0.41372523 -0.1152437
+
+ BS_09Z7TC35 BS_1AYRM596 BS_1BWP5MCT
+HALLMARK_TNFA_SIGNALING_VIA_NFKB -0.44911439 0.5208667 -0.5609193
+HALLMARK_HYPOXIA -0.38297104 0.2436910 -0.5058759
+HALLMARK_CHOLESTEROL_HOMEOSTASIS -0.26534735 0.1054224 -0.5180933
+HALLMARK_MITOTIC_SPINDLE 0.12727006 0.2339489 -0.4338076
+HALLMARK_WNT_BETA_CATENIN_SIGNALING -0.09287646 0.1602124 -0.4427594
+ BS_1QXEC43H BS_1TWCV047
+HALLMARK_TNFA_SIGNALING_VIA_NFKB 0.32300630 -0.1468081
+HALLMARK_HYPOXIA 0.36247083 -0.2971559
+HALLMARK_CHOLESTEROL_HOMEOSTASIS 0.32418657 -0.4561386
+HALLMARK_MITOTIC_SPINDLE -0.07023068 -0.1861483
+HALLMARK_WNT_BETA_CATENIN_SIGNALING 0.41372523 -0.1152437
@@ -633,7 +642,7 @@ A note on gene set size
We need to get a collection of all possible genes we will sample from to create random gene sets. Because we’re doing some random sampling, we need to set a seed for this to be reproducible.
-
+
# Use all the gene symbols in the dataset as the pool of possible genes
all_genes <- rownames(rnaseq_mat)
@@ -645,7 +654,7 @@ A note on gene set size
Our minimum gene set size earlier was 15 genes and our maximum gene set size was 500 genes. We’ll use the same minimum and maximum values for our random gene sets and some values in between.
-
+
# Make a list of integers that indicate the random gene set sizes
gene_set_size <- list(15, 25, 50, 100, 250, 500)
@@ -654,7 +663,7 @@ A note on gene set size
For each gene set size, we will generate 100 random gene sets.
-
+
# Set number of replicates
nreps <- 100
# Generate 100 random gene sets of each size
@@ -671,7 +680,7 @@ A note on gene set size
The Hallmarks list we used earlier stored the gene set names as the name of the list, so let’s add names to our random gene sets that indicate what size they are and so gsva()
doesn’t get upset.
-
+
# We will include the size of the gene set in the gene set name
# Start by taking the length of each pathway and appending "pathway_" to that
# number
@@ -693,7 +702,7 @@ A note on gene set size
Run GSVA on our dataset with the same parameters as before, but now with random gene sets.
-
+
random_gsva_results <- gsva(rnaseq_mat,
random_gene_sets,
method = "gsva",
@@ -707,414 +716,343 @@ A note on gene set size
# Compute Gaussian-distributed scores
mx.diff = TRUE)
-
-945 genes with constant expression values throuhgout the samples.Since argument method!="ssgsea", genes with constant expression values are discarded.
-
-
+
+Warning in .filterFeatures(expr, method): 945 genes with constant expression
+values throuhgout the samples.
+
+
+Warning in .filterFeatures(expr, method): Since argument method!="ssgsea", genes
+with constant expression values are discarded.
+
+
Estimating GSVA scores for 579 gene sets.
Estimating ECDFs with Gaussian kernels
- |
- | | 0%
- |
- |= | 1%
- |
- |== | 1%
- |
- |== | 2%
- |
- |=== | 2%
- |
- |=== | 3%
- |
- |==== | 3%
- |
- |==== | 4%
- |
- |===== | 4%
- |
- |===== | 5%
- |
- |====== | 5%
- |
- |====== | 6%
- |
- |======= | 6%
- |
- |======== | 7%
- |
- |========= | 7%
- |
- |========= | 8%
- |
- |========== | 8%
- |
- |========== | 9%
- |
- |=========== | 9%
- |
- |=========== | 10%
- |
- |============ | 10%
- |
- |============ | 11%
- |
- |============= | 11%
- |
- |============= | 12%
- |
- |============== | 12%
- |
- |=============== | 13%
- |
- |================ | 13%
- |
- |================ | 14%
- |
- |================= | 14%
- |
- |================= | 15%
- |
- |================== | 15%
- |
- |================== | 16%
- |
- |=================== | 16%
- |
- |=================== | 17%
- |
- |==================== | 17%
- |
- |==================== | 18%
- |
- |===================== | 18%
- |
- |====================== | 19%
- |
- |======================= | 20%
- |
- |======================== | 20%
- |
- |======================== | 21%
- |
- |========================= | 21%
- |
- |========================= | 22%
- |
- |========================== | 22%
- |
- |========================== | 23%
- |
- |=========================== | 23%
- |
- |=========================== | 24%
- |
- |============================ | 24%
- |
- |============================ | 25%
- |
- |============================= | 25%
- |
- |============================== | 26%
- |
- |=============================== | 26%
- |
- |=============================== | 27%
- |
- |================================ | 27%
- |
- |================================ | 28%
- |
- |================================= | 28%
- |
- |================================= | 29%
- |
- |================================== | 29%
- |
- |================================== | 30%
- |
- |=================================== | 30%
- |
- |=================================== | 31%
- |
- |==================================== | 31%
- |
- |===================================== | 32%
- |
- |====================================== | 32%
- |
- |====================================== | 33%
- |
- |======================================= | 33%
- |
- |======================================= | 34%
- |
- |======================================== | 34%
- |
- |======================================== | 35%
- |
- |========================================= | 35%
- |
- |========================================= | 36%
- |
- |========================================== | 36%
- |
- |========================================== | 37%
- |
- |=========================================== | 37%
- |
- |============================================ | 38%
- |
- |============================================= | 39%
- |
- |============================================== | 39%
- |
- |============================================== | 40%
- |
- |=============================================== | 40%
- |
- |=============================================== | 41%
- |
- |================================================ | 41%
- |
- |================================================ | 42%
- |
- |================================================= | 42%
- |
- |================================================= | 43%
- |
- |================================================== | 43%
- |
- |================================================== | 44%
- |
- |=================================================== | 44%
- |
- |==================================================== | 45%
- |
- |===================================================== | 45%
- |
- |===================================================== | 46%
- |
- |====================================================== | 46%
- |
- |====================================================== | 47%
- |
- |======================================================= | 47%
- |
- |======================================================= | 48%
- |
- |======================================================== | 48%
- |
- |======================================================== | 49%
- |
- |========================================================= | 49%
- |
- |========================================================= | 50%
- |
- |========================================================== | 50%
- |
- |=========================================================== | 50%
- |
- |=========================================================== | 51%
- |
- |============================================================ | 51%
- |
- |============================================================ | 52%
- |
- |============================================================= | 52%
- |
- |============================================================= | 53%
- |
- |============================================================== | 53%
- |
- |============================================================== | 54%
- |
- |=============================================================== | 54%
- |
- |=============================================================== | 55%
- |
- |================================================================ | 55%
- |
- |================================================================= | 56%
- |
- |================================================================== | 56%
- |
- |================================================================== | 57%
- |
- |=================================================================== | 57%
- |
- |=================================================================== | 58%
- |
- |==================================================================== | 58%
- |
- |==================================================================== | 59%
- |
- |===================================================================== | 59%
- |
- |===================================================================== | 60%
- |
- |====================================================================== | 60%
- |
- |====================================================================== | 61%
- |
- |======================================================================= | 61%
- |
- |======================================================================== | 62%
- |
- |========================================================================= | 63%
- |
- |========================================================================== | 63%
- |
- |========================================================================== | 64%
- |
- |=========================================================================== | 64%
- |
- |=========================================================================== | 65%
- |
- |============================================================================ | 65%
- |
- |============================================================================ | 66%
- |
- |============================================================================= | 66%
- |
- |============================================================================= | 67%
- |
- |============================================================================== | 67%
- |
- |============================================================================== | 68%
- |
- |=============================================================================== | 68%
- |
- |================================================================================ | 69%
- |
- |================================================================================= | 69%
- |
- |================================================================================= | 70%
- |
- |================================================================================== | 70%
- |
- |================================================================================== | 71%
- |
- |=================================================================================== | 71%
- |
- |=================================================================================== | 72%
- |
- |==================================================================================== | 72%
- |
- |==================================================================================== | 73%
- |
- |===================================================================================== | 73%
- |
- |===================================================================================== | 74%
- |
- |====================================================================================== | 74%
- |
- |======================================================================================= | 75%
- |
- |======================================================================================== | 75%
- |
- |======================================================================================== | 76%
- |
- |========================================================================================= | 76%
- |
- |========================================================================================= | 77%
- |
- |========================================================================================== | 77%
- |
- |========================================================================================== | 78%
- |
- |=========================================================================================== | 78%
- |
- |=========================================================================================== | 79%
- |
- |============================================================================================ | 79%
- |
- |============================================================================================ | 80%
- |
- |============================================================================================= | 80%
- |
- |============================================================================================== | 81%
- |
- |=============================================================================================== | 82%
- |
- |================================================================================================ | 82%
- |
- |================================================================================================ | 83%
- |
- |================================================================================================= | 83%
- |
- |================================================================================================= | 84%
- |
- |================================================================================================== | 84%
- |
- |================================================================================================== | 85%
- |
- |=================================================================================================== | 85%
- |
- |=================================================================================================== | 86%
- |
- |==================================================================================================== | 86%
- |
- |==================================================================================================== | 87%
- |
- |===================================================================================================== | 87%
- |
- |====================================================================================================== | 88%
- |
- |======================================================================================================= | 88%
- |
- |======================================================================================================= | 89%
- |
- |======================================================================================================== | 89%
- |
- |======================================================================================================== | 90%
- |
- |========================================================================================================= | 90%
- |
- |========================================================================================================= | 91%
- |
- |========================================================================================================== | 91%
- |
- |========================================================================================================== | 92%
- |
- |=========================================================================================================== | 92%
- |
- |=========================================================================================================== | 93%
- |
- |============================================================================================================ | 93%
- |
- |============================================================================================================= | 94%
- |
- |============================================================================================================== | 94%
- |
- |============================================================================================================== | 95%
- |
- |=============================================================================================================== | 95%
- |
- |=============================================================================================================== | 96%
- |
- |================================================================================================================ | 96%
- |
- |================================================================================================================ | 97%
- |
- |================================================================================================================= | 97%
- |
- |================================================================================================================= | 98%
- |
- |================================================================================================================== | 98%
- |
- |================================================================================================================== | 99%
- |
- |=================================================================================================================== | 99%
- |
- |====================================================================================================================| 100%
+ |
+ | | 0%
+ |
+ | | 1%
+ |
+ |= | 1%
+ |
+ |= | 2%
+ |
+ |== | 2%
+ |
+ |== | 3%
+ |
+ |=== | 4%
+ |
+ |=== | 5%
+ |
+ |==== | 5%
+ |
+ |==== | 6%
+ |
+ |===== | 7%
+ |
+ |===== | 8%
+ |
+ |====== | 8%
+ |
+ |====== | 9%
+ |
+ |======= | 9%
+ |
+ |======= | 10%
+ |
+ |======= | 11%
+ |
+ |======== | 11%
+ |
+ |======== | 12%
+ |
+ |========= | 12%
+ |
+ |========= | 13%
+ |
+ |========== | 14%
+ |
+ |========== | 15%
+ |
+ |=========== | 15%
+ |
+ |=========== | 16%
+ |
+ |============ | 17%
+ |
+ |============ | 18%
+ |
+ |============= | 18%
+ |
+ |============= | 19%
+ |
+ |============== | 19%
+ |
+ |============== | 20%
+ |
+ |============== | 21%
+ |
+ |=============== | 21%
+ |
+ |=============== | 22%
+ |
+ |================ | 22%
+ |
+ |================ | 23%
+ |
+ |================= | 24%
+ |
+ |================= | 25%
+ |
+ |================== | 25%
+ |
+ |================== | 26%
+ |
+ |=================== | 27%
+ |
+ |=================== | 28%
+ |
+ |==================== | 28%
+ |
+ |==================== | 29%
+ |
+ |===================== | 29%
+ |
+ |===================== | 30%
+ |
+ |===================== | 31%
+ |
+ |====================== | 31%
+ |
+ |====================== | 32%
+ |
+ |======================= | 32%
+ |
+ |======================= | 33%
+ |
+ |======================= | 34%
+ |
+ |======================== | 34%
+ |
+ |======================== | 35%
+ |
+ |========================= | 35%
+ |
+ |========================= | 36%
+ |
+ |========================== | 36%
+ |
+ |========================== | 37%
+ |
+ |========================== | 38%
+ |
+ |=========================== | 38%
+ |
+ |=========================== | 39%
+ |
+ |============================ | 39%
+ |
+ |============================ | 40%
+ |
+ |============================ | 41%
+ |
+ |============================= | 41%
+ |
+ |============================= | 42%
+ |
+ |============================== | 42%
+ |
+ |============================== | 43%
+ |
+ |============================== | 44%
+ |
+ |=============================== | 44%
+ |
+ |=============================== | 45%
+ |
+ |================================ | 45%
+ |
+ |================================ | 46%
+ |
+ |================================= | 46%
+ |
+ |================================= | 47%
+ |
+ |================================= | 48%
+ |
+ |================================== | 48%
+ |
+ |================================== | 49%
+ |
+ |=================================== | 49%
+ |
+ |=================================== | 50%
+ |
+ |=================================== | 51%
+ |
+ |==================================== | 51%
+ |
+ |==================================== | 52%
+ |
+ |===================================== | 52%
+ |
+ |===================================== | 53%
+ |
+ |===================================== | 54%
+ |
+ |====================================== | 54%
+ |
+ |====================================== | 55%
+ |
+ |======================================= | 55%
+ |
+ |======================================= | 56%
+ |
+ |======================================== | 56%
+ |
+ |======================================== | 57%
+ |
+ |======================================== | 58%
+ |
+ |========================================= | 58%
+ |
+ |========================================= | 59%
+ |
+ |========================================== | 59%
+ |
+ |========================================== | 60%
+ |
+ |========================================== | 61%
+ |
+ |=========================================== | 61%
+ |
+ |=========================================== | 62%
+ |
+ |============================================ | 62%
+ |
+ |============================================ | 63%
+ |
+ |============================================ | 64%
+ |
+ |============================================= | 64%
+ |
+ |============================================= | 65%
+ |
+ |============================================== | 65%
+ |
+ |============================================== | 66%
+ |
+ |=============================================== | 66%
+ |
+ |=============================================== | 67%
+ |
+ |=============================================== | 68%
+ |
+ |================================================ | 68%
+ |
+ |================================================ | 69%
+ |
+ |================================================= | 69%
+ |
+ |================================================= | 70%
+ |
+ |================================================= | 71%
+ |
+ |================================================== | 71%
+ |
+ |================================================== | 72%
+ |
+ |=================================================== | 72%
+ |
+ |=================================================== | 73%
+ |
+ |==================================================== | 74%
+ |
+ |==================================================== | 75%
+ |
+ |===================================================== | 75%
+ |
+ |===================================================== | 76%
+ |
+ |====================================================== | 77%
+ |
+ |====================================================== | 78%
+ |
+ |======================================================= | 78%
+ |
+ |======================================================= | 79%
+ |
+ |======================================================== | 79%
+ |
+ |======================================================== | 80%
+ |
+ |======================================================== | 81%
+ |
+ |========================================================= | 81%
+ |
+ |========================================================= | 82%
+ |
+ |========================================================== | 82%
+ |
+ |========================================================== | 83%
+ |
+ |=========================================================== | 84%
+ |
+ |=========================================================== | 85%
+ |
+ |============================================================ | 85%
+ |
+ |============================================================ | 86%
+ |
+ |============================================================= | 87%
+ |
+ |============================================================= | 88%
+ |
+ |============================================================== | 88%
+ |
+ |============================================================== | 89%
+ |
+ |=============================================================== | 89%
+ |
+ |=============================================================== | 90%
+ |
+ |=============================================================== | 91%
+ |
+ |================================================================ | 91%
+ |
+ |================================================================ | 92%
+ |
+ |================================================================= | 92%
+ |
+ |================================================================= | 93%
+ |
+ |================================================================== | 94%
+ |
+ |================================================================== | 95%
+ |
+ |=================================================================== | 95%
+ |
+ |=================================================================== | 96%
+ |
+ |==================================================================== | 97%
+ |
+ |==================================================================== | 98%
+ |
+ |===================================================================== | 98%
+ |
+ |===================================================================== | 99%
+ |
+ |======================================================================| 99%
+ |
+ |======================================================================| 100%
Now let’s make a plot to look at the distribution of scores from random gene sets. First we need to get this data in an appropriate format for ggplot2
.
-
+
# The random results are a matrix
random_long_df <- random_gsva_results %>%
data.frame() %>%
@@ -1130,15 +1068,14 @@ A note on gene set size
dplyr::mutate(gene_set_size = stringr::word(gene_set, 2, sep = "_")) %>%
# We want to plot smallest no. genes -> largest no. genes
dplyr::mutate(gene_set_size = factor(gene_set_size,
- levels = c(15, 25, 50, 100, 250, 500)))
-
+ levels = c(15, 25, 50, 100, 250, 500)))
Let’s make a violin plot so we can look at the distribution of scores by gene set size.
-
+
# Violin plot comparing GSVA scores of different random gene set sizes
random_long_df %>%
ggplot2::ggplot(ggplot2::aes(x = gene_set_size,
@@ -1161,8 +1098,8 @@ A note on gene set size
y = "GSVA score") +
ggplot2::theme_bw()
-
-
+
+
@@ -1179,7 +1116,7 @@ How can you use these scores?
Write results to file
-
+
gsva_results %>%
as.data.frame() %>%
tibble::rownames_to_column("pathway") %>%
@@ -1193,51 +1130,75 @@ Write results to file
Session Info
-
+
sessionInfo()
-
+
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
-Running under: Ubuntu 18.04.3 LTS
+Running under: Ubuntu 20.04 LTS
Matrix products: default
-BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
-LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
+BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
locale:
- [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8 LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8
- [6] LC_MESSAGES=C.UTF-8 LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
-[11] LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
+ [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
+ [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
+ [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
+ [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
+ [9] LC_ADDRESS=C LC_TELEPHONE=C
+[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
-[1] stats graphics grDevices datasets utils methods base
+[1] stats graphics grDevices utils datasets methods base
other attached packages:
-[1] GSVA_1.38.2 magrittr_2.0.1
+[1] GSVA_1.38.2 magrittr_2.0.1 optparse_1.6.6
loaded via a namespace (and not attached):
- [1] MatrixGenerics_1.2.0 Biobase_2.50.0 httr_1.4.2 tidyr_1.1.2
- [5] bit64_4.0.5 jsonlite_1.7.2 assertthat_0.2.1 BiocManager_1.30.10
- [9] stats4_4.0.3 blob_1.2.1 renv_0.12.5-2 GenomeInfoDbData_1.2.4
-[13] pillar_1.4.7 RSQLite_2.2.3 lattice_0.20-41 glue_1.4.2
-[17] limma_3.46.0 digest_0.6.27 GenomicRanges_1.42.0 XVector_0.30.0
-[21] colorspace_2.0-0 Matrix_1.3-2 GSEABase_1.52.1 XML_3.99-0.5
-[25] pkgconfig_2.0.3 zlibbioc_1.36.0 purrr_0.3.4 xtable_1.8-4
-[29] mvtnorm_1.1-1 scales_1.1.1 BiocParallel_1.24.1 emmeans_1.5.3
-[33] tibble_3.0.5 annotate_1.68.0 farver_2.0.3 generics_0.1.0
-[37] IRanges_2.24.1 ggplot2_3.3.3 ellipsis_0.3.1 SummarizedExperiment_1.20.0
-[41] BiocGenerics_0.36.0 cli_2.2.0 crayon_1.3.4 memoise_1.1.0
-[45] estimability_1.3 fansi_0.4.2 nlme_3.1-151 graph_1.68.0
-[49] tools_4.0.3 hms_1.0.0 lifecycle_0.2.0 matrixStats_0.57.0
-[53] stringr_1.4.0 S4Vectors_0.28.1 fftw_1.0-6 munsell_0.5.0
-[57] DelayedArray_0.16.2 AnnotationDbi_1.52.0 compiler_4.0.3 GenomeInfoDb_1.26.2
-[61] rlang_0.4.10 grid_4.0.3 RCurl_1.98-1.2 rstudioapi_0.13
-[65] labeling_0.4.2 bitops_1.0-6 qusage_2.24.0 gtable_0.3.0
-[69] DBI_1.1.1 R6_2.5.0 knitr_1.30 dplyr_1.0.3
-[73] bit_4.0.4 readr_1.4.0 stringi_1.5.3 parallel_4.0.3
-[77] Rcpp_1.0.6 vctrs_0.3.6 tidyselect_1.1.0 xfun_0.20
-[81] coda_0.19-4
+ [1] MatrixGenerics_1.2.0 Biobase_2.50.0
+ [3] httr_1.4.2 tidyr_1.1.2
+ [5] bit64_4.0.5 jsonlite_1.7.2
+ [7] assertthat_0.2.1 stats4_4.0.3
+ [9] blob_1.2.1 GenomeInfoDbData_1.2.4
+[11] yaml_2.2.1 pillar_1.4.7
+[13] RSQLite_2.2.3 lattice_0.20-41
+[15] glue_1.4.2 limma_3.46.0
+[17] digest_0.6.27 GenomicRanges_1.42.0
+[19] XVector_0.30.0 colorspace_2.0-0
+[21] htmltools_0.5.1.1 Matrix_1.3-2
+[23] GSEABase_1.52.1 XML_3.99-0.5
+[25] pkgconfig_2.0.3 zlibbioc_1.36.0
+[27] purrr_0.3.4 xtable_1.8-4
+[29] mvtnorm_1.1-1 scales_1.1.1
+[31] getopt_1.20.3 BiocParallel_1.24.1
+[33] emmeans_1.5.3 tibble_3.0.5
+[35] annotate_1.68.0 farver_2.0.3
+[37] ggplot2_3.3.3 generics_0.1.0
+[39] IRanges_2.24.1 ellipsis_0.3.1
+[41] SummarizedExperiment_1.20.0 BiocGenerics_0.36.0
+[43] cli_2.2.0 crayon_1.3.4
+[45] memoise_1.1.0 estimability_1.3
+[47] evaluate_0.14 ps_1.5.0
+[49] fansi_0.4.2 nlme_3.1-151
+[51] graph_1.68.0 tools_4.0.3
+[53] hms_1.0.0 lifecycle_0.2.0
+[55] matrixStats_0.57.0 stringr_1.4.0
+[57] S4Vectors_0.28.1 fftw_1.0-6
+[59] munsell_0.5.0 DelayedArray_0.16.2
+[61] AnnotationDbi_1.52.0 compiler_4.0.3
+[63] GenomeInfoDb_1.26.2 rlang_0.4.10
+[65] grid_4.0.3 RCurl_1.98-1.2
+[67] rstudioapi_0.13 labeling_0.4.2
+[69] bitops_1.0-6 rmarkdown_2.6
+[71] qusage_2.24.0 gtable_0.3.0
+[73] DBI_1.1.1 R6_2.5.0
+[75] knitr_1.30 dplyr_1.0.3
+[77] bit_4.0.4 readr_1.4.0
+[79] stringi_1.5.3 parallel_4.0.3
+[81] Rcpp_1.0.6 vctrs_0.3.6
+[83] tidyselect_1.1.0 xfun_0.20
+[85] coda_0.19-4
diff --git a/scRNA-seq/02-normalizing_scRNA-seq.nb.html b/scRNA-seq/02-normalizing_scRNA-seq.nb.html
index 8557a52f..6f490d01 100644
--- a/scRNA-seq/02-normalizing_scRNA-seq.nb.html
+++ b/scRNA-seq/02-normalizing_scRNA-seq.nb.html
@@ -762,7 +762,7 @@ Identify marker genes