introduction_to_genomic_data.Rmd

---
title: "Introduction to genomic data"
output: 
  html_document:
    toc: true
    toc_depth: 2
    toc_float: true
knit: (function(input_file, encoding) {
  out_dir <- 'docs';
  rmarkdown::render(input_file,
 encoding=encoding,
 output_file=file.path(dirname(input_file), out_dir, 'index.html'))})
author: "Todd W. Pierson"
date: "16 October 2019"
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(width=200)
```

<div class="alert alert-danger">
  <strong>Note:</strong> This brief tutorial is written for UNR's NRES 721 as a basic introduction to genomic data using simple commands that students can run through their Terminal (Linux/Mac) or in `R`. Students: if you're running Windows (e.g., on a computer in the teaching laboratory at UNR), you can try to follow along by using the Linux commands on your Windows machine through [one of these options](https://itsfoss.com/run-linux-commands-in-windows/), just follow along for the bits in `R`, or just review what we're doing on the shared screen. 
  
<strong>Students: if you're using Mac/Linux, use the `svn` command further down in the document to download the data for this tutorial. If you're on a PC, you can download them from the `\data` directory [here](https://github.com/twpierson/nres721_introduction).</strong>
</div>

## Introduction
In this tutorial, we'll work through a basic introduction to handling and understanding raw genomic data. In general, biologists often use the term "genomic" to refer to datasets consisting of more than several loci. This could mean shotgun libraries used to assemble full genomes, reduced-representation libraries (e.g., RADseq), or many other kinds of data. In practice, this usually means that these data were generated on a "next generation sequencing" or "high-throughput sequencing" platform. Today, [Illumina](https://www.illumina.com/index-d.html) platforms generate the vast majority of these data, and that's what we'll focus on for these tutorials.

We'll use a few example data files that we need to download. First, navigate to a local directory from which you'd like to work.

```{bash, comment = NA, eval = FALSE}
cd [your working directory]
```

Then, download the data. If you're within your working directory, the following command will download the `\data` directory and place it within it.

```{bash, eval = FALSE}
svn checkout https://github.com/twpierson/nres721_introduction/trunk/data
```

## Interpreting a FASTQ file

Straight off of the sequencing platform, data are most often in a BCL format. However, most (or many) sequencing facilities will have already "demultiplexed" (see later in this tutorial) these data into individual FASTQ files. If these sound similar to the FASTA files you're more familiar with, it's because they are! It's worth taking a minute to review the structure of a FASTA format. 

We can peek at the first two lines of an example FASTA file of the `/data` directory. If you are following along, the commands we'll use in this tutorial should be run in your Terminal window, rather than in `R`.

```{bash, comment=NA}
cat data/example.fasta
```

In the FASTA file, each sequence has two lines dedicated to it—one for the sample or sequence name and the other for the actual sequence. The FASTQ format is built off of this same premise, but it has a bit more information included.

We can peek at an example FASTQ file in the same directory.
```{bash, comment=NA}
cat data/example_R1.fastq
```

This file clearly contains more information! FASTQ files have four lines per sequencing read; they represent:

1) sample information in a header; this string contains information about the sequencing platform, the physical position on the "tile" where this sequencing reaction took place, and (often) index information
2) the actual sequence read
3) "+"; this is just a placeholder, although sometimes other information is included on this line
4) a quality score for each nucleotide of the read; this is the "Q" in "FASTQ" and is important for downstream applications.

The [Wikipedia page](https://en.wikipedia.org/wiki/FASTQ_format) for FASTQ files has much more useful information. In our simple example here, we can see the dual-index sequences (`AGCAAGCA` and `CTTGTCGA`) at the end of the first line (more about that kind of information in the next section). Each of these four lines is present in a FASTQ file for *each molecule sequenced*. That means that for some super high-throughput platforms (e.g., the [Illumina NovaSeq](https://www.illumina.com/systems/sequencing-platforms/novaseq.html), which can generate up to 2.2 billion reads per lane!), these FASTQ files can be huge.

You may notice that the FASTQ file we viewed has a "R1" designator in the file name. This is because on many sequencing platforms, each molecule is sequenced from both directions—generating a "read 1" (i.e., "R1") and "read 2" ("R2") read. Thus, we have a corresponding "R2" file. Let's look at it:

```{bash, comment=NA}
cat data/example_R2.fastq
```

Note that the information from the first line is identical, but that the sequences and quality scores are different. 

<div class="alert alert-info">
  <strong>Discussion:</strong> How long is the sequence in the R1 file? How about the R2 file? What are the implications of this if your library insert (e.g., amplicon) is 75 bp, 150 bp, or 300 bp?
</div>

## Multiplexing and demultiplexing

The best way to leverage the power of next-generation sequencing technologies to generate data for many individuals (e.g., for a typical population genetic or phylogenetic project) is to "multiplex" many samples onto a single sequencing run. For example, some Illumina NextSeq runs can generate 400 million paired-end reads per sequencing lane. For many purposes (e.g., amplicon data from a microbiome project or RADseq-style data for a population genetic project), this amount of data would be incredible overkill (and cost-prohibitive!), so it's useful to spread these reads across many samples.

To do this, we probably want to mark molecules during the library preparation stage so that we can later determine which sample they came from. This is often called "indexing" or "tagging" (or confusing, "barcoding"), and the act of pooling these samples for sequencing is "multiplexing". Thus, when we receive our raw data, one of our first tasks is "demultiplexing"---or separating our reads by sample of origin. These indexes may be read in separate sequencing reads and stored in the FASTQ headers (e.g., as we saw earlier), or they could be in the actual sequencing read (sometimes called "in-line" indexes). There are many ways to skin a cat, and many application-specific software packages have their own way of demultiplexing reads.

<div class="alert alert-info">
  <strong>Discussion:</strong> What might the advantages be of including indexes in multiple positions on the library molecule?
</div>

## Quality filtering

As discussed earlier, FASTQ files contain quality scores associated with each nucleotide of each read. Two general rules: 1) quality scores are generally lowest at the beginning and (especially) the end of reads; and 2) R1 data typically have higher quality scores than R2 data. 

Why do these quality scores matter? Simply put, they tell us information about the probability that a given nucleotide is called in error. To reduce the probability of including these erroneous data in our analyses, we can, for example, trim sections of reads that have low quality scores or throw out reads altogether that have low overall quality scores. There are many, many ways to visualize quality score data and filter reads. 

We'll use a quick visualization in `R`. If you haven't already, install the `seqTools` package (installation instructions [here](https://bioconductor.org/packages/release/bioc/html/seqTools.html)). Then, open `R` and load the package.

```{r,  results = "hide", warning = FALSE, message = FALSE}
library(seqTools)
```

We also want to navigate to our working directory.
```{r, eval = FALSE}
setwd("[your working directory]")
```

Next, we'll separately load our R1 and R2 data. This time, we'll use real sequencing data from an environmental DNA project. These are amplicons for a "barcoding" locus used to characterize amphibian communities. 

```{r, results = "hide", warning = FALSE, message = FALSE}
fq1 <- fastqq(c("data/eDNA_08_R1.fastq"))
fq2 <- fastqq(c("data/eDNA_08_R2.fastq"))
```

Then, we can plot our quality scores.
```{r}
par(mfrow = c(1,2))
plotMergedPhredQuant(fq1, main = "Phred quantiles for R1")
plotMergedPhredQuant(fq2, main = "Phred quantiles for R2")
```

<div class="alert alert-info">
  <strong>Discussion:</strong> What do you notice in these plots? How might you filter these data, and why would that be important?
</div>

Now that we've covered some of the basic concepts of interpreting and handling genomic data, we'll move on to [assembling an example dataset](https://twpierson.github.io/nres721_assembly/).