Skip to content

Commit

Permalink
mv docs from wiki
Browse files Browse the repository at this point in the history
  • Loading branch information
rdnajac committed Sep 5, 2024
1 parent 0e6e969 commit 082dac9
Show file tree
Hide file tree
Showing 7 changed files with 508 additions and 1 deletion.
58 changes: 58 additions & 0 deletions Packages.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Packages

Bioconda recommends these default channels:

```sh
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict
```

A .condarc file is created in the user's home directory with the following contents:

```.condarc
channels:
- defaults
- bioconda
- conda-forge
channel_priority: strict
```

However, Mamba documentation recommends against using any of the
[Anaconda default channels](https://docs.anaconda.com/working-with-conda/reference/default-repositories/).
by deactivating them, rather than deprioritizing them...

Instead of fighting against defaults, write spec files:

```yaml
name: RNAseq
channels:
- bioconda
- conda-forge
dependencies:
- fastqc
- hisat2
- bwa
- bowtie2
- samtools
- htslib
- bcftools
- stringtie
- bowtie
- subread
```
Then create the environment:
```sh
# it doesn't matter if you use .yml or .yaml, but be consistent!
micromamba env create -f RNAseq.yaml
```

When you correctly activate the environment, the usual prompt
will be prefixed with the name of the environment in parentheses:

```console
(RNAseq) ubuntu@ip-172-31-70-15:~/micromamba/envs$
```
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
# how-to

[![code style: prettier](https://img.shields.io/badge/code_style-prettier-ff69b4.svg?style=flat-square)](https://github.com/prettier/prettier)

How to do stuff
104 changes: 104 additions & 0 deletions RNAseq.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# RNA Sequencing Analysis

RNA sequencing (RNA-seq) is a technique used to analyze the transcriptome -
the complete set of RNA transcripts in a cell. This method involves sequencing
RNA molecules after reverse transcription to cDNA to examine gene expression levels
and identify novel transcripts.

This document describes two pipelines for RNA-seq analysis:

- Using `featureCounts` and then analyzing with DESeq2 in R
- Using the Tuxedo Suite (HISAT2, StringTie, Ballgown)

## `featureCounts`

[`featureCounts`](https://subread.sourceforge.net/featureCounts.html)
is a program for counting reads mapped to genomic features, such as genes, exons,
and promoters. It is part of the [Subread](https://subread.sourceforge.net)
package and can be used for RNA-seq as well as DNA-seq analysis.

> featureCounts takes as input SAM/BAM files and an annotation file including
> chromosomal coordinates of features. It outputs numbers of reads assigned to features
> (or meta-features). It also outputs stat info for the overall summrization results,
> including number of successfully assigned reads and number of reads that failed to be
> assigned due to various reasons (these reasons are included in the stat info).
Use FeatureCounts to count the number of reads that map to each gene in a GTF
file and summarize the results for downstream analysis (i.e., differential expression).

### Input

Aligned reads in BAM format and a GTF file containing genomic features.

Example usage:

```sh
featureCounts -T "$NUM_THREADS" --verbose -t exon -g gene_id --countReadPairs \
-a "$REF_GTF" -p -P -C -B -o "${OUT_DIR}/${OUT_PREFIX}.tsv" ./*.bam
```

### Output

Feature counts are written to a tab-delimited file (`.tsv` or `.txt`)
with columns for each sample. You can import this file into R or other
statistical software for further analysis.

Feature counts also provide a summary of the number of reads that were
assigned to features, as well as the number of reads that were not assigned
to any feature.

## Downstream Analysis using R

Export the results from `featureCounts` to R for further analysis, such as
differential expression analysis using packages like DESeq2 or edgeR.

### DESeq2

[DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html) is an
R package for differential gene expression analysis based on the negative
binomial distribution.

> DESeq2 provides methods to test for differential expression by use of negative
> binomial generalized linear models. The models use the raw counts as input and
> perform regularized log transformation and variance stabilizing
> transformation. The package also provides functions to visualize the data and
> results.
<!-- -->

## Tuxedo Suite

The Tuxedo Suite is a collection of tools for transcript-level expression analysis of RNA-seq experiments.

1. Align reads to the reference genome and sort to BAM format (HISAT2)
2. Assemble transcripts (StringTie)
3. Prepare for differential expression analysis (Ballgown setup)
4. Perform differential expression analysis
5. Visualize results

| Tool | Description | Manual | Source |
| --------- | ---------------------------------------------- | --------------------------------------------------------------------------- | -------------------------------------------------- |
| HISAT2 | Align reads to reference genome | [manual](https://daehwankimlab.github.io/hisat2/manual/) | [source](https://github.com/DaehwanKimLab/hisat2) |
| StringTie | Assemble RNA-Seq alignments into transcripts | [manual](https://ccb.jhu.edu/software/stringtie/index.shtml) | [source](https://github.com/gpertea/stringtie) |
| Ballgown | Isoform-level differential expression analysis | [manual](https://bioconductor.org/packages/release/bioc/html/ballgown.html) | [source](https://github.com/alyssafrazee/ballgown) |

> [!TIP]
> StringTie comes packaged with `gffcompare` for comparing and evaluating the
> accuracy of RNA-seq transcript assemblers. Read the
> [manual](https://ccb.jhu.edu/software/stringtie/gffcompare.shtml) for details.
Using Ballgown:

1. Filter to remove low-abundance genes
2. Identify differentially expressed transcripts and genes
3. Add gene names
4. Sort by p-value
5. Write results to file

## References

- Liao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014;30(7):923-30. [PMID: 24227677](https://pubmed.ncbi.nlm.nih.gov/24227677/)
- Pertea, M., Kim, D., Pertea, G. M., Leek, J. T., & Salzberg, S. L. (2016). Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nature Protocols, 11(9), 1650–1667. [PMID: 27560171](https://pubmed.ncbi.nlm.nih.gov/27560171/)
- Kim D, Langmead B, Salzberg SL. HISAT: a fast spliced aligner with low memory requirements. Nat Methods. 2015;12(4):357-360. [PMID: 25751142](https://pubmed.ncbi.nlm.nih.gov/25751142/)
- Pertea M, Pertea GM, Antonescu CM, Chang TC, Mendell JT, Salzberg SL. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol. 2015;33(3):290-295. [PMID: 25690850](https://pubmed.ncbi.nlm.nih.gov/25690850/)
- Frazee AC, Pertea G, Jaffe AE, Langmead B, Salzberg SL, Leek JT. Ballgown bridges the gap between transcriptome assembly and expression analysis. Nat Biotechnol. 2015;33(3):243-246. [PMID: 25748911](https://pubmed.ncbi.nlm.nih.gov/25748911/)
102 changes: 102 additions & 0 deletions docs/Illumina.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# Illumina

If you used a core facility, Azenta, or another commercial service for
sequencing, they will send link to directly download the de-multiplexed
FASTQ files, usually with corresponding md5 checksums.

For instructions on how to download data from Azenta's sFTP server,
click [here](https://3478602.fs1.hubspotusercontent-na1.net/hubfs/3478602/13012-M%26G%200222%20sFTP%20Guide-3.pdf).

> [!NOTE]
> sFTP (Secure File Transfer Protocol) provides an encrypted channel for data transfer.\
> The md5 checksum is a unique character sequence that is computed from the
> contents of a file and changes if the file is modified.
> Read the original [RFC 1321](https://www.ietf.org/rfc/rfc1321.txt).
Otherwise, consult the [documentation](https://developer.basespace.illumina.com/docs)
for the appropriate Illumina sequencer:

- [MiSeq](https://support.illumina.com/sequencing/sequencing_instruments/miseq/documentation.html)
- [NextSeq500](https://support.illumina.com/sequencing/sequencing_instruments/nextseq-550/documentation.html)

## BaseSpace and the `bs` command-line interface

The browser-based interface is useful for small-scale projects, but the
command-line interface is more efficient for large-scale projects.
Check out [examples](https://developer.basespace.illumina.com/docs/content/documentation/cli/cli-examples).

Install on macOS using Homebrew:

```sh
brew tap basespace/basespace && brew install bs-cli
```

otherwise, download the latest version from Illumina:

```sh
install_directory="${HOME}/.local/bin}"
basespace_executable="$install_directory/bs"
wget "https://launch.basespace.illumina.com/CLI/latest/amd64-linux/bs" -O "$basespace_executable"
chmod +x "$basespace_executable"
# add to PATH
# echo "export PATH=$install_directory:$PATH" >> ~/.bashrc
# or add an alias
# echo "alias bs=$basespace_executable" >> ~/.bashrc
```

After installation, authenticate with your BaseSpace credentials:

```sh
bs authenticate
#
```

Follow link and login with illumina credentials,
then run 'bs whoami' to verify that you are authenticated:

```console
+----------------+----------------------------------------------------+
| Name | Ryan Najac |
| Id | ######## |
| Email | [email protected] |
| DateCreated | 2021-07-13 15:29:51 +0000 UTC |
| DateLastActive | 2024-06-03 18:59:47 +0000 UTC |
| Host | https://api.basespace.illumina.com |
| Scopes | READ GLOBAL, CREATE GLOBAL, BROWSE GLOBAL, |
| | CREATE PROJECTS, CREATE RUNS, START APPLICATIONS, |
| | MOVETOTRASH GLOBAL, WRITE GLOBAL |
+----------------+----------------------------------------------------+
```

## Demultiplexing Illumina sequencing data

Illumina instruments will demultiplex the data for you if you provide a valid
sample sheet prior to sequencing. For more information on how to create a sample
sheet, consult the [Illumina Support](https://support.illumina.com/) page.

Skip ahead to the [Quality Control](#quality-control) section if the
FASTQ files are already demultiplexed and ready for analysis, or keep reading
for instructions on how to demultiplex the data yourself.

Illumina hosts `.rpm` files for CentOS/RedHat Linux distros and the
source code (which must be compiled) for other distros.

Download bcl2fastq2 Conversion Software v2.20 Installer (Linux rpm) from
[Illumina](https://support.illumina.com/sequencing/sequencing_software/bcl2fastq-conversion-software.html).

The AWS EC2 instance used for this project is based on Ubuntu, so we will
have to convert the `.rpm` file to a `.deb` file using the `alien` package,
as per this [post](https://www.biostars.org/p/266897/).

```sh
sudo alien -i bcl2fastq2-v2.20.0.422-Linux-x86_64.rpm
```

> [!WARNING]
> As of 2024-08-05, `bcl2fastq` is no longer supported; use `bclconvert` instead.\
> You can install `bclconvert` using the same methods as described above.
Read the docs:

- [bcl2fastq](https://support.illumina.com/content/dam/illumina-support/documents/documentation/software_documentation/bcl2fastq/bcl2fastq_letterbooklet_15038058brpmi.pdf)
- [bclconvert](https://support-docs.illumina.com/SW/BCL_Convert_v4.0/Content/SW/BCLConvert/BCLConvert.htm)
81 changes: 81 additions & 0 deletions docs/Jupyter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Jupyter

Jupyter notebooks provide an interactive environment for data analysis, visualization, and code execution. They are particularly useful for bioinformatics workflows and exploratory data analysis.

## Running Jupyter on an EC2 Instance

When running Jupyter on an EC2 instance, you need to configure it to allow remote access. Follow these steps:

1. Generate a config file:

```sh
jupyter notebook --generate-config
```

2. Edit the config file:

```sh
nano ~/.jupyter/jupyter_notebook_config.py
```

3. Add or modify these lines:

```python
c.NotebookApp.ip = '0.0.0.0'
c.NotebookApp.open_browser = False
c.NotebookApp.port = 8888
```

4. Start Jupyter:

```sh
jupyter notebook
```

## Accessing Jupyter from Your Local Machine

There are two main methods to access Jupyter running on an EC2 instance:

### Method 1: Direct Access (Less Secure)

1. Ensure port 8888 is open in your EC2 security group.
2. Access Jupyter using your EC2 instance's public IP:

```
http://<your-ec2-public-ip>:8888
```

3. Use the token provided in the Jupyter server output for authentication.

### Method 2: SSH Tunnel (More Secure)

1. Create an SSH tunnel:

```sh
ssh -i <your-key.pem> -L 8888:localhost:8888 ubuntu@<your-ec2-public-ip>
```

2. Access Jupyter locally:

```
http://localhost:8888
```

3. Use the token provided in the Jupyter server output for authentication.

## Troubleshooting

If you're having trouble connecting to Jupyter, check the following:

1. EC2 Security Group: Ensure port 8888 is open (for direct access method).
2. Jupyter Configuration: Verify the config file settings.
3. Firewall: Check if the EC2 instance's firewall is blocking connections.
4. Jupyter Server: Confirm Jupyter is running and note any error messages.

## Best Practices

1. Use virtual environments to manage dependencies.
2. Regularly save your work and consider version control for notebooks.
3. For long-running tasks, consider using tools like `tmux` or `screen` to keep Jupyter running even if your SSH connection drops.

Remember to always prioritize security when working with remote servers and sensitive data.
Loading

0 comments on commit 082dac9

Please sign in to comment.