-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
7 changed files
with
508 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
# Packages | ||
|
||
Bioconda recommends these default channels: | ||
|
||
```sh | ||
conda config --add channels defaults | ||
conda config --add channels bioconda | ||
conda config --add channels conda-forge | ||
conda config --set channel_priority strict | ||
``` | ||
|
||
A .condarc file is created in the user's home directory with the following contents: | ||
|
||
```.condarc | ||
channels: | ||
- defaults | ||
- bioconda | ||
- conda-forge | ||
channel_priority: strict | ||
``` | ||
|
||
However, Mamba documentation recommends against using any of the | ||
[Anaconda default channels](https://docs.anaconda.com/working-with-conda/reference/default-repositories/). | ||
by deactivating them, rather than deprioritizing them... | ||
|
||
Instead of fighting against defaults, write spec files: | ||
|
||
```yaml | ||
name: RNAseq | ||
channels: | ||
- bioconda | ||
- conda-forge | ||
dependencies: | ||
- fastqc | ||
- hisat2 | ||
- bwa | ||
- bowtie2 | ||
- samtools | ||
- htslib | ||
- bcftools | ||
- stringtie | ||
- bowtie | ||
- subread | ||
``` | ||
Then create the environment: | ||
```sh | ||
# it doesn't matter if you use .yml or .yaml, but be consistent! | ||
micromamba env create -f RNAseq.yaml | ||
``` | ||
|
||
When you correctly activate the environment, the usual prompt | ||
will be prefixed with the name of the environment in parentheses: | ||
|
||
```console | ||
(RNAseq) ubuntu@ip-172-31-70-15:~/micromamba/envs$ | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,5 @@ | ||
# how-to | ||
|
||
[![code style: prettier](https://img.shields.io/badge/code_style-prettier-ff69b4.svg?style=flat-square)](https://github.com/prettier/prettier) | ||
|
||
How to do stuff |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,104 @@ | ||
# RNA Sequencing Analysis | ||
|
||
RNA sequencing (RNA-seq) is a technique used to analyze the transcriptome - | ||
the complete set of RNA transcripts in a cell. This method involves sequencing | ||
RNA molecules after reverse transcription to cDNA to examine gene expression levels | ||
and identify novel transcripts. | ||
|
||
This document describes two pipelines for RNA-seq analysis: | ||
|
||
- Using `featureCounts` and then analyzing with DESeq2 in R | ||
- Using the Tuxedo Suite (HISAT2, StringTie, Ballgown) | ||
|
||
## `featureCounts` | ||
|
||
[`featureCounts`](https://subread.sourceforge.net/featureCounts.html) | ||
is a program for counting reads mapped to genomic features, such as genes, exons, | ||
and promoters. It is part of the [Subread](https://subread.sourceforge.net) | ||
package and can be used for RNA-seq as well as DNA-seq analysis. | ||
|
||
> featureCounts takes as input SAM/BAM files and an annotation file including | ||
> chromosomal coordinates of features. It outputs numbers of reads assigned to features | ||
> (or meta-features). It also outputs stat info for the overall summrization results, | ||
> including number of successfully assigned reads and number of reads that failed to be | ||
> assigned due to various reasons (these reasons are included in the stat info). | ||
Use FeatureCounts to count the number of reads that map to each gene in a GTF | ||
file and summarize the results for downstream analysis (i.e., differential expression). | ||
|
||
### Input | ||
|
||
Aligned reads in BAM format and a GTF file containing genomic features. | ||
|
||
Example usage: | ||
|
||
```sh | ||
featureCounts -T "$NUM_THREADS" --verbose -t exon -g gene_id --countReadPairs \ | ||
-a "$REF_GTF" -p -P -C -B -o "${OUT_DIR}/${OUT_PREFIX}.tsv" ./*.bam | ||
``` | ||
|
||
### Output | ||
|
||
Feature counts are written to a tab-delimited file (`.tsv` or `.txt`) | ||
with columns for each sample. You can import this file into R or other | ||
statistical software for further analysis. | ||
|
||
Feature counts also provide a summary of the number of reads that were | ||
assigned to features, as well as the number of reads that were not assigned | ||
to any feature. | ||
|
||
## Downstream Analysis using R | ||
|
||
Export the results from `featureCounts` to R for further analysis, such as | ||
differential expression analysis using packages like DESeq2 or edgeR. | ||
|
||
### DESeq2 | ||
|
||
[DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html) is an | ||
R package for differential gene expression analysis based on the negative | ||
binomial distribution. | ||
|
||
> DESeq2 provides methods to test for differential expression by use of negative | ||
> binomial generalized linear models. The models use the raw counts as input and | ||
> perform regularized log transformation and variance stabilizing | ||
> transformation. The package also provides functions to visualize the data and | ||
> results. | ||
<!-- --> | ||
|
||
## Tuxedo Suite | ||
|
||
The Tuxedo Suite is a collection of tools for transcript-level expression analysis of RNA-seq experiments. | ||
|
||
1. Align reads to the reference genome and sort to BAM format (HISAT2) | ||
2. Assemble transcripts (StringTie) | ||
3. Prepare for differential expression analysis (Ballgown setup) | ||
4. Perform differential expression analysis | ||
5. Visualize results | ||
|
||
| Tool | Description | Manual | Source | | ||
| --------- | ---------------------------------------------- | --------------------------------------------------------------------------- | -------------------------------------------------- | | ||
| HISAT2 | Align reads to reference genome | [manual](https://daehwankimlab.github.io/hisat2/manual/) | [source](https://github.com/DaehwanKimLab/hisat2) | | ||
| StringTie | Assemble RNA-Seq alignments into transcripts | [manual](https://ccb.jhu.edu/software/stringtie/index.shtml) | [source](https://github.com/gpertea/stringtie) | | ||
| Ballgown | Isoform-level differential expression analysis | [manual](https://bioconductor.org/packages/release/bioc/html/ballgown.html) | [source](https://github.com/alyssafrazee/ballgown) | | ||
|
||
> [!TIP] | ||
> StringTie comes packaged with `gffcompare` for comparing and evaluating the | ||
> accuracy of RNA-seq transcript assemblers. Read the | ||
> [manual](https://ccb.jhu.edu/software/stringtie/gffcompare.shtml) for details. | ||
Using Ballgown: | ||
|
||
1. Filter to remove low-abundance genes | ||
2. Identify differentially expressed transcripts and genes | ||
3. Add gene names | ||
4. Sort by p-value | ||
5. Write results to file | ||
|
||
## References | ||
|
||
- Liao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014;30(7):923-30. [PMID: 24227677](https://pubmed.ncbi.nlm.nih.gov/24227677/) | ||
- Pertea, M., Kim, D., Pertea, G. M., Leek, J. T., & Salzberg, S. L. (2016). Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nature Protocols, 11(9), 1650–1667. [PMID: 27560171](https://pubmed.ncbi.nlm.nih.gov/27560171/) | ||
- Kim D, Langmead B, Salzberg SL. HISAT: a fast spliced aligner with low memory requirements. Nat Methods. 2015;12(4):357-360. [PMID: 25751142](https://pubmed.ncbi.nlm.nih.gov/25751142/) | ||
- Pertea M, Pertea GM, Antonescu CM, Chang TC, Mendell JT, Salzberg SL. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol. 2015;33(3):290-295. [PMID: 25690850](https://pubmed.ncbi.nlm.nih.gov/25690850/) | ||
- Frazee AC, Pertea G, Jaffe AE, Langmead B, Salzberg SL, Leek JT. Ballgown bridges the gap between transcriptome assembly and expression analysis. Nat Biotechnol. 2015;33(3):243-246. [PMID: 25748911](https://pubmed.ncbi.nlm.nih.gov/25748911/) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,102 @@ | ||
# Illumina | ||
|
||
If you used a core facility, Azenta, or another commercial service for | ||
sequencing, they will send link to directly download the de-multiplexed | ||
FASTQ files, usually with corresponding md5 checksums. | ||
|
||
For instructions on how to download data from Azenta's sFTP server, | ||
click [here](https://3478602.fs1.hubspotusercontent-na1.net/hubfs/3478602/13012-M%26G%200222%20sFTP%20Guide-3.pdf). | ||
|
||
> [!NOTE] | ||
> sFTP (Secure File Transfer Protocol) provides an encrypted channel for data transfer.\ | ||
> The md5 checksum is a unique character sequence that is computed from the | ||
> contents of a file and changes if the file is modified. | ||
> Read the original [RFC 1321](https://www.ietf.org/rfc/rfc1321.txt). | ||
Otherwise, consult the [documentation](https://developer.basespace.illumina.com/docs) | ||
for the appropriate Illumina sequencer: | ||
|
||
- [MiSeq](https://support.illumina.com/sequencing/sequencing_instruments/miseq/documentation.html) | ||
- [NextSeq500](https://support.illumina.com/sequencing/sequencing_instruments/nextseq-550/documentation.html) | ||
|
||
## BaseSpace and the `bs` command-line interface | ||
|
||
The browser-based interface is useful for small-scale projects, but the | ||
command-line interface is more efficient for large-scale projects. | ||
Check out [examples](https://developer.basespace.illumina.com/docs/content/documentation/cli/cli-examples). | ||
|
||
Install on macOS using Homebrew: | ||
|
||
```sh | ||
brew tap basespace/basespace && brew install bs-cli | ||
``` | ||
|
||
otherwise, download the latest version from Illumina: | ||
|
||
```sh | ||
install_directory="${HOME}/.local/bin}" | ||
basespace_executable="$install_directory/bs" | ||
wget "https://launch.basespace.illumina.com/CLI/latest/amd64-linux/bs" -O "$basespace_executable" | ||
chmod +x "$basespace_executable" | ||
# add to PATH | ||
# echo "export PATH=$install_directory:$PATH" >> ~/.bashrc | ||
# or add an alias | ||
# echo "alias bs=$basespace_executable" >> ~/.bashrc | ||
``` | ||
|
||
After installation, authenticate with your BaseSpace credentials: | ||
|
||
```sh | ||
bs authenticate | ||
# | ||
``` | ||
|
||
Follow link and login with illumina credentials, | ||
then run 'bs whoami' to verify that you are authenticated: | ||
|
||
```console | ||
+----------------+----------------------------------------------------+ | ||
| Name | Ryan Najac | | ||
| Id | ######## | | ||
| Email | [email protected] | | ||
| DateCreated | 2021-07-13 15:29:51 +0000 UTC | | ||
| DateLastActive | 2024-06-03 18:59:47 +0000 UTC | | ||
| Host | https://api.basespace.illumina.com | | ||
| Scopes | READ GLOBAL, CREATE GLOBAL, BROWSE GLOBAL, | | ||
| | CREATE PROJECTS, CREATE RUNS, START APPLICATIONS, | | ||
| | MOVETOTRASH GLOBAL, WRITE GLOBAL | | ||
+----------------+----------------------------------------------------+ | ||
``` | ||
|
||
## Demultiplexing Illumina sequencing data | ||
|
||
Illumina instruments will demultiplex the data for you if you provide a valid | ||
sample sheet prior to sequencing. For more information on how to create a sample | ||
sheet, consult the [Illumina Support](https://support.illumina.com/) page. | ||
|
||
Skip ahead to the [Quality Control](#quality-control) section if the | ||
FASTQ files are already demultiplexed and ready for analysis, or keep reading | ||
for instructions on how to demultiplex the data yourself. | ||
|
||
Illumina hosts `.rpm` files for CentOS/RedHat Linux distros and the | ||
source code (which must be compiled) for other distros. | ||
|
||
Download bcl2fastq2 Conversion Software v2.20 Installer (Linux rpm) from | ||
[Illumina](https://support.illumina.com/sequencing/sequencing_software/bcl2fastq-conversion-software.html). | ||
|
||
The AWS EC2 instance used for this project is based on Ubuntu, so we will | ||
have to convert the `.rpm` file to a `.deb` file using the `alien` package, | ||
as per this [post](https://www.biostars.org/p/266897/). | ||
|
||
```sh | ||
sudo alien -i bcl2fastq2-v2.20.0.422-Linux-x86_64.rpm | ||
``` | ||
|
||
> [!WARNING] | ||
> As of 2024-08-05, `bcl2fastq` is no longer supported; use `bclconvert` instead.\ | ||
> You can install `bclconvert` using the same methods as described above. | ||
Read the docs: | ||
|
||
- [bcl2fastq](https://support.illumina.com/content/dam/illumina-support/documents/documentation/software_documentation/bcl2fastq/bcl2fastq_letterbooklet_15038058brpmi.pdf) | ||
- [bclconvert](https://support-docs.illumina.com/SW/BCL_Convert_v4.0/Content/SW/BCLConvert/BCLConvert.htm) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
# Jupyter | ||
|
||
Jupyter notebooks provide an interactive environment for data analysis, visualization, and code execution. They are particularly useful for bioinformatics workflows and exploratory data analysis. | ||
|
||
## Running Jupyter on an EC2 Instance | ||
|
||
When running Jupyter on an EC2 instance, you need to configure it to allow remote access. Follow these steps: | ||
|
||
1. Generate a config file: | ||
|
||
```sh | ||
jupyter notebook --generate-config | ||
``` | ||
|
||
2. Edit the config file: | ||
|
||
```sh | ||
nano ~/.jupyter/jupyter_notebook_config.py | ||
``` | ||
|
||
3. Add or modify these lines: | ||
|
||
```python | ||
c.NotebookApp.ip = '0.0.0.0' | ||
c.NotebookApp.open_browser = False | ||
c.NotebookApp.port = 8888 | ||
``` | ||
|
||
4. Start Jupyter: | ||
|
||
```sh | ||
jupyter notebook | ||
``` | ||
|
||
## Accessing Jupyter from Your Local Machine | ||
|
||
There are two main methods to access Jupyter running on an EC2 instance: | ||
|
||
### Method 1: Direct Access (Less Secure) | ||
|
||
1. Ensure port 8888 is open in your EC2 security group. | ||
2. Access Jupyter using your EC2 instance's public IP: | ||
|
||
``` | ||
http://<your-ec2-public-ip>:8888 | ||
``` | ||
|
||
3. Use the token provided in the Jupyter server output for authentication. | ||
|
||
### Method 2: SSH Tunnel (More Secure) | ||
|
||
1. Create an SSH tunnel: | ||
|
||
```sh | ||
ssh -i <your-key.pem> -L 8888:localhost:8888 ubuntu@<your-ec2-public-ip> | ||
``` | ||
|
||
2. Access Jupyter locally: | ||
|
||
``` | ||
http://localhost:8888 | ||
``` | ||
|
||
3. Use the token provided in the Jupyter server output for authentication. | ||
|
||
## Troubleshooting | ||
|
||
If you're having trouble connecting to Jupyter, check the following: | ||
|
||
1. EC2 Security Group: Ensure port 8888 is open (for direct access method). | ||
2. Jupyter Configuration: Verify the config file settings. | ||
3. Firewall: Check if the EC2 instance's firewall is blocking connections. | ||
4. Jupyter Server: Confirm Jupyter is running and note any error messages. | ||
|
||
## Best Practices | ||
|
||
1. Use virtual environments to manage dependencies. | ||
2. Regularly save your work and consider version control for notebooks. | ||
3. For long-running tasks, consider using tools like `tmux` or `screen` to keep Jupyter running even if your SSH connection drops. | ||
|
||
Remember to always prioritize security when working with remote servers and sensitive data. |
Oops, something went wrong.