mv docs from wiki

PalomeroLab · Sep 5, 2024 · 082dac9 · 082dac9
1 parent 0e6e969
commit 082dac9
Show file tree

Hide file tree

Showing 7 changed files with 508 additions and 1 deletion.
diff --git a/Packages.md b/Packages.md
@@ -0,0 +1,58 @@
+# Packages
+
+Bioconda recommends these default channels:
+
+```sh
+conda config --add channels defaults
+conda config --add channels bioconda
+conda config --add channels conda-forge
+conda config --set channel_priority strict
+```
+
+A .condarc file is created in the user's home directory with the following contents:
+
+```.condarc
+channels:
+  - defaults
+  - bioconda
+  - conda-forge
+channel_priority: strict
+```
+
+However, Mamba documentation recommends against using any of the
+[Anaconda default channels](https://docs.anaconda.com/working-with-conda/reference/default-repositories/).
+by deactivating them, rather than deprioritizing them...
+
+Instead of fighting against defaults, write spec files:
+
+```yaml
+name: RNAseq
+channels:
+  - bioconda
+  - conda-forge
+dependencies:
+  - fastqc
+  - hisat2
+  - bwa
+  - bowtie2
+  - samtools
+  - htslib
+  - bcftools
+  - stringtie
+  - bowtie
+  - subread
+```
+
+Then create the environment:
+
+```sh
+# it doesn't matter if you use .yml or .yaml, but be consistent!
+micromamba env create -f RNAseq.yaml
+```
+
+When you correctly activate the environment, the usual prompt
+will be prefixed with the name of the environment in parentheses:
+
+```console
+(RNAseq) ubuntu@ip-172-31-70-15:~/micromamba/envs$
+```
diff --git a/README.md b/README.md
@@ -1,3 +1,5 @@
 # how-to
 
+[![code style: prettier](https://img.shields.io/badge/code_style-prettier-ff69b4.svg?style=flat-square)](https://github.com/prettier/prettier)
+
 How to do stuff
diff --git a/RNAseq.md b/RNAseq.md
@@ -0,0 +1,104 @@
+# RNA Sequencing Analysis
+
+RNA sequencing (RNA-seq) is a technique used to analyze the transcriptome -
+the complete set of RNA transcripts in a cell. This method involves sequencing
+RNA molecules after reverse transcription to cDNA to examine gene expression levels
+and identify novel transcripts.
+
+This document describes two pipelines for RNA-seq analysis:
+
+- Using `featureCounts` and then analyzing with DESeq2 in R
+- Using the Tuxedo Suite (HISAT2, StringTie, Ballgown)
+
+## `featureCounts`
+
+[`featureCounts`](https://subread.sourceforge.net/featureCounts.html)
+is a program for counting reads mapped to genomic features, such as genes, exons,
+and promoters. It is part of the [Subread](https://subread.sourceforge.net)
+package and can be used for RNA-seq as well as DNA-seq analysis.
+
+> featureCounts takes as input SAM/BAM files and an annotation file including
+> chromosomal coordinates of features. It outputs numbers of reads assigned to features
+> (or meta-features). It also outputs stat info for the overall summrization results,
+> including number of successfully assigned reads and number of reads that failed to be
+> assigned due to various reasons (these reasons are included in the stat info).
+
+Use FeatureCounts to count the number of reads that map to each gene in a GTF
+file and summarize the results for downstream analysis (i.e., differential expression).
+
+### Input
+
+Aligned reads in BAM format and a GTF file containing genomic features.
+
+Example usage:
+
+```sh
+featureCounts -T "$NUM_THREADS" --verbose -t exon -g gene_id --countReadPairs \
+		-a "$REF_GTF" -p -P -C -B -o "${OUT_DIR}/${OUT_PREFIX}.tsv" ./*.bam
+```
+
+### Output
+
+Feature counts are written to a tab-delimited file (`.tsv` or `.txt`)
+with columns for each sample. You can import this file into R or other
+statistical software for further analysis.
+
+Feature counts also provide a summary of the number of reads that were
+assigned to features, as well as the number of reads that were not assigned
+to any feature.
+
+## Downstream Analysis using R
+
+Export the results from `featureCounts` to R for further analysis, such as
+differential expression analysis using packages like DESeq2 or edgeR.
+
+### DESeq2
+
+[DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html) is an
+R package for differential gene expression analysis based on the negative
+binomial distribution.
+
+> DESeq2 provides methods to test for differential expression by use of negative
+> binomial generalized linear models. The models use the raw counts as input and
+> perform regularized log transformation and variance stabilizing
+> transformation. The package also provides functions to visualize the data and
+> results.
+
+<!-- -->
+
+## Tuxedo Suite
+
+The Tuxedo Suite is a collection of tools for transcript-level expression analysis of RNA-seq experiments.
+
+1. Align reads to the reference genome and sort to BAM format (HISAT2)
+2. Assemble transcripts (StringTie)
+3. Prepare for differential expression analysis (Ballgown setup)
+4. Perform differential expression analysis
+5. Visualize results
+
+| Tool      | Description                                    | Manual                                                                      | Source                                             |
+| --------- | ---------------------------------------------- | --------------------------------------------------------------------------- | -------------------------------------------------- |
+| HISAT2    | Align reads to reference genome                | [manual](https://daehwankimlab.github.io/hisat2/manual/)                    | [source](https://github.com/DaehwanKimLab/hisat2)  |
+| StringTie | Assemble RNA-Seq alignments into transcripts   | [manual](https://ccb.jhu.edu/software/stringtie/index.shtml)                | [source](https://github.com/gpertea/stringtie)     |
+| Ballgown  | Isoform-level differential expression analysis | [manual](https://bioconductor.org/packages/release/bioc/html/ballgown.html) | [source](https://github.com/alyssafrazee/ballgown) |
+
+> [!TIP]
+> StringTie comes packaged with `gffcompare` for comparing and evaluating the
+> accuracy of RNA-seq transcript assemblers. Read the
+> [manual](https://ccb.jhu.edu/software/stringtie/gffcompare.shtml) for details.
+
+Using Ballgown:
+
+1. Filter to remove low-abundance genes
+2. Identify differentially expressed transcripts and genes
+3. Add gene names
+4. Sort by p-value
+5. Write results to file
+
+## References
+
+- Liao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014;30(7):923-30. [PMID: 24227677](https://pubmed.ncbi.nlm.nih.gov/24227677/)
+- Pertea, M., Kim, D., Pertea, G. M., Leek, J. T., & Salzberg, S. L. (2016). Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nature Protocols, 11(9), 1650–1667. [PMID: 27560171](https://pubmed.ncbi.nlm.nih.gov/27560171/)
+- Kim D, Langmead B, Salzberg SL. HISAT: a fast spliced aligner with low memory requirements. Nat Methods. 2015;12(4):357-360. [PMID: 25751142](https://pubmed.ncbi.nlm.nih.gov/25751142/)
+- Pertea M, Pertea GM, Antonescu CM, Chang TC, Mendell JT, Salzberg SL. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol. 2015;33(3):290-295. [PMID: 25690850](https://pubmed.ncbi.nlm.nih.gov/25690850/)
+- Frazee AC, Pertea G, Jaffe AE, Langmead B, Salzberg SL, Leek JT. Ballgown bridges the gap between transcriptome assembly and expression analysis. Nat Biotechnol. 2015;33(3):243-246. [PMID: 25748911](https://pubmed.ncbi.nlm.nih.gov/25748911/)
diff --git a/docs/Illumina.md b/docs/Illumina.md
@@ -0,0 +1,102 @@
+# Illumina
+
+If you used a core facility, Azenta, or another commercial service for
+sequencing, they will send link to directly download the de-multiplexed
+FASTQ files, usually with corresponding md5 checksums.
+
+For instructions on how to download data from Azenta's sFTP server,
+click [here](https://3478602.fs1.hubspotusercontent-na1.net/hubfs/3478602/13012-M%26G%200222%20sFTP%20Guide-3.pdf).
+
+> [!NOTE]
+> sFTP (Secure File Transfer Protocol) provides an encrypted channel for data transfer.\
+> The md5 checksum is a unique character sequence that is computed from the
+> contents of a file and changes if the file is modified.
+> Read the original [RFC 1321](https://www.ietf.org/rfc/rfc1321.txt).
+
+Otherwise, consult the [documentation](https://developer.basespace.illumina.com/docs)
+for the appropriate Illumina sequencer:
+
+- [MiSeq](https://support.illumina.com/sequencing/sequencing_instruments/miseq/documentation.html)
+- [NextSeq500](https://support.illumina.com/sequencing/sequencing_instruments/nextseq-550/documentation.html)
+
+## BaseSpace and the `bs` command-line interface
+
+The browser-based interface is useful for small-scale projects, but the
+command-line interface is more efficient for large-scale projects.
+Check out [examples](https://developer.basespace.illumina.com/docs/content/documentation/cli/cli-examples).
+
+Install on macOS using Homebrew:
+
+```sh
+brew tap basespace/basespace && brew install bs-cli
+```
+
+otherwise, download the latest version from Illumina:
+
+```sh
+install_directory="${HOME}/.local/bin}"
+basespace_executable="$install_directory/bs"
+wget "https://launch.basespace.illumina.com/CLI/latest/amd64-linux/bs" -O "$basespace_executable"
+chmod +x "$basespace_executable"
+# add to PATH
+# echo "export PATH=$install_directory:$PATH" >> ~/.bashrc
+# or add an alias
+# echo "alias bs=$basespace_executable" >> ~/.bashrc
+```
+
+After installation, authenticate with your BaseSpace credentials:
+
+```sh
+bs authenticate
+#
+```
+
+Follow link and login with illumina credentials,
+then run 'bs whoami' to verify that you are authenticated:
+
+```console
++----------------+----------------------------------------------------+
+| Name           | Ryan Najac                                         |
+| Id             | ########                                           |
+| Email          | [email protected]                          |
+| DateCreated    | 2021-07-13 15:29:51 +0000 UTC                      |
+| DateLastActive | 2024-06-03 18:59:47 +0000 UTC                      |
+| Host           | https://api.basespace.illumina.com                 |
+| Scopes         | READ GLOBAL, CREATE GLOBAL, BROWSE GLOBAL,         |
+|                | CREATE PROJECTS, CREATE RUNS, START APPLICATIONS,  |
+|                | MOVETOTRASH GLOBAL, WRITE GLOBAL                   |
++----------------+----------------------------------------------------+
+```
+
+## Demultiplexing Illumina sequencing data
+
+Illumina instruments will demultiplex the data for you if you provide a valid
+sample sheet prior to sequencing. For more information on how to create a sample
+sheet, consult the [Illumina Support](https://support.illumina.com/) page.
+
+Skip ahead to the [Quality Control](#quality-control) section if the
+FASTQ files are already demultiplexed and ready for analysis, or keep reading
+for instructions on how to demultiplex the data yourself.
+
+Illumina hosts `.rpm` files for CentOS/RedHat Linux distros and the
+source code (which must be compiled) for other distros.
+
+Download bcl2fastq2 Conversion Software v2.20 Installer (Linux rpm) from
+[Illumina](https://support.illumina.com/sequencing/sequencing_software/bcl2fastq-conversion-software.html).
+
+The AWS EC2 instance used for this project is based on Ubuntu, so we will
+have to convert the `.rpm` file to a `.deb` file using the `alien` package,
+as per this [post](https://www.biostars.org/p/266897/).
+
+```sh
+sudo alien -i bcl2fastq2-v2.20.0.422-Linux-x86_64.rpm
+```
+
+> [!WARNING]
+> As of 2024-08-05, `bcl2fastq` is no longer supported; use `bclconvert` instead.\
+> You can install `bclconvert` using the same methods as described above.
+
+Read the docs:
+
+- [bcl2fastq](https://support.illumina.com/content/dam/illumina-support/documents/documentation/software_documentation/bcl2fastq/bcl2fastq_letterbooklet_15038058brpmi.pdf)
+- [bclconvert](https://support-docs.illumina.com/SW/BCL_Convert_v4.0/Content/SW/BCLConvert/BCLConvert.htm)
diff --git a/docs/Jupyter.md b/docs/Jupyter.md
@@ -0,0 +1,81 @@
+# Jupyter
+
+Jupyter notebooks provide an interactive environment for data analysis, visualization, and code execution. They are particularly useful for bioinformatics workflows and exploratory data analysis.
+
+## Running Jupyter on an EC2 Instance
+
+When running Jupyter on an EC2 instance, you need to configure it to allow remote access. Follow these steps:
+
+1. Generate a config file:
+
+   ```sh
+   jupyter notebook --generate-config
+   ```
+
+2. Edit the config file:
+
+   ```sh
+   nano ~/.jupyter/jupyter_notebook_config.py
+   ```
+
+3. Add or modify these lines:
+
+   ```python
+   c.NotebookApp.ip = '0.0.0.0'
+   c.NotebookApp.open_browser = False
+   c.NotebookApp.port = 8888
+   ```
+
+4. Start Jupyter:
+
+   ```sh
+   jupyter notebook
+   ```
+
+## Accessing Jupyter from Your Local Machine
+
+There are two main methods to access Jupyter running on an EC2 instance:
+
+### Method 1: Direct Access (Less Secure)
+
+1. Ensure port 8888 is open in your EC2 security group.
+2. Access Jupyter using your EC2 instance's public IP:
+
+   ```
+   http://<your-ec2-public-ip>:8888
+   ```
+
+3. Use the token provided in the Jupyter server output for authentication.
+
+### Method 2: SSH Tunnel (More Secure)
+
+1. Create an SSH tunnel:
+
+   ```sh
+   ssh -i <your-key.pem> -L 8888:localhost:8888 ubuntu@<your-ec2-public-ip>
+   ```
+
+2. Access Jupyter locally:
+
+   ```
+   http://localhost:8888
+   ```
+
+3. Use the token provided in the Jupyter server output for authentication.
+
+## Troubleshooting
+
+If you're having trouble connecting to Jupyter, check the following:
+
+1. EC2 Security Group: Ensure port 8888 is open (for direct access method).
+2. Jupyter Configuration: Verify the config file settings.
+3. Firewall: Check if the EC2 instance's firewall is blocking connections.
+4. Jupyter Server: Confirm Jupyter is running and note any error messages.
+
+## Best Practices
+
+1. Use virtual environments to manage dependencies.
+2. Regularly save your work and consider version control for notebooks.
+3. For long-running tasks, consider using tools like `tmux` or `screen` to keep Jupyter running even if your SSH connection drops.
+
+Remember to always prioritize security when working with remote servers and sensitive data.