bcbio Run

Setting up for bcbio single cell RNA-Seq analysis

Questions during initial consult
- Protocol for disassociation - how difficult?
- Type of method used?
- % viability of cells
- What types of cells expect to exist and a ballpark proportion expected of each
  - If rare cell type, any method to enrich for cell type?
- Potential sources of variation that we might expect?
Analyst should ask client for the following:
- How many samples were sequenced?
- What were the sample indices used?
- How many cells were encapsulated and sequenced per sample?
- What is the main experimental question - does it require clustering using markers and/or cell trajectory analyses?
- What are a handful of markers for the expected cell types using official gene symbols or Ensembl IDs.
  - Human gene symbols, clients can search here
  - Mouse gene symbols, clients can search here.
Acquire data from sequencing core. The way in which you handle/process your data will differ depending on the sequencing core that you obtain it from. The key thing to keep in mind is that the input to bcbio cannot be demultiplexed. The data needs to remain multiplexed, but split into four FASTQ files (R1-R4, as described in detail below).
- Bauer sequencing core: uses Basespace. To download the sequencing files use BaseMount.
  - The BaseSpaceRunDownloader tool previously used and shown below is deprecated:
```
 wget https://da1s119xsxmu0.cloudfront.net/sites/knowledgebase/API/08052014/Script/BaseSpaceRunDownloader_v2.zip
 unzip BaseSpaceRunDownloader_v2.zip
 python BaseSpaceRunDownloader_v2.py -r <Run ID> -a <access token>
```
    The option -r is the number in the basespace url and the access token is something you have to get for your basespace account.
  The files output will be BCL files that can be turned into FASTQ files with the bcl2fastq tool (instructions below).
- DFCI sequencing center (Zach): will output the FASTQ files in the correct format since the Core has provided a script to Zach, but should check the files - should have 4 reads, not a huge undetermined file, etc.
- Biopolymers sequencing facility: should be FASTQ, but should check the files - should have 4 reads, not a huge undetermined file, etc
- Broad Institute: has their own single cell distribution platform - should be FASTQ, but should check the files - should have 4 reads, not a huge undetermined file, etc
- CCCB: will generally provide tarballs that correspond to different runs. Sometimes they have run bcl2fastq on the data but you do not want to use this output. It is likely demultiplexed and cannot be used as input to bcbio.
If downloaded sequencing files are BCL format, then need to convert to FASTQ after completing changes to the Samplesheet that are detailed below.
- Change directories to the sequencing folder downloaded from the facility. The folder should be arranged according to the image below for NextSeq or MiniSeq:
  
  Image acquired from bcl2fastq documentation.
```
 cd path/to/YYMMDD_machinename_XXXX_FCexperimentname 
```
- Update the Samplesheet.csv so that it does not demultiplex.
  - In the run-level folder (decompress the tarball), you should see a Samplesheet.csv file. This is a standard file obtained from Illumina sequencing. In the file you will notice four sections (Header, Reads, Settings, Data). The [Data] section is what we are interested in. It should look something like:
```
 		[Data],,,,,,,,,
 Sample_ID,Sample_Name,Sample_Plate,Sample_Well,I7_Index_ID,index,I5_Index_ID,index2,Sample_Project,Description
 Sample_JP1_11a_1,JP1_11a_1,,,D701,ATTACTCG,D501,AGGCTATA,JP_06092018_1641,
```
  There are two things we need to check in this section of the CSV file:
  1. The columns I7_Index and I5_index are empty.
  2. The barcode sequences that are in index and index2 columns do not match sample barcodes. These can be changed to a dummy sequence like AAAAAAAA just to be safe.
  NOTE: If we do not make these changes, bcl2fastq will attempt to demultiplex the samples. We can make changes because we don't need this samplesheet for any steps downstream other than the bcl2fastq step.
- Log on to O2 to run bcl2fastq. Load bcl2fastq module and convert files to FASTQ by using the following command:
```
 bcl2fastq \
 --use-bases-mask y*,y*,y*,y* \
 --mask-short-adapter-reads 0 \
 --minimum-trimmed-read-length 0
```
  More information regarding the bcl2fastq command and directory structures for other sequencing machines can be found in the documentation.
  
  NOTE: This can sometimes take awhile and is best run as a job submission script.
The output files should be in the BaseCalls directory. For each file of sequenced reads, there should be four associated FASTQ files (R1-R4) for the inDrops technology.
- R1 (61 bp Read 1): sequence of the read
- R2 (8 bp Index Read 1 (i7)): cellular barcode - which cell read originated from
- R3 (8 bp Index Read 2 (i5)): library index - which sample read originated from
- R4 (14 bp Read 2): read 2 and barcode/UMI - remaining cellular barcode and UMI - which transcript read originated from (to find PCR duplicates)
The reads for each sequence are depicted in the image below:

Image credit: Sarah Boswell, Harvard Staff Scientist for Sequencing Technologies
To quickly view the counts for the barcodes with the top five highest counts based on the first 10,000 reads in a file:
```
gzip -cd filename_R3.fq.gz | head -40000 | awk 'NR % 4 == 2' | sort | uniq -c | awk 	'{ print $2 "," $1}' | sort -t"," -n --key=2 | tail -5
```
NOTE: awk 'NR % 4 == 2' gets every 4th line starting from the 2nd, which is a useful trick when you want to count up FASTQ file entries (Rory's code)

The reverse complement sequences of the sample indices given by the client should correspond to the most abundant indices in the file.

Automatization

Alternatively, you can use this script to make a list of top N barcodes and match them with a list provided in a CSV file:
```
python check_sc_barcode.py --fastq FILE_R3.fastq.gz --barcodes barcodes.csv
```
It needs python3 and biopython (for people on O2 python is available at /n/app/bcbio/conda3/bin/python and the script at /n/app/bcbio/scripts/check_sc_barcode.py).

The barcode.csv file looks like this:
```
CTATTAAG,M3L_Basal amygdala
AAGGCTAT,M2_BA25_1
GAGCCTTA,M3R_BA25
TTATGCGA,M3L_BA25
```
The output will look like this:
```
This barcode CTTAATAG has not been detected.
This barcode ATAGCCTT has not been detected.
This barcode TAAGGCTC (M2_BA25 2) is detected with 1394 reads.
This barcode AGATCTCG is not in your list (2893 reads).
```

Use the cat command to concatenate all of the files for a given sample across lanes:

cat Undetermined_S0_L001_R1_001.fastq.gz Undetermined_S0_L002_R1_001.fastq.gz Undetermined_S0_L003_R1_001.fastq.gz Undetermined_S0_L004_R1_001.fastq.gz > cat_R1.fastq.gz

or

cat *R1*.fastq.gz > cat_R1.fastq.gz

Do the same for the R2, R3, and R4 files.

Create metadata file as normal for bcbio run. Note that your FASTQ files are not demultiplexed, so you will often have multiple samples in each of the FASTQ files.
```
fileName,description
cat,run1
```

Download the most recent transcriptome FASTA and GTF files:

# Most recent mouse FASTA from Ensembl FTP
wget ftp://ftp.ensembl.org/pub/current_fasta/mus_musculus/cdna/Mus_musculus.GRCm38.cdna.all.fa.gz

# Most recent mouse GTF from Ensembl FTP
wget ftp://ftp.ensembl.org/pub/release-92/gtf/mus_musculus/Mus_musculus.GRCm38.92.gtf.gz

# Perform the checksums
sum Mus_musculus.GRCm38.cdna.all.fa.gz
sum Mus_musculus.GRCm38.92.gtf.gz

# Decompress FASTA and GTF to run in bcbio
gzip -d Mus_musculus.GRCm38.cdna.all.fa.gz
gzip -d Mus_musculus.GRCm38.92.gtf.gz

Overview of bcbio single cell RNA-Seq workflow on O2

The bcbio single cell RNA-Seq pipeline will perform the following steps:

Identify the sample barcodes in the R3 read, which will be provided in the config file in the sample_barcodes parameter. A single mismatch between known sample barcodes and sequences is allowed.
Identify the cellular barcodes by parsing the R2 and R4 reads.
Identify the unique molecular identifiers (UMIs) by parsing R4 read.
Filter out the sequence data with cellular barcodes matching less than 1000 reads (indicating poor quality cells due to encapsulation of free floating RNA from dying cells, small cells, or set of cells that failed for some reason). The threshold for the number of matching reads used for filtering can be specified in the config file with the minimum_barcode_depth parameter.
Align reads with Rapmap tool.
Take reads that mapped to more than one transcript and divide the count between all of the transcripts to which the reads aligned.

NOTE: The location of the barcodes and UMIs differs by library method, and the description given above reflects the locations for inDrop data. However, bcbio will perform similar steps for other methods; it will just parse the reads a bit differently.

Running bcbio single cell RNA-Seq workflow on O2

Create sample barcodes file (.txt) to identify samples in bcbio. The reverse-complement of the sample barcodes supplied by the client are written as a single barcode per line in a file. No other text should be present in the file, for example the following is the contents of a barcode file for an experiment with four samples:
```
AGGCTTAG
CGGAGAGA
TACTCCTT
ATTAGACG
```

NOTE: This is information that should have been supplied by the client. While it is possible to run bcbio without this, it is advisable not to.

NOTE: The barcodes written here should match the most prevalent barcodes in the Setting up for bcbio single cell RNA-Seq analysis section, Step 5.

Create configuration template for single cell run:

details:
  - analysis: scRNA-seq
    algorithm:
      transcriptome_fasta: /n/data1/cores/bcbio/PIs/PI_name/ref_data/Mus_musculus.GRCm38.cdna.all.fa
      transcriptome_gtf: /n/data1/cores/bcbio/PIs/PI_name/ref_data/Mus_musculus.GRCm38.92.gtf
      umi_type: harvard-indrop-v3
      minimum_barcode_depth: 1000
      cellular_barcode_correction: 1
      sample_barcodes: /n/data1/cores/bcbio/PIs/PI_name/meta/hbc02055-sample-barcodes-rc.txt
    genome_build: mm10

NOTE: If you want to perform the same barcode selection as cellranger, then you can add auto in minimum_barcode_depth parameter.

NOTE: The .gtf is only used to link genes to transcripts - does not use the coordinates, so it is fine that the coordinates reference the genome.

Normal bcbio configuration file creation:

bcbio_nextgen.py -w template ../config/scRNAseq_config_template.yaml ../meta/PI_name.csv ../hbcXXXXX/seq_dir/Data/Intensities/BaseCalls/cat*fastq.gz

Create script (below) to run job on O2 and run with sbatch ../../runJob-PI_name-scRNAseq.slurm:

#!/bin/sh
#SBATCH -p medium
#SBATCH -J win-full
#SBATCH -o run.o
#SBATCH -e run.e
#SBATCH -t 4-00:00
#SBATCH --cpus-per-task=1
#SBATCH --mem=8000
#SBATCH --mail-type=ALL
#SBATCH --mail-user=piper@hsph.harvard.edu

/n/app/bcbio/dev/anaconda/bin/bcbio_nextgen.py ../config/PI_name.yaml -n 48 -t ipython -s slurm -q medium -r t=4-00:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

01_bcbio_run.md

01_bcbio_run.md

bcbio Run

Setting up for bcbio single cell RNA-Seq analysis

Overview of bcbio single cell RNA-Seq workflow on O2

Running bcbio single cell RNA-Seq workflow on O2

Files

01_bcbio_run.md

Latest commit

History

01_bcbio_run.md

File metadata and controls

bcbio Run

Setting up for bcbio single cell RNA-Seq analysis

Overview of bcbio single cell RNA-Seq workflow on O2

Running bcbio single cell RNA-Seq workflow on O2