SRA Pipeline

This repository contains code for running an analysis pipeline in AWS Batch.

What the pipeline does

Given a set of SRA accession numbers, AWS Batch will start an array job where each child will process a single accession number, doing the following:

Download the file(s) associated with the accession number from SRA, using the prefetch tool with the Aspera Connect transport.
Start a bash pipe which runs the following steps, once for each of three viral genomes.
- extracts the downloaded .sra file to fastq format using fastq-dump. The sra file is highly compressed and this step can expand it to more than 20 times its size, which is one reason we stream the data in a pipe: so as to not need lots of scratch space.
- Pipe the fastq data through bowtie2 to search for the virus.
- Pipe the output of bowtie2 through gzip to compress it prior to the next step.
- stream the compressed output of bowtie2 to an S3 bucket. The resulting file will have an S3 URL like this: s3://<bucket-name>/pipeline-results2/<SRA-accession-number>/<virus>/<SRA-accession-number>.sam.gz.

Prerequisites/Requirements

These tools must all be run on the Fred Hutch internal network.
Obtain your S3 credentials using the awscreds script. You only need to do this once.
Request access to AWS Batch.
Clone this repository to a location under your home directory, and then change directories into the repository (you only need to do this once, although you may need to run git pull periodically to keep your cloned repository up to date):

git clone https://github.com/FredHutch/sra-pipeline.git
cd sra-pipeline

`sra_pipeline` utility

A script called sra_pipeline is available to to simplify the following:

Display accession numbers that have already been processed.
Display accession numbers which are currently being processed.
Submit some number of new accession numbers to the pipeline, choosing either randomly, by picking the smallest available data sets, or by providing a file containing accession numbers.

Running the utility with --help gives usage information:

$ ./sra_pipeline --help
usage: sra_pipeline.py [-h] [-c] [-i] [-s N] [-r N] [-f FILE]

optional arguments:
  -h, --help            show this help message and exit
  -c, --completed       show completed accession numbers
  -i, --in-progress     show accession numbers that are in progress
  -s N, --submit-small N
                        submit N jobs of ascending size
  -r N, --submit-random N
                        submit N randomly chosen jobs
  -f FILE, --submit-file FILE
                        submit accession numbers contained in FILE

Additional monitoring of jobs

You can get more detail about running jobs by using
the Batch Dashboard and/or the AWS command-line client for Batch.

See Using AWS Batch at Fred Hutch for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 152 Commits
bt2		bt2
.gitignore		.gitignore
Dockerfile		Dockerfile
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
exome_accessions.txt		exome_accessions.txt
fastq-sras.csv		fastq-sras.csv
fastq-sras.txt		fastq-sras.txt
missing.py		missing.py
pylintrc		pylintrc
run.py		run.py
run.sh		run.sh
salivary_gland_accession_list.txt		salivary_gland_accession_list.txt
salivary_sizes.csv		salivary_sizes.csv
sra_pipeline		sra_pipeline
sra_pipeline.py		sra_pipeline.py
srr-sizes.csv		srr-sizes.csv
srr-sizes3.csv		srr-sizes3.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SRA Pipeline

What the pipeline does

Prerequisites/Requirements

`sra_pipeline` utility

Additional monitoring of jobs

About

Releases

Packages

Languages

FredHutch/sra-pipeline

Folders and files

Latest commit

History

Repository files navigation

SRA Pipeline

What the pipeline does

Prerequisites/Requirements

sra_pipeline utility

Additional monitoring of jobs

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`sra_pipeline` utility

Packages