This repository contains code for running an analysis pipeline in AWS Batch.
Given a set of SRA accession numbers, AWS Batch will start an array job where each child will process a single accession number, doing the following:
- Download the file(s) associated with the accession number from SRA, using the prefetch tool with the Aspera Connect transport.
- Start a bash pipe which
runs the following steps, once for each of three viral genomes.
- extracts the downloaded
.sra
file tofastq
format using fastq-dump. The sra file is highly compressed and this step can expand it to more than 20 times its size, which is one reason we stream the data in a pipe: so as to not need lots of scratch space. - Pipe the
fastq
data through bowtie2 to search for the virus. - Pipe the output of
bowtie2
through gzip to compress it prior to the next step. - stream the compressed output of
bowtie2
to an S3 bucket. The resulting file will have an S3 URL like this:s3://<bucket-name>/pipeline-results2/<SRA-accession-number>/<virus>/<SRA-accession-number>.sam.gz
.
- extracts the downloaded
- These tools must all be run on the Fred Hutch internal network.
- Obtain your S3 credentials using the awscreds script. You only need to do this once.
- Request access to AWS Batch.
- Clone this repository to a location under your home directory, and then
change directories into the repository (you only need to do this once,
although you may need to run
git pull
periodically to keep your cloned repository up to date):
git clone https://github.com/FredHutch/sra-pipeline.git
cd sra-pipeline
A script called sra_pipeline
is available to to simplify the following:
- Display accession numbers that have already been processed.
- Display accession numbers which are currently being processed.
- Submit some number of new accession numbers to the pipeline, choosing either randomly, by picking the smallest available data sets, or by providing a file containing accession numbers.
Running the utility with --help
gives usage information:
$ ./sra_pipeline --help
usage: sra_pipeline.py [-h] [-c] [-i] [-s N] [-r N] [-f FILE]
optional arguments:
-h, --help show this help message and exit
-c, --completed show completed accession numbers
-i, --in-progress show accession numbers that are in progress
-s N, --submit-small N
submit N jobs of ascending size
-r N, --submit-random N
submit N randomly chosen jobs
-f FILE, --submit-file FILE
submit accession numbers contained in FILE
You can get more detail about running jobs by using
the Batch Dashboard
and/or the
AWS command-line client for Batch.
See Using AWS Batch at Fred Hutch for more information.