Parallelize BWA across FASTQs #157

iskandr · 2016-03-02T22:32:38Z

The alignments produced by BWA are independent across reads.

Each FASTQ can be aligned separately, for example:

# align all tumor FASTQs
bwa aln tumor_L005_001_R1.fastq.gz > tumor_L005_001_R1.sai
bwa aln tumor_L005_001_R2.fastq.gz > tumor_L005_001_R2.sai
bwa aln tumor_L005_002_R1.fastq.gz > tumor_L005_002_R1.sai
bwa aln tumor_L005_002_R2.fastq.gz > tumor_L005_002_R2.sai
# align normal FASTQs
bwa aln normal_L004_001_R1.fastq.gz > normal_L004_001_R1.sai
bwa aln normal_L004_001_R2.fastq.gz > normal_L004_001_R2.sai

(see: https://wikis.utexas.edu/display/CoreNGSTools/Alignment#Alignment-Performingthebwaalignment)

Once we have a collection of aligned sai files, each R1/R2 sai pair can be converted to a SAM file:

bwa sampe grch38.fa tumor_L005_001_R1.sai tumor_L005_001_R2.sai tumor_L005_001_R1.fastq tumor_L005_001_R2.fastq > tumor_L005_001.sam

The sam files for each R1/R2 pair can be converted to bam files (with samtools view) and then merged into a single bam file with e.g. samtools merge tumor.bam tumor*.bam

I'm surprised that there isn't much documentation online about doing this but it does seem that aside from combining mate pairs, nothing done by bwa involves interactions between reads.

A similar pipeline seems to be implemented in Snakemake: https://github.com/inodb/snakemake-parallel-bwa/blob/master/Snakefile

The text was updated successfully, but these errors were encountered:

iskandr · 2016-03-02T23:51:56Z

Hopefully something similar can be done for MOSAIK, but making sure separate alignment of FASTQs won't give different results: wanpinglee/MOSAIK#24

iskandr · 2016-03-02T23:53:28Z

@smondet I think I'd rather create the workflow nodes for parallelization explicitly than rely on a compiler optimization (unless there's some way to enforce that this optimization has to run). I think that for larger sequencing runs it can't be thought of as optional (especially when we're under a tight time budget)

smondet · 2016-03-02T23:59:03Z

@iskandr what I'm doing there is adding the option to fail loudly if the compiler pass does not happen:
https://github.com/hammerlab/biokepi/pull/150/files#diff-6cc6e1c3d49a4d36b4884380ad53d585R467
(it's not merged yet because testing/fixing in progress)

iskandr · 2016-03-03T00:08:37Z

For BWA-MEM the sequence of commands for multiple tumor FASTQs should be:

bwa mem grch38.fa  tumor_L005_001_R1.fastq.gz  tumor_L005_001_R2.fastq.gz >  tumor_L005_001.sam
bwa mem grch38.fa  tumor_L005_002_R1.fastq.gz  tumor_L005_002_R2.fastq.gz >  tumor_L005_002.sam

Followed by merging into a bam file.

JPFinnigan · 2016-03-04T18:55:37Z

I'm surprised that there isn't much documentation online about doing this but it does seem that aside from combining mate pairs, nothing done by bwa involves interactions between reads.

Does BWA-MEM try 2-pass alignement (e.g. rescue alignment of unaligned R1 using the aligned R2, or v.v.) in a manner similar to a the gapped-aligners TopHat2/HISAT and STAR? I imagine this would fail if alignment was done independently on R1/R2.

Relevant portion of HISAT paper:

HISATx1 uses a one-pass approach that aligns each pair of reads independently of others. HISATx2 is a two-pass version of HISAT to mimic the two-step approach used in TopHat2."

Link: http://www.nature.com/nmeth/journal/v12/n4/full/nmeth.3317.html?WT.ec_id=NMETH-201504

iskandr · 2016-03-04T20:22:14Z

@JPFinnigan BWA-MEM does take into account mate pairs, so every call to bwa mem requires both the _R1 and _R2 FASTQs. However, that still leaves room for coarse-grained parallelism in the alignments since we can potentially get a lot of these FASTQ pairs back from the sequencing core.

JPFinnigan · 2016-03-10T18:41:49Z

BWA-MEM does take into account mate pairs, so every call to bwa mem requires both the _R1 and _R2 FASTQs

I don't think I clearly communicated my question/concern. BWA-MEM run in paired-end uses information about the successful alignment of one read in a mate pair (e.g. R1) to adjust or potentially rescue alignment of the other read (R2) as the case w/ gapped-aligners (STAR/HISAT). I don't think you produce the same alignment running a single instance of bwa mem in paired-end mode (e.g. supplying -R1 -R2) as you do running two parallel instances of bwa aln on R1 and R2 followed by merger w/ bwa sampe. If this is the case I would be careful about parallelizing w/ instances of bwa aln versus w/ instances bwa-mem. For this reason when working with the B16 data I ran multiple instances of BWA-MEM, but perhaps I'm wrong...

Relevant section from BWA-MEM paper (http://arxiv.org/pdf/1303.3997v2.pdf):

Paired-end mapping
2.2.1 Rescuing missing hits Like BWA (Li and Durbin, 2009), BWA- MEM processes a batch of reads at a time. For each batch, it estimates the mean μ and the variance σ2 of the insert size distribution from reliable single-end hits. For the top 100 hits (by default) of either end, if the mate is unmapped in a window [μ − 4σ, μ + 4σ] from each hit, BWA-MEM performs SSE2-based Smith-Waterman alignment (SW; Farrar 2007) for the mate within the window. The second best SW score is recorded to detect potential mismapping in a long tandem repeat. Hits found from both the single-sequence alignment and SW rescuing will be used for pairing

Pairing
Given the i-th hit for the first read and j-th hit for the second, BWA-MEM computes their distance dij if the two hits are in the right orientation, or sets d to infinity otherwise. It scores the pair (i,j) ij with Sij = Si + Sj − min{−alog4 P(dij),U}, where Si and Sj are the SW score of the two hits, respectively, a is the matching score and P (d) gives the probability of observing an insert size larger than d assuming a normal distribution. ‘log4’ arises when we interpret SW score as odds ratio (Durbin et al., 1998). U is a threshold that controls pairing: if dij is small enough such that −alog4 P(dij) < U, BWA-MEM prefers to pair the two ends; otherwise it prefers the unpaired alignments. BWA-MEM chooses the pair (i,j) that maximizes Sij as the final alignments for both ends. This pairing strategy jointly considers the alignment scores, the insert size and the possibility of chimeric read pairs.

iskandr · 2016-03-10T19:07:06Z

@JPFinnigan We're not going to run parallel bwa aln in place of bwa mem. Rather, whichever of BWA or BWA-MEM you want to use, we're going to parallelize them. In the case of bwa aln, we're going to parallelize across all FASTQ files. For bwa mem we're going to parallelize across all R1/R2 FASTQ pairs. Does that make sense?

JPFinnigan · 2016-03-10T19:23:20Z

Sure. NP. From your original example it seemed as though you were proposing to use `bwa aln`` for paired-reads.

iskandr · 2016-03-10T19:28:18Z

Sorry, that was confusing!

On Thu, Mar 10, 2016 at 2:23 PM, JPFinnigan [email protected]
wrote:

Sure. NP. From your original example it seemed as though you were
proposing to use bwa aln` for paired-reads.

—
Reply to this email directly or view it on GitHub
#157 (comment).

smondet · 2016-03-10T20:32:31Z

Anyway, this now implemented as an option. For any aligner, we can run it with a single pair of FASTQs or parallelized over pairs of FASTQs fragments. And we can choose just before submitting a given pipeline (it's a Pipeline.Compiler.t option).

smondet mentioned this issue Mar 9, 2016

Optimize aligners over FASTQ fragments on-demand #162

Merged

smondet closed this as completed Mar 10, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize BWA across FASTQs #157

Parallelize BWA across FASTQs #157

iskandr commented Mar 2, 2016

iskandr commented Mar 2, 2016

iskandr commented Mar 2, 2016

smondet commented Mar 2, 2016

iskandr commented Mar 3, 2016

JPFinnigan commented Mar 4, 2016

iskandr commented Mar 4, 2016

JPFinnigan commented Mar 10, 2016

iskandr commented Mar 10, 2016

JPFinnigan commented Mar 10, 2016

iskandr commented Mar 10, 2016

smondet commented Mar 10, 2016

Parallelize BWA across FASTQs #157

Parallelize BWA across FASTQs #157

Comments

iskandr commented Mar 2, 2016

iskandr commented Mar 2, 2016

iskandr commented Mar 2, 2016

smondet commented Mar 2, 2016

iskandr commented Mar 3, 2016

JPFinnigan commented Mar 4, 2016

iskandr commented Mar 4, 2016

JPFinnigan commented Mar 10, 2016

iskandr commented Mar 10, 2016

JPFinnigan commented Mar 10, 2016

iskandr commented Mar 10, 2016

smondet commented Mar 10, 2016