Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize BWA across FASTQs #157

Closed
iskandr opened this issue Mar 2, 2016 · 11 comments
Closed

Parallelize BWA across FASTQs #157

iskandr opened this issue Mar 2, 2016 · 11 comments

Comments

@iskandr
Copy link
Member

iskandr commented Mar 2, 2016

The alignments produced by BWA are independent across reads.

Each FASTQ can be aligned separately, for example:

# align all tumor FASTQs
bwa aln tumor_L005_001_R1.fastq.gz > tumor_L005_001_R1.sai
bwa aln tumor_L005_001_R2.fastq.gz > tumor_L005_001_R2.sai
bwa aln tumor_L005_002_R1.fastq.gz > tumor_L005_002_R1.sai
bwa aln tumor_L005_002_R2.fastq.gz > tumor_L005_002_R2.sai
# align normal FASTQs
bwa aln normal_L004_001_R1.fastq.gz > normal_L004_001_R1.sai
bwa aln normal_L004_001_R2.fastq.gz > normal_L004_001_R2.sai

(see: https://wikis.utexas.edu/display/CoreNGSTools/Alignment#Alignment-Performingthebwaalignment)

Once we have a collection of aligned sai files, each R1/R2 sai pair can be converted to a SAM file:

bwa sampe grch38.fa tumor_L005_001_R1.sai tumor_L005_001_R2.sai tumor_L005_001_R1.fastq tumor_L005_001_R2.fastq > tumor_L005_001.sam

The sam files for each R1/R2 pair can be converted to bam files (with samtools view) and then merged into a single bam file with e.g. samtools merge tumor.bam tumor*.bam

I'm surprised that there isn't much documentation online about doing this but it does seem that aside from combining mate pairs, nothing done by bwa involves interactions between reads.

A similar pipeline seems to be implemented in Snakemake: https://github.com/inodb/snakemake-parallel-bwa/blob/master/Snakefile

@iskandr
Copy link
Member Author

iskandr commented Mar 2, 2016

Hopefully something similar can be done for MOSAIK, but making sure separate alignment of FASTQs won't give different results: wanpinglee/MOSAIK#24

@iskandr
Copy link
Member Author

iskandr commented Mar 2, 2016

@smondet I think I'd rather create the workflow nodes for parallelization explicitly than rely on a compiler optimization (unless there's some way to enforce that this optimization has to run). I think that for larger sequencing runs it can't be thought of as optional (especially when we're under a tight time budget)

@smondet
Copy link
Member

smondet commented Mar 2, 2016

@iskandr what I'm doing there is adding the option to fail loudly if the compiler pass does not happen:
https://github.com/hammerlab/biokepi/pull/150/files#diff-6cc6e1c3d49a4d36b4884380ad53d585R467
(it's not merged yet because testing/fixing in progress)

@iskandr
Copy link
Member Author

iskandr commented Mar 3, 2016

For BWA-MEM the sequence of commands for multiple tumor FASTQs should be:

bwa mem grch38.fa  tumor_L005_001_R1.fastq.gz  tumor_L005_001_R2.fastq.gz >  tumor_L005_001.sam
bwa mem grch38.fa  tumor_L005_002_R1.fastq.gz  tumor_L005_002_R2.fastq.gz >  tumor_L005_002.sam

Followed by merging into a bam file.

@JPFinnigan
Copy link

I'm surprised that there isn't much documentation online about doing this but it does seem that aside from combining mate pairs, nothing done by bwa involves interactions between reads.

Does BWA-MEM try 2-pass alignement (e.g. rescue alignment of unaligned R1 using the aligned R2, or v.v.) in a manner similar to a the gapped-aligners TopHat2/HISAT and STAR? I imagine this would fail if alignment was done independently on R1/R2.

Relevant portion of HISAT paper:

HISATx1 uses a one-pass approach that aligns each pair of reads independently of others. HISATx2 is a two-pass version of HISAT to mimic the two-step approach used in TopHat2."

Link: http://www.nature.com/nmeth/journal/v12/n4/full/nmeth.3317.html?WT.ec_id=NMETH-201504

@iskandr
Copy link
Member Author

iskandr commented Mar 4, 2016

@JPFinnigan BWA-MEM does take into account mate pairs, so every call to bwa mem requires both the _R1 and _R2 FASTQs. However, that still leaves room for coarse-grained parallelism in the alignments since we can potentially get a lot of these FASTQ pairs back from the sequencing core.

@JPFinnigan
Copy link

BWA-MEM does take into account mate pairs, so every call to bwa mem requires both the _R1 and _R2 FASTQs

I don't think I clearly communicated my question/concern. BWA-MEM run in paired-end uses information about the successful alignment of one read in a mate pair (e.g. R1) to adjust or potentially rescue alignment of the other read (R2) as the case w/ gapped-aligners (STAR/HISAT). I don't think you produce the same alignment running a single instance of bwa mem in paired-end mode (e.g. supplying -R1 -R2) as you do running two parallel instances of bwa aln on R1 and R2 followed by merger w/ bwa sampe. If this is the case I would be careful about parallelizing w/ instances of bwa aln versus w/ instances bwa-mem. For this reason when working with the B16 data I ran multiple instances of BWA-MEM, but perhaps I'm wrong...

Relevant section from BWA-MEM paper (http://arxiv.org/pdf/1303.3997v2.pdf):

Paired-end mapping
2.2.1 Rescuing missing hits Like BWA (Li and Durbin, 2009), BWA- MEM processes a batch of reads at a time. For each batch, it estimates the mean μ and the variance σ2 of the insert size distribution from reliable single-end hits. For the top 100 hits (by default) of either end, if the mate is unmapped in a window [μ − 4σ, μ + 4σ] from each hit, BWA-MEM performs SSE2-based Smith-Waterman alignment (SW; Farrar 2007) for the mate within the window. The second best SW score is recorded to detect potential mismapping in a long tandem repeat. Hits found from both the single-sequence alignment and SW rescuing will be used for pairing

Pairing
Given the i-th hit for the first read and j-th hit for the second, BWA-MEM computes their distance dij if the two hits are in the right orientation, or sets d to infinity otherwise. It scores the pair (i,j) ij with Sij = Si + Sj − min{−alog4 P(dij),U}, where Si and Sj are the SW score of the two hits, respectively, a is the matching score and P (d) gives the probability of observing an insert size larger than d assuming a normal distribution. ‘log4’ arises when we interpret SW score as odds ratio (Durbin et al., 1998). U is a threshold that controls pairing: if dij is small enough such that −alog4 P(dij) < U, BWA-MEM prefers to pair the two ends; otherwise it prefers the unpaired alignments. BWA-MEM chooses the pair (i,j) that maximizes Sij as the final alignments for both ends. This pairing strategy jointly considers the alignment scores, the insert size and the possibility of chimeric read pairs.

@iskandr
Copy link
Member Author

iskandr commented Mar 10, 2016

@JPFinnigan We're not going to run parallel bwa aln in place of bwa mem. Rather, whichever of BWA or BWA-MEM you want to use, we're going to parallelize them. In the case of bwa aln, we're going to parallelize across all FASTQ files. For bwa mem we're going to parallelize across all R1/R2 FASTQ pairs. Does that make sense?

@JPFinnigan
Copy link

Sure. NP. From your original example it seemed as though you were proposing to use `bwa aln`` for paired-reads.

@iskandr
Copy link
Member Author

iskandr commented Mar 10, 2016

Sorry, that was confusing!

On Thu, Mar 10, 2016 at 2:23 PM, JPFinnigan [email protected]
wrote:

Sure. NP. From your original example it seemed as though you were
proposing to use bwa aln` for paired-reads.


Reply to this email directly or view it on GitHub
#157 (comment).

@smondet
Copy link
Member

smondet commented Mar 10, 2016

Anyway, this now implemented as an option. For any aligner, we can run it with a single pair of FASTQs or parallelized over pairs of FASTQs fragments. And we can choose just before submitting a given pipeline (it's a Pipeline.Compiler.t option).

@smondet smondet closed this as completed Mar 10, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants