-
Notifications
You must be signed in to change notification settings - Fork 0
Stitcher 5.1.6 Design Document
The Stitcher takes paired reads from a bam, stitches them, and writes out a new bam. Reads that cannot be stitched are left unchanged. A stitched read is a single read that has been bioinformatically stitched from a pair of reads. The stitched read contains one consensus sequence of base calls, one consensus sequence of base call qualities, one consensus cigar string, and one direction tag, which gives the original directions of the contributing reads. Optionally, The Stitcher filters out reads that will not be used downstream, such as PCR duplicates or poorly mapped reads, in order to save on future processing time.
The Stitcher supports configuration of parameters so that its behavior can be fine tuned depending on the application context.
SDS ID | Specification | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SDS-1 |
The Stitcher shall accept command line arguments as a whitespace-separated list of name and value pairs. For example: Format: Stitcher.exe [-options] Example: Stitcher.exe –Bam C:\test.bam –OutFolder C:\OutFolder |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
SDS-2 | If an invalid command is given, the Stitcher shall exit with an error message describing the failed argument, the reason for failure, and the list of valid commands. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
SDS-3 | The Stitcher command line shall be capitalization invariant. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
SDS-4 |
The Stitcher shall require the command line arguments listed below.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
SDS-5 |
The Stitcher shall optionally support the command line arguments listed below.
|
The Stitcher requires as input one BAM file. The BAM file is assumed to be sorted, i.e. alignment reads should be sorted by mapped reference position. This assumption allows The Stitcher to process BAMs in one pass without a large memory footprint. Most standard aligners produce sorted BAM files. Positions in BAM files are expected to use 0-based coordinate system.
SDS ID | Specification |
---|---|
SDS-6 | The Stitcher shall require as input one bam. |
SDS-7 |
The Stitcher shall require as input an accompanying BAI file with the same file name in the same directory. |
The Stitcher outputs one (stitched) BAM file. The stitched file name shall be .stitched.bam.
SDS ID | Specification |
---|---|
SDS-8 | The Stitcher shall output one stitched bam. |
SDS-9 | The output bam name shall be .stitched.bam. |
SDS-10 | The output bam shall be sorted. NOT IMPLEMENTED IN THIS VERSION. (Workaround: samtools in pipeline). |
SDS-11 |
The Stitcher shall output an accompanying BAI file with the corresponding file name (.stitched.bam.bai) in the same directory. NOT IMPLEMENTED IN THIS VERSION. (Workaround: samtools in pipeline). |
SDS-24 |
The output bam shall contain a single, merged read for each read pair that was successfully stitched. The original read pair shall not be outputted. Reads that were not successfully stitched should be outputted as their original read pair unless filtering unstitchable pairs is enabled. Singletons shall be outputted in their original form. |
SDS-25 | Merged reads in the bam outputted by the Stitcher shall be flagged as:
Unmerged reads shall retain their original flags. |
SDS-26 |
Merged reads shall inherit the following from the Read1 of the pair:
They shall inherit the following from the component reads:
They shall be assigned the following as a result of stitching: Read sequence, Qualities, CIGAR, New tags - XD (Cigar Directions). Unless otherwise explicitly mentioned above, all existing tags will be removed from the component reads in the merged read, as it would be unclear how to fairly and consistently reconcile tags from the two component reads. Unmerged reads shall retain their existing tags. |
The Stitcher streams through the bam, holding on to each eligible read until its mate is found. Once the mate is found, the read and its mate shall be stitched.
Alignments that are not proper pairs, are duplicates, or which fail mapping quality thresholds shall optionally be filtered. Unmapped or supplementary alignments will be automatically skipped.
Pairs of reads are processed once the pair is deemed complete. Complete pairs are those for which Read1, Read2, and any expected supplementary alignments have been collected. (Note: as a result, reads having supplementary alignments will never be processed and will either be outputted separately or not at all, depending on configuration settings. Supplementary reads are always skipped). Reads which do not overlap and reads which are improperly paired are optionally treated as incomplete.
Stitching is then attempted on complete pairs. Paired reads that successfully stitch will be reconciled into one consensus read to be used for downstream processing. The XD tag is used to give the directions of the stitched read. Read pairs which cannot be successfully stitched together will be outputted separately as their original alignments, or skipped if filtering unstitchable pairs.
Unpaired reads and incomplete pairs will be flushed at the end of each chromosome. Read pairs that represent fusions (i.e. reads are on different chromosomes) will not be stitched.
ac:link<ri:attachment ri:filename="Stitcher AlignmentPairFilter Implementation.pptx"></ri:attachment></ac:link>
<ac:image ac:height="400"><ri:attachment ri:filename="image2016-11-28 13:25:10.png"></ri:attachment></ac:image>
Schematic of AlignmentPairFilter.TryPair functionality, with notes indicating the specific StitcherPairFilter behaviors in green.
<ac:image ac:height="400"><ri:attachment ri:filename="image2016-11-28 13:25:36.png"></ri:attachment></ac:image>
SDS ID | Specification |
---|---|
SDS-13 | The Stitcher shall filter reads with map quality less than FilterMinMapQuality from the final bam. |
SDS-14 | The Stitcher shall optionally filter out duplicate reads. |
SDS-15 | The Stitcher shall optionally filter out any reads not marked as proper pairs. |
SDS-16 | The Stitcher shall optionally filter out any reads that have incompatible cigar strings with its mate (i.e. "unstitchable" pairs). |
A) If the pairs simply do not overlap, they are outputted separately.
B) If the pairs do overlap, they are tested for compatibility, where compatible means the cigar strings match such that indels on one read do not overlap with match/mismatch segments on the other, or with disagreeing indels.
- Compatibility is evaluated for each overlapping base.
- An overlapping base is deemed incompatible if: it is an indel on one read and is either a match/mismatch on the other, or is the opposite type of indel (i.e. Read1 shows an insertion and Read2 shows a deletion at this position).
- Softclips are not considered disagreements.
Getting StitchedPositions Representations
The two reads are abstracted into an array of StitchedPositions, each of which can have cigar sub-strings for mapped (i.e. M or D) and unmapped (i.e. I or S) bases for each mate.
Redistributing Softclips
"Bookending" softclips (at either end of the component reads) are then redistributed so that they match up with the further-extending positions on the opposite read. In this way, the softclipped bases can be used to support match/mismatch and inserted bases on the opposite read. Optionally, softclipped bases can also support deletions – if allowing terminal softclips to support overlapping deletions, and if in the process of redistributing softclips we get to a deletion in the opposite read has mapped bases after it, we can "kick over" the softclip so that it supports the mapped bases on the other side, thereby also supporting the deletion.
Reconciling StitchedPositions Between Reads
For each StitchedPosition in the combined representation of the two reads, the cigar operations for both reads are reconciled. Cigar operations are deemed to be compatible if they are the same for both read, empty for one of the reads, or softclips for one of the reads. If either read has an insertion while the opposite has no unmapped bases (insertion or softclipped), stitching has failed. Successfully reconciled softclips count toward stitched support only if configured to use softclipped bases, and only if it is not on the probe-end of the read or we are not ignoring probe softclips. The result of the reconciling step is a stitched CIGAR and directions array. Finally, if the stitching has created an internal softclip, we assert that stitching has failed (overlapping internal softclips are too nebulous for us to confidently call even the rest of the stitched read).
If the reads are compatible, they are reconciled into one consensus read as described in the section below.
If the paired reads are compatible and successfully stitch into a merged read, the Stitcher will assemble a consensus sequence and qualities array based on the component reads.
Worked examples with results directly from this release are given hereac:link<ri:page ri:content-title="Read Stitcher 5.1.6 Stitching Scenarios">ac:plain-text-link-body</ac:plain-text-link-body></ri:page></ac:link>
SDS ID | Specification | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Region |
Rules |
|||||||||||||||||||
SDS-17 | Positions preceding overlap region | Use upstream read sequence, qscores, and direction. | ||||||||||||||||||
SDS-18 |
Positions within overlap region |
For each position:
Direction is "stitched". |
||||||||||||||||||
SDS-22 |
Positions trailing overlap region |
Use downstream sequence, qscores, and direction. |
The consensus read shall have have a new XD tag. The rules for the XD tag are as follows:
SDS ID | specification | |
---|---|---|
SDS-23 | **base category** | **XD item** |
N consecutive forward bases | NF | |
N consecutive reverse bases | NR | |
N consecutive stitched | NS |
All bases relevant to the stitched cigar string must have a direction. This includes all bases in the stitched sequence, but also gaps (such as deletions or strings of N's that might result from reads that do not stitch easily). All these bases shall have a direction and shall be available to contribute to (directional) coverage counts made by Pisces.
For example, the following pair of reads, would yield the following stitched cigar string and XD tag:
Read pair:
Cigar | Direction | Coordinate -> | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
read1 | 2M2D4M | Forward | x | x | |||||||
read2 | 1M2D5M | Reverse | x | x |
Stitched read:
Cigar | XD tag | Coordinate -> | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
stitched read | 2M2D5M | 1F7S1R | x | x |
For reads that are completely un-stitchable (incompatible cigar strings), The Stitcher
-
If (FilterUnstichablePairs=TRUE), the Stitcher shall not try to stitch these reads. These reads are omitted from the final bam.
-
Otherwise:
-
Easy Case: dont stitch these reads. Treat them as unpaired reads.
-
Medium Case:
-
if reads are not inconsistent (collapseble indels, softclips,)
- push it together and N-ify disagreements.
-
if reads are inconsistent (can't even come up with an insertion length that makes sense)
- If one read is bad throughout the overlap (by mapping or by basecall q score), stitch, but take the other read through the overlap. Qscore and direction corresponding to the read that was taken.
- If both reads are bad, or have some mix of acceptable data in the overlap (by mapping and basecall q score), stitch, but make the overlap entirely N. Score for each base is 0 and direction is "stitched".
-
-
-
- Hard Case:
- TBD, probably involves some local realignment.
- Hard Case:
SDS ID | Specification |
---|---|
SDS-24 |
If (FilterUnstichablePairs=FALSE) The Stitcher shall treat un-stitchable reads as unpaired reads. (if FilterUnstichablePairs=TRUE, see SDS-16) |
By design, the stitched read will contribute 1x coverage to down stream variant calling. Any overlapping read pair shall contribute 1x to coverage, as this is a more accurate reflection of what is happening on a molecular level. This might be counter-intuitive to the user, who expects 2x coverage in the "overlap" region of the stitched read. Correspondingly, any unpaired reads included in the output bam will therefore count x2 to coverage, or be double-weighted relative to stitched reads.
It is possible that an MNV begins on a base with one read direction and ends on a base with another read direction. In this case, the MNVs direction should be considered stitched by the downstream variant caller (which exempts it from the strand bias filter).
The Stitcher does not do any form of indel realignment.
The Stitcher does not implement its own PCR duplicate detection algorithms, and as such is subject to errors made upstream in in the detection of PCR duplicates. (This is known to happen in the case of low-frequency variants and shot gun sequencing). The Stitcher optionally filters duplicates, when marked in the bam by an upstream duplicate-marking tool. The stitcher presumes if one read is a duplicate, so is its mate. (ie, if one read in a pair is a duplicate, both reads are considered duplicates.) This should always be true biologically, but is not a requirement of the mark duplicate tool.
The Stitcher is not multi-threaded by sample. Its expected that the calling pipeline will handle this, submitting 1 job per bam.
The Stitcher does not include indexing or sorting the output bam. This may be done with Samtools.
The Stitcher expects that input bams are sorted and indexed.
The Stitcher is not able to correctly process BAMs which have basecall quality scores of >46 (this is highly unlikely to occur, but may happen in the case of some kind of basecall quality post-processing). This limitation is due to the fact that Stitcher sums basecall qualities for agreeing overlapping bases between R1 and R2. The Phred formatting used in BAM quality score encoding does not support values over 93, so having a quality score of >46 in both agreeing bases would lead to invalid BAM quality encoding.
- Pisces 5.3.0 Quick Start
- Pisces 5.3.0 Design Document
- Pisces 5.3.0 Supported Options
- Scylla 5.3.0 Design Document
- Stitcher 5.3.0 Design Document
- VQR 5.3.0 Design Document
- VennVcf 5.3.0 Design Document
- Gemini 5.3.0 Design Document
- AdaptiveGenotyper 5.3.0 Design Document
- Pisces Tools 5.3.0
- Suggested Pipeline Configuration 5.3.0
- Pisces 5.2.10 Design Document
- Pisces 5.2.10 Supported Options
- Scylla 5.2.10 Design Document
- Stitcher 5.2.10 Design Document
- VQR 5.2.10 Design Document
- VennVcf 5.2.10 Design Document
- Gemini 5.2.10 Design Document
- AdaptiveGenotyper 5.2.10 Design Document
- Pisces Tools 5.2.10
- Suggested Pipeline Configuration 5.2.10
- Pisces 5.2.9 Quick Start
- Pisces 5.2.9 Design Document
- Pisces 5.2.9 Supported Options
- Scylla 5.2.9 Design Document
- Stitcher 5.2.9 Design Document
- VQR 5.2.9 Design Document
- VennVcf 5.2.9 Design Document
- Pisces Tools 5.2.9
- Suggested Pipeline Configuration 5.2.9
- Pisces 5.2.7 Quick Start
- Pisces 5.2.7 Design Document
- Pisces 5.2.7 Supported Options
- Scylla 5.2.7 Design Document
- Stitcher 5.2.7 Design Document
- VQR 5.2.7 Design Document
- VennVcf 5.2.7 Design Document
- Pisces Tools 5.2.7
- Suggested Pipeline Configuration 5.2.7
- Pisces 5.2.5 Design Document
- Pisces 5.2.5 Supported Options
- Scylla 5.2.5 Design Document
- Stitcher 5.2.5 Design Document
- VQR 5.2.5 Design Document
- Suggested Pipeline Configuration 5.2.5
- Pisces 5.2.0 Design Document
- Pisces 5.2.0 Supported Options
- Scylla 5.2.0 Design Document
- Stitcher 5.2.0 Design Document
- VQR 5.2.0 Design Document
- Suggested Pipeline Configuration 5.2.0
- Pisces Suite 5.2.0 Known Issues and Limitations