-
Notifications
You must be signed in to change notification settings - Fork 0
Stitcher 5.2.0 Design Document
The Stitcher takes paired reads from a bam, stitches them, and writes out a new bam. Reads that cannot be stitched are left unchanged. A stitched read is a single read, that has been bioinformatically stitched from a pair of reads. The stitched read contains one consensus sequence of base calls, one consensus sequence of base call qualities, one consensus cigar string, and one direction tag, which gives the original directions of the contributing reads. Optionally, The stitcher filters out reads that will not be used downstream, such as PCR duplicates or poorly mapped reads, in order to save on future processing time. Note that this document includes some references to extensions of current algorithms. These are highlighted in red, where not currently implemented.
The Stitcher supports configuration of parameters so that its behavior can be fine tuned depending on the application context.
Format: dotnet Stitcher.dll [-options]
Example: dotnet Stitcher.dll –Bam C:\test.bam –OutFolder C:\OutFolder
SDS ID | Specification |
---|---|
SDS-1 | The Stitcher shall accept command line arguments as a whitespace-separated list of name and value pairs. |
SDS-2 | If an invalid command is given, the Stitcher shall exit with an error message describing the failed argument, the reason for failure, and the list of valid commands. |
SDS-3 | The Stitcher command line shall be capitalization invariant. |
SDS ID | Specification |
---|---|
SDS-4 | The Stitcher shall require the command line arguments listed below: |
Argument Name | Type | Default value | Description |
---|---|---|---|
Bam | string | none | File path for input bam |
SDS ID | Specification |
---|---|
SDS-5 | The Stitcher shall optionally support the command line arguments listed below. |
Argument Name | Type | Default value | Description |
---|---|---|---|
OutFolder | string | none. By default the output destination will be the original bam folder | destination for output bam |
MinBaseCallQuality | integer | 20 | in case of a stitching conflict, bases with qscore less than this value will automatically be disregarded in favor of the mate's bases. |
FilterMinMapQuality | integer | 1 | reads with map quality less than this value shall be filtered |
FilterDuplicates | bool | true | reads marked as duplicate reads shall be filtered |
FilterForProperPairs | bool | false | reads marked as not proper pairs shall be filtered |
FilterUnstichablePairs | bool | false | reads pairs with incompatible cigar strings shall be filtered |
NifyUnstitchablePairs | bool | true | read pairs with incompatible CIGAR strings shall be N-ified. |
NifyDisagreement | bool | false | whether or not to turn high-quality disagreeing overlap bases to Ns. |
UseSoftClippedBases | bool | true | allow bases soft clipped from the (non-probe) ends of reads to inform stitching. |
The Stitcher requires as input one BAM file. The BAM file is assumed to be sorted, i.e. alignment reads should be sorted by mapped reference position. This assumption allows The Stitcher to process BAMs in one pass without a large memory footprint. Most standard aligners produce sorted BAM files. Positions in BAM files are expected to use 0-based coordinate system.
SDS ID | Specification |
---|---|
SDS-6 | The Stitcher shall require as input one bam. |
SDS-7 | The Stitcher shall require as input an accompanying BAI file with the same file name in the same directory. |
The Stitcher outputs one (stitched) BAM file. The stitched file name shall be .stitched.bam. The bam shall be sorted.
SDS ID | Specification |
---|---|
SDS-8 | The Stitcher shall require as input one bam. |
SDS-9 | The Stitcher shall require as input an accompanying BAI file with the same file name in the same directory. |
SDS-24 | The output bam shall contain a single, merged read for each read pair that was successfully stitched. The original read pair shall not be outputted. Reads that were not successfully stitched should be outputted as their original read pair unless filtering unstitchable pairs is enabled. Singletons shall be outputted in their original form. |
SDS-25 | Merged reads in the bam outputted by the Stitcher shall be flagged as, Read 1, Unmated. Unmerged reads shall retain their original flags. |
The Stitcher is essentially a pipeline program. The Stitcher shall stream through the bam, holding on to each proper-pair read until its mate is found. Once the mate is found, the read and its mate shall be stitched. The stitched read will be held until the all possible upstream alignments have been written. Once this condition has been met, the stitched read shall be written to the output bam. Alignments that are not proper pairs or which fail mapping quality thresholds shall optionally be filtered. Paired reads shall be reconciled into one consensus read to be used for downstream processing. The XD tag will be used to give the directions of the stitched read. If no mate is found for a proper-pair read (that is flagged as having a proper mate in the bam), the Stitcher will assume the bam is corrupted, log an exception, and exit.
SDS ID | Specification |
---|---|
SDS-12 | The Stitcher shall log an exception if a bam is detected to be corrupted. |
SDS ID | Specification |
---|---|
SDS-13 | The Stitcher shall filter reads with map quality less than FilterMinMapQuality from the final bam. |
SDS-14 | The Stitcher shall optionally filter out duplicate reads. |
SDS-15 | The Stitcher shall optionally filter out any reads not marked as proper pairs. |
SDS-16 | The Stitcher shall optionally filter out any reads that have incompatible cigar strings with its mate. |
A) If the pairs simply do not overlap, they may be evaluated separately or treated as a single read with a gap in the middle, to facilitate phasing. (see the "StitchGappedPairs" option) B) If the pairs do overlap, they are tested for compatibility, where compatible means the cigar strings match such that insertions and deletions are called at the same place and are of the same length within the stitched region. Indels must have the same start and endpoint, where ever closed. (insertions or deletions running off the edge of the overlap are considered acceptable if their mate-indel is fully anchored) Inserted content can disagree - The Stitcher will take the highest quality bases) Mismatch sections are of the same length. If a softclipped section falls within the overlap region and its sequence is compatible with its mate, it may contribute to the stitched section. (For example, if two bases are softclipped from the end of R1, but they match and overlap with the first two bases of an insertion coming from R2, we can call that insertion present in the stitched region)
If the reads are compatible, they are reconciled into one consensus read as follows:
Generating Consensus Reads
If the paired reads are compatible, the Stitcher shall reconcile the pair into one consensus read by region as follows below.
SDS ID | Specification |
---|---|
SDS-17 | Positions preceding overlap region, shall use the upstream read sequence, Qscores, and direction |
SDS-22 | Positions trailing overlap region, shall use the downstream read sequence, Qscores, and direction |
SDS ID | Specification |
---|---|
SDS-18 | Matched/Mismatched Positions within overlap region, shall have the stitched direction, and for each position: |
condition | Base | Qscore |
---|---|---|
Bases agree | from read 1 | Max(read1, read2) |
Bases disagree and both qscores ≥ minimum basecall quality configured | “N”, i.e. no call | 0 |
Bases disagree and both qscores < minimum basecall quality | Read1 if read1 qscore ≥ minimum basecall quality, otherwise read2 | Qscore corresponding to read for which base was extracted |
Bases disagree and either qscores < minimum basecall quality configured | Read1 if read1 qscore ≥ minimum basecall quality, otherwise read2 | Qscore corresponding to read for which base was extracted |
SDS ID | Specification |
---|---|
SDS-19 | Insertion/Deletion Positions within overlap region, shall have the stitched direction, and for each position: |
condition | Base | Qscore |
---|---|---|
Bases agree | from read 1 | Max(read1, read2) |
Insertion agree in length, but disagree in content, and both qscores ≥ minimum basecall quality configured | “N”, i.e. no call | 0 |
Insertion agree in length, but disagree in content, and bases disagree and either qscores < minimum basecall quality configured | choose inserted content from read1 if read1 qscore ≥ minimum basecall quality, otherwise read2 | Qscore corresponding to read for which base was extracted |
Deletions or insertions disagree in length, or one read has and indel and the other has no indel or a different indel. | see un-stitchable strategy | see un-stitchable strategy |
SDS ID | Specification |
---|---|
SDS-20 | If (UseSoftClippedBases=TRUE), softclipped positions from the end of the reads within overlap region may inform read stitching, shall have the stitched direction, and for each position: |
condition | Base | Qscore |
---|---|---|
Bases in both directions are softclipped. Aligner has given up. | see un-stitchable strategy | see un-stitchable strategy |
Softclipped in one direction, and the softclipped base matches the inserted base from the other direction. | “N”, i.e. no call | 0 |
Softclipped in one direction, and the softclipped base disagrees with the base from the other direction | Base from non-softclipped read. | Qscore corresponding to read for which base was extracted |
The consensus read shall have have a new XD tag:
SDS ID | Specification |
---|---|
SDS-23 | The rules for the XD tag are as follows: |
Base category | XD item |
---|---|
N consecutive forward bases | NF |
N consecutive reverse bases | NR |
N consecutive stitched | NS |
All bases relevant to the stitched cigar string must have a direction. This includes all bases in the stitched sequence, but also gaps (such as deletions or strings of N's that might result from reads that do not stitch easily). All these bases shall have a direction and shall be available to contribute to (directional) coverage counts made by Pisces. For example, the following pair of reads, would yield the following stitched cigar string and XD tag:
Original read pair:
Cigar | Direction | Coordinate -> | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
read1 | 2M2D4M | Forward | A | A | X | X | A | A | A | A | ||
read2 | 1M2D5M | Reverse | A | X | X | A | A | A | A | A |
Stitched read pair:
Cigar | Direction | Coordinate -> | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
read | 2M2D5M | 1F7S1R | A | A | A | X | X | A | A | A | A |
For reads that have incompatible cigar strings, the stitcher shall
(A) if reads are not inconsistent (collapsable indels, equivalent softclips and inserted bases,) push it together and N-ify disagreements.
(B) if reads are inconsistent (can't even come up with an insertion length that makes sense) or have no way to anchor the overlap (overlapping softclips) If (FilterUnstichablePairs=TRUE) , these reads are omitted from the final bam. If (FilterUnstichablePairs=FALSE) , these reads are returned, unstitched, to the final bam.
SDS ID | Specification |
---|---|
SDS-24 | If (FilterUnstichablePairs=FALSE) The Stitcher shall treat un-stitchable reads as unpaired reads. |
(if FilterUnstichablePairs=TRUE, see SDS-16) |
By design, the stitched read will contribute 1x coverage to down stream variant calling. Any overlapping read pair shall contribute 1x to coverage, as this is a more accurate reflection of what is happening on a molecular level. This might be counter intuitive to the user, who expects 2x coverage in the "overlap" region of the stitched read. Correspondingly, any unpaired reads included in the output bam will therefore count x2 to coverage, or be double-weighted relative to stitched reads.
It is possible that an MNV begins on a base with one read direction and ends on a base with another read direction. In this case, the MNVs direction should be considered stitched by the downstream variant caller (which exempts it from the strand bias filter).
The Stitcher does not do any form of indel realignment.
The Stitcher does not implement its own PCR duplicate detection algorithms, and as such is subject to errors made upstream in in the detection of PCR duplicates. (This is known to happen in the case of low-frequency variants and shot gun sequencing) Stitcher.exe optionally filters duplicates, when marked in the bam by an upstream duplicate-marking tool. The stitcher presumes if one read is a duplicate, so is its mate. (ie, if one read in a pair is a duplicate, both reads are considered duplicates.) This should always be true biologically, but is not a requirement of the mark duplicate tool.
The Stitcher is not multi-threaded, by default. Its expected that the calling pipeline will handle this, submitting 1 job per bam. By default, the Stitcher does not include indexing or sorting the output bam. This may be done with Samtools. In multi-threaded mode,the stitcher shall sort but not index the bams. The multi-threaded sort and samtools sort do not necessarily result in identical read orders when several reads have the same sort position. For unique output (ie, for test purposes) a unix sort may be performed.
The Stitcher does not include indexing of the output bam. This may be done with Samtools.
BAMs fed to Pisces and Stitcher.exe should always be sorted and then indexed.
- Pisces 5.3.0 Quick Start
- Pisces 5.3.0 Design Document
- Pisces 5.3.0 Supported Options
- Scylla 5.3.0 Design Document
- Stitcher 5.3.0 Design Document
- VQR 5.3.0 Design Document
- VennVcf 5.3.0 Design Document
- Gemini 5.3.0 Design Document
- AdaptiveGenotyper 5.3.0 Design Document
- Pisces Tools 5.3.0
- Suggested Pipeline Configuration 5.3.0
- Pisces 5.2.10 Design Document
- Pisces 5.2.10 Supported Options
- Scylla 5.2.10 Design Document
- Stitcher 5.2.10 Design Document
- VQR 5.2.10 Design Document
- VennVcf 5.2.10 Design Document
- Gemini 5.2.10 Design Document
- AdaptiveGenotyper 5.2.10 Design Document
- Pisces Tools 5.2.10
- Suggested Pipeline Configuration 5.2.10
- Pisces 5.2.9 Quick Start
- Pisces 5.2.9 Design Document
- Pisces 5.2.9 Supported Options
- Scylla 5.2.9 Design Document
- Stitcher 5.2.9 Design Document
- VQR 5.2.9 Design Document
- VennVcf 5.2.9 Design Document
- Pisces Tools 5.2.9
- Suggested Pipeline Configuration 5.2.9
- Pisces 5.2.7 Quick Start
- Pisces 5.2.7 Design Document
- Pisces 5.2.7 Supported Options
- Scylla 5.2.7 Design Document
- Stitcher 5.2.7 Design Document
- VQR 5.2.7 Design Document
- VennVcf 5.2.7 Design Document
- Pisces Tools 5.2.7
- Suggested Pipeline Configuration 5.2.7
- Pisces 5.2.5 Design Document
- Pisces 5.2.5 Supported Options
- Scylla 5.2.5 Design Document
- Stitcher 5.2.5 Design Document
- VQR 5.2.5 Design Document
- Suggested Pipeline Configuration 5.2.5
- Pisces 5.2.0 Design Document
- Pisces 5.2.0 Supported Options
- Scylla 5.2.0 Design Document
- Stitcher 5.2.0 Design Document
- VQR 5.2.0 Design Document
- Suggested Pipeline Configuration 5.2.0
- Pisces Suite 5.2.0 Known Issues and Limitations