-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Releasing MarkDuplicatesSpark #5603
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -34,12 +34,78 @@ | |
|
||
import java.util.*; | ||
|
||
/** | ||
* <p>This is a Spark implementation of the MarkDuplicates tool from Picard that allows the tool to be run in | ||
* parallel on multiple cores on a local machine or multiple machines on a Spark cluster while still matching | ||
* the output of the single-core Picard version. Since the tool requires holding all of the readnames in memory | ||
* while it groups the paired-down read information, it is recommended running this tool on a machine/configuration | ||
* with at least 8 GB of memory for a typical 30x bam.</p> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this 8GB per core or 8 GB total? Does it matter? How does the memory usage scale with the number of cores? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well, the memory usage scales by the size of the bam approximately (and somewhat with the complexity of the pairs) |
||
* | ||
* <p>This tool locates and tags duplicate reads in a BAM or SAM file, where duplicate reads are | ||
* defined as originating from a single fragment of DNA. Duplicates can arise during sample preparation e.g. library | ||
* construction using PCR. See also "<a href='https://broadinstitute.github.io/picard/command-line-overview.html#EstimateLibraryComplexity'>EstimateLibraryComplexity</a>" + | ||
* for additional notes on PCR duplication artifacts. Duplicate reads can also result from a single amplification cluster, | ||
* incorrectly detected as multiple clusters by the optical sensor of the sequencing instrument. These duplication artifacts are | ||
* referred to as optical duplicates.</p> | ||
* | ||
* <p>The MarkDuplicates tool works by comparing sequences in the 5 prime positions of both reads and read-pairs in a SAM/BAM file. | ||
* After duplicate reads are collected, the tool differentiates the primary and duplicate reads using an algorithm that ranks | ||
* reads by the sums of their base-quality scores (default method).</p> | ||
* | ||
* <p>The tool's main output is a new SAM or BAM file, in which duplicates have been identified in the SAM flags field for each | ||
* read. Duplicates are marked with the hexadecimal value of 0x0400, which corresponds to a decimal value of 1024. | ||
* If you are not familiar with this type of annotation, please see the following <a href='https://www.broadinstitute.org/gatk/blog?id=7019'>blog post</a> for additional information.</p>" + | ||
* | ||
* <p>Although the bitwise flag annotation indicates whether a read was marked as a duplicate, it does not identify the type of | ||
* duplicate. To do this, a new tag called the duplicate type (DT) tag was recently added as an optional output in | ||
* the 'optional field' section of a SAM/BAM file. Invoking the 'duplicate-tagging-policy' option, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Have you checked that all of the Picard args mentioned here are actually present in the Spark version? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, I have checked and updated the names involved. |
||
* you can instruct the program to mark all the duplicates (All), only the optical duplicates (OpticalOnly), or no | ||
* duplicates (DontTag). The records within the output of a SAM/BAM file will have values for the 'DT' tag (depending on the invoked | ||
* 'duplicate-tagging-policy'), as either library/PCR-generated duplicates (LB), or sequencing-platform artifact duplicates (SQ). | ||
* This tool uses the 'read-name-regex' and the 'optical-duplicate-pixel-distance' options as the primary methods to identify | ||
* and differentiate duplicate types. Set read-name-regex' to null to skip optical duplicate detection, e.g. for RNA-seq | ||
* or other data where duplicate sets are extremely large and estimating library complexity is not an aim. | ||
* Note that without optical duplicate counts, library size estimation will be inaccurate.</p> | ||
* | ||
* <p>MarkDuplicates also produces a metrics file indicating the numbers of duplicates for both single- and paired-end reads.</p> | ||
* | ||
* <p>The program can take either coordinate-sorted or query-sorted inputs, however it is recommended that the input be | ||
* query-sorted or query-grouped as the tool will have to perform an extra sort operation on the data in order to associate | ||
* reads from the input bam with their mates.</p> | ||
* | ||
* <p>If desired, duplicates can be removed using the 'remove-all-duplicates' and 'remove-sequencing-duplicates' options.</p> | ||
* | ||
* <h4>Usage example:</h4> | ||
* <pre> | ||
* gatk MarkDuplicatesSpark \\<br /> | ||
* -I input.bam \\<br /> | ||
* -O marked_duplicates.bam \\<br /> | ||
* -M marked_dup_metrics.txt | ||
* </pre> | ||
* | ||
* <h4>MarkDuplicates run on a spark cluster 5 machines</h4> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Include a working usage example that involves running locally on multiple cores as well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Running locally it automatically uses all available cores. |
||
* <pre> | ||
* gatk MarkDuplicatesSpark \\<br /> | ||
* -I input.bam \\<br /> | ||
* -O marked_duplicates.bam \\<br /> | ||
* -M marked_dup_metrics.txt \\<br /> | ||
* -- \\<br /> | ||
* --spark-runner SPARK \\<br /> | ||
* --spark-master <master_url> \\<br /> | ||
* --num-executors 5 \\<br /> | ||
* --executor-cores 8 <br /> | ||
* </pre> | ||
* | ||
* Please see | ||
* <a href='http://broadinstitute.github.io/picard/picard-metric-definitions.html#DuplicationMetrics'>MarkDuplicates</a> | ||
* for detailed explanations of the output metrics. | ||
* <hr /> | ||
*/ | ||
@DocumentedFeature | ||
@CommandLineProgramProperties( | ||
summary ="Marks duplicates on spark", | ||
oneLineSummary ="MarkDuplicates on Spark", | ||
programGroup = ReadDataManipulationProgramGroup.class) | ||
@BetaFeature | ||
public final class MarkDuplicatesSpark extends GATKSparkTool { | ||
private static final long serialVersionUID = 1L; | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove "paired-down" (I think you meant "pared down", but just "the read information" is sufficient)