Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Releasing MarkDuplicatesSpark #5603

Merged
merged 3 commits into from
Jan 25, 2019
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -34,12 +34,78 @@

import java.util.*;

/**
* <p>This is a Spark implementation of the MarkDuplicates tool from Picard that allows the tool to be run in
* parallel on multiple cores on a local machine or multiple machines on a Spark cluster while still matching
* the output of the single-core Picard version. Since the tool requires holding all of the readnames in memory
* while it groups the paired-down read information, it is recommended running this tool on a machine/configuration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove "paired-down" (I think you meant "pared down", but just "the read information" is sufficient)

* with at least 8 GB of memory for a typical 30x bam.</p>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this 8GB per core or 8 GB total? Does it matter? How does the memory usage scale with the number of cores?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the memory usage scales by the size of the bam approximately (and somewhat with the complexity of the pairs)

*
* <p>This tool locates and tags duplicate reads in a BAM or SAM file, where duplicate reads are
* defined as originating from a single fragment of DNA. Duplicates can arise during sample preparation e.g. library
* construction using PCR. See also "<a href='https://broadinstitute.github.io/picard/command-line-overview.html#EstimateLibraryComplexity'>EstimateLibraryComplexity</a>" +
* for additional notes on PCR duplication artifacts. Duplicate reads can also result from a single amplification cluster,
* incorrectly detected as multiple clusters by the optical sensor of the sequencing instrument. These duplication artifacts are
* referred to as optical duplicates.</p>
*
* <p>The MarkDuplicates tool works by comparing sequences in the 5 prime positions of both reads and read-pairs in a SAM/BAM file.
* After duplicate reads are collected, the tool differentiates the primary and duplicate reads using an algorithm that ranks
* reads by the sums of their base-quality scores (default method).</p>
*
* <p>The tool's main output is a new SAM or BAM file, in which duplicates have been identified in the SAM flags field for each
* read. Duplicates are marked with the hexadecimal value of 0x0400, which corresponds to a decimal value of 1024.
* If you are not familiar with this type of annotation, please see the following <a href='https://www.broadinstitute.org/gatk/blog?id=7019'>blog post</a> for additional information.</p>" +
*
* <p>Although the bitwise flag annotation indicates whether a read was marked as a duplicate, it does not identify the type of
* duplicate. To do this, a new tag called the duplicate type (DT) tag was recently added as an optional output in
* the 'optional field' section of a SAM/BAM file. Invoking the 'duplicate-tagging-policy' option,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you checked that all of the Picard args mentioned here are actually present in the Spark version?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I have checked and updated the names involved.

* you can instruct the program to mark all the duplicates (All), only the optical duplicates (OpticalOnly), or no
* duplicates (DontTag). The records within the output of a SAM/BAM file will have values for the 'DT' tag (depending on the invoked
* 'duplicate-tagging-policy'), as either library/PCR-generated duplicates (LB), or sequencing-platform artifact duplicates (SQ).
* This tool uses the 'read-name-regex' and the 'optical-duplicate-pixel-distance' options as the primary methods to identify
* and differentiate duplicate types. Set read-name-regex' to null to skip optical duplicate detection, e.g. for RNA-seq
* or other data where duplicate sets are extremely large and estimating library complexity is not an aim.
* Note that without optical duplicate counts, library size estimation will be inaccurate.</p>
*
* <p>MarkDuplicates also produces a metrics file indicating the numbers of duplicates for both single- and paired-end reads.</p>
*
* <p>The program can take either coordinate-sorted or query-sorted inputs, however it is recommended that the input be
* query-sorted or query-grouped as the tool will have to perform an extra sort operation on the data in order to associate
* reads from the input bam with their mates.</p>
*
* <p>If desired, duplicates can be removed using the 'remove-all-duplicates' and 'remove-sequencing-duplicates' options.</p>
*
* <h4>Usage example:</h4>
* <pre>
* gatk MarkDuplicatesSpark \\<br />
* -I input.bam \\<br />
* -O marked_duplicates.bam \\<br />
* -M marked_dup_metrics.txt
* </pre>
*
* <h4>MarkDuplicates run on a spark cluster 5 machines</h4>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include a working usage example that involves running locally on multiple cores as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running locally it automatically uses all available cores.

* <pre>
* gatk MarkDuplicatesSpark \\<br />
* -I input.bam \\<br />
* -O marked_duplicates.bam \\<br />
* -M marked_dup_metrics.txt \\<br />
* -- \\<br />
* --spark-runner SPARK \\<br />
* --spark-master <master_url> \\<br />
* --num-executors 5 \\<br />
* --executor-cores 8 <br />
* </pre>
*
* Please see
* <a href='http://broadinstitute.github.io/picard/picard-metric-definitions.html#DuplicationMetrics'>MarkDuplicates</a>
* for detailed explanations of the output metrics.
* <hr />
*/
@DocumentedFeature
@CommandLineProgramProperties(
summary ="Marks duplicates on spark",
oneLineSummary ="MarkDuplicates on Spark",
programGroup = ReadDataManipulationProgramGroup.class)
@BetaFeature
public final class MarkDuplicatesSpark extends GATKSparkTool {
private static final long serialVersionUID = 1L;

Expand Down