Releasing MarkDuplicatesSpark #5603

jamesemery · 2019-01-24T16:47:19Z

Removing the beta tag in advance of the 4.1 release.

Resolves #4675

codecov-io · 2019-01-24T17:24:45Z

Codecov Report

Merging #5603 into master will increase coverage by 0.001%.
The diff coverage is n/a.

@@               Coverage Diff               @@
##              master     #5603       +/-   ##
===============================================
+ Coverage     87.035%   87.035%   +0.001%     
- Complexity     31535     31537        +2     
===============================================
  Files           1930      1930               
  Lines         145443    145443               
  Branches       16090     16090               
===============================================
+ Hits          126586    126587        +1     
+ Misses         12997     12996        -1     
  Partials        5860      5860

Impacted Files	Coverage Δ	Complexity Δ
...transforms/markduplicates/MarkDuplicatesSpark.java	`94.521% <ø> (ø)`	`36 <0> (ø)`	⬇️
...e/hellbender/engine/spark/SparkContextFactory.java	`71.233% <0%> (-2.74%)`	`11% <0%> (ø)`
...ithwaterman/SmithWatermanIntelAlignerUnitTest.java	`60% <0%> (ø)`	`2% <0%> (ø)`	⬇️
...utils/smithwaterman/SmithWatermanIntelAligner.java	`80% <0%> (+30%)`	`3% <0%> (+2%)`	⬆️

droazen

The new tool docs need some work @jamesemery -- back to you for a few changes.

droazen · 2019-01-24T17:32:08Z

...org/broadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSpark.java

@@ -34,12 +34,62 @@

 import java.util.*;

+/**
+ * <p>This tool is a Spark implementation of the tool MarkDuplicates in Picard allowing for better utilization
+ *    of available system resources to speed up duplicate marking.</p>


Can you add a bit more Spark-specific information? For example, a working example Spark command showing how to run the tool on multiple cores, information about memory requirements per core, etc.

Also, the wording here could be better. How about this:

This is a Spark implementation of the MarkDuplicates tool from Picard that allows the tool to be run in parallel on multiple cores on a local machine or multiple machines on a Spark cluster while still matching the output of the single-core Picard version.

droazen · 2019-01-24T17:32:50Z

...org/broadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSpark.java

+ *
+ * <p>Although the bitwise flag annotation indicates whether a read was marked as a duplicate, it does not identify the type of
+ *    duplicate.  To do this, a new tag called the duplicate type (DT) tag was recently added as an optional output in
+ *    the 'optional field' section of a SAM/BAM file.  Invoking the 'duplicate-tagging-policy' option,


Have you checked that all of the Picard args mentioned here are actually present in the Spark version?

yes, I have checked and updated the names involved.

droazen · 2019-01-24T17:33:14Z

...org/broadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSpark.java

+ *    referred to as optical duplicates.</p>
+ *
+ * <p>The MarkDuplicates tool works by comparing sequences in the 5 prime positions of both reads and read-pairs in a SAM/BAM file.
+ *    After duplicate reads arecollected, the tool differentiates the primary and duplicate reads using an algorithm that ranks


arecollected -> are collected

jamesemery · 2019-01-25T15:28:51Z

@droazen responded to your comments

droazen

A few more comments @jamesemery -- back to you.

droazen · 2019-01-25T19:20:12Z

...org/broadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSpark.java

+ * <p>This is a Spark implementation of the MarkDuplicates tool from Picard that allows the tool to be run in
+ *    parallel on multiple cores on a local machine or multiple machines on a Spark cluster while still matching
+ *    the output of the single-core Picard version. Since the tool requires holding all of the readnames in memory
+ *    while it groups the paired-down read information, it is recommended running this tool on a machine/configuration


Remove "paired-down" (I think you meant "pared down", but just "the read information" is sufficient)

droazen · 2019-01-25T19:21:00Z

...org/broadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSpark.java

+ *    parallel on multiple cores on a local machine or multiple machines on a Spark cluster while still matching
+ *    the output of the single-core Picard version. Since the tool requires holding all of the readnames in memory
+ *    while it groups the paired-down read information, it is recommended running this tool on a machine/configuration
+ *    with at least 8 GB of memory for a typical 30x bam.</p>


Is this 8GB per core or 8 GB total? Does it matter? How does the memory usage scale with the number of cores?

Well, the memory usage scales by the size of the bam approximately (and somewhat with the complexity of the pairs)

droazen · 2019-01-25T19:23:07Z

...org/broadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSpark.java

+ *            -M marked_dup_metrics.txt
+ *     </pre>
+ *
+ *  <h4>MarkDuplicates run on a spark cluster 5 machines</h4>


Include a working usage example that involves running locally on multiple cores as well.

Running locally it automatically uses all available cores.

jamesemery · 2019-01-25T20:20:09Z

@droazen back to you

droazen

👍 Merging! Congrats @jamesemery !

removing the beta tag and adding tool docs

bc6b637

droazen suggested changes Jan 24, 2019

View reviewed changes

droazen assigned jamesemery Jan 24, 2019

adding more to the docs

92c4171

jamesemery assigned droazen and unassigned jamesemery Jan 25, 2019

droazen suggested changes Jan 25, 2019

View reviewed changes

droazen assigned jamesemery and unassigned droazen Jan 25, 2019

updated docs again

0be10b4

jamesemery assigned droazen and unassigned jamesemery Jan 25, 2019

droazen approved these changes Jan 25, 2019

View reviewed changes

droazen merged commit ec2a6f7 into master Jan 25, 2019

droazen deleted the je_releaseMarkDuplicates branch January 25, 2019 22:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releasing MarkDuplicatesSpark #5603

Releasing MarkDuplicatesSpark #5603

jamesemery commented Jan 24, 2019 •

edited

Loading

codecov-io commented Jan 24, 2019 •

edited

Loading

droazen left a comment

droazen Jan 24, 2019

droazen Jan 24, 2019

droazen Jan 24, 2019

jamesemery Jan 24, 2019

droazen Jan 24, 2019

jamesemery commented Jan 25, 2019

droazen left a comment

droazen Jan 25, 2019

droazen Jan 25, 2019

jamesemery Jan 25, 2019

droazen Jan 25, 2019

jamesemery Jan 25, 2019

jamesemery commented Jan 25, 2019

droazen left a comment

Releasing MarkDuplicatesSpark #5603

Releasing MarkDuplicatesSpark #5603

Conversation

jamesemery commented Jan 24, 2019 • edited Loading

codecov-io commented Jan 24, 2019 • edited Loading

Codecov Report

droazen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamesemery commented Jan 25, 2019

droazen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamesemery commented Jan 25, 2019

droazen left a comment

Choose a reason for hiding this comment

jamesemery commented Jan 24, 2019 •

edited

Loading

codecov-io commented Jan 24, 2019 •

edited

Loading