Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Releasing MarkDuplicatesSpark #5603

Merged
merged 3 commits into from
Jan 25, 2019
Merged

Releasing MarkDuplicatesSpark #5603

merged 3 commits into from
Jan 25, 2019

Conversation

jamesemery
Copy link
Collaborator

@jamesemery jamesemery commented Jan 24, 2019

Removing the beta tag in advance of the 4.1 release.

Resolves #4675

@codecov-io
Copy link

codecov-io commented Jan 24, 2019

Codecov Report

Merging #5603 into master will increase coverage by 0.001%.
The diff coverage is n/a.

@@               Coverage Diff               @@
##              master     #5603       +/-   ##
===============================================
+ Coverage     87.035%   87.035%   +0.001%     
- Complexity     31535     31537        +2     
===============================================
  Files           1930      1930               
  Lines         145443    145443               
  Branches       16090     16090               
===============================================
+ Hits          126586    126587        +1     
+ Misses         12997     12996        -1     
  Partials        5860      5860
Impacted Files Coverage Δ Complexity Δ
...transforms/markduplicates/MarkDuplicatesSpark.java 94.521% <ø> (ø) 36 <0> (ø) ⬇️
...e/hellbender/engine/spark/SparkContextFactory.java 71.233% <0%> (-2.74%) 11% <0%> (ø)
...ithwaterman/SmithWatermanIntelAlignerUnitTest.java 60% <0%> (ø) 2% <0%> (ø) ⬇️
...utils/smithwaterman/SmithWatermanIntelAligner.java 80% <0%> (+30%) 3% <0%> (+2%) ⬆️

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new tool docs need some work @jamesemery -- back to you for a few changes.

@@ -34,12 +34,62 @@

import java.util.*;

/**
* <p>This tool is a Spark implementation of the tool MarkDuplicates in Picard allowing for better utilization
* of available system resources to speed up duplicate marking.</p>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a bit more Spark-specific information? For example, a working example Spark command showing how to run the tool on multiple cores, information about memory requirements per core, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the wording here could be better. How about this:

This is a Spark implementation of the MarkDuplicates tool from Picard that allows the tool to be run in 
parallel on multiple cores on a local machine or multiple machines on a Spark cluster while still matching 
the output of the single-core Picard version.

*
* <p>Although the bitwise flag annotation indicates whether a read was marked as a duplicate, it does not identify the type of
* duplicate. To do this, a new tag called the duplicate type (DT) tag was recently added as an optional output in
* the 'optional field' section of a SAM/BAM file. Invoking the 'duplicate-tagging-policy' option,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you checked that all of the Picard args mentioned here are actually present in the Spark version?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I have checked and updated the names involved.

* referred to as optical duplicates.</p>
*
* <p>The MarkDuplicates tool works by comparing sequences in the 5 prime positions of both reads and read-pairs in a SAM/BAM file.
* After duplicate reads arecollected, the tool differentiates the primary and duplicate reads using an algorithm that ranks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arecollected -> are collected

@jamesemery jamesemery assigned droazen and unassigned jamesemery Jan 25, 2019
@jamesemery
Copy link
Collaborator Author

@droazen responded to your comments

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments @jamesemery -- back to you.

* <p>This is a Spark implementation of the MarkDuplicates tool from Picard that allows the tool to be run in
* parallel on multiple cores on a local machine or multiple machines on a Spark cluster while still matching
* the output of the single-core Picard version. Since the tool requires holding all of the readnames in memory
* while it groups the paired-down read information, it is recommended running this tool on a machine/configuration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove "paired-down" (I think you meant "pared down", but just "the read information" is sufficient)

* parallel on multiple cores on a local machine or multiple machines on a Spark cluster while still matching
* the output of the single-core Picard version. Since the tool requires holding all of the readnames in memory
* while it groups the paired-down read information, it is recommended running this tool on a machine/configuration
* with at least 8 GB of memory for a typical 30x bam.</p>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this 8GB per core or 8 GB total? Does it matter? How does the memory usage scale with the number of cores?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the memory usage scales by the size of the bam approximately (and somewhat with the complexity of the pairs)

* -M marked_dup_metrics.txt
* </pre>
*
* <h4>MarkDuplicates run on a spark cluster 5 machines</h4>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include a working usage example that involves running locally on multiple cores as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running locally it automatically uses all available cores.

@droazen droazen assigned jamesemery and unassigned droazen Jan 25, 2019
@jamesemery jamesemery assigned droazen and unassigned jamesemery Jan 25, 2019
@jamesemery
Copy link
Collaborator Author

@droazen back to you

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Merging! Congrats @jamesemery !

@droazen droazen merged commit ec2a6f7 into master Jan 25, 2019
@droazen droazen deleted the je_releaseMarkDuplicates branch January 25, 2019 22:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants