-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Releasing MarkDuplicatesSpark #5603
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5603 +/- ##
===============================================
+ Coverage 87.035% 87.035% +0.001%
- Complexity 31535 31537 +2
===============================================
Files 1930 1930
Lines 145443 145443
Branches 16090 16090
===============================================
+ Hits 126586 126587 +1
+ Misses 12997 12996 -1
Partials 5860 5860
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new tool docs need some work @jamesemery -- back to you for a few changes.
@@ -34,12 +34,62 @@ | |||
|
|||
import java.util.*; | |||
|
|||
/** | |||
* <p>This tool is a Spark implementation of the tool MarkDuplicates in Picard allowing for better utilization | |||
* of available system resources to speed up duplicate marking.</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a bit more Spark-specific information? For example, a working example Spark command showing how to run the tool on multiple cores, information about memory requirements per core, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the wording here could be better. How about this:
This is a Spark implementation of the MarkDuplicates tool from Picard that allows the tool to be run in
parallel on multiple cores on a local machine or multiple machines on a Spark cluster while still matching
the output of the single-core Picard version.
* | ||
* <p>Although the bitwise flag annotation indicates whether a read was marked as a duplicate, it does not identify the type of | ||
* duplicate. To do this, a new tag called the duplicate type (DT) tag was recently added as an optional output in | ||
* the 'optional field' section of a SAM/BAM file. Invoking the 'duplicate-tagging-policy' option, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you checked that all of the Picard args mentioned here are actually present in the Spark version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I have checked and updated the names involved.
* referred to as optical duplicates.</p> | ||
* | ||
* <p>The MarkDuplicates tool works by comparing sequences in the 5 prime positions of both reads and read-pairs in a SAM/BAM file. | ||
* After duplicate reads arecollected, the tool differentiates the primary and duplicate reads using an algorithm that ranks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arecollected
-> are collected
@droazen responded to your comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments @jamesemery -- back to you.
* <p>This is a Spark implementation of the MarkDuplicates tool from Picard that allows the tool to be run in | ||
* parallel on multiple cores on a local machine or multiple machines on a Spark cluster while still matching | ||
* the output of the single-core Picard version. Since the tool requires holding all of the readnames in memory | ||
* while it groups the paired-down read information, it is recommended running this tool on a machine/configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove "paired-down" (I think you meant "pared down", but just "the read information" is sufficient)
* parallel on multiple cores on a local machine or multiple machines on a Spark cluster while still matching | ||
* the output of the single-core Picard version. Since the tool requires holding all of the readnames in memory | ||
* while it groups the paired-down read information, it is recommended running this tool on a machine/configuration | ||
* with at least 8 GB of memory for a typical 30x bam.</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this 8GB per core or 8 GB total? Does it matter? How does the memory usage scale with the number of cores?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the memory usage scales by the size of the bam approximately (and somewhat with the complexity of the pairs)
* -M marked_dup_metrics.txt | ||
* </pre> | ||
* | ||
* <h4>MarkDuplicates run on a spark cluster 5 machines</h4> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Include a working usage example that involves running locally on multiple cores as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running locally it automatically uses all available cores.
@droazen back to you |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Merging! Congrats @jamesemery !
Removing the beta tag in advance of the 4.1 release.
Resolves #4675