-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added some javadoc notes about disk throughput to MarkDuplicatesSpark and SortSamSpark #5672
Conversation
… and SortSamSpark
Codecov Report
@@ Coverage Diff @@
## master #5672 +/- ##
==============================================
- Coverage 87.05% 80.287% -6.763%
+ Complexity 31708 30237 -1471
==============================================
Files 1940 1943 +3
Lines 146142 146770 +628
Branches 16128 16223 +95
==============================================
- Hits 127216 117837 -9379
- Misses 13041 23220 +10179
+ Partials 5885 5713 -172
|
@jamesemery, how do you feel about me making changes to your branch via a secondary commit? I think this might be easier than my asking for minor documentation fixes, explaining them and then having you make them. |
I just noticed that SortSamSpark has zero javadoc. So here is where I am developing content: https://docs.google.com/document/d/13_fZ8y8692aKH3jT09cf9VCNxD2NpkISjq8EJqq0ZJ4/edit?usp=sharing @jamesemery. |
Will wait for your return in a week to discuss. |
Assigning back to @jamesemery to review @sooheelee 's proposed Google doc, and incorporate whatever changes are appropriate |
@jamesemery I think that the warning about coordinate-sorted input vs. name-sorted input needs to be much more prominent, and perhaps repeated several times at various points in the docs. |
Also, we should confirm that the tool itself emits a warning message to the logger when given coordinate-sorted input. |
@sooheelee Updated this branch, let me know what your thoughts are (based on the google document discussion) |
@sooheelee Updated the documentation in this branch according to the external discussion. Thoughts? Can this be merged? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jamesemery, here is my review. Since we have worked together to extensively shape the content and preemptively address user questions, what remains are minor typos etc.
* read. Duplicates are marked with the hexadecimal value of 0x0400, which corresponds to a decimal value of 1024. | ||
* If you are not familiar with this type of annotation, please see the following <a href='https://www.broadinstitute.org/gatk/blog?id=7019'>blog post</a> for additional information.</p>" + | ||
* <ul> | ||
* <li>FMarkDuplicatesSpark processing can replace both the MarkDuplicates and SortSam steps of the Best Practices <a href="https://software.broadinstitute.org/gatk/documentation/article?id=7899#2">single sample pipeline </a>. After flagging duplicate sets, the tool automatically coordinate-sorts the records. It is still necessary to subsequently run SetNmMdAndUqTags before running BQSR. </li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- remove
F
in FMarkDuplicatesSpark - remove extraneous space in
>single sample pipeline </a>
* <ul> | ||
* <li>FMarkDuplicatesSpark processing can replace both the MarkDuplicates and SortSam steps of the Best Practices <a href="https://software.broadinstitute.org/gatk/documentation/article?id=7899#2">single sample pipeline </a>. After flagging duplicate sets, the tool automatically coordinate-sorts the records. It is still necessary to subsequently run SetNmMdAndUqTags before running BQSR. </li> | ||
* <li>The tool is optimized to run on queryname-grouped alignments. If provided coordinate-sorted alignments, the tool will spend additional time first queryname sorting the reads internally. This can result in the tool being up to 2x slower processing under some circumstances.</li> | ||
* <li>Due to MarkDuplicatesSpark queryname-sorting coordinate-sorted inputs internally at the start, the tool produces identical results regardless of the input sort-order. That is, it will flag duplicates sets that include secondary, and supplementary and unmapped mate records no matter the sort-order of the input. This differs from how Picard MarkDuplicates behaves given the differently sorted inputs. <li/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- correct closing list item tag:
<li/>
--></li>
* <li>FMarkDuplicatesSpark processing can replace both the MarkDuplicates and SortSam steps of the Best Practices <a href="https://software.broadinstitute.org/gatk/documentation/article?id=7899#2">single sample pipeline </a>. After flagging duplicate sets, the tool automatically coordinate-sorts the records. It is still necessary to subsequently run SetNmMdAndUqTags before running BQSR. </li> | ||
* <li>The tool is optimized to run on queryname-grouped alignments. If provided coordinate-sorted alignments, the tool will spend additional time first queryname sorting the reads internally. This can result in the tool being up to 2x slower processing under some circumstances.</li> | ||
* <li>Due to MarkDuplicatesSpark queryname-sorting coordinate-sorted inputs internally at the start, the tool produces identical results regardless of the input sort-order. That is, it will flag duplicates sets that include secondary, and supplementary and unmapped mate records no matter the sort-order of the input. This differs from how Picard MarkDuplicates behaves given the differently sorted inputs. <li/> | ||
* <li>CCollecting duplicate metrics slows down performance and thus the metrics collection is optional and must be specified for the Spark version of the tool with '-M'. It is possible to collect the metrics with the standalone Picard tool, <a href='https://software.broadinstitute.org/gatk/documentation/tooldocs/current/picard_sam_markduplicates_EstimateLibraryComplexity.php'>EstimateLibraryComplexity</a>.</li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- remove extraneous
C
in CCollecting - remove
,
before EstimateLibraryComplexity
* <li>The tool is optimized to run on queryname-grouped alignments. If provided coordinate-sorted alignments, the tool will spend additional time first queryname sorting the reads internally. This can result in the tool being up to 2x slower processing under some circumstances.</li> | ||
* <li>Due to MarkDuplicatesSpark queryname-sorting coordinate-sorted inputs internally at the start, the tool produces identical results regardless of the input sort-order. That is, it will flag duplicates sets that include secondary, and supplementary and unmapped mate records no matter the sort-order of the input. This differs from how Picard MarkDuplicates behaves given the differently sorted inputs. <li/> | ||
* <li>CCollecting duplicate metrics slows down performance and thus the metrics collection is optional and must be specified for the Spark version of the tool with '-M'. It is possible to collect the metrics with the standalone Picard tool, <a href='https://software.broadinstitute.org/gatk/documentation/tooldocs/current/picard_sam_markduplicates_EstimateLibraryComplexity.php'>EstimateLibraryComplexity</a>.</li> | ||
* <li>MarkDuplicatesSpark is optimized to run locally on a single machine by leveraging core parallelism that MarkDuplicates and SortSam cannot. It will typically run faster than MarkDuplicates and SortSam by a factor of 15% over the same data at 2 cores and will scale linearly to upwards of 16 cores. This means that the tool can be used to speedup MarkDuplicates even without access to a Spark cluster.</li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means that the tool can be used to speedup MarkDuplicates even without access to a Spark cluster.
--> This means MarkDuplicatesSpark, even without access to a Spark cluster, is faster than MarkDuplicates.
/** | ||
* SortSam on Spark (works on SAM/BAM/CRAM) | ||
* | ||
* <h4>Overview</h4> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* <h3>Additional Notes</h3> | ||
* <ul> | ||
* <li>This Spark tool requires a significant amount of disk operations. Run with both the input data and outputs on high throughput SSDs when possible. When pipelining this tool on Google Compute Engine instances, for best performance requisition machines with LOCAL SSDs. </li> | ||
* <li>Furthermore, we recommend explicitly setting the Spark temp directory to an available SSD when running this in local mode by adding the argument --conf 'spark.local.dir=/PATH/TO/TEMP/DIR'. See the discussion at https://gatkforums.broadinstitute.org/gatk/discussion/comment/56337 for details.</li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change "See the discussion at https://gatkforums.broadinstitute.org/gatk/discussion/comment/56337 for details."
-->
See <a href="https://gatkforums.broadinstitute.org/gatk/discussion/comment/56337">this forum discussion</a> for details.
* SortSam on Spark (works on SAM/BAM/CRAM) | ||
* | ||
* <h4>Overview</h4> | ||
* <p>A <a href='https://software.broadinstitute.org/gatk/blog?id=23420'>Spark</a> implementation of <a href='https://software.broadinstitute.org/gatk/documentation/tooldocs/current/picard_sam_SortSam.php'>Picard SortSam</a>. The Spark version can run in parallel on multiple cores on a local machine or multiple machines on a Spark cluster while still matching the output of the single-core Picard version. See <a href="https://software.broadinstitute.org/gatk/blog?id=23420">Blog#23420</a> for performance benchmarks.</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove link surrounding 'Spark'. We link to the blog article in the next sentence.
* | ||
* <p>The tool sorts reads by coordinate order by default or alternatively by read name, the QNAME field, if asked with the '-SO queryname' option. The contig ordering in the reference dictionary defines coordinate order, and the tool uses the sequence dictionary represented by the @SQ header lines or that of the optionally provided reference to sort reads by the RNAME field. For those reads mapping to a contig, coordinate sorting further orders reads by the POS field of the SAM record, which contains the leftmost mapping position.</p> | ||
* | ||
* <p>Queryname-sorted alignments are grouped first by readname and then are deterministically ordered among equal readnames by read flags including orientation, secondary, and supplemntary record flags (See {@link htsjdk.samtools.SAMRecordQueryNameComparator#compare(SAMRecord, SAMRecord)}} for details). For paired-end reads, reads in the pair share the same queryname. Because aligners can generate secondary and supplementary alignments, queryname groups can consists of, e.g. more than two records for a paired-end pair.</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can improve this sentence:
Queryname-sorted alignments are grouped first by readname and then are deterministically ordered among equal readnames by read flags including orientation, secondary, and supplemntary record flags (See htsjdk.samtools.SAMRecordQueryNameComparator#compare(SAMRecord, SAMRecord)} for details).
May I suggest
"To queryname-sort, the tool first groups by readname and then deterministically sorts within a readname set by orientation, secondary and supplementary SAM flags. "
- supplemntary record flags --> supplementary
- remove link to detailed information. It doesn't show up well in javadoc:
* <h4>Usage examples</h4> | ||
* Coordinate-sort aligned reads using all cores available locally | ||
* <pre> | ||
* gatk SortSamSpark \<br /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mention below, here and elsewhere:
Please remove the <br>
elements as they are unnecessary for gatkdocs and actually add a weird additional line between command lines.
Right now, this is what the doc looks like:
We want to tighten up the line spacing and removing the breaks will do that for us.
* </pre> | ||
* | ||
* <h3>Notes</h3> | ||
* <ul> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Mutect2, I decided to make the Notes section an ordered list. It's up to you if you want to change ul
tool
since you only have two bullets.
@sooheelee I have responded to your comments. They were mostly typos and quick changes. I might go ahead and merge this soon. |
Yes, feel free to merge. I gave approval with my last review. |
* -I coordinatesorted.bam \ | ||
* -SO queryname \ | ||
* -O querygroupsorted.bam \ | ||
* -- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just skimming through and this needs a \
.
Hopefully this will help better inform people as to how these tools should be run in cromwell.