Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added some javadoc notes about disk throughput to MarkDuplicatesSpark and SortSamSpark #5672

Merged
merged 5 commits into from
Mar 7, 2019

Conversation

jamesemery
Copy link
Collaborator

Hopefully this will help better inform people as to how these tools should be run in cromwell.

@codecov-io
Copy link

codecov-io commented Feb 13, 2019

Codecov Report

Merging #5672 into master will decrease coverage by 6.763%.
The diff coverage is n/a.

@@              Coverage Diff               @@
##             master     #5672       +/-   ##
==============================================
- Coverage     87.05%   80.287%   -6.763%     
+ Complexity    31708     30237     -1471     
==============================================
  Files          1940      1943        +3     
  Lines        146142    146770      +628     
  Branches      16128     16223       +95     
==============================================
- Hits         127216    117837     -9379     
- Misses        13041     23220    +10179     
+ Partials       5885      5713      -172
Impacted Files Coverage Δ Complexity Δ
...transforms/markduplicates/MarkDuplicatesSpark.java 94.521% <ø> (ø) 36 <0> (ø) ⬇️
...hellbender/tools/spark/pipelines/SortSamSpark.java 100% <ø> (ø) 5 <0> (ø) ⬇️
...rs/variantutils/SelectVariantsIntegrationTest.java 0.25% <0%> (-99.75%) 1% <0%> (-70%)
...kers/filters/VariantFiltrationIntegrationTest.java 0.826% <0%> (-99.174%) 1% <0%> (-25%)
...dorientation/CollectF1R2CountsIntegrationTest.java 0.917% <0%> (-99.083%) 1% <0%> (-12%)
.../walkers/bqsr/BaseRecalibratorIntegrationTest.java 1.031% <0%> (-98.969%) 1% <0%> (-7%)
...ers/vqsr/FilterVariantTranchesIntegrationTest.java 1.053% <0%> (-98.947%) 1% <0%> (-5%)
...s/variantutils/VariantsToTableIntegrationTest.java 1.205% <0%> (-98.795%) 1% <0%> (-20%)
...ientation/ReadOrientationModelIntegrationTest.java 1.667% <0%> (-98.333%) 1% <0%> (-5%)
...on/FindBreakpointEvidenceSparkIntegrationTest.java 1.754% <0%> (-98.246%) 1% <0%> (-6%)
... and 237 more

@sooheelee
Copy link
Contributor

@jamesemery, how do you feel about me making changes to your branch via a secondary commit? I think this might be easier than my asking for minor documentation fixes, explaining them and then having you make them.

@sooheelee
Copy link
Contributor

I just noticed that SortSamSpark has zero javadoc. So here is where I am developing content: https://docs.google.com/document/d/13_fZ8y8692aKH3jT09cf9VCNxD2NpkISjq8EJqq0ZJ4/edit?usp=sharing @jamesemery.

@sooheelee
Copy link
Contributor

Will wait for your return in a week to discuss.

@droazen droazen assigned jamesemery and unassigned sooheelee Feb 14, 2019
@droazen
Copy link
Contributor

droazen commented Feb 14, 2019

Assigning back to @jamesemery to review @sooheelee 's proposed Google doc, and incorporate whatever changes are appropriate

@droazen
Copy link
Contributor

droazen commented Feb 14, 2019

@jamesemery I think that the warning about coordinate-sorted input vs. name-sorted input needs to be much more prominent, and perhaps repeated several times at various points in the docs.

@droazen
Copy link
Contributor

droazen commented Feb 14, 2019

Also, we should confirm that the tool itself emits a warning message to the logger when given coordinate-sorted input.

@jamesemery
Copy link
Collaborator Author

@sooheelee Updated this branch, let me know what your thoughts are (based on the google document discussion)

@jamesemery jamesemery assigned sooheelee and unassigned jamesemery Feb 21, 2019
@jamesemery
Copy link
Collaborator Author

@sooheelee Updated the documentation in this branch according to the external discussion. Thoughts? Can this be merged?

Copy link
Contributor

@sooheelee sooheelee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesemery, here is my review. Since we have worked together to extensively shape the content and preemptively address user questions, what remains are minor typos etc.

* read. Duplicates are marked with the hexadecimal value of 0x0400, which corresponds to a decimal value of 1024.
* If you are not familiar with this type of annotation, please see the following <a href='https://www.broadinstitute.org/gatk/blog?id=7019'>blog post</a> for additional information.</p>" +
* <ul>
* <li>FMarkDuplicatesSpark processing can replace both the MarkDuplicates and SortSam steps of the Best Practices <a href="https://software.broadinstitute.org/gatk/documentation/article?id=7899#2">single sample pipeline </a>. After flagging duplicate sets, the tool automatically coordinate-sorts the records. It is still necessary to subsequently run SetNmMdAndUqTags before running BQSR. </li>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • remove F in FMarkDuplicatesSpark
  • remove extraneous space in >single sample pipeline </a>

* <ul>
* <li>FMarkDuplicatesSpark processing can replace both the MarkDuplicates and SortSam steps of the Best Practices <a href="https://software.broadinstitute.org/gatk/documentation/article?id=7899#2">single sample pipeline </a>. After flagging duplicate sets, the tool automatically coordinate-sorts the records. It is still necessary to subsequently run SetNmMdAndUqTags before running BQSR. </li>
* <li>The tool is optimized to run on queryname-grouped alignments. If provided coordinate-sorted alignments, the tool will spend additional time first queryname sorting the reads internally. This can result in the tool being up to 2x slower processing under some circumstances.</li>
* <li>Due to MarkDuplicatesSpark queryname-sorting coordinate-sorted inputs internally at the start, the tool produces identical results regardless of the input sort-order. That is, it will flag duplicates sets that include secondary, and supplementary and unmapped mate records no matter the sort-order of the input. This differs from how Picard MarkDuplicates behaves given the differently sorted inputs. <li/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • correct closing list item tag:<li/> --> </li>

* <li>FMarkDuplicatesSpark processing can replace both the MarkDuplicates and SortSam steps of the Best Practices <a href="https://software.broadinstitute.org/gatk/documentation/article?id=7899#2">single sample pipeline </a>. After flagging duplicate sets, the tool automatically coordinate-sorts the records. It is still necessary to subsequently run SetNmMdAndUqTags before running BQSR. </li>
* <li>The tool is optimized to run on queryname-grouped alignments. If provided coordinate-sorted alignments, the tool will spend additional time first queryname sorting the reads internally. This can result in the tool being up to 2x slower processing under some circumstances.</li>
* <li>Due to MarkDuplicatesSpark queryname-sorting coordinate-sorted inputs internally at the start, the tool produces identical results regardless of the input sort-order. That is, it will flag duplicates sets that include secondary, and supplementary and unmapped mate records no matter the sort-order of the input. This differs from how Picard MarkDuplicates behaves given the differently sorted inputs. <li/>
* <li>CCollecting duplicate metrics slows down performance and thus the metrics collection is optional and must be specified for the Spark version of the tool with '-M'. It is possible to collect the metrics with the standalone Picard tool, <a href='https://software.broadinstitute.org/gatk/documentation/tooldocs/current/picard_sam_markduplicates_EstimateLibraryComplexity.php'>EstimateLibraryComplexity</a>.</li>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • remove extraneous C in CCollecting
  • remove , before EstimateLibraryComplexity

* <li>The tool is optimized to run on queryname-grouped alignments. If provided coordinate-sorted alignments, the tool will spend additional time first queryname sorting the reads internally. This can result in the tool being up to 2x slower processing under some circumstances.</li>
* <li>Due to MarkDuplicatesSpark queryname-sorting coordinate-sorted inputs internally at the start, the tool produces identical results regardless of the input sort-order. That is, it will flag duplicates sets that include secondary, and supplementary and unmapped mate records no matter the sort-order of the input. This differs from how Picard MarkDuplicates behaves given the differently sorted inputs. <li/>
* <li>CCollecting duplicate metrics slows down performance and thus the metrics collection is optional and must be specified for the Spark version of the tool with '-M'. It is possible to collect the metrics with the standalone Picard tool, <a href='https://software.broadinstitute.org/gatk/documentation/tooldocs/current/picard_sam_markduplicates_EstimateLibraryComplexity.php'>EstimateLibraryComplexity</a>.</li>
* <li>MarkDuplicatesSpark is optimized to run locally on a single machine by leveraging core parallelism that MarkDuplicates and SortSam cannot. It will typically run faster than MarkDuplicates and SortSam by a factor of 15% over the same data at 2 cores and will scale linearly to upwards of 16 cores. This means that the tool can be used to speedup MarkDuplicates even without access to a Spark cluster.</li>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • This means that the tool can be used to speedup MarkDuplicates even without access to a Spark cluster. --> This means MarkDuplicatesSpark, even without access to a Spark cluster, is faster than MarkDuplicates.

/**
* SortSam on Spark (works on SAM/BAM/CRAM)
*
* <h4>Overview</h4>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove <h4>Overview</h4> line. Currently, documentation looks like this:
screenshot 2019-03-06 13 42 13

* <h3>Additional Notes</h3>
* <ul>
* <li>This Spark tool requires a significant amount of disk operations. Run with both the input data and outputs on high throughput SSDs when possible. When pipelining this tool on Google Compute Engine instances, for best performance requisition machines with LOCAL SSDs. </li>
* <li>Furthermore, we recommend explicitly setting the Spark temp directory to an available SSD when running this in local mode by adding the argument --conf 'spark.local.dir=/PATH/TO/TEMP/DIR'. See the discussion at https://gatkforums.broadinstitute.org/gatk/discussion/comment/56337 for details.</li>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change "See the discussion at https://gatkforums.broadinstitute.org/gatk/discussion/comment/56337 for details."
-->
See <a href="https://gatkforums.broadinstitute.org/gatk/discussion/comment/56337">this forum discussion</a> for details.

* SortSam on Spark (works on SAM/BAM/CRAM)
*
* <h4>Overview</h4>
* <p>A <a href='https://software.broadinstitute.org/gatk/blog?id=23420'>Spark</a> implementation of <a href='https://software.broadinstitute.org/gatk/documentation/tooldocs/current/picard_sam_SortSam.php'>Picard SortSam</a>. The Spark version can run in parallel on multiple cores on a local machine or multiple machines on a Spark cluster while still matching the output of the single-core Picard version. See <a href="https://software.broadinstitute.org/gatk/blog?id=23420">Blog#23420</a> for performance benchmarks.</p>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove link surrounding 'Spark'. We link to the blog article in the next sentence.

*
* <p>The tool sorts reads by coordinate order by default or alternatively by read name, the QNAME field, if asked with the '-SO queryname' option. The contig ordering in the reference dictionary defines coordinate order, and the tool uses the sequence dictionary represented by the @SQ header lines or that of the optionally provided reference to sort reads by the RNAME field. For those reads mapping to a contig, coordinate sorting further orders reads by the POS field of the SAM record, which contains the leftmost mapping position.</p>
*
* <p>Queryname-sorted alignments are grouped first by readname and then are deterministically ordered among equal readnames by read flags including orientation, secondary, and supplemntary record flags (See {@link htsjdk.samtools.SAMRecordQueryNameComparator#compare(SAMRecord, SAMRecord)}} for details). For paired-end reads, reads in the pair share the same queryname. Because aligners can generate secondary and supplementary alignments, queryname groups can consists of, e.g. more than two records for a paired-end pair.</p>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can improve this sentence:

Queryname-sorted alignments are grouped first by readname and then are deterministically ordered among equal readnames by read flags including orientation, secondary, and supplemntary record flags (See htsjdk.samtools.SAMRecordQueryNameComparator#compare(SAMRecord, SAMRecord)} for details).

May I suggest
"To queryname-sort, the tool first groups by readname and then deterministically sorts within a readname set by orientation, secondary and supplementary SAM flags. "

  • supplemntary record flags --> supplementary
  • remove link to detailed information. It doesn't show up well in javadoc:

screenshot 2019-03-06 14 02 10

* <h4>Usage examples</h4>
* Coordinate-sort aligned reads using all cores available locally
* <pre>
* gatk SortSamSpark \<br />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mention below, here and elsewhere:
Please remove the <br> elements as they are unnecessary for gatkdocs and actually add a weird additional line between command lines.

Right now, this is what the doc looks like:
screenshot 2019-03-06 14 05 00

We want to tighten up the line spacing and removing the breaks will do that for us.

* </pre>
*
* <h3>Notes</h3>
* <ul>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Mutect2, I decided to make the Notes section an ordered list. It's up to you if you want to change ul tool since you only have two bullets.

@sooheelee sooheelee assigned jamesemery and unassigned sooheelee Mar 6, 2019
@jamesemery
Copy link
Collaborator Author

@sooheelee I have responded to your comments. They were mostly typos and quick changes. I might go ahead and merge this soon.

@sooheelee
Copy link
Contributor

Yes, feel free to merge. I gave approval with my last review.

* -I coordinatesorted.bam \
* -SO queryname \
* -O querygroupsorted.bam \
* --
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just skimming through and this needs a \.

@jamesemery jamesemery merged commit 3570af9 into master Mar 7, 2019
@jamesemery jamesemery deleted the je_updateMDSparkDocumentation branch March 7, 2019 19:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants