-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved memory requirements of CollectReadCounts. #5715
Conversation
39ed129
to
a43ff73
Compare
"Input intervals may not be overlapping."); | ||
final SAMFileHeader.SortOrder sortOrder = getHeaderForReads().getSortOrder(); | ||
isCoordinateSorted = sortOrder == SAMFileHeader.SortOrder.coordinate; | ||
if (!isCoordinateSorted) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and other checks might be overkill. Not sure if they were inherited from older tools where sorting wasn't guaranteed? We can get rid of them if anyone feels strongly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, I don't think the coordinate-sorted check is justified if we currently require the bam to be indexed. How expensive is the interval overlap check? I would say keep it unless it takes more than a minute or so on 100bp WGS intervals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I agree. Interval-overlap check is cheap and I'm OK with keeping it.
Codecov Report
@@ Coverage Diff @@
## master #5715 +/- ##
===============================================
- Coverage 87.069% 87.062% -0.008%
- Complexity 31875 31880 +5
===============================================
Files 1940 1940
Lines 146738 146804 +66
Branches 16226 16234 +8
===============================================
+ Hits 127764 127810 +46
- Misses 13061 13073 +12
- Partials 5913 5921 +8
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@samuelklee Thanks for this! Looks good - just a couple of questions regarding the input validation.
|
||
@Override | ||
public List<ReadFilter> getDefaultReadFilters() { | ||
final List<ReadFilter> filters = new ArrayList<>(super.getDefaultReadFilters()); | ||
final List<ReadFilter> filters = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the motivation for this? I know we experimented with it some. @vruano This was related to CRAM performance, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #5233. I don't know if we should disable this by default, or just do it at the WDL level for the production WDL. Not sure if we should also disable over in CollectAllelicCounts or if that will have any effects. With the cost reductions from the other fixes, I don't think it's absolutely necessary to disable the filter (although we should rerun cost estimates to be sure memory requirements don't change). Up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep it enabled by default, since that's the expected behavior for all GATK tools, but disable it from the WDL (and make it part of best practices).
"Input intervals may not be overlapping."); | ||
final SAMFileHeader.SortOrder sortOrder = getHeaderForReads().getSortOrder(); | ||
isCoordinateSorted = sortOrder == SAMFileHeader.SortOrder.coordinate; | ||
if (!isCoordinateSorted) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, I don't think the coordinate-sorted check is justified if we currently require the bam to be indexed. How expensive is the interval overlap check? I would say keep it unless it takes more than a minute or so on 100bp WGS intervals.
CRAM + NIO looks to be ~3 cents per sample. This essentially includes disk optimizations, since the disk size is determined by the CRAM size; this is not too large, so this results in disk costs of ~0.3 cents per sample. Note that I ran on the CRAMs in gs://broad-sv-dev-data/TCGA_blood_normals. |
CRAM w/o NIO is also ~3 cents per sample (it was marginally more expensive than CRAM w/ NIO, but within the noise). CRAM w/o NIO w/ SSD is ~5 cents. So I'd say CRAM w/ or w/o NIO is fine. Strictly speaking, we can't directly compare the BAM and CRAM costs, since they were done on different sets of TCGA samples. But both are well under the goal of ~15 cents per sample, so I think it's safe to say that we can turn our attention to optimizing inference costs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks good to merge assuming the tests pass.
Goal was to get WGS coverage collection at 100bp at ~15 cents per sample. Since this is I/O bound (takes ~2 hours to stream or localize a BAM, or about the same to decompress a CRAM), cost reduction can be most easily achieved by reducing the memory requirements and moving down to a cheaper VM.
Memory requirements at 100bp are dominated by manipulations of the list of ~30M intervals. There were a few easy fixes to reduce requirements that did not require changing the collection method (which can be easily modified for future investigations, see #4551):
-removed WellformedReadFilter. See #5233. EDIT: We decided after PR review to retain this filter by default and disable it at the WDL level when Best Practices is released. Leaving the issue open.
-initialized HashMultiSet capacity
-removed unnecessary call to OverlapDetector.getAll
-avoided a redundant defensive copy in SimpleCountCollection
-used per-contig OverlapDetectors, rather than a global one
This brought the cost down to ~9 cents per sample using n1-standard-2's with 7.5GB of memory when collecting on BAMs with NIO. Note that I didn't optimize disk size, which accounts for ~50% of the total cost and is unused when running with NIO, so we are closer to ~5 cents per sample. It is possible that using CRAMs with or without NIO and with or without SSDs might be cheaper.
Note that OverlapDetectors may be overkill for our case, since bins are guaranteed to be sorted and non-overlapping and queries are also sorted. We could probably roll something that is O(1) in memory. However, since we are I/O bound, as long as we are satisfied with the current cost, I am willing to sacrifice memory for implementation and maintenance costs, as well as the option to change strategies easily. In any case, @lbergelson found some easy wins in OverlapDetector that may further bring the memory usage down, and will issue a fix in htsjdk soon.