Added MinibatchSliceSampler and replaced naive subsampling in ModelSegments. #5575

samuelklee · 2019-01-14T15:45:46Z

Naive subsampling of the data in the ModelSegments MCMC (i.e., limiting per-segment and global quantities to 1000 points randomly sampled each MCMC iteration and rescaling likelihoods accordingly) was implemented in #3913 to bring WGS runtime down to reasonable levels. However, this sort of naive subsampling does not accurately preserve the posterior, which leads to some artifacts in posterior estimation. @MartonKN suspected that this negatively affected downstream performance in his caller, since weights of larger segments were underestimated.

For example, the copy-ratio posterior widths should scale with the inverse square root of the number of copy-ratio bins in each segment. However, subsampling yields an artificial break at 1000 bins and screws up the scaling:

To fix this, I implemented minibatch slice sampling as described in http://proceedings.mlr.press/v33/dubois14.pdf. This uses early stopping of sampling as determined by a simple statistical test to perform approximate sampling of the posterior in a way that is more well behaved:

Note that the scaling levels off for larger segments, but the approximation can be made exact by taking the appropriate parameter to zero (here, this parameter is set to 0.1). However, since subsampling parameters were not exposed in the old code, I have not exposed the parameters for the approximation here. We can do this in a future PR if desired. Changing these parameters can affect runtime and results, but I've set them to reasonable values for now.

The implementation involved 1) creating an abstract class to extract some common functionality shared with the old batch SliceSampler (which is now no longer used in production code), 2) implementing the MinibatchSliceSampler as described in the above reference, and 3) adding some hash-based caching functionality to both the batch/minibatch implementations, as well as to the allele-fraction likelihood calculations (see related discussion in #2860). I also made a few miscellaneous improvements to code style, etc.

This is a relatively sizable change and can rather dramatically change the number of segments remaining after smoothing, etc. (although primarily on small scales and probably well within the noise). I will rerun the TCGA SNP array evaluations to make sure there are no negative effects on performance from this change or those introduced in #5556. @LeeTL1220 should also run some tests. The branch might require some further tweaking based on the results.

…gments.

codecov-io · 2019-01-14T22:16:43Z

Codecov Report

Merging #5575 into master will decrease coverage by 0.006%.
The diff coverage is 95.847%.

@@               Coverage Diff               @@
##              master     #5575       +/-   ##
===============================================
- Coverage     87.054%   87.048%   -0.006%     
- Complexity     31452     31520       +68     
===============================================
  Files           1921      1928        +7     
  Lines         144995    145320      +325     
  Branches       16064     16089       +25     
===============================================
+ Hits          126224    126498      +274     
- Misses         12929     12963       +34     
- Partials        5842      5859       +17

Impacted Files	Coverage Δ	Complexity Δ
...er/tools/copynumber/CollectAllelicCountsSpark.java	`96% <ø> (ø)`	`10 <0> (ø)`	⬇️
.../CreateReadCountPanelOfNormalsIntegrationTest.java	`96.196% <ø> (ø)`	`39 <0> (ø)`	⬇️
...ber/formats/collections/SimpleCountCollection.java	`100% <ø> (ø)`	`12 <0> (ø)`	⬇️
...s/copynumber/GermlineCNVCallerIntegrationTest.java	`90.909% <ø> (ø)`	`7 <0> (ø)`	⬇️
...mber/denoising/HDF5SVDReadCountPanelOfNormals.java	`88.406% <ø> (ø)`	`41 <0> (ø)`	⬇️
...s/copynumber/CollectReadCountsIntegrationTest.java	`87.879% <ø> (ø)`	`9 <0> (ø)`	⬇️
...s/copynumber/models/CopyRatioModellerUnitTest.java	`100% <ø> (ø)`	`10 <0> (ø)`	⬇️
...rmats/collections/AnnotatedIntervalCollection.java	`67.47% <ø> (ø)`	`14 <0> (ø)`	⬇️
...nder/utils/mcmc/GibbsSamplerCopyRatioUnitTest.java	`89.922% <ø> (ø)`	`13 <0> (ø)`	⬇️
...ools/copynumber/formats/records/LegacySegment.java	`46.875% <ø> (ø)`	`7 <0> (ø)`	⬇️
... and 51 more

samuelklee · 2019-01-15T22:07:41Z

TCGA SNP validation looks about the same on WES---perhaps a slight tradeoff of sensitivity for specificity at the 0.1% level, but nothing to write home about. WGS validations are still running due to the (intermittent?) NIO failures discussed elsewhere.

samuelklee · 2019-01-15T22:48:20Z

The WGS validations that have completed so far look fine, though. Changes at the 0.1% level, mostly improvements, but nothing appears to be obviously broken. Should still run some of the other validations to be sure, but let's get this in for 4.1.

samuelklee · 2019-01-16T12:41:44Z

WGS validations are done and still look fine. @LeeTL1220 want to run your validations? 4.0.12.0 Docker + override jar should do the trick. In any case, I think the review can proceed.

LeeTL1220 · 2019-01-16T18:07:01Z

@samuelklee Kicked'em off

LeeTL1220 · 2019-01-17T14:37:06Z

@samuelklee The results are definitely improved for the WES. WGS looks about the same. Here is the reproducibility WES test on HCC1143, which is improved.

LeeTL1220

@samuelklee Good stuff. My requests are really minor, so feel free to merge after addressed.

LeeTL1220 · 2019-01-17T14:39:50Z

src/main/java/org/broadinstitute/hellbender/tools/copynumber/models/AlleleFractionModeller.java

@@ -163,8 +163,8 @@ void fitMCMC(final int numSamples, final int numBurnIn) {
        }
        final int numSegments = minorFractionsSamples.get(0).size();
        final List<ModeledSegment.SimplePosteriorSummary> posteriorSummaries = new ArrayList<>(numSegments);
-        for (int segment = 0; segment < numSegments; segment++) {
-            final int j = segment;
+        for (int segmentIndex = 0; segmentIndex < numSegments; segmentIndex++) {


Thanks for doing this rename.

LeeTL1220 · 2019-01-17T14:45:32Z

src/main/java/org/broadinstitute/hellbender/utils/mcmc/MinibatchSliceSampler.java

+     * a continuous, univariate, unimodal, unnormalized log probability density function
+     * (assumed to be a posterior and specified by a prior and a likelihood),
+     * hard limits on the random variable, a step width, a minibatch size, and a minibatch approximation threshold.
+     * @param rng                       random number generator


Can you add the usual Never {@code null} for parameters that are not allowed to be null.

LeeTL1220 · 2019-01-17T14:47:21Z

src/main/java/org/broadinstitute/hellbender/utils/mcmc/SliceSampler.java

-     * log probability density function, hard limits on the random variable, and a step width.
+     * Creates a new sampler for a bounded univariate random variable, given a random number generator,
+     * a continuous, univariate, unimodal, unnormalized log probability density function,
+     * hard limits on the random variable, and a step width.
     * @param rng      random number generator


Can you specify which parameters can be null?

samuelklee · 2019-01-17T20:05:52Z

Done, thanks @LeeTL1220. Will merge when tests pass.

samuelklee added 2 commits January 14, 2019 10:17

Added MinibatchSliceSampler and replaced naive subsampling in ModelSe…

3671338

…gments.

Optimized imports in copynumber packages.

efbfd3c

samuelklee force-pushed the sl_minibatch branch from 9999097 to efbfd3c Compare January 14, 2019 15:53

droazen requested a review from LeeTL1220 January 14, 2019 18:59

droazen assigned LeeTL1220 Jan 14, 2019

LeeTL1220 approved these changes Jan 17, 2019

View reviewed changes

Addressed PR comments.

c7a927b

samuelklee merged commit 1af1f17 into master Jan 18, 2019

samuelklee deleted the sl_minibatch branch January 18, 2019 00:51

samuelklee mentioned this pull request Jan 27, 2022

Be aware of numerical stability issues introduced by changes to MathUtils methods, e.g. log10factorial. #7649

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added MinibatchSliceSampler and replaced naive subsampling in ModelSegments. #5575

Added MinibatchSliceSampler and replaced naive subsampling in ModelSegments. #5575

samuelklee commented Jan 14, 2019 •

edited

Loading

codecov-io commented Jan 14, 2019 •

edited

Loading

samuelklee commented Jan 15, 2019

samuelklee commented Jan 15, 2019

samuelklee commented Jan 16, 2019

LeeTL1220 commented Jan 16, 2019

LeeTL1220 commented Jan 17, 2019

LeeTL1220 left a comment

LeeTL1220 Jan 17, 2019

LeeTL1220 Jan 17, 2019

LeeTL1220 Jan 17, 2019

samuelklee commented Jan 17, 2019

Added MinibatchSliceSampler and replaced naive subsampling in ModelSegments. #5575

Added MinibatchSliceSampler and replaced naive subsampling in ModelSegments. #5575

Conversation

samuelklee commented Jan 14, 2019 • edited Loading

codecov-io commented Jan 14, 2019 • edited Loading

Codecov Report

samuelklee commented Jan 15, 2019

samuelklee commented Jan 15, 2019

samuelklee commented Jan 16, 2019

LeeTL1220 commented Jan 16, 2019

LeeTL1220 commented Jan 17, 2019

LeeTL1220 left a comment

Choose a reason for hiding this comment

LeeTL1220 Jan 17, 2019

Choose a reason for hiding this comment

LeeTL1220 Jan 17, 2019

Choose a reason for hiding this comment

LeeTL1220 Jan 17, 2019

Choose a reason for hiding this comment

samuelklee commented Jan 17, 2019

samuelklee commented Jan 14, 2019 •

edited

Loading

codecov-io commented Jan 14, 2019 •

edited

Loading