Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added MinibatchSliceSampler and replaced naive subsampling in ModelSegments. #5575

Merged
merged 3 commits into from
Jan 18, 2019

Conversation

samuelklee
Copy link
Contributor

@samuelklee samuelklee commented Jan 14, 2019

Naive subsampling of the data in the ModelSegments MCMC (i.e., limiting per-segment and global quantities to 1000 points randomly sampled each MCMC iteration and rescaling likelihoods accordingly) was implemented in #3913 to bring WGS runtime down to reasonable levels. However, this sort of naive subsampling does not accurately preserve the posterior, which leads to some artifacts in posterior estimation. @MartonKN suspected that this negatively affected downstream performance in his caller, since weights of larger segments were underestimated.

For example, the copy-ratio posterior widths should scale with the inverse square root of the number of copy-ratio bins in each segment. However, subsampling yields an artificial break at 1000 bins and screws up the scaling:

cr-ss

To fix this, I implemented minibatch slice sampling as described in http://proceedings.mlr.press/v33/dubois14.pdf. This uses early stopping of sampling as determined by a simple statistical test to perform approximate sampling of the posterior in a way that is more well behaved:

cr-mb

Note that the scaling levels off for larger segments, but the approximation can be made exact by taking the appropriate parameter to zero (here, this parameter is set to 0.1). However, since subsampling parameters were not exposed in the old code, I have not exposed the parameters for the approximation here. We can do this in a future PR if desired. Changing these parameters can affect runtime and results, but I've set them to reasonable values for now.

The implementation involved 1) creating an abstract class to extract some common functionality shared with the old batch SliceSampler (which is now no longer used in production code), 2) implementing the MinibatchSliceSampler as described in the above reference, and 3) adding some hash-based caching functionality to both the batch/minibatch implementations, as well as to the allele-fraction likelihood calculations (see related discussion in #2860). I also made a few miscellaneous improvements to code style, etc.

This is a relatively sizable change and can rather dramatically change the number of segments remaining after smoothing, etc. (although primarily on small scales and probably well within the noise). I will rerun the TCGA SNP array evaluations to make sure there are no negative effects on performance from this change or those introduced in #5556. @LeeTL1220 should also run some tests. The branch might require some further tweaking based on the results.

@codecov-io
Copy link

codecov-io commented Jan 14, 2019

Codecov Report

Merging #5575 into master will decrease coverage by 0.006%.
The diff coverage is 95.847%.

@@               Coverage Diff               @@
##              master     #5575       +/-   ##
===============================================
- Coverage     87.054%   87.048%   -0.006%     
- Complexity     31452     31520       +68     
===============================================
  Files           1921      1928        +7     
  Lines         144995    145320      +325     
  Branches       16064     16089       +25     
===============================================
+ Hits          126224    126498      +274     
- Misses         12929     12963       +34     
- Partials        5842      5859       +17
Impacted Files Coverage Δ Complexity Δ
...er/tools/copynumber/CollectAllelicCountsSpark.java 96% <ø> (ø) 10 <0> (ø) ⬇️
.../CreateReadCountPanelOfNormalsIntegrationTest.java 96.196% <ø> (ø) 39 <0> (ø) ⬇️
...ber/formats/collections/SimpleCountCollection.java 100% <ø> (ø) 12 <0> (ø) ⬇️
...s/copynumber/GermlineCNVCallerIntegrationTest.java 90.909% <ø> (ø) 7 <0> (ø) ⬇️
...mber/denoising/HDF5SVDReadCountPanelOfNormals.java 88.406% <ø> (ø) 41 <0> (ø) ⬇️
...s/copynumber/CollectReadCountsIntegrationTest.java 87.879% <ø> (ø) 9 <0> (ø) ⬇️
...s/copynumber/models/CopyRatioModellerUnitTest.java 100% <ø> (ø) 10 <0> (ø) ⬇️
...rmats/collections/AnnotatedIntervalCollection.java 67.47% <ø> (ø) 14 <0> (ø) ⬇️
...nder/utils/mcmc/GibbsSamplerCopyRatioUnitTest.java 89.922% <ø> (ø) 13 <0> (ø) ⬇️
...ools/copynumber/formats/records/LegacySegment.java 46.875% <ø> (ø) 7 <0> (ø) ⬇️
... and 51 more

@samuelklee
Copy link
Contributor Author

TCGA SNP validation looks about the same on WES---perhaps a slight tradeoff of sensitivity for specificity at the 0.1% level, but nothing to write home about. WGS validations are still running due to the (intermittent?) NIO failures discussed elsewhere.

@samuelklee
Copy link
Contributor Author

The WGS validations that have completed so far look fine, though. Changes at the 0.1% level, mostly improvements, but nothing appears to be obviously broken. Should still run some of the other validations to be sure, but let's get this in for 4.1.

@samuelklee
Copy link
Contributor Author

WGS validations are done and still look fine. @LeeTL1220 want to run your validations? 4.0.12.0 Docker + override jar should do the trick. In any case, I think the review can proceed.

@LeeTL1220
Copy link
Contributor

@samuelklee Kicked'em off

@LeeTL1220
Copy link
Contributor

@samuelklee The results are definitely improved for the WES. WGS looks about the same. Here is the reproducibility WES test on HCC1143, which is improved.

image

Copy link
Contributor

@LeeTL1220 LeeTL1220 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samuelklee Good stuff. My requests are really minor, so feel free to merge after addressed.

@@ -163,8 +163,8 @@ void fitMCMC(final int numSamples, final int numBurnIn) {
}
final int numSegments = minorFractionsSamples.get(0).size();
final List<ModeledSegment.SimplePosteriorSummary> posteriorSummaries = new ArrayList<>(numSegments);
for (int segment = 0; segment < numSegments; segment++) {
final int j = segment;
for (int segmentIndex = 0; segmentIndex < numSegments; segmentIndex++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this rename.

* a continuous, univariate, unimodal, unnormalized log probability density function
* (assumed to be a posterior and specified by a prior and a likelihood),
* hard limits on the random variable, a step width, a minibatch size, and a minibatch approximation threshold.
* @param rng random number generator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the usual Never {@code null} for parameters that are not allowed to be null.

* log probability density function, hard limits on the random variable, and a step width.
* Creates a new sampler for a bounded univariate random variable, given a random number generator,
* a continuous, univariate, unimodal, unnormalized log probability density function,
* hard limits on the random variable, and a step width.
* @param rng random number generator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you specify which parameters can be null?

@samuelklee
Copy link
Contributor Author

Done, thanks @LeeTL1220. Will merge when tests pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants