-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added MinibatchSliceSampler and replaced naive subsampling in ModelSegments. #5575
Conversation
9999097
to
efbfd3c
Compare
Codecov Report
@@ Coverage Diff @@
## master #5575 +/- ##
===============================================
- Coverage 87.054% 87.048% -0.006%
- Complexity 31452 31520 +68
===============================================
Files 1921 1928 +7
Lines 144995 145320 +325
Branches 16064 16089 +25
===============================================
+ Hits 126224 126498 +274
- Misses 12929 12963 +34
- Partials 5842 5859 +17
|
TCGA SNP validation looks about the same on WES---perhaps a slight tradeoff of sensitivity for specificity at the 0.1% level, but nothing to write home about. WGS validations are still running due to the (intermittent?) NIO failures discussed elsewhere. |
The WGS validations that have completed so far look fine, though. Changes at the 0.1% level, mostly improvements, but nothing appears to be obviously broken. Should still run some of the other validations to be sure, but let's get this in for 4.1. |
WGS validations are done and still look fine. @LeeTL1220 want to run your validations? 4.0.12.0 Docker + override jar should do the trick. In any case, I think the review can proceed. |
@samuelklee Kicked'em off |
@samuelklee The results are definitely improved for the WES. WGS looks about the same. Here is the reproducibility WES test on HCC1143, which is improved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@samuelklee Good stuff. My requests are really minor, so feel free to merge after addressed.
@@ -163,8 +163,8 @@ void fitMCMC(final int numSamples, final int numBurnIn) { | |||
} | |||
final int numSegments = minorFractionsSamples.get(0).size(); | |||
final List<ModeledSegment.SimplePosteriorSummary> posteriorSummaries = new ArrayList<>(numSegments); | |||
for (int segment = 0; segment < numSegments; segment++) { | |||
final int j = segment; | |||
for (int segmentIndex = 0; segmentIndex < numSegments; segmentIndex++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this rename.
* a continuous, univariate, unimodal, unnormalized log probability density function | ||
* (assumed to be a posterior and specified by a prior and a likelihood), | ||
* hard limits on the random variable, a step width, a minibatch size, and a minibatch approximation threshold. | ||
* @param rng random number generator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add the usual Never {@code null}
for parameters that are not allowed to be null.
* log probability density function, hard limits on the random variable, and a step width. | ||
* Creates a new sampler for a bounded univariate random variable, given a random number generator, | ||
* a continuous, univariate, unimodal, unnormalized log probability density function, | ||
* hard limits on the random variable, and a step width. | ||
* @param rng random number generator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you specify which parameters can be null?
Done, thanks @LeeTL1220. Will merge when tests pass. |
Naive subsampling of the data in the ModelSegments MCMC (i.e., limiting per-segment and global quantities to 1000 points randomly sampled each MCMC iteration and rescaling likelihoods accordingly) was implemented in #3913 to bring WGS runtime down to reasonable levels. However, this sort of naive subsampling does not accurately preserve the posterior, which leads to some artifacts in posterior estimation. @MartonKN suspected that this negatively affected downstream performance in his caller, since weights of larger segments were underestimated.
For example, the copy-ratio posterior widths should scale with the inverse square root of the number of copy-ratio bins in each segment. However, subsampling yields an artificial break at 1000 bins and screws up the scaling:
To fix this, I implemented minibatch slice sampling as described in http://proceedings.mlr.press/v33/dubois14.pdf. This uses early stopping of sampling as determined by a simple statistical test to perform approximate sampling of the posterior in a way that is more well behaved:
Note that the scaling levels off for larger segments, but the approximation can be made exact by taking the appropriate parameter to zero (here, this parameter is set to 0.1). However, since subsampling parameters were not exposed in the old code, I have not exposed the parameters for the approximation here. We can do this in a future PR if desired. Changing these parameters can affect runtime and results, but I've set them to reasonable values for now.
The implementation involved 1) creating an abstract class to extract some common functionality shared with the old batch SliceSampler (which is now no longer used in production code), 2) implementing the MinibatchSliceSampler as described in the above reference, and 3) adding some hash-based caching functionality to both the batch/minibatch implementations, as well as to the allele-fraction likelihood calculations (see related discussion in #2860). I also made a few miscellaneous improvements to code style, etc.
This is a relatively sizable change and can rather dramatically change the number of segments remaining after smoothing, etc. (although primarily on small scales and probably well within the noise). I will rerun the TCGA SNP array evaluations to make sure there are no negative effects on performance from this change or those introduced in #5556. @LeeTL1220 should also run some tests. The branch might require some further tweaking based on the results.