-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved memory usage in gCNV. #5781
Conversation
Prior to this, gCNV shard size and VM type have never been optimized. For WES cohort mode, we arbitrarily ran with: 37 shards of 200 samples by 5000 intervals on n1-standard-8s (8 CPU, 30GB memory, $0.08 / hr) each taking ~2 hours = ~3 cents / sample This PR gets rid of a memory spike in the sampling of denoised copy ratios, fixes a memory leak by updating theano, and also adds some theano flags that typically yield a factor of ~2 speedup (notably, the OpenMP elemwise flag, although we also get a slight boost from using numpy MKL). This allows us to run, e.g.: 2 shards of 50 samples by 100000 intervals on n1-standard-8s (8 CPU, 30GB memory, $0.08 / hr) each taking ~5 hours = ~1.6 cents / sample For these runs, we used a slightly larger interval list and 1/4 the number of samples than in the first example, but because everything scales linearly, it's probably fair to compare the per-sample-and-interval costs. So we get a factor of ~8 savings if we keep the shard size the same. The cost was already satisfactory, but fixing the leak allows us to more easily run scatters that are not so wide, which may be crucial for running the megaWDL. Adding the OpenMP flag also lets CPU scalability work as intended. We can do a more systematic optimization for cost if desired, and we should also revalidate to make sure performance doesn't vary too much with shard size (from spot checking, it looks like marginal and/or single-bin calls may flicker on and off). Note that we have still not optimized inference for WES, although I believe @vruano has done some optimizations for WGS. @mwalker174 @vruano for WGS with 2kb bins, I would expect the cost of the gCNV step to be ~10 cents in cohort mode before inference optimizations, assuming we address #5716 to minimize disk costs. @asmirnov239 can you review? And maybe you can address dCR output in PostprocessGermlineCNVCalls and expose the number of samples in a separate PR? We can make some further changes to the dCR format there if we need. |
@lucidtronix @cmnbroad @jamesemery also note that I went ahead and switched over to numpy MKL, I'm assuming you have no objections! |
Codecov Report
@@ Coverage Diff @@
## master #5781 +/- ##
===============================================
- Coverage 86.999% 86.998% -0.001%
Complexity 32110 32110
===============================================
Files 1974 1974
Lines 147249 147249
Branches 16218 16218
===============================================
- Hits 128105 128104 -1
Misses 13236 13236
- Partials 5908 5909 +1
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@samuelklee Back to you!
num_samples: number of samples to draw | ||
|
||
Returns: | ||
A generator that will yield `num_samples` samples from an approximation to a posterior | ||
""" | ||
return (model_approx.sample()[model_var_name] for _ in range(num_samples)) | ||
|
||
sample = stochastic_node_mean_symbolic(model_approx, node, size=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you try using approximation.sample_node method instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder to self to recheck for memory spike as we discussed.
No spike and the results are exactly the same when using sample_node. Sorry for not realizing that the other method cribbed from this earlier. Interestingly, I also checked whether compilation affects memory usage---it doesn't. Compilation can take a non-trivial amount of time for short-running shards, so we might experiment with distributing a precompiled model for fixed shard sizes in production. Not sure if this works in practice or what the effect of getting VMs on different architectures might be, though. |
…updated output formats and filenames.
…onda defaults to enable MKL.
db9bad5
to
f772b3b
Compare
Thanks @asmirnov239, back to you! |
Looks good @samuelklee! Thanks for fixing it |
-Changed sampling of denoised copy ratios to address memory spike and updated output formats and filenames. Partially addresses #5754.
-Updated theano version to 1.0.4 and changed numpy install source to conda defaults to enable MKL.
-Updated theano flags to use MKL and OpenMP elemwise.
Closes #5764.