-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
model_utils.yield_stdev
is very slow
#315
Comments
Hi @lhenkelm, thanks for investigating and bringing this up! I have a few thoughts below. The implemented method uses linear error propagation, but this is not the only possibility. Especially for models with many parameters, a sampling-based approach may be more suitable, as briefly described in #221. We could implement that, and it may be significantly faster for the setup you are describing. The usage of How does your post-fit plotting script look like exactly? Results from
How does the performance change when removing this loop completely? I would have expected that the
I cannot spontaneously think of a way to get rid of the loop completely, but it should be possible to pre-calculate |
Hi @alexander-held,
The alternatives look really interesting. I'll have a go at swapping them in in place of
I am not sure that the awkward arrays are the issue. Since the number of in-interpreter calls to the awkward array methods is so large, it could be that an implementation with pure numpy arrays is still similarly slow.
Oh, that is a very good point, I missed that this is already cached! The 8 calls are all for the same model and data. >>> pdf
<pyhf.pdf.Model object at 0x7f6e8f31c280>
>>> other
<pyhf.pdf.Model object at 0x7f6e8f336e50>
>>> pdf.spec == other.spec
True
>>> pdf == other
False
>>> hash(pdf) == hash(other)
False
>>> pdf in {other : 1}
False (the code calling
I'll do the experiment of skipping the loop and let you know. Regarding the
I agree that skipping correlations requires careful thought. But I can try how far I get with pre-computing/masking and so on. Thanks a lot for all of the suggestions! I will see how they work out and let you know. |
I think this is almost safe, the only thing I can think of are the interpolation codes which are not contained in the specification. I think we could cache with a combination of spec + these codes. Thanks a lot for brining this up, I had not thought of this complication when re-building the model before. I am benchmarking another setup I have, and pre-calculating the data for I am going to try whether I can optimize it further, but it seems in any case there is a lot of performance to be gained here. edit: It should be possible to do this via matrix multiplication, it's just a matter of setting it up correctly with |
Thank you for the update! I had a go at re-running the profiling with some of the above suggestions monkey-patched in to replace
I put the exact implementations I used in this gist.
I am going to hold off on trying my hand at optimizing the loop for now, since you are already investigating that. |
See #316 for an update: thanks to help from @agoose77, the off-diagonal calculation can be fully vectorized. In my setup this results now in a runtime of 0.4 s for that part (compared to ~180 s previously). The bottleneck is now actually the initial setup of all yields for all variations. There is probably some further room for optimization there.
I had noticed this as well and need to compare more closely.
There is, but I believe it would involve re-creating the model (scikit-hep/pyhf#545). That could probably be done and I would expect that to drastically speed up this approach. |
I am very glad to see this nice optimization. Thank you @agoose77 and @alexander-held!
I now tried a batched version of random sampling. It took 118s for the use case that with 0.4.0 would take 137s. So it is faster, but not by as much as I thought. But I also found that as I increase the number of toys, the interpolator call in normsys takes more and more time (more so than the rest of the |
Maybe you already looked into it @alexander-held (if so, sorry for the noise), but I had a little bit of time and I tried out also vectorizing the initial setup of the yield variations (including batching the pyhf Model evaluation). BTW, how do you feel about caching based on spec + interpcodes instead of the model directly? I'd be happy to submit a PR. |
How many samples are you considering for this? I do not have a good feeling for how many are needed, but thought that one may get a way with a fairly small number.
Is the bottleneck within
Nice! We can try this for a few other setups, but it might very well be that even with model re-creation this is worth it in general as an additional optimization.
Yes please, a PR would be very welcome! I still need to understand the difference I saw after the refactor in #316, but we can include other improvements in separate PRs. The change to spec + interpcode makes perfect sense to me, I don't see a reason to stick with the current model-based caching. |
For the timings with random sampling above I used 50k samples. When batching, I split that into 2 batches of 25k samples each, to stay in RAM.
That is a good question. I only tried numpy here, and I never looked into how normsys is implemented, so I don't have an answer. |
I found out why the tests in #316 failed for the per-channel uncertainties: there is a bug in the old implementation. The bug has been around since the original implementation in #189 as far as I can tell. cabinetry/src/cabinetry/model_utils.py Lines 303 to 307 in f34c73a
This was a performance optimization, introduced before per-channel uncertainties were calculated. The argument that the cross-term vanishes for two I will implement a fix for this first for clarity (tracked in #323), and then replace it all with the new implementation (which matches the old results after this fix). I expect that the impact of this is generally small, since typically |
I'd propose to close this issue via #316 and track this part in a new issue as an additional improvement. I can open a new issue for that once #316 is done. I quite like the solution pars_square = np.tile(parameters.copy().astype(float), (model.config.npars, 1))
up_variations_ak = _prepare_yield_variations(model, pars_square + np.diagflat(uncertainty))
down_variations_ak = _prepare_yield_variations(model, pars_square - np.diagflat(uncertainty)) for parameter variations. All the vectorization (also #316) can potentially make the code quite a bit harder to read, and I think the comments you have in the gist are very helpful. I'll try to incorporate this information into #316 as well. |
I like that plan. Once #316 is in, the title of this issue will no longer be so appropriate, after all.
Thanks! I am glad you can see some use for it :) |
I think you can drop the |
You're probably right about |
The batching of model calls for yield variations can be tracked in #325. |
First, thank you for providing and maintaining this nice package! I get a lot of utility out of it, in particular the
model_utils
module.I use
cabinetry.model_utils.yield_stdev
in version 0.4.0 to get post-fit uncertainties for plots, and when I profiled my fitting code I noticed that most time is actually taken up preparing the plots after the fits are done.Specifically, the post-fit plotting script calls
yield_stdev
8 times, and since every call takes roughly 17s, in total the script runs for 160s or so, most of which is spent inyield_stdev
:(the x-axis is cumulative time spent in a function and all functions it calls in turn, which are placed lower on the y-axis)
The functions taking up all the time down-stack in
cabinetry.model_utils.yield_stdev
are a mix of numpy code and hooks for numpy provided by awkward-array.It seems these are individually quite fast, e.g.
__array_ufunc__
calls average to 6 ms each,however they are being called very many times from
yield_stdev
(202 512 times in the case of__array_ufunc__
).From a first glance at the implementation it seems the nested loop here
is a likely candidate for the bottleneck: my model has O(100) parameters, and most of them are at least a little bit correlated,
so the number of non-zero off-diagonal elements in the correlation matrix is about 12k.
Is it feasible to optimize
yield_stdev
? E.g. to replace the nested loop over 6k nested parameter pairs by an equivalent operation over larger arrays, such that the loop is left to numpy?The text was updated successfully, but these errors were encountered: