-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Smooth histograms — which methods are best? #3
Comments
This is a very pragmatic question and shows my ignorance about these things: Have you already tried to investigate the scaling behavior of your current KDE implementation (resource-wise)? Do you (need to) keep all the kernels of all data points in memory at the same time for this to work? (Put differently: "what happens if you fill a KDE-style histogram with 10M events?") |
@philippwindischhofer In practice, our code looks something like:
Since this is just essentially applying normcdf twice in a vectorised way, I imagine there wouldnt be any issues with scaling (but I havent tried!) |
I think use of KDEs is fine, but using them with a normal Poisson-based maximum-likelihood function as implemented in pyhf is mathematically wrong. Maximum-likelihood fits only produce unbiased estimates of the value and its uncertainty when the data in each bin is Poisson-distributed, this is not the case when you convert data into KDEs. If you add a single new event, it will not only increase the value in a single bin, but also in neighboring bins. These additional correlations are not included in the fitted model. Whether this is a problem in practice, I cannot say. The problem I describe may not affect the optimization significantly, but this point needs to be studied. |
The bandwidth could be reduced throughout analysis optimization to slowly approach the discrete bin case. For the final inference step after the optimization, the KDE can be replaced by a traditional histogram. |
btw
can be more efficiently written as
|
Following up on the discussion that took place during the kick-off meeting, an interesting point arose on the issue of which methods are best used to approximate a set of binned counts.
In the case where one has a neural network-based observable, one could apply the softmax function
exp(nn / T) / sum(exp(nn / T))
(T
= temperature hyperparam) to the nn output, like one would do with a classification problem. This essentially puts every event in all bins, but weights each bin with the corresponding nn output (normalized to 1). The histogram is then the elementwise sum of the softmaxed output over all events.An alternative to this is to take an approach using a kernel density estimate (kde), which is defined by instantiating a distribution (kernel) centered around each data point (e.g. standard normal distribution) and averaging their contributions. The smoothness of the resulting estimate is controlled with a 'bandwidth' hyperparameter, which scales the widths of the kernels. Once you choose a bandwidth, you can get 'counts' by integrating the pdf over a set of intervals (bins). One can equivalently use the cumulative density function.
Revisiting the case with the nn-based observable, one could then make the network output one value (regression net), and then define a kde over the set of outputs from a batch of events. The question is then: is there any clear advantage in expressivity or principle from using a kde, softmax, or other methods?
Another interesting question is where these approaches break down, e.g. using a kde in this way seems to be very senstitive to the bandwidth.
It's also worth noting that a kde may make more sense for general use as a candidate for smooth histogramming, but the case of the nn observable may be more nuanced.
The text was updated successfully, but these errors were encountered: