-
Notifications
You must be signed in to change notification settings - Fork 9
Efficiency diagram #53
Comments
Ok here is my current solution: pt = 1*np.random.randn(n) + 5
ps = np.random.randint(0,2,n)
e = Hist(bin('pt', 50, 0, 10), profile('eff'))
e.fill(pt=pt, eff=ps)
beside(e.step('pt'), e.marker("pt", "eff")).to(canvas) Still I have to problem, that the errors are not binomial. |
If the pass and total are coming from two different sources, then no, you can't use a cut axis. I hadn't realized that this is a use-case because from two sources you can't as easily control for accidentally lost events. The different axes guarantee that numerator and denominator have the same normalization— if they're in two different sources, you have to take care to avoid loses in whatever system is producing the two sources. But okay, supposing you've done that, now what? There are two techniques for merging pre-filled histograms from different sources; neither of which will help you in your particular case. One ( I am working on another iteration of histogram API right now, so it's good for me to know that there are uses like this that need to be made possible, but that doesn't help you right now. Do you know about Physt? After writing histbook, I learned about more histogram tools in this ecosystem and have turned toward writing connectors to use them all together. Physt might have something that will help. At worst, here's a hack: introduce a dummy variable Needless to say, it wasn't supposed to be this complicated. Traditionally, we've had all these free-floating histograms (like the two in |
Ok I see the problem. Actually my problem is not as compilcated as I described before. So lets simplify the example from above: data = pd.DataFrame(columns=['pt','found'])
data['pt']=1*np.random.randn(n) + 5
data['found']=[x>0.3 for x in np.random.rand(n)]
e = Hist(bin('pt', 50, 0, 10), profile('found'))
e.fill(pt=data['pt'], found=data['found'])
e.marker("pt", "found").to(canvas) Now I still have the problem with the bin-to-bin uncertainties. I tried to export the histogram to a DataFrame and calculate the uncertainties manually: # Calculating binomial errors sigma = 1/N sqrt( N e (1-e))
d['err(found)'] = 1/d['count()']*np.sqrt(d['count()']*d['found']*(1-d['found']))
print(d.head(20)) How can I get this column back to the histogram in order to draw it? You suggest to use |
Oh! You're free to construct the data as you'd like! Well, in that case, please do buy into the paradigm of having the axis apply the cut for you (which is not exactly what you did in your example above: a profile is not an efficiency; see below). I tried to walk through the "intended" method for making this sort of plot. What I found were several indications of incomplete work:
Here's how I managed to use it (with the bug-fix): %matplotlib inline
import pandas
import numpy
from histbook import *
data = pandas.DataFrame(columns=["pt", "found"])
data["pt"] = numpy.random.randn(10000) + 5
data["found"] = (numpy.random.randn(10000) > 0.3) # do vectorized Numpy operations!
h = Hist(bin("pt", 50, 0, 10), cut("found"))
h.fill(data)
# get a DataFrame with "counts()", "found", and "err(found)" as columns
# this DataFrame is the histogram (indexes are intervals)
df = h.pandas("found", error="normal")
# to plot it, however, we need a column of midpoints of those interval indexes
df["midpoints"] = [x[0].mid if isinstance(x[0], pandas.Interval) else numpy.nan for x in df.index]
# now scatter plot with error bars
df.plot.scatter(x="midpoints", y="found", yerr="err(found)") About the future of this package: it will at least be rewritten. With uproot, it was pretty clear what the package needed to do and how it should be internally organized (though that went through one major revision, from v1 to v2). With histogramming, it's both a field crowded with alternatives (though you and one other user have expressed doubts about Physt, and Boost.Histogram's Python bindings aren't here yet), and it's also less clear what the package ought to do. HBOOK, PAW, and ROOT have a way of doing histograms in languages where it's not a performance bottleneck to fill one entry in a loop. What's the right calling convention when filling whole arrays at once? The histograms need to do some calculations internally since they now get a whole dataset or a chunk of a dataset at a time; they can't just take precalculated data in a Moreover, is the HBOOK/PAW/ROOT way the right way anymore? That kind of API encourages the creation of many small objects, each a 1-, 2-, or 3-dimensional binning of the data, and profiles are a completely separate case from visualizing distributions (the legacy of the HPROF package). That was great for analyses when I was a grad student, in which all the data I wanted to fit were in a single histogram, but now physicists are performing massively combined fits using many histograms to specify signal and control regions and systematic variations of the same distribution. Hundreds of little histograms that have to be filled different ways then have to be gathered up and put into the right places in the fit— the CMS Combine tool's configuration is getting very complicated. histbook attempts to solve some of that, but we're going to need to make big changes to iterate toward the right solution. Instead of building "the right" histogramming library, I think it makes more sense at this point to make the right histogram representation so that users can move between histogramming libraries without manual translations at the border. So I've been focusing on the smaller problem of just representing histograms in as general a way as possible: not filling, adding, or plotting. (See https://github.com/diana-hep/histos, a histogram protocol in development using Google Flatbuffers.) Boost.Histogram will likely be the best histogram-filler, but it remains to be seen what is the most convenient way to manipulate them or plot them. Anything that goes into histbook will be raw material for that development. If, for instance, you want to add the Clopper-Pearson error calculation to histbook as a pull request, I'll accept it and it will make it that much easier to integrate it into future histogramming libraries. I just don't want to get your hopes up that histbook itself will be a fixed syntax going forward. |
Thanks a lot for your work and your detailed explanations. |
Hi, I would like to produce a Graph/Histogram similar to ROOT's TEfficiency. So I have two data samples
var_total
andvar_pass
. I thought one could probably use thecut axis
but I didn't succeed. Furthermore, a correct error estimate (e.g. binomial errors) would be necessary. Any idea how to do this using histbook?The text was updated successfully, but these errors were encountered: