Skip to content

How To: Setting up analysis regions, histograms and cuts

Alp edited this page Jul 30, 2022 · 9 revisions

This section is meant to be a primer for the analysis setup using bucoffea framework, and also includes information about interacting with coffea histograms.

Each analysis definition is pretty much constrained the two files. Using the VBF H(inv) as an example, these files are:

  • vbfhinv/definitions.py: Histogram definitions, analysis regions, and so forth.
  • vbfhinv/vbfhinvProcessor.py: Contains the processor class that runs on input data, computes values, cuts and saves output histograms.

More details about the analysis setup is explained below.

Defining Analysis Cuts

We use a coffea.processor.PackedSelection object to keep track of the selections we use. This object stores the name of the cut, and the mask associated with it (that is, an array of True and False booleans for each event, specifying whether the event passed the cut or not). A given cut can be added to this object using selection.add() method, as described below.

Let's say we want to cut on dijet invariant mass, mjj > 200. So we compute mjj from our dijets, and compute the mask as follows:

# Here is our PackedSelection object
selection = processor.PackedSelection()

# See vbfhinvProcessor.py for specifics, this command computes the invariant mass for the two leading jets,
# and stores them in a 1D array
mjj = diak4.mass.max()

# Define the mask and add it to the selection object
selection.add("mjj", mjj > 200)

Note that in the processor files, the cut values (e.g. 200 here) are read from the configuration files, instead of being hard-coded into the script like that.

Note: Please note that the array that is passed as a mask (second argument of selection.add() call) must be a 1D array.

This is all pretty cool and handy, executing the code above will make PackedSelection object remember what our cut is. But how do we actually make use of these cuts and define regions?

Defining Analysis Regions

A region in each analysis is defined by a list of cuts to be applied (this could be the signal region, or say, dimuon control region). For each region, we get the events passing all the cuts, and save histograms. In order to achieve this in this framework, we define each region by a list of unique cut names. In practice, this is done in the vbfhinv_regions function, implemented here. We'll give a simpler example below.

Let's say we add 3 cuts to define a region as such:

selection = processor.PackedSelection()
selection.add("cut_1", ...)
selection.add("cut_2", ...)
selection.add("cut_3", ...)

Now, we can construct the region, and get the final mask for events passing the region selection as such:

region = ["cut_1", "cut_2", "cut_3"]
# This mask is a 1D array that stores True for an event passing ALL the selections, False otherwise
mask = selection.all(*region)

An important note: You can add more cuts to the PackedSelection object via selection.add, but if they are not referenced in any region definition, they won't be used at all. As an example, we could've done something like this:

# Let's add a fourth cut
selection.add("cut_4", ...)

# That cut is not in our region definition
region = ["cut_1", "cut_2", "cut_3"]

# The 4th cut will not be considered in the mask here
mask = selection.all(*region)

With the mask array, we can pick events passing the selections:

# Will return a filtered array for mjj, containing only the passing events
df["mjj"][mask]

# We can negate this array, to get only the FAILING events (if for some reason, we're interested)
df["mjj"][~mask] 

Next section will cover how to define histograms, so that we can fill those histograms with the events that pass our selection.

Defining Histograms

Histograms for the VBF H(inv) analysis are defined in the vbfhinv_accumulator function, located in definitions.py here. This function returns a dictionary-based accumulator that maps histogram names to histogram objects. And when the analysis is run, these histograms are filled. Note that if you're trying to fill a histogram that is not specified here, you'll get an error. Let's take a closer look at how the individual histograms are defined.

First thing to do is to define an axis for the histograms to use:

from coffea import hist

# Shortcuts 
Bin = hist.Bin
Cat = hist.Cat

# Set up a numerical axis
mjj_ax = Bin("mjj", r"$M_{jj}$ (GeV)", 150, 0, 7500)

# We can also set up categorical axes
dataset_ax = Cat("dataset", "Primary dataset")

With those defined, we can go ahead and set up our histograms:

Hist = hist.Hist

# Mjj histogram with 3 axes:
# - One categorical axis storing the dataset name
# - One categorical axis storing the region name
# - One numerical axis storing the mjj values

# "Counts" refers to the y-label of the histogram
h_mjj = Hist("Counts", dataset_ax, region_ax, mjj_ax)

Great! Now that we defined this histogram, we can go ahead and fill it:

# Let's say this is MET dataset, and we're in signal region ("sr_vbf") 
h_mjj.fill(
  dataset="MET_ver1_2017C",
  region="sr_vbf",
  mjj=df["mjj"][mask]
)

# If we miss a field (or add an undefined field), that is an error:
h_mjj.fill(
  dataset="MET_ver1_2017C",
  mjj=df["mjj"][mask]
)
# coffea will complain that the value for "region_ax" is missing

Reading Files & Plotting Histograms

Once the processor is ran, the output .coffea file can be loaded into a script and histogram contents can be plotted. If we're dealing with a single .coffea file, one can use the load(filepath) function, as explained below.

coffea provides useful functions to plot histograms. Once the histograms are set up and filled as explained in the previous sections, a few function calls can be used to plot them. For more detailed information on coffea histograms, the users can consult here.

The general idea when plotting histograms is typically as follows:

  • Obtain the histogram from the file
  • Integrate over the categorical axes (e.g. if you want to look at signal region only)
  • Plot the remaining numerical axis with plot1d

Users can consult the code snippet below for the workflow outlined above:

from matplotlib import pyplot as plt
from coffea import hist
from coffea.util import load

# Get the file
acc = load('/path/to/file.coffea')

# Get the histogram
h = acc['histoName']

# Integrate over the categorical axes of the histogram
# For this example, let's say we have dataset and region axes (typical in our histograms)
h = h.integrate('dataset', 'datasetName')
h = h.integrate('region', 'regionName')

# Plot the remaining numerical axis
fig, ax = plt.subplots()
hist.plot1d(h, ax=ax)

fig.savefig('plot.pdf')

Overlaying Several Plots

If the user wants to overlay several plots into one figure, that's also easily possible using plot1d. The function can be used as follows:

# Integrate over the dataset
h = h.integrate('dataset', 'datasetName')

# Plot different regions together in a single plot
hist.plot1d(h, overlay="region", ax=ax)

For more information about the plot1d function, it is recommended to check the documentation page of coffea here.

Integrating With Regular Expression

It is also possible to integrate an axis of a histogram using a compiled regular expression (instead of a string that is an exact match). For example, if we want to integrate over all datasets that match the regular expression Data.*2017, we can do the following:

import re

# The histogram will include contents of every dataset that matches Data.*2017 regex
h = h.integrate('dataset', re.compile('Data.*2017'))

Plotting Ratios

Using coffea.hist module, it is also easy to plot the ratio of two coffea histograms. For this, the plotratio function (documentation here) can be used. Following is an example, if we have two coffea histograms called h_num and h_denom:

from coffea import hist

fig, ax = plt.subplots()
hist.plotratio(
  h_num,
  h_denom,
  ax=ax,
  unc="num", # Use the uncertainty from numerator, check out the docs for possible values!
)

It is a possible use case to divide the figure into two sub-figures (top and bottom), and plot the histograms separately on the top figure, while plotting the ratio of the two on the bottom. This can be achieved by using the bucoffea workaround fig_ratio(), together with plotting functions from coffea:

from bucoffea.plot.util import fig_ratio
from coffea import hist

# Create the figure with the sub-figures
fig, ax, rax = fig_ratio()

# Plot histograms on top
hist.plot1d(h_num, ax=ax)
hist.plot1d(h_denom, ax=ax)

# Plot ratio on bottom
hist.plotratio(
  h_num,
  h_denom,
  ax=rax,
  unc="num", # Use the uncertainty from numerator, check out the docs for possible values!
)