Skip to content

Output file merging

Alp edited this page May 19, 2022 · 2 revisions

In PR #42, a new file merging scheme has been introduced. This page explains how to use it and what the differences are to before.

In the old and new schemes, we assume that all necessary coffea input files are located in one directory indir.

Old scheme

In the old scheme, merging was hidden from the user. The user would simply call:

from bucoffea.util.plot import acc_from_dir
acc = acc_from_dir(indir)

and acc would be the merged accumulator. The merging would happen only the first time this command was called, and would then be cached in a separate cache file. The downside of this method is that everything is pkl based and we can only load everything or nothing, it is not possible to just load one out of the many histograms in the accumulator.

After loading, all histograms are in memory and are immediately accessible:

histo = acc['recoil'] # Gives the recoil histogram

New scheme

In the new scheme, merging is explicit. The user has to call bumerge to merge his inputs:

bumerge $indir -o ./path/to/output/directory -j 4

-o / --output specifies the output directory to use -j / --jobs specifies the number of parallel jobs to use for merging

The merging now uses only a small amount of memory and can therefore easily be run in parallel threads.

Once everything has been merged, the user can access the merged outputs using a klepto dir_archive, which is just a cached dictionary using a directory as its backend.

from klepto.archives import dir_archive
acc = dir_archive(
                  './path/to/output/directory', # Same as the -o argument to bumerge
                  serialized=True,
                  compression=0,
                  memsize=1e3,
                  )

The only significant difference to before is that the user now has to explicitly load the keys before access Once loaded, the copy in memory is independent from the copy on disk. Applying operations like rebinning or dataset merging can be applied to the copy in memory. Note that if load is called again, the copy in memory is replaced by the copy from disk, thus erasing any changes made to the working copy.

# Gives a KeyError:
histo = acc['recoil']

# Instead:
acc.load('recoil')
histo = acc['recoil']

Plotting

All changes have been propagated to stack_plot.py and lo_vs_nlo.py, so check these out for hints on how to manage the plotting migration.