-
Notifications
You must be signed in to change notification settings - Fork 4
Output file merging
In PR #42, a new file merging scheme has been introduced. This page explains how to use it and what the differences are to before.
In the old and new schemes, we assume that all necessary coffea input files are located in one directory indir
.
In the old scheme, merging was hidden from the user. The user would simply call:
from bucoffea.util.plot import acc_from_dir
acc = acc_from_dir(indir)
and acc
would be the merged accumulator. The merging would happen only the first time this command was called, and would then be cached in a separate cache file. The downside of this method is that everything is pkl
based and we can only load everything or nothing, it is not possible to just load one out of the many histograms in the accumulator.
After loading, all histograms are in memory and are immediately accessible:
histo = acc['recoil'] # Gives the recoil histogram
In the new scheme, merging is explicit. The user has to call bumerge
to merge his inputs:
bumerge $indir -o ./path/to/output/directory -j 4
-o / --output
specifies the output directory to use
-j / --jobs
specifies the number of parallel jobs to use for merging
The merging now uses only a small amount of memory and can therefore easily be run in parallel threads.
Once everything has been merged, the user can access the merged outputs using a klepto
dir_archive
, which is just a cached dictionary using a directory as its backend.
from klepto.archives import dir_archive
acc = dir_archive(
'./path/to/output/directory', # Same as the -o argument to bumerge
serialized=True,
compression=0,
memsize=1e3,
)
The only significant difference to before is that the user now has to explicitly load the keys before access
Once loaded, the copy in memory is independent from the copy on disk. Applying operations like rebinning or dataset merging can be applied to the copy in memory. Note that if load
is called again, the copy in memory is replaced by the copy from disk, thus erasing any changes made to the working copy.
# Gives a KeyError:
histo = acc['recoil']
# Instead:
acc.load('recoil')
histo = acc['recoil']
All changes have been propagated to stack_plot.py
and lo_vs_nlo.py
, so check these out for hints on how to manage the plotting migration.