This package is used by the ATLAS HH->bbyy search.
It defines a classifier to choose which non-b-tagged jet to pair with the b-tagged jet in 1-tag events.
This code uses the skTMVA
class from https://github.com/yuraic/koza4ok to convert scikit-learn
output into ROOT TMVA
xml input.
This package relies on two different Machine Learning libraries to train BDTs, and the user can control which one to use. One is scikit-learn
; the other is TMVA
, a CERN specific data science package included in ROOT
.
Both ROOT
and scikit-learn
are required even if the user decides to only use one of the two for the ML portion of the project. ROOT
is required to open the .root
input files and scikit-learn
is used in different capacities during the data pre-processing stages.
Other necessary libraries include:
For updated requirements check the requirements.txt file.
The most recent set of input TTrees are in:
/afs/cern.ch/user/j/jrobinso/work/public/ML_inputs
There are two main script to execute: run_classifier.py
and evaluate_event_performance.py
. These need to be executed sequentially whenever the output of the former looks satisfactory.
This is the main script. It is composed of different parts that handle data processing, plotting, training and/or testing. This script connects all the steps in the pipeline.
There are multiple options and flags that can be passed to this script.
usage: run_classifier.py [-h] --input INPUT [INPUT ...] [--tree TREE]
[--output OUTPUT]
[--exclude VARIABLE_NAME [VARIABLE_NAME ...]]
[--strategy STRATEGY [STRATEGY ...]] [--grid_search]
[--ftrain FTRAIN] [--training_sample TRAINING_SAMPLE]
[--max_events MAX_EVENTS]
Run ML algorithms over ROOT TTree input
optional arguments:
-h, --help show this help message and exit
--input INPUT [INPUT ...]
List of input file names
--tree TREE Name of the tree in the ntuples. Default: events_1tag
--output OUTPUT Output directory. Default: output
--exclude VARIABLE_NAME [VARIABLE_NAME ...]
List of variables that are present in the tree but
should not be used by the classifier
--strategy STRATEGY [STRATEGY ...]
Type of BDT to use. Options are: RootTMVA, sklBDT.
Default: both
--grid_search Pass this flag to run a grid search to determine BDT
parameters
--ftrain FTRAIN Fraction of events to use for training. Default: 0.6.
Set to 0 for testing only.
--training_sample TRAINING_SAMPLE
Directory with pre-trained BDT to be used for testing
--max_events MAX_EVENTS
Maximum number of events to use (for debugging).
Default: all
Event level performance evaluation quantified in terms of Asimov significance.
It compares the performance of three old strategies (mHmatch
, pThigh
, pTjb
) with that of the BDT. The BDT performance is evaluated after excluding events in which the highest BDT score is < threshold. For many threshold values, the performance can be computed in paralled.
It outputs a plot and a pickled dictionary.
usage: evaluate_event_performance.py [-h] [--strategy STRATEGY]
[--category CATEGORY]
[--intervals INTERVALS]
Check event level performance
optional arguments:
-h, --help show this help message and exit
--strategy STRATEGY Strategy to evaluate. Options are: root_tmva, skl_BDT.
Default: skl_BDT
--category CATEGORY Trained classifier to use for event-level evaluation.
Examples are: low_mass, high_mass. Default: low_mass
--intervals INTERVALS
Number of threshold values to test. Default: 21
This will train two BDTs (one with TMVA
, one with scikit-learn
) on 1000 events (for debugging purposes) from the Standard Model input files, without usin the Delta_phi_jb variable.
python run_classifier.py --input inputs/SM_bkg_photon_jet.root inputs/SM_hh.root --exclude Delta_phi_jb --strategy RootTMVA sklBDT --max_events 1000
Testing previously trained classifier on different individual BSM inputs -- NB. be careful about double counting here
Note: --ftrain 0
indicates that 0% of the input samples should be used for training; therefore, this will only test the performance of a previously trained net (the location of which is indicated by --training_sample
) on the input files specified after --input
.
python run_classifier.py --input inputs/X275_hh.root --exclude Delta_phi_jb --strategy sklBDT --ftrain 0 --training_sample SM_merged
python run_classifier.py --input inputs/X300_hh.root --exclude Delta_phi_jb --strategy sklBDT --ftrain 0 --training_sample SM_merged
python run_classifier.py --input inputs/X325_hh.root --exclude Delta_phi_jb --strategy sklBDT --ftrain 0 --training_sample SM_merged
python run_classifier.py --input inputs/X350_hh.root --exclude Delta_phi_jb --strategy sklBDT --ftrain 0 --training_sample SM_merged
python run_classifier.py --input inputs/X400_hh.root --exclude Delta_phi_jb --strategy sklBDT --ftrain 0 --training_sample SM_merged
By default, 60% of the input files will be used for training, 40% for testing.
python run_classifier.py --input inputs/*root --exclude Delta_phi_jb
The signal samples in the mass range between 275 and 350 GeV are commonly identified as the "low mass" samples. A classifier trained on those will then be identified by that tag low_mass
and placed in an homonymous folder.
python run_classifier.py --input inputs/SM_bkg_photon_jet.root inputs/X275_hh.root inputs/X300_hh.root inputs/X325_hh.root inputs/X350_hh.root --exclude Delta_phi_jb --output low_mass
Conversely, the Standard Model signal samples and the BSM sample with resonant mass of 400 GeV are commonly identified as the "high mass samples. A classifier trained on those will then be identified by that tag high_mass
and placed in an homonymous folder.
python run_classifier.py --input inputs/SM_bkg_photon_jet.root inputs/X400_hh.root inputs/SM_hh.root --exclude Delta_phi_jb --output high_mass
run_classifier.py
only handles the pipeline to the individual jet pair classification stage. A final step is to evaluate the event-level performance of the tagger, by selecting in each event the jet pair that has the highest BDT score and checking how often that corresponds to the correct pair. The event-level performance is then compared with that of other methods that were previosuly tested in the analysis.
python evaluate_event_performance.py --category low_mass --strategy root_tmva --intervals 21