Skip to content

Official implementation of "Accelerating Resonance Searches via Signature-Oriented Pre-training".

License

Notifications You must be signed in to change notification settings

jet-universe/sophon

Repository files navigation

Sophon

[Paper] [Dataset] [Model] [Colab]

Sophon (Signature-Oriented Pre-training for Heavy-resonance ObservatioN) is a method proposed for developing foundation AI models tailored for future usage in LHC experimental analyses. The approach focuses on pre-training a model using a comprehensive jet dataset designed to capture extensive jet signatures.

This work introduces:

  1. The JetClass-II dataset: a large-scale and comprehensive large-R jet dataset.
  2. The Sophon model: a Particle Transformer model pre-trained on a 188-class classification task utilizing the JetClass-II dataset.

Further details are provided below.

Introduction to JetClass-II

JetClass-II is a large-scale and comprehensive dataset covering extensive large-radius jet signatures and a wide range of jet $p_{\rm T}$ and mass values.

The dataset consists of three major parts:

  1. Res2P: Generic $X \to$ 2 prong resonant jets.
  2. Res34P: Generic $X \to$ 3 or 4 prong resonant jets.
  3. QCD: Jets from QCD multijet background.

Each part is further subdivided into detailed categories, indicating which partons, leptons, or combinations thereof initiated the jet.

jetclass2_table

The dataset can be downloaded from [HuggingFace]. The three major parts (Res2P, Res34P, and QCD) are separately packed and can be downloaded individually for ease of use. The sizes of the training sets are 20M, 86M, and 28M entries, respectively. The dataset also includes validation and test sets, with the sizes for training/validation/test following a 4:1:1 ratio.

Data files

Every 100k entries (jets) are stored in a Parquet file. A complete view of the JetClass-II data files are shown in the table below.

Type File name range File number total entries
Res2P, train Res2P_0000.parquetRes2P_0199.parquet 200 20M
Res2P, val Res2P_0200.parquetRes2P_0249.parquet 50 5M
Res2P, test Res2P_0250.parquetRes2P_0299.parquet 50 5M
Res34P, train Res34P_0000.parquetRes34P_0859.parquet 860 86M
Res34P, val Res34P_0860.parquetRes34P_1074.parquet 215 21.5M
Res34P, test Res34P_1075.parquetRes34P_1289.parquet 215 21.5M
QCD, train QCD_0000.parquetQCD_0279.parquet 280 28M
QCD, val QCD_0280.parquetQCD_0349.parquet 70 7M
QCD, test QCD_0350.parquetQCD_0419.parquet 70 7M

Quick dive into JetClass-II

Use [Colab] to inspect and visualize data in JetClass-II.

Here are some visualizations of jets marked with the top-5 probability scores interpreted by the Sophon model (see the Sophon model's section below).

jetclass2_vis_example

Generation details

The dataset is generated using MadGraph + Pythia + Delphes.

During the Delphes (fast simulation) step, the pileup (PU) effect, with an average of 50 PU interactions, is emulated to mimic the realistic LHC collision environment. The PUPPI algorithm is then applied to remove the PU, correcting the E-flow objects used to cluster jets. This distinguishes it from the original JetClass dataset. The Delphes card can be found in the jetclass2-generation repository.

The complete generation script (the one-stop MadGraph + Pythia + Delphes production) and the n-tuplizer script are provided in the jetclass2-generation repository to facilitate reproducibility.

Variable details

The JetClass-II dataset includes the following variables:

  1. part_*: Features for jet constituent particles (i.e., E-flow objects in Delphes).
  2. jet_*: Features for jets. A specific variable is jet_label, which indicates the label in 188 classes.
  3. genpart_*: Features for generator-level jet (GEN-jet) constituent particles. The GEN-jet is clustered from the stable particles generated by Pythia, excluding neutrinos, using the same clustering configuration. The GEN-jets are matched with jets based on angular separation. The entry is left empty if no matched GEN-jet is found.
  4. genjet_*: Jet-level features for the matched GEN-jet.
  5. aux_genpart_*: Auxiliary variables storing features of selected truth particles. Five types of particles are chosen if they are valid:
    1. The initial resonance $X$ (in both 2-prong and 3/4-prong resonance cases).
    2. The two secondary resonances $Y$ produced by $X$ ($X \to Y_1Y_2$) in the 3/4-prong resonance case.
    3. The direct decay products (partons and leptons) from $X$ and $Y$.
    4. The subsequent decay products of tau leptons in case (iii).
    5. The partons ($p_{\rm T}$ > 5 GeV) matched within a QCD jet.
**Expand to see detailed descriptions for JetClass-II variables and a comparison with JetClass variables.**
Variable Type Description Exists in JetClass?
For jet constituent particles
part_px vector<float> particle's $p_x$ ✔️
part_py vector<float> particle's $p_y$ ✔️
part_pz vector<float> particle's $p_z$ ✔️
part_energy vector<float> particle's energy ✔️
part_deta vector<float> difference in pseudorapidity $\eta$ between the particle and the jet axis ✔️
part_dphi vector<float> difference in azimuthal angle $\phi$ between the particle and the jet axis ✔️
part_d0val vector<float> particle's transverse impact parameter value $d_0$, in mm ✔️
part_d0err vector<float> error of the particle's transverse impact parameter $\sigma_{d_0}$, in mm ✔️
part_dzval vector<float> particle's longitudinal impact parameter value $d_z$, in mm ✔️
part_dzerr vector<float> error of the particle's longitudinal impact parameter $\sigma_{d_z}$, in mm ✔️
part_charge vector<int32_t> particle's electric charge ✔️
part_isElectron vector<bool> if the particle is an electron (abs(pid)==11) ✔️
part_isMuon vector<bool> if the particle is an muon (abs(pid)==13) ✔️
part_isPhoton vector<bool> if the particle is an photon (pid==22) ✔️
part_isChargedHadron vector<bool> if the particle is a charged hadron (charge!=0 && !isElectron && !isMuon) ✔️
part_isNeutralHadron vector<bool> if the particle is a neutral hadron (charge==0 && !isPhoton) ✔️
For jet
jet_pt float jet's transverse momentum $p_{\rm T}$ ✔️
jet_eta float jet's pseudorapidity $\eta$ ✔️
jet_phi float jet's azimuthal angle $\phi$ ✔️
jet_energy float jet's energy ✔️
jet_sdmass float jet's soft-drop mass ✔️
jet_nparticles int32_t number of jet constituent particles ✔️
jet_tau1 float jet's $N$-subjettiness variable $\tau_1$ ✔️
jet_tau2 float jet's $N$-subjettiness variable $\tau_2$ ✔️
jet_tau3 float jet's $N$-subjettiness variable $\tau_3$ ✔️
jet_tau4 float jet's $N$-subjettiness variable $\tau_4$ ✔️
jet_label int32_t jet's label index in JetClass-II, detailed in the above table 🆕
For GEN-jet constituent particles (if a GEN-jet is found matched to a jet)
genpart_px vector<float> particle's $p_x$ 🆕
genpart_py vector<float> particle's $p_y$ 🆕
genpart_pz vector<float> particle's $p_z$ 🆕
genpart_energy vector<float> particle's energy 🆕
genpart_jet_deta vector<float> difference in pseudorapidity $\eta$ between the particle and the jet (not the GEN-jet) axis 🆕
genpart_jet_dphi vector<float> difference in azimuthal angle $\phi$ between the particle and the jet (not the GEN-jet) axis 🆕
genpart_x vector<float> $x$ coordinate of the particle’s production vertex, in mm 🆕
genpart_y vector<float> $y$ coordinate of the particle’s production vertex, in mm 🆕
genpart_z vector<float> $z$ coordinate of the particle’s production vertex, in mm 🆕
genpart_t vector<float> $t$ coordinate of the particle’s production vertex, in mm/c 🆕
genpart_pid vector<int32_t> particle's PDGID 🆕
For GEN-jet (if matched to a jet)
genjet_pt float GEN-jet's transverse momentum $p_{\rm T}$ 🆕
genjet_eta float GEN-jet's pseudorapidity $\eta$ 🆕
genjet_phi float GEN-jet's azimuthal angle $\phi$ 🆕
genjet_energy float GEN-jet's energy 🆕
genjet_sdmass float GEN-jet's soft-drop mass 🆕
genjet_nparticles int32_t number of GEN-jet constituent particles 🆕
For selected truth particles
aux_genpart_pt vector<float> selected truth particles' $p_{\rm T}$ ✔️ (different rules to select truth particles)
aux_genpart_eta vector<float> selected truth particles' $\eta$ ✔️ (different rules to select truth particles)
aux_genpart_phi vector<float> selected truth particles' $\phi$ ✔️ (different rules to select truth particles)
aux_genpart_mass vector<float> selected truth particles' mass ✔️ (different rules to select truth particles)
aux_genpart_pid vector<int32_t> selected truth particles' PDGID 🆕
aux_genpart_isResX vector<bool> if the particle is the initial resonance $X$ 🆕
aux_genpart_isResY vector<bool> if the particle is the secondary resonance $Y$ 🆕
aux_genpart_isResDecayProd vector<bool> if the particle is the direct decay product (parton and lepton) from $X$ and $Y$ 🆕
aux_genpart_isTauDecayProd vector<bool> if the particle is the subsequent decay product of tau leptons 🆕
aux_genpart_isQcdParton vector<bool> if the particle is the parton with $p_{\rm T}$ > 5 GeV stored in the QCD jet case 🆕

Pre-training Sophon model

Install dependencies

The Sophon model is based on the ParT architecture. It is implemented in PyTorch, with training based on the weaver framework for dataset loading and transformation. To install weaver, run:

pip install git+https://github.com/hqucms/weaver-core.git@dev/custom_train_eval

Note: We are temporarily using a development branch of weaver.

For instructions on setting up Miniconda and installing PyTorch, refer to the weaver page.

Download dataset

Download the JetClass-II dataset from [HuggingFace]. The training and validation files are used in this work, while the test files are not used.

Ensure that all data files are accessible from:

./datasets/JetClassII/Pythia/{Res2P,Res34P,QCD}_*.parquet

Training

Step 1: Generate dataset sampling weights according to the weights section in the data configuration. The processed config with pre-calculated weights will be saved to data/JetClassII.

./train_sophon.sh make_weight

Step 2: Start training.

./train_sophon.sh train

Note: Depending on your machine and GPU configuration, additional settings may be useful. Here are a few examples:

  • Enable PyTorch's DDP for parallel training, e.g., CUDA_VISIBLE_DEVICES=0,1,2,3 DDP_NGPUS=4 ./train_sophon.sh train --start-lr 2e-3 (the learning rate should be scaled according to DDP_NGPUS).
  • Configure the number of data loader workers and the number of splits for the entire dataset. The script uses the default configuration --num-workers 5 --data-split-num 200, which means there are 5 workers, each responsible for processing 1/5 of the data files and reading the data synchronously; the data assigned to each worker is split into 200 parts, with each worker sequentially reading 1/200 of the total data in order.

Step 3 (optional): Convert the model to ONNX.

./train_sophon.sh convert

Using Sophon model (Python/C++)

We introduce two methods for inferring the Sophon model: using Python and C++ (with C++ macros for analyzing Delphes files).

Python workflow

Please refer to our Jupyter notebook example on [Colab] for detailed instructions. See the section "Inferring Sophon model" for more information.

C++ workflow for analyzing Delphes files

For details on using the C++ workflow, please see the ./analyzers directory.

Citation

If you use the JetClass-II dataset or the Sophon model, please cite:

@article{Li:2024htp,
    author = "Li, Congqiao and Agapitos, Antonios and Drews, Jovin and Duarte, Javier and Fu, Dawei and Gao, Leyun and Kansal, Raghav and Kasieczka, Gregor and Moureaux, Louis and Qu, Huilin and Suarez, Cristina Mantilla and Li, Qiang",
    title = "{Accelerating Resonance Searches via Signature-Oriented Pre-training}",
    eprint = "2405.12972",
    archivePrefix = "arXiv",
    primaryClass = "hep-ph",
    month = "5",
    year = "2024"
}

About

Official implementation of "Accelerating Resonance Searches via Signature-Oriented Pre-training".

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published