Skip to content

Commit

Permalink
Merge pull request #67 from iris-hep/update-IRIS-HEP-projects-0131
Browse files Browse the repository at this point in the history
Update iris hep projects 0131
  • Loading branch information
davidlange6 authored Feb 1, 2024
2 parents 6981384 + 566cfe6 commit b47d9b0
Show file tree
Hide file tree
Showing 20 changed files with 133 additions and 81 deletions.
6 changes: 4 additions & 2 deletions projects/agc-julia-rntuple.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,10 @@ program:
- IRIS-HEP fellow
shortdescription: Implement an analysis pipeline for the Analysis Grand Challenge (AGC) using [JuliaHEP](https://github.com/JuliaHEP/) ecosystem.
description: >
The project's main goal is to implement AGC pipeline using Julia to demonstrate usability and as a test of performance. New utility packages can be expected especially for systematics handling and out-of-core orchestration. (built on existing packages such as `FHist.jl` and `Dagger.jl`)
At the same time, the project can explore using `RNTuple` instead of `TTree` for AGC data storage. As the interface is exactly transparent, this goal mainly requires data conversion unless performance bugs are spotted. This will be help inform transition at LHC experiments in near future (Run 4).
The project's main goal is to implement AGC pipeline using Julia to demonstrate usability and as a test of performance. New utility packages can be expected
especially for systematics handling and out-of-core orchestration. (built on existing packages such as `FHist.jl` and `Dagger.jl`) At the same time, the
project can explore using `RNTuple` instead of `TTree` for AGC data storage. As the interface is exactly transparent, this goal mainly requires data
conversion unless performance bugs are spotted. This will be help inform transition at LHC experiments in near future (Run 4).
contacts:
- name: Jerry Ling
email: [email protected]
Expand Down
14 changes: 7 additions & 7 deletions projects/agc-physlite.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,13 +22,13 @@ program:
- IRIS-HEP fellow
shortdescription: Create an Analysis Grand Challenge implementation using ATLAS PHYSLITE data
description: >
The IRIS-HEP Analysis Grand Challenge (AGC) is a realistic environment for investigating how high energy physics data analysis workflows scale to the demands of the High-Luminosity LHC (HL-LHC).
It captures relevant workflow aspects from data delivery to statistical inference.
The AGC has so far been based on publicly available Open Data from the CMS experiment.
The ATLAS collaboration aims to use a data format called PHYSLITE at the HL-LHC, which slightly differs from the data formats used so far within the AGC.
This project involves implementing the capability to analyze PHYSLITE ATLAS data within the AGC workflow and optimizing the related performance under large volumes of data.
In addition to this, the evaluation of systematic uncertainties for ATLAS with PHYSLITE is expected to differ in some aspects from what the AGC has considered thus far.
This project will also investigate workflows to integrate the evaluation of such sources of uncertainty within a Python-based implementation of an AGC analysis task.
The IRIS-HEP Analysis Grand Challenge (AGC) is a realistic environment for investigating how high energy physics data analysis workflows scale to the demands
of the High-Luminosity LHC (HL-LHC). It captures relevant workflow aspects from data delivery to statistical inference. The AGC has so far been based on
publicly available Open Data from the CMS experiment. The ATLAS collaboration aims to use a data format called PHYSLITE at the HL-LHC, which slightly differs
from the data formats used so far within the AGC. This project involves implementing the capability to analyze PHYSLITE ATLAS data within the AGC workflow and
optimizing the related performance under large volumes of data. In addition to this, the evaluation of systematic uncertainties for ATLAS with PHYSLITE is
expected to differ in some aspects from what the AGC has considered thus far. This project will also investigate workflows to integrate the evaluation of such
sources of uncertainty within a Python-based implementation of an AGC analysis task.
contacts:
- name: Matthew Feickert
email: [email protected]
Expand Down
10 changes: 6 additions & 4 deletions projects/agc-rdf.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,10 +26,12 @@ program:
- IRIS-HEP fellow
shortdescription: Develop and test an analysis pipeline using ROOT's RDataFrame for the next iteration of the Analysis Grand Challenge
description: >
The IRIS-HEP Analysis Grand Challenge (AGC) aims to develop examples of realistic, end-to-end high-energy physics analyses, as well as demonstrate the advantages of modern tools and technologies when applied to such tasks.
The next iteration of the AGC (v2) will put the capabilities of modern analysis interfaces such as Coffea and ROOT's RDataFrame under further test, for example by including more complex systematic variations and sophisticated machine learning techniques.
The project consists in the investigation and implementation of such new developments in the context of RDataFrame as well as their benchmarking on state-of-the-art analysis facilities.
The goal is to gain insights useful to guide the future design of both the analysis facilities and the applications that will be deployed on them.
The IRIS-HEP Analysis Grand Challenge (AGC) aims to develop examples of realistic, end-to-end high-energy physics analyses, as well as demonstrate the
advantages of modern tools and technologies when applied to such tasks. The next iteration of the AGC (v2) will put the capabilities of modern analysis
interfaces such as Coffea and ROOT's RDataFrame under further test, for example by including more complex systematic variations and sophisticated machine
learning techniques. The project consists in the investigation and implementation of such new developments in the context of RDataFrame as well as their
benchmarking on state-of-the-art analysis facilities. The goal is to gain insights useful to guide the future design of both the analysis facilities and the
applications that will be deployed on them.
contacts:
- name: Enrico Guiraud
email: [email protected]
Expand Down
16 changes: 5 additions & 11 deletions projects/agc-recast.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,17 +22,11 @@ commitment:
- Full time
shortdescription: Implement the CMS open data AGC analysis with RECAST and REANA
description: >
[RECAST](https://iris-hep.org/projects/recast.html) is a platform for systematic
interpretation of LHC searches.
It reuses preserved analysis workflows from the LHC experiments, which is now
possible with containerization and tools such as [REANA](http://reanahub.io).
A yet unrealized component of the IRIS-HEP [Analysis Grand Challenge](https://agc.readthedocs.io/)
(AGC) is reuse and reinterpretation of the analysis.
This project would aim to preserve the AGC CMS open data analysis and the
accompanying distributed infrastructure and implement a RECAST workflow allowing
REANA integration with the AGC.
A key challenge of the project is creating a preservation scheme for the associated
Kubernetes distributed infrastructure.
[RECAST](https://iris-hep.org/projects/recast.html) is a platform for systematic interpretation of LHC searches. It reuses preserved analysis workflows from
the LHC experiments, which is now possible with containerization and tools such as [REANA](http://reanahub.io). A yet unrealized component of the IRIS-HEP
[Analysis Grand Challenge](https://agc.readthedocs.io/) (AGC) is reuse and reinterpretation of the analysis. This project would aim to preserve the AGC CMS
open data analysis and the accompanying distributed infrastructure and implement a RECAST workflow allowing REANA integration with the AGC. A key challenge of
the project is creating a preservation scheme for the associated Kubernetes distributed infrastructure.
contacts:
- name: Kyle Cranmer
email: [email protected]
Expand Down
5 changes: 4 additions & 1 deletion projects/cms-data-pop.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,10 @@ commitment:
- Full time
shortdescription: Predict data popularity to improve its availability for physics analysis
description: >
The CMS data management team is responsible for distributing data among computing centers worldwide. Given the limited disk space at these sites, the team must dynamically manage the available data on disk. Whenever users attempt to access unavailable data, they are required to wait for the data to be retrieved from permanent tape storage. This delay impedes data analysis and hinders the scientific productivity of the collaboration. The objective of this project is to create a tool that utilizes machine learning algorithms to predict which data should be retained, based on current usage patterns.
The CMS data management team is responsible for distributing data among computing centers worldwide. Given the limited disk space at these sites, the team
must dynamically manage the available data on disk. Whenever users attempt to access unavailable data, they are required to wait for the data to be retrieved
from permanent tape storage. This delay impedes data analysis and hinders the scientific productivity of the collaboration. The objective of this project is
to create a tool that utilizes machine learning algorithms to predict which data should be retained, based on current usage patterns.
contacts:
- name: Dmytro Kovalskyi
email: [email protected]
Expand Down
7 changes: 6 additions & 1 deletion projects/cms-monit-micro-services.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,12 @@ program:
- IRIS-HEP fellow
shortdescription: To develop microservice architecture for CMS HTCondor Job Monitoring
description: >
Current implementation of HTCondor Job Monitoring, internally known as Spider service, is a monolithic application which query HTCondor Schedds periodically. This implementation does not allow deployment in modern Kubernetes infrastructures with advantages like auto-scaling, resilience, self-healing, and so on. However, it can be separated into microservices responsible for “ClassAds calculation and conversion to JSON documents”, “transmitting results to ActiveMQ and OpenSearch without any duplicates” and “highly durable query management”. Such a microservice architecture will allow the use of appropriate languages like GoLang when it has advantages over Python. Moreover, intermediate monitoring pipelines can be integrated into this microservice architecture and it will drop the work-power needed for the services that produce monitoring outcomes using HTCondor Job Monitoring data
Current implementation of HTCondor Job Monitoring, internally known as Spider service, is a monolithic application which query HTCondor Schedds periodically.
This implementation does not allow deployment in modern Kubernetes infrastructures with advantages like auto-scaling, resilience, self-healing, and so on.
However, it can be separated into microservices responsible for “ClassAds calculation and conversion to JSON documents”, “transmitting results to ActiveMQ and
OpenSearch without any duplicates” and “highly durable query management”. Such a microservice architecture will allow the use of appropriate languages like
GoLang when it has advantages over Python. Moreover, intermediate monitoring pipelines can be integrated into this microservice architecture and it will drop
the work-power needed for the services that produce monitoring outcomes using HTCondor Job Monitoring data
contacts:
- name: Brij Kishor Jashal
email: [email protected]
7 changes: 5 additions & 2 deletions projects/cms-t0-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,10 @@ commitment:
- Full time
shortdescription: Improve functional testing before deployment of critical changes for CMS Tier-0
description: >
The CMS Tier-0 service is responsible for the prompt processing and distribution of the data collected by the CMS Experiment. Thorough testing of any code or configuration changes for the service is critical for timely data processing. The existing system has a Jenkins pipeline to execute a large-scale "replay" of the data processing using old data for the final functional testing before deployment of critical changes. The project is focusing on integration of unit tests and smaller functional tests in the integration pipeline to speed up testing and reduce resource utilization.
The CMS Tier-0 service is responsible for the prompt processing and distribution of the data collected by the CMS Experiment. Thorough testing of any code or
configuration changes for the service is critical for timely data processing. The existing system has a Jenkins pipeline to execute a large-scale "replay" of
the data processing using old data for the final functional testing before deployment of critical changes. The project is focusing on integration of unit
tests and smaller functional tests in the integration pipeline to speed up testing and reduce resource utilization.
contacts:
- name: Dmytro Kovalskyi
email: [email protected]
Expand All @@ -34,4 +37,4 @@ contacts:
email: [email protected]
mentees:
- name: Mycola Kolomiiets
link: https://iris-hep.org/fellows/MycolaKolomiiets.html
link: https://iris-hep.org/fellows/MycolaKolomiiets.html
6 changes: 3 additions & 3 deletions projects/diff-geant.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,9 @@ program:
- IRIS-HEP fellow
shortdescription: Developing an automatic differentiation and initial parameters optimisation pipeline for the particle shower model.
description: >
The goal of this project is to develop a differentiable simulation and optimization pipeline for Geant4. The narrow task of this
Fellowship project is to develop a trial automatic differentiation and backpropagation pipeline for the Markov-like stochastic
branching process that is modeling a particle shower spreading inside a detector material in three spatial dimensions.
The goal of this project is to develop a differentiable simulation and optimization pipeline for Geant4. The narrow task of this Fellowship project is to
develop a trial automatic differentiation and backpropagation pipeline for the Markov-like stochastic branching process that is modeling a particle shower
spreading inside a detector material in three spatial dimensions.
contacts:
- name: Lukas Heinrich
email: [email protected]
Expand Down
15 changes: 6 additions & 9 deletions projects/energy-cost-vre-coffea-casa.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,19 +22,16 @@ commitment:
- Full time
shortdescription: Implementing energy consumption benchmarks on different analysis platforms and facilities
description: >
Benchmarks for software energy consumption are starting to appear
(see e.g. the [SCI score](https://github.com/Green-Software-Foundation/software_carbon_intensity/blob/main/Software_Carbon_Intensity/Software_Carbon_Intensity_Specification.md#quantification-method))
alongside more common performance benchmarks.
In this project, we will pilot the implementation of selected software energy consumption benchmarks
on two different facilities for user analysis:
Benchmarks for software energy consumption are starting to appear (see e.g. the [SCI
score](https://github.com/Green-Software-Foundation/software_carbon_intensity/blob/main/Software_Carbon_Intensity/Software_Carbon_Intensity_Specification.md#quantification-method))
alongside more common performance benchmarks. In this project, we will pilot the implementation of selected software energy consumption benchmarks on two
different facilities for user analysis:
* the [Virtual Research Environment](https://indico.jlab.org/event/459/contributions/11671/),
a prototype analysis platform for the European Open Science Cloud.
* [Coffea-casa](https://coffea-casa.readthedocs.io/), a prototype Analysis
Facility (AF), which provides services for "low-latency columnar analysis."
We will then test them with simple user software pipelines.
The candidate will work in collaboration with another IRIS-HEP fellow
investigating energy consumption benchmarks for ML algorithms,
and alongside a team of students and interns working on the selection and implementation of the benchmarks.
We will then test them with simple user software pipelines. The candidate will work in collaboration with another IRIS-HEP fellow investigating energy
consumption benchmarks for ML algorithms, and alongside a team of students and interns working on the selection and implementation of the benchmarks.
contacts:
- name: Caterina Doglioni
email: [email protected]
22 changes: 12 additions & 10 deletions projects/gnn-tracking.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,23 +25,25 @@ program:

shortdescription: Reconstruct the trajectories of particle with graph neural networks
description: |
In the GNN tracking project, we use [graph neural networks][gnn-wiki] (GNNs) to reconstruct trajectories ("tracks") of elementary particles traveling through a detector.
In the GNN tracking project, we use [graph neural networks][gnn-wiki] (GNNs) to reconstruct trajectories ("tracks") of elementary particles traveling through
a detector.
This task is called ["tracking"][tracking-wiki] and is different from many other problems that involve trajectories:
* there are several thousand particles that need to be tracked at once,
* there is no time information (the particles travel too fast),
* we do not observe a continuous trajectory but instead only around five points ("hits") along the way in different detector layers.
The task can be described as a combinatorically very challenging "connect-the-dots" problem, essentially turning a cloud of points (hits) in 3D space into a set of O(1000) trajectories.
Expressed differently, each hit (containing not much more than the x/y/z coordinate) must be assigned to the particle/track it belongs to.
The task can be described as a combinatorically very challenging "connect-the-dots" problem, essentially turning a cloud of points (hits) in 3D space into a
set of O(1000) trajectories. Expressed differently, each hit (containing not much more than the x/y/z coordinate) must be assigned to the particle/track it
belongs to.
A conceptually simple way to turn this problem into a machine learning task is to create a fully connected graph of all points and then train an edge classifier to reject any edge that doesn't connect points that belong to the same particle.
In this way, only the individual trajectories remain as components of the initial fully connected graph.
However, this strategy does not seem to lead to perfect results in practice.
The approach of this project uses this strategy only as the first step to arrive at "small" graphs.
It then projects all hits into a learned latent space with the model learning to place hits of the same particle close to each other, such that the hits belonging to the same particle form clusters.
A conceptually simple way to turn this problem into a machine learning task is to create a fully connected graph of all points and then train an edge
classifier to reject any edge that doesn't connect points that belong to the same particle. In this way, only the individual trajectories remain as components
of the initial fully connected graph. However, this strategy does not seem to lead to perfect results in practice. The approach of this project uses this
strategy only as the first step to arrive at "small" graphs. It then projects all hits into a learned latent space with the model learning to place hits of
the same particle close to each other, such that the hits belonging to the same particle form clusters.
The project code together with documentation and a reading list is available on [github][ghorganization] and uses [pytorch geometric][pyg].
See also [our GSoC proposal for the same project][gsoc-proposal], which lists prerequisites and possible tasks.
The project code together with documentation and a reading list is available on [github][ghorganization] and uses [pytorch geometric][pyg]. See also [our GSoC
proposal for the same project][gsoc-proposal], which lists prerequisites and possible tasks.
[ghorganization]: https://github.com/gnn-tracking
Expand Down
Loading

0 comments on commit b47d9b0

Please sign in to comment.