Skip to content

kronos develop

helq edited this page Jan 30, 2024 · 2 revisions

CODES kronos-develop branch

Kronos is the tagname for our current (as of Jan 2024) funding project. The objective of the project is to speed up CODES simulations via the use of surrogates (network surrogates and workload surrogates). In this page, we document how to compile and run CODES models that demonstrate the capabilities of surrogates.

To follow the instructions here presented, download this compressed file.

Compile

To compile CODES, copy the script CODES-compile-script.sh into an empty folder and run it. The compilation script comes with three switches.

When all switches are deactivated, CODES is compiled with minimal requirements. It just needs C and C++ compilers, and an MPI library (preferably MPICH, but OpenMPI has been tested).

As of Jan 17, 2024, the branch kronos-develop implements a strategy to replace the network simulation for a surrogate of the network. When activated, the surrogate network speeds up (in most cases) the overall simulation time.

When switched on, the compilation script will download additional software to enable further capabilities of CODES. These capabilities are:

  • SWM: CODES comes preloaded with some synthetic workloads. SWM is a library (of sorts) which allows us to implement further, more complicated synthetic workloads.
  • UNION: Writing synthetic workloads directly into CODES or using SWM is complex. So, to simplify even more the creation of them, we can use a "programming language" called coNCePTuaL. UNION requires SWM to work.
  • TORCH: As of now, CODES only implements one surrogate model (average packet latency). For more complex surrogate models, we can train offline a model using PyTorch. This model can be then loaded into CODES using PyTorch C++ API.

Running example (high-fidelity and hybrid)

Copy the folder experiments/ into the folder the compile script was executed, and then run:

cd experiments
bash run-experiment.sh dfly-72/mpireplay-synthetic-10ms.sh

Running the script will take several minutes. A folder results will be created under experiments/dfly-72/. Each time the script is run, a new subfolder under results is created and it will contain all the output generated after running the example experiment.

The example experiment consists of two simulations (we run CODES twice): high-fidelity and hybrid.

  • Under high-fidelity, the network is never replaced by the surrogate network. This is our gold standard, our ground-truth.
  • Under hybrid, the simulation will run in high-fidelity mode for 3ms, then it will switch the network for the surrogate network and will run for 5ms. Finally at 8ms, the network will be switched from surrogate to high-fidelity.

The parameters of the simulation(s) are:

  • A 72-node dragonfly network (with 72 compute nodes, 36 routers, on 9 groups)
  • The simulation runs for 10ms.
  • The traffic injected into the network is Uniform Random (a synthetic workload) with a period of 470ns. Each computing node sends new message to a random computing node in the network every 470ns.

Under the folder results/exp-XXX/synthetic1-uniform-ms/high-fidelity/packet-latency, we can find the packet latency data generated by the high-fidelity simulation. It contains source, destination, latency and other pieces of data collected per packet transmitted in the network.

Training PyTorch model from example results

We need to install PyTorch (it should come with Pandas and NumPy as dependencies).

First, we need to clean the output data so that Pandas can read it without any issues.

cd experiments/dfly-72/results/exp-XXX/synthetic1-uniform-ms/high-fidelity/packet-latency
head -n +1 'packets-delay-gid=0.txt' | sed 's/#//' > packets-delay.csv
tail -n +2 -q 'packets-delay-gid='*.txt >> packets-delay.csv

Copy the subfolder pytorch-training to the folder where everything was compiled. And now run the script to train the model:

python pytorch-training/ml-packetdelay.py --method MLP --epoch 50 \
   --input-file <path-to-data>/packets-delay.csv \
   --model-path experiments/dfly-72/ml-model.pt

Running surrogate model from PyTorch model

CODES has to be compiled with Torch support. This means activating the switch torch_enable in the compilation script and running it again.

Once completed training (previous section), we have access to a trained ML model which we can load into CODES. To run the experiment using Torch, run:

cd experiments
bash run-experiment.sh dfly-72/mpireplay-synthetic-10ms-torch.sh

This will create another entry inside results. The output generated by the hybrid simulation using the Torch model should be similar to that of hybrid or high-fidelity (similar number of transmitted packets).

FAQ

  • What is a synthetic workload? Ans: On a real system (not in a simulation), a parallel program across compute nodes advances its computation via messages. Some require more messages than others to advance, and all will have different patterns of communication. We can log a trace of a real application traffic and then use it in CODES. This is space intensive, so instead we can simulate some traffic pattern. This is what a synthetic workload is, a traffic pattern generated via some code instead of a trace.
  • What's the difference between packet and message sizes? Ans: A message is broken into packets. Packets have a size limit, messages do "not". In the example presented in here, message size and packet size are the same.
  • In the source code, where is the average packet latency predictor and the PyTorch predictor? Ans: Most of the infrastructure to switch from high-fidelity to surrogate and back is contained within codes/src/surrogate (codes/codes/surrogate for the header files). The average packet latency predictor (the default predictor used in surrogate mode) can be found in codes/src/surrogate/packet-latency-predictor/average.c. The PyTorch-based predictor can be found in codes/src/surrogate/packet-latency-predictor/torch-jit.C.
Clone this wiki locally