-
Notifications
You must be signed in to change notification settings - Fork 16
kronos develop
Kronos is the tagname for our current (as of Jan 2024) funding project. The objective of the project is to speed up CODES simulations via the use of surrogates (network surrogates and workload surrogates). In this page, we document how to compile and run CODES models that demonstrate the capabilities of surrogates.
To follow the instructions here presented, download this compressed file.
To compile CODES, copy the script CODES-compile-script.sh
into an empty folder and run
it. The compilation script comes with three switches.
When all switches are deactivated, CODES is compiled with minimal requirements. It just needs C and C++ compilers, and an MPI library (preferably MPICH, but OpenMPI has been tested).
As of Jan 17, 2024, the branch kronos-develop
implements a strategy to replace the
network simulation for a surrogate of the network. When activated, the surrogate network
speeds up (in most cases) the overall simulation time.
When switched on, the compilation script will download additional software to enable further capabilities of CODES. These capabilities are:
- SWM: CODES comes preloaded with some synthetic workloads. SWM is a library (of sorts) which allows us to implement further, more complicated synthetic workloads.
- UNION: Writing synthetic workloads directly into CODES or using SWM is complex. So, to simplify even more the creation of them, we can use a "programming language" called coNCePTuaL. UNION requires SWM to work.
- TORCH: As of now, CODES only implements one surrogate model (average packet latency). For more complex surrogate models, we can train offline a model using PyTorch. This model can be then loaded into CODES using PyTorch C++ API.
Copy the folder experiments/
into the folder the compile script was executed, and then
run:
cd experiments
bash run-experiment.sh dfly-72/mpireplay-synthetic-10ms.sh
Running the script will take several minutes. A folder results
will be created under
experiments/dfly-72/
. Each time the script is run, a new subfolder under results
is
created and it will contain all the output generated after running the example experiment.
The example experiment consists of two simulations (we run CODES twice): high-fidelity and hybrid.
- Under high-fidelity, the network is never replaced by the surrogate network. This is our gold standard, our ground-truth.
- Under hybrid, the simulation will run in high-fidelity mode for 3ms, then it will switch the network for the surrogate network and will run for 5ms. Finally at 8ms, the network will be switched from surrogate to high-fidelity.
The parameters of the simulation(s) are:
- A 72-node dragonfly network (with 72 compute nodes, 36 routers, on 9 groups)
- The simulation runs for 10ms.
- The traffic injected into the network is Uniform Random (a synthetic workload) with a period of 470ns. Each computing node sends new message to a random computing node in the network every 470ns.
Under the folder results/exp-XXX/synthetic1-uniform-ms/high-fidelity/packet-latency
, we
can find the packet latency data generated by the high-fidelity simulation. It contains
source, destination, latency and other pieces of data collected per packet transmitted in
the network.
We need to install PyTorch (it should come with Pandas and NumPy as dependencies).
First, we need to clean the output data so that Pandas can read it without any issues.
cd experiments/dfly-72/results/exp-XXX/synthetic1-uniform-ms/high-fidelity/packet-latency
head -n +1 'packets-delay-gid=0.txt' | sed 's/#//' > packets-delay.csv
tail -n +2 -q 'packets-delay-gid='*.txt >> packets-delay.csv
Copy the subfolder pytorch-training
to the folder where everything was compiled. And now
run the script to train the model:
python pytorch-training/ml-packetdelay.py --method MLP --epoch 50 \
--input-file <path-to-data>/packets-delay.csv \
--model-path experiments/dfly-72/ml-model.pt
CODES has to be compiled with Torch support. This means activating the switch
torch_enable
in the compilation script and running it again.
Once completed training (previous section), we have access to a trained ML model which we can load into CODES. To run the experiment using Torch, run:
cd experiments
bash run-experiment.sh dfly-72/mpireplay-synthetic-10ms-torch.sh
This will create another entry inside results
. The output generated by the hybrid
simulation using the Torch model should be similar to that of hybrid or high-fidelity
(similar number of transmitted packets).
- What is a synthetic workload? Ans: On a real system (not in a simulation), a parallel program across compute nodes advances its computation via messages. Some require more messages than others to advance, and all will have different patterns of communication. We can log a trace of a real application traffic and then use it in CODES. This is space intensive, so instead we can simulate some traffic pattern. This is what a synthetic workload is, a traffic pattern generated via some code instead of a trace.
- What's the difference between packet and message sizes? Ans: A message is broken into packets. Packets have a size limit, messages do "not". In the example presented in here, message size and packet size are the same.
- In the source code, where is the average packet latency predictor and the PyTorch
predictor? Ans: Most of the infrastructure to switch from high-fidelity to surrogate
and back is contained within
codes/src/surrogate
(codes/codes/surrogate
for the header files). The average packet latency predictor (the default predictor used in surrogate mode) can be found incodes/src/surrogate/packet-latency-predictor/average.c
. The PyTorch-based predictor can be found incodes/src/surrogate/packet-latency-predictor/torch-jit.C
.