The profiling data is stored in the data/profiling
directory. The profiling data is stored in CSV format. The profiling data is stored in the following format:
profiling/
- compute
- a100
- codellama/CodeLlama-34b-Instruct-hf/
- mlp.csv
- attention.csv
- internlm/internlm-20b/
- mlp.csv
- attention.csv
- h100
- meta-llama/Llama-2-70b-hf/
- mlp.csv
- attention.csv
- network
- a100_pair_nvlink
- allreduce.csv
- send_recv.csv
- h100_dgx
- allreduce.csv
- send_recv.csv
For compute profiling, only the SKU matters not the network configuration of the node. So, we don't discriminate between a100_pair_nvlink
(Azure Standard_NC96ads_A100_v4 with 4 A100s) and a100_dgx
(A100 DGX with 8 A100s), the same compute data is used in both in the folder called a100
.
For network profiling, the network configuration of the node matters. So, we have separate folders for a100_pair_nvlink
and a100_dgx
. One example is that TP4 is different in these. In a100_pair_nvlink
, there are two pairs connected by NVLink but between these pairs is a relatively slower link, but in a100_dgx
where all 8 GPUs are connected by NVLink.
We need actual GPUs to get profiling data for a new model. Once the profiling is done, simulations can be run on CPUs only.
-
Clone the
sarathi-serve
GitHub repo.- Checkout branch
vidur
- Follow its README to install it.
- Let us assume that the Python virtual environment was created in
sarathi-serve/env
.
- Checkout branch
-
Now clone this repo
vidur
but keep thesarathi-serve/env
virtual environment activated. -
Add a YAML model config for the new model in
data/model_configs
.- Use the model's HuggingFace model id for the file name eg.
data/model_configs/meta-llama/Llama-2-70b-hf.yml
. - Refer HuggingFace
config.json
for the model eg. https://huggingface.co/meta-llama/Llama-2-70b-hf/blob/main/config.json. - Ensure that correct parameters are set in the YAML file so that the reference transformer model GPTModel closely resembles the new model.
- We use this reference model to profile only the MLP operations of all the models so the attention operations are no-op'ed here.
- Use the model's HuggingFace model id for the file name eg.
-
Run the following command to install the simulator in the virtual environment:
python -m pip install -e .
from thevidur/
directory. -
For compute profiling (mlp and attention), 1 GPU is enough even for tensor parallel degrees greater than 1. So
num_gpus
set to 1 is sufficient albeit slower for profiling. -
Now we need to do the MLP profiling:
python vidur/profiling/mlp/main.py \ --models codellama/CodeLlama-34b-Instruct-hf \ --num_gpus 4
- Run
python vidur/profiling/mlp/main.py --help
for more options. - Copy the CSV file from
profiling_outputs/mlp/<timestamp>/codellama/CodeLlama-34b-Instruct-hf/mlp.csv
todata/profiling/compute/a100/codellama/CodeLlama-34b-Instruct-hf/mlp.csv
.
- Run
-
Now we need to do the attention profiling:
python vidur/profiling/attention/main.py \ --models codellama/CodeLlama-34b-Instruct-hf \ --num_gpus 4
- Run
python vidur/profiling/attention/main.py --help
for more options. - Copy the CSV file from
profiling_outputs/attention/<timestamp>/codellama/CodeLlama-34b-Instruct-hf/attention.csv
todata/profiling/compute/a100/codellama/CodeLlama-34b-Instruct-hf/attention.csv
. - Note that we are using
a100
as the device name. If you are usingh100
or some other device, then you need to create a new folder for that device indata/profiling/compute
and copy the CSV files there.
- Run
Network profiling is not dependent on the model 🎉. So, we can use the same network profiling data for all models. However, we need to ensure that the network profiling data is available for the node configuration we are using. If not, then we need to profile the network for the device. 1.
For network profiling, the node setup i.e. type of connectivity between the gpus matter. This is why we have the concept of network_device
. The network_device is an informal name for the network configuration of the node. Eg: a100_pair_nvlink
, a100_dgx
, h100_dgx
etc.
1. For tensor parallelism, 4 GPUs are needed for TP4 and 8 GPUs are needed for TP8 etc.
2. For pipeline parallelism across nodes, 2 nodes are needed to profile the link between the nodes.
Currently available data include:
a100_pair_nvlink
: Azure Standard_NC96ads_A100_v4 with 4 80GB A100 PCIe GPUs with pair-wise NVLINK connectivity.h100_pair_nvlink
: Azure internal VM with 4 80GB H100 NVL GPUs with pair-wise NVLINK connectivity.a100_dgx
: A100 DGX with 8 80GB A100s.h100_dgx
: H100 DGX with 8 H100s.
-
Clone this (
vidur
) repo and create a Python virtual environment as in Setup. -
Setup a ray cluster:
- Tensor parallelism is typically done on a single node so we don't need a multi-node cluster.
- However, pipeline parallelism is typically done across multiple nodes so we need at least 2 nodes there.
- Run
ray start --head
from the root node. - Run
ray start --address <head-node-ip>:<head-node-port>
from the other nodes. The other nodes also need to have the same git commit checked out.
-
Run the following command to profile for the
all_reduce
operation, (sufficient for TP):python vidur/profiling/collectives/main.py \ --num_workers_per_node_combinations 1,2,4,8 \ --collective all_reduce
- One may need to adjust
--num_workers_per_node_combinations
depending on the number of GPUs in the node eg.--num_workers_per_node_combinations 1,2,4
for Azure Standard_NC96ads_A100_v4 node. - Copy the CSV file from
profiling_outputs/collectives/<timestamp>/all_reduce.csv
todata/profiling/network/{network_device}/allreduce.csv
. network_device
is an informal name for the network configuration of the node. Eg:a100_pair_nvlink
,a100_dgx
,h100_dgx
etc.- Run
python vidur/profiling/collectives/main.py --help
for more options.
- One may need to adjust
-
Run the following command to profile for the
send_recv
operation, (required only for PP):python vidur/profiling/collectives/main.py \ --num_workers_per_node_combinations 1,2,4,8 \ --collective send_recv
- Typically, PP is done across nodes so
num_workers_per_node_combinations
should be the same as the number of GPUs available in one node. Profilingnum_workers_per_node_combinations
less than the number of GPUs in the node to have PP inside a node. This can be useful when each gpu is not connected to every other gpu using the same high speed link. - Copy the CSV file from
profiling_outputs/collectives/<timestamp>/send_recv.csv
todata/profiling/network/{network_device}/send_recv.csv
. network_device
is an informal name for the network configuration of the node. Eg:a100_pair_nvlink
,a100_dgx
,h100_dgx
etc.
- Typically, PP is done across nodes so
These include implementation overheads like scheduling time, sampling time, detokenization etc. For better fidelity, these should also be profiled. However, they tie the simulator closely to the implementation eg. vLLM
. Scripts are available here but not documented yet. These scripts follow a similar pattern to the compute and network profiling scripts.