You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An important use case for Torch-TRT is to support multi-GPU and multi-node. The goal is to boost the performance using data parallelism and tensor parallelism to compile the model using multi gpu. There are different ways to do this- fully sharded data parallelism, sequence parallelism and tensor parallelism. Data parallelism examples can be found in /examples/distributed_inference/data_parallel_gpt2 and /examples/distributed_inference/data_parallel_stable_diffusion. This RFC focuses on tensor parallelism, an efficient model parallelism method for large model compilation.
Goal
Accelerate the model compilation time using tensor parallelism
Implementation stages
Development of tensor parallel (TP) inference examples.
The compilation of the model with tensor parallelism is agnostic to the franework as well as the model has been sharded properly by the network. The following two frameworks have been explored so far
torch.distributed.tensor.parallel
Below is an example TP using torch.distributed.tensor.parallel
We start with this, primary reason being that torch.distributed is more stable and has no dependency on external libs. Due to this it does not introduce torch.compile graph breaks. The forward function is more in control. However the main limitation is that the sharding layout is manually set and will change from model to model.
We run the above using torchrun --nproc_per_node=2 tensor_parallel.py
Megatron-LM uses a JSON configuration file to set various parameters such as model size, sequence length, and parallelism settings.
This is a future path to be explored since it causes dynamic graph breaks in torch.compile. It however supports the mainstream LLM models via its configuration file.
Wrapping the NCCL ops in torch TensorRT converter library
NCCL (NVIDIA Collective Communications Library) is a library designed to optimize collective communication operations across multiple GPUs and nodes. It is developed by NVIDIA and is widely used in distributed deep learning and high-performance computing to handle communication between GPUs efficiently.
Since the NCCL collective communication is not supported in torch TRT, the above torch.distributed causes graph breaks, leading to slower compilation time as compared to torch in the first iteration while forming the TRT Engine. The following are the operations to be supported:
Below are some examples through which the NCCL ops can be supported in torch TRT using NCCL based plugin. It uses TensorRT plugin registered under the namespace TRT_LLM_PLUGIN_NAMESPACE, which wraps the NCCL operations
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Multi GPU compilation support
TL;DR
An important use case for Torch-TRT is to support multi-GPU and multi-node. The goal is to boost the performance using data parallelism and tensor parallelism to compile the model using multi gpu. There are different ways to do this- fully sharded data parallelism, sequence parallelism and tensor parallelism. Data parallelism examples can be found in
/examples/distributed_inference/data_parallel_gpt2
and/examples/distributed_inference/data_parallel_stable_diffusion
. This RFC focuses on tensor parallelism, an efficient model parallelism method for large model compilation.Goal
Accelerate the model compilation time using tensor parallelism
Implementation stages
Development of tensor parallel (TP) inference examples.
The compilation of the model with tensor parallelism is agnostic to the franework as well as the model has been sharded properly by the network. The following two frameworks have been explored so far
Below is an example TP using
torch.distributed.tensor.parallel
https://github.com/pytorch/TensorRT/pull/3047/files#diffd70d4b88b03c3208178bf10992dfbb8e8ddb4ec6db247bbf42e0c2e706b0d5a5R5-R83
We start with this, primary reason being that
torch.distributed
is more stable and has no dependency on external libs. Due to this it does not introducetorch.compile
graph breaks. The forward function is more in control. However the main limitation is that the sharding layout is manually set and will change from model to model.We run the above using
torchrun --nproc_per_node=2 tensor_parallel.py
Megatron-LM uses a JSON configuration file to set various parameters such as model size, sequence length, and parallelism settings.
This is a future path to be explored since it causes dynamic graph breaks in torch.compile. It however supports the mainstream LLM models via its configuration file.
Wrapping the NCCL ops in torch TensorRT converter library
NCCL (NVIDIA Collective Communications Library) is a library designed to optimize collective communication operations across multiple GPUs and nodes. It is developed by NVIDIA and is widely used in distributed deep learning and high-performance computing to handle communication between GPUs efficiently.
Since the NCCL collective communication is not supported in torch TRT, the above
torch.distributed
causes graph breaks, leading to slower compilation time as compared to torch in the first iteration while forming the TRT Engine. The following are the operations to be supported:https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/colls.html
Below are some examples through which the NCCL ops can be supported in torch TRT using NCCL based plugin. It uses TensorRT plugin registered under the namespace TRT_LLM_PLUGIN_NAMESPACE, which wraps the NCCL operations
https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/functional.py#L3701-L4003
Implemeting the above directly from TRT-LLM library
There can be two methods
Beta Was this translation helpful? Give feedback.
All reactions