Skip to content

Latest commit

 

History

History
645 lines (547 loc) · 33.7 KB

perf-overview.md

File metadata and controls

645 lines (547 loc) · 33.7 KB

(perf-overview)=

Overview

This document summarizes performance measurements of TensorRT-LLM on H100 (Hopper), L40S (Ada) and A100 (Ampere) GPUs for a few key models.

The data in the following tables is provided as a reference point to help users validate observed performance. It should not be considered as the peak performance that can be delivered by TensorRT-LLM.

Methodology

The different performance numbers below were collected using the methodology described in the benchmarks folder.

Peak Throughput

The below tables provide reference data at large batch sizes, representing high throughput offline tasks.

All data was generated using version 0.9.0

H200 GPUs (FP8)

Model Batch Size TP (1) Input Length Output Length Throughput (out tok/s/GPU)
GPT-J 6B 1024 1 128 128 27,304
GPT-J 6B 120 1 128 2048 8,530
GPT-J 6B 64 1 2048 128 2,785
GPT-J 6B 64 1 2048 2048 3,753
Mistral 7B 896 1 128 128 20,460
Mistral 7B 120 1 128 2048 8,950
Mistral 7B 64 1 2048 128 2,423
Mistral 7B 56 1 2048 2048 3,867
LLaMA 7B 896 1 128 128 20,618
LLaMA 7B 120 1 128 2048 8,348
LLaMA 7B 64 1 2048 128 2,391
LLaMA 7B 56 1 2048 2048 3,522
LLaMA 70B 1024 1 128 128 3,989
LLaMA 70B 512 2 128 2048 3,963
LLaMA 70B 64 1 2048 128 418
LLaMA 70B 64 1 2048 2048 1,458
Falcon 180B 1024 4 128 128 1,118
Falcon 180B 1024 4 128 2048 990
Falcon 180B 64 4 2048 128 118
Falcon 180B 64 4 2048 2048 265

H100 GPUs (FP8)

Model Batch Size TP (1) Input Length Output Length Throughput (out tok/s/GPU)
GPT-J 6B 1024 1 128 128 25,860
GPT-J 6B 120 1 128 2048 7,350
GPT-J 6B 64 1 2048 128 2,570
GPT-J 6B 64 1 2048 2048 3,212
Mistral 7B 896 1 128 128 20,404
Mistral 7B 120 1 128 2048 8,623
Mistral 7B 84 1 2048 128 2,405
Mistral 7B 56 1 2048 2048 3,731
LLaMA 7B 896 1 128 128 19,854
LLaMA 7B 120 1 128 2048 6,944
LLaMA 7B 84 1 2048 128 2,163
LLaMA 7B 56 1 2048 2048 2,826
LLaMA 70B 1024 2 128 128 3,214
LLaMA 70B 512 4 128 2048 2,725
LLaMA 70B 96 2 2048 128 346
LLaMA 70B 64 2 2048 2048 1,011
Falcon 180B 1024 4 128 128 1,100
Falcon 180B 1024 8 128 2048 837
Falcon 180B 64 4 2048 128 112
Falcon 180B 64 4 2048 2048 246

L40S GPUs (FP8)

Model Batch Size TP (1) Input Length Output Length Throughput (out tok/s/GPU)
GPT-J 6B 512 1 128 128 7,859
GPT-J 6B 64 1 128 2048 1,904
GPT-J 6B 32 1 2048 128 684
GPT-J 6B 32 1 2048 2048 768
Mistral 7B 896 1 128 128 9,562
Mistral 7B 120 1 128 2048 4,387
Mistral 7B 84 1 2048 128 971
Mistral 7B 56 1 2048 2048 1,721
LLaMA 7B 256 1 128 128 5,885
LLaMA 7B 64 1 128 2048 1,654
LLaMA 7B 32 1 2048 128 574
LLaMA 7B 16 1 2048 2048 537
LLaMA 70B 256 2 128 128 562
LLaMA 70B 256 4 128 2048 478
LLaMA 70B 16 2 2048 128 49
LLaMA 70B 64 4 2048 2048 185
Falcon 180B 512 8 128 128 152
Falcon 180B 256 8 128 2048 200
Falcon 180B 32 8 2048 128 15
Falcon 180B 32 8 2048 2048 52

A100 GPUs (FP16)

Model Batch Size TP (1) Input Length Output Length Throughput (out tok/s/GPU)
GPT-J 6B 512 1 128 128 5,876
GPT-J 6B 32 1 128 2048 1,549
GPT-J 6B 32 1 2048 128 545
GPT-J 6B 32 1 2048 2048 815
Mistral 7B 896 1 128 128 6,251
Mistral 7B 120 1 128 2048 3,776
Mistral 7B 64 1 2048 128 698
Mistral 7B 56 1 2048 2048 1,576
Mixtral 8x7B 512 2 128 128 2,842
Mixtral 8x7B 128 2 128 2048 1,724
Mixtral 8x7B 64 2 2048 128 319
Mixtral 8x7B 32 2 2048 2048 801
LLaMA 7B 256 1 128 128 5,390
LLaMA 7B 32 1 128 2048 1,484
LLaMA 7B 32 1 2048 128 533
LLaMA 7B 16 1 2048 2048 603
LLaMA 70B 1024 4 128 128 686
LLaMA 70B 512 8 128 2048 684
LLaMA 70B 96 4 2048 128 80
LLaMA 70B 64 4 2048 2048 289
Falcon 180B 1024 8 128 128 254
Falcon 180B 512 8 128 2048 266
Falcon 180B 64 8 2048 128 29
Falcon 180B 64 8 2048 2048 93

(1) TP stands for Tensor Parallelism.

Low Latency**

All data was generated using version 0.9.0 ** Low latency numbers will soon be updated to reflect real time latency with infight-batching.

The below tables provide reference data at batch size 1 for first token latency, representing end-user's perceived latency for online streaming tasks.

H200 GPUs (FP8)

Model Batch Size TP (1) Input Length 1st Token Latency (ms)
GPT-J 6B 1 1 128 5.0
GPT-J 6B 1 1 2048 23.5
Mistral 7B 1 1 128 5.9
Mistral 7B 1 1 2048 31.7
LLaMA 7B 1 1 128 5.7
LLaMA 7B 1 1 2048 30.2
LLaMA 70B 1 4 128 17.8
LLaMA 70B 1 4 2048 103.0
Falcon 180B 1 4 128 36.4
Falcon 180B 1 4 2048 194.4

H100 GPUs (FP8)

Model Batch Size TP (1) Input Length 1st Token Latency (ms)
GPT-J 6B 1 1 128 5.5
GPT-J 6B 1 1 2048 23.8
Mistral 7B 1 1 128 6.5
Mistral 7B 1 1 2048 32.4
LLaMA 7B 1 1 128 6.3
LLaMA 7B 1 1 2048 30.8
LLaMA 70B 1 4 128 19.6
LLaMA 70B 1 8 2048 85.1
Falcon 180B 1 4 128 41.1
Falcon 180B 1 8 2048 129.9

L40S GPUs (FP8)

Model Batch Size TP (1) Input Length 1st Token Latency (ms)
GPT-J 6B 1 1 128 12.4
GPT-J 6B 1 1 2048 61.7
Mistral 7B 1 1 128 15.4
Mistral 7B 1 1 2048 87.3
LLaMA 7B 1 1 128 14.1
LLaMA 7B 1 1 2048 80.1
LLaMA 70B 1 8 128 70.4
LLaMA 70B 1 4 2048 673.3
Falcon 180B 1 8 128 91.0
Falcon 180B 1 8 2048 768.8

A100 GPUs (FP16)

Model Batch Size TP (1) Input Length 1st Token Latency (ms)
GPT-J 6B 1 1 128 14.8
GPT-J 6B 1 1 2048 136.4
Mistral 7B 1 1 128 16.3
Mistral 7B 1 1 2048 139.6
Mixtral 8x7B 1 2 128 23.8
Mixtral 8x7B 1 2 2048 160.9
LLaMA 7B 1 1 128 16.2
LLaMA 7B 1 1 2048 132.4
LLaMA 70B 1 4 128 45.6
LLaMA 70B 1 8 2048 249.2
Falcon 180B 1 8 128 76.5
Falcon 180B 1 8 2048 456.0

(1) TP stands for Tensor Parallelism.

Known Issues

The following issues are being addressed to improve the efficiency of TensorRT-LLM.

Fused Matmul + Gated-SiLU (LLaMA)

The current implementation combines two Matmul operations into one Matmul followed by a separate SwiGLU kernel (when --use_fused_mlp is enabled). The future release will include a more efficient implementation that runs single Matmul + SwiGLU fused kernel.

Reproducing Benchmarked Results

Building the TensorRT-LLM Container


In order to benchmark TensorRT-LLM, you will need to follow the Quick Start build process to create a baseline container for building a wheel. Additionally, the development container needs a copy of the source code to build the wheel and the benchmarking script. Create the right build environment, use the following :

git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
git submodule update --init --recursive
git lfs install
git lfs pull
make -C docker build
make -C docker run LOCAL_USER=1

Warning

If you have elevated privileges on your system, then skip the make -C docker run LOCAL_USER=1 command above as it may make it so that you cannot access some required system libraries within the container because the build forces your UID and GID to match those that are set for your non-elevated user. There are cases where the container will be booted as root (i.e. on some SLURM systems with the pyxis plugin) which will cause libraries to be missing.

If you are benchmarking in a shared environment, you need to specify the GPU indices that you would like the container to use, otherwise the Makefile defaults to loading the container with all GPUs on the system. For example, if you only have the 4 higher indices of GPUs on your system you can configure it using the following example:

NV_GPU=0,1,2,3
make -C docker run LOCAL_USER=1 GPU_OPTS='--gpus \"device=${NV_GPU}\"'

Additionally, if you'd like to mount external storage to access persistent storage, or previously built engines, you can mount directories as follows (simply replace source and destination with the appropriate paths):

make -C docker run LOCAL_USER=1 DOCKER_RUN_ARGS="-v /source:/destination"

Once the container starts, you'll need to build the wheel and the benchmarking scripts. From the code root (the default directory when the container is loaded), the following commands will build the TensorRT-LLM wheel, install dependencies, and build the benchmark scripts:

python3 ./scripts/build_wheel.py --benchmarks --trt_root /usr/local/tensorrt
pip install ./build/tensorrt_llm*.whl

Methodology

Engine Building Setups

Each engine needs to be built before they can be benchmarked, and requires the source code for each of their respective build scripts. For smaller models, it is fine to build the engine on the fly in container; however, for larger engines it is recommended to pre-build and mount a directory with the engine because engine files are quite large and take time to repeatedly build. Additionally, built engines can be used for input lengths, output lengths, and batch sizes up to their build options meaning you can use an engine to benchmark multiple input configurations.

In order to benchmark the various networks, our engine building scheme is as follows:

  • For the GPT-J, Llama2-7b, and Llama2-70b benchmarks were ran using a single-setting engine build for each network configured for our maximum expected throughput.
  • For Falcon-180B, where memory limits and model size have a higher impact for running the model, our benchmarks transition to a per-configuration engine build.

Below we document how to benchmark each model on an H100-HBM3-80GB system and reproduce the throughput numbers we document on our [Performance section](#performance of-tensorrt-llm).

Running on A100

To run the benchmarks below on A100, you will need to undefine or remove the following quantization fields from each config json file, because FP8 computation is a feature in H100 and newer GPUs.

"quantization": {
	"quant_algo": null,
	"kv_cache_quant_algo": null,
}

Reproducing First Token Latency

In order to test the latency to the first token, you can build the engines as specified below (or with the tweaks specified above on A100) -- once built as described in the build steps above, you can then benchmark with a single output token in order to find the time to first token latency. We provide the appropriate command lines below for each of the benchmarked models, but you can use this same method to benchmark other models available in TensorRT-LLM.

Benchmarking per Model

Warning

In some cases, using Group Query Attention (GQA) can improve performance of some networks. These kernels are currently experimental and not enabled by default. In order to enable them, simply run export TRTLLM_ENABLE_XQA=1 in your shell. The kernels are an inference runtime optimization, so previously built engines should still function. For the benchmarks below, we have enabled GQA where our tests displayed performance benefits. If your network is not listed below, be sure to try both GQA-enabled and GQA-disabled configurations to find the configuration that works best. For more details see our documentation about GPT Attention.

GPT-J 6B


Prepare a config json file /tmp/engines/gptj/ckpt_config.json:

{
    "architecture": "GPTJForCausalLM",
    "dtype": "float16",
    "num_hidden_layers": 28,
    "num_attention_heads": 16,
    "hidden_size": 4096,
    "norm_epsilon": 1e-05,
    "vocab_size": 50400,
    "position_embedding_type": "rope_gptj",
    "max_position_embeddings": 2048,
    "hidden_act": "gelu",
    "quantization": {
        "quant_algo": "FP8",
        "kv_cache_quant_algo": "FP8"
    },
    "rotary_dim": 64
}

Build an engine:

trtllm-build --model_config /tmp/engines/gptj/ckpt_config.json \
	--output_dir /tmp/engines/gptj \
	--paged_kv_cache disable \
	--context_fmha enable \
	--gpt_attention_plugin float16 \
	--max_batch_size 64 \
	--max_input_len 2048 \
	--max_output_len 2048 \
	--strongly_typed

Throughput Benchmark

in_out_sizes=("64:128,128" "64:128,2048" "64:2048,128" "64:2048,2048")
for in_out in ${in_out_sizes[@]}
do
	batch_size=$(echo $in_out | awk -F':' '{ print $1 }')
	in_out_dims=$(echo $in_out | awk -F':' '{ print $2 }')
	echo "BS: $batch_size, ISL/OSL: $in_out_dims"

	./cpp/build/benchmarks/gptSessionBenchmark --engine_dir /tmp/engines/gptj/ --warm_up 1 --batch_size $batch_size --duration 0 --num_runs 5 --input_output_len $in_out_dims
done

First Token Latency Benchmark

in_out_sizes=("64:128,1" "64:2048,1")
for in_out in ${in_out_sizes[@]}
do
	batch_size=$(echo $in_out | awk -F':' '{ print $1 }')
	in_out_dims=$(echo $in_out | awk -F':' '{ print $2 }')
	echo "BS: $batch_size, ISL/OSL: $in_out_dims"

	./cpp/build/benchmarks/gptSessionBenchmark --engine_dir /tmp/engines/gptj/ --warm_up 1 --batch_size $batch_size --duration 0 --num_runs 5 --input_output_len $in_out_dims
done

Llama2-7b


Prepare a config json file /tmp/engines/llama/7b/ckpt_config.json:

{
    "architecture": "LlamaForCausalLM",
    "dtype": "float16",
    "num_hidden_layers": 32,
    "num_attention_heads": 32,
    "hidden_size": 4096,
    "intermediate_size": 11008,
    "num_key_value_heads": 32,
    "vocab_size": 32000,
    "position_embedding_type": "rope_gpt_neox",
    "max_position_embeddings": 4096,
    "hidden_act": "silu",
    "rotary_base": 10000.0,
    "rotary_scaling": null,
    "norm_epsilon": 1e-05,
    "quantization": {
        "quant_algo": "FP8",
        "kv_cache_quant_algo": "FP8"
    }
}

Build an engine:

pip install -r examples/llama/requirements.txt
trtllm-build --model_config /tmp/engines/llama/7b/ckpt_config.json \
	--output_dir /tmp/engines/llama/7b \
	--paged_kv_cache disable \
	--context_fmha enable \
	--gpt_attention_plugin float16 \
	--max_batch_size 64 \
	--max_input_len 2048 \
	--max_output_len 2048 \
	--strongly_typed

Throughput Benchmark

in_out_sizes=("64:128,128" "64:128,2048" "64:2048,128" "32:2048,2048")
for in_out in ${in_out_sizes[@]}
do
	batch_size=$(echo $in_out | awk -F':' '{ print $1 }')
	in_out_dims=$(echo $in_out | awk -F':' '{ print $2 }')
	echo "BS: $batch_size, ISL/OSL: $in_out_dims"

	./cpp/build/benchmarks/gptSessionBenchmark --engine_dir /tmp/engines/llama/7b --warm_up 1 --batch_size $batch_size --duration 0 --num_runs 5 --input_output_len $in_out_dims
done

First Token Latency Benchmark

in_out_sizes=("64:128,1" "32:2048,1")
for in_out in ${in_out_sizes[@]}
do
	batch_size=$(echo $in_out | awk -F':' '{ print $1 }')
	in_out_dims=$(echo $in_out | awk -F':' '{ print $2 }')
	echo "BS: $batch_size, ISL/OSL: $in_out_dims"

	./cpp/build/benchmarks/gptSessionBenchmark --engine_dir /tmp/engines/llama/7b --warm_up 1 --batch_size $batch_size --duration 0 --num_runs 5 --input_output_len $in_out_dims
done

Llama2-70b


Prepare a config json file /tmp/engines/llama/70b/ckpt_config.json:

{
    "architecture": "LlamaForCausalLM",
    "dtype": "float16",
    "num_hidden_layers": 80,
    "num_attention_heads": 64,
    "hidden_size": 8192,
    "intermediate_size": 28672,
    "num_key_value_heads": 8,
    "vocab_size": 32000,
    "position_embedding_type": "rope_gpt_neox",
    "max_position_embeddings": 4096,
    "hidden_act": "silu",
    "rotary_base": 10000.0,
    "rotary_scaling": null,
    "norm_epsilon": 1e-05,
    "quantization": {
        "quant_algo": "FP8",
        "kv_cache_quant_algo": "FP8"
    },
    "mapping": {
        "world_size": 4,
        "tp_size": 4,
        "pp_size": 1
    }
}

Build an engine:

pip install -r examples/llama/requirements.txt
trtllm-build --model_config /tmp/engines/llama/70b/ckpt_config.json \
	--output_dir /tmp/engines/llama/70b \
	--workers 4 \
	--paged_kv_cache disable \
	--context_fmha enable \
	--gpt_attention_plugin float16 \
	--max_batch_size 64 \
	--max_input_len 2048 \
	--max_output_len 2048 \
	--strongly_typed

Throughput Benchmark

export TRTLLM_ENABLE_XQA=1
in_out_sizes=("64:128,128" "64:128,2048" "64:2048,128" "64:2048,2048")
for in_out in ${in_out_sizes[@]}
do
	batch_size=$(echo $in_out | awk -F':' '{ print $1 }')
	in_out_dims=$(echo $in_out | awk -F':' '{ print $2 }')
	echo "BS: $batch_size, ISL/OSL: $in_out_dims"

	mpirun -n 4 --allow-run-as-root --oversubscribe ./cpp/build/benchmarks/gptSessionBenchmark --engine_dir /tmp/engines/llama/70b --warm_up 1 --batch_size $batch_size --duration 0 --num_runs 5 --input_output_len $in_out_dims
done

First Token Latency Benchmark

export TRTLLM_ENABLE_XQA=1
in_out_sizes=("64:128,1" "64:128,1")
for in_out in ${in_out_sizes[@]}
do
	batch_size=$(echo $in_out | awk -F':' '{ print $1 }')
	in_out_dims=$(echo $in_out | awk -F':' '{ print $2 }')
	echo "BS: $batch_size, ISL/OSL: $in_out_dims"

	mpirun -n 4 --allow-run-as-root --oversubscribe ./cpp/build/benchmarks/gptSessionBenchmark --engine_dir /tmp/engines/llama/70b --warm_up 1 --batch_size $batch_size --duration 0 --num_runs 5 --input_output_len $in_out_dims
done

Falcon-180B


Benchmarking Falcon-180B requires a custom engine per batch size, input/output sequence length due to the large footprint of the model and the large input size of 2048. You can build and benchmark each engine one at a time with the following loop.

Prepare a config json file /tmp/engines/falcon/180b/ckpt_config.json:

{
    "architecture": "FalconForCausalLM",
    "dtype": "bfloat16",
    "num_hidden_layers": 80,
    "num_attention_heads": 232,
    "num_key_value_heads": 8,
    "hidden_size": 14848,
    "norm_epsilon": 1e-05,
    "vocab_size": 65024,
    "position_embedding_type": "rope_gpt_neox",
    "max_position_embeddings": 2048,
    "hidden_act": "gelu",
    "use_parallel_embedding": false,
    "embedding_sharding_dim": 0,
    "share_embedding_table": false,
    "quantization": {
        "quant_algo": "FP8",
        "kv_cache_quant_algo": "FP8"
    },
    "mapping": {
        "world_size": 8,
        "tp_size": 8,
        "pp_size": 1
    },
    "bias": false,
    "parallel_attention": true,
    "new_decoder_architecture": true
}
export TRTLLM_ENABLE_XQA=1
# Benchmark specific batch size:isl:osl combinations.
in_out_sizes=("96:128,128" "96:128,2048" "64:2048,128")
for in_out in ${in_out_sizes[@]}
do
	batch_size=$(echo $in_out | awk -F':' '{ print $1 }')
	in_out_dims=$(echo $in_out | awk -F':' '{ print $2 }')
	isl=$(echo $in_out_dims | awk -F',' '{ print $1 }')
	osl=$(echo $in_out_dims | awk -F',' '{ print $2 }')
	engine_path="/tmp/engines/falcon/180b/${batch_size}_${isl}_${osl}"
	echo "BS: $batch_size, ISL/OSL: ${isl},${osl}"

	# Build the specific engine for the BS,ISL,OSL combination
	trtllm-build --model_config /tmp/engines/falcon/180b/ckpt_config.json \
		--output_dir $engine_path \
		--workers 8 \
		--paged_kv_cache disable \
		--context_fmha enable \
		--gpt_attention_plugin bfloat16 \
		--max_batch_size $batch_size \
		--max_input_len $isl \
		--max_output_len $osl \
		--strongly_typed

	# Throughput benchmark
	mpirun -n 8 --allow-run-as-root --oversubscribe ./cpp/build/benchmarks/gptSessionBenchmark --engine_dir $engine_path --warm_up 1 --batch_size $batch_size --duration 0 --num_runs 5 --input_output_len "${isl},${osl}"
	# Time to first token benchmark
	mpirun -n 8 --allow-run-as-root --oversubscribe ./cpp/build/benchmarks/gptSessionBenchmark --engine_dir $engine_path --warm_up 1 --batch_size $batch_size --duration 0 --num_runs 5 --input_output_len "${isl},1"

	# The Falcon-180b engine is quite large, remove after the benchmark to free up space
	# Remove this line if you'd like to save the engines.
	rm -r $engine_path
done