You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While trying to load a Falcon 40B model (OpenAssistant/falcon-40b-sft-mix-1226) 8-bit quantized (--quantize bitsandbytes) we noticed that this is currently not possible on H100 machines.
On an A100 80GB it is possible to load the model with the following command: text-generation-launcher --model-id OpenAssistant/falcon-40b-sft-mix-1226 -p 8080 --quantize bitsandbytes --max-input-length 1024 --max-total-tokens 2048 --max-batch-prefill-tokens 1024
The same command on a H100 machine crashes with RuntimeError: Not enough memory to handle 1024 prefill tokens. You need to decrease --max-batch-prefill-tokens. Full output including Traceback: here.
While the error seems to indicate a memory problem, this is probably not the actual issue here. When tgi prints the warmup message the GPU memory usage as reported by nvidia-smi is ~45000 MiB.
Installation details:
The TGI installation for both A100 & H100 machines (lambda labs and runpod) was done outside docker in a python3.10 venv, commands used the installation can be found in the following gist.
Loading the 40B model in 4bit quantized with --quantize bitsandbytes-nf4 works also on a H100.
(thanks to @tju01 for cross-check on runpod machines and analysis of error.)
Information
Docker
The CLI directly
Tasks
An officially supported command
My own modifications
Reproduction
Reproduction:
Install TGI on H100 system (e.g. use commands here)
text_generation_client: router/client/src/lib.rs:33: Server error: Not enough memory to handle 1024 prefill tokens.
Expected behavior
Checking what type of exception was thrown and giving a helpful error message instead of always assuming that it is because of the max prefill tokens because it's sometimes not. (And of course it would be great if 8-bit quantized model would also run on a H100 but that probably needs to be resolved in bitsandbytes).
The text was updated successfully, but these errors were encountered:
System Info
While trying to load a Falcon 40B model (OpenAssistant/falcon-40b-sft-mix-1226) 8-bit quantized (
--quantize bitsandbytes
) we noticed that this is currently not possible on H100 machines.On an A100 80GB it is possible to load the model with the following command:
text-generation-launcher --model-id OpenAssistant/falcon-40b-sft-mix-1226 -p 8080 --quantize bitsandbytes --max-input-length 1024 --max-total-tokens 2048 --max-batch-prefill-tokens 1024
The same command on a H100 machine crashes with
RuntimeError: Not enough memory to handle 1024 prefill tokens. You need to decrease --max-batch-prefill-tokens
. Full output including Traceback: here.While the error seems to indicate a memory problem, this is probably not the actual issue here. When tgi prints the warmup message the GPU memory usage as reported by nvidia-smi is ~45000 MiB.
The RuntimeError is thrown in flash_causal_lm.py#L730 . The actual exception that causes this RuntimeError in the
self.generate_token()
call: "Exception: cublasLt ran into an error!". It could be related to bitsandbytes-foundation/bitsandbytes#599Installation details:
The TGI installation for both A100 & H100 machines (lambda labs and runpod) was done outside docker in a python3.10 venv, commands used the installation can be found in the following gist.
Loading the 40B model in 4bit quantized with
--quantize bitsandbytes-nf4
works also on a H100.(thanks to @tju01 for cross-check on runpod machines and analysis of error.)
Information
Tasks
Reproduction
Reproduction:
text-generation-launcher --model-id OpenAssistant/falcon-40b-sft-mix-1226 -p 8080 --quantize bitsandbytes --max-input-length 1024 --max-total-tokens 2048 --max-batch-prefill-tokens 1024
Error shown:
Expected behavior
Checking what type of exception was thrown and giving a helpful error message instead of always assuming that it is because of the max prefill tokens because it's sometimes not. (And of course it would be great if 8-bit quantized model would also run on a H100 but that probably needs to be resolved in bitsandbytes).
The text was updated successfully, but these errors were encountered: