Misleading "Not enough memory" message (H100, 8-bit quantized bitsandbytes issue) #780

andreaskoepf · 2023-08-06T21:35:08Z

System Info

While trying to load a Falcon 40B model (OpenAssistant/falcon-40b-sft-mix-1226) 8-bit quantized (--quantize bitsandbytes) we noticed that this is currently not possible on H100 machines.

On an A100 80GB it is possible to load the model with the following command:
text-generation-launcher --model-id OpenAssistant/falcon-40b-sft-mix-1226 -p 8080 --quantize bitsandbytes --max-input-length 1024 --max-total-tokens 2048 --max-batch-prefill-tokens 1024

The same command on a H100 machine crashes with RuntimeError: Not enough memory to handle 1024 prefill tokens. You need to decrease --max-batch-prefill-tokens. Full output including Traceback: here.

While the error seems to indicate a memory problem, this is probably not the actual issue here. When tgi prints the warmup message the GPU memory usage as reported by nvidia-smi is ~45000 MiB.

The RuntimeError is thrown in flash_causal_lm.py#L730 . The actual exception that causes this RuntimeError in the self.generate_token() call: "Exception: cublasLt ran into an error!". It could be related to bitsandbytes-foundation/bitsandbytes#599

Installation details:
The TGI installation for both A100 & H100 machines (lambda labs and runpod) was done outside docker in a python3.10 venv, commands used the installation can be found in the following gist.

Loading the 40B model in 4bit quantized with --quantize bitsandbytes-nf4 works also on a H100.

(thanks to @tju01 for cross-check on runpod machines and analysis of error.)

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Reproduction:

Install TGI on H100 system (e.g. use commands here)
Execute text-generation-launcher --model-id OpenAssistant/falcon-40b-sft-mix-1226 -p 8080 --quantize bitsandbytes --max-input-length 1024 --max-total-tokens 2048 --max-batch-prefill-tokens 1024

Error shown:

text_generation_client: router/client/src/lib.rs:33: Server error: Not enough memory to handle 1024 prefill tokens.

Expected behavior

Checking what type of exception was thrown and giving a helpful error message instead of always assuming that it is because of the max prefill tokens because it's sometimes not. (And of course it would be great if 8-bit quantized model would also run on a H100 but that probably needs to be resolved in bitsandbytes).

The text was updated successfully, but these errors were encountered:

abhinavkulkarni · 2023-08-07T02:24:28Z

Hey @andreaskoepf,

I think we both are facing same issue with the warmup method of client.rs. I have commented here about the same: #778 (comment)

github-actions · 2024-04-15T02:49:06Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions bot added the Stale label Apr 15, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misleading "Not enough memory" message (H100, 8-bit quantized bitsandbytes issue) #780

Misleading "Not enough memory" message (H100, 8-bit quantized bitsandbytes issue) #780

andreaskoepf commented Aug 6, 2023 •

edited

Loading

abhinavkulkarni commented Aug 7, 2023

github-actions bot commented Apr 15, 2024

Misleading "Not enough memory" message (H100, 8-bit quantized bitsandbytes issue) #780

Misleading "Not enough memory" message (H100, 8-bit quantized bitsandbytes issue) #780

Comments

andreaskoepf commented Aug 6, 2023 • edited Loading

System Info

Information

Tasks

Reproduction

Expected behavior

abhinavkulkarni commented Aug 7, 2023

github-actions bot commented Apr 15, 2024

andreaskoepf commented Aug 6, 2023 •

edited

Loading