Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misleading "Not enough memory" message (H100, 8-bit quantized bitsandbytes issue) #780

Closed
2 of 4 tasks
andreaskoepf opened this issue Aug 6, 2023 · 2 comments
Closed
2 of 4 tasks
Labels

Comments

@andreaskoepf
Copy link

andreaskoepf commented Aug 6, 2023

System Info

While trying to load a Falcon 40B model (OpenAssistant/falcon-40b-sft-mix-1226) 8-bit quantized (--quantize bitsandbytes) we noticed that this is currently not possible on H100 machines.

On an A100 80GB it is possible to load the model with the following command:
text-generation-launcher --model-id OpenAssistant/falcon-40b-sft-mix-1226 -p 8080 --quantize bitsandbytes --max-input-length 1024 --max-total-tokens 2048 --max-batch-prefill-tokens 1024

The same command on a H100 machine crashes with RuntimeError: Not enough memory to handle 1024 prefill tokens. You need to decrease --max-batch-prefill-tokens. Full output including Traceback: here.

While the error seems to indicate a memory problem, this is probably not the actual issue here. When tgi prints the warmup message the GPU memory usage as reported by nvidia-smi is ~45000 MiB.

The RuntimeError is thrown in flash_causal_lm.py#L730 . The actual exception that causes this RuntimeError in the self.generate_token() call: "Exception: cublasLt ran into an error!". It could be related to bitsandbytes-foundation/bitsandbytes#599

Installation details:
The TGI installation for both A100 & H100 machines (lambda labs and runpod) was done outside docker in a python3.10 venv, commands used the installation can be found in the following gist.

Loading the 40B model in 4bit quantized with --quantize bitsandbytes-nf4 works also on a H100.

(thanks to @tju01 for cross-check on runpod machines and analysis of error.)

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Reproduction:

  1. Install TGI on H100 system (e.g. use commands here)
  2. Execute text-generation-launcher --model-id OpenAssistant/falcon-40b-sft-mix-1226 -p 8080 --quantize bitsandbytes --max-input-length 1024 --max-total-tokens 2048 --max-batch-prefill-tokens 1024

Error shown:

text_generation_client: router/client/src/lib.rs:33: Server error: Not enough memory to handle 1024 prefill tokens.

Expected behavior

Checking what type of exception was thrown and giving a helpful error message instead of always assuming that it is because of the max prefill tokens because it's sometimes not. (And of course it would be great if 8-bit quantized model would also run on a H100 but that probably needs to be resolved in bitsandbytes).

@abhinavkulkarni
Copy link
Contributor

Hey @andreaskoepf,

I think we both are facing same issue with the warmup method of client.rs. I have commented here about the same: #778 (comment)

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Apr 15, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants