Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Waiting for output from MQLLMEngine #3

Open
ekmekovski opened this issue Nov 8, 2024 · 7 comments
Open

Waiting for output from MQLLMEngine #3

ekmekovski opened this issue Nov 8, 2024 · 7 comments
Labels
question Further information is requested

Comments

@ekmekovski
Copy link

Hello sir firstly thnx for the repository.
I am trying to running a LLama3.1 model with GPTQ-8 quantization in multiple GPUs using the repo. When I tried to load them The first shard loaded (it tooks ~4 gb, but the model consume ~10 gb in a single gpu). And it enters to hanging state, I saw in debugging: "Waiting for output from MQLLMEngine " . Do you have any idea?
Thnx in advance

@sasha0552
Copy link
Owner

Hi! If you are using P40s, do NOT use GPTQ. Use literally everything else, AQLM, GGUF. Or load the models in fp16. GPTQ doesn't work adequately on the P40s, it will load (after a few hours) but will be very slow.

@sasha0552 sasha0552 added the question Further information is requested label Nov 8, 2024
@ekmekovski
Copy link
Author

ekmekovski commented Nov 8, 2024

Correction it was P4 😃 but I think the case still applies right? I will try with different quantization thnx again!

@sasha0552
Copy link
Owner

Yeah, this applies to the P4 as well. GPTQ only works on P100.

@ekmekovski
Copy link
Author

Thank you for the replies. I was able to use GGUF only. I tried BnB, AND AWQ too but when I start the vLLM engine it says these quantization methods supported for compute capability 7.0, 8.0 respectively. I wanted to ask is it about vLLM implementation or P4, pascal architecture cannot use these quantization method. Thnx again.

@sasha0552
Copy link
Owner

sasha0552 commented Nov 9, 2024

Theoretically, AWQ can be patched (you need at least something like vllm-project/vllm#1345), and it should have acceptable performance (even faster than fp16). Currently it generally works (i.e. it doesn't crash), but it generates gibberish (ref: vllm-project/vllm#5058). Anything INT8/FP8 related cannot run on Pascal due to architecture limitations.

image

@ekmekovski
Copy link
Author

What abt bnb I see that you put a checkmark but, ı think it does not work well, am I wrong?

@sasha0552
Copy link
Owner

You should replace 70 with 60 in this file to make it work.

It worked at least three months ago.

Alternatively, you can try the Triton AWQ implementation that vLLM uses for AMD GPUs. It should work (in theory). You can enable it with the VLLM_USE_TRITON_AWQ environment variable (note that you may also need to replace 75 with 60 these too)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants