Waiting for output from MQLLMEngine #3

ekmekovski · 2024-11-08T07:07:50Z

Hello sir firstly thnx for the repository.
I am trying to running a LLama3.1 model with GPTQ-8 quantization in multiple GPUs using the repo. When I tried to load them The first shard loaded (it tooks ~4 gb, but the model consume ~10 gb in a single gpu). And it enters to hanging state, I saw in debugging: "Waiting for output from MQLLMEngine " . Do you have any idea?
Thnx in advance

sasha0552 · 2024-11-08T12:17:09Z

Hi! If you are using P40s, do NOT use GPTQ. Use literally everything else, AQLM, GGUF. Or load the models in fp16. GPTQ doesn't work adequately on the P40s, it will load (after a few hours) but will be very slow.

ekmekovski · 2024-11-08T12:19:46Z

Correction it was P4 😃 but I think the case still applies right? I will try with different quantization thnx again!

sasha0552 · 2024-11-08T12:46:44Z

Yeah, this applies to the P4 as well. GPTQ only works on P100.

ekmekovski · 2024-11-09T15:19:12Z

Thank you for the replies. I was able to use GGUF only. I tried BnB, AND AWQ too but when I start the vLLM engine it says these quantization methods supported for compute capability 7.0, 8.0 respectively. I wanted to ask is it about vLLM implementation or P4, pascal architecture cannot use these quantization method. Thnx again.

sasha0552 · 2024-11-09T16:10:46Z

Theoretically, AWQ can be patched (you need at least something like vllm-project/vllm#1345), and it should have acceptable performance (even faster than fp16). Currently it generally works (i.e. it doesn't crash), but it generates gibberish (ref: vllm-project/vllm#5058). Anything INT8/FP8 related cannot run on Pascal due to architecture limitations.

ekmekovski · 2024-11-09T16:21:06Z

What abt bnb I see that you put a checkmark but, ı think it does not work well, am I wrong?

sasha0552 · 2024-11-09T16:31:44Z

You should replace 70 with 60 in this file to make it work.

It worked at least three months ago.

Alternatively, you can try the Triton AWQ implementation that vLLM uses for AMD GPUs. It should work (in theory). You can enable it with the VLLM_USE_TRITON_AWQ environment variable (note that you may also need to replace 75 with 60 these too)

sasha0552 added the question Further information is requested label Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Waiting for output from MQLLMEngine #3

Waiting for output from MQLLMEngine #3

ekmekovski commented Nov 8, 2024

sasha0552 commented Nov 8, 2024

ekmekovski commented Nov 8, 2024 •

edited

Loading

sasha0552 commented Nov 8, 2024

ekmekovski commented Nov 9, 2024

sasha0552 commented Nov 9, 2024 •

edited

Loading

ekmekovski commented Nov 9, 2024

sasha0552 commented Nov 9, 2024

Waiting for output from MQLLMEngine #3

Waiting for output from MQLLMEngine #3

Comments

ekmekovski commented Nov 8, 2024

sasha0552 commented Nov 8, 2024

ekmekovski commented Nov 8, 2024 • edited Loading

sasha0552 commented Nov 8, 2024

ekmekovski commented Nov 9, 2024

sasha0552 commented Nov 9, 2024 • edited Loading

ekmekovski commented Nov 9, 2024

sasha0552 commented Nov 9, 2024

ekmekovski commented Nov 8, 2024 •

edited

Loading

sasha0552 commented Nov 9, 2024 •

edited

Loading