-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Waiting for output from MQLLMEngine #3
Comments
Hi! If you are using P40s, do NOT use GPTQ. Use literally everything else, AQLM, GGUF. Or load the models in fp16. GPTQ doesn't work adequately on the P40s, it will load (after a few hours) but will be very slow. |
Correction it was P4 😃 but I think the case still applies right? I will try with different quantization thnx again! |
Yeah, this applies to the P4 as well. GPTQ only works on P100. |
Thank you for the replies. I was able to use GGUF only. I tried BnB, AND AWQ too but when I start the vLLM engine it says these quantization methods supported for compute capability 7.0, 8.0 respectively. I wanted to ask is it about vLLM implementation or P4, pascal architecture cannot use these quantization method. Thnx again. |
Theoretically, AWQ can be patched (you need at least something like vllm-project/vllm#1345), and it should have acceptable performance (even faster than fp16). Currently it generally works (i.e. it doesn't crash), but it generates gibberish (ref: vllm-project/vllm#5058). Anything INT8/FP8 related cannot run on Pascal due to architecture limitations. |
What abt bnb I see that you put a checkmark but, ı think it does not work well, am I wrong? |
You should replace It worked at least three months ago. Alternatively, you can try the Triton AWQ implementation that vLLM uses for AMD GPUs. It should work (in theory). You can enable it with the |
Hello sir firstly thnx for the repository.
I am trying to running a LLama3.1 model with GPTQ-8 quantization in multiple GPUs using the repo. When I tried to load them The first shard loaded (it tooks ~4 gb, but the model consume ~10 gb in a single gpu). And it enters to hanging state, I saw in debugging: "Waiting for output from MQLLMEngine " . Do you have any idea?
Thnx in advance
The text was updated successfully, but these errors were encountered: