-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add QuIP# support #4803
Add QuIP# support #4803
Conversation
The installation procedure is almost identical to the one GPTQ-for-LLaMa; maybe @jllllll can come to the rescue and create wheels for this one as well. |
Is there any minimum architecture support required for this (for example, AWQ quant requires Ampere or better on Nvidia cards)? (Trying to figure out if it will work on cards like the P40 which uses Compute version 6.1 architecture - |
I don't know. Most of these custom CUDA kernels require Ampere cards, but my old fork of GPTQ-for-LLaMa has a custom kernel and works on Pascal cards. I guess it depends on the operations performed. Maybe @tsengalb99 can tell us what the requirements are. |
Hi - a few things:
|
BTW our models should work with HF's AutoTokenizer. We have multiple places in our code where we just call AutoTokenizer and everything works fine. |
Thanks for the reply @tsengalb99. Updates and eventual breaking changes are expected, and I'll make sure to update the code in this PR accordingly over time. About CUDA graphs and the HF |
That doesn't work with a local copy of relaxml/Llama-2-70b-E8P-2Bit:
The problem is that the tokenizer files are not present in the repository. This can be easily fixed by uploading the tokenizer files here (or any other copy of the default Llama tokenizer) to that repository. |
You need to extract the base model string (eg meta-llama/Llama-2-7b-hf) which |
I had seen this, but this repository is based on loading from local copies of HF repositories (stored under This is very secondary and I wouldn't worry about it. |
Hi, another QuIP# author here. Depending on what your interested in, we also have quantized 4 bit models in our huggingface repo (ex: relaxml/Llama-2-70b-chat-HI-4Bit-Packed) that have much smaller degradation from the fp16 model. We expect to have fast inference with these 4 bit models approximately by the end of the week; our current forward pass code slow is a slower naive implementation of the codebook for this specific 4 bit quantization. |
That's great to hear @jerry-chee, thanks for the information. |
I spent a while trying to create GitHub Actions wheels for quip-sharp here and failed, so I gave up and instead just added an error message instructing the user to install manually. I also removed the usage of a default Llama tokenizer as this causes issues such as Cornell-RelaxML/quip-sharp#6. It would be good if the repositories were updated to include the corresponding tokenizer files -- every GPTQ, AWQ, and EXL2 repository on HF contains these. Hopefully the interest in quip-sharp will increase and someone will soon be able to find a solution to the CUDA graphs issue for better performance. I am personally already happy with the 8 tokens/second I am getting for 70b models. |
Interesting, we can take a look at that later as a very low priority thing
We will try to do that some time in the next few weeks.
I filed a ticket with huggingface huggingface/transformers#27837 and it's on their todo list. We have faster kernels in the pipeline so the speed will increase from those alone. |
@tsengalb99 To make pascal work fast.. like your 1060 it requires the use of up-casting to FP32 math. Pascal also has no tensor cores and atomicadd but there are functions for the latter that can be used in it's place and they are reasonable. Compute 6.1 also has dp4a instructions that can be used to speed things up. Why would anyone bother? The P40 is prolific and is the only other 24gb card besides the 3090 with that much ram. On top of that it's $200. Otherwise people are stuck with janky 7b and 13b models which are useful as simple tools and that's about it. If the goal is to run larger models, I think pascal support is a good thing to have. |
I'm getting this error after doing python setup.py install, I use Win11 with a 3090Ti , I have NVCC and Visual Studio 2022. @oobabooga any idea? |
So I ran cmd_windows, copied and pasted the first command to install Quip manually, and it gave me an error. |
Can you paste the error? |
On WSL with Ubuntu LTS , quiptools-cuda compiled with cuda 11.8 not 12.1. edit: 9.2 gb of vram used for Llama-1-30b-E8P-2Bit at ~4.70 tokens/s on a 3060 12gb it's bonkers. |
I tried it but still get this error:
|
You still have cuda 12.1 installed. At compilation, you might see this warning instead:
If nothing works, search for text-gen-install in your WSL home folder, back-up your files, delete text-gen-install folder and start fresh with cuda 11.8 at install. |
Doesn't work on windows 10 for me, here's my specs:
Here's my error:
|
I think we just need to download a pre-compiled wheel and use it instead of building it @BadisG |
Do we have such wheel yet @iChristGit ? |
Not yet sadly, I also wanna run natively on Windows11, its the same errors that i suppose someone with a native linux build can do and upload, just like the old GPTQ. |
This can be fixed by disabling Ninja.
But then you'll also probably get this:
The Internet suggests we need the
Note that oobabooga already attempted to make wheels so for Windows we might just need to wait for that to succeed or for QuIP# devs to give some pointers or fix their setup script. |
Man, I keep hoping that Quip will work out of the box with new iteration of the webui, but so far, still no luck. It's still asking to install Quip manually. |
Its takes time, when GPTQ was first released I was picking my hair with each error to compile it, now its a 1 click install. |
hahaha yeah I know. Same with Exllama 2. It wouldn't work at all when it was released. |
Sorry we (the QuIP# team) can't be of much help here since we don't have any access to Windows machines with NVIDIA GPUs. We're hoping to package quiptools into a wheel in the future when it becomes more mature, but as of now since QuIP# is a WIP the install process is a bit more involved (but hopefully not too involved). |
Okay... who is gonna quantize Mixtral-8x7B? And what VRAM/RAM requirements would that have? |
Already been quantized by thebloke |
I thought Thebloke still doesn't provide QuIP# quantization? |
You are right, I was thinking you wanted any kind of quant, not Quip# |
@iChristGit As long as it still doesn't work on windows I don't see the incentive for it... |
Oh wait what? It doesn't work on windows yet? That would explain a lot for me, because i haven't been able to get it running yet... |
Yep its hard to figure out how to compile it on windows, but on linux its easy as far as people say. |
Yep.. you can run WSL on windows for the meantime maybe. |
As of latest commits Quip# is marked as only available on linux, does this mean its not posibble to make it work in windows at all? @oobabooga |
I have tried to compile it for Windows using GitHub actions and it fails with some vague errors. I think that there is something in the quip# code itself that prevents it from compiling on Windows. |
Okay I am on Windows WSL (Ubuntu) now, but I get this error when I try to install error: can't create or remove files in install directory The following error occurred while trying to add or remove files in the
|
Is there any technical reason for it not working on Windows, or is it just "this is too new, and nobody really tried"? If someone bumped into a roadblock, it might be good to document it (some dependency not compiling?) For the few that managed to run it, is it really as good as the perplexity claims make it seem? |
"I have tried to compile it for Windows using GitHub actions and it fails with some vague errors. I think that there is something in the quip# code itself that prevents it from compiling on Windows." A comment from ooba a couple weeks back, still same issue it wont compile on windows. |
Ok, tried it for a bit now, the thing that hangs is package
However,
That was as far as I got because I have no idea what to do next and Python befuddles me. |
@CamiloMM, I solved that problem by installing the current cuda-toolkit from nvidia website (I'm on Linux Mint). @Nicoolodion2, it is a permission issue, I had to add I any case, even though quip-sharp is right there, oobabooga still doesn't find it for me. I can't get past the error:
|
QuIP# is a novel quantization method. Its 2-bit performance is better than anything previously available.
Repository: https://github.com/Cornell-RelaxML/quip-sharp
Blog post: https://cornell-relaxml.github.io/quip-sharp/
Installation
The installation is currently manual, but later I will add it to the one-click installer.
You need to have a C++ compiler (like
g++
) andnvcc
available in your environment for the command above.4) Download my tokenizer (I'm using it as a placeholder for now, as the model above doesn't include a tokenizer):Perplexity
On a small test that I have been running since the beginning of this year to compare different quantizations:
It's the same test as in the first table in this blog post, so the numbers are directly comparable.
This is the first time I see a quantized 70b model that fits in a RTX 3090 perform better than a q4_K_M 30b model. Which is especially important nowadays since Meta never released a Llama-2 30b base model.
Performance
I can get to 3042 context with 24GB VRAM. It generates at around 8 tokens/second when the context is small and 6 tokens/second when it is large.