-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove exllamav1 loaders #5128
Remove exllamav1 loaders #5128
Conversation
How does it compare for not having an ampere card though? Also for not using flash attention. I'm not using it often either but I'm also not really using quip/awq or HQQ at all if we're going by that. I think that exllama 1 was also compatible with the old flash attention that ran on cards below ampere but I only have pascal and ampere so I can't really confirm. During the holidays nobody who used it is probably going to notice to complain. |
Try to migrate from exllamav1 to exllamav2 with my AMD Instinct cards and have garbage output. |
My GPU perform much better on exllamav1 on 13b models. Disabling cache 8 bit won't fix the performance issue Performance is about 10x slower on exllamav2 compare to v1 Please bring back exllamav1 and exllamav1_hf |
我体验了exllamav2,但感觉他并不是那么完美,速度确实很快,但相同参数回复短了很多 |
Please bring back exllamav1 and exllamav1_hf! This allows you to load the 10.7B models completely, while exllamav2 gives you an out of memory for 8gb GPU. |
exllama2 sucks in some cases and to get the previous one back, you have to downgrade the version thanks for the new spokes in the wheels 🥰 |
@oobabooga pls revert |
If you have a performance problem with exllamav2 that was not present exllamav1, you should open an issue in the exllamav2 repository. |
in my case it's not about performance exllama and exllama2 have different results // we run 1mln rows daily and this is critical/noticeable for us |
damn, you can't roll back to the previous version :? it just won't start :/ try to roll back to that commit and start it from scratch yourself - you'll understand why there is such a return request |
To be fair, it reverted for me fine. Need to check how well it works. |
ExLlamav1 hasn't received a commit in 3 months and does not support Mixtral.
The downsides of ExLlamav2 relative to v1 are slightly higher VRAM usage and slightly higher perplexity for the same GPTQ model:
The perplexity difference is not significant and the VRAM usage can be reduced with
--cache_8bit
. So I see no point in keeping ExLlamav1.