Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove exllamav1 loaders #5128

Merged
merged 6 commits into from
Dec 31, 2023
Merged

Remove exllamav1 loaders #5128

merged 6 commits into from
Dec 31, 2023

Conversation

oobabooga
Copy link
Owner

@oobabooga oobabooga commented Dec 31, 2023

ExLlamav1 hasn't received a commit in 3 months and does not support Mixtral.

The downsides of ExLlamav2 relative to v1 are slightly higher VRAM usage and slightly higher perplexity for the same GPTQ model:

v1 v2 v2 cache 8bit
VRAM 11295MiB 11653MiB 10133MiB
3200 tokens (prompt processing, seconds) 1,7 1,5 1,51
512 tokens (generation, seconds) 13,25 10,04 10,83
ppl 5,57350826 5,57457876 5,57457876

The perplexity difference is not significant and the VRAM usage can be reduced with --cache_8bit. So I see no point in keeping ExLlamav1.

@oobabooga oobabooga merged commit 0e54a09 into dev Dec 31, 2023
@oobabooga oobabooga deleted the remove-exllamav1 branch December 31, 2023 05:08
@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Jan 2, 2024

How does it compare for not having an ampere card though? Also for not using flash attention. I'm not using it often either but I'm also not really using quip/awq or HQQ at all if we're going by that.

I think that exllama 1 was also compatible with the old flash attention that ran on cards below ampere but I only have pascal and ampere so I can't really confirm. During the holidays nobody who used it is probably going to notice to complain.

@ZanMax
Copy link

ZanMax commented Jan 7, 2024

Try to migrate from exllamav1 to exllamav2 with my AMD Instinct cards and have garbage output.
exllamav1 instead works perfectly.

@kelvincht
Copy link

My GPU perform much better on exllamav1 on 13b models.

Disabling cache 8 bit won't fix the performance issue

Performance is about 10x slower on exllamav2 compare to v1

Please bring back exllamav1 and exllamav1_hf

@jianmomo
Copy link

我体验了exllamav2,但感觉他并不是那么完美,速度确实很快,但相同参数回复短了很多

@DmitryVN
Copy link

DmitryVN commented Jan 18, 2024

Please bring back exllamav1 and exllamav1_hf! This allows you to load the 10.7B models completely, while exllamav2 gives you an out of memory for 8gb GPU.

PoetOnTheRun pushed a commit to PoetOnTheRun/text-generation-webui that referenced this pull request Feb 22, 2024
@koplenov
Copy link

exllama2 sucks in some cases

and to get the previous one back, you have to downgrade the version

thanks for the new spokes in the wheels 🥰

@koplenov
Copy link

@oobabooga pls revert

@oobabooga
Copy link
Owner Author

If you have a performance problem with exllamav2 that was not present exllamav1, you should open an issue in the exllamav2 repository.

@koplenov
Copy link

in my case it's not about performance

exllama and exllama2 have different results

// we run 1mln rows daily and this is critical/noticeable for us

@koplenov
Copy link

damn, you can't roll back to the previous version :?

it just won't start :/
dependency hell, versions not specified, compilation errors

try to roll back to that commit and start it from scratch yourself - you'll understand why there is such a return request

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Feb 28, 2024

To be fair, it reverted for me fine. Need to check how well it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants