Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove non-HF ExLlamaV2 loader #5431

Merged
merged 2 commits into from
Feb 4, 2024
Merged

Remove non-HF ExLlamaV2 loader #5431

merged 2 commits into from
Feb 4, 2024

Conversation

oobabooga
Copy link
Owner

Since PR #4814, the speed difference between ExLlamav2 and ExLlamav2_HF is zero. So I see no point in keeping the non-HF version, which is redundant and which samples in a way not guaranteed to be consistent with HF transformers sampling.

@oobabooga oobabooga merged commit cde000d into dev Feb 4, 2024
@oobabooga oobabooga deleted the remove-exllamav2 branch February 4, 2024 04:16
@sgsdxzy
Copy link
Contributor

sgsdxzy commented Feb 4, 2024

Won't this cause problems for #5375 ?

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Feb 4, 2024

In this case we can't use native sampling of exllamav2 though.

@aikitoria
Copy link

It's not true that there is zero speed difference. Non-HF loader is around 10% faster for goliath-120b.

@aikitoria
Copy link

aikitoria commented Feb 6, 2024

Quick bench using ooba from before this commit and exllamav2 master branch from 5 minutes ago on runpod A100 80GB.
Using the new version here as that reverts the performance degradation that happened in 0.0.12.

HF:

Output generated in 8.76 seconds (14.50 tokens/s, 127 tokens, context 1728, seed 735928511)
Output generated in 9.22 seconds (13.77 tokens/s, 127 tokens, context 1728, seed 83286885)
Output generated in 8.99 seconds (14.13 tokens/s, 127 tokens, context 1728, seed 128023280)
Output generated in 8.78 seconds (14.47 tokens/s, 127 tokens, context 1728, seed 1418661767)

Non-HF:

Output generated in 8.17 seconds (15.67 tokens/s, 128 tokens, context 1728, seed 745431605)
Output generated in 8.18 seconds (15.65 tokens/s, 128 tokens, context 1728, seed 762707583)
Output generated in 8.18 seconds (15.64 tokens/s, 128 tokens, context 1728, seed 996129951)
Output generated in 8.18 seconds (15.64 tokens/s, 128 tokens, context 1728, seed 700382800)

@aikitoria
Copy link

not guaranteed to be consistent with HF transformers sampling

Why is this important, if the builtin sampling in exllamav2 works fine?

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Feb 6, 2024

For some stuff I like HF samplers and for some stuff the native ones. I forgot about the extra 1 t/s. It happens in llama.cpp too, a tiny difference due to overhead from HF. Not to mention seeing the actual top speeds in .cpp It also helps to troubleshoot issues with HF vs the original loader. There are like a million reasons to keep it.

oobabooga added a commit that referenced this pull request Feb 6, 2024
@aikitoria
Copy link

Thanks for restoring it!

PoetOnTheRun pushed a commit to PoetOnTheRun/text-generation-webui that referenced this pull request Feb 22, 2024
PoetOnTheRun pushed a commit to PoetOnTheRun/text-generation-webui that referenced this pull request Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants