-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove non-HF ExLlamaV2 loader #5431
Conversation
Won't this cause problems for #5375 ? |
In this case we can't use native sampling of exllamav2 though. |
It's not true that there is zero speed difference. Non-HF loader is around 10% faster for goliath-120b. |
Quick bench using ooba from before this commit and exllamav2 master branch from 5 minutes ago on runpod A100 80GB. HF:
Non-HF:
|
Why is this important, if the builtin sampling in exllamav2 works fine? |
For some stuff I like HF samplers and for some stuff the native ones. I forgot about the extra 1 t/s. It happens in llama.cpp too, a tiny difference due to overhead from HF. Not to mention seeing the actual top speeds in .cpp It also helps to troubleshoot issues with HF vs the original loader. There are like a million reasons to keep it. |
Thanks for restoring it! |
This reverts commit cde000d.
Since PR #4814, the speed difference between
ExLlamav2
andExLlamav2_HF
is zero. So I see no point in keeping the non-HF version, which is redundant and which samples in a way not guaranteed to be consistent with HF transformers sampling.