Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantzation AWQ GEMM + GEMV #1727

Merged
merged 6 commits into from
Jul 4, 2024

Conversation

minhthuc2502
Copy link
Collaborator

Support quantization 4 bit with AWQ. There are 2 stable versions available: gemm and gemv.

Currently, I only add AWQ for Llama and Mistral converter. Other models could be added easily if they need AWQ quant.

I did some benchmark with it:

With only batch_size = 1, model mistral 7B:

Quant type Speed (tok/s) VRAM
int8 86,4 7722MiB
awq gemm 73 4746MiB
awq gemv 127 4746MiB

@BBC-Esq
Copy link

BBC-Esq commented Jun 19, 2024

Support quantization 4 bit with AWQ. There are 2 stable versions available: gemm and gemv.

Currently, I only add AWQ for Llama and Mistral converter. Other models could be added easily if they need AWQ quant.

I did some benchmark with it:

With only batch_size = 1, model mistral 7B:
Quant type Speed (tok/s) VRAM
int8 86,4 7722MiB
awq gemm 73 4746MiB
awq gemv 127 4746MiB

Wow, that is spooky...I just started benchmarking AWQ last night for the first time. Do you think that eventually you'd want to incorporate the "exllama" option as well. For more info see here:

https://github.com/huggingface/transformers/blob/547b5582ec85147492f2485dd8e9cbbeb1016fd8/src/transformers/utils/quantization_config.py#L47

Also, would you mind sharing the script you used to benchmark or perhaps just some snippets? I wouldn't mind downloading a development branch and trying my hand at it.

@minhthuc2502
Copy link
Collaborator Author

Currently, I only support GEMM and GEMV which are the most used version. It could be nice to support all in the future.

I did some benchmarks in the C++ only, I think you have to build this project in c++ first. BTW, the code where I used to benchmark is quite dirty but I will try to improve it and add it to the repo.

@BBC-Esq
Copy link

BBC-Esq commented Jun 19, 2024

Currently, I only support GEMM and GEMV which are the most used version. It could be nice to support all in the future.

I did some benchmarks in the C++ only, I think you have to build this project in c++ first. BTW, the code where I used to benchmark is quite dirty but I will try to improve it and add it to the repo.

Thanks, I'm still learning to "build" anything (unsuccessfully as of yet...) believe it or not, but if you upload it I'll take a look.

@minhthuc2502 minhthuc2502 merged commit 39f48f2 into OpenNMT:master Jul 4, 2024
17 checks passed
@BBC-Esq
Copy link

BBC-Esq commented Sep 9, 2024

Can you share the code you used to benchmark?

@minhthuc2502
Copy link
Collaborator Author

I used this. You can tweak a bit to create the correct prompt for the model used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants