-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AQLM CUDA support #3287
AQLM CUDA support #3287
Conversation
Hello trying to run it but there seems to be an issue with
# mode 0o666 is required for the filelock to be shared across users
lock = filelock.FileLock(os.path.join(lock_dir, lock_file_name),
mode=0o666) Removing Traceback
|
Hey @remiconnesson that filelock issue seems to be unrelated to this PR and addressed on main. Please try the updated version of this branch or main |
csrc/quantization/aqlm/LICENSE
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a Apache 2 LICENSE at the root, should the attribution (github link) goes to the top of the aqlm_cuda_entry.cpp
and aqlm_cuda_kernel.cu
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh i see it is already there i think this can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, thanks removing now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see awq and others for namespacing into vllm
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to
"csrc/quantization/aqlm/cuda_entry.cpp"
"csrc/quantization/aqlm/gemm_kernels.cu"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment about namespacing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto done :)
vllm/config.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change this to use the registry #4098?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated with main
vllm/model_executor/layers/linear.py
Outdated
params_dtype, linear_method, [ | ||
self.num_heads * tp_size * self.head_size, | ||
self.num_kv_heads * tp_size * self.head_size, | ||
self.num_kv_heads * tp_size * self.head_size | ||
]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please move this to a variable for readability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sorry i should have clarified the comment about namespace the exposed torch binding should be not be in cpp namespace vllm/csrc/quantization/awq/gemm_kernels.cu Lines 389 to 394 in 705578a
all other helpers should include in vllm/csrc/quantization/awq/gemm_kernels.cu Lines 19 to 20 in 705578a
|
@simon-mo thanks for the clarification, I was thinking about the wrong namespace! I got rid of the cpp file and wrapped everything except the two external functions in |
Thanks for doing all this work @mgoin, much appreciated. |
Co-authored-by: mgoin <[email protected]>
Co-authored-by: mgoin <[email protected]>
Co-authored-by: mgoin <[email protected]>
Co-authored-by: mgoin <[email protected]>
Co-authored-by: mgoin <[email protected]>
SUMMARY:
Supports AQLM compressed inference, see
https://github.com/Vahe1994/AQLM
https://arxiv.org/pdf/2401.06118.pdf
Optimized supported formats are 1x16 and 2x8. Tensor parallelism is supported. Only CUDA kernels are provided. Formats other than 1x16 and 2x8 will run but at lower performance.
Also adds underlying support for all quantization schemes that require a separate fixed size codebook per layer.
The only trickiness was that
QKVParallelLinear
concatenates the Q, K, and V tensors, whose sizes and offsets are determined by by the number of heads, kv heads, and tensor parallelism. The corresponding codebooks all need to be present and concatenated forapply_weights
. To support this we add theis_metadata
attribute, which if present, will concatenate the Q,K, and V tensors along the zeroth dimension, just using the size of the loaded tensor.Here's a benchmark server graph comparing 2bit 1x16 and 2x8 compared to FP16, plotting mean TPOT vs queries per second. At low query rates, you can see that the 1x16 is 1.36x faster and the 2x8 is 2.12x faster than FP16. By 15 queries per second, the 1x16 is 1.56x slower and the 2x8 is 1.16 slower. So either format is a good choice if memory is limited, especially if are serving low QPS. But 2x8 is best if you can afford the slightly lower accuracy.
Tested on several models:
ISTA-DASLab/Llama-2-7b-AQLM-2Bit-1x16-hf
ISTA-DASLab/Llama-2-7b-AQLM-2Bit-2x8-hf
ISTA-DASLab/Llama-2-13b-AQLM-2Bit-1x16-hf
ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf
BlackSamorez/TinyLlama-1_1B-Chat-v1_0-AQLM-2Bit-1x16-hf
Including with single or multiple GPUS and associated tensor parallelism.