AQLM CUDA support #3287

jaemzfleming · 2024-03-08T20:18:25Z

SUMMARY:
Supports AQLM compressed inference, see

https://github.com/Vahe1994/AQLM
https://arxiv.org/pdf/2401.06118.pdf

Optimized supported formats are 1x16 and 2x8. Tensor parallelism is supported. Only CUDA kernels are provided. Formats other than 1x16 and 2x8 will run but at lower performance.

Also adds underlying support for all quantization schemes that require a separate fixed size codebook per layer.

The only trickiness was that QKVParallelLinear concatenates the Q, K, and V tensors, whose sizes and offsets are determined by by the number of heads, kv heads, and tensor parallelism. The corresponding codebooks all need to be present and concatenated for apply_weights. To support this we add the is_metadata attribute, which if present, will concatenate the Q,K, and V tensors along the zeroth dimension, just using the size of the loaded tensor.

Here's a benchmark server graph comparing 2bit 1x16 and 2x8 compared to FP16, plotting mean TPOT vs queries per second. At low query rates, you can see that the 1x16 is 1.36x faster and the 2x8 is 2.12x faster than FP16. By 15 queries per second, the 1x16 is 1.56x slower and the 2x8 is 1.16 slower. So either format is a good choice if memory is limited, especially if are serving low QPS. But 2x8 is best if you can afford the slightly lower accuracy.

Tested on several models:

ISTA-DASLab/Llama-2-7b-AQLM-2Bit-1x16-hf
ISTA-DASLab/Llama-2-7b-AQLM-2Bit-2x8-hf
ISTA-DASLab/Llama-2-13b-AQLM-2Bit-1x16-hf
ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf
BlackSamorez/TinyLlama-1_1B-Chat-v1_0-AQLM-2Bit-1x16-hf

Including with single or multiple GPUS and associated tensor parallelism.

…m prefixes.

remiconnesson · 2024-04-14T02:15:50Z

Hello trying to run it but there seems to be an issue with

File "/workspace/nm-vllm/vllm/model_executor/weight_utils.py", line 63, in get_lock
lock = filelock.FileLock(os.path.join(lock_dir, lock_file_name),
TypeError: BaseFileLock.__init__() got an unexpected keyword argument 'mode'

https://github.com/neuralmagic/nm-vllm/blob/22f7faeee16f63548b33ad6ebcc78e256de93524/vllm/model_executor/weight_utils.py#L62-L64

    # mode 0o666 is required for the filelock to be shared across users
    lock = filelock.FileLock(os.path.join(lock_dir, lock_file_name),
                             mode=0o666)

Removing mode=0o666 seems to clear the problem

Traceback

root@86c9ebba321d:/workspace/nm-vllm/examples# python aqlm_example.py 
config.json: 100%|███████████████████████████████████████████████████████████████████| 968/968 [00:00<00:00, 9.93MB/s]
WARNING 04-14 02:07:14 config.py:222] aqlm quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 04-14 02:07:14 llm_engine.py:81] Initializing an LLM engine (v0.4.0.post1) with config: model='ISTA-DASLab/Llama-2-7b-AQLM-2Bit-1x16-hf', speculative_config=None, tokenizer='ISTA-DASLab/Llama-2-7b-AQLM-2Bit-1x16-hf', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=True, quantization=aqlm, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, seed=0)
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████| 776/776 [00:00<00:00, 7.82MB/s]
tokenizer.model: 100%|█████████████████████████████████████████████████████████████| 500k/500k [00:00<00:00, 3.24MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 9.20MB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████| 414/414 [00:00<00:00, 4.37MB/s]
INFO 04-14 02:07:16 pynccl.py:58] Loading nccl from library /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1
INFO 04-14 02:07:16 selector.py:77] Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance.
INFO 04-14 02:07:16 selector.py:33] Using XFormers backend.
INFO 04-14 02:07:18 weight_utils.py:194] Using model weights format ['*.safetensors']
Exception ignored in: <function BaseFileLock.__del__ at 0x7efd01ce3f40>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/filelock/_api.py", line 240, in __del__
    self.release(force=True)
  File "/usr/local/lib/python3.10/dist-packages/filelock/_api.py", line 201, in release
    with self._thread_lock:
AttributeError: 'UnixFileLock' object has no attribute '_thread_lock'
Traceback (most recent call last):
  File "/workspace/nm-vllm/examples/aqlm_example.py", line 46, in <module>
    main()
  File "/workspace/nm-vllm/examples/aqlm_example.py", line 36, in main
    model = LLM(args.model if args.model is not None else models[args.choice],
  File "/workspace/nm-vllm/vllm/entrypoints/llm.py", line 112, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/workspace/nm-vllm/vllm/engine/llm_engine.py", line 231, in from_engine_args
    engine = cls(
  File "/workspace/nm-vllm/vllm/engine/llm_engine.py", line 119, in __init__
    self.model_executor = executor_class(
  File "/workspace/nm-vllm/vllm/executor/gpu_executor.py", line 41, in __init__
    self._init_worker()
  File "/workspace/nm-vllm/vllm/executor/gpu_executor.py", line 67, in _init_worker
    self.driver_worker.load_model()
  File "/workspace/nm-vllm/vllm/worker/worker.py", line 108, in load_model
    self.model_runner.load_model()
  File "/workspace/nm-vllm/vllm/worker/model_runner.py", line 155, in load_model
    self.model = get_model(
  File "/workspace/nm-vllm/vllm/model_executor/model_loader.py", line 101, in get_model
    model.load_weights(model_config.model, model_config.download_dir,
  File "/workspace/nm-vllm/vllm/model_executor/models/llama.py", line 393, in load_weights
    for name, loaded_weight in hf_model_weights_iterator(
  File "/workspace/nm-vllm/vllm/model_executor/weight_utils.py", line 241, in hf_model_weights_iterator
    hf_folder, hf_weights_files, use_safetensors = prepare_hf_model_weights(
  File "/workspace/nm-vllm/vllm/model_executor/weight_utils.py", line 197, in prepare_hf_model_weights
    with get_lock(model_name_or_path, cache_dir):
  File "/workspace/nm-vllm/vllm/model_executor/weight_utils.py", line 63, in get_lock
    lock = filelock.FileLock(os.path.join(lock_dir, lock_file_name),
TypeError: BaseFileLock.__init__() got an unexpected keyword argument 'mode'

mgoin · 2024-04-16T14:39:02Z

Hey @remiconnesson that filelock issue seems to be unrelated to this PR and addressed on main. Please try the updated version of this branch or main

simon-mo · 2024-04-18T07:18:53Z

csrc/quantization/aqlm/LICENSE

I think there's a Apache 2 LICENSE at the root, should the attribution (github link) goes to the top of the aqlm_cuda_entry.cpp and aqlm_cuda_kernel.cu?

Oh i see it is already there i think this can be removed.

That makes sense, thanks removing now

simon-mo · 2024-04-18T07:19:37Z

csrc/quantization/aqlm/aqlm_cuda_entry.cpp

Please see awq and others for namespacing into vllm.

Changed to

"csrc/quantization/aqlm/cuda_entry.cpp" "csrc/quantization/aqlm/gemm_kernels.cu"

simon-mo · 2024-04-18T07:19:59Z

csrc/quantization/aqlm/aqlm_cuda_kernel.cu

same comment about namespacing

ditto done :)

simon-mo · 2024-04-18T07:22:05Z

vllm/config.py

change this to use the registry #4098?

updated with main

simon-mo · 2024-04-18T07:23:28Z

vllm/model_executor/layers/linear.py

+                         params_dtype, linear_method, [
+                             self.num_heads * tp_size * self.head_size,
+                             self.num_kv_heads * tp_size * self.head_size,
+                             self.num_kv_heads * tp_size * self.head_size
+                         ])


please move this to a variable for readability

simon-mo · 2024-04-18T18:02:03Z

sorry i should have clarified the comment about namespace

the exposed torch binding should be not be in cpp namespace

vllm/csrc/quantization/awq/gemm_kernels.cu

Lines 389 to 394 in 705578a

    
           torch::Tensor awq_gemm( 
        
               torch::Tensor _in_feats, 
        
               torch::Tensor _kernel, 
        
               torch::Tensor _scaling_factors, 
        
               torch::Tensor _zeros, 
        
               int split_k_iters)

all other helpers should include in namespace vllm ...

vllm/csrc/quantization/awq/gemm_kernels.cu

Lines 19 to 20 in 705578a

    
           namespace vllm { 
        
           namespace awq {

mgoin · 2024-04-18T19:01:24Z

@simon-mo thanks for the clarification, I was thinking about the wrong namespace! I got rid of the cpp file and wrapped everything except the two external functions in vllm::aqlm::

jaemzfleming · 2024-04-22T19:05:14Z

Thanks for doing all this work @mgoin, much appreciated.

Co-authored-by: mgoin <[email protected]>

jaemzfleming added 30 commits February 26, 2024 14:47

actual add kernel

079cba5

getting serious

23c3f77

adding in mat mat, need to move the pytorch stuff, maybe add some aql…

20a71fd

…m prefixes.

load the codebooks, codes, and scales.

d0cf25a

try to bind cpp aqlm entry point to python

40463e3

add aqlm

0e03c23

fix print statements

26f8d83

add comment

dad66ce

remove unused enum

77a8913

add a bunch of prints, add bias

2bb6871

minor fix for scales

5f0c319

change

024b54c

format

84c2e2a

try reversing some formatting changes

8ea4d9d

restored

b993971

add aqlm_cuda

1766886

restore formatting

b673f47

restore format

4e7d398

more formatting

4fc1426

format

ac2ef81

restore formatting

30d2d42

restore formatting

3fcb944

formta

4e7291a

first working aqlm

39abbc0

some improvements

8d7fa96

restore format

9a3dbe1

make a central c++ aqlm entry point

e7c2601

add support for 2x8, worked shockingly easily

6eba035

support more than one model

604f66f

formatting

ce63937

mgoin added 4 commits April 9, 2024 15:53

Poke test again

811e2cc

Merge remote-tracking branch 'upstream/main' into jf/aqlm

a97353b

Merge remote-tracking branch 'upstream/main' into jf/aqlm

6ca51d4

Merge remote-tracking branch 'upstream/main' into jf/aqlm

22f7fae

mgoin added 2 commits April 15, 2024 19:34

Merge remote-tracking branch 'upstream/main' into jf/aqlm

2282157

Resolve create_weights updates

d0e8d0c

mgoin added 2 commits April 16, 2024 14:59

Better test debug output (manually tested TP)

6bb89c0

Merge branch 'vllm-project:main' into jf/aqlm

09d4a24

simon-mo approved these changes Apr 18, 2024

View reviewed changes

mgoin added 4 commits April 18, 2024 10:57

Delete csrc/quantization/aqlm/LICENSE

4d46f18

Address comments

a29008d

Merge remote-tracking branch 'upstream/main' into jf/aqlm

dacdb52

Update test

3852115

Cleanup namespaces

d367895

Merge remote-tracking branch 'upstream/main' into jf/aqlm

d34f23d

Merge remote-tracking branch 'upstream/main' into jf/aqlm

7283c23

robertgshaw2-neuralmagic merged commit 2b7949c into vllm-project:main Apr 23, 2024
47 checks passed

robertgshaw2-neuralmagic deleted the jf/aqlm branch April 23, 2024 17:59

xjpang pushed a commit to xjpang/vllm that referenced this pull request Apr 25, 2024

AQLM CUDA support (vllm-project#3287)

b584ffc

Co-authored-by: mgoin <[email protected]>

robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Apr 26, 2024

AQLM CUDA support (vllm-project#3287)

9e73227

Co-authored-by: mgoin <[email protected]>

alexeykondrat pushed a commit to alexeykondrat/ci-vllm that referenced this pull request May 1, 2024

AQLM CUDA support (vllm-project#3287)

790c76b

Co-authored-by: mgoin <[email protected]>

z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request May 7, 2024

AQLM CUDA support (vllm-project#3287)

9b5dcec

Co-authored-by: mgoin <[email protected]>

dtrifiro mentioned this pull request May 15, 2024

bump ubi base image tag opendatahub-io/vllm#24

Merged

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024

AQLM CUDA support (vllm-project#3287)

501e55c

Co-authored-by: mgoin <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AQLM CUDA support #3287

AQLM CUDA support #3287

jaemzfleming commented Mar 8, 2024 •

edited

Loading

remiconnesson commented Apr 14, 2024 •

edited

Loading

mgoin commented Apr 16, 2024

simon-mo Apr 18, 2024

simon-mo Apr 18, 2024

mgoin Apr 18, 2024

simon-mo Apr 18, 2024

mgoin Apr 18, 2024

simon-mo Apr 18, 2024

mgoin Apr 18, 2024

simon-mo Apr 18, 2024

mgoin Apr 18, 2024

simon-mo Apr 18, 2024

mgoin Apr 18, 2024

simon-mo commented Apr 18, 2024

mgoin commented Apr 18, 2024 •

edited

Loading

jaemzfleming commented Apr 22, 2024

AQLM CUDA support #3287

AQLM CUDA support #3287

Conversation

jaemzfleming commented Mar 8, 2024 • edited Loading

remiconnesson commented Apr 14, 2024 • edited Loading

Traceback

mgoin commented Apr 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simon-mo commented Apr 18, 2024

mgoin commented Apr 18, 2024 • edited Loading

jaemzfleming commented Apr 22, 2024

jaemzfleming commented Mar 8, 2024 •

edited

Loading

remiconnesson commented Apr 14, 2024 •

edited

Loading

mgoin commented Apr 18, 2024 •

edited

Loading