Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AQLM CUDA support #3287

Merged
merged 116 commits into from
Apr 23, 2024
Merged
Changes from 1 commit
Commits
Show all changes
116 commits
Select commit Hold shift + click to select a range
079cba5
actual add kernel
jaemzfleming Feb 26, 2024
23c3f77
getting serious
jaemzfleming Feb 26, 2024
20a71fd
adding in mat mat, need to move the pytorch stuff, maybe add some aql…
jaemzfleming Feb 26, 2024
d0cf25a
load the codebooks, codes, and scales.
jaemzfleming Feb 27, 2024
40463e3
try to bind cpp aqlm entry point to python
jaemzfleming Feb 27, 2024
0e03c23
add aqlm
jaemzfleming Feb 27, 2024
26f8d83
fix print statements
jaemzfleming Feb 27, 2024
dad66ce
add comment
jaemzfleming Feb 28, 2024
77a8913
remove unused enum
jaemzfleming Feb 28, 2024
2bb6871
add a bunch of prints, add bias
jaemzfleming Feb 28, 2024
5f0c319
minor fix for scales
jaemzfleming Feb 28, 2024
024b54c
change
jaemzfleming Feb 28, 2024
84c2e2a
format
jaemzfleming Feb 29, 2024
8ea4d9d
try reversing some formatting changes
jaemzfleming Feb 29, 2024
b993971
restored
jaemzfleming Feb 29, 2024
1766886
add aqlm_cuda
jaemzfleming Feb 29, 2024
b673f47
restore formatting
jaemzfleming Feb 29, 2024
4e7d398
restore format
jaemzfleming Feb 29, 2024
4fc1426
more formatting
jaemzfleming Feb 29, 2024
ac2ef81
format
jaemzfleming Feb 29, 2024
30d2d42
restore formatting
jaemzfleming Feb 29, 2024
3fcb944
restore formatting
jaemzfleming Feb 29, 2024
4e7291a
formta
jaemzfleming Feb 29, 2024
39abbc0
first working aqlm
jaemzfleming Feb 29, 2024
8d7fa96
some improvements
jaemzfleming Feb 29, 2024
9a3dbe1
restore format
jaemzfleming Feb 29, 2024
e7c2601
make a central c++ aqlm entry point
jaemzfleming Feb 29, 2024
6eba035
add support for 2x8, worked shockingly easily
jaemzfleming Feb 29, 2024
604f66f
support more than one model
jaemzfleming Mar 1, 2024
ce63937
formatting
jaemzfleming Mar 1, 2024
6cbdff7
remove secondary aqlm loading
jaemzfleming Mar 1, 2024
a58d369
restore trailing space
jaemzfleming Mar 1, 2024
31f0ddc
remove some code
jaemzfleming Mar 1, 2024
edc80c6
remove some comments
jaemzfleming Mar 1, 2024
3253dc7
add some attributions
jaemzfleming Mar 1, 2024
fefe1c8
support 2 tp
jaemzfleming Mar 1, 2024
4b12ed6
better tp support
jaemzfleming Mar 1, 2024
e5c2010
format
jaemzfleming Mar 1, 2024
eef729f
comments
jaemzfleming Mar 1, 2024
d31241b
comments
jaemzfleming Mar 1, 2024
ba3c125
rename aqlm_test
jaemzfleming Mar 1, 2024
703fa79
better comments
jaemzfleming Mar 1, 2024
6e47ff6
better comment
jaemzfleming Mar 1, 2024
556178f
first attempt
jaemzfleming Mar 4, 2024
e23f1cd
got it working
jaemzfleming Mar 5, 2024
6253807
remove prints
jaemzfleming Mar 5, 2024
05ccd50
add arguments and options
jaemzfleming Mar 5, 2024
7b67492
rename shard_dim to just bool is_metadata
jaemzfleming Mar 5, 2024
0af6eb2
Merge branch 'jf/aqlm' into jf/aqlm-nosplit
jaemzfleming Mar 5, 2024
3aafb3c
use TORCH_CHECK
jaemzfleming Mar 5, 2024
ef608a6
cleanup aqlm_example
jaemzfleming Mar 5, 2024
3bf6e7e
Merge branch 'jf/aqlm' into jf/aqlm-nosplit
jaemzfleming Mar 5, 2024
5bacc9d
format
jaemzfleming Mar 5, 2024
2def434
some stuff
jaemzfleming Mar 5, 2024
821ee99
change 60 to 70 for min cap
jaemzfleming Mar 5, 2024
35eb873
Merge branch 'jf/aqlm' into jf/aqlm-nosplit
jaemzfleming Mar 5, 2024
d0816bf
format
jaemzfleming Mar 5, 2024
6372c64
make aqlm not rocm supported
jaemzfleming Mar 5, 2024
9f4d75f
Merge branch 'jf/aqlm-nosplit' into jf/aqlm
jaemzfleming Mar 5, 2024
83c2070
Add LICENSE file
jaemzfleming Mar 5, 2024
267b339
add reference
jaemzfleming Mar 5, 2024
0408789
add better license headers
jaemzfleming Mar 5, 2024
48838b8
add support for 2x8 optimization
jaemzfleming Mar 7, 2024
4822629
format
jaemzfleming Mar 7, 2024
c255f44
add better example models, and replace output_partition_size with sizes
jaemzfleming Mar 7, 2024
7acedee
Merge branch 'upstream-main' into jf/aqlm
jaemzfleming Mar 7, 2024
15d7206
format
jaemzfleming Mar 7, 2024
8df10d9
Add test_aqlm.py
mgoin Mar 7, 2024
a3039dd
remove comments
jaemzfleming Mar 7, 2024
84611e7
Merge branch 'jf/aqlm' of https://github.com/neuralmagic/nm-vllm into…
jaemzfleming Mar 7, 2024
2ecce81
put aqlm inside rocm block
jaemzfleming Mar 8, 2024
5864a00
add model to example
jaemzfleming Mar 8, 2024
58dbb01
remove comment
jaemzfleming Mar 8, 2024
7dc5f83
format
jaemzfleming Mar 8, 2024
8069375
fix test
jaemzfleming Mar 8, 2024
9891e22
Add dequantization kernel
jaemzfleming Mar 12, 2024
a51192f
Update csrc/quantization/aqlm/aqlm_cuda_entry.cpp
mgoin Mar 12, 2024
992d584
Update csrc/quantization/aqlm/aqlm_cuda_entry.cpp
mgoin Mar 12, 2024
9143b45
set gpu_memory_utilization
jaemzfleming Mar 12, 2024
5d24991
add benchmark and refactor a bit.
jaemzfleming Mar 14, 2024
5985acb
Merge branch 'upstream-main' into jf/aqlm
jaemzfleming Mar 15, 2024
c319d2a
Merge branch 'upstream-main' into jf/aqlm
jaemzfleming Mar 21, 2024
d9152e2
add aqlm
jaemzfleming Mar 21, 2024
0574dff
Add dequant methods
jaemzfleming Mar 21, 2024
39ca4a0
fix format
jaemzfleming Mar 21, 2024
522f990
formatA
jaemzfleming Mar 21, 2024
d2ac6b2
some format fixes
jaemzfleming Mar 21, 2024
bb66e3c
formatting
jaemzfleming Mar 21, 2024
11c7950
format
jaemzfleming Mar 21, 2024
fb78b95
remove dead space
jaemzfleming Mar 21, 2024
d73a92b
niceties for aqlm benchmark
jaemzfleming Mar 21, 2024
4406555
update the test file
jaemzfleming Mar 22, 2024
3622342
remove gpu_memory_utilization reduction
jaemzfleming Mar 22, 2024
3cf2a1b
Merge branch 'upstream-main' into jf/aqlm
jaemzfleming Mar 26, 2024
e2b3529
port over better dequant kernels from aqlm
jaemzfleming Mar 26, 2024
3d65a48
better threshold for aqlm
jaemzfleming Mar 26, 2024
421249c
Merge branch 'upstream-main' into jf/aqlm
jaemzfleming Mar 26, 2024
d033c85
format
jaemzfleming Mar 26, 2024
f950178
Merge remote-tracking branch 'upstream/main' into jf/aqlm
mgoin Apr 8, 2024
7c604fe
Merge remote-tracking branch 'upstream/main' into jf/aqlm
mgoin Apr 8, 2024
92206de
Update test point
mgoin Apr 9, 2024
811e2cc
Poke test again
mgoin Apr 9, 2024
a97353b
Merge remote-tracking branch 'upstream/main' into jf/aqlm
mgoin Apr 9, 2024
6ca51d4
Merge remote-tracking branch 'upstream/main' into jf/aqlm
mgoin Apr 10, 2024
22f7fae
Merge remote-tracking branch 'upstream/main' into jf/aqlm
mgoin Apr 11, 2024
2282157
Merge remote-tracking branch 'upstream/main' into jf/aqlm
mgoin Apr 15, 2024
d0e8d0c
Resolve create_weights updates
mgoin Apr 15, 2024
6bb89c0
Better test debug output (manually tested TP)
mgoin Apr 16, 2024
09d4a24
Merge branch 'vllm-project:main' into jf/aqlm
mgoin Apr 17, 2024
4d46f18
Delete csrc/quantization/aqlm/LICENSE
mgoin Apr 18, 2024
a29008d
Address comments
mgoin Apr 18, 2024
dacdb52
Merge remote-tracking branch 'upstream/main' into jf/aqlm
mgoin Apr 18, 2024
3852115
Update test
mgoin Apr 18, 2024
d367895
Cleanup namespaces
mgoin Apr 18, 2024
d34f23d
Merge remote-tracking branch 'upstream/main' into jf/aqlm
mgoin Apr 22, 2024
7283c23
Merge remote-tracking branch 'upstream/main' into jf/aqlm
mgoin Apr 22, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
remove some code
jaemzfleming committed Mar 1, 2024
commit 31f0ddc3899abda11a62a833bc99b2d1869d99b4
7 changes: 2 additions & 5 deletions vllm/config.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change this to use the registry #4098?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated with main

Original file line number Diff line number Diff line change
@@ -115,7 +115,7 @@ def __init__(
max_model_len)
self._verify_load_format()
self._verify_tokenizer_mode()
self.hf_quant_config = self._get_and_verify_quantization()
self._verify_quantization()
self._verify_cuda_graph()

def _verify_load_format(self) -> None:
@@ -154,14 +154,13 @@ def _verify_tokenizer_mode(self) -> None:
"either 'auto' or 'slow'.")
self.tokenizer_mode = tokenizer_mode

def _get_and_verify_quantization(self) -> Any | None:
def _verify_quantization(self) -> None:
supported_quantization = ["aqlm", "awq", "gptq", "squeezellm"]
rocm_not_supported_quantization = ["awq"]
if self.quantization is not None:
self.quantization = self.quantization.lower()

# Parse quantization method from the HF model config, if available.
hf_quant_method = None
hf_quant_config = getattr(self.hf_config, "quantization_config", None)
if hf_quant_config is not None:
hf_quant_method = str(hf_quant_config["quant_method"]).lower()
@@ -188,8 +187,6 @@ def _get_and_verify_quantization(self) -> Any | None:
"optimized yet. The speed can be slower than "
"non-quantized models.")

return hf_quant_config

def _verify_cuda_graph(self) -> None:
if self.max_context_len_to_capture is None:
self.max_context_len_to_capture = self.max_model_len