Support FP32 #141

WoosukKwon · 2023-06-06T01:52:29Z

Closes #72

This PR removes the support for some head and block sizes to enhance the compilation speed. The removed sizes are not used for the models we currently support. In addition, removing them allows us to support FP32. On my machine, the compilation time got reduced from 7.5 mins to 1.5 mins even though FP32 is now added.

This PR removes the support for some head and block sizes to enable FP32. Besides, the PR changes the default dtype option to auto, to make its meaning clearer.

WoosukKwon · 2023-06-06T03:07:37Z

@zhuohan123 The PR is ready for review! And I'd appreciate it if you can install and test this PR in your environment.

zhuohan123

LGTM! In general, I feel like we should comment out instead of deleting these codes so that we can take them back in the future.

csrc/attention/attention_kernels.cu

WoosukKwon · 2023-06-07T07:37:28Z

@zhuohan123 It seems I made a mistake in measuring the compilation time. I found that this PR does NOT reduce the compilation time noticeably (i.e., the compilation still takes 6-8 mins, and the e2e installation time is 8-10 mins on my machine). Nevertheless, I'd like to merge this PR as it enables FP32.

zhuohan123 · 2023-06-07T10:15:44Z

@zhuohan123 It seems I made a mistake in measuring the compilation time. I found that this PR does NOT reduce the compilation time noticeably (i.e., the compilation still takes 6-8 mins, and the e2e installation time is 8-10 mins on my machine). Nevertheless, I'd like to merge this PR as it enables FP32.

Yes I was about to discuss this with you and just found this reply. On my node, the compilation time is still around 7 min.

@iotamudelta

* Fixed single GPU issue without setting up mp. Added toggles for server request batching parameters (vllm-project#114) * Fixed single GPU issue without setting up mp. Added toggles for server request batching parameters * Adding HTTP headers * Add distributed executor backend to benchmark scripts (vllm-project#118) * Add weight padding for moe (vllm-project#119) * add weight padding for moe * enable padding by default * fix linter * fix linter * fix linter * using envs.py * fix linter * [BugFix] Fix navi build after many custom for MI kernels added (vllm-project#116) * fix navi build * Created dummy kernels of unsupported on Navi to avoid function not found crashes at runtime * replacing ifdefs on host code with those on kernels * refactoring code to avoid unsupported call on Navi * syntactic change * import statements fix * moving env variables to envs.py * style fixes * cosmetic changes for isort * remved extra include * moving use_skinny to be member --------- Co-authored-by: lcskrishna <[email protected]> Co-authored-by: maleksan85 <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> * add emtpy_cache() after each padding (vllm-project#120) * [FIX] Gradlib OOM on Navi and sometimes on MI (vllm-project#124) * add memory clean up after every shape and parameter to reduce cache invalidation buffers * small typo * syntax change --------- Co-authored-by: maleksan85 <[email protected]> * save shape when fp8 solution not found (vllm-project#123) Co-authored-by: Gregory Shtrasberg <[email protected]> * Fix unit test for moe by adding padding (vllm-project#128) * fix test_moe * fix linter * Llama3.1 (vllm-project#129) * Add support for a rope extension method (vllm-project#6553) * [BugFix] Fix RoPE error in Llama 3.1 (vllm-project#6693) --------- Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> * chat/completions endpoint (vllm-project#121) * Initial implementation of chat/completions endpoint and its streaming variant * Reusing datatypes from the openai entrypoints * Response role from arg * Added models endpoint and model validation from the request * Optimize custom all reduce (vllm-project#130) * First version * Revert error. While there, add missing finalize. * Use the correct defaults for ROCm. Increase sampling area to capture crossover. * Scope end_sync as well. * Guard only volatile keyword for ifndef USE_ROCM * Document crossover * Add BF16 support to custom PA (vllm-project#133) * tightened atol for custom PA; enable supported head size, block sizes in testing * update num_blocks and num_iters in benchmark PA to realistic settings * move to generic b16 type * bf16 first port * enabled all bf16 tests, set atol for bf16 * enable custom PA for bf16 as well as block size 32 and head size 64 * fix cast to zero in custom PA reduce * py linter fixes * clang format fixes * div round up clang-format --------- Co-authored-by: Charlie Fu <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> * Making check for output match in original types. It saves some memory. (vllm-project#135) Co-authored-by: maleksan85 <[email protected]> * Make CAR ROCm 6.1 compatible. (vllm-project#137) * remove scoping * while there fix a typo * while there remove unused variable * Car revert (vllm-project#140) * Per @iotamudelta suggestion until the deadlocks issue is better understood Revert "Make CAR ROCm 6.1 compatible. (vllm-project#137)" This reverts commit 4d2dda6. * Per @iotamudelta suggestion until the deadlocks issue is better understood Revert "Optimize custom all reduce (vllm-project#130)" This reverts commit 636ff01. * Using the correct datatypes for streaming non-chat completions (vllm-project#134) * Adding UNREACHABLE_CODE macro for non MI300 and MI250 cards (vllm-project#138) * Adding UNREACHABLE_CODE macro * clang format fixes * clang formatting fix * minor updates in syntax * clang format update * clang format fix one more try * clang format one more try * clang format fix one more try --------- Co-authored-by: Aleksandr Malyshev <[email protected]> * gfx90a typo fix (vllm-project#142) Co-authored-by: maleksan85 <[email protected]> * wvsplitk templatized and better tuned for MI300 (vllm-project#132) * improvements to wvSpltK * wvsplt gemm; better handle MI300 and large A[] sizes * lint fix * Adjustments to better handle small weights in TP8. * early-out bug fix * better wave load balancing in wvSplt * add missing skip for wvsplt_big * Bug fix for wvSplt_big in load balancing at M4, lint fix. * [Bugfix] Dockerfile.rocm (vllm-project#141) * Dockerfile.rocm bug fix * naming preference --------- Co-authored-by: Gregory Shtrasberg <[email protected]> * Update test-template.j2 (vllm-project#145) * Adding Triton implementations awq_dequantize and awq_gemm to ROCm (vllm-project#136) * basic support for AWQ added * awq_dequantize implementation in Triton * awq_gemm implementation in Triton * unit tests in tests/kernels/test_awq_triton.py --------- Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: Matt Wong <[email protected]> Co-authored-by: Charlie Fu <[email protected]> Co-authored-by: Aleksandr Malyshev <[email protected]> Co-authored-by: lcskrishna <[email protected]> Co-authored-by: maleksan85 <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: iotamudelta <[email protected]> Co-authored-by: sanyalington <[email protected]> Co-authored-by: Hashem Hashemi <[email protected]> Co-authored-by: Zachary Streeter <[email protected]> Co-authored-by: omkar kakarparthi <[email protected]> Co-authored-by: rasmith <[email protected]>

* Dockerfile.rocm bug fix * naming preference --------- Co-authored-by: Gregory Shtrasberg <[email protected]>

WoosukKwon added 13 commits June 6, 2023 01:25

Omit block and head sizes for reduced compilation time

cc12338

Restrict block sizes

67006dd

Fix supported head sizes

f29f81c

Omit block size 64

2550917

Omit block size 64

b8d1be4

Allow float16 <-> bfloat16 casting

a2a17bb

Support FP32

205c78f

Add float32 to attention test

d917bc5

Fix installation

56532b5

Fix

042bc25

default -> auto

30b4239

default -> auto

c2b9186

Minor

d408884

WoosukKwon requested a review from zhuohan123 June 6, 2023 02:03

WoosukKwon added 4 commits June 6, 2023 02:07

Polish

000710a

Minor

37e7a21

Merge branch 'main' into compile-time

0931129

Add float for testing

75a69c4

zhuohan123 approved these changes Jun 7, 2023

View reviewed changes

csrc/attention/attention_kernels.cu Show resolved Hide resolved

WoosukKwon added 4 commits June 7, 2023 05:33

Address comments

9ffd926

Use NVCC threads option for parallelization

5075bbc

Add comments on using docker image

6dcd3c1

Revert back to 5-10 mins

e08afd7

WoosukKwon changed the title ~~Reduce compilation time & Support FP32~~ Support FP32 Jun 7, 2023

WoosukKwon merged commit e38074b into main Jun 7, 2023

WoosukKwon deleted the compile-time branch June 7, 2023 07:40

WoosukKwon mentioned this pull request Jun 7, 2023

Add docstrings for LLMServer and related classes and examples #142

Merged

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Support FP32 (vllm-project#141)

742af99

sjchoi1 pushed a commit to casys-kaist-internal/vllm that referenced this pull request May 7, 2024

Support FP32 (vllm-project#141)

cf6799d

mht-sharma pushed a commit to mht-sharma/vllm that referenced this pull request Oct 30, 2024

[Bugfix] Dockerfile.rocm (vllm-project#141)

c1860d6

* Dockerfile.rocm bug fix * naming preference --------- Co-authored-by: Gregory Shtrasberg <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support FP32 #141

Support FP32 #141

WoosukKwon commented Jun 6, 2023 •

edited

Loading

WoosukKwon commented Jun 6, 2023 •

edited

Loading

zhuohan123 left a comment

WoosukKwon commented Jun 7, 2023

zhuohan123 commented Jun 7, 2023

Support FP32 #141

Support FP32 #141

Conversation

WoosukKwon commented Jun 6, 2023 • edited Loading

WoosukKwon commented Jun 6, 2023 • edited Loading

zhuohan123 left a comment

Choose a reason for hiding this comment

WoosukKwon commented Jun 7, 2023

zhuohan123 commented Jun 7, 2023

WoosukKwon commented Jun 6, 2023 •

edited

Loading

WoosukKwon commented Jun 6, 2023 •

edited

Loading