Merging main #4

mht-sharma · 2024-08-21T09:06:09Z

No description provided.

…r request batching parameters (#114) * Fixed single GPU issue without setting up mp. Added toggles for server request batching parameters * Adding HTTP headers

* add weight padding for moe * enable padding by default * fix linter * fix linter * fix linter * using envs.py * fix linter

* fix navi build * Created dummy kernels of unsupported on Navi to avoid function not found crashes at runtime * replacing ifdefs on host code with those on kernels * refactoring code to avoid unsupported call on Navi * syntactic change * import statements fix * moving env variables to envs.py * style fixes * cosmetic changes for isort * remved extra include * moving use_skinny to be member --------- Co-authored-by: lcskrishna <[email protected]> Co-authored-by: maleksan85 <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]>

* add memory clean up after every shape and parameter to reduce cache invalidation buffers * small typo * syntax change --------- Co-authored-by: maleksan85 <[email protected]>

Co-authored-by: Gregory Shtrasberg <[email protected]>

* fix test_moe * fix linter

* Add support for a rope extension method (vllm-project#6553) * [BugFix] Fix RoPE error in Llama 3.1 (vllm-project#6693) --------- Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]>

* Initial implementation of chat/completions endpoint and its streaming variant * Reusing datatypes from the openai entrypoints * Response role from arg * Added models endpoint and model validation from the request

* First version * Revert error. While there, add missing finalize. * Use the correct defaults for ROCm. Increase sampling area to capture crossover. * Scope end_sync as well. * Guard only volatile keyword for ifndef USE_ROCM * Document crossover

* tightened atol for custom PA; enable supported head size, block sizes in testing * update num_blocks and num_iters in benchmark PA to realistic settings * move to generic b16 type * bf16 first port * enabled all bf16 tests, set atol for bf16 * enable custom PA for bf16 as well as block size 32 and head size 64 * fix cast to zero in custom PA reduce * py linter fixes * clang format fixes * div round up clang-format --------- Co-authored-by: Charlie Fu <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]>

#135) Co-authored-by: maleksan85 <[email protected]>

* remove scoping * while there fix a typo * while there remove unused variable

@iotamudelta

* Per @iotamudelta suggestion until the deadlocks issue is better understood Revert "Make CAR ROCm 6.1 compatible. (#137)" This reverts commit 4d2dda6. * Per @iotamudelta suggestion until the deadlocks issue is better understood Revert "Optimize custom all reduce (#130)" This reverts commit 636ff01.

* Adding UNREACHABLE_CODE macro * clang format fixes * clang formatting fix * minor updates in syntax * clang format update * clang format fix one more try * clang format one more try * clang format fix one more try --------- Co-authored-by: Aleksandr Malyshev <[email protected]>

Co-authored-by: maleksan85 <[email protected]>

* improvements to wvSpltK * wvsplt gemm; better handle MI300 and large A[] sizes * lint fix * Adjustments to better handle small weights in TP8. * early-out bug fix * better wave load balancing in wvSplt * add missing skip for wvsplt_big * Bug fix for wvSplt_big in load balancing at M4, lint fix.

* Dockerfile.rocm bug fix * naming preference --------- Co-authored-by: Gregory Shtrasberg <[email protected]>

* basic support for AWQ added * awq_dequantize implementation in Triton * awq_gemm implementation in Triton * unit tests in tests/kernels/test_awq_triton.py

gshtras and others added 22 commits August 2, 2024 14:26

Fixed single GPU issue without setting up mp. Added toggles for serve…

3e480e9

…r request batching parameters (#114) * Fixed single GPU issue without setting up mp. Added toggles for server request batching parameters * Adding HTTP headers

Add distributed executor backend to benchmark scripts (#118)

42b1b9a

Add weight padding for moe (#119)

5fac73f

* add weight padding for moe * enable padding by default * fix linter * fix linter * fix linter * using envs.py * fix linter

add emtpy_cache() after each padding (#120)

98f31cd

[FIX] Gradlib OOM on Navi and sometimes on MI (#124)

30f12f0

* add memory clean up after every shape and parameter to reduce cache invalidation buffers * small typo * syntax change --------- Co-authored-by: maleksan85 <[email protected]>

save shape when fp8 solution not found (#123)

8608888

Co-authored-by: Gregory Shtrasberg <[email protected]>

Fix unit test for moe by adding padding (#128)

f49dff3

* fix test_moe * fix linter

Llama3.1 (#129)

dd1a208

* Add support for a rope extension method (vllm-project#6553) * [BugFix] Fix RoPE error in Llama 3.1 (vllm-project#6693) --------- Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]>

chat/completions endpoint (#121)

674da1d

* Initial implementation of chat/completions endpoint and its streaming variant * Reusing datatypes from the openai entrypoints * Response role from arg * Added models endpoint and model validation from the request

Optimize custom all reduce (#130)

636ff01

* First version * Revert error. While there, add missing finalize. * Use the correct defaults for ROCm. Increase sampling area to capture crossover. * Scope end_sync as well. * Guard only volatile keyword for ifndef USE_ROCM * Document crossover

Making check for output match in original types. It saves some memory. (

4132cbe

#135) Co-authored-by: maleksan85 <[email protected]>

Make CAR ROCm 6.1 compatible. (#137)

4d2dda6

* remove scoping * while there fix a typo * while there remove unused variable

Using the correct datatypes for streaming non-chat completions (#134)

5945822

gfx90a typo fix (#142)

7382dd5

Co-authored-by: maleksan85 <[email protected]>

[Bugfix] Dockerfile.rocm (#141)

c1860d6

* Dockerfile.rocm bug fix * naming preference --------- Co-authored-by: Gregory Shtrasberg <[email protected]>

Update test-template.j2 (#145)

7c5fd50

Adding Triton implementations awq_dequantize and awq_gemm to ROCm (#136)

aa36718

* basic support for AWQ added * awq_dequantize implementation in Triton * awq_gemm implementation in Triton * unit tests in tests/kernels/test_awq_triton.py

mht-sharma merged commit cec14e0 into mht-sharma:rocm-vllm-main Aug 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merging main #4

Merging main #4

mht-sharma commented Aug 21, 2024

Merging main #4

Merging main #4

Conversation

mht-sharma commented Aug 21, 2024