Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a baseline with dynamic growing KV cache size for the paper #134

Closed
zhuohan123 opened this issue May 31, 2023 · 3 comments
Closed

Add a baseline with dynamic growing KV cache size for the paper #134

zhuohan123 opened this issue May 31, 2023 · 3 comments
Assignees
Labels
performance Performance-related issues stale

Comments

@zhuohan123
Copy link
Member

No description provided.

@zhuohan123 zhuohan123 self-assigned this May 31, 2023
@hmellor
Copy link
Collaborator

hmellor commented Mar 8, 2024

@zhuohan123 is this still something you're interested in doing?

@DarkLight1337 DarkLight1337 added the performance Performance-related issues label May 31, 2024
yukavio pushed a commit to yukavio/vllm that referenced this issue Jul 3, 2024
SUMMARY:
* upstream merge (sync) up to `93348d9458af7517bb8c114611d438a1b4a2c3be`
* some minor changes related to `ruff` and `yapf`

NOTES: skipped flaky lora gemma test

TEST PLAN:
ran nightly, passed all except gemma
running now on remote push

---------

Signed-off-by: Tao He <[email protected]>
Signed-off-by: Yuan Tang <[email protected]>
Signed-off-by: Sherlock113 <[email protected]>
Co-authored-by: Ronen Schaffer <[email protected]>
Co-authored-by: Mustafa Eyceoz <[email protected]>
Co-authored-by: Roy <[email protected]>
Co-authored-by: Woosuk Kwon <[email protected]>
Co-authored-by: Massimiliano Pronesti <[email protected]>
Co-authored-by: 44670 <[email protected]>
Co-authored-by: zhaoyang-star <[email protected]>
Co-authored-by: Harry Mellor <[email protected]>
Co-authored-by: Jared Moore <[email protected]>
Co-authored-by: Philipp Moritz <[email protected]>
Co-authored-by: Cade Daniel <[email protected]>
Co-authored-by: 张大成 <[email protected]>
Co-authored-by: zhangdacheng <[email protected]>
Co-authored-by: Jingru <[email protected]>
Co-authored-by: Dylan Hawk <[email protected]>
Co-authored-by: Tao He <[email protected]>
Co-authored-by: Ganesh Jagadeesan <[email protected]>
Co-authored-by: Allen.Dou <[email protected]>
Co-authored-by: Liangfu Chen <[email protected]>
Co-authored-by: CHU Tianxiang <[email protected]>
Co-authored-by: Jae-Won Chung <[email protected]>
Co-authored-by: Seonghyeon <[email protected]>
Co-authored-by: Billy Cao <[email protected]>
Co-authored-by: Nick Hill <[email protected]>
Co-authored-by: felixzhu555 <[email protected]>
Co-authored-by: br3no <[email protected]>
Co-authored-by: simon-mo <[email protected]>
Co-authored-by: Sherry <[email protected]>
Co-authored-by: Yuan Tang <[email protected]>
Co-authored-by: Huarong <[email protected]>
Co-authored-by: huohuarong <[email protected]>
Co-authored-by: Robert Shaw <[email protected]>
Co-authored-by: alexm <[email protected]>
Co-authored-by: zixiao <[email protected]>
Co-authored-by: cloudhan <[email protected]>
Co-authored-by: Sage Moore <[email protected]>
Co-authored-by: ElizaWszola <[email protected]>
Co-authored-by: Michael Goin <[email protected]>
Co-authored-by: Jason Cox <[email protected]>
Co-authored-by: Zhuohan Li <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: TianYu GUO <[email protected]>
Co-authored-by: Jialun Lyu <[email protected]>
Co-authored-by: ttbachyinsda <[email protected]>
Co-authored-by: guofangze <[email protected]>
Co-authored-by: Antoni Baum <[email protected]>
Co-authored-by: Avnish Narayan <[email protected]>
Co-authored-by: Chen Wang <[email protected]>
Co-authored-by: Hongxia Yang <[email protected]>
Co-authored-by: lcskrishna <[email protected]>
Co-authored-by: SangBin Cho <[email protected]>
Co-authored-by: Chujie Zheng <[email protected]>
Co-authored-by: TechxGenus <[email protected]>
Co-authored-by: Michael Goin <[email protected]>
Co-authored-by: jacobthebanana <[email protected]>
Co-authored-by: whyiug <[email protected]>
Co-authored-by: Terry <[email protected]>
Co-authored-by: Douglas Lehr <[email protected]>
Co-authored-by: kliuae <[email protected]>
Co-authored-by: DAIZHENWEI <[email protected]>
Co-authored-by: Sherlock Xu <[email protected]>
Co-authored-by: Bo-Wen Wang <[email protected]>
Co-authored-by: Ronan McGovern <[email protected]>
Co-authored-by: Hui Liu <[email protected]>
Co-authored-by: 陈序 <[email protected]>
Co-authored-by: Or Sharir <[email protected]>
Co-authored-by: youkaichao <[email protected]>
Co-authored-by: Thomas Parnell <[email protected]>
Co-authored-by: Dan Clark <[email protected]>
Co-authored-by: Daniel Clark <[email protected]>
Co-authored-by: youkaichao <[email protected]>
Co-authored-by: Enrique Shockwave <[email protected]>
Co-authored-by: akhoroshev <[email protected]>
Co-authored-by: Dinghow Yang <[email protected]>
Co-authored-by: Junda Chen <[email protected]>
Co-authored-by: Yang Fan <[email protected]>
Co-authored-by: laneeee <[email protected]>
Xaenalt pushed a commit to Xaenalt/vllm that referenced this issue Aug 15, 2024
* formatting fixes

* Upstream CR update

* whitespace fix
mht-sharma added a commit to mht-sharma/vllm that referenced this issue Aug 21, 2024
* Fixed single GPU issue without setting up mp. Added toggles for server request batching parameters (vllm-project#114)

* Fixed single GPU issue without setting up mp. Added toggles for server request batching parameters

* Adding HTTP headers

* Add distributed executor backend to benchmark scripts (vllm-project#118)

* Add weight padding for moe (vllm-project#119)

* add weight padding for moe

* enable padding by default

* fix linter

* fix linter

* fix linter

* using envs.py

* fix linter

* [BugFix] Fix navi build after many custom for MI kernels added (vllm-project#116)

* fix navi build

* Created dummy kernels of unsupported on Navi to avoid function not found crashes at runtime

* replacing ifdefs on host code with those on kernels

* refactoring code to avoid unsupported call on Navi

* syntactic change

* import statements fix

* moving env variables to envs.py

* style fixes

* cosmetic changes for isort

* remved extra include

* moving use_skinny to be member

---------

Co-authored-by: lcskrishna <[email protected]>
Co-authored-by: maleksan85 <[email protected]>
Co-authored-by: Gregory Shtrasberg <[email protected]>

* add emtpy_cache() after each padding (vllm-project#120)

* [FIX] Gradlib OOM on Navi and sometimes on MI (vllm-project#124)

* add memory clean up after every shape and parameter to reduce cache invalidation buffers

* small typo

* syntax change

---------

Co-authored-by: maleksan85 <[email protected]>

* save shape when fp8 solution not found (vllm-project#123)

Co-authored-by: Gregory Shtrasberg <[email protected]>

* Fix unit test for moe by adding padding (vllm-project#128)

* fix test_moe

* fix linter

* Llama3.1 (vllm-project#129)

* Add support for a rope extension method (vllm-project#6553)

* [BugFix] Fix RoPE error in Llama 3.1 (vllm-project#6693)

---------

Co-authored-by: Simon Mo <[email protected]>
Co-authored-by: Woosuk Kwon <[email protected]>

* chat/completions endpoint (vllm-project#121)

* Initial implementation of chat/completions endpoint and its streaming variant

* Reusing datatypes from the openai entrypoints

* Response role from arg

* Added models endpoint and model validation from the request

* Optimize custom all reduce (vllm-project#130)

* First version

* Revert error.

While there, add missing finalize.

* Use the correct defaults for ROCm.

Increase sampling area to capture crossover.

* Scope end_sync as well.

* Guard only volatile keyword for ifndef USE_ROCM

* Document crossover

* Add BF16 support to custom PA (vllm-project#133)

* tightened atol for custom PA; enable supported head size, block sizes in testing

* update num_blocks and num_iters in benchmark PA to realistic settings

* move to generic b16 type

* bf16 first port

* enabled all bf16 tests, set atol for bf16

* enable custom PA for bf16 as well as block size 32 and head size 64

* fix cast to zero in custom PA reduce

* py linter fixes

* clang format fixes

* div round up clang-format

---------

Co-authored-by: Charlie Fu <[email protected]>
Co-authored-by: Gregory Shtrasberg <[email protected]>

* Making check for output match in original types. It saves some memory. (vllm-project#135)

Co-authored-by: maleksan85 <[email protected]>

* Make CAR ROCm 6.1 compatible. (vllm-project#137)

* remove scoping
* while there fix a typo
* while there remove unused variable

* Car revert (vllm-project#140)

* Per @iotamudelta suggestion until the deadlocks issue is better understood
Revert "Make CAR ROCm 6.1 compatible. (vllm-project#137)"

This reverts commit 4d2dda6.

* Per @iotamudelta suggestion until the deadlocks issue is better understood
Revert "Optimize custom all reduce (vllm-project#130)"

This reverts commit 636ff01.

* Using the correct datatypes for streaming non-chat completions (vllm-project#134)

* Adding UNREACHABLE_CODE macro for non MI300 and MI250 cards (vllm-project#138)

* Adding UNREACHABLE_CODE macro

* clang format fixes

* clang formatting fix

* minor updates in syntax

* clang format update

* clang format fix one more try

* clang format one more try

* clang format fix one more try

---------

Co-authored-by: Aleksandr Malyshev <[email protected]>

* gfx90a typo fix (vllm-project#142)

Co-authored-by: maleksan85 <[email protected]>

* wvsplitk templatized and better tuned for MI300 (vllm-project#132)

* improvements to wvSpltK

* wvsplt gemm; better handle MI300 and large A[] sizes

* lint fix

* Adjustments to better handle small weights in TP8.

* early-out bug fix

* better wave load balancing in wvSplt

* add missing skip for wvsplt_big

* Bug fix for wvSplt_big in load balancing at M4, lint fix.

* [Bugfix] Dockerfile.rocm (vllm-project#141)

* Dockerfile.rocm bug fix

* naming preference

---------

Co-authored-by: Gregory Shtrasberg <[email protected]>

* Update test-template.j2 (vllm-project#145)

* Adding Triton implementations awq_dequantize and awq_gemm to ROCm (vllm-project#136)

* basic support for AWQ added
* awq_dequantize implementation in Triton
* awq_gemm implementation in Triton
* unit tests in tests/kernels/test_awq_triton.py

---------

Co-authored-by: Gregory Shtrasberg <[email protected]>
Co-authored-by: Matt Wong <[email protected]>
Co-authored-by: Charlie Fu <[email protected]>
Co-authored-by: Aleksandr Malyshev <[email protected]>
Co-authored-by: lcskrishna <[email protected]>
Co-authored-by: maleksan85 <[email protected]>
Co-authored-by: Simon Mo <[email protected]>
Co-authored-by: Woosuk Kwon <[email protected]>
Co-authored-by: iotamudelta <[email protected]>
Co-authored-by: sanyalington <[email protected]>
Co-authored-by: Hashem Hashemi <[email protected]>
Co-authored-by: Zachary Streeter <[email protected]>
Co-authored-by: omkar kakarparthi <[email protected]>
Co-authored-by: rasmith <[email protected]>
wallashss pushed a commit to wallashss/vllm that referenced this issue Sep 2, 2024
Sync vllm with upstream/v0.5.5 to odh/main for 2.13
dtrifiro pushed a commit to dtrifiro/vllm that referenced this issue Sep 30, 2024
Sync vllm with upstream/v0.5.5 to odh/main for 2.13
Copy link

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

@github-actions github-actions bot added the stale label Oct 31, 2024
Copy link

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance-related issues stale
Projects
None yet
Development

No branches or pull requests

3 participants