[Kernel][Model] Improve continuous batching for Jamba and Mamba #9189

mzusman · 2024-10-09T10:48:15Z

Mamba kernels now avoid writing/reading to/from padded cache entries
Jamba now uses the previously implemented functionality for continuous batching in [Kernel] Change interface to Mamba causal_conv1d_update for continuous batching #8012 [Kernel] Change interface to Mamba selective_state_update for continuous batching #8039 [Kernel][Model] Varlen prefill + Prefill chunking support for mamba kernels and Jamba model #8533

github-actions · 2024-10-09T10:48:28Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

tlrmchlsmth

One suggestion: To mirror the PAD_SLOT_ID that's in backends/utils.py, we could #define PAD_SLOT_ID -1 in csrc/attention_generic.cuh, so that we name the constant.

And then we can add comments letting the reader know to keep them in sync

csrc/mamba/causal_conv1d/causal_conv1d.cu

tests/kernels/test_causal_conv1d.py

vllm/model_executor/models/jamba.py

mzusman · 2024-10-10T10:16:02Z

One suggestion: To mirror the PAD_SLOT_ID that's in backends/utils.py, we could #define PAD_SLOT_ID -1 in csrc/attention_generic.cuh, so that we name the constant.

And then we can add comments letting the reader know to keep them in sync

I've taken another approach, passing the pad_slot_id through the kernel params, keeps it simpler since we don't need to keep the cpp and python variables in sync and keeps the mamba kernels as standalone entities. WDYT?

…mba_from_scratch

…ithub/main' into continous_batching_mamba_from_scratch

mzusman · 2024-10-13T10:48:54Z

Pushed adaptations for #6484 , The PR is ready again.

tlrmchlsmth · 2024-10-15T00:54:26Z

Changing the title since Jamba and Mamba already support continuous batching -- this just makes it better

tlrmchlsmth · 2024-10-15T00:55:52Z

One suggestion: To mirror the PAD_SLOT_ID that's in backends/utils.py, we could #define PAD_SLOT_ID -1 in csrc/attention_generic.cuh, so that we name the constant.
And then we can add comments letting the reader know to keep them in sync

I've taken another approach, passing the pad_slot_id through the kernel params, keeps it simpler since we don't need to keep the cpp and python variables in sync and keeps the mamba kernels as standalone entities. WDYT?

Sounds good to me

tlrmchlsmth

I tried out the PR and am currently seeing the following error:

E           RuntimeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241015-011501.pkl): _C::causal_conv1d_fwd() expected at most 8 argument(s) but received 9 argument(s). Declaration: _C::causal_conv1d_fwd(Tensor($0! -> ) x, Tensor($1! -> ) weight, Tensor? bias_, Tensor($2! -> )? conv_states, Tensor? query_start_loc, Tensor? cache_indices, Tensor? has_initial_state, bool silu_activation) -> Tensor

If you're seeing the same issue, could you LMK when it's fixed? Looks good otherwise, thanks!

csrc/mamba/causal_conv1d/causal_conv1d.cu

tests/kernels/test_causal_conv1d.py

vllm/model_executor/layers/mamba/ops/causal_conv1d.py

vllm/model_executor/layers/mamba/ops/mamba_ssm.py

…mba_from_scratch

mzusman · 2024-10-15T23:48:56Z

I tried out the PR and am currently seeing the following error:

E           RuntimeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241015-011501.pkl): _C::causal_conv1d_fwd() expected at most 8 argument(s) but received 9 argument(s). Declaration: _C::causal_conv1d_fwd(Tensor($0! -> ) x, Tensor($1! -> ) weight, Tensor? bias_, Tensor($2! -> )? conv_states, Tensor? query_start_loc, Tensor? cache_indices, Tensor? has_initial_state, bool silu_activation) -> Tensor

If you're seeing the same issue, could you LMK when it's fixed? Looks good otherwise, thanks!

@tlrmchlsmth I cannot reproduce this error, could you let me know how did you run into it?

tlrmchlsmth · 2024-10-16T02:18:18Z

@tlrmchlsmth I cannot reproduce this error, could you let me know how did you run into it?

Possibly user error, let's see if it goes through the CI ;)

tlrmchlsmth

Thanks for the great work, much nicer management of the Mamba Mixer state.

Now it's my turn to extend mamba_chunk_scan_combined to support this for Mamba2 :)

tlrmchlsmth · 2024-10-16T16:12:37Z

Merging now! from @simon-mo the readthedocs failure looks transient

…-project#9189) Signed-off-by: charlifu <[email protected]>

…-project#9189) Signed-off-by: Vinay Damodaran <[email protected]>

…-project#9189) Signed-off-by: Alvant <[email protected]>

…-project#9189) Signed-off-by: Amit Garg <[email protected]>

…-project#9189) Signed-off-by: qishuai <[email protected]>

…-project#9189) Signed-off-by: Sumit Dubey <[email protected]>

mzusman added 4 commits October 9, 2024 11:48

Do not read or write to padding

2109fbc

Padding support to mamba_ssm

8545ec4

continuous batching jamba

24758ed

format

385c257

mzusman requested review from tlrmchlsmth and WoosukKwon as code owners October 9, 2024 10:48

format

927be2c

tlrmchlsmth reviewed Oct 9, 2024

View reviewed changes

csrc/mamba/causal_conv1d/causal_conv1d.cu Outdated Show resolved Hide resolved

csrc/mamba/causal_conv1d/causal_conv1d.cu Outdated Show resolved Hide resolved

tlrmchlsmth reviewed Oct 9, 2024

View reviewed changes

mzusman added 4 commits October 10, 2024 17:36

Add pad_slot_id to kernels params

5bc07fb

revert merged column parallel

6c9b043

Add TP=2 test

3a4d02b

Add with padding params to mamba tests

c934c30

mzusman requested review from DarkLight1337 and ywang96 as code owners October 10, 2024 14:38

mzusman added 8 commits October 10, 2024 17:39

Format

8a6626c

causal_conv1d outputs to x inplace

a6eab1b

Fix tests and add with_padding test

f6d3a05

Fix tests

d96bd01

Return none

a663fa8

Merge remote-tracking branch 'github/main' into continous_batching_ma…

39776d5

…mba_from_scratch

fix typo

69ebab8

remove diffs and use pad_slot_id as var in tests

906379d

mzusman changed the title ~~[Kernel][Model] Continous batching for Jamba~~ [Kernel][Model] Continuous batching for Jamba Oct 13, 2024

mzusman added 4 commits October 13, 2024 13:16

Adaptiations to vllm-project#6484 and Merge remote-tracking branch 'g…

fa1162e

…ithub/main' into continous_batching_mamba_from_scratch

Fix tests

94fe819

format

1544231

Format

158f22d

mzusman changed the title ~~[Kernel][Model] Continuous batching for Jamba~~ [Kernel][Model] Continuous batching for Jamba and Mamba Oct 14, 2024

tlrmchlsmth changed the title ~~[Kernel][Model] Continuous batching for Jamba and Mamba~~ [Kernel][Model] Improve continuous batching for Jamba and Mamba Oct 15, 2024

tlrmchlsmth reviewed Oct 15, 2024

View reviewed changes

Merge remote-tracking branch 'github/main' into continous_batching_ma…

c893c68

…mba_from_scratch

mzusman added 5 commits October 16, 2024 08:01

Address review comments

40d14ee

Format

7325254

Fix mamba

f4c1198

add cache empty for consistency

d1c5c32

Format

9905319

tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 16, 2024

tlrmchlsmth approved these changes Oct 16, 2024

View reviewed changes

tlrmchlsmth merged commit fb60ae9 into vllm-project:main Oct 16, 2024
87 of 88 checks passed

This was referenced Oct 16, 2024

Simplify Jamba state management #7428

Closed

[Model] Support Mamba2 (Codestral Mamba) #9292

Draft

charlifu pushed a commit to charlifu/vllm that referenced this pull request Oct 23, 2024

[Kernel][Model] Improve continuous batching for Jamba and Mamba (vllm…

39c9698

…-project#9189) Signed-off-by: charlifu <[email protected]>

vrdn-23 pushed a commit to vrdn-23/vllm that referenced this pull request Oct 23, 2024

[Kernel][Model] Improve continuous batching for Jamba and Mamba (vllm…

0a4287d

…-project#9189) Signed-off-by: Vinay Damodaran <[email protected]>

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[Kernel][Model] Improve continuous batching for Jamba and Mamba (vllm…

988f9c1

…-project#9189) Signed-off-by: Alvant <[email protected]>

garg-amit pushed a commit to garg-amit/vllm that referenced this pull request Oct 28, 2024

[Kernel][Model] Improve continuous batching for Jamba and Mamba (vllm…

5ddff51

…-project#9189) Signed-off-by: Amit Garg <[email protected]>

FerdinandZhong pushed a commit to FerdinandZhong/vllm that referenced this pull request Oct 29, 2024

[Kernel][Model] Improve continuous batching for Jamba and Mamba (vllm…

e40eb40

…-project#9189) Signed-off-by: qishuai <[email protected]>

mzusman mentioned this pull request Oct 30, 2024

[BugFix][Kernel] Fix Illegal memory access in causal_conv1d in H100 #9838

Merged

sumitd2 pushed a commit to sumitd2/vllm that referenced this pull request Nov 14, 2024

[Kernel][Model] Improve continuous batching for Jamba and Mamba (vllm…

e73ea6c

…-project#9189) Signed-off-by: Sumit Dubey <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel][Model] Improve continuous batching for Jamba and Mamba #9189

[Kernel][Model] Improve continuous batching for Jamba and Mamba #9189

mzusman commented Oct 9, 2024 •

edited

Loading

github-actions bot commented Oct 9, 2024

tlrmchlsmth left a comment

mzusman commented Oct 10, 2024 •

edited

Loading

mzusman commented Oct 13, 2024

tlrmchlsmth commented Oct 15, 2024

tlrmchlsmth commented Oct 15, 2024

tlrmchlsmth left a comment

mzusman commented Oct 15, 2024 •

edited

Loading

tlrmchlsmth commented Oct 16, 2024

tlrmchlsmth left a comment

tlrmchlsmth commented Oct 16, 2024

[Kernel][Model] Improve continuous batching for Jamba and Mamba #9189

[Kernel][Model] Improve continuous batching for Jamba and Mamba #9189

Conversation

mzusman commented Oct 9, 2024 • edited Loading

github-actions bot commented Oct 9, 2024

tlrmchlsmth left a comment

Choose a reason for hiding this comment

mzusman commented Oct 10, 2024 • edited Loading

mzusman commented Oct 13, 2024

tlrmchlsmth commented Oct 15, 2024

tlrmchlsmth commented Oct 15, 2024

tlrmchlsmth left a comment

Choose a reason for hiding this comment

mzusman commented Oct 15, 2024 • edited Loading

tlrmchlsmth commented Oct 16, 2024

tlrmchlsmth left a comment

Choose a reason for hiding this comment

tlrmchlsmth commented Oct 16, 2024

mzusman commented Oct 9, 2024 •

edited

Loading

mzusman commented Oct 10, 2024 •

edited

Loading

mzusman commented Oct 15, 2024 •

edited

Loading