Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunk prefill cache writes, remove div_i32 from insert_or_update_cache #289

Merged

Conversation

kzawora-intel
Copy link

@kzawora-intel kzawora-intel commented Sep 17, 2024

Re-implements following PRs for current habana_main:
#102 (Removing div_i32 operations from each layer)
#115 (removing scatter for reshape&cache in case of prompt)

Accuracy (GSM8K on Llama3.1-8B-Instruct):

Tasks Version Filter n-shot Metric Value Stderr
gsm8k_cot_llama 3 flexible-extract 8 exact_match 0.8415 ± 0.0101
strict-match 8 exact_match 0.8400 ± 0.0101

I've benchmarked this change on Llama3.1-8B-Instruct and on average, +2.50% throughput gain (+558.14 tok/s, ~21594 tok/s -> ~22152 tok/s) can be observed across all prefill buckets on G2, with up to +4.40% (+956.79 tok/s, ~25031 -> ~25988 tok/s) throughput increase in compute-bound scenarios.

num_slots_available, block_indices,
block_offset)
def forward(self, input, cache, block_indices, block_offset):
insert_or_update_cache(input, cache, block_indices, block_offset)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't VLLMKVCache patched by INC/HQT? Will this work with fp8?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually breaks FP8 ([rank0]: TypeError: PatchedVLLMKVCache.forward() missing 2 required positional arguments: 'block_indices' and 'block_offset').

@kzawora-intel kzawora-intel added the habana Issues or PRs submitted by Habana Labs label Sep 20, 2024
libinta added a commit that referenced this pull request Sep 24, 2024
@michalkuligowski michalkuligowski merged commit 1c6bada into habana_main Sep 26, 2024
19 checks passed
@michalkuligowski michalkuligowski deleted the private/kzawora/insert_or_update_cache_opt branch September 26, 2024 12:53
zhouyu5 pushed a commit to zhouyu5/vllm-fork that referenced this pull request Sep 27, 2024
HabanaAI#289)

Re-implements following PRs for current habana_main:
HabanaAI#102 (Removing div_i32
operations from each layer)
HabanaAI#115 (removing scatter for
reshape&cache in case of prompt)

Accuracy (GSM8K on Llama3.1-8B-Instruct):
| Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|

|---------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot_llama| 3|flexible-extract| 8|exact_match|↑ |0.8415|± |0.0101|
| | |strict-match | 8|exact_match|↑ |0.8400|± |0.0101|

I've benchmarked this change on Llama3.1-8B-Instruct and on average,
+2.50% throughput gain (+558.14 tok/s, ~21594 tok/s -> ~22152 tok/s) can
be observed across all prefill buckets on G2, with up to +4.40% (+956.79
tok/s, ~25031 -> ~25988 tok/s) throughput increase in compute-bound
scenarios.
zhouyu5 pushed a commit to zhouyu5/vllm-fork that referenced this pull request Sep 27, 2024
HabanaAI#289)

Re-implements following PRs for current habana_main:
HabanaAI#102 (Removing div_i32
operations from each layer)
HabanaAI#115 (removing scatter for
reshape&cache in case of prompt)

Accuracy (GSM8K on Llama3.1-8B-Instruct):
| Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|

|---------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot_llama| 3|flexible-extract| 8|exact_match|↑ |0.8415|± |0.0101|
| | |strict-match | 8|exact_match|↑ |0.8400|± |0.0101|

I've benchmarked this change on Llama3.1-8B-Instruct and on average,
+2.50% throughput gain (+558.14 tok/s, ~21594 tok/s -> ~22152 tok/s) can
be observed across all prefill buckets on G2, with up to +4.40% (+956.79
tok/s, ~25031 -> ~25988 tok/s) throughput increase in compute-bound
scenarios.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
habana Issues or PRs submitted by Habana Labs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants