Chunk prefill cache writes, remove div_i32 from insert_or_update_cache #289

kzawora-intel · 2024-09-17T10:05:43Z

Re-implements following PRs for current habana_main:
#102 (Removing div_i32 operations from each layer)
#115 (removing scatter for reshape&cache in case of prompt)

Accuracy (GSM8K on Llama3.1-8B-Instruct):

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k_cot_llama	3	flexible-extract	8	exact_match	↑	0.8415	±	0.0101
		strict-match	8	exact_match	↑	0.8400	±	0.0101

I've benchmarked this change on Llama3.1-8B-Instruct and on average, +2.50% throughput gain (+558.14 tok/s, ~21594 tok/s -> ~22152 tok/s) can be observed across all prefill buckets on G2, with up to +4.40% (+956.79 tok/s, ~25031 -> ~25988 tok/s) throughput increase in compute-bound scenarios.

…e_cache

madamczykhabana · 2024-09-17T10:33:52Z

vllm/hpu/utils.py

-                               num_slots_available, block_indices,
-                               block_offset)
+    def forward(self, input, cache, block_indices, block_offset):
+        insert_or_update_cache(input, cache, block_indices, block_offset)


Isn't VLLMKVCache patched by INC/HQT? Will this work with fp8?

This actually breaks FP8 ([rank0]: TypeError: PatchedVLLMKVCache.forward() missing 2 required positional arguments: 'block_indices' and 'block_offset').

…a/insert_or_update_cache_opt

This reverts commit 8a90083.

This reverts commit 6b4514b.

…e_opt

HabanaAI#289) Re-implements following PRs for current habana_main: HabanaAI#102 (Removing div_i32 operations from each layer) HabanaAI#115 (removing scatter for reshape&cache in case of prompt) Accuracy (GSM8K on Llama3.1-8B-Instruct): | Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr| |---------------|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k_cot_llama| 3|flexible-extract| 8|exact_match|↑ |0.8415|± |0.0101| | | |strict-match | 8|exact_match|↑ |0.8400|± |0.0101| I've benchmarked this change on Llama3.1-8B-Instruct and on average, +2.50% throughput gain (+558.14 tok/s, ~21594 tok/s -> ~22152 tok/s) can be observed across all prefill buckets on G2, with up to +4.40% (+956.79 tok/s, ~25031 -> ~25988 tok/s) throughput increase in compute-bound scenarios.

kzawora-intel added 3 commits September 17, 2024 13:03

Optimize prefill cache writes and remove div_i32 from insert_or_updat…

a5528ab

…e_cache

remove unnecessary function

690b867

add bucket profiling

8a90083

madamczykhabana reviewed Sep 17, 2024

View reviewed changes

kzawora-intel added the habana Issues or PRs submitted by Habana Labs label Sep 20, 2024

libinta added a commit that referenced this pull request Sep 24, 2024

port #289 and add fix

52fb2c1

kzawora-intel mentioned this pull request Sep 25, 2024

Chunk prefill cache writes, remove div_i32 from insert_or_update_cache HabanaAI/vllm-hpu-extension#6

Merged

kzawora-intel added 9 commits September 25, 2024 13:17

Merge remote-tracking branch 'origin/habana_main' into private/kzawor…

a334539

…a/insert_or_update_cache_opt

Revert "add bucket profiling"

1e7287f

This reverts commit 8a90083.

oopsie woopsie

0bd8366

i messed up and now i've fixed it

de6833c

bump requirements

c5fd50d

Remove redundant destructor calls

6b4514b

Revert "Remove redundant destructor calls"

cf48997

This reverts commit 6b4514b.

Update requirements-hpu.txt

e7e87d5

Merge branch 'habana_main' into private/kzawora/insert_or_update_cach…

d5cdb42

…e_opt

madamczykhabana approved these changes Sep 26, 2024

View reviewed changes

michalkuligowski merged commit 1c6bada into habana_main Sep 26, 2024
19 checks passed

michalkuligowski deleted the private/kzawora/insert_or_update_cache_opt branch September 26, 2024 12:53

This was referenced Nov 12, 2024

[Usage]: How to run FP8 inference #453

Closed

[Bug]: FP8 not working in habana_main #488

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunk prefill cache writes, remove div_i32 from insert_or_update_cache #289

Chunk prefill cache writes, remove div_i32 from insert_or_update_cache #289

kzawora-intel commented Sep 17, 2024 •

edited

Loading

madamczykhabana Sep 17, 2024

kzawora-intel Sep 25, 2024

Chunk prefill cache writes, remove div_i32 from insert_or_update_cache #289

Chunk prefill cache writes, remove div_i32 from insert_or_update_cache #289

Conversation

kzawora-intel commented Sep 17, 2024 • edited Loading

madamczykhabana Sep 17, 2024

Choose a reason for hiding this comment

kzawora-intel Sep 25, 2024

Choose a reason for hiding this comment

kzawora-intel commented Sep 17, 2024 •

edited

Loading