Iboiko/flatpa blocksnumber #8108

iboiko-habana · 2024-09-03T09:53:40Z

Fix of number block's calculation for Flat PA via adding empty table block (HabanaAI#158)

* Fix setup.py for HPU * Fix vllm._C import ops -> vllm.hpu import ops * more of the same thing * re-add hpex rmsnorm and rope; but rope is crashing * remove unnecessary comments * add vllm/hpu files * add hpu autodetection * Add HabanaAttention stub * revert accidental changes * revert non-habana backend attention changes * add habana attention/worker/executor, sampling fails now * Restore unnecessarily changed files * enable HabanaMemoryProfiler * Make sampler pass * restore habana fused rope * prefill is now working!!! * fix prefill padding; decode is now working!!!!! * revert accidental changes * remove unused stuff in habana_paged_attn.py * remove diagnostic stuff from llm_engine.py * use HabanaExecutorAsync in async_llm_engine.py * add habana copyright headers to habana_*.py files * fix prefill attention conformance * minor naming fixes * remove naive attention from habana_attn (it never worked anyway) * re-enable profile run * Add fake HPUGraph support * add more metrics * indentation fix * ~~recipe cache metrics don't work lalalala~~ * i'm done with metrics for now * fix corner case in which hl-smi is not available but synapse is * FIXME: temporary setup.py workaround * WIP: add tensor parallelism stubs * habana worker cleanup * tensor parallelism is now working * remove unused files * remove unused func * add hpugraphrunner * improve hpu layernorm * Port pipelined PA * Port context length bucketing * remove cudagraphrunner from hpu runner * restore HPUGraphRunner back from FakeHPUGraphRunner * handle rotary embeddings properly on gaudi3 * oopsie! captured_block_counts was incorrect! * captured_block_counts.append doesn't do anything * Restore habana_main KV cache memory layout * fix memory profiler * overhaul hpugraph capture * memory profiling overhaul * format memory properly in model warmup * add graph compilation profiler for graph capture phase * adroll back log lvl on graph capture message * Remove unnecessary view on residual connection in RMSNorm (#25) --------- Co-authored-by: madamczykhabana <[email protected]>

…cc466a3

Rebase habana_main up to cc466a3

WA: Disable cumsum in HPU _prepare_prompt

* Bucketing/Warmup WIP * Cleanup * Revert "Fix model_output_idx on HPU (#27)" This reverts commit 90dfa92. * Rework selected_token_indices fix to also work with block_size padding * Simple prompt attention POC * Remove cumsum * MQA/GQA support for simple prompt_attention * Cleanup * Fix typo * Restore profiling runs

* Fix HPU auto-detection in setup.py * Update setup.py

* add gaudi installation readme * readme writeup * Create README_GAUDI.md * Update README.md * Update README_GAUDI.md * Update README.md * Update readmes

* Fix mixtral hidden states layout to fit into habana model runner * Add static moe op to mixtral * Add mark_step to static_fused_moe * Update __init__.py * Fix code indentation * Make code compatible with non HPU devices * Move static_fused_moe to vllm.hpu.ops * Update mixtral.py * Move op import from forward to top of the file * Remove circular import

* Use setuptools older than 70.0.0 * Delete pyproject.toml --------- Co-authored-by: Konrad Zawora <[email protected]>

* Trimmed metadata - part 1 * [WIP] HPU graphs for decode * [WIP] Graph allocation algorithm reworked * Cleanup * Add graph memory estimations * Fix multinode synchronization * Create attn_bias inside HPU graph * Cleanup after rebase * Increase default VLLM_GRAPH_RESERVED_MEM to 0.3 * Remove obsolete class * Tweak default HPU graph parameters

It causes OOM on 70b

Co-authored-by: Krzysztof Laskowski <[email protected]>

This reverts commit 1dc6cb2.

This reverts commit 4afe86d.

* Cleanup AttentionMetadata on HPU * Flat PA - POC * Decode warmup overhaul * Debugging OOM * Experimental profiling * Fix input_hash calculation * Block bucket size 32 -> 16 * Improve host time * Skip UTs * Add GQA/MQA * Add mask instead of filling * 2d block mapping * Optional flipping in PA * Runner updated for 2d block mapping * Restore mark_step * Eliminate physical transposes * Disable warmup_mode * Revert changes to test_attention.py * POC: build block_bias on device * Cleanup * Fix seq_len calculation * Experimental profiling * Add missing call to kv_matmul_op * Fix block_usage calculation * Change default block bucket step for decode to 128 * Fix max decode block bucket calculation * Fix block_usage calculations * Cleanup * Cleanup profiler code * Print values for bucketing vars * Pass block size do HpuModelAdapter --------- Co-authored-by: barak goldberg <[email protected]>

* Disable tokenizer * Update protocol.py * Update serving_completion.py * Detect value of skip_tokenizer_init cmd arg * support skipping tokenizer for streaming scenario * remove debug print --------- Co-authored-by: Michał Kuligowski <[email protected]>

Co-authored-by: Krzysztof Laskowski <[email protected]>

* Disable tokenizer * Update protocol.py * Update serving_completion.py * Detect value of skip_tokenizer_init cmd arg * support skipping tokenizer for streaming scenario * remove debug print * Suppress None EOS token warning --------- Co-authored-by: Michał Kuligowski <[email protected]>

github-actions · 2024-09-03T09:53:51Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

kzawora-intel and others added 30 commits May 8, 2024 14:39

Merge remote-tracking branch 'upstream/main' into habana_main_rebase_…

ca30d50

…cc466a3

adapt habana components to changed vllm apis

6963277

fix hpugraph capture/replay post rebase

737c767

re-enable 8x hpu support

b5d4037

Fix model_output_idx on HPU (#27)

90dfa92

Allow block_sizes: 64 and 128 (#28)

eeef644

add triton to requirements-hpu

84a4698

Fix out-of-bound HPUGraph capture issue

972acf3

fix VLLM_HPU_LOG_STEP_GRAPH_COMPILATION

61b7763

Merge pull request #26 from HabanaAI/habana_main_rebase_cc466a3

ae3d612

Rebase habana_main up to cc466a3

WA: Disable cumsum in HPU _prepare_prompt

fdf282b

Merge pull request #30 from HabanaAI/private/kzawora/cumsum_wa

2664659

WA: Disable cumsum in HPU _prepare_prompt

Cleanup: Fix HPU auto-detection in setup.py (#34)

14d294d

* Fix HPU auto-detection in setup.py * Update setup.py

Restore int64 sampling (#35)

f6fb119

Llama whitespace fix (#36)

78b0513

Restore pyproject.toml (#37)

09c1eb2

Add high-level profiler (#29)

7f7500b

Add release docs for Gaudi (#32)

b6f5584

* add gaudi installation readme * readme writeup * Create README_GAUDI.md * Update README.md * Update README_GAUDI.md * Update README.md * Update readmes

Update tag in readme (#39)

6f5629f

Fix error with high-level profiler in multi-card scenario (#38)

3c827b3

WA: Remove pyproject.toml, bypass HPU autodetection (#45)

8359489

Use setuptools older than 70.0.0 (#42)

82f6280

* Use setuptools older than 70.0.0 * Delete pyproject.toml --------- Co-authored-by: Konrad Zawora <[email protected]>

Add VLLM_SKIP_WARMUP flag (#43)

539e394

Remove usage of wrap_in_hpu_graph in PT eager (#47)

1c5d12e

Add HPU support to benchmark_latency and benchmark_throughput (#49)

9bb5d20

Use int32 seeds for random sampler on HPU (#50)

ab359ac

madamczykhabana and others added 28 commits July 2, 2024 15:18

Fix lower bucket range calculation

0674aea

Disable warmup_mode for now

15c67ed

It causes OOM on 70b

Introduce delayed sampling mechanism (#84)

77e1ab8

Co-authored-by: Krzysztof Laskowski <[email protected]>

Disable tensor cache set to True (#88)

1dc6cb2

Revert "Disable tensor cache set to True (#88)" (#89)

4afe86d

This reverts commit 1dc6cb2.

Revert "Revert "Disable tensor cache set to True (#88)" (#89)" (#90)

ca1dbf6

This reverts commit 4afe86d.

change RMS norm to bf16 (#93)

d4e72b8

SiLU memory leak in fwd (#86)

362d581

Log warnings when running not warmed-up configurations (#94)

a2c6d5f

Allow wrapping in hpu graphs when warmup is skipped (#95)

287598e

Fix block_usage calculation (#96)

f0e4a83

WA for numerically unstable block_softmax (#104)

34e2855

Fix min bucket boundary calculation (#97)

79f6783

Cache indices and offsets (#102)

bdb430f

Limit prefill batch size so that we avoid recompilations (#100)

42a31dd

Skip redundant sampling tensors generation (#110)

d9d6aac

Co-authored-by: Krzysztof Laskowski <[email protected]>

Do reshape_and_cache in bulk for prompt (#115)

969bd83

Allow overscheduling prompt (#116)

b419b07

Fix finding proper block buckets (#119)

260c710

Allow block size 256 (#121)

067a243

Allocate blocks from id=1 (#155)

b00fac5

Reimplement silu_and_mul for mixtral (#164)

1072e3a

Fix blocks allocation range (#171)

b7fdd5a

Fix delayed sampling TP>1 (#149)

881814f

Fix blocks number calculation for Flat PA

8b764cd

iboiko-habana closed this Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iboiko/flatpa blocksnumber #8108

Iboiko/flatpa blocksnumber #8108

iboiko-habana commented Sep 3, 2024

github-actions bot commented Sep 3, 2024

Iboiko/flatpa blocksnumber #8108

Iboiko/flatpa blocksnumber #8108

Conversation

iboiko-habana commented Sep 3, 2024

github-actions bot commented Sep 3, 2024