Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iboiko/flatpa blocksnumber #8108

Closed

Conversation

iboiko-habana
Copy link
Contributor

Fix of number block's calculation for Flat PA via adding empty table block (HabanaAI#158)

kzawora-intel and others added 30 commits May 8, 2024 14:39
* Fix setup.py for HPU

* Fix  vllm._C import ops -> vllm.hpu import ops

* more of the same thing

* re-add hpex rmsnorm and rope; but rope is crashing

* remove unnecessary comments

* add vllm/hpu files

* add hpu autodetection

* Add HabanaAttention stub

* revert accidental changes

* revert non-habana backend attention changes

* add habana attention/worker/executor, sampling fails now

* Restore unnecessarily changed files

* enable HabanaMemoryProfiler

* Make sampler pass

* restore habana fused rope

* prefill is now working!!!

* fix prefill padding; decode is now working!!!!!

* revert accidental changes

* remove unused stuff in habana_paged_attn.py

* remove diagnostic stuff from llm_engine.py

* use HabanaExecutorAsync in async_llm_engine.py

* add habana copyright headers to habana_*.py files

* fix prefill attention conformance

* minor naming fixes

* remove naive attention from habana_attn (it never worked anyway)

* re-enable profile run

* Add fake HPUGraph support

* add more metrics

* indentation fix

* ~~recipe cache metrics don't work lalalala~~

* i'm done with metrics for now

* fix corner case in which hl-smi is not available but synapse is

* FIXME: temporary setup.py workaround

* WIP: add tensor parallelism stubs

* habana worker cleanup

* tensor parallelism is now working

* remove unused files

* remove unused func

* add hpugraphrunner

* improve hpu layernorm

* Port pipelined PA

* Port context length bucketing

* remove cudagraphrunner from hpu runner

* restore HPUGraphRunner back from FakeHPUGraphRunner

* handle rotary embeddings properly on gaudi3

* oopsie! captured_block_counts was incorrect!

* captured_block_counts.append doesn't do anything

* Restore habana_main KV cache memory layout

* fix memory profiler

* overhaul hpugraph capture

* memory profiling overhaul

* format memory properly in model warmup

* add graph compilation profiler for graph capture phase

* adroll back log lvl on graph capture message

* Remove unnecessary view on residual connection in RMSNorm (#25)

---------

Co-authored-by: madamczykhabana <[email protected]>
WA: Disable cumsum in HPU _prepare_prompt
* Bucketing/Warmup WIP

* Cleanup

* Revert "Fix model_output_idx on HPU (#27)"

This reverts commit 90dfa92.

* Rework selected_token_indices fix to also work with block_size padding

* Simple prompt attention POC

* Remove cumsum

* MQA/GQA support for simple prompt_attention

* Cleanup

* Fix typo

* Restore profiling runs
* Fix HPU auto-detection in setup.py

* Update setup.py
* add gaudi installation readme

* readme writeup

* Create README_GAUDI.md

* Update README.md

* Update README_GAUDI.md

* Update README.md

* Update readmes
* Fix mixtral hidden states layout to fit into habana model runner

* Add static moe op to mixtral

* Add mark_step to static_fused_moe

* Update __init__.py

* Fix code indentation

* Make code compatible with non HPU devices

* Move static_fused_moe to vllm.hpu.ops

* Update mixtral.py

* Move op import from forward to top of the file

* Remove circular import
* Use setuptools older than 70.0.0

* Delete pyproject.toml

---------

Co-authored-by: Konrad Zawora <[email protected]>
* Trimmed metadata - part 1

* [WIP] HPU graphs for decode

* [WIP] Graph allocation algorithm reworked

* Cleanup

* Add graph memory estimations

* Fix multinode synchronization

* Create attn_bias inside HPU graph

* Cleanup after rebase

* Increase default VLLM_GRAPH_RESERVED_MEM to 0.3

* Remove obsolete class

* Tweak default HPU graph parameters
madamczykhabana and others added 28 commits July 2, 2024 15:18
It causes OOM on 70b
* Cleanup AttentionMetadata on HPU

* Flat PA - POC

* Decode warmup overhaul

* Debugging OOM

* Experimental profiling

* Fix input_hash calculation

* Block bucket size 32 -> 16

* Improve host time

* Skip UTs

* Add GQA/MQA

* Add mask instead of filling

* 2d block mapping

* Optional flipping in PA

* Runner updated for 2d block mapping

* Restore mark_step

* Eliminate physical transposes

* Disable warmup_mode

* Revert changes to test_attention.py

* POC: build block_bias on device

* Cleanup

* Fix seq_len calculation

* Experimental profiling

* Add missing call to kv_matmul_op

* Fix block_usage calculation

* Change default block bucket step for decode to 128

* Fix max decode block bucket calculation

* Fix block_usage calculations

* Cleanup

* Cleanup profiler code

* Print values for bucketing vars

* Pass block size do HpuModelAdapter

---------

Co-authored-by: barak goldberg <[email protected]>
* Disable tokenizer

* Update protocol.py

* Update serving_completion.py

* Detect value of skip_tokenizer_init cmd arg

* support skipping tokenizer for streaming scenario

* remove debug print

---------

Co-authored-by: Michał Kuligowski <[email protected]>
* Disable tokenizer

* Update protocol.py

* Update serving_completion.py

* Detect value of skip_tokenizer_init cmd arg

* support skipping tokenizer for streaming scenario

* remove debug print

* Suppress None EOS token warning

---------

Co-authored-by: Michał Kuligowski <[email protected]>
Copy link

github-actions bot commented Sep 3, 2024

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.