-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Iboiko/flatpa blocksnumber #8108
Iboiko/flatpa blocksnumber #8108
Conversation
* Fix setup.py for HPU * Fix vllm._C import ops -> vllm.hpu import ops * more of the same thing * re-add hpex rmsnorm and rope; but rope is crashing * remove unnecessary comments * add vllm/hpu files * add hpu autodetection * Add HabanaAttention stub * revert accidental changes * revert non-habana backend attention changes * add habana attention/worker/executor, sampling fails now * Restore unnecessarily changed files * enable HabanaMemoryProfiler * Make sampler pass * restore habana fused rope * prefill is now working!!! * fix prefill padding; decode is now working!!!!! * revert accidental changes * remove unused stuff in habana_paged_attn.py * remove diagnostic stuff from llm_engine.py * use HabanaExecutorAsync in async_llm_engine.py * add habana copyright headers to habana_*.py files * fix prefill attention conformance * minor naming fixes * remove naive attention from habana_attn (it never worked anyway) * re-enable profile run * Add fake HPUGraph support * add more metrics * indentation fix * ~~recipe cache metrics don't work lalalala~~ * i'm done with metrics for now * fix corner case in which hl-smi is not available but synapse is * FIXME: temporary setup.py workaround * WIP: add tensor parallelism stubs * habana worker cleanup * tensor parallelism is now working * remove unused files * remove unused func * add hpugraphrunner * improve hpu layernorm * Port pipelined PA * Port context length bucketing * remove cudagraphrunner from hpu runner * restore HPUGraphRunner back from FakeHPUGraphRunner * handle rotary embeddings properly on gaudi3 * oopsie! captured_block_counts was incorrect! * captured_block_counts.append doesn't do anything * Restore habana_main KV cache memory layout * fix memory profiler * overhaul hpugraph capture * memory profiling overhaul * format memory properly in model warmup * add graph compilation profiler for graph capture phase * adroll back log lvl on graph capture message * Remove unnecessary view on residual connection in RMSNorm (#25) --------- Co-authored-by: madamczykhabana <[email protected]>
Rebase habana_main up to cc466a3
WA: Disable cumsum in HPU _prepare_prompt
* Bucketing/Warmup WIP * Cleanup * Revert "Fix model_output_idx on HPU (#27)" This reverts commit 90dfa92. * Rework selected_token_indices fix to also work with block_size padding * Simple prompt attention POC * Remove cumsum * MQA/GQA support for simple prompt_attention * Cleanup * Fix typo * Restore profiling runs
* Fix HPU auto-detection in setup.py * Update setup.py
* add gaudi installation readme * readme writeup * Create README_GAUDI.md * Update README.md * Update README_GAUDI.md * Update README.md * Update readmes
* Fix mixtral hidden states layout to fit into habana model runner * Add static moe op to mixtral * Add mark_step to static_fused_moe * Update __init__.py * Fix code indentation * Make code compatible with non HPU devices * Move static_fused_moe to vllm.hpu.ops * Update mixtral.py * Move op import from forward to top of the file * Remove circular import
* Use setuptools older than 70.0.0 * Delete pyproject.toml --------- Co-authored-by: Konrad Zawora <[email protected]>
* Trimmed metadata - part 1 * [WIP] HPU graphs for decode * [WIP] Graph allocation algorithm reworked * Cleanup * Add graph memory estimations * Fix multinode synchronization * Create attn_bias inside HPU graph * Cleanup after rebase * Increase default VLLM_GRAPH_RESERVED_MEM to 0.3 * Remove obsolete class * Tweak default HPU graph parameters
It causes OOM on 70b
Co-authored-by: Krzysztof Laskowski <[email protected]>
* Cleanup AttentionMetadata on HPU * Flat PA - POC * Decode warmup overhaul * Debugging OOM * Experimental profiling * Fix input_hash calculation * Block bucket size 32 -> 16 * Improve host time * Skip UTs * Add GQA/MQA * Add mask instead of filling * 2d block mapping * Optional flipping in PA * Runner updated for 2d block mapping * Restore mark_step * Eliminate physical transposes * Disable warmup_mode * Revert changes to test_attention.py * POC: build block_bias on device * Cleanup * Fix seq_len calculation * Experimental profiling * Add missing call to kv_matmul_op * Fix block_usage calculation * Change default block bucket step for decode to 128 * Fix max decode block bucket calculation * Fix block_usage calculations * Cleanup * Cleanup profiler code * Print values for bucketing vars * Pass block size do HpuModelAdapter --------- Co-authored-by: barak goldberg <[email protected]>
* Disable tokenizer * Update protocol.py * Update serving_completion.py * Detect value of skip_tokenizer_init cmd arg * support skipping tokenizer for streaming scenario * remove debug print --------- Co-authored-by: Michał Kuligowski <[email protected]>
Co-authored-by: Krzysztof Laskowski <[email protected]>
* Disable tokenizer * Update protocol.py * Update serving_completion.py * Detect value of skip_tokenizer_init cmd arg * support skipping tokenizer for streaming scenario * remove debug print * Suppress None EOS token warning --------- Co-authored-by: Michał Kuligowski <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge). To run full CI, you can do one of these:
🚀 |
Fix of number block's calculation for Flat PA via adding empty table block (HabanaAI#158)