Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use flat block layout for PA #92

Merged
merged 36 commits into from
Jul 10, 2024
Merged

Use flat block layout for PA #92

merged 36 commits into from
Jul 10, 2024

Conversation

madamczykhabana
Copy link

No description provided.

@madamczykhabana madamczykhabana marked this pull request as ready for review July 9, 2024 15:12
@madamczykhabana madamczykhabana changed the title [WIP] flat PA Use flat block layout for PA Jul 10, 2024
@madamczykhabana madamczykhabana merged commit 81a23a7 into habana_next Jul 10, 2024
@madamczykhabana madamczykhabana deleted the flat_pa branch July 19, 2024 15:42
ssarkar2 pushed a commit that referenced this pull request Aug 12, 2024
* Cleanup AttentionMetadata on HPU

* Flat PA - POC

* Decode warmup overhaul

* Fix input_hash calculation

* Block bucket size 32 -> 16

* Improve host time

* Skip UTs

* Add GQA/MQA

* Add mask instead of filling

* 2d block mapping

* Optional flipping in PA

* Runner updated for 2d block mapping

* Eliminate physical transposes

* POC: build block_bias on device

* Cleanup

* Fix seq_len calculation

* Experimental profiling

* Add missing call to kv_matmul_op

* Fix block_usage calculation

* Change default block bucket step for decode to 128

* Fix max decode block bucket calculation

* Fix block_usage calculations

* Cleanup

* Print values for bucketing vars

* Pass block size do HpuModelAdapter

---------

Co-authored-by: barak goldberg <[email protected]>
skaulintel pushed a commit that referenced this pull request Aug 20, 2024
* Cleanup AttentionMetadata on HPU

* Flat PA - POC

* Decode warmup overhaul

* Fix input_hash calculation

* Block bucket size 32 -> 16

* Improve host time

* Skip UTs

* Add GQA/MQA

* Add mask instead of filling

* 2d block mapping

* Optional flipping in PA

* Runner updated for 2d block mapping

* Eliminate physical transposes

* POC: build block_bias on device

* Cleanup

* Fix seq_len calculation

* Experimental profiling

* Add missing call to kv_matmul_op

* Fix block_usage calculation

* Change default block bucket step for decode to 128

* Fix max decode block bucket calculation

* Fix block_usage calculations

* Cleanup

* Print values for bucketing vars

* Pass block size do HpuModelAdapter

---------

Co-authored-by: barak goldberg <[email protected]>
@kzawora-intel kzawora-intel added the habana Issues or PRs submitted by Habana Labs label Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
habana Issues or PRs submitted by Habana Labs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants