Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graphs v2 #44

Merged
merged 11 commits into from
Jun 4, 2024
Merged

Graphs v2 #44

merged 11 commits into from
Jun 4, 2024

Conversation

madamczykhabana
Copy link

Rework of HPU graphs. Now the flow looks like this:

  • VLLM_GRAPH_RESERVED_MEM is used to determine how much free memory after loading weights should be used for HPU-graphs (by default 30%)
  • we allocate blocks according to gpu-memory-utilization
  • we warmup all shapes without HPU graphs
  • we calculate remaining free memory and split it between prompt and decode graphs according to VLLM_GRAPH_PROMPT_RATIO (50%) and VLLM_GRAPH_MEM_MARGIN (5%)
  • we capture prompt graphs in order defined by bs*seq_len stoping when according to heuristics we won't fit another graph
  • as above but for decode graphs

Other important changes:

  • selecting token ids has been moved inside HPU graphs to limit memory usage
  • calculating attention_mask has been moved inside HPU graphs, but before model.forward to limit memory usage
  • *AttentionMetadata objects are trimmed before going into HPU graphs for better control over parameters (and avoiding recompilations due to changing constants)

@madamczykhabana
Copy link
Author

@kzawora-intel Please review

Copy link

@kzawora-intel kzawora-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks very good, thanks!

@kzawora-intel kzawora-intel merged commit b3617ee into HabanaAI:habana_main Jun 4, 2024
adobrzyniewicz-habana pushed a commit that referenced this pull request Jun 25, 2024
* Trimmed metadata - part 1

* [WIP] HPU graphs for decode

* [WIP] Graph allocation algorithm reworked

* Cleanup

* Add graph memory estimations

* Fix multinode synchronization

* Create attn_bias inside HPU graph

* Cleanup after rebase

* Increase default VLLM_GRAPH_RESERVED_MEM to 0.3

* Remove obsolete class

* Tweak default HPU graph parameters
@kzawora-intel kzawora-intel added the habana Issues or PRs submitted by Habana Labs label Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
habana Issues or PRs submitted by Habana Labs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants