Graphs v2 #44

madamczykhabana · 2024-05-29T13:06:44Z

Rework of HPU graphs. Now the flow looks like this:

VLLM_GRAPH_RESERVED_MEM is used to determine how much free memory after loading weights should be used for HPU-graphs (by default 30%)
we allocate blocks according to gpu-memory-utilization
we warmup all shapes without HPU graphs
we calculate remaining free memory and split it between prompt and decode graphs according to VLLM_GRAPH_PROMPT_RATIO (50%) and VLLM_GRAPH_MEM_MARGIN (5%)
we capture prompt graphs in order defined by bs*seq_len stoping when according to heuristics we won't fit another graph
as above but for decode graphs

Other important changes:

selecting token ids has been moved inside HPU graphs to limit memory usage
calculating attention_mask has been moved inside HPU graphs, but before model.forward to limit memory usage
*AttentionMetadata objects are trimmed before going into HPU graphs for better control over parameters (and avoiding recompilations due to changing constants)

madamczykhabana · 2024-05-29T13:21:35Z

@kzawora-intel Please review

vllm/worker/habana_model_runner.py

kzawora-intel

looks very good, thanks!

* Trimmed metadata - part 1 * [WIP] HPU graphs for decode * [WIP] Graph allocation algorithm reworked * Cleanup * Add graph memory estimations * Fix multinode synchronization * Create attn_bias inside HPU graph * Cleanup after rebase * Increase default VLLM_GRAPH_RESERVED_MEM to 0.3 * Remove obsolete class * Tweak default HPU graph parameters

madamczykhabana added 10 commits June 3, 2024 13:17

Trimmed metadata - part 1

c927891

[WIP] HPU graphs for decode

0a391fe

[WIP] Graph allocation algorithm reworked

b6f6881

Cleanup

1ff34f5

Add graph memory estimations

d81d910

Fix multinode synchronization

519f15f

Create attn_bias inside HPU graph

7f11197

Cleanup after rebase

b7a17e7

Increase default VLLM_GRAPH_RESERVED_MEM to 0.3

d1eda47

Remove obsolete class

6f44210

madamczykhabana force-pushed the graphs_v2 branch from d2718ab to 6f44210 Compare June 3, 2024 10:17

kzawora-intel reviewed Jun 3, 2024

View reviewed changes

vllm/worker/habana_model_runner.py Show resolved Hide resolved

Tweak default HPU graph parameters

0e988bd

kzawora-intel approved these changes Jun 4, 2024

View reviewed changes

kzawora-intel merged commit b3617ee into HabanaAI:habana_main Jun 4, 2024

kzawora-intel added the habana Issues or PRs submitted by Habana Labs label Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graphs v2 #44

Graphs v2 #44

madamczykhabana commented May 29, 2024

madamczykhabana commented May 29, 2024

kzawora-intel left a comment

Graphs v2 #44

Graphs v2 #44

Conversation

madamczykhabana commented May 29, 2024

madamczykhabana commented May 29, 2024

kzawora-intel left a comment

Choose a reason for hiding this comment