Fix logic for determining the number of cache blocks #98

tdoublep · 2024-05-16T14:16:25Z

Motivation

When we deploy spec decoding in prod., we are frequently seeing the servers running out of free blocks. We have determined that this is due to two issues:

The constraint on SPECULATOR_MAX_BATCH_SIZE is not enough to avoid running into memory pressure due to speculation - we need to able ensure that we do not speculate on batches that may have a small "size" but very large weight.
The computation of the number of blocks is very wrong in most cases.

Modifications

I have introduced an additional constraint that says we should only speculate on batches with weight up to 75% of the weight limit. This should ensure that we never speculate when we are close to the memory limits.
I have written new code to calculate the number of KV cache blocks. This calculation uses the memory scaling coefficients that we have learned at startup. In particular, it uses to the learned coefficients to figure out what % of the memory capacity needs to be set aside for cache blocks.
In the above calculation, I use the next token coefficient, rather than the prefill coefficient, since typically during next token phase the KV cache blocks comprise a relatively large percentage of the total memory consumption and we need to be able to handle this worst-case. However, this means that during prefill steps, we may not have enough memory leftover to store the auxiliary data structures we need for a forward pass. There isn't really a clean way to handle this other than re-writing the router logic to be block-aware, but what we can do is recommend to the user that they should increase the batch safety margin to a certain level to ensure that prefills will not run OOM. I've added a print statement to provide this guidance.
I now load the speculator before learning the memory scaling model since we also need to take that into account when measuring the amount of free memory.

Result

These changes, together with setting the BATCH_SAFETY_MARGIN=35, seems to result in robust behaviour for both llama3-8b and granite-20b. We no longer need to manually set the number of KV cache blocks in the latter case.

Related Issues

n/a

Signed-off-by: Thomas Parnell <[email protected]>

JRosenkranz

looks good, but a few comments before approval. Also do we have an image available we can try this out?

JRosenkranz · 2024-05-16T17:01:58Z

server/text_generation_server/models/paged_causal_lm.py

+            nt_cache_block_ratio = cache_block_size / block_size / memory_scaling_model.next_token_params[1]
+            total_num_gpu_blocks = int(nt_cache_block_ratio * memory_scaling_model.free_memory // cache_block_size)
+            # we may need to increase the safety margin a bit to ensure that prefill forward does not run OOM
+            recommend_safety_margin = 5 + int(100*(1.0 - (1.0 - nt_cache_block_ratio)/(1.0 - pf_cache_block_ratio)))


can we have a comment around this line as to what is being done?

I will add something

I added some explanation now. This approach isn't ideal, and might affect the maximum throughput we can achieve with the server. However, I can't see any other way to ensure robustness without re-implementing the batching logic to interact with the KVCacheManager.

JRosenkranz · 2024-05-16T17:04:54Z

server/text_generation_server/utils/paged.py

+SPECULATOR_NAME = os.getenv("SPECULATOR_NAME", None)
+
+# speculator revision
+SPECULATOR_REVISION = os.getenv("SPECULATOR_REVISION", None)


What is this used for when loading?

Looks like it's analogous to MODEL_REVISION, the specific commit hash of the model to load. Like one of these I think: https://huggingface.co/ibm/granite-7b-lab-accelerator/commits/main

njhill · 2024-05-16T17:19:09Z

server/text_generation_server/models/paged_causal_lm.py

+            )
+        except:
+            # if something goes wrong during forward, we still need to set the sequence ids
+            batch.sequence_ids = cache_data.sequence_ids


Would it make sense to just move this from the bottom of the method to above the call to self.model(...)?

I think cache_data.sequence_ids only gets populated within the call to self.model(...) so we can't move it beforehand.

This feels a bit fragile, wonder if it would be better to revert to prior state (if possible) if the call to call to self.model fails? ideally within that call... i.e. avoid partial success.

I agree that its fragile, and there might be a better way to address it from within the function. Not sure whether to prioritize that at this stage though.

Signed-off-by: Thomas Parnell <[email protected]>

JRosenkranz

lgtm

Signed-off-by: Nick Hill <[email protected]>

…fix-blocks

When we deploy spec decoding in prod., we are frequently seeing the servers running out of free blocks. We have determined that this is due to two issues: 1. The constraint on `SPECULATOR_MAX_BATCH_SIZE` is not enough to avoid running into memory pressure due to speculation - we need to able ensure that we do not speculate on batches that may have a small "size" but very large weight. 2. The computation of the number of blocks is very wrong in most cases. 1. I have introduced an additional constraint that says we should only speculate on batches with weight up to 75% of the weight limit. This should ensure that we never speculate when we are close to the memory limits. 2. I have written new code to calculate the number of KV cache blocks. This calculation uses the memory scaling coefficients that we have learned at startup. In particular, it uses to the learned coefficients to figure out what % of the memory capacity needs to be set aside for cache blocks. 3. In the above calculation, I use the next token coefficient, rather than the prefill coefficient, since typically during next token phase the KV cache blocks comprise a relatively large percentage of the total memory consumption and we need to be able to handle this worst-case. However, this means that during prefill steps, we may not have enough memory leftover to store the auxiliary data structures we need for a forward pass. There isn't really a clean way to handle this other than re-writing the router logic to be block-aware, but what we can do is recommend to the user that they should increase the batch safety margin to a certain level to ensure that prefills will not run OOM. I've added a print statement to provide this guidance. 4. I now load the speculator before learning the memory scaling model since we also need to take that into account when measuring the amount of free memory. These changes, together with setting the `BATCH_SAFETY_MARGIN=35`, seems to result in robust behaviour for both `llama3-8b` and `granite-20b`. We no longer need to manually set the number of KV cache blocks in the latter case. n/a --------- Signed-off-by: Thomas Parnell <[email protected]>

tdoublep added 6 commits May 15, 2024 16:00

Working changes to fix blocks issue

1165d2b

Signed-off-by: Thomas Parnell <[email protected]>

working changes

a3c8254

Signed-off-by: Thomas Parnell <[email protected]>

Working under stress test

0475929

Signed-off-by: Thomas Parnell <[email protected]>

Use next token coefficient

a4897b0

Signed-off-by: Thomas Parnell <[email protected]>

Factor computation of kv_cache_block_size into modeling code

f042741

Signed-off-by: Thomas Parnell <[email protected]>

Cleanup

a9a4b3e

Signed-off-by: Thomas Parnell <[email protected]>

tdoublep requested review from JRosenkranz and njhill May 16, 2024 14:17

JRosenkranz reviewed May 16, 2024

View reviewed changes

njhill reviewed May 16, 2024

View reviewed changes

tdoublep added 2 commits May 17, 2024 08:38

Added some comments explaining the recommended safety margin

2ee7ce0

Signed-off-by: Thomas Parnell <[email protected]>

Update comments explaining the recommended safety margin

4395be2

Signed-off-by: Thomas Parnell <[email protected]>

JRosenkranz approved these changes May 20, 2024

View reviewed changes

njhill added 2 commits May 30, 2024 17:33

Add comment for future TODO

e3c749e

Signed-off-by: Nick Hill <[email protected]>

Merge remote-tracking branch 'refs/remotes/ibm-public/main' into tpa-…

71a6a67

…fix-blocks

njhill approved these changes May 31, 2024

View reviewed changes

njhill merged commit c265390 into IBM:main May 31, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix logic for determining the number of cache blocks #98

Fix logic for determining the number of cache blocks #98

tdoublep commented May 16, 2024

JRosenkranz left a comment

JRosenkranz May 16, 2024

tdoublep May 17, 2024

tdoublep May 17, 2024

JRosenkranz May 16, 2024

joerunde May 16, 2024

njhill May 16, 2024

tdoublep May 17, 2024

njhill May 23, 2024

tdoublep May 28, 2024

JRosenkranz left a comment

Fix logic for determining the number of cache blocks #98

Fix logic for determining the number of cache blocks #98

Conversation

tdoublep commented May 16, 2024

Motivation

Modifications

Result

Related Issues

JRosenkranz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JRosenkranz left a comment

Choose a reason for hiding this comment