[pull] main from IBM:main #85

pull · 2024-05-31T23:44:10Z

See Commits and Changes for more details.

Can you help keep this open source service alive? 💖 Please sponsor : )

The launch of `fit_memory_scaling_model` uses the values for `quantize` and `dtype_str`, so those should be validated and defaulted before it is ran. Before this change, if `dtype_str` was set to `None` it would be passed to `fit_memory_scaling_model` as `None` resulting in an error: ``` Shard 1: Process SpawnProcess-33: Shard 1: Traceback (most recent call last): Shard 1: File "/opt/tgis/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap Shard 1: self.run() Shard 1: File "/opt/tgis/lib/python3.11/multiprocessing/process.py", line 108, in run Shard 1: self._target(*self._args, **self._kwargs) Shard 1: File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/utils/paged.py", line 37, in fit_memory_scaling_model Shard 1: model = get_model( Shard 1: ^^^^^^^^^^ Shard 1: File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/models/__init__.py", line 39, in get_model Shard 1: dtype = get_torch_dtype(dtype_str) Shard 1: ^^^^^^^^^^^^^^^^^^^^^^^^^^ Shard 1: File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/utils/dist.py", line 64, in get_torch_dtype Shard 1: dt = getattr(torch, dtype_str, None) Shard 1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Shard 1: TypeError: attribute name must be string, not 'NoneType' ``` After this change, a value will always be set before calling `fit_memory_scaling_model`. Signed-off-by: Travis Johnson <[email protected]>

#### Motivation When we deploy spec decoding in prod., we are frequently seeing the servers running out of free blocks. We have determined that this is due to two issues: 1. The constraint on `SPECULATOR_MAX_BATCH_SIZE` is not enough to avoid running into memory pressure due to speculation - we need to able ensure that we do not speculate on batches that may have a small "size" but very large weight. 2. The computation of the number of blocks is very wrong in most cases. #### Modifications 1. I have introduced an additional constraint that says we should only speculate on batches with weight up to 75% of the weight limit. This should ensure that we never speculate when we are close to the memory limits. 2. I have written new code to calculate the number of KV cache blocks. This calculation uses the memory scaling coefficients that we have learned at startup. In particular, it uses to the learned coefficients to figure out what % of the memory capacity needs to be set aside for cache blocks. 3. In the above calculation, I use the next token coefficient, rather than the prefill coefficient, since typically during next token phase the KV cache blocks comprise a relatively large percentage of the total memory consumption and we need to be able to handle this worst-case. However, this means that during prefill steps, we may not have enough memory leftover to store the auxiliary data structures we need for a forward pass. There isn't really a clean way to handle this other than re-writing the router logic to be block-aware, but what we can do is recommend to the user that they should increase the batch safety margin to a certain level to ensure that prefills will not run OOM. I've added a print statement to provide this guidance. 4. I now load the speculator before learning the memory scaling model since we also need to take that into account when measuring the amount of free memory. #### Result These changes, together with setting the `BATCH_SAFETY_MARGIN=35`, seems to result in robust behaviour for both `llama3-8b` and `granite-20b`. We no longer need to manually set the number of KV cache blocks in the latter case. #### Related Issues n/a --------- Signed-off-by: Thomas Parnell <[email protected]>

openshift-ci · 2024-05-31T23:44:22Z

Hi @pull[bot]. Thanks for your PR.

I'm waiting for a opendatahub-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

heyselbi · 2024-06-17T07:51:01Z

/ok-to-test

heyselbi · 2024-06-18T20:24:30Z

/lgtm
/approve

openshift-ci · 2024-06-18T20:24:39Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: heyselbi, pull[bot]

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [heyselbi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…atahub-io#85) #### Motivation The `Calico` models currently set the mlp and attention bias to true, which was hard-coded to false in flash and paged llama implementations. This will use the config params set in huggingface/transformers#30031 to set those values properly. #### Modifications - added attention_bias, mlp_bias to config for Flash and Paged Llama implementations (default is False) - set bias in attention and mlp to the config value #### Result Models should be able to load properly if containing attention and mlp bias --------- Signed-off-by: Joshua Rosenkranz <[email protected]> Signed-off-by: Joe Runde <[email protected]> Co-authored-by: Joe Runde <[email protected]>

#### Motivation The `Calico` models currently set the mlp and attention bias to true, which was hard-coded to false in flash and paged llama implementations. This will use the config params set in huggingface/transformers#30031 to set those values properly. #### Modifications - added attention_bias, mlp_bias to config for Flash and Paged Llama implementations (default is False) - set bias in attention and mlp to the config value #### Result Models should be able to load properly if containing attention and mlp bias --------- Signed-off-by: Joshua Rosenkranz <[email protected]> Signed-off-by: Joe Runde <[email protected]> Co-authored-by: Joe Runde <[email protected]>

tjohnson31415 and others added 2 commits May 30, 2024 12:01

openshift-ci bot requested review from Xaenalt and z103cb May 31, 2024 23:44

openshift-ci bot added the needs-ok-to-test label May 31, 2024

pull bot added ⤵️ pull and removed needs-ok-to-test labels May 31, 2024

openshift-ci bot added the ok-to-test label Jun 17, 2024

openshift-ci bot assigned heyselbi Jun 18, 2024

openshift-ci bot added the lgtm label Jun 18, 2024

openshift-ci bot added the approved label Jun 18, 2024

openshift-merge-bot bot merged commit 43623db into opendatahub-io:main Jun 18, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] main from IBM:main #85

[pull] main from IBM:main #85

pull bot commented May 31, 2024 •

edited

Loading

openshift-ci bot commented May 31, 2024

heyselbi commented Jun 17, 2024

heyselbi commented Jun 18, 2024

openshift-ci bot commented Jun 18, 2024

[pull] main from IBM:main #85

[pull] main from IBM:main #85

Conversation

pull bot commented May 31, 2024 • edited Loading

openshift-ci bot commented May 31, 2024

heyselbi commented Jun 17, 2024

heyselbi commented Jun 18, 2024

openshift-ci bot commented Jun 18, 2024

pull bot commented May 31, 2024 •

edited

Loading