-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pull] main from IBM:main #85
Conversation
The launch of `fit_memory_scaling_model` uses the values for `quantize` and `dtype_str`, so those should be validated and defaulted before it is ran. Before this change, if `dtype_str` was set to `None` it would be passed to `fit_memory_scaling_model` as `None` resulting in an error: ``` Shard 1: Process SpawnProcess-33: Shard 1: Traceback (most recent call last): Shard 1: File "/opt/tgis/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap Shard 1: self.run() Shard 1: File "/opt/tgis/lib/python3.11/multiprocessing/process.py", line 108, in run Shard 1: self._target(*self._args, **self._kwargs) Shard 1: File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/utils/paged.py", line 37, in fit_memory_scaling_model Shard 1: model = get_model( Shard 1: ^^^^^^^^^^ Shard 1: File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/models/__init__.py", line 39, in get_model Shard 1: dtype = get_torch_dtype(dtype_str) Shard 1: ^^^^^^^^^^^^^^^^^^^^^^^^^^ Shard 1: File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/utils/dist.py", line 64, in get_torch_dtype Shard 1: dt = getattr(torch, dtype_str, None) Shard 1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Shard 1: TypeError: attribute name must be string, not 'NoneType' ``` After this change, a value will always be set before calling `fit_memory_scaling_model`. Signed-off-by: Travis Johnson <[email protected]>
#### Motivation When we deploy spec decoding in prod., we are frequently seeing the servers running out of free blocks. We have determined that this is due to two issues: 1. The constraint on `SPECULATOR_MAX_BATCH_SIZE` is not enough to avoid running into memory pressure due to speculation - we need to able ensure that we do not speculate on batches that may have a small "size" but very large weight. 2. The computation of the number of blocks is very wrong in most cases. #### Modifications 1. I have introduced an additional constraint that says we should only speculate on batches with weight up to 75% of the weight limit. This should ensure that we never speculate when we are close to the memory limits. 2. I have written new code to calculate the number of KV cache blocks. This calculation uses the memory scaling coefficients that we have learned at startup. In particular, it uses to the learned coefficients to figure out what % of the memory capacity needs to be set aside for cache blocks. 3. In the above calculation, I use the next token coefficient, rather than the prefill coefficient, since typically during next token phase the KV cache blocks comprise a relatively large percentage of the total memory consumption and we need to be able to handle this worst-case. However, this means that during prefill steps, we may not have enough memory leftover to store the auxiliary data structures we need for a forward pass. There isn't really a clean way to handle this other than re-writing the router logic to be block-aware, but what we can do is recommend to the user that they should increase the batch safety margin to a certain level to ensure that prefills will not run OOM. I've added a print statement to provide this guidance. 4. I now load the speculator before learning the memory scaling model since we also need to take that into account when measuring the amount of free memory. #### Result These changes, together with setting the `BATCH_SAFETY_MARGIN=35`, seems to result in robust behaviour for both `llama3-8b` and `granite-20b`. We no longer need to manually set the number of KV cache blocks in the latter case. #### Related Issues n/a --------- Signed-off-by: Thomas Parnell <[email protected]>
Hi @pull[bot]. Thanks for your PR. I'm waiting for a opendatahub-io member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/ok-to-test |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: heyselbi, pull[bot] The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…atahub-io#85) #### Motivation The `Calico` models currently set the mlp and attention bias to true, which was hard-coded to false in flash and paged llama implementations. This will use the config params set in huggingface/transformers#30031 to set those values properly. #### Modifications - added attention_bias, mlp_bias to config for Flash and Paged Llama implementations (default is False) - set bias in attention and mlp to the config value #### Result Models should be able to load properly if containing attention and mlp bias --------- Signed-off-by: Joshua Rosenkranz <[email protected]> Signed-off-by: Joe Runde <[email protected]> Co-authored-by: Joe Runde <[email protected]>
#### Motivation The `Calico` models currently set the mlp and attention bias to true, which was hard-coded to false in flash and paged llama implementations. This will use the config params set in huggingface/transformers#30031 to set those values properly. #### Modifications - added attention_bias, mlp_bias to config for Flash and Paged Llama implementations (default is False) - set bias in attention and mlp to the config value #### Result Models should be able to load properly if containing attention and mlp bias --------- Signed-off-by: Joshua Rosenkranz <[email protected]> Signed-off-by: Joe Runde <[email protected]> Co-authored-by: Joe Runde <[email protected]>
See Commits and Changes for more details.
Created by pull[bot]
Can you help keep this open source service alive? 💖 Please sponsor : )