Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] main from IBM:main #85

Merged
merged 2 commits into from
Jun 18, 2024
Merged

Conversation

pull[bot]
Copy link

@pull pull bot commented May 31, 2024

See Commits and Changes for more details.


Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

tjohnson31415 and others added 2 commits May 30, 2024 12:01
The launch of `fit_memory_scaling_model` uses the values for `quantize`
and `dtype_str`, so those should be validated and defaulted before it is
ran.

Before this change, if `dtype_str` was set to `None` it would be passed
to `fit_memory_scaling_model` as `None` resulting in an error:
```
Shard 1: Process SpawnProcess-33:
Shard 1: Traceback (most recent call last):
Shard 1:   File "/opt/tgis/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
Shard 1:     self.run()
Shard 1:   File "/opt/tgis/lib/python3.11/multiprocessing/process.py", line 108, in run
Shard 1:     self._target(*self._args, **self._kwargs)
Shard 1:   File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/utils/paged.py", line 37, in fit_memory_scaling_model
Shard 1:     model = get_model(
Shard 1:             ^^^^^^^^^^
Shard 1:   File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/models/__init__.py", line 39, in get_model
Shard 1:     dtype = get_torch_dtype(dtype_str)
Shard 1:             ^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 1:   File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/utils/dist.py", line 64, in get_torch_dtype
Shard 1:     dt = getattr(torch, dtype_str, None)
Shard 1:          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 1: TypeError: attribute name must be string, not 'NoneType'
```

After this change, a value will always be set before calling
`fit_memory_scaling_model`.

Signed-off-by: Travis Johnson <[email protected]>
#### Motivation

When we deploy spec decoding in prod., we are frequently seeing the
servers running out of free blocks. We have determined that this is due
to two issues:
1. The constraint on `SPECULATOR_MAX_BATCH_SIZE` is not enough to avoid
running into memory pressure due to speculation - we need to able ensure
that we do not speculate on batches that may have a small "size" but
very large weight.
2. The computation of the number of blocks is very wrong in most cases. 

#### Modifications

1. I have introduced an additional constraint that says we should only
speculate on batches with weight up to 75% of the weight limit. This
should ensure that we never speculate when we are close to the memory
limits.
2. I have written new code to calculate the number of KV cache blocks.
This calculation uses the memory scaling coefficients that we have
learned at startup. In particular, it uses to the learned coefficients
to figure out what % of the memory capacity needs to be set aside for
cache blocks.
3. In the above calculation, I use the next token coefficient, rather
than the prefill coefficient, since typically during next token phase
the KV cache blocks comprise a relatively large percentage of the total
memory consumption and we need to be able to handle this worst-case.
However, this means that during prefill steps, we may not have enough
memory leftover to store the auxiliary data structures we need for a
forward pass. There isn't really a clean way to handle this other than
re-writing the router logic to be block-aware, but what we can do is
recommend to the user that they should increase the batch safety margin
to a certain level to ensure that prefills will not run OOM. I've added
a print statement to provide this guidance.
4. I now load the speculator before learning the memory scaling model
since we also need to take that into account when measuring the amount
of free memory.

#### Result

These changes, together with setting the `BATCH_SAFETY_MARGIN=35`, seems
to result in robust behaviour for both `llama3-8b` and `granite-20b`. We
no longer need to manually set the number of KV cache blocks in the
latter case.

#### Related Issues

n/a

---------

Signed-off-by: Thomas Parnell <[email protected]>
Copy link

openshift-ci bot commented May 31, 2024

Hi @pull[bot]. Thanks for your PR.

I'm waiting for a opendatahub-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@heyselbi
Copy link

/ok-to-test

@heyselbi
Copy link

/lgtm
/approve

Copy link

openshift-ci bot commented Jun 18, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: heyselbi, pull[bot]

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 43623db into opendatahub-io:main Jun 18, 2024
3 checks passed
openshift-merge-bot bot pushed a commit to red-hat-data-services/text-generation-inference that referenced this pull request Aug 13, 2024
…atahub-io#85)

#### Motivation

The `Calico` models currently set the mlp and attention bias to true,
which was hard-coded to false in flash and paged llama implementations.
This will use the config params set in
huggingface/transformers#30031 to set those
values properly.

#### Modifications

- added attention_bias, mlp_bias to config for Flash and Paged Llama
implementations (default is False)
- set bias in attention and mlp to the config value

#### Result

Models should be able to load properly if containing attention and mlp
bias

---------

Signed-off-by: Joshua Rosenkranz <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
Co-authored-by: Joe Runde <[email protected]>
vaibhavjainwiz pushed a commit that referenced this pull request Aug 29, 2024
#### Motivation

The `Calico` models currently set the mlp and attention bias to true,
which was hard-coded to false in flash and paged llama implementations.
This will use the config params set in
huggingface/transformers#30031 to set those
values properly.

#### Modifications

- added attention_bias, mlp_bias to config for Flash and Paged Llama
implementations (default is False)
- set bias in attention and mlp to the config value

#### Result

Models should be able to load properly if containing attention and mlp
bias

---------

Signed-off-by: Joshua Rosenkranz <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
Co-authored-by: Joe Runde <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants