-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sync main and release branches #58
Conversation
This handles the OOM problem with large prefixes by both: - Taking the max prefix cache size into account when running the memory usage estimator, to ensure a full prefix cache does not cause an OOM - Taking the prefix length into consideration when deciding if a request will fit into a batch, to avoid large prefixes causing unexpected large memory allocations This includes an api breaking change to the config, as the prefix cache will not be enabled unless a user explicitly sets PREFIX_STORE_PATH to some non-empty value. Signed-off-by: Joe Runde <[email protected]>
Removing flash-attention v1 from the server-release image, to speed up build times. Modifications: - remove flash-att-v1 build stage from Dockerfile - remove server/Makefile-flash-att - create GitHub package for cache image on GitHub container registry - push full cache image to ghcr,io on push to main (PR merged) - use cache image from ghcr.io for PR builds - replace build stages/step with single `build-and-push` action - temporarily build dropout_layer_norm and rotary_emb from flash-attention v2 --------- Signed-off-by: Christian Kadner <[email protected]>
Every successful push to main produces new build cache images on ghcr.io which currently have to be cleaned up manually. Add actions/delete-package-versions after successful push of a new cache image (implicit). --------- Signed-off-by: Christian Kadner <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
A previous OOM fix mistakenly left model.word_embeddings always unset when the prompt cache was disabled. This causes inference without a prompt cache to fail. Our tests always set up a prompt cache so they did not catch this case. Fix: Always invoke self._setup_prompt_encoder() in model.py again. Result: Inference without a prompt cache works. ----- Signed-off-by: Joe Runde <[email protected]>
[pull] main from IBM:main
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: heyselbi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/override Build |
@heyselbi: /override requires failed status contexts, check run or a prowjob name to operate on.
Only the following failed contexts/checkruns were expected:
If you are trying to override a checkrun that has a space in it, you must put a double quote on the context. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/override build |
@heyselbi: Overrode contexts on behalf of heyselbi: build In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/override branch-ci-opendatahub-io-text-generation-inference-main-fast-image-mirror branch-ci-opendatahub-io-text-generation-inference-main-images ci/prow/images ci/prow/pr-image-mirror |
@heyselbi: Overrode contexts on behalf of heyselbi: branch-ci-opendatahub-io-text-generation-inference-main-fast-image-mirror, branch-ci-opendatahub-io-text-generation-inference-main-images, ci/prow/images, ci/prow/pr-image-mirror In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/lgtm |
#### Motivation The ibm_fms engine is being too strict on the "variant" selection. The FMS get_model call requires the variant, but also supports overriding configuration parameters via kwargs. #### Modifications Allows for support for additional FMS Calico models that may have differences from the default variant config, such as the size of the vocab. Signed-off-by: Travis Johnson <[email protected]>
Motivation
[Describe why this change is needed]
Modifications
[Describe the code changes]
Result
[Describe how the changes affects existing behavior and how to test it]
Related Issues
[Resolves #123]