forked from IBM/text-generation-inference
-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sync main and release branches #39
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Travis Johnson <[email protected]>
and ninja 1.11.1.1
Resolves CVE-2023-40217 Also update UBI version
Addresses CVE-2023-45803 Also fix break due to removed tests dir in newer miniconda
When using the download-weights CLI command and specifying a single extension. This is used for slow tokenizers which can be subsequently converted to fast tokenizers.
This runs a series of tests to ensure consistency of output when the same input is included in a (padded) batch, as well as when batches are modified via pruning and concatenation operations while requests are in progress.
Adapted from corresponding changes to HF TGI (pre license-change) Co-authored-by: Nicolas Patry <[email protected]> Co-authored-by: Jamie Yang <[email protected]> Co-authored-by: Travis Johnson <[email protected]>
Addresses vuln WS-2023-0366
Reported in twistlock scan CVE: GHSA-v8gr-m533-ghj9 Co-authored-by: Nick Hill <[email protected]>
This PR adds exllamav2 kernels. The added changes are adapted from two open source repositories: - https://github.com/turboderp/exllamav2 - https://github.com/PanQiWei/AutoGPTQ Co-authored-by: Nick Hill <[email protected]>
This pull request (mostly) ports the heterogeneous next token chooser, which is used for flash models in TGI, into Causal LM. Co-authored-by: Alex Brooks <[email protected]> Co-authored-by: Travis Johnson <[email protected]>
Also updated patched transformers files with upstream updates
Inadvertently moved within the gptq-only block
Signed-off-by: Travis Johnson <[email protected]>
Signed-off-by: Travis Johnson <[email protected]>
Signed-off-by: Travis Johnson <[email protected]>
Signed-off-by: Travis Johnson <[email protected]>
From recent code observations
Signed-off-by: Travis Johnson <[email protected]>
Addresses pyarrow CVE
To avoid CPU-intensive tokenization on async event loop. Determine thread pool size based on number of CPU cores and shard processes. Also validate stop sequence lengths based on number of bytes rather than number of tokens (the latter doesn't make sense since we don't do token-based matching). And add a couple of integration tests.
Inadvertently broken by new dtype positional arg added to Batch.from_pb()
It's unnecessary here
Since the decoding vectorization changes, the pad tokens are also passed in to the repetition penalty processor. In the case where the pad token id is equal to the EOS token id. This bug was found when testing with the `EleutherAI/gpt-neox-20b` model in TGIS. Having pad token id == eos token id does not seem to be that common, but it is also the fallback if the pad token cannot be found another way. There's also a little optimization change in this PR which is to pass a view over all_input_ids_tensor into `next_token_chooser` to avoid processing all of the pre-allocated output slots that have the pad token. Signed-off-by: Travis Johnson <[email protected]>
Signed-off-by: Vaibhav Jain <[email protected]>
Signed-off-by: jooho <[email protected]>
Signed-off-by: heyselbi <[email protected]>
Signed-off-by: Sean Pryor <[email protected]>
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: heyselbi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Needs #40 to build succesfully |
deps: bump optimum to 1.16.1
openshift-merge-bot bot
pushed a commit
that referenced
this pull request
Feb 29, 2024
This handles the OOM problem with large prefixes by both: - Taking the max prefix cache size into account when running the memory usage estimator, to ensure a full prefix cache does not cause an OOM - Taking the prefix length into consideration when deciding if a request will fit into a batch, to avoid large prefixes causing unexpected large memory allocations This includes an api breaking change to the config, as the prefix cache will not be enabled unless a user explicitly sets PREFIX_STORE_PATH to some non-empty value. Signed-off-by: Joe Runde <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Sync main and release to have most up to date stable image
How Has This Been Tested?
Merge criteria: