Sync main and release branches #39

heyselbi · 2024-01-09T19:24:58Z

Description

Sync main and release to have most up to date stable image

How Has This Been Tested?

Merge criteria:

The commits are squashed in a cohesive manner and have meaningful messages.
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work

Signed-off-by: Travis Johnson <[email protected]>

and ninja 1.11.1.1

Resolves CVE-2023-40217 Also update UBI version

Addresses CVE-2023-45803 Also fix break due to removed tests dir in newer miniconda

When using the download-weights CLI command and specifying a single extension. This is used for slow tokenizers which can be subsequently converted to fast tokenizers.

This runs a series of tests to ensure consistency of output when the same input is included in a (padded) batch, as well as when batches are modified via pruning and concatenation operations while requests are in progress.

Adapted from corresponding changes to HF TGI (pre license-change) Co-authored-by: Nicolas Patry <[email protected]> Co-authored-by: Jamie Yang <[email protected]> Co-authored-by: Travis Johnson <[email protected]>

Addresses vuln WS-2023-0366

Reported in twistlock scan CVE: GHSA-v8gr-m533-ghj9 Co-authored-by: Nick Hill <[email protected]>

This PR adds exllamav2 kernels. The added changes are adapted from two open source repositories: - https://github.com/turboderp/exllamav2 - https://github.com/PanQiWei/AutoGPTQ Co-authored-by: Nick Hill <[email protected]>

This pull request (mostly) ports the heterogeneous next token chooser, which is used for flash models in TGI, into Causal LM. Co-authored-by: Alex Brooks <[email protected]> Co-authored-by: Travis Johnson <[email protected]>

Also updated patched transformers files with upstream updates

Inadvertently moved within the gptq-only block

Signed-off-by: Travis Johnson <[email protected]>

From recent code observations

Signed-off-by: Travis Johnson <[email protected]>

Addresses pyarrow CVE

To avoid CPU-intensive tokenization on async event loop. Determine thread pool size based on number of CPU cores and shard processes. Also validate stop sequence lengths based on number of bytes rather than number of tokens (the latter doesn't make sense since we don't do token-based matching). And add a couple of integration tests.

Inadvertently broken by new dtype positional arg added to Batch.from_pb()

It's unnecessary here

Since the decoding vectorization changes, the pad tokens are also passed in to the repetition penalty processor. In the case where the pad token id is equal to the EOS token id. This bug was found when testing with the `EleutherAI/gpt-neox-20b` model in TGIS. Having pad token id == eos token id does not seem to be that common, but it is also the fallback if the pad token cannot be found another way. There's also a little optimization change in this PR which is to pass a view over all_input_ids_tensor into `next_token_chooser` to avoid processing all of the pre-allocated output slots that have the pad token. Signed-off-by: Travis Johnson <[email protected]>

Signed-off-by: Vaibhav Jain <[email protected]>

Signed-off-by: jooho <[email protected]>

Signed-off-by: heyselbi <[email protected]>

Signed-off-by: Sean Pryor <[email protected]>

openshift-ci · 2024-01-09T19:25:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: heyselbi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [heyselbi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

dtrifiro · 2024-01-11T17:04:38Z

Needs #40 to build succesfully

deps: bump optimum to 1.16.1

This handles the OOM problem with large prefixes by both: - Taking the max prefix cache size into account when running the memory usage estimator, to ensure a full prefix cache does not cause an OOM - Taking the prefix length into consideration when deciding if a request will fit into a batch, to avoid large prefixes causing unexpected large memory allocations This includes an api breaking change to the config, as the prefix cache will not be enabled unless a user explicitly sets PREFIX_STORE_PATH to some non-empty value. Signed-off-by: Joe Runde <[email protected]>

tjohnson31415 and others added 30 commits November 14, 2023 17:47

test: add test for the time limit stopping criteria

30c33bb

Signed-off-by: Travis Johnson <[email protected]>

Update to python 3.11 and flash-attention 2.3.2

98584d3

and ninja 1.11.1.1

Update to latest miniconda / python 3.11.5

887a1cb

Resolves CVE-2023-40217 Also update UBI version

Update poetry dependencies, in particular urllib 2.0.7

1130840

Addresses CVE-2023-45803 Also fix break due to removed tests dir in newer miniconda

Add .model extension to default "extra" files downloaded with weights

3e34359

When using the download-weights CLI command and specifying a single extension. This is used for slow tokenizers which can be subsequently converted to fast tokenizers.

Bump transformers minor version; fix TypicalLogitsWarper

536e6a0

Revert TypicalLogitsWarper change for now

a8926f6

Support for serving GPTQ quantized models

1083fd0

Adapted from corresponding changes to HF TGI (pre license-change) Co-authored-by: Nicolas Patry <[email protected]> Co-authored-by: Jamie Yang <[email protected]> Co-authored-by: Travis Johnson <[email protected]>

Update various rust and python dependencies

ad33fc0

Addresses vuln WS-2023-0366

dep: update cryptography to address cve

188f39f

Reported in twistlock scan CVE: GHSA-v8gr-m533-ghj9 Co-authored-by: Nick Hill <[email protected]>

Update to protobuf 25.0 and onnxruntime 1.16.1

27e0952

Adding exllamav2 support for GPTQ models

6c670dd

This PR adds exllamav2 kernels. The added changes are adapted from two open source repositories: - https://github.com/turboderp/exllamav2 - https://github.com/PanQiWei/AutoGPTQ Co-authored-by: Nick Hill <[email protected]>

Vectorized next token chooser for causal_lm

d31197b

This pull request (mostly) ports the heterogeneous next token chooser, which is used for flash models in TGI, into Causal LM. Co-authored-by: Alex Brooks <[email protected]> Co-authored-by: Travis Johnson <[email protected]>

Add prometheus metrics for tokenize API

7c48f48

Update to transformers 4.35 and update some rust TLS dependencies

9e3fb05

Also updated patched transformers files with upstream updates

fix: Always print CUDA memory summary

63142fc

Inadvertently moved within the gptq-only block

fix: use torch instead of numpy to resolve device mismatch bug

f4060c0

Signed-off-by: Travis Johnson <[email protected]>

feat: use HeterogeneousNextTokenChooser in seq2seq_lm

7ad221f

Signed-off-by: Travis Johnson <[email protected]>

refactor: increase scope of try block

45842ad

Signed-off-by: Travis Johnson <[email protected]>

Fix flash impl for "old" Falcon arch models (incl. falcon-7b)

2756820

fix: fixes after testing causal_lm vectorization on GPU

f3fc122

Signed-off-by: Travis Johnson <[email protected]>

deps: update base image

d723faa

fix: Handle no max_batch_weight case for exllamav2 GPTQ

1feed99

Some minor token processing logic cleanup

642041d

From recent code observations

Fix return_top_n negative infinity bug

2c6e567

feat: use HeterogeneousNextTokenChooser in flash_causal_lm (#301)

a2b1ad0

Signed-off-by: Travis Johnson <[email protected]>

Update poetry dependency versions and tokio

7c4745e

Addresses pyarrow CVE

Break dependency on older vulnerable version of spin crate

47e8ba7

njhill and others added 10 commits November 14, 2023 18:35

Fix PT compile warmup

167a154

Inadvertently broken by new dtype positional arg added to Batch.from_pb()

Update onnx dependencies

6004ea3

Update UBI base image

0ebc567

Don't include extra left padding in all_input_ids_tensor

85918c5

It's unnecessary here

Add OWNER file

87b8834

Signed-off-by: Vaibhav Jain <[email protected]>

Add a checking logic to delete miniconda directory

52925a9

Signed-off-by: jooho <[email protected]>

Ensure conda is updated

a1c2a3b

auto add new issues to ODH projects

2472975

Signed-off-by: heyselbi <[email protected]>

Ensure final dockerfile updates packages

a57bdf1

Signed-off-by: Sean Pryor <[email protected]>

openshift-ci bot requested review from Jooho and vaibhavjainwiz January 9, 2024 19:25

openshift-ci bot added the approved label Jan 9, 2024

deps: bump optimum to 1.16.1

e1c7e21

Merge pull request #40 from dtrifiro/fix-build

03e3a60

deps: bump optimum to 1.16.1

dtrifiro mentioned this pull request Jan 11, 2024

[pull] main from IBM:main #33

Closed

dtrifiro merged commit 23ac6c1 into release Jan 12, 2024
4 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync main and release branches #39

Sync main and release branches #39

heyselbi commented Jan 9, 2024

openshift-ci bot commented Jan 9, 2024

dtrifiro commented Jan 11, 2024

Sync main and release branches #39

Sync main and release branches #39

Conversation

heyselbi commented Jan 9, 2024

Description

How Has This Been Tested?

Merge criteria:

openshift-ci bot commented Jan 9, 2024

dtrifiro commented Jan 11, 2024