add cross-encoder tracing, config-generating, and uploading #375

HenryL27 · 2024-02-13T23:48:55Z

Description

Adds capabilities for uploading cross encoders. Introduces a 'CrossEncoderModel' class with methods to zip (either torchscript or onnx) and generate a json configuration (as in SentenceTransformerModel) along with a utility upload function. Example

from opensearch_py_ml.ml_models import CrossEncoderModel
from opensearchpy import OpenSearch

client = OpenSearch(**os_client_args)
model = CrossEncoderModel("BAAI/bge-reranker-base", overwrite=True)
model.upload(client, framework='pt')
model.upload(client, framework='onnx')

Issues Resolved

[List any issues this PR will resolve]

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

codecov · 2024-02-14T00:14:31Z

Codecov Report

Attention: Patch coverage is 18.42105% with 93 lines in your changes are missing coverage. Please review.

Project coverage is 89.68%. Comparing base (529ee34) to head (f6551ef).
Report is 2 commits behind head on main.

❗ Current head f6551ef differs from pull request most recent head 7a134b0

Please upload reports for the commit 7a134b0 to get more accurate results.

Files	Patch %	Lines
opensearch_py_ml/ml_models/crossencodermodel.py	16.96%	93 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #375      +/-   ##
==========================================
- Coverage   91.53%   89.68%   -1.86%     
==========================================
  Files          42       43       +1     
  Lines        4395     4508     +113     
==========================================
+ Hits         4023     4043      +20     
- Misses        372      465      +93

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dhrubo-os · 2024-02-14T03:29:45Z

Thanks for raising this PR. We need to add corresponding tests too. Seems like few functions are mostly common to the SentenceTransformerModel class. May be we can take out the common functionality to some other utils class?

dhrubo-os · 2024-02-14T03:41:58Z

Customer should have a way to give the name of the generated zip file and the model file as like we have in our SentenceTransformerModel class

dhrubo-os · 2024-02-15T00:42:47Z

opensearch_py_ml/ml_models/crossencodermodel.py

+        # save tokenizer file
+        tk_path = Path(f"/tmp/{mname}-tokenizer")
+        tk.save_pretrained(tk_path)
+        _fix_tokenizer(tk.model_max_length, tk_path)


tk.model_max_length --> not sure if we will get this value all the time. Please take a look at this PR: #219

k. going with the strategy proposed in huggingface/transformers#14561 - look for max_position_embeddings or n_positions in the config object and if I miss in both those places, I set to 32k and hope it's fine. (The fix using sentence_transformers model.get_max_seq_length() also has the potential to return None)

dhrubo-os · 2024-02-15T01:06:07Z

Tests are failing.

dhrubo-os · 2024-02-15T02:35:16Z

opensearch_py_ml/ml_models/crossencodermodel.py

+            "model_content_hash_value": hash_value,
+            "model_config": {
+                "model_type": model_type,
+                "embedding_dimension": 1,


Is this correct?

If you look at here: https://huggingface.co/BAAI/bge-reranker-base/blob/main/config.json

max_position_embeddings = 514

yes, this is what I did. Artifact of the implementation (depends heavily on the embedding model code). Can probably be cleaned up a bit

dhrubo-os · 2024-02-15T03:06:24Z

opensearch_py_ml/ml_models/crossencodermodel.py

+            elif hasattr(model_config, "n_positions"):
+                tk.model_max_length = model_config.n_positions
+            else:
+                tk.model_max_length = 2**15  # =32768. Set to something big I guess


Setting an arbitrary value doesn't seem like a good solution.

What do you think about following this: https://github.com/opensearch-project/opensearch-py-ml/blob/main/opensearch_py_ml/ml_models/sentencetransformermodel.py#L936-L942

would love to. Unfortunately, model.get_max_seq_length() is not a method of most hf transformer ModelForSequenceClassification - that's a special thing from the sentence-transformers model interface, and I'm not sure it is the guarantee that it claims to be: [implementation]

I'm seeing two problems here:

tk.model_max_length is None: [Bug] tokenizer.model_max_length is different when loading model from shortcut or local path huggingface/transformers#14561 ; if you see this, model_max_length will be a larger value and we are going to set a large value which model can't process

I'm not a big fan of setting a static big value like tk.model_max_length = 2**15. What exactly are we achieving here?

It's important to align model_max_length with the model's actual capability (max_position_embeddings). Setting model_max_length to a higher value than max_position_embeddings could lead to errors or unexpected behavior, as the model won't be able to handle inputs longer than its architectural limit. Conversely, setting a lower model_max_length can be useful for optimizing performance or adhering to specific task requirements.

I don't understand these issues. tk.model_max_length is None is exactly the condition that triggers setting it to something reasonable. I agree that it's important to set it to a value that agrees with the model's capability: that's why the first condition here is checking and setting to max_position_embeddings. We only pick the Very Large Number when we don't have a bound. For instance, suppose someone trained a mamba-based cross encoder with infinite context cuz it's a RNN.

Sorry for the confusion. What I tried to mean here is, tk.model_max_length is None --> this condition will not even trigger. Without triggering this, model_max_length could be a large value which model can't process.

tokenizer = GPT2Tokenizer.from_pretrained("path/to/local/gpt2") print(tokenizer.model_max_length) # 1000000000000000019884624838656

So here model_max_length isn't None, right? But still do we want that?

Interesting. Can we just let huggingface/transformers fix this bug? It seems like a them problem, and from what I can tell the only times we're gonna hit it is if someone is trying to use a very old tokenizer file with their thing. At that point I hope we can assume the user is proficient enough with transformers to debug if necessary.

dhrubo-os · 2024-02-15T20:41:24Z

opensearch_py_ml/ml_models/crossencodermodel.py

+        """
+        # export to onnx
+        save_loc = Path(f"/tmp/{mname}.onnx")
+        torch.onnx.export(


i think we should add onnx in the requirements.txt.

In addition, can we also add in the requirements-dev.txt?

looks like it's there already?

dhrubo-os · 2024-02-21T18:26:08Z

opensearch_py_ml/ml_models/crossencodermodel.py

+        if mname.startswith("bge"):
+            features["token_type_ids"] = torch.zeros_like(features["input_ids"])
+
+        if framework == "pt":


i think we should accept "torch_script" as a framework parameter instead of pt. pt is an extension type but not a framework. We decide to convert as .pt, somebody else might want to decide as pth or ptc. Customer might not know the file format of the torch_script.

So framework choice should be: torch_script or onnx

dhrubo-os · 2024-02-21T22:35:36Z

opensearch_py_ml/ml_models/crossencodermodel.py

+            "version": f"1.0.{version_number}",
+            "description": description,
+            "model_format": model_format,
+            "function_name": "TEXT_SIMILARITY",


we need model_task_type. Let's follow sentencetransformer model example method.

dhrubo-os · 2024-02-21T22:36:02Z

opensearch_py_ml/ml_models/crossencodermodel.py

+            )
+        model_config_content = {
+            "name": model_name,
+            "version": f"1.0.{version_number}",


we don't need to append 1.0.

Signed-off-by: HenryL27 <[email protected]>

dhrubo-os · 2024-02-27T05:47:42Z

opensearch_py_ml/ml_models/crossencodermodel.py

+        if model_name is None:
+            model_name = Path(self._hf_model_id).name
+        if description is None:
+            description = f"Cross Encoder Model {model_name}"


Can we follow the way we are showing description for embedding models?

Signed-off-by: HenryL27 <[email protected]>

dhrubo-os · 2024-02-28T22:20:52Z

Looks like linting and integration tests are failing could you please take a look?

Signed-off-by: HenryL27 <[email protected]>

asfoorial · 2024-03-04T04:21:12Z

This PR fails with this promising rerank model https://huggingface.co/mixedbread-ai/mxbai-rerank-xsmall-v1

I tried both torch_script and onnx. The torch_script failed to continue saving the trace giving
Could not export Python function call ‘XSoftmax’. Remove calls to Python functions before export. Did you forget to add @script or @script_method annotation? If this is a nn.ModuleList, add it to constants:

While the onnx saved the onnx and zip files but failed during prediction giving
{
“error”: {
“root_cause”: [
{
“type”: “translate_exception”,
“reason”: “translate_exception: java.lang.IllegalArgumentException: Input mismatch, looking for: [input_ids, attention_mask]”
}
],
“type”: “m_l_exception”,
“reason”: “m_l_exception: Failed to inference TEXT_SIMILARITY model: ZwQzA44BK0CtZG_zHaFw”,
“caused_by”: {
“type”: “privileged_action_exception”,
“reason”: “privileged_action_exception: null”,
“caused_by”: {
“type”: “translate_exception”,
“reason”: “translate_exception: java.lang.IllegalArgumentException: Input mismatch, looking for: [input_ids, attention_mask]”,
“caused_by”: {
“type”: “illegal_argument_exception”,
“reason”: “Input mismatch, looking for: [input_ids, attention_mask]”
}
}
}
},
“status”: 500
}

HenryL27 · 2024-03-04T17:38:14Z

looks like maybe huggingface/transformers#20815 is related? Is mxbai-rerank based on Deberta do you know? If so try setting model_type='deberta' in the make_model_config_json fn call

asfoorial · 2024-03-04T18:10:10Z

looks like maybe huggingface/transformers#20815 is related? Is mxbai-rerank based on Deberta do you know? If so try setting model_type='deberta' in the make_model_config_json fn call

Yes it is based on deberta.

asfoorial · 2024-03-05T10:33:41Z

looks like maybe huggingface/transformers#20815 is related? Is mxbai-rerank based on Deberta do you know? If so try setting model_type='deberta' in the make_model_config_json fn call

I have set model type to deberta and converted the model into onnx (torch script did not work) and was able to deploy it. However I get the same error above alwhen calling _predict API

dhrubo-os · 2024-02-28T22:51:22Z

opensearch_py_ml/ml_models/crossencodermodel.py

+        :param folder_path: folder path to save the model
+            default is /tmp/models/hf_model_id
+        :type folder_path: str
+        :param overwrite: whether to overwrite the existing model


whether to overwrite the existing model folder_path ?

dhrubo-os · 2024-02-28T22:52:53Z

opensearch_py_ml/ml_models/crossencodermodel.py

+        tk = AutoTokenizer.from_pretrained(self._hf_model_id)
+        model = AutoModelForSequenceClassification.from_pretrained(self._hf_model_id)
+        features = tk([["dummy sentence 1", "dummy sentence 2"]], return_tensors="pt")
+        mname = Path(self._hf_model_id).name


let's follow snake casing? model_name?

dhrubo-os · 2024-02-28T22:53:06Z

opensearch_py_ml/ml_models/crossencodermodel.py

+        features = tk([["dummy sentence 1", "dummy sentence 2"]], return_tensors="pt")
+        mname = Path(self._hf_model_id).name
+
+        # bge models don't generate token type ids


do we have to any issue to reference here?

I arrived at this conclusion by trying to do the thing and failing, so there might be an issue somewhere out there but it's more of a fundamental architectural feature, not a bug

dhrubo-os · 2024-02-28T22:57:00Z

opensearch_py_ml/ml_models/crossencodermodel.py

+            elif hasattr(model_config, "n_positions"):
+                tk.model_max_length = model_config.n_positions
+            else:
+                tk.model_max_length = 2**15  # =32768. Set to something big I guess


Sorry for the confusion. What I tried to mean here is, tk.model_max_length is None --> this condition will not even trigger. Without triggering this, model_max_length could be a large value which model can't process.

tokenizer = GPT2Tokenizer.from_pretrained("path/to/local/gpt2") print(tokenizer.model_max_length) # 1000000000000000019884624838656

So here model_max_length isn't None, right? But still do we want that?

Signed-off-by: Henry Lindeman <[email protected]>

HenryL27 requested review from dhrubo-os, greaa-aws, ylwu-amzn, b4sjoo, jngz-es and rbhavna as code owners February 13, 2024 23:48

dhrubo-os reviewed Feb 15, 2024

View reviewed changes

dhrubo-os reviewed Feb 21, 2024

View reviewed changes

HenryL27 added 10 commits February 21, 2024 16:52

add cross-encoder tracing, config-generating, and uploading

4736d1a

Signed-off-by: HenryL27 <[email protected]>

run nox format

b36f9b9

Signed-off-by: HenryL27 <[email protected]>

update changelog

fdcdb4b

Signed-off-by: HenryL27 <[email protected]>

condense common zipping logic; allow more configurable file names

fe1109b

Signed-off-by: HenryL27 <[email protected]>

add simple unit tests for model saving

c54b8bd

Signed-off-by: HenryL27 <[email protected]>

add some more tokenizer max length checks

fdf94b6

Signed-off-by: HenryL27 <[email protected]>

compare file sets, not lists in tests

69e75c2

Signed-off-by: HenryL27 <[email protected]>

no model.get_max_length() function so set tokenizer max when it's None

9e58740

Signed-off-by: HenryL27 <[email protected]>

change framework name pt -> torch_script

6f08d00

Signed-off-by: HenryL27 <[email protected]>

function_name -> model_mask_type

588af41

Signed-off-by: HenryL27 <[email protected]>

HenryL27 force-pushed the cross-encoder branch from fc09ad1 to 588af41 Compare February 22, 2024 00:54

model_task_type with a t. also include function name

b520076

Signed-off-by: HenryL27 <[email protected]>

dhrubo-os reviewed Feb 27, 2024

View reviewed changes

add more deatiled description wiith option to pull from a readme

ce4a860

Signed-off-by: HenryL27 <[email protected]>

fix test and lint issue

2fb551b

Signed-off-by: HenryL27 <[email protected]>

dhrubo-os reviewed May 20, 2024

View reviewed changes

HenryL27 added 2 commits May 21, 2024 09:30

address pr comments

b063f32

Signed-off-by: Henry Lindeman <[email protected]>

Merge branch 'main' into cross-encoder

7a134b0

Signed-off-by: Henry Lindeman <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add cross-encoder tracing, config-generating, and uploading #375

add cross-encoder tracing, config-generating, and uploading #375

HenryL27 commented Feb 13, 2024

codecov bot commented Feb 14, 2024 •

edited

Loading

dhrubo-os commented Feb 14, 2024

dhrubo-os commented Feb 14, 2024

dhrubo-os Feb 15, 2024

HenryL27 Feb 15, 2024

dhrubo-os commented Feb 15, 2024

dhrubo-os Feb 15, 2024

dhrubo-os Feb 15, 2024

HenryL27 Feb 15, 2024

dhrubo-os Feb 15, 2024

HenryL27 Feb 16, 2024

dhrubo-os Feb 21, 2024

HenryL27 Feb 28, 2024 •

edited

Loading

dhrubo-os Feb 28, 2024

HenryL27 May 21, 2024

dhrubo-os Feb 15, 2024

HenryL27 Feb 16, 2024

dhrubo-os Feb 21, 2024

HenryL27 Feb 21, 2024

dhrubo-os Feb 21, 2024

dhrubo-os Feb 21, 2024

dhrubo-os Feb 27, 2024

dhrubo-os commented Feb 28, 2024

asfoorial commented Mar 4, 2024

HenryL27 commented Mar 4, 2024

asfoorial commented Mar 4, 2024

asfoorial commented Mar 5, 2024

dhrubo-os Feb 28, 2024

dhrubo-os Feb 28, 2024

dhrubo-os Feb 28, 2024

HenryL27 May 21, 2024

dhrubo-os Feb 28, 2024

add cross-encoder tracing, config-generating, and uploading #375

Are you sure you want to change the base?

add cross-encoder tracing, config-generating, and uploading #375

Conversation

HenryL27 commented Feb 13, 2024

Description

Issues Resolved

Check List

codecov bot commented Feb 14, 2024 • edited Loading

Codecov Report

dhrubo-os commented Feb 14, 2024

dhrubo-os commented Feb 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhrubo-os commented Feb 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HenryL27 Feb 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhrubo-os commented Feb 28, 2024

asfoorial commented Mar 4, 2024

HenryL27 commented Mar 4, 2024

asfoorial commented Mar 4, 2024

asfoorial commented Mar 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Feb 14, 2024 •

edited

Loading

HenryL27 Feb 28, 2024 •

edited

Loading