Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add cross-encoder tracing, config-generating, and uploading #375

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

HenryL27
Copy link

Description

Adds capabilities for uploading cross encoders. Introduces a 'CrossEncoderModel' class with methods to zip (either torchscript or onnx) and generate a json configuration (as in SentenceTransformerModel) along with a utility upload function. Example

from opensearch_py_ml.ml_models import CrossEncoderModel
from opensearchpy import OpenSearch

client = OpenSearch(**os_client_args)
model = CrossEncoderModel("BAAI/bge-reranker-base", overwrite=True)
model.upload(client, framework='pt')
model.upload(client, framework='onnx')

Issues Resolved

[List any issues this PR will resolve]

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link

codecov bot commented Feb 14, 2024

Codecov Report

Attention: Patch coverage is 18.42105% with 93 lines in your changes are missing coverage. Please review.

Project coverage is 89.68%. Comparing base (529ee34) to head (f6551ef).
Report is 2 commits behind head on main.

Current head f6551ef differs from pull request most recent head 7a134b0

Please upload reports for the commit 7a134b0 to get more accurate results.

Files Patch % Lines
opensearch_py_ml/ml_models/crossencodermodel.py 16.96% 93 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #375      +/-   ##
==========================================
- Coverage   91.53%   89.68%   -1.86%     
==========================================
  Files          42       43       +1     
  Lines        4395     4508     +113     
==========================================
+ Hits         4023     4043      +20     
- Misses        372      465      +93     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dhrubo-os
Copy link
Collaborator

Thanks for raising this PR. We need to add corresponding tests too. Seems like few functions are mostly common to the SentenceTransformerModel class. May be we can take out the common functionality to some other utils class?

@dhrubo-os
Copy link
Collaborator

Customer should have a way to give the name of the generated zip file and the model file as like we have in our SentenceTransformerModel class

# save tokenizer file
tk_path = Path(f"/tmp/{mname}-tokenizer")
tk.save_pretrained(tk_path)
_fix_tokenizer(tk.model_max_length, tk_path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tk.model_max_length --> not sure if we will get this value all the time. Please take a look at this PR: #219

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k. going with the strategy proposed in huggingface/transformers#14561 - look for max_position_embeddings or n_positions in the config object and if I miss in both those places, I set to 32k and hope it's fine. (The fix using sentence_transformers model.get_max_seq_length() also has the potential to return None)

@dhrubo-os
Copy link
Collaborator

Tests are failing.

"model_content_hash_value": hash_value,
"model_config": {
"model_type": model_type,
"embedding_dimension": 1,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at here: https://huggingface.co/BAAI/bge-reranker-base/blob/main/config.json

max_position_embeddings = 514

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is what I did. Artifact of the implementation (depends heavily on the embedding model code). Can probably be cleaned up a bit

elif hasattr(model_config, "n_positions"):
tk.model_max_length = model_config.n_positions
else:
tk.model_max_length = 2**15 # =32768. Set to something big I guess
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting an arbitrary value doesn't seem like a good solution.

What do you think about following this: https://github.com/opensearch-project/opensearch-py-ml/blob/main/opensearch_py_ml/ml_models/sentencetransformermodel.py#L936-L942

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would love to. Unfortunately, model.get_max_seq_length() is not a method of most hf transformer ModelForSequenceClassification - that's a special thing from the sentence-transformers model interface, and I'm not sure it is the guarantee that it claims to be: [implementation]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing two problems here:

  1. tk.model_max_length is None: [Bug] tokenizer.model_max_length is different when loading model from shortcut or local path huggingface/transformers#14561 ; if you see this, model_max_length will be a larger value and we are going to set a large value which model can't process

  2. I'm not a big fan of setting a static big value like tk.model_max_length = 2**15. What exactly are we achieving here?

It's important to align model_max_length with the model's actual capability (max_position_embeddings). Setting model_max_length to a higher value than max_position_embeddings could lead to errors or unexpected behavior, as the model won't be able to handle inputs longer than its architectural limit. Conversely, setting a lower model_max_length can be useful for optimizing performance or adhering to specific task requirements.

Copy link
Author

@HenryL27 HenryL27 Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand these issues. tk.model_max_length is None is exactly the condition that triggers setting it to something reasonable. I agree that it's important to set it to a value that agrees with the model's capability: that's why the first condition here is checking and setting to max_position_embeddings. We only pick the Very Large Number when we don't have a bound. For instance, suppose someone trained a mamba-based cross encoder with infinite context cuz it's a RNN.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the confusion. What I tried to mean here is, tk.model_max_length is None --> this condition will not even trigger. Without triggering this, model_max_length could be a large value which model can't process.


tokenizer = GPT2Tokenizer.from_pretrained("path/to/local/gpt2")
print(tokenizer.model_max_length)
# 1000000000000000019884624838656

So here model_max_length isn't None, right? But still do we want that?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Can we just let huggingface/transformers fix this bug? It seems like a them problem, and from what I can tell the only times we're gonna hit it is if someone is trying to use a very old tokenizer file with their thing. At that point I hope we can assume the user is proficient enough with transformers to debug if necessary.

"""
# export to onnx
save_loc = Path(f"/tmp/{mname}.onnx")
torch.onnx.export(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should add onnx in the requirements.txt.

In addition, can we also add in the requirements-dev.txt?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like it's there already?

if mname.startswith("bge"):
features["token_type_ids"] = torch.zeros_like(features["input_ids"])

if framework == "pt":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should accept "torch_script" as a framework parameter instead of pt. pt is an extension type but not a framework. We decide to convert as .pt, somebody else might want to decide as pth or ptc. Customer might not know the file format of the torch_script.

So framework choice should be: torch_script or onnx

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

"version": f"1.0.{version_number}",
"description": description,
"model_format": model_format,
"function_name": "TEXT_SIMILARITY",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need model_task_type. Let's follow sentencetransformer model example method.

)
model_config_content = {
"name": model_name,
"version": f"1.0.{version_number}",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need to append 1.0.

if model_name is None:
model_name = Path(self._hf_model_id).name
if description is None:
description = f"Cross Encoder Model {model_name}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we follow the way we are showing description for embedding models?

@dhrubo-os
Copy link
Collaborator

Looks like linting and integration tests are failing could you please take a look?

Signed-off-by: HenryL27 <[email protected]>
@asfoorial
Copy link

This PR fails with this promising rerank model https://huggingface.co/mixedbread-ai/mxbai-rerank-xsmall-v1

I tried both torch_script and onnx. The torch_script failed to continue saving the trace giving
Could not export Python function call ‘XSoftmax’. Remove calls to Python functions before export. Did you forget to add @script or @script_method annotation? If this is a nn.ModuleList, add it to constants:

While the onnx saved the onnx and zip files but failed during prediction giving
{
“error”: {
“root_cause”: [
{
“type”: “translate_exception”,
“reason”: “translate_exception: java.lang.IllegalArgumentException: Input mismatch, looking for: [input_ids, attention_mask]”
}
],
“type”: “m_l_exception”,
“reason”: “m_l_exception: Failed to inference TEXT_SIMILARITY model: ZwQzA44BK0CtZG_zHaFw”,
“caused_by”: {
“type”: “privileged_action_exception”,
“reason”: “privileged_action_exception: null”,
“caused_by”: {
“type”: “translate_exception”,
“reason”: “translate_exception: java.lang.IllegalArgumentException: Input mismatch, looking for: [input_ids, attention_mask]”,
“caused_by”: {
“type”: “illegal_argument_exception”,
“reason”: “Input mismatch, looking for: [input_ids, attention_mask]”
}
}
}
},
“status”: 500
}

@HenryL27
Copy link
Author

HenryL27 commented Mar 4, 2024

looks like maybe huggingface/transformers#20815 is related? Is mxbai-rerank based on Deberta do you know? If so try setting model_type='deberta' in the make_model_config_json fn call

@asfoorial
Copy link

looks like maybe huggingface/transformers#20815 is related? Is mxbai-rerank based on Deberta do you know? If so try setting model_type='deberta' in the make_model_config_json fn call

Yes it is based on deberta.

@asfoorial
Copy link

looks like maybe huggingface/transformers#20815 is related? Is mxbai-rerank based on Deberta do you know? If so try setting model_type='deberta' in the make_model_config_json fn call

I have set model type to deberta and converted the model into onnx (torch script did not work) and was able to deploy it. However I get the same error above alwhen calling _predict API

:param folder_path: folder path to save the model
default is /tmp/models/hf_model_id
:type folder_path: str
:param overwrite: whether to overwrite the existing model
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whether to overwrite the existing model folder_path ?

tk = AutoTokenizer.from_pretrained(self._hf_model_id)
model = AutoModelForSequenceClassification.from_pretrained(self._hf_model_id)
features = tk([["dummy sentence 1", "dummy sentence 2"]], return_tensors="pt")
mname = Path(self._hf_model_id).name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's follow snake casing? model_name?

features = tk([["dummy sentence 1", "dummy sentence 2"]], return_tensors="pt")
mname = Path(self._hf_model_id).name

# bge models don't generate token type ids
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have to any issue to reference here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I arrived at this conclusion by trying to do the thing and failing, so there might be an issue somewhere out there but it's more of a fundamental architectural feature, not a bug

elif hasattr(model_config, "n_positions"):
tk.model_max_length = model_config.n_positions
else:
tk.model_max_length = 2**15 # =32768. Set to something big I guess
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the confusion. What I tried to mean here is, tk.model_max_length is None --> this condition will not even trigger. Without triggering this, model_max_length could be a large value which model can't process.


tokenizer = GPT2Tokenizer.from_pretrained("path/to/local/gpt2")
print(tokenizer.model_max_length)
# 1000000000000000019884624838656

So here model_max_length isn't None, right? But still do we want that?

Signed-off-by: Henry Lindeman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants