-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Support for asymmetric embedding models #1799
Comments
@ylwu-amzn sorry for mentioning you directly on this issue, but you seem to be someone who knows what needs to be done for this to be at least triaged? I would implement the feature myself, but don't want to start before I have a green light from the community that the approach has a chance of being pulled into the project. |
@br3no Sorry it took so long to get to this. It would be great to support asymmetric embedding models, so definitely a green light if you're still interested in working on this. A couple of design things to think about:
Can I assign this to you? |
@HenryL27, great. Yes you can assign me. |
@HenryL27 I've opened a PR for this feature. I've tested the feature with a self-packaged version of https://huggingface.co/intfloat/multilingual-e5-small/tree/main/onnx. I'm reluctant in adding a dedicated integration test for this model, because it weights 270mb. I only added one test-case to validate that the input gets modified before the embedding computation. Let me know what you think. |
Closing the issue, as the PR has been merged. |
All, it's not clear to me on how this feature impacts the APIs like the query interfaces and the ingestion processors. What's the impact of this feature on the query and processor interfaces? We're working on revamping and generalizing the framework so that we can integrate any ML model. The plan is provide users with a way to define the model interface when the models are registered. So, if you have any asymmetric text embedding model, you should be able to define a field like "mode" so that the downstream dependencies like models/connectors can utilize the interface metadata to map/process/pass this data accordingly for invocation. We plan to create ML inference search processors that allow these models to be added into search pipelines. The following will be made possible to support any ML model (not just asymmetric text embedding models) and make them useable in any ingest and search pipelines as well as search queries.
|
@dylan-tong-aws the feature merged in the PR adds the possibility to register prefixes for models (analogous to chat templates in chat-based LLMs) and adds a parameter type to be used at inference, so that OpenSearch knows which prefix to add. Asymmetric embedding models such as e5 require the text one wants to embed to "state" whether it is a query or a passage. This might be a subset of the features you are planning to introduce? Can you link an issue so that I can understand the context better? |
Is your feature request related to a problem?
OpenSearch currently only supports symmetric text embedding models, such as
sentence-transformers/all-MiniLM-L12-v2
. These models treat queries and passages equally at inference time. While they can be trained on datasets of queries and passages in such a way that they learn similar representations for the queries and passages (e.g.sentence-transformers/msmarco-MiniLM-L-12-v3
), the best performing embedding models on the MTEB board (https://huggingface.co/spaces/mteb/leaderboard) are models that offer different inference "APIs" for queries and passages.To be able to support these models in OpenSearch we need to be able to define different inference mechanisms for the passage embedding and the query embedding.
What solution would you like?
Prominent asymmetric models use string prefixes to prime the model to embed queries and passages differently. Cf. https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_embedder#using-transformers or https://huggingface.co/intfloat/e5-large-v2#usage. To be able to support this kind of asymmetric model, I propose to introduce model-specific "embedding templates". These should be part of the model metadata and can be used to "format" the input for the model before running inference.
E.g. for the e5-family of models, the query template could look like this:
query: %s
passage: %s
I propose to add optional fields to the model configuration in the
_register
endpoint in ml-commons. E.g.:ml-commons already distinguishes datasets by type (cf. SearchQueryInputDataset and TextDocsInputDataSet). On inference time, it should be possible to check whether a particular model has templates or not and depending on the dataset type, apply the correct one using regular Java format strings.
OpenSearch neural-search would then need to be extended, to make sure it uses the correct dataset type for queries and passages. Currently it uses the same type, regardless of the use-case: https://github.com/opensearch-project/neural-search/blob/main/src/main/java/org/opensearch/neuralsearch/ml/MLCommonsClientAccessor.java#L248.
What alternatives have you considered?
The proposed change has a small surface and only extends the API. I couldn't think of any competitive alternative solution.
Do you have any additional context?
No.
The text was updated successfully, but these errors were encountered: