Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Support for asymmetric embedding models #1799

Closed
br3no opened this issue Dec 21, 2023 · 8 comments
Closed

[FEATURE] Support for asymmetric embedding models #1799

br3no opened this issue Dec 21, 2023 · 8 comments
Assignees
Labels
enhancement New feature or request

Comments

@br3no
Copy link
Contributor

br3no commented Dec 21, 2023

Is your feature request related to a problem?
OpenSearch currently only supports symmetric text embedding models, such as sentence-transformers/all-MiniLM-L12-v2. These models treat queries and passages equally at inference time. While they can be trained on datasets of queries and passages in such a way that they learn similar representations for the queries and passages (e.g. sentence-transformers/msmarco-MiniLM-L-12-v3), the best performing embedding models on the MTEB board (https://huggingface.co/spaces/mteb/leaderboard) are models that offer different inference "APIs" for queries and passages.

To be able to support these models in OpenSearch we need to be able to define different inference mechanisms for the passage embedding and the query embedding.

What solution would you like?
Prominent asymmetric models use string prefixes to prime the model to embed queries and passages differently. Cf. https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_embedder#using-transformers or https://huggingface.co/intfloat/e5-large-v2#usage. To be able to support this kind of asymmetric model, I propose to introduce model-specific "embedding templates". These should be part of the model metadata and can be used to "format" the input for the model before running inference.

E.g. for the e5-family of models, the query template could look like this:

type template
query query: %s
passage passage: %s

I propose to add optional fields to the model configuration in the _register endpoint in ml-commons. E.g.:

POST /_plugins/_ml/models/_register
{
  ...,
  "model_config": {
    ...,
    "query_template" : "query: %s",
    "passage_template" : "passage: %s",
    ...
  },
  ...
}

ml-commons already distinguishes datasets by type (cf. SearchQueryInputDataset and TextDocsInputDataSet). On inference time, it should be possible to check whether a particular model has templates or not and depending on the dataset type, apply the correct one using regular Java format strings.

OpenSearch neural-search would then need to be extended, to make sure it uses the correct dataset type for queries and passages. Currently it uses the same type, regardless of the use-case: https://github.com/opensearch-project/neural-search/blob/main/src/main/java/org/opensearch/neuralsearch/ml/MLCommonsClientAccessor.java#L248.

What alternatives have you considered?
The proposed change has a small surface and only extends the API. I couldn't think of any competitive alternative solution.

Do you have any additional context?
No.

@br3no br3no added enhancement New feature or request untriaged labels Dec 21, 2023
@br3no
Copy link
Contributor Author

br3no commented Jan 29, 2024

@ylwu-amzn sorry for mentioning you directly on this issue, but you seem to be someone who knows what needs to be done for this to be at least triaged?

I would implement the feature myself, but don't want to start before I have a green light from the community that the approach has a chance of being pulled into the project.

@HenryL27
Copy link
Collaborator

@br3no Sorry it took so long to get to this.

It would be great to support asymmetric embedding models, so definitely a green light if you're still interested in working on this.

A couple of design things to think about:

  1. The SearchQueryInputDataset doesn't really lend itself well to embedding - the term 'query' is overloaded in this weird search-engine-machine-learning space, but this dataset is representing a proper search-engine database-y query, not a natural language query. Could you extract a natural language query from the opensearch query? Sure. But I'll bet it would be extremely difficult to do that in any kind of robust way.
  2. We do still need a way to differentiate between query datasets and document datasets - my suggestion would be to simply add a boolean flag to the TextDocsInputDataset (or maybe just TextDocsInput) called something like asQuery. Default it to false, and then have neural search set it for queries or something.
  3. Templates - I worry a little bit about string formatting attacks although maybe that's not real. But in general I think asymmetric models are only asymmetric by a prefix; there's not a bunch of infixes or suffixes, right? In this case might we simplify the API by just a smidge and call them "query_prefix" and "passage_prefix" and do away with the "%s"'s entirely (just rely on string concatenation)
  4. Is this going to entail an upgrade to the internal model metadata index? Not a problem if yes, I just want to keep that on your radar.

Can I assign this to you?

@br3no
Copy link
Contributor Author

br3no commented Feb 14, 2024

@HenryL27, great. Yes you can assign me.

@br3no br3no mentioned this issue Feb 16, 2024
5 tasks
@br3no
Copy link
Contributor Author

br3no commented Feb 16, 2024

@HenryL27 I've opened a PR for this feature. I've tested the feature with a self-packaged version of https://huggingface.co/intfloat/multilingual-e5-small/tree/main/onnx. I'm reluctant in adding a dedicated integration test for this model, because it weights 270mb. I only added one test-case to validate that the input gets modified before the embedding computation. Let me know what you think.

@ylwu-amzn
Copy link
Collaborator

@br3no Thanks a lot for the contribution! Sorry that somehow missed your comment. Thanks @HenryL27 for helping.
Will take a look at your PR.

@br3no
Copy link
Contributor Author

br3no commented Feb 29, 2024

Closing the issue, as the PR has been merged.

@dylan-tong-aws
Copy link

dylan-tong-aws commented Mar 22, 2024

All, it's not clear to me on how this feature impacts the APIs like the query interfaces and the ingestion processors. What's the impact of this feature on the query and processor interfaces?

We're working on revamping and generalizing the framework so that we can integrate any ML model.

The plan is provide users with a way to define the model interface when the models are registered. So, if you have any asymmetric text embedding model, you should be able to define a field like "mode" so that the downstream dependencies like models/connectors can utilize the interface metadata to map/process/pass this data accordingly for invocation.

We plan to create ML inference search processors that allow these models to be added into search pipelines. The following will be made possible to support any ML model (not just asymmetric text embedding models) and make them useable in any ingest and search pipelines as well as search queries.

GET my-knn-index-1/_search?my_pipeline
{
  "size": 2,
  "query": {
    "knn": {
      "my_text_embeddings: {
        "vector": ext.ml_models.text_embeddings,
        "k": 2
      },
       “my_image_embeddings: {
	  "vector": ext.ml_models.image_embeddings,
        "k": 2
    }
  }
  Ext:
    ml_models: ## these inputs correspond to ml inference processors within the search pipeline
      text_embeddings: {
          text: “this is input for a text embedding model”,
          mode: query
      },
       image_embeddings: {
         image: “##### image bits ######”,
         format: base64
      }
}

@br3no
Copy link
Contributor Author

br3no commented Apr 12, 2024

@dylan-tong-aws the feature merged in the PR adds the possibility to register prefixes for models (analogous to chat templates in chat-based LLMs) and adds a parameter type to be used at inference, so that OpenSearch knows which prefix to add. Asymmetric embedding models such as e5 require the text one wants to embed to "state" whether it is a query or a passage.

This might be a subset of the features you are planning to introduce? Can you link an issue so that I can understand the context better?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Development

No branches or pull requests

5 participants