[FEATURE] Support for asymmetric embedding models #1799

br3no · 2023-12-21T11:53:06Z

Is your feature request related to a problem?
OpenSearch currently only supports symmetric text embedding models, such as sentence-transformers/all-MiniLM-L12-v2. These models treat queries and passages equally at inference time. While they can be trained on datasets of queries and passages in such a way that they learn similar representations for the queries and passages (e.g. sentence-transformers/msmarco-MiniLM-L-12-v3), the best performing embedding models on the MTEB board (https://huggingface.co/spaces/mteb/leaderboard) are models that offer different inference "APIs" for queries and passages.

To be able to support these models in OpenSearch we need to be able to define different inference mechanisms for the passage embedding and the query embedding.

What solution would you like?
Prominent asymmetric models use string prefixes to prime the model to embed queries and passages differently. Cf. https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_embedder#using-transformers or https://huggingface.co/intfloat/e5-large-v2#usage. To be able to support this kind of asymmetric model, I propose to introduce model-specific "embedding templates". These should be part of the model metadata and can be used to "format" the input for the model before running inference.

E.g. for the e5-family of models, the query template could look like this:

type	template
query	`query: %s`
passage	`passage: %s`

I propose to add optional fields to the model configuration in the _register endpoint in ml-commons. E.g.:

POST /_plugins/_ml/models/_register
{
  ...,
  "model_config": {
    ...,
    "query_template" : "query: %s",
    "passage_template" : "passage: %s",
    ...
  },
  ...
}

ml-commons already distinguishes datasets by type (cf. SearchQueryInputDataset and TextDocsInputDataSet). On inference time, it should be possible to check whether a particular model has templates or not and depending on the dataset type, apply the correct one using regular Java format strings.

OpenSearch neural-search would then need to be extended, to make sure it uses the correct dataset type for queries and passages. Currently it uses the same type, regardless of the use-case: https://github.com/opensearch-project/neural-search/blob/main/src/main/java/org/opensearch/neuralsearch/ml/MLCommonsClientAccessor.java#L248.

What alternatives have you considered?
The proposed change has a small surface and only extends the API. I couldn't think of any competitive alternative solution.

Do you have any additional context?
No.

The text was updated successfully, but these errors were encountered:

br3no · 2024-01-29T14:47:39Z

@ylwu-amzn sorry for mentioning you directly on this issue, but you seem to be someone who knows what needs to be done for this to be at least triaged?

I would implement the feature myself, but don't want to start before I have a green light from the community that the approach has a chance of being pulled into the project.

HenryL27 · 2024-02-14T00:17:33Z

@br3no Sorry it took so long to get to this.

It would be great to support asymmetric embedding models, so definitely a green light if you're still interested in working on this.

A couple of design things to think about:

The SearchQueryInputDataset doesn't really lend itself well to embedding - the term 'query' is overloaded in this weird search-engine-machine-learning space, but this dataset is representing a proper search-engine database-y query, not a natural language query. Could you extract a natural language query from the opensearch query? Sure. But I'll bet it would be extremely difficult to do that in any kind of robust way.
We do still need a way to differentiate between query datasets and document datasets - my suggestion would be to simply add a boolean flag to the TextDocsInputDataset (or maybe just TextDocsInput) called something like asQuery. Default it to false, and then have neural search set it for queries or something.
Templates - I worry a little bit about string formatting attacks although maybe that's not real. But in general I think asymmetric models are only asymmetric by a prefix; there's not a bunch of infixes or suffixes, right? In this case might we simplify the API by just a smidge and call them "query_prefix" and "passage_prefix" and do away with the "%s"'s entirely (just rely on string concatenation)
Is this going to entail an upgrade to the internal model metadata index? Not a problem if yes, I just want to keep that on your radar.

Can I assign this to you?

br3no · 2024-02-14T09:50:47Z

@HenryL27, great. Yes you can assign me.

br3no · 2024-02-16T16:33:12Z

@HenryL27 I've opened a PR for this feature. I've tested the feature with a self-packaged version of https://huggingface.co/intfloat/multilingual-e5-small/tree/main/onnx. I'm reluctant in adding a dedicated integration test for this model, because it weights 270mb. I only added one test-case to validate that the input gets modified before the embedding computation. Let me know what you think.

ylwu-amzn · 2024-02-17T14:43:25Z

@br3no Thanks a lot for the contribution! Sorry that somehow missed your comment. Thanks @HenryL27 for helping.
Will take a look at your PR.

br3no · 2024-02-29T20:37:48Z

Closing the issue, as the PR has been merged.

dylan-tong-aws · 2024-03-22T20:32:01Z

All, it's not clear to me on how this feature impacts the APIs like the query interfaces and the ingestion processors. What's the impact of this feature on the query and processor interfaces?

We're working on revamping and generalizing the framework so that we can integrate any ML model.

The plan is provide users with a way to define the model interface when the models are registered. So, if you have any asymmetric text embedding model, you should be able to define a field like "mode" so that the downstream dependencies like models/connectors can utilize the interface metadata to map/process/pass this data accordingly for invocation.

We plan to create ML inference search processors that allow these models to be added into search pipelines. The following will be made possible to support any ML model (not just asymmetric text embedding models) and make them useable in any ingest and search pipelines as well as search queries.

GET my-knn-index-1/_search?my_pipeline
{
  "size": 2,
  "query": {
    "knn": {
      "my_text_embeddings: {
        "vector": ext.ml_models.text_embeddings,
        "k": 2
      },
       “my_image_embeddings: {
	  "vector": ext.ml_models.image_embeddings,
        "k": 2
    }
  }
  Ext:
    ml_models: ## these inputs correspond to ml inference processors within the search pipeline
      text_embeddings: {
          text: “this is input for a text embedding model”,
          mode: query
      },
       image_embeddings: {
         image: “##### image bits ######”,
         format: base64
      }
}

br3no · 2024-04-12T09:09:21Z

@dylan-tong-aws the feature merged in the PR adds the possibility to register prefixes for models (analogous to chat templates in chat-based LLMs) and adds a parameter type to be used at inference, so that OpenSearch knows which prefix to add. Asymmetric embedding models such as e5 require the text one wants to embed to "state" whether it is a query or a passage.

This might be a subset of the features you are planning to introduce? Can you link an issue so that I can understand the context better?

br3no added enhancement New feature or request untriaged labels Dec 21, 2023

jngz-es added this to ml-commons projects Feb 13, 2024

b4sjoo added untriaged and removed untriaged labels Feb 13, 2024

b4sjoo assigned HenryL27 Feb 13, 2024

b4sjoo removed the untriaged label Feb 13, 2024

b4sjoo moved this to On-deck in ml-commons projects Feb 13, 2024

br3no mentioned this issue Feb 16, 2024

asymmetric embeddings #2123

Merged

5 tasks

HenryL27 assigned br3no Feb 16, 2024

br3no closed this as completed Feb 29, 2024

github-project-automation bot moved this from On-deck to Done in ml-commons projects Feb 29, 2024

br3no mentioned this issue Feb 29, 2024

[PROPOSAL] Add support for asymmetric embedding models to neural-search opensearch-project/neural-search#620

Open

This was referenced Apr 25, 2024

Add support for asymmetric embedding models opensearch-project/neural-search#710

Open

[RFC] ML Inference Processors #2173

Open

brianf-aws mentioned this issue Dec 5, 2024

[Documentation] Tutorial for using Asymmetric models #3255

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Support for asymmetric embedding models #1799

[FEATURE] Support for asymmetric embedding models #1799

br3no commented Dec 21, 2023

br3no commented Jan 29, 2024

HenryL27 commented Feb 14, 2024

br3no commented Feb 14, 2024

br3no commented Feb 16, 2024

ylwu-amzn commented Feb 17, 2024

br3no commented Feb 29, 2024

dylan-tong-aws commented Mar 22, 2024 •

edited

Loading

br3no commented Apr 12, 2024

[FEATURE] Support for asymmetric embedding models #1799

[FEATURE] Support for asymmetric embedding models #1799

Comments

br3no commented Dec 21, 2023

br3no commented Jan 29, 2024

HenryL27 commented Feb 14, 2024

br3no commented Feb 14, 2024

br3no commented Feb 16, 2024

ylwu-amzn commented Feb 17, 2024

br3no commented Feb 29, 2024

dylan-tong-aws commented Mar 22, 2024 • edited Loading

br3no commented Apr 12, 2024

dylan-tong-aws commented Mar 22, 2024 •

edited

Loading