[Semantic Search] Support default models for index/fields #70

jmazanec15 · 2022-11-16T18:33:01Z

Problem Statement

Currently, the neural-search plugin search functionality relies on the user to pass the "model_id" with each query.

  "query": {
    "neural": {
      "<vector_field>": {
        "query_text": "hello world",
        "model_id": "csdsdcsdsadasdcsad",
        "k": int
      }
    }

This offers a suboptimal user experience. The model IDs are randomized strings that add confusion to a given query. Additionally, search behavior has to change when the model is updated (the ID needs to be updated). While it may be possible to come up with some kind of alias scheme for the model ID (see opensearch-project/ml-commons#554), the best user experience would be for the user writing the query to not need to know any details about the model_id.

Potential Solutions

Goal

We want to offer a user experience like this:

  "query": {
    "neural": {
      "<vector_field>": {
        "query_text": "hello world",
        "k": int
      }
    }

Similarly, for indexing, the same information could be used if no model id is specified, so the experience would look like:

PUT _ingest/pipeline/<pipeline_name>
{
  "description": "string",
  "processors" : [
    {
      "text_embedding": {
        "field_map": {
           "<input_field_name>": "<output_field_name>",
           ...
        }
      }
    },
    ...
  ]
}

1. Rely on index meta field

In this option, we would associate the model mapping in a field in the _meta field of the index.

PUT my-neural-index
{
  "mappings": {
    "_meta": { 
      "neural_field_map": {
        "field_name_1": "<model_id>",
        "field_name_2": "<model_id>",
        "field_name_3": "<model_id>",
      }
    }
  }
}

2. Make model map index settings

Similar to _meta field, we could make the map an index setting (would need to validate that settings can in fact be maps). Index settings would give us more control over validation of input model ids as well as some hooks to trigger actions when settings are updated.

PUT my-neural-index/_settings
{
    "index": { 
      "neural_field_map": {
        "field_name_1": "<model_id>",
        "field_name_2": "<model_id>",
        "field_name_3": "<model_id>",
      }
    }
}

3. Use system index

Using a system index is another approach to associating this model information with a given index. However, maintaining a system index is heavier than relying on a _meta field. This would require several APIs to manage the functionality. If we are to create a system index for model management, it would be better to group this functionality with ml-commons, which already has a model system index (see next option).

4. Rely on model index association during model management

Another alternative is to delegate functionality of associating a model with an index/field/function to the model management apis of ml-commons. In this solution, users would provide metadata during upload about what indices/fields/functions to associate a model with. This has the benefit of providing users the ability of abstracting all model management (including association) to ml-commons apis.

Requested Feedback

Currently, I am on the fence between the approaches of 1, 2 and 4 as my preferred solution and am looking for feedback on this. Additionally, if there are other alternative approaches we should consider, please post them here.

The text was updated successfully, but these errors were encountered:

navneet1v · 2022-11-16T18:51:46Z

There is index settings object already present for an index any reason we are not considering that approach and falling back to something called as _meta?

jmazanec15 · 2022-11-16T19:28:46Z

@navneet1v good idea. Let me think about setting vs. _meta. I am not sure the benefit of one over the other. But I will take a look.

jmazanec15 · 2022-11-17T00:07:45Z

@navneet1v updated to include settings update.

@ylwu-amzn can you comment if approach 4 would be feasible or desirable in ml-commons?

navneet1v · 2022-11-17T07:58:22Z

Note: Below I have used model_id and model alias interchangeably. I am not favoring or saying that we should use model_id over model alias or vice-versa.

Some thoughts:

the best user experience would be for the user writing the query to not need to know any details about the model_id.

I don't feel the same on this point. The user who is writing the query should be aware of the model which is going to be used to convert the query string to embeddings.

Now should this understanding of the model be done via model_id or some model alias; in that case model alias is a clear winner as it is more user friendly.

@ylwu-amzn can you comment if approach 4 would be feasible or desirable in ml-commons?

Moving on the solutions and the alternatives, we should put some thoughts around association of model_id(or model alias) with the different components of a cluster like index, processors, pipeline, fields, models system index, query etc. Once we have clarity on that we will be able to navigate to right solution from solutions listed above.

Example: If we say model should a property in which indexes it can be used them solution 4(Rely on model index association during model management) makes more sense. But if we say model_id is a property/settings of an index then it should be at index level. In the same way for the index fields.
One more dimension I can think of here is usage of this model_id field. I think we should add some details around that too. It will help us take better decision.

jmazanec15 · 2022-11-17T19:57:50Z

I don't feel the same on this point. The user who is writing the query should be aware of the model which is going to be used to convert the query string to embeddings.

To clarify on this point, I think we would still allow modelIDs to be passed with query, but add the default.

we should put some thoughts around association of model_id(or model alias) with the different components of a cluster like index, processors, pipeline, fields, models system index, query etc.

Yes, good point, its difficult to say. Thinking about this - I think on ingestion, its reasonable to expect processor to own the model_id. This way, pipelines can be shared across indices. However, for search, for a lot of cases, users will want to use the same model used for ingestion. The only connection I can think of between query and processor is to associate model_id with index/field combination in some way - this could be inside the mapping or outside of it. Alternatively, during query, we could try to read the model id from the pipeline, but then this tightly couples the pipeline with the query.

One more dimension I can think of here is usage of this model_id field. I think we should add some details around that too. It will help us take better decision.

What do you mean by this?

navneet1v · 2022-12-14T02:18:49Z

One more dimension I can think of here is usage of this model_id field. I think we should add some details around that too. It will help us take better decision.

What do you mean by this?

The idea here is how the model_id is getting used can also drive where the model_id as an attribute should be present.

vibrantvarun · 2023-08-31T09:17:21Z

Hi folks, I would like to propose 3 solutions for the above problem. Then, we need to think which option would be more feasible and appropriate to implement considering pros and cons.

### Option 1:

A new search processor needs to be created and added in the search pipeline. When the user hits a search query with no model Id in the request, then search request processor will be triggered and will add the model Id in the search request.

HLD:

LLD:

Solution Explanation:

Client creates a search pipeline (ex. nlp-search-pipeline) with search request processor. (one time process)
Client update the index settings with default search pipeline i.e. nlp-search-pipeline. (one time process)
Client does a neural search query to OpenSearch. OpenSearch identifies it as neural search query and invokes NeuralSearchQueryBuilder. (If model Id is not present then set model Id as N/A)
https://github.com/opensearch-project/OpenSearch/blob/bb7d23ca3c31140b03899d83648948dbe7229cf2/server/src/main/java/org/opensearch/search/builder/SearchSourceBuilder.java#L1183

neural-search/src/main/java/org/opensearch/neuralsearch/query/NeuralQueryBuilder.java

Line 144 in d12f480

public static NeuralQueryBuilder fromXContent(XContentParser parser) throws IOException {
At transport layer OpenSearch identifies whether any default search pipeline exists. If it exist then invokes processors.
https://github.com/opensearch-project/OpenSearch/blob/bb7d23ca3c31140b03899d83648948dbe7229cf2/server/src/main/java/org/opensearch/action/search/TransportSearchAction.java#L396C11-L396C11

Applies SearchRequestProcessor and modifies the search Request
https://github.com/opensearch-project/OpenSearch/blob/bb7d23ca3c31140b03899d83648948dbe7229cf2/server/src/main/java/org/opensearch/search/pipeline/Pipeline.java#L133C24-L133C24

Solution Cost:

Currently when applying default search pipeline.

Cluster Stat api is invoked and index metadata is fetched.
From Index metadata, index settings are fetched. In that it is checked whether it has "index.search.default_pipeline"
https://github.com/opensearch-project/OpenSearch/blob/bb7d23ca3c31140b03899d83648948dbe7229cf2/server/src/main/java/org/opensearch/search/pipeline/SearchPipelineService.java#L398C21-L398C21

After adding default value processor

Default value processor is invoked
Processor updates the query
Execute query

Time Complexity:

For N times querying against the cluster, N times search request processor will be called.

PROS:

Search Pipelines are created with an intent to find a way to update the search request and search response as per the need. Therefore, this use case perfectly fits in it.
Already the work to call default search pipeline had been done in the TransportSearchAction.java so on OpenSearch Core side there is no additional work needs to be done. On the plugin side, the majority of the work will be done.
Currently, for N queries N times model id's are passed in the search request. If the above solution is implemented then only once the model id needs to be passed at the time of setting up the search pipeline.

Cons:

User has do a initial setup of Search Pipeline and update index setting with the default search pipeline.
The feasibility of the solution is the challenge. For ex. A user triggers a nested search query then OpenSearch Core will convert it to SearchSourceBuilder that too with many internal low level query builders like (boolQueryBuilder, neuralQueryBuilder etc.). On plugin side, a SearchSourceBuilder will be passed. Now the challenge is to extract the right neural query builder and update the model Id in it.

### Option 2:

Add the default model id and field level default model id map in index _meta field.

PUT my-neural-index
{
  "mappings": {
    "_meta": {
      "default_model_id":"<model_id>",
      "neural_field_map": {
        "field_name_1": "<model_id>",
        "field_name_2": "<model_id>",
        "field_name_3": "<model_id>",
      }
    }
  }
}

HLD

LLD

Solution Explanation:

Client creates the index.
Client add the default model id and field default model Id map in the _meta field of the index.
Client does a neural search query.
OpenSearch rest handler calls plugin to get Neural Search builder. Neural Search builder creation logic calls GET /{index} api. Response of the api has meta field that will contain default model Id. Then model Id will be added in the neural query builder.
The neural query builder will be returned to opensearch will execute the search.

Cons:

Additional limitation with this is that users may store their own meta data here. So, potentially bad practice to store system info in same place as user info
"_meta" is for application sepecific data and not related to OS.
meta map is opaque to OS. So, we have no way to enforce what keys should be present in the meta map.
We should also look for the feasibility here, we convert string to embeddings in doWrite function here(

neural-search/src/main/java/org/opensearch/neuralsearch/query/NeuralQueryBuilder.java

Line 209 in 70675c8

protected QueryBuilder doRewrite(QueryRewriteContext queryRewriteContext) {

), but in the function we have no access to the mapper service. If there is another way we can get the mapper service to get these meta map it can work.

TimeComplexity: For N queries 2*N IndexMetaData calls will be made (one in OpenSearch Core and one in Neural Search Plugin)

### Option 3:

Add the default model id and field level default model id map in index settings.

PUT my-neural-index/_settings
{
    "index": { 
      "neural_field_map": {
        "field_name_1": "<model_id>",
        "field_name_2": "<model_id>",
        "field_name_3": "<model_id>",
      }
    }
}

HLD

LLD

Solution Explanation:

Client creates the index.
Client add the default model id and field default model Id map in index settings.
Client does a neural search query.
OpenSearch rest handler calls plugin to get Neural Search builder. Neural Search builder creation logic calls GET /{index} /settings api. Settings contain default model Id. Then model Id will be added in the neural query builder.
The neural query builder will be returned to opensearch will execute the search.

PROS:

Easy to Implement.
The user experience would look like Just update the index settings and rest all things will be taken care off internally when search is executed.

CONS:

I personally don't think it is the cluster settings is the right place to keep model id's

TimeComplexity: For N queries 2*N getSettings calls will be made (one in OpenSearch Core and one in Neural Search Plugin)

jmazanec15 · 2023-08-31T18:44:09Z

Some thoughts..

For option 1, could you include the API workflow for creating the default pipeline and then creating the index? i.e. the PUTs and POST requests.

I am not inclined on option 2 (as defined) and 3:
Option 3 (Not inclined) will require us to create a setting that is a map. I cannot find any examples of this being done. Additionally, it is putting values for field not in the field definition, but in the settings. So this creates a weird interface:

PUT my-neural-index{
  "settings":
    "index": { 
      "neural_field_map": {
        "field_name_1": "<model_id>",
        "field_name_2": "<model_id>",
        "field_name_3": "<model_id>",
      }
    }
  },
  "mappings": {
    "properties": {
      "field_name_1": {
        "type": "knn_vector",
        "dimension": 100
      },
      "field_name_2": {
        "type": "knn_vector",
        "dimension": 100
      },
      "field_name_3": {
        "type": "knn_vector",
        "dimension": 100
      },
    }
  }
}

This looks wrong. The default model should be defined where the field is defined.

Option 2 (Not inclined) On similar ground, I dont think we should do this at the meta field level.

That being said, I think 2 candidates are possibilites:

Option 1 - search pipeline
Option 2 (modified) - use the meta field property inside knn fields:

  "mappings": {
    "properties": {
      "field_name_1": {
        "type": "knn_vector",
        "dimension": 100,
        "meta": {
          "neural.ingest_model": "sdjcvsdncksj",
          "neural.search_model": "sdjcvsdncksj",
        }
      },
      "field_name_2": {
        "type": "knn_vector",
        "dimension": 100,
        "meta": {
          "neural.ingest_model": "sdjcvsdncksj",
          "neural.search_model": "sdjcvsdncksj",
        }
      },
      "field_name_3": {
        "type": "knn_vector",
        "dimension": 100,
        "meta": {
          "neural.ingest_model": "sdjcvsdncksj",
          "neural.search_model": "sdjcvsdncksj",
        }
      },
    }
  }
}

For Option 2 (modified), you should look into the rewrite query logic to see if we could make a change to inject mapperservice at time of rewrite. Second, with option two, it may conflict with other application types - Im not sure how big of a problem this is - you might need to do more research here. That being said though, I like Option 2 (modified) the most from the user perspective.

vibrantvarun · 2023-08-31T20:31:10Z

@jmazanec15

Extended part for Option 1:
For Option 1 API workflow
1.

PUT /_search/pipeline/my_pipeline 
{
  "request_processors": [
    {
      "default_query" : {
        "tag" : "tag1",
        "description" : "This processor is going to restrict to publicly visible documents",
        "modelId": "uaiidoashcfdpwscf",
        "neural_field_map": {
           "field_name_1": "<model_id>",
           "field_name_2": "<model_id>",
           "field_name_3": "<model_id>",
        }
      }
    }
  ]
}

PUT /my-index/_settings
{
  "index" : {
    "default_search_pipeline" : "my_pipeline"
  }
}

Creating a search pipeline and creating a index is independent process. However, search pipeline can only be executed if index has a setting of default search pipeline.

jmazanec15 · 2023-08-31T21:37:26Z

I see. I am probably in favor of Option 2 from user perspective. But, if everyone else is in favor of Option 1, I am okay with that. Good news is, we can always add the other option in the future if we get feedback on it, without breaking BWC.

vibrantvarun · 2023-09-06T16:26:35Z

@jmazanec15
I have checked the feasibility of option where updating meta mapping has been discussed.

My findings are:

In order to fetch meta mapping we need a mapper service object from OS in NSQueryBuilder.

Bringing MapperService object to NSQueryBuilder is next to impossible because In order to do that we need to create a mapper service object in createComponents. Moreover, creating a mapper service object has a dependency on QueryShardContext. QueryShardContext & all Search Related work has been done on OpenSearch therefore it is not possible to do so.

vibrantvarun · 2023-10-03T22:27:17Z

The code is merged and the functionality will be released in 2.11

jmazanec15 added the Enhancements Increases software capabilities beyond original client specifications label Nov 16, 2022

jmazanec15 mentioned this issue Dec 22, 2022

[PROPOSAL] Search Semantic Chaining Mechanisms opensearch-project/search-processor#12

Closed

navneet1v mentioned this issue Jan 11, 2023

[RFC] Search pipelines opensearch-project/search-processor#80

Closed

1 task

navneet1v added backlog All the backlog features should be marked with this label Features Introduces a new unit of functionality that satisfies a requirement v2.8.0 labels Mar 28, 2023

vamshin removed the Features Introduces a new unit of functionality that satisfies a requirement label Mar 28, 2023

navneet1v removed the v2.8.0 label May 26, 2023

vibrantvarun self-assigned this Aug 1, 2023

vamshin added this to Vector Search RoadMap Aug 22, 2023

vamshin moved this to 2.10 (September 11th) in Vector Search RoadMap Aug 22, 2023

vamshin moved this from 2.10 (September 11th) to 2.11(. ) in Vector Search RoadMap Aug 22, 2023

vamshin changed the title ~~Support default models for index/fields~~ [Semantic Search] Support default models for index/fields Sep 7, 2023

This was referenced Sep 21, 2023

[DOC] Default Model Id for Neural Search Query clause opensearch-project/documentation-website#5060

Closed

[Feature] Support for Default Model Id #337

Merged

vibrantvarun mentioned this issue Sep 29, 2023

[Backport 2.x] [Feature] Support for Default Model Id #365

Merged

5 tasks

vibrantvarun closed this as completed Oct 3, 2023

github-project-automation bot moved this from 2.11.0 (November 16th, 2023) to ✅ Done in Vector Search RoadMap Oct 3, 2023

vibrantvarun added the v2.11.0 Issues targeting release v2.11.0 label Oct 3, 2023

vibrantvarun reopened this Oct 3, 2023

github-project-automation bot moved this from ✅ Done to 2.10 (September 11th, 2023) in Vector Search RoadMap Oct 3, 2023

github-actions bot added the untriaged label Oct 3, 2023

vibrantvarun moved this from 2.10 (September 11th, 2023) to 2.11.0 (November 16th, 2023) in Vector Search RoadMap Oct 3, 2023

navneet1v removed the untriaged label Oct 4, 2023

vamshin added neural-search semantic search labels Oct 4, 2023

heemin32 closed this as completed Oct 4, 2023

github-project-automation bot moved this from 2.11.0 (November 16th, 2023) to ✅ Done in Vector Search RoadMap Oct 4, 2023

zhichao-aws mentioned this issue Feb 22, 2024

[FEATURE] support default model id in neural_sparse query #610

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Semantic Search] Support default models for index/fields #70

[Semantic Search] Support default models for index/fields #70

jmazanec15 commented Nov 16, 2022 •

edited

Loading

navneet1v commented Nov 16, 2022

jmazanec15 commented Nov 16, 2022

jmazanec15 commented Nov 17, 2022

navneet1v commented Nov 17, 2022 •

edited

Loading

jmazanec15 commented Nov 17, 2022

navneet1v commented Dec 14, 2022

vibrantvarun commented Aug 31, 2023 •

edited

Loading

jmazanec15 commented Aug 31, 2023

vibrantvarun commented Aug 31, 2023 •

edited

Loading

jmazanec15 commented Aug 31, 2023

vibrantvarun commented Sep 6, 2023

vibrantvarun commented Oct 3, 2023

[Semantic Search] Support default models for index/fields #70

[Semantic Search] Support default models for index/fields #70

Comments

jmazanec15 commented Nov 16, 2022 • edited Loading

Problem Statement

Potential Solutions

Goal

1. Rely on index meta field

2. Make model map index settings

3. Use system index

4. Rely on model index association during model management

Requested Feedback

navneet1v commented Nov 16, 2022

jmazanec15 commented Nov 16, 2022

jmazanec15 commented Nov 17, 2022

navneet1v commented Nov 17, 2022 • edited Loading

jmazanec15 commented Nov 17, 2022

navneet1v commented Dec 14, 2022

vibrantvarun commented Aug 31, 2023 • edited Loading

jmazanec15 commented Aug 31, 2023

vibrantvarun commented Aug 31, 2023 • edited Loading

jmazanec15 commented Aug 31, 2023

vibrantvarun commented Sep 6, 2023

vibrantvarun commented Oct 3, 2023

jmazanec15 commented Nov 16, 2022 •

edited

Loading

navneet1v commented Nov 17, 2022 •

edited

Loading

vibrantvarun commented Aug 31, 2023 •

edited

Loading

vibrantvarun commented Aug 31, 2023 •

edited

Loading