Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Semantic Search] Support default models for index/fields #70

Closed
jmazanec15 opened this issue Nov 16, 2022 · 12 comments
Closed

[Semantic Search] Support default models for index/fields #70

jmazanec15 opened this issue Nov 16, 2022 · 12 comments
Assignees
Labels
backlog All the backlog features should be marked with this label Enhancements Increases software capabilities beyond original client specifications neural-search semantic search v2.11.0 Issues targeting release v2.11.0

Comments

@jmazanec15
Copy link
Member

jmazanec15 commented Nov 16, 2022

Problem Statement

Currently, the neural-search plugin search functionality relies on the user to pass the "model_id" with each query.

  "query": {
    "neural": {
      "<vector_field>": {
        "query_text": "hello world",
        "model_id": "csdsdcsdsadasdcsad",
        "k": int
      }
    }

This offers a suboptimal user experience. The model IDs are randomized strings that add confusion to a given query. Additionally, search behavior has to change when the model is updated (the ID needs to be updated). While it may be possible to come up with some kind of alias scheme for the model ID (see opensearch-project/ml-commons#554), the best user experience would be for the user writing the query to not need to know any details about the model_id.

Potential Solutions

Goal

We want to offer a user experience like this:

  "query": {
    "neural": {
      "<vector_field>": {
        "query_text": "hello world",
        "k": int
      }
    }

Similarly, for indexing, the same information could be used if no model id is specified, so the experience would look like:

PUT _ingest/pipeline/<pipeline_name>
{
  "description": "string",
  "processors" : [
    {
      "text_embedding": {
        "field_map": {
           "<input_field_name>": "<output_field_name>",
           ...
        }
      }
    },
    ...
  ]
}

1. Rely on index meta field

In this option, we would associate the model mapping in a field in the _meta field of the index.

PUT my-neural-index
{
  "mappings": {
    "_meta": { 
      "neural_field_map": {
        "field_name_1": "<model_id>",
        "field_name_2": "<model_id>",
        "field_name_3": "<model_id>",
      }
    }
  }
}

2. Make model map index settings

Similar to _meta field, we could make the map an index setting (would need to validate that settings can in fact be maps). Index settings would give us more control over validation of input model ids as well as some hooks to trigger actions when settings are updated.

PUT my-neural-index/_settings
{
    "index": { 
      "neural_field_map": {
        "field_name_1": "<model_id>",
        "field_name_2": "<model_id>",
        "field_name_3": "<model_id>",
      }
    }
}

3. Use system index

Using a system index is another approach to associating this model information with a given index. However, maintaining a system index is heavier than relying on a _meta field. This would require several APIs to manage the functionality. If we are to create a system index for model management, it would be better to group this functionality with ml-commons, which already has a model system index (see next option).

4. Rely on model index association during model management

Another alternative is to delegate functionality of associating a model with an index/field/function to the model management apis of ml-commons. In this solution, users would provide metadata during upload about what indices/fields/functions to associate a model with. This has the benefit of providing users the ability of abstracting all model management (including association) to ml-commons apis.

Requested Feedback

Currently, I am on the fence between the approaches of 1, 2 and 4 as my preferred solution and am looking for feedback on this. Additionally, if there are other alternative approaches we should consider, please post them here.

@jmazanec15 jmazanec15 added the Enhancements Increases software capabilities beyond original client specifications label Nov 16, 2022
@navneet1v
Copy link
Collaborator

There is index settings object already present for an index any reason we are not considering that approach and falling back to something called as _meta?

@jmazanec15
Copy link
Member Author

@navneet1v good idea. Let me think about setting vs. _meta. I am not sure the benefit of one over the other. But I will take a look.

@jmazanec15
Copy link
Member Author

@navneet1v updated to include settings update.

@ylwu-amzn can you comment if approach 4 would be feasible or desirable in ml-commons?

@navneet1v
Copy link
Collaborator

navneet1v commented Nov 17, 2022

Note: Below I have used model_id and model alias interchangeably. I am not favoring or saying that we should use model_id over model alias or vice-versa.

Some thoughts:

the best user experience would be for the user writing the query to not need to know any details about the model_id.

I don't feel the same on this point. The user who is writing the query should be aware of the model which is going to be used to convert the query string to embeddings.

Now should this understanding of the model be done via model_id or some model alias; in that case model alias is a clear winner as it is more user friendly.

@ylwu-amzn can you comment if approach 4 would be feasible or desirable in ml-commons?

Moving on the solutions and the alternatives, we should put some thoughts around association of model_id(or model alias) with the different components of a cluster like index, processors, pipeline, fields, models system index, query etc. Once we have clarity on that we will be able to navigate to right solution from solutions listed above.

Example: If we say model should a property in which indexes it can be used them solution 4(Rely on model index association during model management) makes more sense. But if we say model_id is a property/settings of an index then it should be at index level. In the same way for the index fields.
One more dimension I can think of here is usage of this model_id field. I think we should add some details around that too. It will help us take better decision.

@jmazanec15
Copy link
Member Author

I don't feel the same on this point. The user who is writing the query should be aware of the model which is going to be used to convert the query string to embeddings.

To clarify on this point, I think we would still allow modelIDs to be passed with query, but add the default.

we should put some thoughts around association of model_id(or model alias) with the different components of a cluster like index, processors, pipeline, fields, models system index, query etc.

Yes, good point, its difficult to say. Thinking about this - I think on ingestion, its reasonable to expect processor to own the model_id. This way, pipelines can be shared across indices. However, for search, for a lot of cases, users will want to use the same model used for ingestion. The only connection I can think of between query and processor is to associate model_id with index/field combination in some way - this could be inside the mapping or outside of it. Alternatively, during query, we could try to read the model id from the pipeline, but then this tightly couples the pipeline with the query.

One more dimension I can think of here is usage of this model_id field. I think we should add some details around that too. It will help us take better decision.

What do you mean by this?

@navneet1v
Copy link
Collaborator

One more dimension I can think of here is usage of this model_id field. I think we should add some details around that too. It will help us take better decision.

What do you mean by this?

The idea here is how the model_id is getting used can also drive where the model_id as an attribute should be present.

@navneet1v navneet1v added backlog All the backlog features should be marked with this label Features Introduces a new unit of functionality that satisfies a requirement v2.8.0 labels Mar 28, 2023
@vamshin vamshin removed the Features Introduces a new unit of functionality that satisfies a requirement label Mar 28, 2023
@navneet1v navneet1v removed the v2.8.0 label May 26, 2023
@vibrantvarun vibrantvarun self-assigned this Aug 1, 2023
@vamshin vamshin moved this to 2.10 (September 11th) in Vector Search RoadMap Aug 22, 2023
@vamshin vamshin moved this from 2.10 (September 11th) to 2.11(. ) in Vector Search RoadMap Aug 22, 2023
@vibrantvarun
Copy link
Member

vibrantvarun commented Aug 31, 2023

Hi folks, I would like to propose 3 solutions for the above problem. Then, we need to think which option would be more feasible and appropriate to implement considering pros and cons.

### Option 1:

A new search processor needs to be created and added in the search pipeline. When the user hits a search query with no model Id in the request, then search request processor will be triggered and will add the model Id in the search request.

HLD:

HLD

LLD:

LLD

Solution Explanation:

  1. Client creates a search pipeline (ex. nlp-search-pipeline) with search request processor. (one time process)

  2. Client update the index settings with default search pipeline i.e. nlp-search-pipeline. (one time process)

  3. Client does a neural search query to OpenSearch. OpenSearch identifies it as neural search query and invokes NeuralSearchQueryBuilder. (If model Id is not present then set model Id as N/A)
    https://github.com/opensearch-project/OpenSearch/blob/bb7d23ca3c31140b03899d83648948dbe7229cf2/server/src/main/java/org/opensearch/search/builder/SearchSourceBuilder.java#L1183

    public static NeuralQueryBuilder fromXContent(XContentParser parser) throws IOException {

  4. At transport layer OpenSearch identifies whether any default search pipeline exists. If it exist then invokes processors.
    https://github.com/opensearch-project/OpenSearch/blob/bb7d23ca3c31140b03899d83648948dbe7229cf2/server/src/main/java/org/opensearch/action/search/TransportSearchAction.java#L396C11-L396C11

    Applies SearchRequestProcessor and modifies the search Request
    https://github.com/opensearch-project/OpenSearch/blob/bb7d23ca3c31140b03899d83648948dbe7229cf2/server/src/main/java/org/opensearch/search/pipeline/Pipeline.java#L133C24-L133C24

Solution Cost:

Currently when applying default search pipeline.

  1. Cluster Stat api is invoked and index metadata is fetched.
  2. From Index metadata, index settings are fetched. In that it is checked whether it has "index.search.default_pipeline"
    https://github.com/opensearch-project/OpenSearch/blob/bb7d23ca3c31140b03899d83648948dbe7229cf2/server/src/main/java/org/opensearch/search/pipeline/SearchPipelineService.java#L398C21-L398C21

After adding default value processor

  1. Default value processor is invoked
  2. Processor updates the query
  3. Execute query

Time Complexity:

For N times querying against the cluster, N times search request processor will be called.

PROS:

  1. Search Pipelines are created with an intent to find a way to update the search request and search response as per the need. Therefore, this use case perfectly fits in it.
  2. Already the work to call default search pipeline had been done in the TransportSearchAction.java so on OpenSearch Core side there is no additional work needs to be done. On the plugin side, the majority of the work will be done.
  3. Currently, for N queries N times model id's are passed in the search request. If the above solution is implemented then only once the model id needs to be passed at the time of setting up the search pipeline.

Cons:

  1. User has do a initial setup of Search Pipeline and update index setting with the default search pipeline.
  2. The feasibility of the solution is the challenge. For ex. A user triggers a nested search query then OpenSearch Core will convert it to SearchSourceBuilder that too with many internal low level query builders like (boolQueryBuilder, neuralQueryBuilder etc.). On plugin side, a SearchSourceBuilder will be passed. Now the challenge is to extract the right neural query builder and update the model Id in it.

### Option 2:

Add the default model id and field level default model id map in index _meta field.

PUT my-neural-index
{
  "mappings": {
    "_meta": {
      "default_model_id":"<model_id>",
      "neural_field_map": {
        "field_name_1": "<model_id>",
        "field_name_2": "<model_id>",
        "field_name_3": "<model_id>",
      }
    }
  }
}

HLD
MetaHld

LLD
MetaLLD

Solution Explanation:

  1. Client creates the index.
  2. Client add the default model id and field default model Id map in the _meta field of the index.
  3. Client does a neural search query.
  4. OpenSearch rest handler calls plugin to get Neural Search builder. Neural Search builder creation logic calls GET /{index} api. Response of the api has meta field that will contain default model Id. Then model Id will be added in the neural query builder.
  5. The neural query builder will be returned to opensearch will execute the search.

Cons:

  1. Additional limitation with this is that users may store their own meta data here. So, potentially bad practice to store system info in same place as user info
  2. "_meta" is for application sepecific data and not related to OS.
  3. meta map is opaque to OS. So, we have no way to enforce what keys should be present in the meta map.
  4. We should also look for the feasibility here, we convert string to embeddings in doWrite function here(
    protected QueryBuilder doRewrite(QueryRewriteContext queryRewriteContext) {
    ), but in the function we have no access to the mapper service. If there is another way we can get the mapper service to get these meta map it can work.

TimeComplexity: For N queries 2*N IndexMetaData calls will be made (one in OpenSearch Core and one in Neural Search Plugin)

### Option 3:

Add the default model id and field level default model id map in index settings.

PUT my-neural-index/_settings
{
    "index": { 
      "neural_field_map": {
        "field_name_1": "<model_id>",
        "field_name_2": "<model_id>",
        "field_name_3": "<model_id>",
      }
    }
}

HLD
MetaHld

LLD
MetaLLD

Solution Explanation:

  1. Client creates the index.
  2. Client add the default model id and field default model Id map in index settings.
  3. Client does a neural search query.
  4. OpenSearch rest handler calls plugin to get Neural Search builder. Neural Search builder creation logic calls GET /{index} /settings api. Settings contain default model Id. Then model Id will be added in the neural query builder.
  5. The neural query builder will be returned to opensearch will execute the search.

PROS:

  1. Easy to Implement.
  2. The user experience would look like Just update the index settings and rest all things will be taken care off internally when search is executed.

CONS:

  1. I personally don't think it is the cluster settings is the right place to keep model id's

TimeComplexity: For N queries 2*N getSettings calls will be made (one in OpenSearch Core and one in Neural Search Plugin)

@jmazanec15
Copy link
Member Author

Some thoughts..

For option 1, could you include the API workflow for creating the default pipeline and then creating the index? i.e. the PUTs and POST requests.

I am not inclined on option 2 (as defined) and 3:
Option 3 (Not inclined) will require us to create a setting that is a map. I cannot find any examples of this being done. Additionally, it is putting values for field not in the field definition, but in the settings. So this creates a weird interface:

PUT my-neural-index{
  "settings":
    "index": { 
      "neural_field_map": {
        "field_name_1": "<model_id>",
        "field_name_2": "<model_id>",
        "field_name_3": "<model_id>",
      }
    }
  },
  "mappings": {
    "properties": {
      "field_name_1": {
        "type": "knn_vector",
        "dimension": 100
      },
      "field_name_2": {
        "type": "knn_vector",
        "dimension": 100
      },
      "field_name_3": {
        "type": "knn_vector",
        "dimension": 100
      },
    }
  }
}

This looks wrong. The default model should be defined where the field is defined.

Option 2 (Not inclined) On similar ground, I dont think we should do this at the meta field level.

That being said, I think 2 candidates are possibilites:

  1. Option 1 - search pipeline
  2. Option 2 (modified) - use the meta field property inside knn fields:
  "mappings": {
    "properties": {
      "field_name_1": {
        "type": "knn_vector",
        "dimension": 100,
        "meta": {
          "neural.ingest_model": "sdjcvsdncksj",
          "neural.search_model": "sdjcvsdncksj",
        }
      },
      "field_name_2": {
        "type": "knn_vector",
        "dimension": 100,
        "meta": {
          "neural.ingest_model": "sdjcvsdncksj",
          "neural.search_model": "sdjcvsdncksj",
        }
      },
      "field_name_3": {
        "type": "knn_vector",
        "dimension": 100,
        "meta": {
          "neural.ingest_model": "sdjcvsdncksj",
          "neural.search_model": "sdjcvsdncksj",
        }
      },
    }
  }
}

For Option 2 (modified), you should look into the rewrite query logic to see if we could make a change to inject mapperservice at time of rewrite. Second, with option two, it may conflict with other application types - Im not sure how big of a problem this is - you might need to do more research here. That being said though, I like Option 2 (modified) the most from the user perspective.

@vibrantvarun
Copy link
Member

vibrantvarun commented Aug 31, 2023

@jmazanec15

Extended part for Option 1:
For Option 1 API workflow
1.

PUT /_search/pipeline/my_pipeline 
{
  "request_processors": [
    {
      "default_query" : {
        "tag" : "tag1",
        "description" : "This processor is going to restrict to publicly visible documents",
        "modelId": "uaiidoashcfdpwscf",
        "neural_field_map": {
           "field_name_1": "<model_id>",
           "field_name_2": "<model_id>",
           "field_name_3": "<model_id>",
        }
      }
    }
  ]
}
PUT /my-index/_settings
{
  "index" : {
    "default_search_pipeline" : "my_pipeline"
  }
}

Creating a search pipeline and creating a index is independent process. However, search pipeline can only be executed if index has a setting of default search pipeline.

@jmazanec15
Copy link
Member Author

I see. I am probably in favor of Option 2 from user perspective. But, if everyone else is in favor of Option 1, I am okay with that. Good news is, we can always add the other option in the future if we get feedback on it, without breaking BWC.

@vibrantvarun
Copy link
Member

@jmazanec15
I have checked the feasibility of option where updating meta mapping has been discussed.

My findings are:

In order to fetch meta mapping we need a mapper service object from OS in NSQueryBuilder.

Bringing MapperService object to NSQueryBuilder is next to impossible because In order to do that we need to create a mapper service object in createComponents. Moreover, creating a mapper service object has a dependency on QueryShardContext. QueryShardContext & all Search Related work has been done on OpenSearch therefore it is not possible to do so.

@vamshin vamshin changed the title Support default models for index/fields [Semantic Search] Support default models for index/fields Sep 7, 2023
@vibrantvarun
Copy link
Member

The code is merged and the functionality will be released in 2.11

@github-project-automation github-project-automation bot moved this from 2.11.0 (November 16th, 2023) to ✅ Done in Vector Search RoadMap Oct 3, 2023
@vibrantvarun vibrantvarun added the v2.11.0 Issues targeting release v2.11.0 label Oct 3, 2023
@vibrantvarun vibrantvarun reopened this Oct 3, 2023
@github-project-automation github-project-automation bot moved this from ✅ Done to 2.10 (September 11th, 2023) in Vector Search RoadMap Oct 3, 2023
@vibrantvarun vibrantvarun moved this from 2.10 (September 11th, 2023) to 2.11.0 (November 16th, 2023) in Vector Search RoadMap Oct 3, 2023
@heemin32 heemin32 closed this as completed Oct 4, 2023
@github-project-automation github-project-automation bot moved this from 2.11.0 (November 16th, 2023) to ✅ Done in Vector Search RoadMap Oct 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog All the backlog features should be marked with this label Enhancements Increases software capabilities beyond original client specifications neural-search semantic search v2.11.0 Issues targeting release v2.11.0
Projects
Status: Done
Development

No branches or pull requests

5 participants