-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Semantic Search] Support default models for index/fields #70
Comments
There is index settings object already present for an index any reason we are not considering that approach and falling back to something called as _meta? |
@navneet1v good idea. Let me think about setting vs. _meta. I am not sure the benefit of one over the other. But I will take a look. |
@navneet1v updated to include settings update. @ylwu-amzn can you comment if approach 4 would be feasible or desirable in ml-commons? |
Note: Below I have used model_id and model alias interchangeably. I am not favoring or saying that we should use model_id over model alias or vice-versa. Some thoughts:
I don't feel the same on this point. The user who is writing the query should be aware of the model which is going to be used to convert the query string to embeddings. Now should this understanding of the model be done via model_id or some model alias; in that case model alias is a clear winner as it is more user friendly.
Moving on the solutions and the alternatives, we should put some thoughts around association of model_id(or model alias) with the different components of a cluster like index, processors, pipeline, fields, models system index, query etc. Once we have clarity on that we will be able to navigate to right solution from solutions listed above. Example: If we say model should a property in which indexes it can be used them solution 4(Rely on model index association during model management) makes more sense. But if we say model_id is a property/settings of an index then it should be at index level. In the same way for the index fields. |
To clarify on this point, I think we would still allow modelIDs to be passed with query, but add the default.
Yes, good point, its difficult to say. Thinking about this - I think on ingestion, its reasonable to expect processor to own the model_id. This way, pipelines can be shared across indices. However, for search, for a lot of cases, users will want to use the same model used for ingestion. The only connection I can think of between query and processor is to associate model_id with index/field combination in some way - this could be inside the mapping or outside of it. Alternatively, during query, we could try to read the model id from the pipeline, but then this tightly couples the pipeline with the query.
What do you mean by this? |
The idea here is how the model_id is getting used can also drive where the model_id as an attribute should be present. |
Hi folks, I would like to propose 3 solutions for the above problem. Then, we need to think which option would be more feasible and appropriate to implement considering pros and cons. ### Option 1: A new search processor needs to be created and added in the search pipeline. When the user hits a search query with no model Id in the request, then search request processor will be triggered and will add the model Id in the search request. HLD: LLD: Solution Explanation:
Solution Cost: Currently when applying default search pipeline.
After adding default value processor
Time Complexity: For N times querying against the cluster, N times search request processor will be called. PROS:
Cons:
### Option 2: Add the default model id and field level default model id map in index _meta field.
Solution Explanation:
Cons:
TimeComplexity: For N queries 2*N IndexMetaData calls will be made (one in OpenSearch Core and one in Neural Search Plugin) ### Option 3: Add the default model id and field level default model id map in index settings.
Solution Explanation:
PROS:
CONS:
TimeComplexity: For N queries 2*N getSettings calls will be made (one in OpenSearch Core and one in Neural Search Plugin) |
Some thoughts.. For option 1, could you include the API workflow for creating the default pipeline and then creating the index? i.e. the PUTs and POST requests. I am not inclined on option 2 (as defined) and 3:
This looks wrong. The default model should be defined where the field is defined. Option 2 (Not inclined) On similar ground, I dont think we should do this at the meta field level. That being said, I think 2 candidates are possibilites:
For Option 2 (modified), you should look into the rewrite query logic to see if we could make a change to inject mapperservice at time of rewrite. Second, with option two, it may conflict with other application types - Im not sure how big of a problem this is - you might need to do more research here. That being said though, I like Option 2 (modified) the most from the user perspective. |
Extended part for Option 1:
Creating a search pipeline and creating a index is independent process. However, search pipeline can only be executed if index has a setting of default search pipeline. |
I see. I am probably in favor of Option 2 from user perspective. But, if everyone else is in favor of Option 1, I am okay with that. Good news is, we can always add the other option in the future if we get feedback on it, without breaking BWC. |
@jmazanec15 My findings are: In order to fetch meta mapping we need a mapper service object from OS in NSQueryBuilder. Bringing MapperService object to NSQueryBuilder is next to impossible because In order to do that we need to create a mapper service object in createComponents. Moreover, creating a mapper service object has a dependency on QueryShardContext. QueryShardContext & all Search Related work has been done on OpenSearch therefore it is not possible to do so. |
The code is merged and the functionality will be released in 2.11 |
Problem Statement
Currently, the neural-search plugin search functionality relies on the user to pass the "model_id" with each query.
This offers a suboptimal user experience. The model IDs are randomized strings that add confusion to a given query. Additionally, search behavior has to change when the model is updated (the ID needs to be updated). While it may be possible to come up with some kind of alias scheme for the model ID (see opensearch-project/ml-commons#554), the best user experience would be for the user writing the query to not need to know any details about the model_id.
Potential Solutions
Goal
We want to offer a user experience like this:
Similarly, for indexing, the same information could be used if no model id is specified, so the experience would look like:
1. Rely on index meta field
In this option, we would associate the model mapping in a field in the _meta field of the index.
2. Make model map index settings
Similar to _meta field, we could make the map an index setting (would need to validate that settings can in fact be maps). Index settings would give us more control over validation of input model ids as well as some hooks to trigger actions when settings are updated.
3. Use system index
Using a system index is another approach to associating this model information with a given index. However, maintaining a system index is heavier than relying on a _meta field. This would require several APIs to manage the functionality. If we are to create a system index for model management, it would be better to group this functionality with ml-commons, which already has a model system index (see next option).
4. Rely on model index association during model management
Another alternative is to delegate functionality of associating a model with an index/field/function to the model management apis of ml-commons. In this solution, users would provide metadata during upload about what indices/fields/functions to associate a model with. This has the benefit of providing users the ability of abstracting all model management (including association) to ml-commons apis.
Requested Feedback
Currently, I am on the fence between the approaches of 1, 2 and 4 as my preferred solution and am looking for feedback on this. Additionally, if there are other alternative approaches we should consider, please post them here.
The text was updated successfully, but these errors were encountered: