Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] MLInferenceIngestProcessor has xContentRegistry as null #3276

Open
brianf-aws opened this issue Dec 13, 2024 · 4 comments · May be fixed by #3281
Open

[BUG] MLInferenceIngestProcessor has xContentRegistry as null #3276

brianf-aws opened this issue Dec 13, 2024 · 4 comments · May be fixed by #3281
Assignees
Labels
bug Something isn't working

Comments

@brianf-aws
Copy link
Contributor

brianf-aws commented Dec 13, 2024

What is the bug?
Commit that introduces local models to MLInferenceIngestProcessor

On creation of the Ingest processor the MLInferenceIngestProcessor has no valid xContentRegistry which is used to create valid MLInput objects used for model prediction

Contrast this with the MLInferenceSearchResponseProcessor That has a valid contentRegistry, this processor is able to correctly make objects and can produce valid model input.

If you want to test this locally I reccomend you follow the steps to create a local embedding

Then run the following

PUT {{ _.domain }}/_ingest/pipeline/asymmetric_embedding_ingest_pipeline
{
	"description": "ingest passage text and generate a embedding using an asymmetric model",
	"processors": [
        {
	        "ml_inference": {
        
		        "model_input": "{\"text_docs\":[\"${input_map.text_docs}\"],\"target_response\":[\"sentence_embedding\"],\"parameters\":{\"content_type\":\"query\"}}",
		        "function_name": "text_embedding",
		        "model_id": "{{ _.model_id }}",
		        "input_map": [
			        {
				        "text_docs": "description"
			        }
		        ],
		        "output_map": [
			        {
				        "fact_embedding": "$.inference_results.*.output.*.data",
				        "embedding_size": "$.inference_results.*.output.*.shape[0]"
			        }
		        ]
	        }
        }
	]
}


### -------------------------- ack expected 
POST {{ _.domain }}/_ingest/pipeline/asymmetric_embedding_ingest_pipeline/_simulate
{
	"docs": [
          {
	          "_index": "my-index",
	          "_id": "1",
	          "_source": {
		          "title": "Central Park",
		          "description": "A large public park in the heart of New York City, offering a wide range of recreational activities.",
		          "content_type": "passage"
	          }
          }
	]
}

What is the expected behavior?
A proper embedding should appear as a result of the simulating the ingest pipeline.

What is your host/environment?

  • OS: 2.18
  • Version [e.g. 22]
  • Plugins
@brianf-aws brianf-aws added bug Something isn't working untriaged labels Dec 13, 2024
@brianf-aws
Copy link
Contributor Author

brianf-aws commented Dec 14, 2024

After investigating, I realized that this requires a open-search core code change along with a minor Ml-Commons change. More specifically, it requires that the IngestService class change its constructor to have a content registry object as a new parameter.
This will obviously cause a cascading amount of changes related to ingest service. However, this sort of code change was done previously as shown here which will pass indiceServices to ingest processors by modifying the node class.

new MLInferenceIngestProcessor.Factory(parameters.scriptService, parameters.client, xContentRegistry)

This is the most sensible solution as any new ingest processor can benefit from making objects using the xContentRegistry provided by the Node class. This also benefits the MlInferenceIngest processor to create custom ML Inputs based on any new model that comes up as shown by the Asymmetric Model Type .

Are there any easier ways to solve this issue?

Currently we can not circumvent this by using the createComponents method within MLPlugin

public Collection<Object> createComponents(
Client client,
ClusterService clusterService,
ThreadPool threadPool,
ResourceWatcherService resourceWatcherService,
ScriptService scriptService,
NamedXContentRegistry xContentRegistry,
Environment environment,
NodeEnvironment nodeEnvironment,
NamedWriteableRegistry namedWriteableRegistry,
IndexNameExpressionResolver indexNameExpressionResolver,
Supplier<RepositoriesService> repositoriesServiceSupplier
) {

as the creation of the ingestProcessor gets called before the createComponents method
public Map<String, org.opensearch.ingest.Processor.Factory> getProcessors(org.opensearch.ingest.Processor.Parameters parameters) {

One way to easily remove the need for the code change is to change the order of creation with the node class. But this is a terrible idea as we will depend on the order of code (This is a huge headache for future contributors who decide to modify the relavent code discussed) to make sure everything is correct see below.

This is the order the code executes which explains why the xContentRegistry is null when creating the ingest processor

  1. IngestService gets invoked (first )calling the factories of the processors this includes the MlInferenceIngestProcessor at which time in point the xContentRegistry is null
  2. The plugins load their configurations (more specifically the MLPlugin), which service the xContentRegistry therefore the processor doesnt have the valid object in time.

@austintlee
Copy link
Collaborator

I think you should start with a PR in core and get feedback. Just to be clear, you want to introduce a new xContentRegistry member in org.opensearch.ingest.Processor.Parameters class, right? That's exactly what SearchPipeline is doing.

@brianf-aws
Copy link
Contributor Author

Hey Austin! @austintlee, Yes thats part of it; this involves several code changes

  1. IngestService has a xContentRegistry paramater
  2. Give Ingest Processor Parameters a new xContentRegistry field, you might get mixed up since the Search pipeline has a same class name but it has the xContentRegistry which is why its able to work
  3. Now within the MLPlugins class we do the following (what you suggested)
    new MLInferenceIngestProcessor.Factory(parameters.scriptService, parameters.client, xContentRegistry)
 new MLInferenceIngestProcessor.Factory(parameters.scriptService, parameters.client, paramters.xContentRegistry) 

@brianf-aws brianf-aws linked a pull request Dec 16, 2024 that will close this issue
5 tasks
@brianf-aws
Copy link
Contributor Author

brianf-aws commented Dec 16, 2024

Currently since the implementation of ingestProcessors have a null content registry we can force the xContentRegistry to have the registry as specified by the getNamedXContent() method

public List<NamedXContentRegistry.Entry> getNamedXContent() {

this is a temporary ML-Commons unblocker for using local models but I would need to have a OpenSearch core fix which would take awhile.

Here is the plan of action to make this happen. Within the specified order.

  • ML-Commons Temporary unblocker for parametrized local models to be used for MLInferenceIngest Processor : Adds preset contentRegistry for IngestProcessors #3281
  • OpenSearch Core change : [PR to be shared]
  • ML-Commons change with contentRegistry coming from core. : [PR to be shared]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: In Progress
Development

Successfully merging a pull request may close this issue.

3 participants