[FEATURE] Refactor on data validation and extraction from customer's documents in several processors #660

zane-neo · 2024-04-02T02:45:57Z

Is your feature request related to a problem?

Background

Currently in neural search, we validate and extract data from user's index documents and send them to model for inference. We used a recursive approach to check the data structure based on user's configuration. For example, below is the user's configuration for text_embedding and the field_map is the important part which represents a mapping relation between the original key and the target key(embedding field key).

{
    "text_embedding": {
        "model_id": "WYjkv4MBHcWxVq8Jtc8U",
        "field_map": {
            "title": "title_knn",
            "favor_list": "favor_list_knn",
            "favorites": {
                "game": "game_knn",
                "movie": "movie_knn"
            }
        }
    }
}

Above configuration assume user has original document in below structure:

{
    "title": "content of title", // raw string
    "favor_list": ["content of each element", "content of each element", ...], // list of string
    "favorites": {
        "game": "game content", // map type with leaf string type
        "movie": "movie content"
    }
}

We support raw string, map type with leaf string type, and list of string, list of map with leaf string type.

Problem statement

Several processors are using the same configuration to validate and extract the field content: InferenceProcessor, TextImageEmbeddingProcessor and TextChunkingProcessor etc which causes duplicate code among these classes.
And a more critical issue is when we need to implement new features for this validation and extraction, we need to duplicate that as well, e.g.: #110 this issue requested to add dot support for user's configuration, and we need to implement for this in multiple places which is a bad smell and it's time to refactor our code.

Proposal

The proposal is that we can extract these code in a common place and by designing them reasonably we can reduce the code duplication and all the enhancement goes a single place in future.

We should note that not every piece can be made common since different processor has different logic, e.g. TextImageEmbeddingProcessor and InferenceProcessor's buildxxxKeyAndOriginalValue, since they support different type, it's not easy to make this part reusable, so in this case we will have to make some abstract methods and different processor has their own different implementation. This case we need to add an abstract class and different processors need to extend it.

Some code are almost same, e.g. TextImageEmbeddingProcessor and InferenceProcessor and TextChunkProcessor's validateEmbeddingFieldsValue/validateFieldsValue method, they're almost exactly same with only minor differences, in this case we can extract the code to a common class a common method with adding slightly change and with combination approach we can reduce the code duplication.

What solution would you like?

A clear and concise description of what you want to happen.

What alternatives have you considered?

A clear and concise description of any alternative solutions or features you've considered.

Do you have any additional context?

Add any other context or screenshots about the feature request here.

The text was updated successfully, but these errors were encountered:

zane-neo · 2024-04-02T02:46:54Z

This is a very early idea of refactoring code doesn't mean above statement will be the final change, let's have more discussion under this issue especially related code owners, would like to hear your voice on this.

navneet1v · 2024-04-02T02:58:03Z

@zane-neo can we have a utility exposed from opensearch-core for this? @mingshl was also doing something similar for MLInference processor. I would recommend we coming up with a common utility that we can leverage across different processors

zane-neo · 2024-04-02T04:02:46Z

@zane-neo can we have a utility exposed from opensearch-core for this? @mingshl was also doing something similar for MLInference processor. I would recommend we coming up with a common utility that we can leverage across different processors

That make sense since other processors might need same support for customer's configuration. We can consider this overall and make it more generic across different processors.

zane-neo · 2024-04-02T09:47:44Z

Checked on @mingshl 's issue and PR again and it seems her proposal is targeted to a different case, simply put:
processor based on remote model connector with user defined painless scripts, so only remote model is able to use that. Our case is different since we're support both remote model and local model and doesn't base on connector configuration nor painless scripts, which is a more generic case that can be applied to even non inference processors. @mingshl please correct me if any misunderstanding.

mingshl · 2024-04-03T16:54:36Z

@zane-neo the dot path notation for nested object would be supported in ml inference processors.

we first support remote models, but the future development will include local models. The reason behind that is the input datasets for local models are planning to be unified similar to remote models, and it will be convenient to be supported for many features, including ml inference processors.

mingshl · 2024-04-03T16:58:19Z

Checked on @mingshl 's issue and PR again and it seems her proposal is targeted to a different case, simply put: processor based on remote model connector with user defined painless scripts, so only remote model is able to use that. Our case is different since we're support both remote model and local model and doesn't base on connector configuration nor painless scripts, which is a more generic case that can be applied to even non inference processors. @mingshl please correct me if any misunderstanding.

I think there are some diverges here, so in ml inference processors , the main responsibility is to call inference and add the results to the documents or search responses. The only required field is model_id, so the use cases have to be using models to inference. And the ml inference processors itself doesn't need to write painless script.

zane-neo · 2024-04-07T02:13:45Z

Checked on @mingshl 's issue and PR again and it seems her proposal is targeted to a different case, simply put: processor based on remote model connector with user defined painless scripts, so only remote model is able to use that. Our case is different since we're support both remote model and local model and doesn't base on connector configuration nor painless scripts, which is a more generic case that can be applied to even non inference processors. @mingshl please correct me if any misunderstanding.

I think there are some diverges here, so in ml inference processors , the main responsibility is to call inference and add the results to the documents or search responses. The only required field is model_id, so the use cases have to be using models to inference. And the ml inference processors itself doesn't need to write painless script.

Yeah, painless scripts are part of the connector which is conforms to current remote inference implementation. The main point here is the ML inference processor is based on connector which is a ml-commons specific concept so this part isn't in the scope if we want a generic data structure parser in opensearch core.

naveentatikonda · 2024-09-18T00:56:51Z

Closing this issue...feel free to open if there are any concerns

zane-neo added untriaged enhancement labels Apr 2, 2024

navneet1v removed the untriaged label Apr 2, 2024

zane-neo changed the title ~~[FEATURE] Refactor on data validation and extraction from customer's index documents in multiple processors~~ [FEATURE] Refactor on data validation and extraction from customer's documents in several processors Apr 2, 2024

zane-neo mentioned this issue Apr 9, 2024

[Feature Request] Mapping fashion configuration for pipeline processors opensearch-project/OpenSearch#13128

Closed

naveentatikonda closed this as completed Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Refactor on data validation and extraction from customer's documents in several processors #660

[FEATURE] Refactor on data validation and extraction from customer's documents in several processors #660

zane-neo commented Apr 2, 2024

zane-neo commented Apr 2, 2024

navneet1v commented Apr 2, 2024

zane-neo commented Apr 2, 2024

zane-neo commented Apr 2, 2024

mingshl commented Apr 3, 2024

mingshl commented Apr 3, 2024

zane-neo commented Apr 7, 2024

naveentatikonda commented Sep 18, 2024

[FEATURE] Refactor on data validation and extraction from customer's documents in several processors #660

[FEATURE] Refactor on data validation and extraction from customer's documents in several processors #660

Comments

zane-neo commented Apr 2, 2024

Is your feature request related to a problem?

Background

Problem statement

Proposal

What solution would you like?

What alternatives have you considered?

Do you have any additional context?

zane-neo commented Apr 2, 2024

navneet1v commented Apr 2, 2024

zane-neo commented Apr 2, 2024

zane-neo commented Apr 2, 2024

mingshl commented Apr 3, 2024

mingshl commented Apr 3, 2024

zane-neo commented Apr 7, 2024

naveentatikonda commented Sep 18, 2024