Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Simplify the Inference Ingest Processor configuration #100205

Merged
merged 6 commits into from
Oct 3, 2023

Conversation

davidkyle
Copy link
Member

Introduces a new way to configure which fields will be passed for inference and where to write the the results to.

  • Removes the need to use a field_map to rename a field
  • The output location is configured in 1 place next to the input
  • Multiple fields from the ingest document can be selected for inference

The new configuration does not support Data frame analytics, those models accept the full ingest document not specific fields.

Example: Process the field body and write the results to body_tokens

{
    "processors": [
        {
            "inference": {
                "model_id": "elser_model",
                "input_output": [
                    {
                        "input_field": "body",
                        "output_field": "body_tokens"
                    }
                ]
            }
        }
    ]
}

Example: Select multiple input fields can be specified as:

{
    "processors": [
        {
            "inference": {
                "model_id": "elser_model",
                "input_output": [
                    {
                        "input_field": "body",
                        "output_field": "body_tokens"
                    },
                    {
                        "input_field": "title",
                        "output_field": "title_tokens"
                    },
                ]
            }
        }
    ]
}

@github-actions
Copy link
Contributor

github-actions bot commented Oct 3, 2023

Documentation preview:

@elasticsearchmachine
Copy link
Collaborator

Hi @davidkyle, I've created a changelog YAML for you.

@davidkyle davidkyle marked this pull request as ready for review October 3, 2023 15:51
@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Oct 3, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@davidkyle davidkyle added the cloud-deploy Publish cloud docker image for Cloud-First-Testing label Oct 3, 2023
Copy link
Contributor

@jonathan-buttner jonathan-buttner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple comments, I'm not sure they're worth keeping the PR from being merged though.

@@ -74,6 +74,12 @@ public Map<String, Object> asMap() {
return asMap;
}

@Override
public Map<String, Object> asMap(String outputField) {
// errors do not have a result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this throw like RawInferenceResults does?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RawInferenceResults is used internal by the tree ensembles, it should never be seen by the outside world which is why it throws. The comment here isn't very clear but the idea is that we don't write the error message to the result field (which might be mapped to a dense vector when written to the index), instead it goes in a different field

* @param outputField Write the inference result to this field
* @return Map representation of the InferenceResult
*/
Map<String, Object> asMap(String outputField);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably missing the importance of why we need to pass in the outputField here but from looking at a few of the implementations of InferenceResults most seem to have a results field this is already used with this version Map<String, Object> asMap();. Why couldn't the caller just instantiate the implementing class with the output field when constructing the object instead of having to pass it in with this method?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this isn't feasible but the code here:

InferenceResults.writeResultToField(
                    response.getInferenceResults().get(i),
                    ingestDocument,
                    inputs.get(i).outputBasePath(),
                    inputs.get(i).outputField,
                    response.getId() != null ? response.getId() : modelId,
                    i == 0

So I think what I'm suggesting whoever is creating the inference results class that is eventually being placed in the response and then access here with response.getInferenceResults().get(i) would need to pass in the corresponding output field.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is a bit back to front: InferenceResults has a result_field and to change that field we need to send an inference config update with the inference request just to update the result_field in the response. The sender knows where they want to write the result to but we have to use this verbose and long winded way of changing result_field.

It all breaks down when you have multiple fields and you want the responses to be written to different output fields (e.g title & body), if we used result_field we would have to create separate requests for each input field just to change the result_field in each one.

It's much simpler to say write yourself to this location but we have to keep the legacy method so the code isn't very neat. Hopefully this can be refactored away eventually.

Map<String, String> fieldMap
Map<String, String> fieldMap,
List<Factory.InputConfig> inputs,
boolean configuredWithInputsFields
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would probably be too big of a change now, but I wonder if in the future we could extract out the need for a boolean here and pass in a small class that handles defining how the ingest processor works if it has multiple input fields. Or maybe split this into two separate class 🤷‍♂️

That way we can avoid the if-else logic throughout this class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ good idea

InferModelAction.Request request;
try {
request = buildRequest(ingestDocument);
} catch (ElasticsearchStatusException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth logging something here?

if (configuredWithInputsFields) {
List<String> requestInputs = new ArrayList<>();
for (var inputFields : inputs) {
var lookup = (String) fields.get(inputFields.inputField);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to do an instanceof or type check before casting?

@davidkyle davidkyle merged commit b055204 into elastic:main Oct 3, 2023
davidkyle added a commit that referenced this pull request Oct 6, 2023
…00335)

Following on from #100205 this PR adds more tests and checks 
for corner cases when parsing the configuration.
davidkyle added a commit to davidkyle/elasticsearch that referenced this pull request Oct 6, 2023
…astic#100335)

Following on from elastic#100205 this PR adds more tests and checks 
for corner cases when parsing the configuration.
elasticsearchmachine pushed a commit that referenced this pull request Oct 6, 2023
…00335) (#100416)

Following on from #100205 this PR adds more tests and checks 
for corner cases when parsing the configuration.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud-deploy Publish cloud docker image for Cloud-First-Testing >enhancement :ml Machine learning Team:ML Meta label for the ML team v8.11.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants