Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emit multiple fields from a runtime field script #75108

Merged
merged 31 commits into from
Aug 10, 2021

Conversation

javanna
Copy link
Member

@javanna javanna commented Jul 8, 2021

We have recently introduced support for grok and dissect to the runtime fields Painless context that allows to split a field into multiple fields. Though each runtime field can only emit values for a single field. This PR introduces support for emitting multiple fields from the same script.

The API call to define a runtime field that emits multiple fields is the following:

PUT localhost:9200/logs/_mappings
{
    "runtime" : {
      "log" : {
        "type" : "composite",
        "script" : "emit(grok(\"%{COMMONAPACHELOG}\").extract(doc[\"message.keyword\"].value))",
        "fields" : {
            "clientip" : {
                "type" : "ip"
            },
            "response" : {
                "type" : "long"
            }
        }
      }
    }
}

The script context for this new field type accepts two emit signatures:

emit(String, Object)
emit(Map)

Sub-fields need to be declared under fields in order to be discoverable through the field_caps API and accessible through the search API. In this first iteration, mapping additional sub-fields requires to provide the whole object with its script and all the existing plus new sub-fields. In a follow-up we will address this by allowing users to provide only what needs to be changed or added.

The way that it emits multiple fields is by returning multiple MappedFieldTypes from RuntimeField#asMappedFieldTypes. The sub-fields are instances of the runtime fields that are already supported, with a little tweak to adapt the script defined by their parent to an artificial script factory for each of the sub-fields that makes its corresponding sub-field accessible. This approach allows to reuse all of the existing runtime fields code for the sub-fields.

The runtime section has been flat so far as it has not supported objects until now. That stays the same, meaning that runtime fields can have dots in their names. Because there are though two ways to create the same field with the introduction of the ability to emit multiple fields, we have to make sure that a runtime field with a certain name cannot be defined twice, which is why the following mappings are rejected with the error Found two runtime fields with same name [log.response]:

PUT localhost:9200/logs/_mappings
{
    "runtime" : {
        "log.response" : {
            "type" : "keyword"
        },
        "log" : {
            "type" : "composite",
            "script" : "emit(\"response\", grok(\"%{COMMONAPACHELOG}\").extract(doc[\"message.keyword\"].value)?.response)",
            "fields" : {
                "response" : {
                    "type" : "long"
                }
            }
        }
    }
}

Closes #68203

@javanna javanna added >enhancement release highlight :Search/Search Search-related issues that do not fall into other categories v8.0.0 v7.15.0 labels Jul 8, 2021
builder.startObject("fields");
for (RuntimeField subfield : subfields) {
subfield.toXContent(builder, params);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we print out all the parameters above, and fields is also a parameter, but nothing gets printed out. It is probably good as we can only parse RuntimeFields at a later time, which is what we need to print out the subfields. Is what I am doing here ok? Do we have to somehow make sure that serializing the parameters does not serialize also the fields, why does nothing happen at the moment @romseygeek ?

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed this as someone not familiar with this part of the codebase (so needs another reviewer). I've left a few questions / comments but overall the PR looks in good shape to me.

new RuntimeField.Builder(name) {
private final FieldMapper.Parameter<Script> script = new FieldMapper.Parameter<>(
"script",
false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be true, i.e., updateable? Same question for fields below

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Runtime field definitions don't get merged, they get replaced entirely, so this parameter doesn't apply.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reusing the field mapper parameter parsing was a nice hack but this shows one of the places where its kind of confusing. I'm not sure what to do about it though.

* @return the values that were emitted for the provided field
*/
public final List<Object> getValues(String field) {
//TODO for now we re-run the script every time a leaf field is accessed, but we could cache the values?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is such a significant limitation that this behavior should at least be documented, so that it does not surprise users that are have composite fields with lots of subfields.

It means that the benefits of the PR are mostly of syntactic nature, allowing users not to have to copy paste the same script into two field definitions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely. There aren't any docs in this PR, they will be added as a follow-up in collaboration with the docs team, but we'll make sure to call out this limitation.

CompositeFieldScript.Factory objectFieldScript = (f, p, s) -> ctx -> new CompositeFieldScript(f, p, s, ctx) {
@Override
public void execute() {
emit("field", "value");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emit multiple fields here?

//TODO for now we re-run the script every time a leaf field is accessed, but we could cache the values?
fieldValues.clear();
execute();
return fieldValues.get(field);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should fieldValues be cleared before returning field so that it does not hold onto unnecessary state?

Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh boy!

I wonder if we could have emit on the composite script bind its results into all of the "child" scripts immediately. Right now it looks like you can run a script and it'll collect a Map of values. And then, after the script is finished, you extract just the ones you need. That's good because you don't convert the ones you don't need. But its bad that you silently ignore when you emit invalid values that you aren't reading. It also feels bad that the error comes after the script instead of during it. If we throw the error while running the script we'd get the line it fails on. Painless has try/catch so you could even use that to dodge error cases or add extra information when an error is thrown.

A smaller thing, but now that there are three sorts of subclasses for stuff like DoubleFieldScript maybe it'd be a good idea to have a more natural shaped supertype. The design of these classes is very much around painless implementing them, but we have those two cases where we implement it by hand. It might make sense to make refactor the base classes so they are less "painless shaped" and make a "painless shaped" layer that it can implement. I'm not sure, and, regardless, I don't think that is a "for now" thing.

}

@Override
public Collection<MappedFieldType> asMappedFieldTypes() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe return a Stream from this if you are going to be flatMaping it. If called needs a list they can make one?

@romseygeek
Copy link
Contributor

I wonder if we could have emit on the composite script bind its results into all of the "child" scripts immediately. Right now it looks like you can run a script and it'll collect a Map of values. And then, after the script is finished, you extract just the ones you need. That's good because you don't convert the ones you don't need. But its bad that you silently ignore when you emit invalid values that you aren't reading. It also feels bad that the error comes after the script instead of during it. If we throw the error while running the script we'd get the line it fails on. Painless has try/catch so you could even use that to dodge error cases or add extra information when an error is thrown.

At the moment the script is actually run by the child field, and there's no caching or sharing of values between different child fields. So if you have a grok defined on a parent field 'message' and you search on 'message.status' and 'message.ip', both fields will run the painless script separately and then extract the part of the map that they need. This isn't ideal, as Yannick mentions, but it's the way runtime fields work now. It would be great to rework things so that different runtime fields could share a LeafSearchLookup, in which case we could cache these results there, but that's a much bigger project.

Copy link
Contributor

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me overall, I left some initial comments. The name composite has really grown on me!

@romseygeek
Copy link
Contributor

Thanks for the reviews everybody!

I have reworked things so that runtime builders now have two methods, one to create a simple leaf field and one to create a field with a parent. The parsing code can take an optional function that will choose which method to call when building fields, and it nicely separates out composite field construction from 'normal' field construction.

Copy link
Contributor

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good to me, just left some small comments. As a caveat, I am not deeply familiar with this code and its history (AbstractScriptFieldType, RuntimeField, etc.)

@romseygeek
Copy link
Contributor

@elasticmachine update branch

@romseygeek romseygeek merged commit 32d2f60 into master Aug 10, 2021
@romseygeek romseygeek deleted the feature/runtime_fields_multi_emit branch August 10, 2021 12:07
romseygeek pushed a commit to romseygeek/elasticsearch that referenced this pull request Aug 10, 2021
We have recently introduced support for grok and dissect to the runtime fields
Painless context that allows to split a field into multiple fields. However, each runtime
field can only emit values for a single field. This commit introduces support for emitting
multiple fields from the same script.

The API call to define a runtime field that emits multiple fields is the following:

```
PUT localhost:9200/logs/_mappings
{
    "runtime" : {
      "log" : {
        "type" : "composite",
        "script" : "emit(grok(\"%{COMMONAPACHELOG}\").extract(doc[\"message.keyword\"].value))",
        "fields" : {
            "clientip" : {
                "type" : "ip"
            },
            "response" : {
                "type" : "long"
            }
        }
      }
    }
}
```

The script context for this new field type accepts two emit signatures:

* `emit(String, Object)`
* `emit(Map)`

Sub-fields need to be declared under fields in order to be discoverable through
the field_caps API and accessible through the search API.

The way that it emits multiple fields is by returning multiple MappedFieldTypes
from RuntimeField#asMappedFieldTypes. The sub-fields are instances of the
runtime fields that are already supported, with a little tweak to adapt the script
defined by their parent to an artificial script factory for each of the sub-fields
that makes its corresponding sub-field accessible. This approach allows to reuse
all of the existing runtime fields code for the sub-fields.

The runtime section has been flat so far as it has not supported objects until now.
That stays the same, meaning that runtime fields can have dots in their names.
Because there are though two ways to create the same field with the introduction
of the ability to emit multiple fields, we have to make sure that a runtime field with
a certain name cannot be defined twice, which is why the following mappings are
rejected with the error `Found two runtime fields with same name [log.response]`:

```
PUT localhost:9200/logs/_mappings
{
    "runtime" : {
        "log.response" : {
            "type" : "keyword"
        },
        "log" : {
            "type" : "composite",
            "script" : "emit(\"response\", grok(\"%{COMMONAPACHELOG}\").extract(doc[\"message.keyword\"].value)?.response)",
            "fields" : {
                "response" : {
                    "type" : "long"
                }
            }
        }
    }
}
```

Closes elastic#68203
romseygeek added a commit that referenced this pull request Aug 10, 2021
We have recently introduced support for grok and dissect to the runtime fields
Painless context that allows to split a field into multiple fields. However, each runtime
field can only emit values for a single field. This commit introduces support for emitting
multiple fields from the same script.

The API call to define a runtime field that emits multiple fields is the following:

```
PUT localhost:9200/logs/_mappings
{
    "runtime" : {
      "log" : {
        "type" : "composite",
        "script" : "emit(grok(\"%{COMMONAPACHELOG}\").extract(doc[\"message.keyword\"].value))",
        "fields" : {
            "clientip" : {
                "type" : "ip"
            },
            "response" : {
                "type" : "long"
            }
        }
      }
    }
}
```

The script context for this new field type accepts two emit signatures:

* `emit(String, Object)`
* `emit(Map)`

Sub-fields need to be declared under fields in order to be discoverable through
the field_caps API and accessible through the search API.

The way that it emits multiple fields is by returning multiple MappedFieldTypes
from RuntimeField#asMappedFieldTypes. The sub-fields are instances of the
runtime fields that are already supported, with a little tweak to adapt the script
defined by their parent to an artificial script factory for each of the sub-fields
that makes its corresponding sub-field accessible. This approach allows to reuse
all of the existing runtime fields code for the sub-fields.

The runtime section has been flat so far as it has not supported objects until now.
That stays the same, meaning that runtime fields can have dots in their names.
Because there are though two ways to create the same field with the introduction
of the ability to emit multiple fields, we have to make sure that a runtime field with
a certain name cannot be defined twice, which is why the following mappings are
rejected with the error `Found two runtime fields with same name [log.response]`:

```
PUT localhost:9200/logs/_mappings
{
    "runtime" : {
        "log.response" : {
            "type" : "keyword"
        },
        "log" : {
            "type" : "composite",
            "script" : "emit(\"response\", grok(\"%{COMMONAPACHELOG}\").extract(doc[\"message.keyword\"].value)?.response)",
            "fields" : {
                "response" : {
                    "type" : "long"
                }
            }
        }
    }
}
```

Closes #68203

Co-authored-by: Luca Cavanna <[email protected]>
@markharwood
Copy link
Contributor

markharwood commented Aug 23, 2021

Hi @romseygeek I have a null pointer fix for this in master but when I tried backporting that fix it looks like this feature is not backported in 7.x? The label on this PR says 7.15 so I'm not sure if it's still just pending?

@markharwood
Copy link
Contributor

Ignore my last comment. I can see the NPE is in a different class in 7.15 - TransportFieldCapabilitiesIndexAction.

@mattkime
Copy link

@javanna - I'd like to confirm - composite fields can only have 'child' fields one level deep, correct?

@romseygeek
Copy link
Contributor

@mattkime yes that's correct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement release highlight :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team v7.15.0 v8.0.0-alpha2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for emitting multiple fields values from a script
9 participants