Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for emitting multiple fields values from a script #68203

Closed
javanna opened this issue Jan 29, 2021 · 5 comments · Fixed by #75108
Closed

Add support for emitting multiple fields values from a script #68203

javanna opened this issue Jan 29, 2021 · 5 comments · Fixed by #75108
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team

Comments

@javanna
Copy link
Member

javanna commented Jan 29, 2021

We recently added support for runtime fields, that are computed at search time based on a painless script. As of today, a runtime field script can emit values for a single field, the one that the script is declared under.

We would like to add the ability for a script to emit values for multiple fields. This will be achieved by introducing support for a new field type (name to be defined) as part of the runtime section. Its script emits fields that belong to such object. This is particularly useful given that scripts support grok and dissect (#68088):

PUT localhost:9200/logs/_mappings
{
  "runtime" : {
    "log" : {
      "type" : "tbd",
      "script": '''
        emit(grok('%{COMMONAPACHELOG}').extract(doc["message"].value)));
      ''',
      "fields" : {
        "clientip" : {
          "type" : "ip"
        },
        "verb" : {
          "type" : "keyword"
        },
        "request" : {
          "type" : "keyword"
        },
        "response" : {
          "type" : "long"
        }
      }
    }
  },
  "properties" : {
    "message" : {
      "type" : "keyword"
    }
  }
}

In the example above, the grok function splits the message field into sub-fields based on the provided grok pattern, and each of the resulting fields is emitted in the following loop. The emitted fields need to be listed under the sub-fields in order to specify their type and make them searchable (and discoverable through field_caps) like any other field:

POST /logs*/_search
{
  "aggs": {
    "response_codes": {
      "range": {
        "field": "log.response",
        "ranges": [
          { "to": 300 },
          { "from": 300, "to": 400 },
          { "from": 500 }
        ]
      }
    }
  }
}
@javanna javanna added >enhancement :Search/Search Search-related issues that do not fall into other categories labels Jan 29, 2021
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Jan 29, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@jtibshirani
Copy link
Contributor

An alternative would be to model this using multi-fields:

"message" : {
      "type" : "keyword",
      "script": '''
        Map fields = grok('%{COMMONAPACHELOG}').extract(doc["message"].value);
        for (Map.Entry field : fields) {
          emit(field.getKey(), field.getValue());
        }
      ''',
      "fields" : {
        "clientip" : {
          "type" : "ip"
        }
      }
    }
  },

I think it fits okay into the multi-fields concept: the parent field holds some value, and each subfield consults the same value but exposes it differently. A benefit of this approach is that there's only one way to define sub-fields, so the conflict in the example isn't possible. A more theoretical point, but I also like that it maintains an invariant: any field that stores + parses a document value is a leaf field type (not an object mapper). This includes fields that expose virtual subfields like flattened in addition to non-virtual ones like aggregate_metric_double.

@javanna
Copy link
Member Author

javanna commented Feb 5, 2021

Thanks for the feedback!

One point of concern is how the runtime field will be shaped when it is made indexed. We'd love to be able to paste it under the properties section. That sounds harder if we follow the multi-fields approach, I'm afraid. Or maybe not, as long as scripts behave the same under both the runtime and the properties section.

I wonder if it would be harder to follow that the script for a keyword field can emit multiple fields or not depending on whether it has sub-fields defined. The intention so far would be to expose that key-value emit (or something along those lines) only to the object variant of a runtime field.

On the naming conflict, I believe that you could still define a message.clientip field with the dot in its name, outside of the message field of type keyword, and we would have to possibly forbid having both. It is still up for discussion what the right behaviour should be, my intention was only to point out the potential problem in the description above.

I was also under the impression that the invariant you mentioned is valid with both examples, because you still need to define the leaf fields. The script is meta: it tells what to emit and where to take it from, but the object is still a collection of other fields.

Does this make sense to you?

@javanna
Copy link
Member Author

javanna commented Feb 6, 2021

One additional point that came to mind is that the message field itself may be defined under properties, and doc['message'] would refer to it. If the script to split the field is defined under a message field of type keyword, that forces the mssage field to be runtime too? That field, to my mind, is really just a container of other fields, but on its own, it has little meaning when referred to, for instance in query.

@javanna javanna changed the title Add support for runtime object fields Add support for emitting multiple fields values from a script Jun 3, 2021
@javanna
Copy link
Member Author

javanna commented Jun 3, 2021

Heads up: I have updated the description of the issue according to recent discussions and the draft PR I opened. The API has slightly changed, and we need to come up with a name for the new field type if we decide to not use object.

romseygeek pushed a commit that referenced this issue Aug 10, 2021
We have recently introduced support for grok and dissect to the runtime fields 
Painless context that allows to split a field into multiple fields. However, each runtime 
field can only emit values for a single field. This commit introduces support for emitting 
multiple fields from the same script.

The API call to define a runtime field that emits multiple fields is the following:

```
PUT localhost:9200/logs/_mappings
{
    "runtime" : {
      "log" : {
        "type" : "composite",
        "script" : "emit(grok(\"%{COMMONAPACHELOG}\").extract(doc[\"message.keyword\"].value))",
        "fields" : {
            "clientip" : {
                "type" : "ip"
            },
            "response" : {
                "type" : "long"
            }
        }
      }
    }
}
```

The script context for this new field type accepts two emit signatures:

* `emit(String, Object)`
* `emit(Map)`

Sub-fields need to be declared under fields in order to be discoverable through 
the field_caps API and accessible through the search API. 

The way that it emits multiple fields is by returning multiple MappedFieldTypes 
from RuntimeField#asMappedFieldTypes. The sub-fields are instances of the 
runtime fields that are already supported, with a little tweak to adapt the script 
defined by their parent to an artificial script factory for each of the sub-fields 
that makes its corresponding sub-field accessible. This approach allows to reuse 
all of the existing runtime fields code for the sub-fields.

The runtime section has been flat so far as it has not supported objects until now. 
That stays the same, meaning that runtime fields can have dots in their names. 
Because there are though two ways to create the same field with the introduction 
of the ability to emit multiple fields, we have to make sure that a runtime field with 
a certain name cannot be defined twice, which is why the following mappings are 
rejected with the error `Found two runtime fields with same name [log.response]`:

```
PUT localhost:9200/logs/_mappings
{
    "runtime" : {
        "log.response" : {
            "type" : "keyword"
        },
        "log" : {
            "type" : "composite",
            "script" : "emit(\"response\", grok(\"%{COMMONAPACHELOG}\").extract(doc[\"message.keyword\"].value)?.response)",
            "fields" : {
                "response" : {
                    "type" : "long"
                }
            }
        }
    }
}
```

Closes #68203
romseygeek pushed a commit to romseygeek/elasticsearch that referenced this issue Aug 10, 2021
We have recently introduced support for grok and dissect to the runtime fields
Painless context that allows to split a field into multiple fields. However, each runtime
field can only emit values for a single field. This commit introduces support for emitting
multiple fields from the same script.

The API call to define a runtime field that emits multiple fields is the following:

```
PUT localhost:9200/logs/_mappings
{
    "runtime" : {
      "log" : {
        "type" : "composite",
        "script" : "emit(grok(\"%{COMMONAPACHELOG}\").extract(doc[\"message.keyword\"].value))",
        "fields" : {
            "clientip" : {
                "type" : "ip"
            },
            "response" : {
                "type" : "long"
            }
        }
      }
    }
}
```

The script context for this new field type accepts two emit signatures:

* `emit(String, Object)`
* `emit(Map)`

Sub-fields need to be declared under fields in order to be discoverable through
the field_caps API and accessible through the search API.

The way that it emits multiple fields is by returning multiple MappedFieldTypes
from RuntimeField#asMappedFieldTypes. The sub-fields are instances of the
runtime fields that are already supported, with a little tweak to adapt the script
defined by their parent to an artificial script factory for each of the sub-fields
that makes its corresponding sub-field accessible. This approach allows to reuse
all of the existing runtime fields code for the sub-fields.

The runtime section has been flat so far as it has not supported objects until now.
That stays the same, meaning that runtime fields can have dots in their names.
Because there are though two ways to create the same field with the introduction
of the ability to emit multiple fields, we have to make sure that a runtime field with
a certain name cannot be defined twice, which is why the following mappings are
rejected with the error `Found two runtime fields with same name [log.response]`:

```
PUT localhost:9200/logs/_mappings
{
    "runtime" : {
        "log.response" : {
            "type" : "keyword"
        },
        "log" : {
            "type" : "composite",
            "script" : "emit(\"response\", grok(\"%{COMMONAPACHELOG}\").extract(doc[\"message.keyword\"].value)?.response)",
            "fields" : {
                "response" : {
                    "type" : "long"
                }
            }
        }
    }
}
```

Closes elastic#68203
romseygeek added a commit that referenced this issue Aug 10, 2021
We have recently introduced support for grok and dissect to the runtime fields
Painless context that allows to split a field into multiple fields. However, each runtime
field can only emit values for a single field. This commit introduces support for emitting
multiple fields from the same script.

The API call to define a runtime field that emits multiple fields is the following:

```
PUT localhost:9200/logs/_mappings
{
    "runtime" : {
      "log" : {
        "type" : "composite",
        "script" : "emit(grok(\"%{COMMONAPACHELOG}\").extract(doc[\"message.keyword\"].value))",
        "fields" : {
            "clientip" : {
                "type" : "ip"
            },
            "response" : {
                "type" : "long"
            }
        }
      }
    }
}
```

The script context for this new field type accepts two emit signatures:

* `emit(String, Object)`
* `emit(Map)`

Sub-fields need to be declared under fields in order to be discoverable through
the field_caps API and accessible through the search API.

The way that it emits multiple fields is by returning multiple MappedFieldTypes
from RuntimeField#asMappedFieldTypes. The sub-fields are instances of the
runtime fields that are already supported, with a little tweak to adapt the script
defined by their parent to an artificial script factory for each of the sub-fields
that makes its corresponding sub-field accessible. This approach allows to reuse
all of the existing runtime fields code for the sub-fields.

The runtime section has been flat so far as it has not supported objects until now.
That stays the same, meaning that runtime fields can have dots in their names.
Because there are though two ways to create the same field with the introduction
of the ability to emit multiple fields, we have to make sure that a runtime field with
a certain name cannot be defined twice, which is why the following mappings are
rejected with the error `Found two runtime fields with same name [log.response]`:

```
PUT localhost:9200/logs/_mappings
{
    "runtime" : {
        "log.response" : {
            "type" : "keyword"
        },
        "log" : {
            "type" : "composite",
            "script" : "emit(\"response\", grok(\"%{COMMONAPACHELOG}\").extract(doc[\"message.keyword\"].value)?.response)",
            "fields" : {
                "response" : {
                    "type" : "long"
                }
            }
        }
    }
}
```

Closes #68203

Co-authored-by: Luca Cavanna <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team
Projects
None yet
3 participants