diff --git a/docs/reference/scripting/common-script-uses.asciidoc b/docs/reference/scripting/common-script-uses.asciidoc index 3cf30b328d723..26f28439c64ae 100644 --- a/docs/reference/scripting/common-script-uses.asciidoc +++ b/docs/reference/scripting/common-script-uses.asciidoc @@ -17,15 +17,10 @@ There are two options at your disposal: * <> is a regular expression dialect that supports aliased expressions that you can reuse. Because Grok sits on top of regular expressions (regex), any regular expressions are valid in grok as well. -* <> extracts structured fields out of text, using +* <> extracts structured fields out of text, using delimiters to define the matching pattern. Unlike grok, dissect doesn't use regular expressions. -Regex is incredibly powerful but can be complicated. If you don't need the -power of regular expressions, use dissect patterns, which are simple and -often faster than grok patterns. Paying special attention to the parts of the string -you want to discard will help build successful dissect patterns. - Let's start with a simple example by adding the `@timestamp` and `message` fields to the `my-index` mapping as indexed fields. To remain flexible, use `wildcard` as the field type for `message`: diff --git a/docs/reference/scripting/dissect-syntax.asciidoc b/docs/reference/scripting/dissect-syntax.asciidoc new file mode 100644 index 0000000000000..8eee3acaac10c --- /dev/null +++ b/docs/reference/scripting/dissect-syntax.asciidoc @@ -0,0 +1,310 @@ +[[dissect]] +=== Dissecting data +Dissect matches a single text field against a defined pattern. A dissect +pattern is defined by the parts of the string you want to discard. Paying +special attention to each part of a string helps to build successful dissect +patterns. + +If you don't need the power of regular expressions, use dissect patterns instead +of grok. Dissect uses a much simpler syntax than grok and is typically faster +overall. The syntax for dissect is transparent: tell dissect what you want and +it will return those results to you. + +[[dissect-syntax]] +==== Dissect patterns +Dissect patterns are comprised of _variables_ and _separators_. Anything +defined by a percent sign and curly braces `%{}` is considered a variable, +such as `%{clientip}`. You can assign variables to any part of data in a field, +and then return only the parts that you want. Separators are any values between +variables, which could be spaces, dashes, or other delimiters. + +For example, let's say you have log data with a `message` field that looks like +this: + +[source,js] +---- +"message" : "247.37.0.0 - - [30/Apr/2020:14:31:22 -0500] \"GET /images/hm_nbg.jpg HTTP/1.0\" 304 0" +---- +// NOTCONSOLE + +You assign variables to each part of the data to construct a successful +dissect pattern. Remember, tell dissect _exactly_ what you want you want to +match on. + + +[NOTE] +==== +ASDLKJASLDKF + +ASDFLKJA;SLDrF +==== + +The first part of the data looks like an IP address, so you +can assign a variable like `%{clientip}`. The next two characters are dashes +with a space on either side. You can assign a variable for each dash, or a +single variable to represent the dashes and spaces. Next are a set of brackets +containing a timestamp. The brackets are a separator, so you include those in +the dissect pattern. Thus far, the data and matching dissect pattern look like +this: + +[source,js] +---- +247.37.0.0 - - [30/Apr/2020:14:31:22 -0500] <1> + +%{clientip} %{ident} %{auth} [%{@timestamp}] <2> +---- +// NOTCONSOLE +<1> The first chunks of data from the `message` field +<2> Dissect pattern to match on the selected data chunks + +Using that same logic, you can create variables for the remaining chunks of +data. Double quotation marks are separators, so include those in your dissect +pattern. The pattern replaces `GET` with a `%{verb}` variable, but keeps `HTTP` +as part of the pattern. + +[source,js] +---- +\"GET /images/hm_nbg.jpg HTTP/1.0\" 304 0 + +"%{verb} %{request} HTTP/%{httpversion}" %{response} %{size} +---- +// NOTCONSOLE + +Combining the two patterns results in a dissect pattern that looks like this: + +[source,js] +---- +%{clientip} %{ident} %{auth} [%{@timestamp}] \"%{verb} %{request} HTTP/%{httpversion}\" %{status} %{size} +---- +// NOTCONSOLE + +Now that you have a dissect pattern, how do you test and use it? + +[[dissect-patterns-test]] +==== Test dissect patterns with Painless +You can incorporate dissect patterns into Painless scripts to extract +data. To test your script, use either the {painless}/painless-execute-api.html#painless-execute-runtime-field-context[field contexts] of the Painless +execute API or create a runtime field that includes the script. Runtime fields +offer greater flexibility and accept multiple documents, but the Painless execute +API is a great option if you don't have write access on a cluster where you're +testing a script. + +For example, test your dissect pattern with the Painless execute API by +including your Painless script and a single document that matches your data. +Start by indexing the `message` field as a `wildcard` data type: + +[source,console] +---- +PUT my-index +{ + "mappings": { + "properties": { + "message": { + "type": "wildcard" + } + } + } +} +---- + +If you want to retrieve the HTTP response code, add your dissect pattern to a +Painless script that extracts the `response` value. To extract values from a +field, use this function: + +[source,painless] +---- +`.extract(doc[""].value)?.` +---- + +In this example, `message` is the `` and `response` is the +``: + +[source,console] +---- +POST /_scripts/painless/_execute +{ + "script": { + "source": """ + String response=dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{response} %{size}').extract(doc["message"].value)?.response; + if (response != null) emit(Integer.parseInt(response)); <1> + """ + }, + "context": "long_field", <2> + "context_setup": { + "index": "my-index", + "document": { <3> + "message": """247.37.0.0 - - [30/Apr/2020:14:31:22 -0500] "GET /images/hm_nbg.jpg HTTP/1.0" 304 0""" + } + } +} +---- +// TEST[continued] +<1> Runtime fields require the `emit` method to return values. +<2> Because the response code is an integer, use the `long_field` context. +<3> Include a sample document that matches your data. + +The result includes the HTTP response code: + +[source,console-result] +---- +{ + "result" : [ + 304 + ] +} +---- + +[[dissect-patterns-runtime]] +==== Use dissect patterns and scripts in runtime fields +If you have a functional dissect pattern, you can add it to a runtime field to +manipulate data. Because runtime fields don't require you to index fields, you +have incredible flexibility to modify your script and how it functions. If you +already <> using the Painless +execute API, you can use that _exact_ Painless script in your runtime field. + +To start, add the `message` field as a `wildcard` type like in the previous +section, but also add `@timestamp` as a `date` in case you want to operate on +that field for <>: + +[source,console] +---- +PUT /my-index/ +{ + "mappings": { + "properties": { + "@timestamp": { + "format": "strict_date_optional_time||epoch_second", + "type": "date" + }, + "message": { + "type": "wildcard" + } + } + } +} +---- + +If you want to extract the HTTP response code using your dissect pattern, you +can create a runtime field like `http.response`: + +[source,console] +---- +PUT my-index/_mappings +{ + "runtime": { + "http.response": { + "type": "long", + "script": """ + String response=dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{response} %{size}').extract(doc["message"].value)?.response; + if (response != null) emit(Integer.parseInt(response)); + """ + } + } +} +---- +// TEST[continued] + +After mapping the fields you want to retrieve, index a few records from +your log data into {es}. The following request uses the <> +to index raw log data into `my-index`: + +[source,console] +---- +POST /my-index/_bulk?refresh=true +{"index":{}} +{"timestamp":"2020-04-30T14:30:17-05:00","message":"40.135.0.0 - - [30/Apr/2020:14:30:17 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"} +{"index":{}} +{"timestamp":"2020-04-30T14:30:53-05:00","message":"232.0.0.0 - - [30/Apr/2020:14:30:53 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"} +{"index":{}} +{"timestamp":"2020-04-30T14:31:12-05:00","message":"26.1.0.0 - - [30/Apr/2020:14:31:12 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"} +{"index":{}} +{"timestamp":"2020-04-30T14:31:19-05:00","message":"247.37.0.0 - - [30/Apr/2020:14:31:19 -0500] \"GET /french/splash_inet.html HTTP/1.0\" 200 3781"} +{"index":{}} +{"timestamp":"2020-04-30T14:31:22-05:00","message":"247.37.0.0 - - [30/Apr/2020:14:31:22 -0500] \"GET /images/hm_nbg.jpg HTTP/1.0\" 304 0"} +{"index":{}} +{"timestamp":"2020-04-30T14:31:27-05:00","message":"252.0.0.0 - - [30/Apr/2020:14:31:27 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"} +{"index":{}} +{"timestamp":"2020-04-30T14:31:28-05:00","message":"not a valid apache log"} +---- +// TEST[continued] + +You can define a simple query to run a search for a specific HTTP response and +return all related fields. Use the `fields` parameter of the search API to +retrieve the `http.response` runtime field. + +[source,console] +---- +GET my-index/_search +{ + "query": { + "match": { + "http.response": "304" + } + }, + "fields" : ["http.response"] +} +---- +// TEST[continued] + +Alternatively, you can define the same runtime field but in the context of a +search request. The runtime definition and the script are exactly the same as +the one defined previously in the index mapping. Just copy that definition into +the search request under the `runtime_mappings` section and include a query +that matches on the runtime field. This query returns the same results as the +search query previously defined for the `http.response` runtime field in your +index mappings, but only in the context of this specific search: + +[source,console] +---- +GET my-index/_search +{ + "runtime_mappings": { + "http.response": { + "type": "long", + "script": """ + String response=dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{response} %{size}').extract(doc["message"].value)?.response; + if (response != null) emit(Integer.parseInt(response)); + """ + } + }, + "query": { + "match": { + "http.response": "304" + } + }, + "fields" : ["http.response"] +} +---- +// TEST[continued] +// TEST[s/_search/_search\?filter_path=hits/] + +[source,console-result] +---- +{ + "hits" : { + "total" : { + "value" : 1, + "relation" : "eq" + }, + "max_score" : 1.0, + "hits" : [ + { + "_index" : "my-index", + "_type" : "_doc", + "_id" : "D47UqXkBByC8cgZrkbOm", + "_score" : 1.0, + "_source" : { + "timestamp" : "2020-04-30T14:31:22-05:00", + "message" : "247.37.0.0 - - [30/Apr/2020:14:31:22 -0500] \"GET /images/hm_nbg.jpg HTTP/1.0\" 304 0" + }, + "fields" : { + "http.response" : [ + 304 + ] + } + } + ] + } +} +---- +// TESTRESPONSE[s/"_id" : "D47UqXkBByC8cgZrkbOm"/"_id": $body.hits.hits.0._id/] \ No newline at end of file diff --git a/docs/reference/scripting/using.asciidoc b/docs/reference/scripting/using.asciidoc index 4f28bf0b6a074..c740ff3decc61 100644 --- a/docs/reference/scripting/using.asciidoc +++ b/docs/reference/scripting/using.asciidoc @@ -566,4 +566,5 @@ DELETE /_ingest/pipeline/my_test_scores_pipeline //// +include::dissect-syntax.asciidoc[] include::grok-syntax.asciidoc[]