[DOCS] Create a new page for dissect content in scripting docs (#73437)

* [DOCS] Create a new page for dissect in scripting docs * Expanding a bit more * Adding a section for using dissect patterns * Adding tests * Fix test cases and other edits
elastic · May 27, 2021 · e64a029 · e64a029
1 parent 40a029f
commit e64a029
Show file tree

Hide file tree

Showing 3 changed files with 303 additions and 6 deletions.
diff --git a/docs/reference/scripting/common-script-uses.asciidoc b/docs/reference/scripting/common-script-uses.asciidoc
@@ -17,15 +17,10 @@ There are two options at your disposal:
 * <<grok,Grok>> is a regular expression dialect that supports aliased
 expressions that you can reuse. Because Grok sits on top of regular expressions
 (regex), any regular expressions are valid in grok as well.
-* <<dissect-processor,Dissect>> extracts structured fields out of text, using
+* <<dissect,Dissect>> extracts structured fields out of text, using
 delimiters to define the matching pattern. Unlike grok, dissect doesn't use regular
 expressions.
 
-Regex is incredibly powerful but can be complicated. If you don't need the
-power of regular expressions, use dissect patterns, which are simple and
-often faster than grok patterns. Paying special attention to the parts of the string
-you want to discard will help build successful dissect patterns.
-
 Let's start with a simple example by adding the `@timestamp` and `message`
 fields to the `my-index` mapping as indexed fields. To remain flexible, use
 `wildcard` as the field type for `message`:

diff --git a/docs/reference/scripting/dissect-syntax.asciidoc b/docs/reference/scripting/dissect-syntax.asciidoc
@@ -0,0 +1,301 @@
+[[dissect]]
+=== Dissecting data
+Dissect matches a single text field against a defined pattern. A dissect
+pattern is defined by the parts of the string you want to discard. Paying
+special attention to each part of a string helps to build successful dissect
+patterns.
+
+If you don't need the power of regular expressions, use dissect patterns instead
+of grok. Dissect uses a much simpler syntax than grok and is typically faster
+overall. The syntax for dissect is transparent: tell dissect what you want and
+it will return those results to you.
+
+[[dissect-syntax]]
+==== Dissect patterns
+Dissect patterns are comprised of _variables_ and _separators_. Anything
+defined by a percent sign and curly braces `%{}` is considered a variable, 
+such as `%{clientip}`. You can assign variables to any part of data in a field, 
+and then return only the parts that you want. Separators are any values between
+variables, which could be spaces, dashes, or other delimiters.
+
+For example, let's say you have log data with a `message` field that looks like
+this:
+
+[source,js]
+----
+"message" : "247.37.0.0 - - [30/Apr/2020:14:31:22 -0500] \"GET /images/hm_nbg.jpg HTTP/1.0\" 304 0"
+----
+// NOTCONSOLE
+
+You assign variables to each part of the data to construct a successful
+dissect pattern. Remember, tell dissect _exactly_ what you want you want to
+match on.
+
+The first part of the data looks like an IP address, so you
+can assign a variable like `%{clientip}`. The next two characters are dashes
+with a space on either side. You can assign a variable for each dash, or a
+single variable to represent the dashes and spaces. Next are a set of brackets
+containing a timestamp. The brackets are a separator, so you include those in
+the dissect pattern. Thus far, the data and matching dissect pattern look like
+this:
+
+[source,js]
+----
+247.37.0.0 - - [30/Apr/2020:14:31:22 -0500]  <1>
+
+%{clientip} %{ident} %{auth} [%{@timestamp}] <2>
+----
+// NOTCONSOLE
+<1> The first chunks of data from the `message` field
+<2> Dissect pattern to match on the selected data chunks
+
+Using that same logic, you can create variables for the remaining chunks of
+data. Double quotation marks are separators, so include those in your dissect
+pattern. The pattern replaces `GET` with a `%{verb}` variable, but keeps `HTTP`
+as part of the pattern. 
+
+[source,js]
+----
+\"GET /images/hm_nbg.jpg HTTP/1.0\" 304 0
+
+"%{verb} %{request} HTTP/%{httpversion}" %{response} %{size}
+----
+// NOTCONSOLE
+
+Combining the two patterns results in a dissect pattern that looks like this: 
+
+[source,js]
+----
+%{clientip} %{ident} %{auth} [%{@timestamp}] \"%{verb} %{request} HTTP/%{httpversion}\" %{status} %{size}
+----
+// NOTCONSOLE
+
+Now that you have a dissect pattern, how do you test and use it?
+
+[[dissect-patterns-test]]
+==== Test dissect patterns with Painless
+You can incorporate dissect patterns into Painless scripts to extract
+data. To test your script, use either the {painless}/painless-execute-api.html#painless-execute-runtime-field-context[field contexts] of the Painless
+execute API or create a runtime field that includes the script. Runtime fields
+offer greater flexibility and accept multiple documents, but the Painless execute
+API is a great option if you don't have write access on a cluster where you're
+testing a script.
+
+For example, test your dissect pattern with the Painless execute API by
+including your Painless script and a single document that matches your data.
+Start by indexing the `message` field as a `wildcard` data type:
+
+[source,console]
+----
+PUT my-index
+{
+  "mappings": {
+    "properties": {
+      "message": {
+        "type": "wildcard"
+      }
+    }
+  }
+}
+----
+
+If you want to retrieve the HTTP response code, add your dissect pattern to a
+Painless script that extracts the `response` value. To extract values from a
+field, use this function:
+
+[source,painless]
+----
+`.extract(doc["<field_name>"].value)?.<field_value>`
+----
+
+In this example, `message` is the `<field_name>` and `response` is the
+`<field_value>`:
+
+[source,console]
+----
+POST /_scripts/painless/_execute
+{
+  "script": {
+    "source": """
+      String response=dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{response} %{size}').extract(doc["message"].value)?.response;
+        if (response != null) emit(Integer.parseInt(response)); <1>
+    """
+  },
+  "context": "long_field", <2>
+  "context_setup": {
+    "index": "my-index",
+    "document": {          <3>
+      "message": """247.37.0.0 - - [30/Apr/2020:14:31:22 -0500] "GET /images/hm_nbg.jpg HTTP/1.0" 304 0"""
+    }
+  }
+}
+----
+// TEST[continued]
+<1> Runtime fields require the `emit` method to return values.
+<2> Because the response code is an integer, use the `long_field` context.
+<3> Include a sample document that matches your data.
+
+The result includes the HTTP response code:
+
+[source,console-result]
+----
+{
+  "result" : [
+    304
+  ]
+}
+----
+
+[[dissect-patterns-runtime]]
+==== Use dissect patterns and scripts in runtime fields
+If you have a functional dissect pattern, you can add it to a runtime field to
+manipulate data. Because runtime fields don't require you to index fields, you
+have incredible flexibility to modify your script and how it functions. If you
+already <<dissect-patterns-test,tested your dissect pattern>> using the Painless
+execute API, you can use that _exact_ Painless script in your runtime field. 
+
+To start, add the `message` field as a `wildcard` type like in the previous
+section, but also add `@timestamp` as a `date` in case you want to operate on
+that field for <<common-script-uses,other use cases>>:
+
+[source,console]
+----
+PUT /my-index/
+{
+  "mappings": {
+    "properties": {
+      "@timestamp": {
+        "format": "strict_date_optional_time||epoch_second",
+        "type": "date"
+      },
+      "message": {
+        "type": "wildcard"
+      }
+    }
+  }
+}
+----
+
+If you want to extract the HTTP response code using your dissect pattern, you
+can create a runtime field like `http.response`:
+
+[source,console]
+----
+PUT my-index/_mappings
+{
+  "runtime": {
+    "http.response": {
+      "type": "long",
+      "script": """
+        String response=dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{response} %{size}').extract(doc["message"].value)?.response;
+        if (response != null) emit(Integer.parseInt(response));
+      """
+    }
+  }
+}
+----
+// TEST[continued]
+
+After mapping the fields you want to retrieve, index a few records from
+your log data into {es}. The following request uses the <<docs-bulk,bulk API>>
+to index raw log data into `my-index`:
+
+[source,console]
+----
+POST /my-index/_bulk?refresh=true
+{"index":{}}
+{"timestamp":"2020-04-30T14:30:17-05:00","message":"40.135.0.0 - - [30/Apr/2020:14:30:17 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"}
+{"index":{}}
+{"timestamp":"2020-04-30T14:30:53-05:00","message":"232.0.0.0 - - [30/Apr/2020:14:30:53 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"}
+{"index":{}}
+{"timestamp":"2020-04-30T14:31:12-05:00","message":"26.1.0.0 - - [30/Apr/2020:14:31:12 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"}
+{"index":{}}
+{"timestamp":"2020-04-30T14:31:19-05:00","message":"247.37.0.0 - - [30/Apr/2020:14:31:19 -0500] \"GET /french/splash_inet.html HTTP/1.0\" 200 3781"}
+{"index":{}}
+{"timestamp":"2020-04-30T14:31:22-05:00","message":"247.37.0.0 - - [30/Apr/2020:14:31:22 -0500] \"GET /images/hm_nbg.jpg HTTP/1.0\" 304 0"}
+{"index":{}}
+{"timestamp":"2020-04-30T14:31:27-05:00","message":"252.0.0.0 - - [30/Apr/2020:14:31:27 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"}
+{"index":{}}
+{"timestamp":"2020-04-30T14:31:28-05:00","message":"not a valid apache log"}
+----
+// TEST[continued]
+
+You can define a simple query to run a search for a specific HTTP response and
+return all related fields. Use the `fields` parameter of the search API to
+retrieve the `http.response` runtime field.
+
+[source,console]
+----
+GET my-index/_search
+{
+  "query": {
+    "match": {
+      "http.response": "304"
+    }
+  },
+  "fields" : ["http.response"]
+}
+----
+// TEST[continued]
+
+Alternatively, you can define the same runtime field but in the context of a
+search request. The runtime definition and the script are exactly the same as
+the one defined previously in the index mapping. Just copy that definition into
+the search request under the `runtime_mappings` section and include a query
+that matches on the runtime field. This query returns the same results as the
+search query previously defined for the `http.response` runtime field in your
+index mappings, but only in the context of this specific search:
+
+[source,console]
+----
+GET my-index/_search
+{
+  "runtime_mappings": {
+    "http.response": {
+      "type": "long",
+      "script": """
+        String response=dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{response} %{size}').extract(doc["message"].value)?.response;
+        if (response != null) emit(Integer.parseInt(response));
+      """
+    }
+  },
+  "query": {
+    "match": {
+      "http.response": "304"
+    }
+  },
+  "fields" : ["http.response"]
+}
+----
+// TEST[continued]
+// TEST[s/_search/_search\?filter_path=hits/]
+
+[source,console-result]
+----
+{
+  "hits" : {
+    "total" : {
+      "value" : 1,
+      "relation" : "eq"
+    },
+    "max_score" : 1.0,
+    "hits" : [
+      {
+        "_index" : "my-index",
+        "_id" : "D47UqXkBByC8cgZrkbOm",
+        "_score" : 1.0,
+        "_source" : {
+          "timestamp" : "2020-04-30T14:31:22-05:00",
+          "message" : "247.37.0.0 - - [30/Apr/2020:14:31:22 -0500] \"GET /images/hm_nbg.jpg HTTP/1.0\" 304 0"
+        },
+        "fields" : {
+          "http.response" : [
+            304
+          ]
+        }
+      }
+    ]
+  }
+}
+----
+// TESTRESPONSE[s/"_id" : "D47UqXkBByC8cgZrkbOm"/"_id": $body.hits.hits.0._id/]
diff --git a/docs/reference/scripting/using.asciidoc b/docs/reference/scripting/using.asciidoc
@@ -566,4 +566,5 @@ DELETE /_ingest/pipeline/my_test_scores_pipeline
 
 ////
 
+include::dissect-syntax.asciidoc[]
 include::grok-syntax.asciidoc[]
Original file line number	Diff line number	Diff line change
Expand Up		@@ -566,4 +566,5 @@ DELETE /_ingest/pipeline/my_test_scores_pipeline

		////

		include::dissect-syntax.asciidoc[]
		include::grok-syntax.asciidoc[]