Skip to content

Commit

Permalink
Endpoint to test Grok pattern (#104394)
Browse files Browse the repository at this point in the history
* Add extract match ranges functionality to Grok.

* TestGrokPatternAction and Request

* TestGrokPattern response

* Update docs/changelog/104394.yaml

* Polish validation error message

* Improve test_grok_pattern API

* Add explicit CharSet

* Add endpoint to operator constants

* Add TransportTestGrokPatternActionTests

* REST API spec

* One more TransportTestGrokPatternActionTest

* Fix API spec

* Refactor REST API spec

* Polish code

* Replace TransportTestGrokPatternActionTests by a YAML REST test

* Add ecs_compatibility

* Always return arrays in the API

* Documentation

* YAML test for ecs_compatibility

* Rename doc fileø

* serverless scope

* Fix docs (hopefully)

* Update docs/reference/rest-api/index.asciidoc

Co-authored-by: István Zoltán Szabó <[email protected]>

* Add "text structure APIs" header in docs TOC

* Move file

* Remove test grok from main index

* typo

* Nested APIs underneath text structure

---------

Co-authored-by: István Zoltán Szabó <[email protected]>
  • Loading branch information
jan-elastic and szabosteve authored Jan 24, 2024
1 parent cf67f5d commit 5dec83f
Show file tree
Hide file tree
Showing 15 changed files with 614 additions and 13 deletions.
5 changes: 5 additions & 0 deletions docs/changelog/104394.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
pr: 104394
summary: Endpoint to find positions of Grok pattern matches
area: Machine Learning
type: enhancement
issues: []
4 changes: 2 additions & 2 deletions docs/reference/rest-api/index.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,6 @@ not be included yet.
* <<enrich-apis,Enrich APIs>>
* <<eql-apis,EQL search APIs>>
* <<esql-apis,{esql} query APIs>>
* <<find-structure,Find structure API>>
* <<fleet-apis,Fleet APIs>>
* <<graph-explore-api,Graph explore API>>
* <<indices, Index APIs>>
Expand Down Expand Up @@ -54,6 +53,7 @@ not be included yet.
* <<snapshot-lifecycle-management-api,Snapshot lifecycle management APIs>>
* <<sql-apis,SQL APIs>>
* <<synonyms-apis,Synonyms APIs>>
* <<text-structure-apis,Text structure APIs>>
* <<transform-apis,{transform-cap} APIs>>
* <<usage-api,Usage API>>
* <<watcher-api,Watcher APIs>>
Expand All @@ -75,7 +75,6 @@ include::{es-repo-dir}/eql/eql-apis.asciidoc[]
include::{es-repo-dir}/esql/esql-apis.asciidoc[]
include::{es-repo-dir}/features/apis/features-apis.asciidoc[]
include::{es-repo-dir}/fleet/index.asciidoc[]
include::{es-repo-dir}/text-structure/apis/find-structure.asciidoc[leveloffset=+1]
include::{es-repo-dir}/graph/explore.asciidoc[]
include::{es-repo-dir}/indices.asciidoc[]
include::{es-repo-dir}/ilm/apis/ilm-api.asciidoc[]
Expand Down Expand Up @@ -103,6 +102,7 @@ include::{es-repo-dir}/snapshot-restore/apis/snapshot-restore-apis.asciidoc[]
include::{es-repo-dir}/slm/apis/slm-api.asciidoc[]
include::{es-repo-dir}/sql/apis/sql-apis.asciidoc[]
include::{es-repo-dir}/synonyms/apis/synonyms-apis.asciidoc[]
include::{es-repo-dir}/text-structure/apis/index.asciidoc[]
include::{es-repo-dir}/transform/apis/index.asciidoc[]
include::usage.asciidoc[]
include::{es-repo-dir}/rest-api/watcher.asciidoc[]
Expand Down
11 changes: 11 additions & 0 deletions docs/reference/text-structure/apis/index.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
[role="xpack"]
[[text-structure-apis]]
== Text structure APIs

You can use the following APIs to find text structures:

* <<find-structure>>
* <<test-grok-pattern>>

include::find-structure.asciidoc[leveloffset=+2]
include::test-grok-pattern.asciidoc[leveloffset=+2]
95 changes: 95 additions & 0 deletions docs/reference/text-structure/apis/test-grok-pattern.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
[role="xpack"]
[[test-grok-pattern]]
= Test Grok pattern API

++++
<titleabbrev>Test Grok pattern</titleabbrev>
++++

Tests a Grok pattern on lines of text, see also <<grok,Grokking grok>>.

[discrete]
[[test-grok-pattern-request]]
== {api-request-title}

`GET _text_structure/test_grok_pattern` +

`POST _text_structure/test_grok_pattern` +

[discrete]
[[test-grok-pattern-desc]]
== {api-description-title}

The test Grok pattern API allows you to execute a Grok pattern on one
or more lines of text. It returns whether the lines match the pattern
together with the offsets and lengths of the matched substrings.

[discrete]
[[test-grok-pattern-query-parms]]
== {api-query-parms-title}

`ecs_compatibility`::
(Optional, string) The mode of compatibility with ECS compliant Grok patterns.
Use this parameter to specify whether to use ECS Grok patterns instead of
legacy ones when the structure finder creates a Grok pattern. Valid values
are `disabled` and `v1`. The default value is `disabled`.

[discrete]
[[test-grok-pattern-request-body]]
== {api-request-body-title}

`grok_pattern`::
(Required, string)
The Grok pattern to run on the lines of text.

`text`::
(Required, array of strings)
The lines of text to run the Grok pattern on.

[discrete]
[[test-grok-pattern-example]]
== {api-examples-title}

[source,console]
--------------------------------------------------
GET _text_structure/test_grok_pattern
{
"grok_pattern": "Hello %{WORD:first_name} %{WORD:last_name}",
"text": [
"Hello John Doe",
"this does not match"
]
}
--------------------------------------------------

The API returns the following response:

[source,console-result]
----
{
"matches": [
{
"matched": true,
"fields": {
"first_name": [
{
"match": "John",
"offset": 6,
"length": 4
}
],
"last_name": [
{
"match": "Doe",
"offset": 11,
"length": 3
}
]
}
},
{
"matched": false
}
]
}
----
24 changes: 21 additions & 3 deletions libs/grok/src/main/java/org/elasticsearch/grok/Grok.java
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
import java.util.Locale;
import java.util.Map;
import java.util.function.Consumer;
import java.util.function.Function;

public final class Grok {

Expand Down Expand Up @@ -86,7 +87,7 @@ private Grok(
expressionBytes.length,
Option.DEFAULT,
UTF8Encoding.INSTANCE,
message -> logCallBack.accept(message)
logCallBack::accept
);

List<GrokCaptureConfig> grokCaptureConfigs = new ArrayList<>();
Expand Down Expand Up @@ -116,7 +117,7 @@ private static String groupMatch(String name, Region region, String pattern) {
*
* @return named regex expression
*/
protected String toRegex(PatternBank patternBank, String grokPattern) {
String toRegex(PatternBank patternBank, String grokPattern) {
StringBuilder res = new StringBuilder();
for (int i = 0; i < MAX_TO_REGEX_ITERATIONS; i++) {
byte[] grokPatternBytes = grokPattern.getBytes(StandardCharsets.UTF_8);
Expand Down Expand Up @@ -189,8 +190,25 @@ public boolean match(String text) {
* @return a map containing field names and their respective coerced values that matched or null if the pattern didn't match
*/
public Map<String, Object> captures(String text) {
return innerCaptures(text, cfg -> cfg::objectExtracter);
}

/**
* Matches and returns the ranges of any named captures.
*
* @param text the text to match and extract values from.
* @return a map containing field names and their respective ranges that matched or null if the pattern didn't match
*/
public Map<String, Object> captureRanges(String text) {
return innerCaptures(text, cfg -> cfg::rangeExtracter);
}

private Map<String, Object> innerCaptures(
String text,
Function<GrokCaptureConfig, Function<Consumer<Object>, GrokCaptureExtracter>> getExtracter
) {
byte[] utf8Bytes = text.getBytes(StandardCharsets.UTF_8);
GrokCaptureExtracter.MapExtracter extracter = new GrokCaptureExtracter.MapExtracter(captureConfig);
GrokCaptureExtracter.MapExtracter extracter = new GrokCaptureExtracter.MapExtracter(captureConfig, getExtracter);
if (match(utf8Bytes, 0, utf8Bytes.length, extracter)) {
return extracter.result();
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -144,4 +144,21 @@ public interface NativeExtracterMap<T> {
*/
T forBoolean(Function<Consumer<Boolean>, GrokCaptureExtracter> buildExtracter);
}

/**
* Creates a {@linkplain GrokCaptureExtracter} that will call {@code emit} with the
* extracted range (offset and length) when it extracts text.
*/
public GrokCaptureExtracter rangeExtracter(Consumer<Object> emit) {
return (utf8Bytes, offset, region) -> {
for (int number : backRefs) {
if (region.beg[number] >= 0) {
int matchOffset = offset + region.beg[number];
int matchLength = region.end[number] - region.beg[number];
String match = new String(utf8Bytes, matchOffset, matchLength, StandardCharsets.UTF_8);
emit.accept(new GrokCaptureExtracter.Range(match, matchOffset, matchLength));
}
}
};
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,11 @@
import org.joni.Region;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import java.util.function.Consumer;
import java.util.function.Function;

import static java.util.Collections.emptyMap;

Expand All @@ -22,6 +24,8 @@
*/
public interface GrokCaptureExtracter {

record Range(Object match, int offset, int length) {}

/**
* Extract {@link Map} results. This implementation of {@link GrokCaptureExtracter}
* is mutable and should be discarded after collecting a single result.
Expand All @@ -31,11 +35,14 @@ class MapExtracter implements GrokCaptureExtracter {
private final List<GrokCaptureExtracter> fieldExtracters;

@SuppressWarnings("unchecked")
MapExtracter(List<GrokCaptureConfig> captureConfig) {
result = captureConfig.isEmpty() ? emptyMap() : new HashMap<>();
MapExtracter(
List<GrokCaptureConfig> captureConfig,
Function<GrokCaptureConfig, Function<Consumer<Object>, GrokCaptureExtracter>> getExtracter
) {
result = captureConfig.isEmpty() ? emptyMap() : new LinkedHashMap<>();
fieldExtracters = new ArrayList<>(captureConfig.size());
for (GrokCaptureConfig config : captureConfig) {
fieldExtracters.add(config.objectExtracter(value -> {
fieldExtracters.add(getExtracter.apply(config).apply(value -> {
var key = config.name();

// Logstash's Grok processor flattens the list of values to a single value in case there's only 1 match,
Expand Down
50 changes: 49 additions & 1 deletion libs/grok/src/test/java/org/elasticsearch/grok/GrokTests.java
Original file line number Diff line number Diff line change
Expand Up @@ -82,13 +82,61 @@ private void testCapturesBytes(boolean ecsCompatibility) {
}

private Map<String, Object> captureBytes(Grok grok, byte[] utf8, int offset, int length) {
GrokCaptureExtracter.MapExtracter extracter = new GrokCaptureExtracter.MapExtracter(grok.captureConfig());
GrokCaptureExtracter.MapExtracter extracter = new GrokCaptureExtracter.MapExtracter(
grok.captureConfig(),
cfg -> cfg::objectExtracter
);
if (grok.match(utf8, offset, length, extracter)) {
return extracter.result();
}
return null;
}

public void testCaptureRanges() {
captureRanges(false);
captureRanges(true);
}

private void captureRanges(boolean ecsCompatibility) {
Grok grok = new Grok(GrokBuiltinPatterns.get(ecsCompatibility), "%{WORD:a} %{WORD:b} %{NUMBER:c:int}", logger::warn);
assertThat(
grok.captureRanges("xx aaaaa bbb 1234 yyy"),
equalTo(
Map.of(
"a",
new GrokCaptureExtracter.Range("aaaaa", 3, 5),
"b",
new GrokCaptureExtracter.Range("bbb", 9, 3),
"c",
new GrokCaptureExtracter.Range("1234", 13, 4)
)
)
);
}

public void testCaptureRanges_noMatch() {
captureRanges_noMatch(false);
captureRanges_noMatch(true);
}

private void captureRanges_noMatch(boolean ecsCompatibility) {
Grok grok = new Grok(GrokBuiltinPatterns.get(ecsCompatibility), "%{WORD:a} %{WORD:b} %{NUMBER:c:int}", logger::warn);
assertNull(grok.captureRanges("xx aaaaa bbb ccc yyy"));
}

public void testCaptureRanges_multipleNamedCapturesWithSameName() {
captureRanges_multipleNamedCapturesWithSameName(false);
captureRanges_multipleNamedCapturesWithSameName(true);
}

private void captureRanges_multipleNamedCapturesWithSameName(boolean ecsCompatibility) {
Grok grok = new Grok(GrokBuiltinPatterns.get(ecsCompatibility), "%{WORD:parts} %{WORD:parts}", logger::warn);
assertThat(
grok.captureRanges(" aa bbb c ddd e "),
equalTo(Map.of("parts", List.of(new GrokCaptureExtracter.Range("aa", 2, 2), new GrokCaptureExtracter.Range("bbb", 5, 3))))
);
}

public void testNoMatchingPatternInDictionary() {
Exception e = expectThrows(IllegalArgumentException.class, () -> new Grok(PatternBank.EMPTY, "%{NOTFOUND}", logger::warn));
assertThat(e.getMessage(), equalTo("Unable to find pattern [NOTFOUND] in Grok's pattern dictionary"));
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
{
"text_structure.test_grok_pattern": {
"documentation": {
"url": "https://www.elastic.co/guide/en/elasticsearch/reference/master/test-grok-pattern-api.html",
"description": "Tests a Grok pattern on some text."
},
"stability": "stable",
"visibility": "public",
"headers": {
"accept": ["application/json"],
"content_type": ["application/json"]
},
"url": {
"paths": [
{
"path": "/_text_structure/test_grok_pattern",
"methods": ["GET", "POST"]
}
]
},
"params": {
"ecs_compatibility": {
"type": "string",
"description": "Optional parameter to specify the compatibility mode with ECS Grok patterns - may be either 'v1' or 'disabled'"
}
},
"body": {
"description": "The Grok pattern and text.",
"required": true
}
}
}
Loading

0 comments on commit 5dec83f

Please sign in to comment.