Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endpoint to test Grok pattern #104394

Merged
merged 29 commits into from
Jan 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
2284725
Add extract match ranges functionality to Grok.
jan-elastic Jan 15, 2024
3e2057e
TestGrokPatternAction and Request
jan-elastic Jan 16, 2024
4c925dd
TestGrokPattern response
jan-elastic Jan 16, 2024
84d5b77
Update docs/changelog/104394.yaml
jan-elastic Jan 16, 2024
6ed27d2
Polish validation error message
jan-elastic Jan 16, 2024
83db25c
Improve test_grok_pattern API
jan-elastic Jan 17, 2024
05f9658
Add explicit CharSet
jan-elastic Jan 17, 2024
9daa1b6
Add endpoint to operator constants
jan-elastic Jan 17, 2024
5d16a04
Add TransportTestGrokPatternActionTests
jan-elastic Jan 17, 2024
e1b7394
REST API spec
jan-elastic Jan 17, 2024
b18bd15
One more TransportTestGrokPatternActionTest
jan-elastic Jan 17, 2024
6c22cef
Fix API spec
jan-elastic Jan 17, 2024
21004c9
Refactor REST API spec
jan-elastic Jan 17, 2024
3ce30fc
Merge branch 'main' of github.com:elastic/elasticsearch into test_gro…
jan-elastic Jan 18, 2024
018614b
Polish code
jan-elastic Jan 18, 2024
4ee8ffb
Replace TransportTestGrokPatternActionTests by a YAML REST test
jan-elastic Jan 18, 2024
2092a89
Add ecs_compatibility
jan-elastic Jan 18, 2024
f5ca1bd
Always return arrays in the API
jan-elastic Jan 22, 2024
6ef2f1e
Documentation
jan-elastic Jan 22, 2024
71cea7f
YAML test for ecs_compatibility
jan-elastic Jan 22, 2024
a9e1fd1
Rename doc fileø
jan-elastic Jan 22, 2024
99c6efa
serverless scope
jan-elastic Jan 22, 2024
c7ad51c
Fix docs (hopefully)
jan-elastic Jan 22, 2024
e906fc8
Update docs/reference/rest-api/index.asciidoc
jan-elastic Jan 23, 2024
d0441e2
Add "text structure APIs" header in docs TOC
jan-elastic Jan 23, 2024
d16051d
Move file
jan-elastic Jan 23, 2024
a060008
Remove test grok from main index
jan-elastic Jan 23, 2024
8bdd927
typo
jan-elastic Jan 23, 2024
98953e2
Nested APIs underneath text structure
jan-elastic Jan 23, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/changelog/104394.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
pr: 104394
summary: Endpoint to find positions of Grok pattern matches
area: Machine Learning
type: enhancement
issues: []
4 changes: 2 additions & 2 deletions docs/reference/rest-api/index.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,6 @@ not be included yet.
* <<enrich-apis,Enrich APIs>>
* <<eql-apis,EQL search APIs>>
* <<esql-apis,{esql} query APIs>>
* <<find-structure,Find structure API>>
* <<fleet-apis,Fleet APIs>>
* <<graph-explore-api,Graph explore API>>
* <<indices, Index APIs>>
Expand Down Expand Up @@ -54,6 +53,7 @@ not be included yet.
* <<snapshot-lifecycle-management-api,Snapshot lifecycle management APIs>>
* <<sql-apis,SQL APIs>>
* <<synonyms-apis,Synonyms APIs>>
* <<text-structure-apis,Text structure APIs>>
* <<transform-apis,{transform-cap} APIs>>
* <<usage-api,Usage API>>
* <<watcher-api,Watcher APIs>>
Expand All @@ -75,7 +75,6 @@ include::{es-repo-dir}/eql/eql-apis.asciidoc[]
include::{es-repo-dir}/esql/esql-apis.asciidoc[]
include::{es-repo-dir}/features/apis/features-apis.asciidoc[]
include::{es-repo-dir}/fleet/index.asciidoc[]
include::{es-repo-dir}/text-structure/apis/find-structure.asciidoc[leveloffset=+1]
include::{es-repo-dir}/graph/explore.asciidoc[]
include::{es-repo-dir}/indices.asciidoc[]
include::{es-repo-dir}/ilm/apis/ilm-api.asciidoc[]
Expand Down Expand Up @@ -103,6 +102,7 @@ include::{es-repo-dir}/snapshot-restore/apis/snapshot-restore-apis.asciidoc[]
include::{es-repo-dir}/slm/apis/slm-api.asciidoc[]
include::{es-repo-dir}/sql/apis/sql-apis.asciidoc[]
include::{es-repo-dir}/synonyms/apis/synonyms-apis.asciidoc[]
include::{es-repo-dir}/text-structure/apis/index.asciidoc[]
include::{es-repo-dir}/transform/apis/index.asciidoc[]
include::usage.asciidoc[]
include::{es-repo-dir}/rest-api/watcher.asciidoc[]
Expand Down
11 changes: 11 additions & 0 deletions docs/reference/text-structure/apis/index.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
[role="xpack"]
[[text-structure-apis]]
== Text structure APIs

You can use the following APIs to find text structures:

* <<find-structure>>
* <<test-grok-pattern>>

include::find-structure.asciidoc[leveloffset=+2]
include::test-grok-pattern.asciidoc[leveloffset=+2]
95 changes: 95 additions & 0 deletions docs/reference/text-structure/apis/test-grok-pattern.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
[role="xpack"]
[[test-grok-pattern]]
= Test Grok pattern API

++++
<titleabbrev>Test Grok pattern</titleabbrev>
++++

Tests a Grok pattern on lines of text, see also <<grok,Grokking grok>>.

[discrete]
[[test-grok-pattern-request]]
== {api-request-title}

`GET _text_structure/test_grok_pattern` +

`POST _text_structure/test_grok_pattern` +

[discrete]
[[test-grok-pattern-desc]]
== {api-description-title}

The test Grok pattern API allows you to execute a Grok pattern on one
or more lines of text. It returns whether the lines match the pattern
together with the offsets and lengths of the matched substrings.

[discrete]
[[test-grok-pattern-query-parms]]
== {api-query-parms-title}

`ecs_compatibility`::
(Optional, string) The mode of compatibility with ECS compliant Grok patterns.
Use this parameter to specify whether to use ECS Grok patterns instead of
legacy ones when the structure finder creates a Grok pattern. Valid values
are `disabled` and `v1`. The default value is `disabled`.

[discrete]
[[test-grok-pattern-request-body]]
== {api-request-body-title}

`grok_pattern`::
(Required, string)
The Grok pattern to run on the lines of text.

`text`::
(Required, array of strings)
The lines of text to run the Grok pattern on.

[discrete]
[[test-grok-pattern-example]]
== {api-examples-title}

[source,console]
--------------------------------------------------
GET _text_structure/test_grok_pattern
{
"grok_pattern": "Hello %{WORD:first_name} %{WORD:last_name}",
"text": [
"Hello John Doe",
"this does not match"
]
}
--------------------------------------------------

The API returns the following response:

[source,console-result]
----
{
"matches": [
{
"matched": true,
"fields": {
"first_name": [
{
"match": "John",
"offset": 6,
"length": 4
}
],
"last_name": [
{
"match": "Doe",
"offset": 11,
"length": 3
}
]
}
},
{
"matched": false
}
]
}
----
24 changes: 21 additions & 3 deletions libs/grok/src/main/java/org/elasticsearch/grok/Grok.java
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
import java.util.Locale;
import java.util.Map;
import java.util.function.Consumer;
import java.util.function.Function;

public final class Grok {

Expand Down Expand Up @@ -86,7 +87,7 @@ private Grok(
expressionBytes.length,
Option.DEFAULT,
UTF8Encoding.INSTANCE,
message -> logCallBack.accept(message)
logCallBack::accept
);

List<GrokCaptureConfig> grokCaptureConfigs = new ArrayList<>();
Expand Down Expand Up @@ -116,7 +117,7 @@ private static String groupMatch(String name, Region region, String pattern) {
*
* @return named regex expression
*/
protected String toRegex(PatternBank patternBank, String grokPattern) {
String toRegex(PatternBank patternBank, String grokPattern) {
StringBuilder res = new StringBuilder();
for (int i = 0; i < MAX_TO_REGEX_ITERATIONS; i++) {
byte[] grokPatternBytes = grokPattern.getBytes(StandardCharsets.UTF_8);
Expand Down Expand Up @@ -189,8 +190,25 @@ public boolean match(String text) {
* @return a map containing field names and their respective coerced values that matched or null if the pattern didn't match
*/
public Map<String, Object> captures(String text) {
return innerCaptures(text, cfg -> cfg::objectExtracter);
}

/**
* Matches and returns the ranges of any named captures.
*
* @param text the text to match and extract values from.
* @return a map containing field names and their respective ranges that matched or null if the pattern didn't match
*/
public Map<String, Object> captureRanges(String text) {
return innerCaptures(text, cfg -> cfg::rangeExtracter);
}

private Map<String, Object> innerCaptures(
String text,
Function<GrokCaptureConfig, Function<Consumer<Object>, GrokCaptureExtracter>> getExtracter
) {
byte[] utf8Bytes = text.getBytes(StandardCharsets.UTF_8);
GrokCaptureExtracter.MapExtracter extracter = new GrokCaptureExtracter.MapExtracter(captureConfig);
GrokCaptureExtracter.MapExtracter extracter = new GrokCaptureExtracter.MapExtracter(captureConfig, getExtracter);
if (match(utf8Bytes, 0, utf8Bytes.length, extracter)) {
return extracter.result();
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -144,4 +144,21 @@ public interface NativeExtracterMap<T> {
*/
T forBoolean(Function<Consumer<Boolean>, GrokCaptureExtracter> buildExtracter);
}

/**
* Creates a {@linkplain GrokCaptureExtracter} that will call {@code emit} with the
* extracted range (offset and length) when it extracts text.
*/
public GrokCaptureExtracter rangeExtracter(Consumer<Object> emit) {
return (utf8Bytes, offset, region) -> {
for (int number : backRefs) {
if (region.beg[number] >= 0) {
int matchOffset = offset + region.beg[number];
int matchLength = region.end[number] - region.beg[number];
String match = new String(utf8Bytes, matchOffset, matchLength, StandardCharsets.UTF_8);
emit.accept(new GrokCaptureExtracter.Range(match, matchOffset, matchLength));
}
}
};
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,11 @@
import org.joni.Region;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import java.util.function.Consumer;
import java.util.function.Function;

import static java.util.Collections.emptyMap;

Expand All @@ -22,6 +24,8 @@
*/
public interface GrokCaptureExtracter {

record Range(Object match, int offset, int length) {}

/**
* Extract {@link Map} results. This implementation of {@link GrokCaptureExtracter}
* is mutable and should be discarded after collecting a single result.
Expand All @@ -31,11 +35,14 @@ class MapExtracter implements GrokCaptureExtracter {
private final List<GrokCaptureExtracter> fieldExtracters;

@SuppressWarnings("unchecked")
MapExtracter(List<GrokCaptureConfig> captureConfig) {
result = captureConfig.isEmpty() ? emptyMap() : new HashMap<>();
MapExtracter(
List<GrokCaptureConfig> captureConfig,
Function<GrokCaptureConfig, Function<Consumer<Object>, GrokCaptureExtracter>> getExtracter
) {
result = captureConfig.isEmpty() ? emptyMap() : new LinkedHashMap<>();
fieldExtracters = new ArrayList<>(captureConfig.size());
for (GrokCaptureConfig config : captureConfig) {
fieldExtracters.add(config.objectExtracter(value -> {
fieldExtracters.add(getExtracter.apply(config).apply(value -> {
var key = config.name();

// Logstash's Grok processor flattens the list of values to a single value in case there's only 1 match,
Expand Down
50 changes: 49 additions & 1 deletion libs/grok/src/test/java/org/elasticsearch/grok/GrokTests.java
Original file line number Diff line number Diff line change
Expand Up @@ -82,13 +82,61 @@ private void testCapturesBytes(boolean ecsCompatibility) {
}

private Map<String, Object> captureBytes(Grok grok, byte[] utf8, int offset, int length) {
GrokCaptureExtracter.MapExtracter extracter = new GrokCaptureExtracter.MapExtracter(grok.captureConfig());
GrokCaptureExtracter.MapExtracter extracter = new GrokCaptureExtracter.MapExtracter(
grok.captureConfig(),
cfg -> cfg::objectExtracter
);
if (grok.match(utf8, offset, length, extracter)) {
return extracter.result();
}
return null;
}

public void testCaptureRanges() {
captureRanges(false);
captureRanges(true);
}

private void captureRanges(boolean ecsCompatibility) {
Grok grok = new Grok(GrokBuiltinPatterns.get(ecsCompatibility), "%{WORD:a} %{WORD:b} %{NUMBER:c:int}", logger::warn);
assertThat(
grok.captureRanges("xx aaaaa bbb 1234 yyy"),
equalTo(
Map.of(
"a",
new GrokCaptureExtracter.Range("aaaaa", 3, 5),
"b",
new GrokCaptureExtracter.Range("bbb", 9, 3),
"c",
new GrokCaptureExtracter.Range("1234", 13, 4)
)
)
);
}

public void testCaptureRanges_noMatch() {
captureRanges_noMatch(false);
captureRanges_noMatch(true);
}

private void captureRanges_noMatch(boolean ecsCompatibility) {
Grok grok = new Grok(GrokBuiltinPatterns.get(ecsCompatibility), "%{WORD:a} %{WORD:b} %{NUMBER:c:int}", logger::warn);
assertNull(grok.captureRanges("xx aaaaa bbb ccc yyy"));
}

public void testCaptureRanges_multipleNamedCapturesWithSameName() {
captureRanges_multipleNamedCapturesWithSameName(false);
captureRanges_multipleNamedCapturesWithSameName(true);
}

private void captureRanges_multipleNamedCapturesWithSameName(boolean ecsCompatibility) {
Grok grok = new Grok(GrokBuiltinPatterns.get(ecsCompatibility), "%{WORD:parts} %{WORD:parts}", logger::warn);
assertThat(
grok.captureRanges(" aa bbb c ddd e "),
equalTo(Map.of("parts", List.of(new GrokCaptureExtracter.Range("aa", 2, 2), new GrokCaptureExtracter.Range("bbb", 5, 3))))
);
}

public void testNoMatchingPatternInDictionary() {
Exception e = expectThrows(IllegalArgumentException.class, () -> new Grok(PatternBank.EMPTY, "%{NOTFOUND}", logger::warn));
assertThat(e.getMessage(), equalTo("Unable to find pattern [NOTFOUND] in Grok's pattern dictionary"));
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
{
"text_structure.test_grok_pattern": {
"documentation": {
"url": "https://www.elastic.co/guide/en/elasticsearch/reference/master/test-grok-pattern-api.html",
"description": "Tests a Grok pattern on some text."
},
"stability": "stable",
"visibility": "public",
"headers": {
"accept": ["application/json"],
"content_type": ["application/json"]
},
"url": {
"paths": [
{
"path": "/_text_structure/test_grok_pattern",
"methods": ["GET", "POST"]
}
]
},
"params": {
"ecs_compatibility": {
"type": "string",
"description": "Optional parameter to specify the compatibility mode with ECS Grok patterns - may be either 'v1' or 'disabled'"
}
},
"body": {
"description": "The Grok pattern and text.",
"required": true
}
}
}
Loading
Loading