Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pkg/ottl] Support for grok patterns #32593

Closed
michalpristas opened this issue Apr 22, 2024 · 8 comments
Closed

[pkg/ottl] Support for grok patterns #32593

michalpristas opened this issue Apr 22, 2024 · 8 comments
Assignees
Labels
discussion needed Community discussion needed enhancement New feature or request pkg/ottl

Comments

@michalpristas
Copy link
Contributor

Component(s)

pkg/ottl

Is your feature request related to a problem? Please describe.

just a copy of what i wrote as a comment in other issue as i thought we're discussing this

Why should we support grok?
grok if you ask me is much more readable and very common for our users.
what i have in mind is also custom pattern definition so you could do something like this

with ExtractGrokPattern signature like this
ExtractGrokPattern(source, pattern, custom_patterns)

custom_patterns is a map

and input string
my beagle is BLUE

you could do

ExtractGrokPattern(source, "my %{FAVORITE_DOG:dog} is colored %{RGB:color}", {
 "FAVORITE_DOG" : "beagle",
  "RGB" : "RED|GREEN|BLUE"
}

and this would result in

{
  "dog": "beagle",
  "color": "BLUE"
}

while this example is not that realistic nginx example from our pipeline shows the beauty of it

patterns:
  - (%{NGINX_HOST} )?"?(?:%{NGINX_ADDRESS_LIST:result.access.remote_ip_list}|%{NOTSPACE:source.address})
    - (-|%{DATA:user.name}) \[%{HTTPDATE:result.access.time}\] "%{DATA:result.access.info}"
    %{NUMBER:http.response.status_code:long} %{NUMBER:http.response.body.bytes:long}
    "(-|%{DATA:http.request.referrer})" "(-|%{DATA:user_agent.original})" %{NUMBER:result.access.http.request.length:long}
    %{NUMBER:result.access.http.request.time:double} \[%{DATA:result.access.upstream.name}\]
    \[%{DATA:result.access.upstream.alternative_name}\] (%{UPSTREAM_ADDRESS_LIST:result.access.upstream_address_list}|-)
    (%{UPSTREAM_RESPONSE_LENGTH_LIST:result.access.upstream.response.length_list}|-) (%{UPSTREAM_RESPONSE_TIME_LIST:result.access.upstream.response.time_list}|-)
    (%{UPSTREAM_RESPONSE_STATUS_CODE_LIST:result.access.upstream.response.status_code_list}|-) %{GREEDYDATA:result.access.http.request.id}
pattern_definitions:
  NGINX_HOST: (?:%{IP:destination.ip}|%{NGINX_NOTSEPARATOR:destination.domain})(:%{NUMBER:destination.port})?
  NGINX_NOTSEPARATOR: "[^\t ,:]+"
  NGINX_ADDRESS_LIST: (?:%{IP}|%{WORD})("?,?\s*(?:%{IP}|%{WORD}))*
  UPSTREAM_ADDRESS_LIST: (?:%{IP}(:%{NUMBER})?)("?,?\s*(?:%{IP}(:%{NUMBER})?))*
  UPSTREAM_RESPONSE_LENGTH_LIST: (?:%{NUMBER})("?,?\s*(?:%{NUMBER}))*
  UPSTREAM_RESPONSE_TIME_LIST: (?:%{NUMBER})("?,?\s*(?:%{NUMBER}))*
  UPSTREAM_RESPONSE_STATUS_CODE_LIST: (?:%{NUMBER})("?,?\s*(?:%{NUMBER}))*
  IP: (?:\[?%{IPV6}\]?|%{IPV4})

this pattern is complex and writing this using regex would be ugly

Describe the solution you'd like

ExtractGrokPattern(source, pattern, custom_patterns) on top of ExtractPattern to give user an option

Grok uses regex anyways but provides better experience

Describe alternatives you've considered

No response

Additional context

No response

@michalpristas michalpristas added enhancement New feature or request needs triage New item requiring triage labels Apr 22, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@TylerHelmuth
Copy link
Member

Which grok library do you recommend? Do you know the performance of grok vs regex?

@TylerHelmuth TylerHelmuth added discussion needed Community discussion needed and removed needs triage New item requiring triage labels Apr 22, 2024
@michalpristas
Copy link
Contributor Author

michalpristas commented Apr 23, 2024

we currently use grok processing on our java backend.
with go and grok it's a bit more complicated
there are few existing ones e.g
https://github.com/vjeantet/grok (this is closest to original elasticsearch implementation)
not maintained for 7 years now, based on this one there's a fork: https://github.com/trivago/grok this one seems to be more performant/optimized

there's also this one: https://github.com/logrusorgru/grokky
also not maintained

In my ideal world, Elastic has its own Go Grok library: I hope we will be able to prioritize this soon to make it a reality.
For the time being, I think vjeantet/grok or the one from trivago would be the best call to make.
Having one of these in alpha and switching to more maintained in beta would be nice.

@michalpristas
Copy link
Contributor Author

@TylerHelmuth is there a path you prefer?
i could start writing a grok parsing library that i would use in another PR for ExtractGrokPattern instead of unmaintained one.

or if you're ok with unmaintained i'd prefer working on ExtractGrokPattern

@TylerHelmuth
Copy link
Member

I definitely don't like willing taking a dependency on a unmaintained library, but IDK the effort to create one for grok.

@michalpristas
Copy link
Contributor Author

i spent some time since last week, i have something that we will host under elastic repo and I will use this one.
performance wise, it's comparable to existing solutions or regex

@andrzej-stencel
Copy link
Member

Assigning to @michalpristas on his request.

evan-bradley added a commit that referenced this issue Aug 9, 2024
**Description:** 
Added converter to OTTL for parsing grok patterns

**Link to tracking Issue:**
#32593

**Testing:** 
added unit tests, e2e test

for manual test use this config

```yaml
receivers:
  filelog:
    include: [ demo.log ]
    start_at: beginning

exporters:
  debug:
    verbosity: detailed
    sampling_initial: 10000
    sampling_thereafter: 10000

processors:
  transform:
    error_mode: ignore
    log_statements:
      - context: log
        statements: 
          - merge_maps(attributes, ExtractGrokPatterns(body, "%{WOOHOO}", true, ["WOOHOO=%{ELB_URI} otel"]), "insert")



service:
  pipelines:
    logs:
      receivers: [filelog]
      processors: [transform ]
      exporters:
        - debug
```

add this line to `demo.log`
```
http://user:[email protected]:80/path?query=string otel
```

Output should contain these attributes:
```
Attributes:
     -> log.file.name: Str(demo.log)
     -> url.username: Str(user)
     -> url.domain: Str(example.com)
     -> url.port: Int(80)
     -> url.path: Str(/path)
     -> url.query: Str(query=string)
     -> url.scheme: Str(http)
```

For default set of patterns check:
http://user:[email protected]:80/path?query=string
This implementation uses a complete set defined in this directory:
https://github.com/elastic/go-grok/tree/main/patterns


`%{ELB_URI}` comes from [AWS
set](https://github.com/elastic/go-grok/blob/main/patterns/aws.go) and
is equivalent to

`((?P<url.scheme>[A-Za-z][A-Za-z0-9+\.-]+)://(?:(?P<url.username>([a-zA-Z0-9._-]+))(?::[^@]*)?@)?(?:((?P<url.domain>(?:((?:(((([0-9A-Fa-f]{1,4}:){7}([0-9A-Fa-f]{1,4}|:))|(([0-9A-Fa-f]{1,4}:){6}(:[0-9A-Fa-f]{1,4}|((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3})|:))|(([0-9A-Fa-f]{1,4}:){5}(((:[0-9A-Fa-f]{1,4}){1,2})|:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3})|:))|(([0-9A-Fa-f]{1,4}:){4}(((:[0-9A-Fa-f]{1,4}){1,3})|((:[0-9A-Fa-f]{1,4})?:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){3}(((:[0-9A-Fa-f]{1,4}){1,4})|((:[0-9A-Fa-f]{1,4}){0,2}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){2}(((:[0-9A-Fa-f]{1,4}){1,5})|((:[0-9A-Fa-f]{1,4}){0,3}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){1}(((:[0-9A-Fa-f]{1,4}){1,6})|((:[0-9A-Fa-f]{1,4}){0,4}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(:(((:[0-9A-Fa-f]{1,4}){1,7})|((:[0-9A-Fa-f]{1,4}){0,5}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:)))(%.+)?)|((?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))))|(\b(?:[0-9A-Za-z][0-9A-Za-z-]{0,62})(?:\.(?:[0-9A-Za-z][0-9A-Za-z-]{0,62}))*(\.?|\b))))(?::(?P<url.port>\b[1-9][0-9]*\b))?))?(?:((?P<url.path>(/[A-Za-z0-9$.+!*'(){},~:;=@#%&_\-]+)+)(?:\?(?P<url.query>[A-Za-z0-9$.+!*'|(){},~@#%&/=:;_?\-\[\]<>]*))?))?)`

**Documentation:** 
updated ottl/readme

---------

Co-authored-by: Tyler Helmuth <[email protected]>
Co-authored-by: Evan Bradley <[email protected]>
@michalpristas
Copy link
Contributor Author

closed via #34037

f7o pushed a commit to f7o/opentelemetry-collector-contrib that referenced this issue Sep 12, 2024
**Description:** 
Added converter to OTTL for parsing grok patterns

**Link to tracking Issue:**
open-telemetry#32593

**Testing:** 
added unit tests, e2e test

for manual test use this config

```yaml
receivers:
  filelog:
    include: [ demo.log ]
    start_at: beginning

exporters:
  debug:
    verbosity: detailed
    sampling_initial: 10000
    sampling_thereafter: 10000

processors:
  transform:
    error_mode: ignore
    log_statements:
      - context: log
        statements: 
          - merge_maps(attributes, ExtractGrokPatterns(body, "%{WOOHOO}", true, ["WOOHOO=%{ELB_URI} otel"]), "insert")



service:
  pipelines:
    logs:
      receivers: [filelog]
      processors: [transform ]
      exporters:
        - debug
```

add this line to `demo.log`
```
http://user:[email protected]:80/path?query=string otel
```

Output should contain these attributes:
```
Attributes:
     -> log.file.name: Str(demo.log)
     -> url.username: Str(user)
     -> url.domain: Str(example.com)
     -> url.port: Int(80)
     -> url.path: Str(/path)
     -> url.query: Str(query=string)
     -> url.scheme: Str(http)
```

For default set of patterns check:
http://user:[email protected]:80/path?query=string
This implementation uses a complete set defined in this directory:
https://github.com/elastic/go-grok/tree/main/patterns


`%{ELB_URI}` comes from [AWS
set](https://github.com/elastic/go-grok/blob/main/patterns/aws.go) and
is equivalent to

`((?P<url.scheme>[A-Za-z][A-Za-z0-9+\.-]+)://(?:(?P<url.username>([a-zA-Z0-9._-]+))(?::[^@]*)?@)?(?:((?P<url.domain>(?:((?:(((([0-9A-Fa-f]{1,4}:){7}([0-9A-Fa-f]{1,4}|:))|(([0-9A-Fa-f]{1,4}:){6}(:[0-9A-Fa-f]{1,4}|((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3})|:))|(([0-9A-Fa-f]{1,4}:){5}(((:[0-9A-Fa-f]{1,4}){1,2})|:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3})|:))|(([0-9A-Fa-f]{1,4}:){4}(((:[0-9A-Fa-f]{1,4}){1,3})|((:[0-9A-Fa-f]{1,4})?:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){3}(((:[0-9A-Fa-f]{1,4}){1,4})|((:[0-9A-Fa-f]{1,4}){0,2}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){2}(((:[0-9A-Fa-f]{1,4}){1,5})|((:[0-9A-Fa-f]{1,4}){0,3}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){1}(((:[0-9A-Fa-f]{1,4}){1,6})|((:[0-9A-Fa-f]{1,4}){0,4}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(:(((:[0-9A-Fa-f]{1,4}){1,7})|((:[0-9A-Fa-f]{1,4}){0,5}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:)))(%.+)?)|((?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))))|(\b(?:[0-9A-Za-z][0-9A-Za-z-]{0,62})(?:\.(?:[0-9A-Za-z][0-9A-Za-z-]{0,62}))*(\.?|\b))))(?::(?P<url.port>\b[1-9][0-9]*\b))?))?(?:((?P<url.path>(/[A-Za-z0-9$.+!*'(){},~:;=@#%&_\-]+)+)(?:\?(?P<url.query>[A-Za-z0-9$.+!*'|(){},~@#%&/=:;_?\-\[\]<>]*))?))?)`

**Documentation:** 
updated ottl/readme

---------

Co-authored-by: Tyler Helmuth <[email protected]>
Co-authored-by: Evan Bradley <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion needed Community discussion needed enhancement New feature or request pkg/ottl
Projects
None yet
Development

No branches or pull requests

3 participants