Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator: uri parser #244

Merged
merged 3 commits into from
Feb 2, 2021
Merged

Operator: uri parser #244

merged 3 commits into from
Feb 2, 2021

Conversation

jsirianni
Copy link
Member

@jsirianni jsirianni commented Jan 31, 2021

Description of Changes

Added an operator for parsing absolute and relative URIs, and URI query strings.

Examples

We have several immediate use-cases. I can provide sample logs and example plugin / pipeline configuration.

NGINX

NGINX path field can be broken up into path and query

{
  "timestamp": "2021-01-29T20:37:36-05:00",
  "severity": 50,
  "severity_text": "404",
  "labels": {
    "file_name": "nginx_access.log",
    "log_type": "nginx.access",
    "plugin_id": "nginx"
  },
  "record": {
    "body_bytes_sent": "178",
    "bytes_sent": "342",
    "http_referer": "-",
    "http_user_agent": "curl/7.58.0",
    "http_x_forwarded_for": "-",
    "method": "GET",
    "path": "/about-us?app=prod&user=james&app=stage",
    "protocol": "HTTP",
    "protocol_version": "1.1",
    "proxy_add_x_forwarded_for": "10.33.104.40",
    "remote_addr": "10.33.104.40",
    "remote_user": "-",
    "request": "GET /about-us?app=prod&user=james&app=stage HTTP/1.1",
    "request_length": "114",
    "request_time": "0.000",
    "status": "404",
    "time_iso8601": "2021-01-29T20:37:36-05:00",
    "upstream_addr": "-",
    "upstream_connect_time": "-",
    "upstream_header_time": "-",
    "upstream_response_length": "-",
    "upstream_response_time": "-",
    "upstream_status": "-"
  }
}

After parsing with uri_parser

- id: relative_path_parser
  type: uri_parser
  parse_from: path
  output: {{ .output }}
{
  "timestamp": "2021-01-29T20:37:36-05:00",
  "severity": 50,
  "severity_text": "404",
  "labels": {
    "file_name": "nginx_access.log",
    "log_type": "nginx.access",
    "plugin_id": "nginx"
  },
  "record": {
    "body_bytes_sent": "178",
    "bytes_sent": "342",
    "http_referer": "-",
    "http_user_agent": "curl/7.58.0",
    "http_x_forwarded_for": "-",
    "method": "GET",
    "path": "/about-us",
    "protocol": "HTTP",
    "protocol_version": "1.1",
    "proxy_add_x_forwarded_for": "10.33.104.40",
    "query": {
      "app": [
        "prod",
        "stage"
      ],
      "user": [
        "james"
      ]
    },
    "remote_addr": "10.33.104.40",
    "remote_user": "-",
    "request": "GET /about-us?app=prod&user=james&app=stage HTTP/1.1",
    "request_length": "114",
    "request_time": "0.000",
    "status": "404",
    "time_iso8601": "2021-01-29T20:37:36-05:00",
    "upstream_addr": "-",
    "upstream_connect_time": "-",
    "upstream_header_time": "-",
    "upstream_response_length": "-",
    "upstream_response_time": "-",
    "upstream_status": "-"
  }
}

Apache HTTP

Apache HTTP and Apache Tomcat both have the query field in raw format.

This will be useful for parsing URI key value pairs, making them searchable. For example, the Apache plugin gives us this:

{
  "timestamp": "2021-01-29T18:04:01.790326-05:00",
  "severity": 50,
  "severity_text": "404",
  "labels": {
    "file_name": "apache_access.log",
    "log_type": "apache_http.access",
    "plugin_id": "apache_http"
  },
  "record": {
    "body_bytes_sent": "274",
    "http_referer": "-",
    "http_user_agent": "curl/7.58.0",
    "http_x_forwarded_for": "10.33.104.93",
    "method": "GET",
    "path": "/token/create",
    "protocol": "HTTP",
    "protocol_version": "1.0",
    "query": "?token=10495818&user=load&tier=prod&tier=api",
    "remote_addr": "127.0.0.1",
    "remote_user": "-",
    "request_time_microseconds": "132",
    "status": "404"
  }
}

The query field can be parsed further using the uri_parser operator

  - id: query_string_parser
    type: uri_parser
    parse_from: query
    output: access_protocol_parser
{
  "timestamp": "2021-01-29T18:04:01.790326-05:00",
  "severity": 50,
  "severity_text": "404",
  "labels": {
    "file_name": "apache_access.log",
    "log_type": "apache_http.access",
    "plugin_id": "apache_http"
  },
  "record": {
    "body_bytes_sent": "274",
    "http_referer": "-",
    "http_user_agent": "curl/7.58.0",
    "http_x_forwarded_for": "10.33.104.93",
    "method": "GET",
    "path": "/token/create",
    "protocol": "HTTP",
    "protocol_version": "1.0",
    "query": {
      "tier": [
        "prod",
        "api"
      ],
      "token": [
        "10495818"
      ],
      "user": [
        "load"
      ]
    },
    "remote_addr": "127.0.0.1",
    "remote_user": "-",
    "request_time_microseconds": "132",
    "status": "404"
  }
}

Please check that the PR fulfills these requirements

  • Tests for the changes have been added (for bug fixes / features)
  • Docs have been added / updated (for bug fixes / features)
  • Add a changelog entry (for non-trivial bug fixes / features)
  • CI passes

@jsirianni jsirianni requested a review from djaglowski January 31, 2021 20:21
@codecov
Copy link

codecov bot commented Jan 31, 2021

Codecov Report

Merging #244 (3b5ee46) into master (f03c848) will increase coverage by 0.16%.
The diff coverage is 92.16%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #244      +/-   ##
==========================================
+ Coverage   71.21%   71.36%   +0.16%     
==========================================
  Files         101      102       +1     
  Lines        5547     5598      +51     
==========================================
+ Hits         3950     3995      +45     
- Misses       1171     1176       +5     
- Partials      426      427       +1     
Impacted Files Coverage Δ
operator/builtin/parser/uri/uri.go 92.16% <92.16%> (ø)
operator/builtin/output/otlp/otlp.go 61.73% <0.00%> (-3.70%) ⬇️
operator/builtin/output/newrelic/newrelic.go 71.96% <0.00%> (+0.93%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f03c848...3b5ee46. Read the comment docs.

Copy link
Member

@djaglowski djaglowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Just one suggestion on docs.

docs/operators/uri_parser.md Outdated Show resolved Hide resolved
@djaglowski
Copy link
Member

Log Files Logs / Second CPU Avg (%) CPU Avg Δ (%) Memory Avg (MB) Memory Avg Δ (MB)
1 1000 1.6896914 -0.24137902 124.70515 +0.03300476
1 5000 5.2759824 -1.2758865 130.9379 -0.37931824
1 10000 12.3625965 +0.27615452 139.70676 -0.004180908
1 50000 55.535557 -5.672928 176.41164 +1.2642822
1 100000 113.04365 +0.94047546 229.9573 +4.2640076
10 100 2.3449273 -0.1552422 128.15396 +0.8492737
10 500 7.086367 +0.17239094 136.9181 +3.051056
10 1000 13.414242 +0.20709705 138.46133 -3.167038
10 5000 60.210808 -0.15136337 177.64183 -4.184677
10 10000 122.69321 +1.2335892 217.28691 +1.2380219

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants