Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support JSON Path-like json parsing #1363

Closed
sparrc opened this issue Jun 10, 2016 · 39 comments · Fixed by #9246
Closed

Support JSON Path-like json parsing #1363

sparrc opened this issue Jun 10, 2016 · 39 comments · Fixed by #9246
Assignees
Labels
area/json json and json_v2 parser/serialiser related

Comments

@sparrc
Copy link
Contributor

sparrc commented Jun 10, 2016

Feature Request

Proposal:

Support something similar to JSONPath for customizing which fields/paths of json get picked up as tags/fields/types/etc.

Doesn't need to be JSONPath specifically, since TBH the official json path spec is fairly ugly. Something more simple like https://github.com/jmoiron/jsonq would work as well.

Current behavior:

flattens json into numerical fields only, grabs tags only from the top level.

Desired behavior:

more customization to deal better with diverse range of different json formats.

Use case: [Why is this important (helps with prioritizing requests)]

Telegraf frequently gets issues opened for users who want specific changes & modifications to the json parser for their particular use-case. These cannot all be satisfied without a truly generic solution.

@sparrc
Copy link
Contributor Author

sparrc commented Sep 5, 2016

this should also support parsing timestamps in the same way

@fromanator
Copy link

@sparrc response to your question in #1554 :
Q: "can you talk more about the drawbacks of using line-protocol? This protocol is ideal for working with influxdb and is able to be encoded & decoded much faster than JSON."

A: This is great for use cases where there's not a telegraf instance loading the data to a kafka topic for ingestion, hence it's not trivial to load the data with the influx line protocol. For my specific use case we are getting a stream of syslog lines from and external server. The syslog line is grokked into JSON and loaded into a Kafka topic. Then we have Telegraf read that kafka topic to load the metrics into a metric storage backend (influx, graphite, etc).

@kevinjos
Copy link

Hello, I opened #2066 and am taking a look at the more generic solution proposed here. What are you thinking in terms of configuration @sparrc ? Currently, I see that httpjson conf can..

  • set a global servers=$VALUE tag
  • specify a tag_keys array
  • parse top level arrays of objects to unique points in line protocol

What should a user of the generic JSON parser be able to configure in addition to the above? Should the new JSONParser be backwards compatible with the existing parser?

@sparrc
Copy link
Contributor Author

sparrc commented Nov 26, 2016

@kevinjos this issue is specific to JSON parsing, not the httpjson plugin. For that reason, there will not be a change to the first bullet-point.

Ideally it will be backwards-compatible, tag_keys will continue to be supported, and if the JSON parsing rules fail then it will fall back to the current behavior, which is to parse the top-level arrays.

I haven't thought too much about what this will look like as I haven't had time to work on it, but the idea would be to create a generic way for users to specify paths to nested objects and arrays, and have custom patterns for parsing different json object types.

@kevinjos
Copy link

I opened a PR that introduces jsonq to the json parser to support parsing nested objects and arrays of objects into unique metrics with default, top-level and object-specific tags:

#2101

Feedback is much appreciated.

@olsky
Copy link

olsky commented Jan 27, 2017

I'd like to add one more use-case here: Mqtt-consumer with Json payload, where Timestamp overwrite would be very useful.

@sparrc
Copy link
Contributor Author

sparrc commented Jan 27, 2017

@olsky what do you mean by "timestamp overwrite"?

@olsky
Copy link

olsky commented Jan 30, 2017

by "timestamp overwrite" I mean: use a custom timestamp (also in the past), not the instance when the message is received.

@asatriani
Copy link

Hi all,
I think this feature is very important because current input telegraf plugin behavior with json format is too poor. In addition to a configuration field to set the metric/event timestamp, another field to ignore numeric data in json message would be very useful

@danielnelson
Copy link
Contributor

Anyone have opinions on JSONPath vs jsonq vs json-ptr? I think the parser could work something like what we sketched up for a XML parser on #1758.

@freeseacher
Copy link
Contributor

any news ?

@fabianbaier
Copy link

has there been progress? This feature looks particular interesting and I agree with desired behavior mentioned by @dgarros in #1332

@danielnelson
Copy link
Contributor

I merged a bunch of new features into the existing JSON parser, I think it may meet the needs for many watching this issue. Check the JSON data format for details.

There is still a place for a new parser, so I will keep this issue open.

@conet
Copy link

conet commented Apr 27, 2020

It's a shame there is still no way to parse something like this:

[
    {
        "timestamp": "2020-01-01T00:00:00.000Z", 
        "events": [
            {
                "metric_type": "metric1",
                "value": 20
            },
            {
                "metric_type": "metric2",
                "value": 10,
            },
            {
                "metric_type": "metric3",
                "value": 5
            }
        ]
    }
]

jq syntax or transformation would be wonderfull.

@piyushsharma
Copy link

After closing issue #4260, any tags specified as tag_keys will be extracted from json nested fields. Current functionality will set any int or float value as a field, unless otherwise specified as a tag in tag_keys. To add more customization of metrics, I propose adding field_keys to the JSON parser config. If users specify field_keys on config, only values named in field_keys will be reported as fields in the returned metric. In this manner, there would be full customization of which json values are reported as tags or fields. If field_keys is not specified, the parser will have current functionality.

Would this suit the above use cases?

From a nested json structure, if I want to parse only specific numeric fields, this solution of "field_keys" would have helped a lot. The current implementation only works if all the fields I want are in one specific chunk of the json (by leveraging json_query). Would be great to have a support a list of json_query filters OR have a way to specify which fields should be measured.

@danielnelson
Copy link
Contributor

You may be able to do with GJSON's multipaths syntax. If you show an example document I'd have a better idea if this will work for you.

@piyushsharma
Copy link

piyushsharma commented Jun 17, 2020

@danielnelson : tried using the multipath syntax but cannot get it to work.

{
  "message": "abc",
  "fields": {
    "status": 200,
    "key": 1,
    "json": [
      {
        "duration": 100,
        "code": 1,
        "label": 2,
        "line": 3
      }
    ],
    "duration": 2
  }
}

Say I have a nested JSON like described above with many more numerical fields, and I am only interested in the following: "fields.duration", "fields.status", "fields.json#duration".

@danielnelson
Copy link
Contributor

When you say many more numerical fields, do you mean the like:

{
  "message": "abc",
  "fields": {
    "status": 200,
    "key": 1,
    "json": [
      {
        "duration": 100,
        "code": 1,
        "label": 2,
        "line": 3,
        "many": 4,
        "more": 5,
        "numerical": 6,
        "fields": 7
      }
    ],
    "duration": 2
  }
}

Or is it more like this?

{
  "message": "abc",
  "fields": {
    "status": 200,
    "key": 1,
    "json": [
      {
        "duration": 100,
        "code": 1,
        "label": 2,
        "line": 3,
      },
      {
        "many": 4,
        "more": 5,
        "numerical": 6,
        "fields": 7
      }
    ],
    "duration": 2
  }
}

@piyushsharma
Copy link

piyushsharma commented Jun 17, 2020

Like:

{
  "message": "abc",
  "fields": {
    "status": 200,
    "key": 1,
    "json": [
      {
        "duration": 100,
        "code": 1,
        "label": 2,
        "line": 3,
        "many": 4,
        "more": 5,
        "numerical": 6,
        "fields": 7,
        "nest": {
                "label": 2,
                "line": 3,
                "many": 4,
                "more": 5,
                "numerical": 6,
                "fields": 7
         }
      }
    ],
    "duration": 2
  }
}

For what I need, we can assume "fields.json" is an array of one item. I only care about some properties of "fields" and some properties of "fields.json", I don't need anything from the nested structure "fields.json.nest".

@danielnelson
Copy link
Contributor

How about something like this? It's a pretty basic multiselect but we need to provide a custom name for duration to avoid the naming conflict.

json_query = '{fields.duration,fields.status,"json_0_duration":fields.json.0.duration}'

@piyushsharma
Copy link

Sample Data:

{
  "@message": "Finished Request",
  "@timestamp": "2020-06-09T12:49:18.747Z",
  "@fields": {
    "event": "request",
    "method": "GET",
    "path": "xyz",
    "responseStatus": 200,
    "duration": 10.3845,
    "requestId": "ddd",
    "referer": "abc",
    "level": "info",
    "label": 100
  }
}

Telegraf Config:

tag_keys = ["@message"]
json_query = '{@timestamp,@message,@fields.duration,@fields.responseStatus}'
json_time_key = "@timestamp"
json_time_format = "2006-01-02T15:04:05Z07:00"

I get the following error:

panic: runtime error: index out of range [0] with length 0

goroutine 28 [running]:
github.com/tidwall/gjson.Get(0xc0005000f0, 0xe6, 0xc000052ec1, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/private/tmp/telegraf-20200429-52913-1ietmxf/telegraf-1.14.2/pkg/mod/github.com/tidwall/[email protected]/gjson.go:1722 +0xe24
github.com/tidwall/gjson.Get(0xc0005000f0, 0xe6, 0xc000052ec0, 0x3d, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/private/tmp/telegraf-20200429-52913-1ietmxf/telegraf-1.14.2/pkg/mod/github.com/tidwall/[email protected]/gjson.go:1734 +0x4ec
github.com/tidwall/gjson.getBytes(0xc0005000f0, 0xe6, 0xf0, 0xc000052ec0, 0x3d, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/private/tmp/telegraf-20200429-52913-1ietmxf/telegraf-1.14.2/pkg/mod/github.com/tidwall/[email protected]/gjson_ngae.go:18 +0x10d
github.com/tidwall/gjson.GetBytes(...)
	/private/tmp/telegraf-20200429-52913-1ietmxf/telegraf-1.14.2/pkg/mod/github.com/tidwall/[email protected]/gjson.go:1814
github.com/influxdata/telegraf/plugins/parsers/json.(*Parser).Parse(0xc000244320, 0xc0005000f0, 0xe6, 0xf0, 0xe6, 0xf0, 0xc0001f81e0, 0xe6, 0xc00048c080)
	/private/tmp/telegraf-20200429-52913-1ietmxf/telegraf-1.14.2/src/github.com/influxdata/telegraf/plugins/parsers/json/parser.go:189 +0x41b
github.com/influxdata/telegraf/plugins/inputs/tail.parseLine(0x68a8a80, 0xc000244320, 0xc0001f81e0, 0xe6, 0xc0001f8101, 0xe6, 0x0, 0x0, 0x0, 0x0)
	/private/tmp/telegraf-20200429-52913-1ietmxf/telegraf-1.14.2/src/github.com/influxdata/telegraf/plugins/inputs/tail/tail.go:206 +0x280
github.com/influxdata/telegraf/plugins/inputs/tail.(*Tail).receiver(0xc000120200, 0x68a8a80, 0xc000244320, 0xc0002e2bb0)
	/private/tmp/telegraf-20200429-52913-1ietmxf/telegraf-1.14.2/src/github.com/influxdata/telegraf/plugins/inputs/tail/tail.go:222 +0x25d
github.com/influxdata/telegraf/plugins/inputs/tail.(*Tail).tailNewFiles.func1(0xc000120200, 0x68a8a80, 0xc000244320, 0xc0002e2bb0)
	/private/tmp/telegraf-20200429-52913-1ietmxf/telegraf-1.14.2/src/github.com/influxdata/telegraf/plugins/inputs/tail/tail.go:177 +0x76
created by github.com/influxdata/telegraf/plugins/inputs/tail.(*Tail).tailNewFiles
	/private/tmp/telegraf-20200429-52913-1ietmxf/telegraf-1.14.2/src/github.com/influxdata/telegraf/plugins/inputs/tail/tail.go:175 +0x530

@danielnelson
Copy link
Contributor

Can you open a new bug issue for this and we can address it in 1.14.5. Probably related to the @ symbol since this somewhat special in GJSON.

@piyushsharma
Copy link

Thanks, will do. It does not work even if I remove the @ prefix from the properties. Tried many combinations -- seeing a lot of JSON time key could not be found at times even though my json_query has the property with timestamp in it.

@piyushsharma
Copy link

filed #7699

@sinistram
Copy link

+1 to the issue. It will be great to have an option parse tags like this:

{
    "tags": {
        "tag1" : "value1",
        "tag2" :  "value2"
     },
    "field1" : 2
}


tag_keys = [ "tags_*"]

@sjwang90
Copy link
Contributor

The new JSON_v2 parser (json_v2) has just been released in Telegraf 1.19 rc-1. We built it to address the use cases discussed in this thread so please download and test it out with your JSON files and open any bugs or feature gaps you may encounter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/json json and json_v2 parser/serialiser related
Projects
None yet
Development

Successfully merging a pull request may close this issue.