splunkmetric removal of "event' field seems to break Splunk heavy forwarder #8761

lneva-fastly · 2021-01-27T20:52:06Z

Relevant telegraf.conf:

[[outputs.http]]
   url = "<URL>"
   # insecure_skip_verify = false
   data_format = "splunkmetric"
   splunkmetric_hec_routing = true
   [outputs.http.headers]
      Content-Type = "application/json"
      Authorization = "Splunk <HEC TOKEN>"

[[inputs.cpu]]
  percpu = false
  totalcpu = true
  collect_cpu_time = false
  report_active = false

System info:

Telegraf 1.17.0
Splunk 8.0.2
Heavy forwarder
Indexer cluster

Steps to reproduce:

Use the splunkmetric data format + http output.

Expected behavior:

Splunk should receive the metrics via HEC with no problems. Worked fine on 1.15.3.

Actual behavior:

Splunk-forwarder hates the events produced. Metrics show up in Splunk, but soon the forwarder gives errors like this:

Jan 27 20:40:14 01-27-2021 20:40:13.925 +0000 WARN  TcpOutputProc - Read operation timed out expecting ACK from 10.0.1.26:29997 in 300 seconds.
Jan 27 20:40:14 01-27-2021 20:40:13.925 +0000 WARN  TcpOutputProc - Possible duplication of events with channel=source::http:telegraf|host::redacted|httpevent|, streamId=1618, offset=0 on host=10.0.1.26:29997

Soon after that, TcpOutputProc locks up and is unable to send anything to the indexers at all. Worse yet, because we have Splunk's persistentQueueSize option set on our HEC input, the problematic events stick around through a restart of the forwarder, even if new problematic events are not arriving. We had to wipe the forwarder out entirely and rebuild it to recover.

Additional info:

We carefully pared down variables until we arrived on the problem: the removal of the "event": "metric" field in #8039. Starting with a fresh, working forwarder, we can cause the above problems by sending events without the "event" field to the forwarder over HEC using curl. Sending the exact same events with "event": "metric" does not cause this problem.

I'm honestly not at all clear on why Splunk hates these events. I also don't have a good explanation for why Splunk Support said that the "event" field is unnecessary in #8039. Perhaps there's something else in the OP's configuration that obviates the need for the "event" field?

For now, we've reverted to 1.15.3, pending a fix to telegraf. Perhaps the "event" field should be optional, defaulting to present?

The text was updated successfully, but these errors were encountered:

powersj · 2022-02-22T21:32:45Z

Hi,

Sorry no one has gotten back to you on this.

Does the latest version of Telegraf + Splunk still cause this issue? I think we could add this back in, but I would want reference to some documentation saying it is required in case we break other existing users.

Thanks

lneva-fastly · 2022-02-22T21:36:41Z

We haven't tested it recently because we've been pinned to the version that works (1.15.3). I understand your concern about breaking things for other users, and your desire for concrete documentation. The thing is, we don't actually have any official documentation for the advice to drop the "event: metric" field in the first place -- just a note in a support ticket. My testing documented above seems to indicate that that advice is not always correct.

We seem to be in a pretty tough place here. Maybe adding an option to include this field is the way to go?

pjain-fastly · 2022-04-05T16:35:53Z

We tested this again with latest Telegraf version (1.22.0) and Splunk version 8.2.3. The issue still persists. To add to what Lex mentioned above, we are in a very difficult situation here. We can no longer pin to 1.15.3 since it is vulnerable to CVE-2020-26892. Upgrading is absolutely essential at this point but that breaks the forwarders.

fastly-ffej · 2022-05-31T15:40:44Z

@powersj, here's a link to a splunk document showing the expected format of a splunk telegraf metric: https://www.splunk.com/en_us/blog/it/splunk-metrics-via-telegraf.html

This setup will result in metrics that look like:
{ "time": 1529708430, "event": "metric", "host": "patas-mbp", "fields": { "_value": 0.6, "cpu": "cpu0", "dc": "mobile", "metric_name": "cpu.usage_user", "user": "ronnocol" } }

Contrary to influxdata#8039, splunk documentation does require the event tag with metric value. This reverts that previous change. fixes: influxdata#8761

powersj · 2022-06-01T16:59:39Z

@fastly-ffej thanks for the link! I think this is worth reverting based on that. In 20-30mins after I post this message PR: #11237 should have artifacts attached to it by the telegraf-tiger bot that you can try. Would one of you please give those a shot and ensure the revert works?

Thanks!

fastly-ffej · 2022-06-02T17:48:06Z

@powersj, I just tested the new binary on our systems and it worked like a champ!

Thanks for the swift response!

powersj · 2022-06-02T18:09:30Z

This should go out in v1.23.0 on or around June 13. It will be available in nightlies starting tomorrow.

Thanks!

lneva-fastly added the bug unexpected problem or unintended behavior label Jan 27, 2021

powersj added the waiting for response waiting for response from contributor label Feb 22, 2022

telegraf-tiger bot removed the waiting for response waiting for response from contributor label Feb 22, 2022

powersj added a commit to powersj/telegraf that referenced this issue Jun 1, 2022

fix: re-add event to splunk serializer

87620c4

Contrary to influxdata#8039, splunk documentation does require the event tag with metric value. This reverts that previous change. fixes: influxdata#8761

powersj mentioned this issue Jun 1, 2022

fix: re-add event to splunk serializer #11237

Merged

powersj closed this as completed in #11237 Jun 2, 2022

pgeler mentioned this issue Oct 13, 2022

Splunk serializer data_format=splunkmetric does not work in 1.23+ #12010

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

splunkmetric removal of "event' field seems to break Splunk heavy forwarder #8761

splunkmetric removal of "event' field seems to break Splunk heavy forwarder #8761

lneva-fastly commented Jan 27, 2021

powersj commented Feb 22, 2022

lneva-fastly commented Feb 22, 2022

pjain-fastly commented Apr 5, 2022

fastly-ffej commented May 31, 2022

powersj commented Jun 1, 2022 •

edited

Loading

fastly-ffej commented Jun 2, 2022

powersj commented Jun 2, 2022

splunkmetric removal of "event' field seems to break Splunk heavy forwarder #8761

splunkmetric removal of "event' field seems to break Splunk heavy forwarder #8761

Comments

lneva-fastly commented Jan 27, 2021

Relevant telegraf.conf:

System info:

Steps to reproduce:

Expected behavior:

Actual behavior:

Additional info:

powersj commented Feb 22, 2022

lneva-fastly commented Feb 22, 2022

pjain-fastly commented Apr 5, 2022

fastly-ffej commented May 31, 2022

powersj commented Jun 1, 2022 • edited Loading

fastly-ffej commented Jun 2, 2022

powersj commented Jun 2, 2022

powersj commented Jun 1, 2022 •

edited

Loading