Starlark sometimes lets the agent crash #13148

knollet · 2023-04-26T16:29:22Z

Relevant telegraf.conf

[[inputs.internal]]
  name_override = "atleastoneinput"

[[processors.starlark]]
  # namepass = no, we don't namepass, we want a runtime error
source = '''
def apply(metric):
  return metric.fields["submetric_that_doesnt_exist"]
'''

[[outputs.file]]
  files = ["stdout"]   ## testing ... testing ... one two three
  data_format = "influx"

Logs from Telegraf

me@host:/etc/telegraf/telegraf.d# telegraf --test
2023-04-26T16:14:47Z I! Loading config file: /etc/telegraf/telegraf.conf
[...]
2023-04-26T16:14:47Z I! Loading config file: /etc/telegraf/telegraf.d/sl-test.conf
panic: negative refcount

goroutine 65 [running]:
github.com/influxdata/telegraf/metric.(*trackingMetric).decr(0xc000d2c240?)
        /go/src/github.com/influxdata/telegraf/metric/tracking.go:145 +0x90
github.com/influxdata/telegraf/metric.(*trackingMetric).Drop(0xc000d2c2a0?)
        /go/src/github.com/influxdata/telegraf/metric/tracking.go:139 +0x19
github.com/influxdata/telegraf/agent.(*Agent).runProcessors.func1(0xc000c0f5d8)
        /go/src/github.com/influxdata/telegraf/agent/agent.go:652 +0x16a
created by github.com/influxdata/telegraf/agent.(*Agent).runProcessors
        /go/src/github.com/influxdata/telegraf/agent/agent.go:645 +0x3c

me@host:/etc/telegraf/telegraf.d# less /var/log/telegraf/telegraf.log
2023-04-26T16:14:47Z E! [processors.starlark] Traceback (most recent call last):
2023-04-26T16:14:47Z E! [processors.starlark]   :2:23: in apply
2023-04-26T16:14:47Z E! [processors.starlark] Error: key "submetric_that_doesnt_exist" not in Fields
2023-04-26T16:14:47Z E! [processors.starlark] Error in plugin: key "submetric_that_doesnt_exist" not in Fields
2023-04-26T16:14:47Z E! [processors.starlark] Traceback (most recent call last):
2023-04-26T16:14:47Z E! [processors.starlark]   :2:23: in apply
2023-04-26T16:14:47Z E! [processors.starlark] Error: key "submetric_that_doesnt_exist" not in Fields
2023-04-26T16:14:47Z E! [processors.starlark] Error in plugin: key "submetric_that_doesnt_exist" not in Fields
2023-04-26T16:14:47Z E! [processors.starlark] Traceback (most recent call last):
2023-04-26T16:14:47Z E! [processors.starlark]   :2:23: in apply
2023-04-26T16:14:47Z E! [processors.starlark] Error: key "submetric_that_doesnt_exist" not in Fields
2023-04-26T16:14:47Z E! [processors.starlark] Error in plugin: key "submetric_that_doesnt_exist" not in Fields
2023-04-26T16:14:47Z E! [processors.starlark] Traceback (most recent call last):
2023-04-26T16:14:47Z E! [processors.starlark]   :2:23: in apply
2023-04-26T16:14:47Z E! [processors.starlark] Error: key "submetric_that_doesnt_exist" not in Fields
2023-04-26T16:14:47Z E! [processors.starlark] Error in plugin: key "submetric_that_doesnt_exist" not in Fields
me@host:/etc/telegraf/telegraf.d#

System info

Telegraf 1.26.1, RHEL 8.7 (Ootpa)

Docker

No response

Steps to reproduce

Create the config
run with --test
telegraf logs some runtime-errors in the starlark code, but then the memory management seems to go haywire. Also seen (albeit less often) when running as service.
...

Expected behavior

Starlark-Stacktraces for the starlark-errors in the log, but no go-related memory-management-errors with panic.

Actual behavior

sometimes telegraf panics.

Additional info

No response

The text was updated successfully, but these errors were encountered:

powersj · 2023-04-26T17:59:19Z

Looking at the panic, Telegraf at the time is trying to call decr() on a tracking metric. Tracking metrics are a special kind of metric that once they are submitted to an output, telegraf will go back to the input and acknowledge that the metric was processed. Only a tracking metric's Accept(), Drop(), and Reject() methods call the above decr().

The specific call that causes the panic is in the agent's runProcessors during an error on Add. Before that the processor's, starlark in this case, Add() is called. The starlark processor does use tracking metrics as well, and will call Drop() on an error.

However, these are not the same metrics so it isn't like we are double decrementing.

I am not able to reproduce with the config you provided. I am wondering if there is something else required. Do note that if you were seeing this when running with --test, outputs are ignored and nothing is written to a file. The only output is to stdout.

Your logs also showed loading two or more files? Can you narrow it down to something we can reproduce?

Thanks!

knollet · 2023-04-27T10:24:58Z

Hm, yes, there was more, that was loaded. I'll have to look into it. Thanks for the information.

knollet · 2023-04-27T11:18:54Z

ok, wow, this is a little bit confusing...


root@host:/etc/telegraf/telegraf.d# telegraf --test --config sl-test.conf
2023-04-27T11:15:25Z I! Loading config file: sl-test.conf
panic: negative refcount

goroutine 28 [running]:
github.com/influxdata/telegraf/metric.(*trackingMetric).decr(0xc000bcac90?)
        /go/src/github.com/influxdata/telegraf/metric/tracking.go:145 +0x90
github.com/influxdata/telegraf/metric.(*trackingMetric).Drop(0xc000bcacf0?)
        /go/src/github.com/influxdata/telegraf/metric/tracking.go:139 +0x19
github.com/influxdata/telegraf/agent.(*Agent).runProcessors.func1(0xc000969230)
        /go/src/github.com/influxdata/telegraf/agent/agent.go:652 +0x16a
created by github.com/influxdata/telegraf/agent.(*Agent).runProcessors
        /go/src/github.com/influxdata/telegraf/agent/agent.go:645 +0x3c
root@host:/etc/telegraf/telegraf.d#

$ cat sl-test.conf
### If I remove the agent-section with the logfile in it, it stops crashing
[agent]
  logfile = "/tmp/doesntexist"   ## I have write permission

### If I remove this input, it stops crashing
[[inputs.nstat]]

### If I remove this input, it stops crashing
[[inputs.sysstat]]
  ## If I don't set these parameters, it stops crashing
  sadc_path = "/usr/lib64/sa/sadc"
  sadf_path = "/bin/sadf"


#################################################################################
[[inputs.tail]]
  ### same as in [agent]-section,
  ### I have write permissions to /tmp
  files = [ "/tmp/doesntexist" ]
  data_format = "grok"

### If I remove the NOTSPACE-part of the grok pattern, **it stops crashing**
  grok_patterns = ['''
     ^%{TIMESTAMP_ISO8601:timestamp:ts-"2006-01-02T15:04:05Z"}\s+%{NOTSPACE:level}\s+''']

#################################################################################

[[processors.starlark]]
  ### minimal starlark that is intended to throw an error
source = '''
def apply(metric):
  return metric.fields["submetric_that_doesnt_exist"]
'''

If an error occurs in the starlark processor tracking metrics are rejected, decrementing their count. However, immediately after the agent when running processors, if an error occurs in the processor it will also attempt to drop the metric. This double reject + drop results in a negative reference count, panicing telegraf. This did not affect metrics which do not use tracking metrics. It requires a metric that uses a tracking metric to expose as well as an error in starlark. fixes: influxdata#13148

powersj · 2023-04-27T14:43:01Z

I have put up #13156 which removes the drop from the starlark processor on error and let's the agent handle the drop. Can you give that a try and ensure it works as expected?

Hm, yes, there was more, that was loaded. I'll have to look into it. Thanks for the information.

Thank you! I can confirm that the new config produces the panic and that we are in fact decrementing a tail tracking metric twice:

starlark.go:61 Reject()
reject tail [0xc000d78800 0xc000d787a0] [0xc000d78780]
v=0
agent.go:653 Drop()
drop tail [0xc000d78800 0xc000d787a0] [0xc000d78780]
v=-1
panic: negative refcount

goroutine 68 [running]:
github.com/influxdata/telegraf/metric.(*trackingMetric).decr(0xc000a142a0)
	/home/powersj/telegraf/metric/tracking.go:150 +0xf0
github.com/influxdata/telegraf/metric.(*trackingMetric).Drop(0xc000a142a0)
	/home/powersj/telegraf/metric/tracking.go:143 +0x169
github.com/influxdata/telegraf/agent.(*Agent).runProcessors.func1(0xc000f96e70)
	/home/powersj/telegraf/agent/agent.go:653 +0x1b6
created by github.com/influxdata/telegraf/agent.(*Agent).runProcessors
	/home/powersj/telegraf/agent/agent.go:645 +0x3c

The first decrement occurs in the starlark processor after we call apply on the metric. If there is an error, telegraf rejects the metric.

This returns an error, to the caller, which happens to be the agent's `runProcessors(). If an error has occurred while running the processor telegraf drops the metric. This is the second decrement.

I am a little surprised we have not gotten an error report about this before. My only guess is that in order to hit this your starlark processor has to return an error when using a tracking metric, which may not occur all that often with some additional checks.

Thanks again!

knollet · 2023-05-02T14:20:45Z

I tried the artifact, but it produced a problem with generating uint-fields and wouldn't recognize influx_uint_support = false. But with the artifact from
#13164
everything looked good, so far.

knollet added the bug unexpected problem or unintended behavior label Apr 26, 2023

powersj added the waiting for response waiting for response from contributor label Apr 26, 2023

telegraf-tiger bot removed the waiting for response waiting for response from contributor label Apr 27, 2023

powersj mentioned this issue Apr 27, 2023

fix(processors.starlark): do not reject tracking metrics twice #13156

Merged

srebhan closed this as completed in #13156 May 2, 2023

paulbrimicombe mentioned this issue Dec 20, 2023

"panic: negative refcount" when mutating tags in starlark for metrics from a tail input #14484

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Starlark sometimes lets the agent crash #13148

Starlark sometimes lets the agent crash #13148

knollet commented Apr 26, 2023

powersj commented Apr 26, 2023

knollet commented Apr 27, 2023

knollet commented Apr 27, 2023

powersj commented Apr 27, 2023 •

edited

Loading

knollet commented May 2, 2023

Starlark sometimes lets the agent crash #13148

Starlark sometimes lets the agent crash #13148

Comments

knollet commented Apr 26, 2023

Relevant telegraf.conf

Logs from Telegraf

System info

Docker

Steps to reproduce

Expected behavior

Actual behavior

Additional info

powersj commented Apr 26, 2023

knollet commented Apr 27, 2023

knollet commented Apr 27, 2023

powersj commented Apr 27, 2023 • edited Loading

knollet commented May 2, 2023

powersj commented Apr 27, 2023 •

edited

Loading