Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Starlark sometimes lets the agent crash #13148

Closed
knollet opened this issue Apr 26, 2023 · 5 comments · Fixed by #13156
Closed

Starlark sometimes lets the agent crash #13148

knollet opened this issue Apr 26, 2023 · 5 comments · Fixed by #13156
Labels
bug unexpected problem or unintended behavior

Comments

@knollet
Copy link
Contributor

knollet commented Apr 26, 2023

Relevant telegraf.conf

[[inputs.internal]]
  name_override = "atleastoneinput"

[[processors.starlark]]
  # namepass = no, we don't namepass, we want a runtime error
source = '''
def apply(metric):
  return metric.fields["submetric_that_doesnt_exist"]
'''

[[outputs.file]]
  files = ["stdout"]   ## testing ... testing ... one two three
  data_format = "influx"

Logs from Telegraf

me@host:/etc/telegraf/telegraf.d# telegraf --test
2023-04-26T16:14:47Z I! Loading config file: /etc/telegraf/telegraf.conf
[...]
2023-04-26T16:14:47Z I! Loading config file: /etc/telegraf/telegraf.d/sl-test.conf
panic: negative refcount

goroutine 65 [running]:
github.com/influxdata/telegraf/metric.(*trackingMetric).decr(0xc000d2c240?)
        /go/src/github.com/influxdata/telegraf/metric/tracking.go:145 +0x90
github.com/influxdata/telegraf/metric.(*trackingMetric).Drop(0xc000d2c2a0?)
        /go/src/github.com/influxdata/telegraf/metric/tracking.go:139 +0x19
github.com/influxdata/telegraf/agent.(*Agent).runProcessors.func1(0xc000c0f5d8)
        /go/src/github.com/influxdata/telegraf/agent/agent.go:652 +0x16a
created by github.com/influxdata/telegraf/agent.(*Agent).runProcessors
        /go/src/github.com/influxdata/telegraf/agent/agent.go:645 +0x3c

me@host:/etc/telegraf/telegraf.d# less /var/log/telegraf/telegraf.log
2023-04-26T16:14:47Z E! [processors.starlark] Traceback (most recent call last):
2023-04-26T16:14:47Z E! [processors.starlark]   :2:23: in apply
2023-04-26T16:14:47Z E! [processors.starlark] Error: key "submetric_that_doesnt_exist" not in Fields
2023-04-26T16:14:47Z E! [processors.starlark] Error in plugin: key "submetric_that_doesnt_exist" not in Fields
2023-04-26T16:14:47Z E! [processors.starlark] Traceback (most recent call last):
2023-04-26T16:14:47Z E! [processors.starlark]   :2:23: in apply
2023-04-26T16:14:47Z E! [processors.starlark] Error: key "submetric_that_doesnt_exist" not in Fields
2023-04-26T16:14:47Z E! [processors.starlark] Error in plugin: key "submetric_that_doesnt_exist" not in Fields
2023-04-26T16:14:47Z E! [processors.starlark] Traceback (most recent call last):
2023-04-26T16:14:47Z E! [processors.starlark]   :2:23: in apply
2023-04-26T16:14:47Z E! [processors.starlark] Error: key "submetric_that_doesnt_exist" not in Fields
2023-04-26T16:14:47Z E! [processors.starlark] Error in plugin: key "submetric_that_doesnt_exist" not in Fields
2023-04-26T16:14:47Z E! [processors.starlark] Traceback (most recent call last):
2023-04-26T16:14:47Z E! [processors.starlark]   :2:23: in apply
2023-04-26T16:14:47Z E! [processors.starlark] Error: key "submetric_that_doesnt_exist" not in Fields
2023-04-26T16:14:47Z E! [processors.starlark] Error in plugin: key "submetric_that_doesnt_exist" not in Fields
me@host:/etc/telegraf/telegraf.d#

System info

Telegraf 1.26.1, RHEL 8.7 (Ootpa)

Docker

No response

Steps to reproduce

  1. Create the config
  2. run with --test
  3. telegraf logs some runtime-errors in the starlark code, but then the memory management seems to go haywire. Also seen (albeit less often) when running as service.
    ...

Expected behavior

Starlark-Stacktraces for the starlark-errors in the log, but no go-related memory-management-errors with panic.

Actual behavior

sometimes telegraf panics.

Additional info

No response

@knollet knollet added the bug unexpected problem or unintended behavior label Apr 26, 2023
@powersj
Copy link
Contributor

powersj commented Apr 26, 2023

Looking at the panic, Telegraf at the time is trying to call decr() on a tracking metric. Tracking metrics are a special kind of metric that once they are submitted to an output, telegraf will go back to the input and acknowledge that the metric was processed. Only a tracking metric's Accept(), Drop(), and Reject() methods call the above decr().

The specific call that causes the panic is in the agent's runProcessors during an error on Add. Before that the processor's, starlark in this case, Add() is called. The starlark processor does use tracking metrics as well, and will call Drop() on an error.

However, these are not the same metrics so it isn't like we are double decrementing.

I am not able to reproduce with the config you provided. I am wondering if there is something else required. Do note that if you were seeing this when running with --test, outputs are ignored and nothing is written to a file. The only output is to stdout.

Your logs also showed loading two or more files? Can you narrow it down to something we can reproduce?

Thanks!

@powersj powersj added the waiting for response waiting for response from contributor label Apr 26, 2023
@knollet
Copy link
Contributor Author

knollet commented Apr 27, 2023

Hm, yes, there was more, that was loaded. I'll have to look into it. Thanks for the information.

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Apr 27, 2023
@knollet
Copy link
Contributor Author

knollet commented Apr 27, 2023

ok, wow, this is a little bit confusing...


root@host:/etc/telegraf/telegraf.d# telegraf --test --config sl-test.conf
2023-04-27T11:15:25Z I! Loading config file: sl-test.conf
panic: negative refcount

goroutine 28 [running]:
github.com/influxdata/telegraf/metric.(*trackingMetric).decr(0xc000bcac90?)
        /go/src/github.com/influxdata/telegraf/metric/tracking.go:145 +0x90
github.com/influxdata/telegraf/metric.(*trackingMetric).Drop(0xc000bcacf0?)
        /go/src/github.com/influxdata/telegraf/metric/tracking.go:139 +0x19
github.com/influxdata/telegraf/agent.(*Agent).runProcessors.func1(0xc000969230)
        /go/src/github.com/influxdata/telegraf/agent/agent.go:652 +0x16a
created by github.com/influxdata/telegraf/agent.(*Agent).runProcessors
        /go/src/github.com/influxdata/telegraf/agent/agent.go:645 +0x3c
root@host:/etc/telegraf/telegraf.d#
$ cat sl-test.conf
### If I remove the agent-section with the logfile in it, it stops crashing
[agent]
  logfile = "/tmp/doesntexist"   ## I have write permission

### If I remove this input, it stops crashing
[[inputs.nstat]]

### If I remove this input, it stops crashing
[[inputs.sysstat]]
  ## If I don't set these parameters, it stops crashing
  sadc_path = "/usr/lib64/sa/sadc"
  sadf_path = "/bin/sadf"


#################################################################################
[[inputs.tail]]
  ### same as in [agent]-section,
  ### I have write permissions to /tmp
  files = [ "/tmp/doesntexist" ]
  data_format = "grok"

### If I remove the NOTSPACE-part of the grok pattern, **it stops crashing**
  grok_patterns = ['''
     ^%{TIMESTAMP_ISO8601:timestamp:ts-"2006-01-02T15:04:05Z"}\s+%{NOTSPACE:level}\s+''']

#################################################################################

[[processors.starlark]]
  ### minimal starlark that is intended to throw an error
source = '''
def apply(metric):
  return metric.fields["submetric_that_doesnt_exist"]
'''

powersj added a commit to powersj/telegraf that referenced this issue Apr 27, 2023
If an error occurs in the starlark processor tracking metrics are
rejected, decrementing their count. However, immediately after the agent
when running processors, if an error occurs in the processor it will
also attempt to drop the metric. This double reject + drop results in a
negative reference count, panicing telegraf.

This did not affect metrics which do not use tracking metrics. It
requires a metric that uses a tracking metric to expose as well as an
error in starlark.

fixes: influxdata#13148
@powersj
Copy link
Contributor

powersj commented Apr 27, 2023

I have put up #13156 which removes the drop from the starlark processor on error and let's the agent handle the drop. Can you give that a try and ensure it works as expected?

Hm, yes, there was more, that was loaded. I'll have to look into it. Thanks for the information.

Thank you! I can confirm that the new config produces the panic and that we are in fact decrementing a tail tracking metric twice:

starlark.go:61 Reject()
reject tail [0xc000d78800 0xc000d787a0] [0xc000d78780]
v=0
agent.go:653 Drop()
drop tail [0xc000d78800 0xc000d787a0] [0xc000d78780]
v=-1
panic: negative refcount

goroutine 68 [running]:
github.com/influxdata/telegraf/metric.(*trackingMetric).decr(0xc000a142a0)
	/home/powersj/telegraf/metric/tracking.go:150 +0xf0
github.com/influxdata/telegraf/metric.(*trackingMetric).Drop(0xc000a142a0)
	/home/powersj/telegraf/metric/tracking.go:143 +0x169
github.com/influxdata/telegraf/agent.(*Agent).runProcessors.func1(0xc000f96e70)
	/home/powersj/telegraf/agent/agent.go:653 +0x1b6
created by github.com/influxdata/telegraf/agent.(*Agent).runProcessors
	/home/powersj/telegraf/agent/agent.go:645 +0x3c

The first decrement occurs in the starlark processor after we call apply on the metric. If there is an error, telegraf rejects the metric.

This returns an error, to the caller, which happens to be the agent's `runProcessors(). If an error has occurred while running the processor telegraf drops the metric. This is the second decrement.

I am a little surprised we have not gotten an error report about this before. My only guess is that in order to hit this your starlark processor has to return an error when using a tracking metric, which may not occur all that often with some additional checks.

Thanks again!

@knollet
Copy link
Contributor Author

knollet commented May 2, 2023

I tried the artifact, but it produced a problem with generating uint-fields and wouldn't recognize influx_uint_support = false. But with the artifact from
#13164
everything looked good, so far.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants