-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Starlark sometimes lets the agent crash #13148
Comments
Looking at the panic, Telegraf at the time is trying to call The specific call that causes the panic is in the agent's runProcessors during an error on Add. Before that the processor's, starlark in this case, However, these are not the same metrics so it isn't like we are double decrementing. I am not able to reproduce with the config you provided. I am wondering if there is something else required. Do note that if you were seeing this when running with --test, outputs are ignored and nothing is written to a file. The only output is to stdout. Your logs also showed loading two or more files? Can you narrow it down to something we can reproduce? Thanks! |
Hm, yes, there was more, that was loaded. I'll have to look into it. Thanks for the information. |
ok, wow, this is a little bit confusing...
|
If an error occurs in the starlark processor tracking metrics are rejected, decrementing their count. However, immediately after the agent when running processors, if an error occurs in the processor it will also attempt to drop the metric. This double reject + drop results in a negative reference count, panicing telegraf. This did not affect metrics which do not use tracking metrics. It requires a metric that uses a tracking metric to expose as well as an error in starlark. fixes: influxdata#13148
I have put up #13156 which removes the drop from the starlark processor on error and let's the agent handle the drop. Can you give that a try and ensure it works as expected?
Thank you! I can confirm that the new config produces the panic and that we are in fact decrementing a tail tracking metric twice: starlark.go:61 Reject()
reject tail [0xc000d78800 0xc000d787a0] [0xc000d78780]
v=0
agent.go:653 Drop()
drop tail [0xc000d78800 0xc000d787a0] [0xc000d78780]
v=-1
panic: negative refcount
goroutine 68 [running]:
github.com/influxdata/telegraf/metric.(*trackingMetric).decr(0xc000a142a0)
/home/powersj/telegraf/metric/tracking.go:150 +0xf0
github.com/influxdata/telegraf/metric.(*trackingMetric).Drop(0xc000a142a0)
/home/powersj/telegraf/metric/tracking.go:143 +0x169
github.com/influxdata/telegraf/agent.(*Agent).runProcessors.func1(0xc000f96e70)
/home/powersj/telegraf/agent/agent.go:653 +0x1b6
created by github.com/influxdata/telegraf/agent.(*Agent).runProcessors
/home/powersj/telegraf/agent/agent.go:645 +0x3c The first decrement occurs in the starlark processor after we call apply on the metric. If there is an error, telegraf rejects the metric. This returns an error, to the caller, which happens to be the agent's `runProcessors(). If an error has occurred while running the processor telegraf drops the metric. This is the second decrement. I am a little surprised we have not gotten an error report about this before. My only guess is that in order to hit this your starlark processor has to return an error when using a tracking metric, which may not occur all that often with some additional checks. Thanks again! |
I tried the artifact, but it produced a problem with generating uint-fields and wouldn't recognize |
Relevant telegraf.conf
Logs from Telegraf
System info
Telegraf 1.26.1, RHEL 8.7 (Ootpa)
Docker
No response
Steps to reproduce
...
Expected behavior
Starlark-Stacktraces for the starlark-errors in the log, but no go-related memory-management-errors with panic.
Actual behavior
sometimes telegraf panics.
Additional info
No response
The text was updated successfully, but these errors were encountered: