"panic: negative refcount" when mutating tags in starlark for metrics from a tail input #14484

paulbrimicombe · 2023-12-20T08:39:28Z

Relevant telegraf.conf

[[outputs.discard]]

[[inputs.tail]]
  files = ["/var/log/something.log"]
  data_format = "json"
  json_time_key = "Time"
  json_time_format = "unix"
  watch_method = "poll"

[[processors.starlark]]
  script = "/etc/telegraf/postprocess.star"

Logs from Telegraf

panic: negative refcount

goroutine 53 [running]:
github.com/influxdata/telegraf/metric.(*trackingMetric).decr(0xc002608f00?)
/go/src/github.com/influxdata/telegraf/metric/tracking.go:145 +0x85
github.com/influxdata/telegraf/metric.(*trackingMetric).Drop(0xc0021b9180?)
/go/src/github.com/influxdata/telegraf/metric/tracking.go:139 +0x13
github.com/influxdata/telegraf/agent.(*Agent).runAggregators.func1()
/go/src/github.com/influxdata/telegraf/agent/agent.go:720 +0xee
created by github.com/influxdata/telegraf/agent.(*Agent).runAggregators in goroutine 114
/go/src/github.com/influxdata/telegraf/agent/agent.go:707 +0x329

System info

Telegraf >= 1.29.0

Docker

Here is a minimal way to reproduce the error:

docker-compose.yaml

version: "3"

services:
  telegraf:
    image: telegraf:1.29
    volumes:
      - ./telegraf.conf:/etc/telegraf/telegraf.conf:ro
      - ./postprocess.star:/etc/telegraf/postprocess.star:ro
      - ./something.log:/var/log/something.log:ro
    command: --debug

./telegraf.conf is the configuration file posted above

./postprocess.star

def apply(metric):
    # Cloning the metric before mutating the tags *appears* to solve the issue
    # Would this have any unforeseen consequences? Is this a valid workaround?
    # metric = deepcopy(metric)

    print(metric)
    metric.tags.pop("path", None)

    # Other operations that cause the panic include setting tag values:
    # metrics.tags["path"] = "some-override"

    return [metric]

./something.log should start as an empty file

Steps to reproduce

Run the the above setup with docker compose up

Write to something.log using something like

echo '{ "Value": 1, "Time": "'(date +%s)'" }' >> something.log

You'll see the output of the print line in the starlark showing the metric being processed
Wait until the metrics are flushed to the output — at that point you'll get a panic: negative refcount message and the docker container will exit

Expected behavior

No golang panics
Metric tags should be able to be mutated in the starlark processor

Actual behavior

golang panics when metric tags are mutated in starlark

Additional info

This bug is only present from 1.29 onwards (you can easily see this by changing the image tag in the docker-compose.yaml file shown above)
Seems to be closely related to Starlark sometimes lets the agent crash #13148
There may be a bunch of other metric mutation operations that also cause this panic but we have only been mutating tags
This only seems to happen for tail inputs — file inputs don't suffer from the same issue

The text was updated successfully, but these errors were encountered:

powersj · 2023-12-20T13:39:56Z

Thanks for the issue and confirmed I see the same.

This only seems to happen for tail inputs — file inputs don't suffer from the same issue

This only affects plugins that use tracking metrics, which are used to call back to the input that the metric was delivered and the input plugin can consume more metrics. Plugins like tail, mqtt, rabbitmq, etc.

There have been a stream of issues around tracking metrics and starlark and we have tried to ensure we could continue tracking metrics even after they leave the starlark processor. This means we have to analyze what starlark returns and attempt to determine if the original metric was returned with no changes. One thought was to mark everything as accepted once it hits the starlark plugin to avoid this analysis, similar to the execd processor.

In this case, you make a change to it, so we mark the metric as accepted since the original metric was not returned. However, this results in the output also marking the same metric accepted twice, hence the panic.

The use of deepcopy is a workaround for now, except it comes at a cost of time, needing to copy the entire metric.

edit: we will jump on this in the new year.

paulbrimicombe · 2023-12-21T08:37:55Z

Thanks @powersj this is all useful information. We'll use deepcopy as a workaround for the moment and keep an eye out for when this is fixed.

srebhan · 2024-01-04T14:58:25Z

@paulbrimicombe can you please test the binary in PR #14523 and let me know if this fixes the issue!?!

paulbrimicombe · 2024-01-05T09:31:24Z

@srebhan I've just tested this and it's definitely fixed our tag mutation issue (there are no panics and the published metrics have the correct tags after the starlark processing). Any idea when this might be released please?

powersj · 2024-01-05T14:47:49Z

Monday!

paulbrimicombe added the bug unexpected problem or unintended behavior label Dec 20, 2023

powersj assigned srebhan and powersj Dec 20, 2023

srebhan unassigned powersj Jan 2, 2024

srebhan added a commit to srebhan/telegraf that referenced this issue Jan 3, 2024

Add unit-test for issue influxdata#14484

5a753ef

srebhan mentioned this issue Jan 3, 2024

fix(processors.starlark): Use tracking ID to identify tracking metrics #14523

Merged

1 task

powersj closed this as completed in #14523 Jan 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"panic: negative refcount" when mutating tags in starlark for metrics from a tail input #14484

"panic: negative refcount" when mutating tags in starlark for metrics from a tail input #14484

paulbrimicombe commented Dec 20, 2023 •

edited

Loading

powersj commented Dec 20, 2023 •

edited

Loading

paulbrimicombe commented Dec 21, 2023

srebhan commented Jan 4, 2024

paulbrimicombe commented Jan 5, 2024

powersj commented Jan 5, 2024

"panic: negative refcount" when mutating tags in starlark for metrics from a tail input #14484

"panic: negative refcount" when mutating tags in starlark for metrics from a tail input #14484

Comments

paulbrimicombe commented Dec 20, 2023 • edited Loading

Relevant telegraf.conf

Logs from Telegraf

System info

Docker

Steps to reproduce

Expected behavior

Actual behavior

Additional info

powersj commented Dec 20, 2023 • edited Loading

paulbrimicombe commented Dec 21, 2023

srebhan commented Jan 4, 2024

paulbrimicombe commented Jan 5, 2024

powersj commented Jan 5, 2024

paulbrimicombe commented Dec 20, 2023 •

edited

Loading

powersj commented Dec 20, 2023 •

edited

Loading