Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"panic: negative refcount" when mutating tags in starlark for metrics from a tail input #14484

Closed
paulbrimicombe opened this issue Dec 20, 2023 · 5 comments · Fixed by #14523
Closed
Assignees
Labels
bug unexpected problem or unintended behavior

Comments

@paulbrimicombe
Copy link

paulbrimicombe commented Dec 20, 2023

Relevant telegraf.conf

[[outputs.discard]]

[[inputs.tail]]
  files = ["/var/log/something.log"]
  data_format = "json"
  json_time_key = "Time"
  json_time_format = "unix"
  watch_method = "poll"

[[processors.starlark]]
  script = "/etc/telegraf/postprocess.star"

Logs from Telegraf

panic: negative refcount

goroutine 53 [running]:
github.com/influxdata/telegraf/metric.(*trackingMetric).decr(0xc002608f00?)
/go/src/github.com/influxdata/telegraf/metric/tracking.go:145 +0x85
github.com/influxdata/telegraf/metric.(*trackingMetric).Drop(0xc0021b9180?)
/go/src/github.com/influxdata/telegraf/metric/tracking.go:139 +0x13
github.com/influxdata/telegraf/agent.(*Agent).runAggregators.func1()
/go/src/github.com/influxdata/telegraf/agent/agent.go:720 +0xee
created by github.com/influxdata/telegraf/agent.(*Agent).runAggregators in goroutine 114
/go/src/github.com/influxdata/telegraf/agent/agent.go:707 +0x329

System info

Telegraf >= 1.29.0

Docker

Here is a minimal way to reproduce the error:

  • docker-compose.yaml

    version: "3"
    
    services:
      telegraf:
        image: telegraf:1.29
        volumes:
          - ./telegraf.conf:/etc/telegraf/telegraf.conf:ro
          - ./postprocess.star:/etc/telegraf/postprocess.star:ro
          - ./something.log:/var/log/something.log:ro
        command: --debug
  • ./telegraf.conf is the configuration file posted above

  • ./postprocess.star

    def apply(metric):
        # Cloning the metric before mutating the tags *appears* to solve the issue
        # Would this have any unforeseen consequences? Is this a valid workaround?
        # metric = deepcopy(metric)
    
        print(metric)
        metric.tags.pop("path", None)
    
        # Other operations that cause the panic include setting tag values:
        # metrics.tags["path"] = "some-override"
    
        return [metric]
  • ./something.log should start as an empty file

Steps to reproduce

  1. Run the the above setup with docker compose up

  2. Write to something.log using something like

    echo '{ "Value": 1, "Time": "'(date +%s)'" }' >> something.log
  3. You'll see the output of the print line in the starlark showing the metric being processed

  4. Wait until the metrics are flushed to the output — at that point you'll get a panic: negative refcount message and the docker container will exit

Expected behavior

  • No golang panics
  • Metric tags should be able to be mutated in the starlark processor

Actual behavior

  • golang panics when metric tags are mutated in starlark

Additional info

  • This bug is only present from 1.29 onwards (you can easily see this by changing the image tag in the docker-compose.yaml file shown above)
  • Seems to be closely related to Starlark sometimes lets the agent crash #13148
  • There may be a bunch of other metric mutation operations that also cause this panic but we have only been mutating tags
  • This only seems to happen for tail inputs — file inputs don't suffer from the same issue
@paulbrimicombe paulbrimicombe added the bug unexpected problem or unintended behavior label Dec 20, 2023
@powersj
Copy link
Contributor

powersj commented Dec 20, 2023

Thanks for the issue and confirmed I see the same.

This only seems to happen for tail inputs — file inputs don't suffer from the same issue

This only affects plugins that use tracking metrics, which are used to call back to the input that the metric was delivered and the input plugin can consume more metrics. Plugins like tail, mqtt, rabbitmq, etc.

There have been a stream of issues around tracking metrics and starlark and we have tried to ensure we could continue tracking metrics even after they leave the starlark processor. This means we have to analyze what starlark returns and attempt to determine if the original metric was returned with no changes. One thought was to mark everything as accepted once it hits the starlark plugin to avoid this analysis, similar to the execd processor.

In this case, you make a change to it, so we mark the metric as accepted since the original metric was not returned. However, this results in the output also marking the same metric accepted twice, hence the panic.

The use of deepcopy is a workaround for now, except it comes at a cost of time, needing to copy the entire metric.

edit: we will jump on this in the new year.

@paulbrimicombe
Copy link
Author

Thanks @powersj this is all useful information. We'll use deepcopy as a workaround for the moment and keep an eye out for when this is fixed.

@srebhan
Copy link
Member

srebhan commented Jan 4, 2024

@paulbrimicombe can you please test the binary in PR #14523 and let me know if this fixes the issue!?!

@paulbrimicombe
Copy link
Author

@srebhan I've just tested this and it's definitely fixed our tag mutation issue (there are no panics and the published metrics have the correct tags after the starlark processing). Any idea when this might be released please?

@powersj
Copy link
Contributor

powersj commented Jan 5, 2024

Monday!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants