Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telegraf crashes when buffer_strategy = "disk" and more than one output plugin of the same type is configured #15876

Closed
dondiro opened this issue Sep 13, 2024 · 2 comments · Fixed by #15966
Assignees
Labels
bug unexpected problem or unintended behavior

Comments

@dondiro
Copy link

dondiro commented Sep 13, 2024

Relevant telegraf.conf

# Configuration for telegraf agent
[agent]
  debug = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  flush_interval = "10s"
  quiet = false
  omit_hostname = true
  buffer_strategy = "disk"
  buffer_directory = "var"

###############################################################################
#                            OUTPUT PLUGINS                                   #
###############################################################################

# Configuration for sending metrics to InfluxDB
[[outputs.influxdb]]
  alias = "influxdb-data"
  namedrop = ["telegraf*"]
  urls = ["http://127.0.0.1:8086"]
  database = "telegraf"

[[outputs.influxdb]]
  alias = "influxdb-internal"
  namepass = ["telegraf*"]
  urls = ["http://127.0.0.1:8096"]
  database = "monitoring"
  
###############################################################################
#                            INPUT PLUGINS                                    #
###############################################################################

# Collect statistics about itself
[[inputs.internal]]
  name_prefix = "telegraf_"

# Read metrics about cpu usage
[[inputs.cpu]]
  percpu = true
  totalcpu = true
  collect_cpu_time = false
  report_active = false
  core_tags = false

Logs from Telegraf

2024-09-12T12:46:20Z I! Loading config: .\telegraf.conf
2024-09-12T12:46:20Z W! Using disk buffer strategy for plugin outputs.influxdb, this is an experimental feature
2024-09-12T12:46:20Z W! Using disk buffer strategy for plugin outputs.influxdb, this is an experimental feature
2024-09-12T12:46:20Z I! Starting Telegraf 1.32.0 brought to you by InfluxData the makers of InfluxDB
2024-09-12T12:46:20Z I! Available plugins: 235 inputs, 9 aggregators, 32 processors, 26 parsers, 62 outputs, 5 secret-stores
2024-09-12T12:46:20Z I! Loaded inputs: cpu internal
2024-09-12T12:46:20Z I! Loaded aggregators:
2024-09-12T12:46:20Z I! Loaded processors:
2024-09-12T12:46:20Z I! Loaded secretstores:
2024-09-12T12:46:20Z I! Loaded outputs: influxdb (2x)
2024-09-12T12:46:20Z I! Tags enabled:
2024-09-12T12:46:20Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"", Flush Interval:10s
2024-09-12T12:46:20Z D! [agent] Initializing plugins
2024-09-12T12:46:20Z D! [agent] Connecting outputs
2024-09-12T12:46:20Z D! [agent] Attempting connection to [outputs.influxdb::influxdb-data]
2024-09-12T12:46:20Z D! [agent] Successfully connected to outputs.influxdb::influxdb-data
2024-09-12T12:46:20Z D! [agent] Attempting connection to [outputs.influxdb::influxdb-internal]
2024-09-12T12:46:20Z D! [agent] Successfully connected to outputs.influxdb::influxdb-internal
2024-09-12T12:46:20Z D! [agent] Starting service inputs
2024-09-12T12:46:30Z D! [outputs.influxdb::influxdb-data] Buffer fullness: 0 metrics
2024-09-12T12:46:30Z D! [outputs.influxdb::influxdb-internal] Wrote batch of 8 metrics in 6.1253ms
2024-09-12T12:46:30Z D! [outputs.influxdb::influxdb-internal] Buffer fullness: 8 metrics
2024-09-12T12:46:40Z D! [outputs.influxdb::influxdb-data] Wrote batch of 9 metrics in 7.721ms
2024-09-12T12:46:40Z D! [outputs.influxdb::influxdb-internal] Wrote batch of 16 metrics in 6.6225ms
2024-09-12T12:46:40Z D! [outputs.influxdb::influxdb-internal] Buffer fullness: 22 metrics
2024-09-12T12:46:40Z D! [outputs.influxdb::influxdb-data] Buffer fullness: 22 metrics
2024-09-12T12:46:50Z E! raw metric data: []
2024-09-12T12:46:50Z E! raw metric data: []
panic: failed to decode metric from bytes: EOF

goroutine 57 [running]:
github.com/influxdata/telegraf/models.(*DiskBuffer).Batch(0xc00176f710, 0x3e8)
        /go/src/github.com/influxdata/telegraf/models/buffer_disk.go:146 +0x52f
github.com/influxdata/telegraf/models.(*RunningOutput).Write(0xc0023b0e70)
        /go/src/github.com/influxdata/telegraf/models/running_output.go:292 +0x3a4
github.com/influxdata/telegraf/agent.(*Agent).flushOnce.func1()
        /go/src/github.com/influxdata/telegraf/agent/agent.go:942 +0x23
created by github.com/influxdata/telegraf/agent.(*Agent).flushOnce in goroutine 32
        /go/src/github.com/influxdata/telegraf/agent/agent.go:941 +0xa6
panic: failed to decode metric from bytes: EOF

goroutine 98 [running]:
github.com/influxdata/telegraf/models.(*DiskBuffer).Batch(0xc00176f7a0, 0x3e8)
        /go/src/github.com/influxdata/telegraf/models/buffer_disk.go:146 +0x52f
github.com/influxdata/telegraf/models.(*RunningOutput).Write(0xc0023b0f20)
        /go/src/github.com/influxdata/telegraf/models/running_output.go:292 +0x3a4
github.com/influxdata/telegraf/agent.(*Agent).flushOnce.func1()
        /go/src/github.com/influxdata/telegraf/agent/agent.go:942 +0x23
created by github.com/influxdata/telegraf/agent.(*Agent).flushOnce in goroutine 33
        /go/src/github.com/influxdata/telegraf/agent/agent.go:941 +0xa6

System info

Telegraf 1.32.0 (replicated in windows and kubernetes)

Docker

No response

Steps to reproduce

  1. Run two different InfluxDB instances
  2. Run Telegraf with the provided configuration: buffer_strategy = "disk", two InfluxDB output plugin configured, one input to collect metrics and one internal input plugin to collect internal metrics.
  3. Wait some seconds for the error (the flush interval configured?)

Expected behavior

Telegraf writes metrics to the configured InfluxDBs without crashes.

Actual behavior

Telegraf stops working. Telegraf uses a folder with the plugin name to write the buffer data to the disk. If the output plugins are of the same type the concurrent go routines maybe are trying to read/write the same data file inside this folder. Maybe by using the alias as folder name could solve the issue.

Additional info

No response

@dondiro dondiro added the bug unexpected problem or unintended behavior label Sep 13, 2024
@srebhan
Copy link
Member

srebhan commented Oct 2, 2024

@dondiro please check the binary in PR #15966, available as soon as CI finished the tests, and let us know if this fixes the issue!

@dondiro
Copy link
Author

dondiro commented Oct 3, 2024

@srebhan and @DStrand1 thanks for the fix. It works in my test case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants