Input plugins frequently exceed collection interval on large systems. #2183

jubalfh · 2016-12-20T15:04:37Z

Short problem description

We're running quite a number of systems with telegraf installed as the metric gathering agent, and we're seeing a pattern of behaviour which is a bit annoying: we're seeing gaps in the metrics that cannot be explained by the daemon dropping the metrics and that correlate extremely well to log messages announcing that a metric took longer to collect than its defined interval, see one of the fresh examples:

2016/12/20 10:03:00 E! ERROR: input [inputs.disk] took longer to collect than collection interval (10s)
2016/12/20 10:03:00 E! ERROR: input [inputs.kernel] took longer to collect than collection interval (10s)
2016/12/20 10:03:01 E! ERROR: input [inputs.net] took longer to collect than collection interval (10s)
2016/12/20 10:03:01 E! ERROR: input [inputs.system] took longer to collect than collection interval (10s)
2016/12/20 10:03:01 E! ERROR: input [inputs.cpu] took longer to collect than collection interval (10s)
2016/12/20 10:03:01 E! ERROR: input [inputs.mem] took longer to collect than collection interval (10s)
2016/12/20 10:03:01 E! ERROR: input [inputs.processes] took longer to collect than collection interval (10s)
2016/12/20 10:03:01 E! ERROR: input [inputs.netstat] took longer to collect than collection interval (10s)
2016/12/20 10:03:01 E! ERROR: input [inputs.swap] took longer to collect than collection interval (10s)
2016/12/20 10:03:01 E! ERROR: input [inputs.diskio] took longer to collect than collection interval (10s)
2016/12/20 10:03:01 E! ERROR: input [inputs.memcached] took longer to collect than collection interval (10s)

The frequency of the metric collection failures seems to be related to the general amount of memory and number of CPUs on the system, with more powerful systems seeing much more frequent collection failures than smaller ones. This does not seem to be related to actual load – the systems are operating well below 70% of cpu utilisation and, usually, below 85% of memory usage.

As you probably can guess, this behaviour is a little bit annoying.

I hope I attached as much information as needed, but if you need something more specific, let me know.

Relevant telegraf.conf:

[agent]
  interval = "10s"
  round_interval = false

  metric_batch_size = 512
  metric_buffer_limit = 16384

  collection_jitter = "2s"

  flush_interval = "20s"
  flush_jitter = "5s"

System info:

Usually: telegraf 1.1.1 on CentOS 6; Amazon EC2 instance r3.8xlarge (32 vCPUs, 240G of memory) with EBS devices attached. The instances vary (mostly) between various r3s and m3s.

Steps to reproduce:

service telegraf start
zgrep "took longer to collect" /var/log/telegraf/*

Expected behavior:

Metric collection finished within collection interval.

Actual behavior:

Metric collection exceeding collection interval.

The text was updated successfully, but these errors were encountered:

sparrc · 2016-12-20T15:50:11Z

There's not much we can do programmatically to speed up metric collection on your system, if it's falling behind then it's falling behind. Did you try collecting at a longer interval?

jubalfh · 2016-12-20T15:56:29Z

…

sparrc · 2016-12-20T15:59:47Z

that was a serious question

like I said, if your system can't deal with making a few http requests and catting files, there's not much we can do to help you.

jubalfh · 2016-12-20T16:04:36Z

OK, once more with feeling: the issue happens frequently only on larger systems. Larger, as in, 32 vCPUs, GBs of memory, terabyte-sized block devices. This is not related to the system load, but rather to the system size.

I was wondering, if the system metrics are being collected sequentially (that would explain the serial failure in collecting them) and where exactly could the bottleneck be located.

sparrc · 2016-12-20T16:19:08Z

OK, you're free to submit a PR but this isn't really an actionable issue worth keeping open.

Each plugin runs in parallel. Since your logs aren't showing any particular plugin(s) it seems to me like your system is having trouble keeping up.

The mem plugin, for example, literally only reads a file called /proc/meminfo, which is 25 lines long, and parses numbers out of it. It's a bit hard to understand how that could take a well-behaved system longer than 10 seconds.

Since you're running linux almost all of the system metrics come from procfs. I don't remember the exact structure of all those files but most measurements gather their info from a single file, rather than being spread out between multiple files, no matter the scale (/proc/meminfo, /proc/stat, /proc/cpuinfo, etc.). The disk and CPU plugins might be exceptions depending on which function calls the plugin is running, but I don't think that reading a text file per-cpu should be too much trouble either.

The code behind that is here: https://github.com/shirou/gopsutil

jubalfh · 2016-12-20T16:26:31Z

That was… most unhelpful. Thanks anyway.

sparrc · 2016-12-20T16:35:22Z

you can always contact [email protected] if you need special assistance with your particular hardware/software setup. Github issues are generally reserved for more specific issues.

sparrc closed this as completed Dec 20, 2016

influxdata locked and limited conversation to collaborators Dec 20, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Input plugins frequently exceed collection interval on large systems. #2183

Input plugins frequently exceed collection interval on large systems. #2183

jubalfh commented Dec 20, 2016 •

edited

Loading

sparrc commented Dec 20, 2016

jubalfh commented Dec 20, 2016

sparrc commented Dec 20, 2016

jubalfh commented Dec 20, 2016

sparrc commented Dec 20, 2016 •

edited

Loading

jubalfh commented Dec 20, 2016

sparrc commented Dec 20, 2016

Input plugins frequently exceed collection interval on large systems. #2183

Input plugins frequently exceed collection interval on large systems. #2183

Comments

jubalfh commented Dec 20, 2016 • edited Loading

Short problem description

Relevant telegraf.conf:

System info:

Steps to reproduce:

Expected behavior:

Actual behavior:

sparrc commented Dec 20, 2016

jubalfh commented Dec 20, 2016

sparrc commented Dec 20, 2016

jubalfh commented Dec 20, 2016

sparrc commented Dec 20, 2016 • edited Loading

jubalfh commented Dec 20, 2016

sparrc commented Dec 20, 2016

jubalfh commented Dec 20, 2016 •

edited

Loading

sparrc commented Dec 20, 2016 •

edited

Loading