Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input plugins frequently exceed collection interval on large systems. #2183

Closed
jubalfh opened this issue Dec 20, 2016 · 7 comments
Closed

Input plugins frequently exceed collection interval on large systems. #2183

jubalfh opened this issue Dec 20, 2016 · 7 comments

Comments

@jubalfh
Copy link

jubalfh commented Dec 20, 2016

Short problem description

We're running quite a number of systems with telegraf installed as the metric gathering agent, and we're seeing a pattern of behaviour which is a bit annoying: we're seeing gaps in the metrics that cannot be explained by the daemon dropping the metrics and that correlate extremely well to log messages announcing that a metric took longer to collect than its defined interval, see one of the fresh examples:

2016/12/20 10:03:00 E! ERROR: input [inputs.disk] took longer to collect than collection interval (10s)
2016/12/20 10:03:00 E! ERROR: input [inputs.kernel] took longer to collect than collection interval (10s)
2016/12/20 10:03:01 E! ERROR: input [inputs.net] took longer to collect than collection interval (10s)
2016/12/20 10:03:01 E! ERROR: input [inputs.system] took longer to collect than collection interval (10s)
2016/12/20 10:03:01 E! ERROR: input [inputs.cpu] took longer to collect than collection interval (10s)
2016/12/20 10:03:01 E! ERROR: input [inputs.mem] took longer to collect than collection interval (10s)
2016/12/20 10:03:01 E! ERROR: input [inputs.processes] took longer to collect than collection interval (10s)
2016/12/20 10:03:01 E! ERROR: input [inputs.netstat] took longer to collect than collection interval (10s)
2016/12/20 10:03:01 E! ERROR: input [inputs.swap] took longer to collect than collection interval (10s)
2016/12/20 10:03:01 E! ERROR: input [inputs.diskio] took longer to collect than collection interval (10s)
2016/12/20 10:03:01 E! ERROR: input [inputs.memcached] took longer to collect than collection interval (10s)

The frequency of the metric collection failures seems to be related to the general amount of memory and number of CPUs on the system, with more powerful systems seeing much more frequent collection failures than smaller ones. This does not seem to be related to actual load – the systems are operating well below 70% of cpu utilisation and, usually, below 85% of memory usage.

As you probably can guess, this behaviour is a little bit annoying.

I hope I attached as much information as needed, but if you need something more specific, let me know.

Relevant telegraf.conf:

[agent]
  interval = "10s"
  round_interval = false

  metric_batch_size = 512
  metric_buffer_limit = 16384

  collection_jitter = "2s"

  flush_interval = "20s"
  flush_jitter = "5s"

System info:

Usually: telegraf 1.1.1 on CentOS 6; Amazon EC2 instance r3.8xlarge (32 vCPUs, 240G of memory) with EBS devices attached. The instances vary (mostly) between various r3s and m3s.

Steps to reproduce:

  1. service telegraf start
  2. zgrep "took longer to collect" /var/log/telegraf/*

Expected behavior:

Metric collection finished within collection interval.

Actual behavior:

Metric collection exceeding collection interval.

@sparrc
Copy link
Contributor

sparrc commented Dec 20, 2016

There's not much we can do programmatically to speed up metric collection on your system, if it's falling behind then it's falling behind. Did you try collecting at a longer interval?

@jubalfh
Copy link
Author

jubalfh commented Dec 20, 2016

@sparrc
Copy link
Contributor

sparrc commented Dec 20, 2016

that was a serious question

like I said, if your system can't deal with making a few http requests and catting files, there's not much we can do to help you.

@jubalfh
Copy link
Author

jubalfh commented Dec 20, 2016

OK, once more with feeling: the issue happens frequently only on larger systems. Larger, as in, 32 vCPUs, GBs of memory, terabyte-sized block devices. This is not related to the system load, but rather to the system size.

I was wondering, if the system metrics are being collected sequentially (that would explain the serial failure in collecting them) and where exactly could the bottleneck be located.

@sparrc
Copy link
Contributor

sparrc commented Dec 20, 2016

OK, you're free to submit a PR but this isn't really an actionable issue worth keeping open.

Each plugin runs in parallel. Since your logs aren't showing any particular plugin(s) it seems to me like your system is having trouble keeping up.

The mem plugin, for example, literally only reads a file called /proc/meminfo, which is 25 lines long, and parses numbers out of it. It's a bit hard to understand how that could take a well-behaved system longer than 10 seconds.

Since you're running linux almost all of the system metrics come from procfs. I don't remember the exact structure of all those files but most measurements gather their info from a single file, rather than being spread out between multiple files, no matter the scale (/proc/meminfo, /proc/stat, /proc/cpuinfo, etc.). The disk and CPU plugins might be exceptions depending on which function calls the plugin is running, but I don't think that reading a text file per-cpu should be too much trouble either.

The code behind that is here: https://github.com/shirou/gopsutil

@sparrc sparrc closed this as completed Dec 20, 2016
@jubalfh
Copy link
Author

jubalfh commented Dec 20, 2016

That was… most unhelpful. Thanks anyway.

@sparrc
Copy link
Contributor

sparrc commented Dec 20, 2016

you can always contact [email protected] if you need special assistance with your particular hardware/software setup. Github issues are generally reserved for more specific issues.

@influxdata influxdata locked and limited conversation to collaborators Dec 20, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants