-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Input plugins frequently exceed collection interval on large systems. #2183
Comments
There's not much we can do programmatically to speed up metric collection on your system, if it's falling behind then it's falling behind. Did you try collecting at a longer interval? |
… |
that was a serious question like I said, if your system can't deal with making a few http requests and catting files, there's not much we can do to help you. |
OK, once more with feeling: the issue happens frequently only on larger systems. Larger, as in, 32 vCPUs, GBs of memory, terabyte-sized block devices. This is not related to the system load, but rather to the system size. I was wondering, if the system metrics are being collected sequentially (that would explain the serial failure in collecting them) and where exactly could the bottleneck be located. |
OK, you're free to submit a PR but this isn't really an actionable issue worth keeping open. Each plugin runs in parallel. Since your logs aren't showing any particular plugin(s) it seems to me like your system is having trouble keeping up. The Since you're running linux almost all of the system metrics come from procfs. I don't remember the exact structure of all those files but most measurements gather their info from a single file, rather than being spread out between multiple files, no matter the scale (/proc/meminfo, /proc/stat, /proc/cpuinfo, etc.). The disk and CPU plugins might be exceptions depending on which function calls the plugin is running, but I don't think that reading a text file per-cpu should be too much trouble either. The code behind that is here: https://github.com/shirou/gopsutil |
That was… most unhelpful. Thanks anyway. |
you can always contact [email protected] if you need special assistance with your particular hardware/software setup. Github issues are generally reserved for more specific issues. |
Short problem description
We're running quite a number of systems with
telegraf
installed as the metric gathering agent, and we're seeing a pattern of behaviour which is a bit annoying: we're seeing gaps in the metrics that cannot be explained by the daemon dropping the metrics and that correlate extremely well to log messages announcing that a metric took longer to collect than its defined interval, see one of the fresh examples:The frequency of the metric collection failures seems to be related to the general amount of memory and number of CPUs on the system, with more powerful systems seeing much more frequent collection failures than smaller ones. This does not seem to be related to actual load – the systems are operating well below 70% of cpu utilisation and, usually, below 85% of memory usage.
As you probably can guess, this behaviour is a little bit annoying.
I hope I attached as much information as needed, but if you need something more specific, let me know.
Relevant telegraf.conf:
System info:
Usually: telegraf 1.1.1 on CentOS 6; Amazon EC2 instance r3.8xlarge (32 vCPUs, 240G of memory) with EBS devices attached. The instances vary (mostly) between various r3s and m3s.
Steps to reproduce:
service telegraf start
Expected behavior:
Metric collection finished within collection interval.
Actual behavior:
Metric collection exceeding collection interval.
The text was updated successfully, but these errors were encountered: