-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(a possibility of) regression between 1.0.1 and 1.1.0: input plugins exceeding collection time #2197
Comments
Hi, I'm observing same behavior. I've increased collection interval from 10s to 30s but it still happens from time to time on all different nodes running telegraf. I'm running telegraf inside docker container on a kubernetes setup. I see these messages: 2017/01/11 14:24:21 E! ERROR: input [inputs.disk] took longer to collect than collection interval (30s) I'll try to rollback to version 1.0.1 to see if it solves the issue. |
Well, I see exactly the same on telegraf 1.0.1, so it probably is a different issue. |
Any update on this issue? We are also seeing this message for 'inputs.diskio' time to time in a system that has lots of disks. However, even though I have "devices = ["sda", "sdb"] in the config, it is taking longer than 10 seconds. Is it going through all the disks? Here's a test I just did now, 'root# time telegraf -config /etc/telegraf/telegraf.conf --input-filter diskio --test
real 0m8.102s Any idea why it is taking around 8 seconds and sometimes more than 10 seconds to collect stat from two disks? Any other debug option? |
Does this still occur with the 1.2.1 release and/or against master? |
Yes, we are using this: telegraf --versionTelegraf v1.2.1 (git: release-1.2 3b6ffb3) |
We are also noticing high cpu usage when diskio plugin is enabled. When I disbable diskio the cpu usage goes down significantly (from 45% to 5%). I issued SIGQUIT and here are the output. https://gist.github.com/beheerderdag/06a052d5ab47f8b946e1f05634650bb7 and at a later time: https://gist.github.com/beheerderdag/835e93d7422a17186dadf64e98edae8a In the second instance I noticed this:
|
We found a work around for our particular issue where the diskio plugin takes longer than 10 seconds and creates high cpu usage. A colleague of mine looked at the trace and the code and suggested this. We basically removed the udevadm call from the binary. The function to look at is Gather() and the DiskIOStats structure. https://github.com/influxdata/telegraf/blob/master/plugins/inputs/system/disk.go#L127
The filtering for devices is taking place later in this function. https://github.com/influxdata/telegraf/blob/master/plugins/inputs/system/disk.go#L142
Our server's diskstats:
So for each and every disk it is calling udevadm to get the disk serial number even though we have the default skip_serial_number option which is set to true. To fix the issue for the time being, I made a quick binary patch. Here are the steps:
Test the diskio plugin where you can see the original version was taking longer.
I resumed the service and no error or high cpu usage so far. I hope this helps others and the developers. |
@beheerderdag Thanks for researching what the root cause of this is. I will write up an issue with the gopsutil project and we can begin discussion on what the next step should be. In order to explain why we need to speed things up, can you add some info about the disk configuration of this system, I imagine it must have some special hardware or software setup to have 3680 disks. |
Daniel, Thanks for your reply and following up on the issue. The server is part of a hierarchal storage management environment so we have tons of tape storage devices and SAN disks. |
If the issue is calling |
@phemmer I added that to the upstream issue, though in my mind ideally we can get the into without udev at all. I'm still hoping for a future with vdev ;) |
The diskio change has been merged, I'm not sure if it was the cause of the original issue. If you are still having problems we need to narrow it down to a minimal case on the latest version. |
This is actually a follow-up for #2183 – I have verified today that with telegraf downgraded to 1.0.1 we don't see the
took longer to collect than collection interval
errors and indeed the metrics are being gathered, and that upgrading to 1.1.0 brings back the faulty behaviour. We have not changed the daemon's configuration between 1.0 and 1.1 series.If you need me to assist you in any way, or any other details from my side, let me know.
Configuration details, error logs et al.
As in #2183; as it's locked, I can't copy the markdown here. Feel free to re-open the other ticket and I'll make edits to the title and contents accordingly and happily close this one.
The text was updated successfully, but these errors were encountered: