Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(a possibility of) regression between 1.0.1 and 1.1.0: input plugins exceeding collection time #2197

Closed
jubalfh opened this issue Dec 22, 2016 · 12 comments

Comments

@jubalfh
Copy link

jubalfh commented Dec 22, 2016

This is actually a follow-up for #2183 – I have verified today that with telegraf downgraded to 1.0.1 we don't see the took longer to collect than collection interval errors and indeed the metrics are being gathered, and that upgrading to 1.1.0 brings back the faulty behaviour. We have not changed the daemon's configuration between 1.0 and 1.1 series.

If you need me to assist you in any way, or any other details from my side, let me know.

Configuration details, error logs et al.

As in #2183; as it's locked, I can't copy the markdown here. Feel free to re-open the other ticket and I'll make edits to the title and contents accordingly and happily close this one.

@jubalfh jubalfh changed the title (probable) regression between 1.0.1 and 1.1.0: input plugins exceeding collection time (a possibility of) regression between 1.0.1 and 1.1.0: input plugins exceeding collection time Dec 22, 2016
@sparrc sparrc added this to the Future Milestone milestone Jan 6, 2017
@lpic10
Copy link
Contributor

lpic10 commented Jan 11, 2017

Hi, I'm observing same behavior. I've increased collection interval from 10s to 30s but it still happens from time to time on all different nodes running telegraf.

I'm running telegraf inside docker container on a kubernetes setup.

I see these messages:

2017/01/11 14:24:21 E! ERROR: input [inputs.disk] took longer to collect than collection interval (30s)
2017/01/11 14:24:21 E! ERROR: input [inputs.kubernetes] took longer to collect than collection interval (30s)
2017/01/11 14:24:22 E! ERROR: input [inputs.influxdb] took longer to collect than collection interval (30s)
2017/01/11 14:24:22 E! ERROR: input [inputs.net] took longer to collect than collection interval (30s)
2017/01/11 14:24:22 E! ERROR: input [inputs.diskio] took longer to collect than collection interval (30s)
2017/01/11 14:24:27 E! ERROR: input [inputs.netstat] took longer to collect than collection interval (30s)
2017/01/11 14:24:34 E! ERROR: input [inputs.kernel] took longer to collect than collection interval (30s)
2017/01/11 14:24:34 E! ERROR: input [inputs.mem] took longer to collect than collection interval (30s)
2017/01/11 14:24:39 E! ERROR: input [inputs.elasticsearch] took longer to collect than collection interval (30s)
2017/01/11 14:24:39 E! ERROR: input [inputs.docker] took longer to collect than collection interval (30s)
2017/01/11 14:24:39 E! ERROR: input [inputs.prometheus] took longer to collect than collection interval (30s)
2017/01/11 14:24:40 E! ERROR: input [inputs.swap] took longer to collect than collection interval (30s)
2017/01/11 14:24:40 E! ERROR: input [inputs.system] took longer to collect than collection interval (30s)
2017/01/11 14:24:51 E! ERROR: input [inputs.cpu] took longer to collect than collection interval (30s)
2017/01/11 14:24:51 E! ERROR: input [inputs.disk] took longer to collect than collection interval (30s)
2017/01/11 14:24:51 E! ERROR: input [inputs.kubernetes] took longer to collect than collection interval (30s)
2017/01/11 14:24:52 E! ERROR: input [inputs.influxdb] took longer to collect than collection interval (30s)
2017/01/11 14:24:52 E! ERROR: input [inputs.diskio] took longer to collect than collection interval (30s)
2017/01/11 14:24:57 E! ERROR: input [inputs.netstat] took longer to collect than collection interval (30s)
2017/01/11 14:24:59 E! ERROR: input [inputs.net] took longer to collect than collection interval (30s)
2017/01/11 14:25:04 E! ERROR: input [inputs.mem] took longer to collect than collection interval (30s)
2017/01/11 14:25:06 E! ERROR: input [inputs.kernel] took longer to collect than collection interval (30s)
2017/01/11 14:25:09 E! ERROR: input [inputs.elasticsearch] took longer to collect than collection interval (30s)
2017/01/11 14:25:09 E! ERROR: input [inputs.docker] took longer to collect than collection interval (30s)
2017/01/11 14:25:09 E! ERROR: input [inputs.prometheus] took longer to collect than collection interval (30s)
2017/01/11 14:25:10 E! ERROR: input [inputs.swap] took longer to collect than collection interval (30s)
2017/01/11 14:25:10 E! ERROR: input [inputs.system] took longer to collect than collection interval (30s)
2017/01/11 14:25:21 E! ERROR: input [inputs.cpu] took longer to collect than collection interval (30s)
2017/01/11 14:25:21 E! ERROR: input [inputs.disk] took longer to collect than collection interval (30s)
2017/01/11 14:25:21 E! ERROR: input [inputs.kubernetes] took longer to collect than collection interval (30s)
2017/01/11 14:25:22 E! ERROR: input [inputs.influxdb] took longer to collect than collection interval (30s)
2017/01/11 14:25:22 E! ERROR: input [inputs.diskio] took longer to collect than collection interval (30s)
2017/01/11 14:25:27 E! ERROR: input [inputs.netstat] took longer to collect than collection interval (30s)
2017/01/11 14:25:29 E! ERROR: input [inputs.net] took longer to collect than collection interval (30s)
2017/01/11 14:25:36 E! ERROR: input [inputs.kernel] took longer to collect than collection interval (30s)
2017/01/11 14:25:39 E! ERROR: input [inputs.elasticsearch] took longer to collect than collection interval (30s)
2017/01/11 14:25:39 E! ERROR: input [inputs.docker] took longer to collect than collection interval (30s)

I'll try to rollback to version 1.0.1 to see if it solves the issue.

@lpic10
Copy link
Contributor

lpic10 commented Jan 11, 2017

Well, I see exactly the same on telegraf 1.0.1, so it probably is a different issue.

@beheerderdag
Copy link

Any update on this issue? We are also seeing this message for 'inputs.diskio' time to time in a system that has lots of disks. However, even though I have "devices = ["sda", "sdb"] in the config, it is taking longer than 10 seconds. Is it going through all the disks?

Here's a test I just did now,

'root# time telegraf -config /etc/telegraf/telegraf.conf --input-filter diskio --test

  • Plugin: inputs.diskio, Collection 1

diskio,host=dmf,name=sda write_bytes=109697900544i,read_time=10920576i,write_time=305980776i,io_time=52000288i,iops_in_progress=0i,reads=845062i,writes=4253753i,read_bytes=42971568128i 1490621167000000000
diskio,name=sdb,host=dmf write_time=243444i,io_time=4216520i,iops_in_progress=0i,reads=794736i,writes=11531i,read_bytes=1801284096i,write_bytes=25260032i,read_time=4228108i 1490621167000000000

real 0m8.102s
user 0m0.840s
sys 0m2.748s'

Any idea why it is taking around 8 seconds and sometimes more than 10 seconds to collect stat from two disks? Any other debug option?

@danielnelson
Copy link
Contributor

Does this still occur with the 1.2.1 release and/or against master?

@beheerderdag
Copy link

Yes, we are using this:

telegraf --version

Telegraf v1.2.1 (git: release-1.2 3b6ffb3)

@beheerderdag
Copy link

beheerderdag commented Mar 29, 2017

We are also noticing high cpu usage when diskio plugin is enabled. When I disbable diskio the cpu usage goes down significantly (from 45% to 5%).

image

I issued SIGQUIT and here are the output.

https://gist.github.com/beheerderdag/06a052d5ab47f8b946e1f05634650bb7

and at a later time:

https://gist.github.com/beheerderdag/835e93d7422a17186dadf64e98edae8a

In the second instance I noticed this:

goroutine 30 [select, 31 minutes, locked to thread]:
runtime.gopark(0x11575f0, 0x0, 0x107ae04, 0x6, 0x18, 0x2)
        /usr/local/go/src/runtime/proc.go:259 +0x13a
runtime.selectgoImpl(0xc42004cf20, 0x0, 0x18)
        /usr/local/go/src/runtime/select.go:423 +0x1235
runtime.selectgo(0xc42004cf20)
        /usr/local/go/src/runtime/select.go:238 +0x1c
runtime.ensureSigM.func1()
        /usr/local/go/src/runtime/signal1_unix.go:304 +0x2f3
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:2086 +0x1

@beheerderdag
Copy link

We found a work around for our particular issue where the diskio plugin takes longer than 10 seconds and creates high cpu usage. A colleague of mine looked at the trace and the code and suggested this. We basically removed the udevadm call from the binary.

The function to look at is Gather() and the DiskIOStats structure.

https://github.com/influxdata/telegraf/blob/master/plugins/inputs/system/disk.go#L127

func (s *DiskIOStats) Gather(acc telegraf.Accumulator) error {
	diskio, err := s.ps.DiskIO()
	if err != nil {
		return fmt.Errorf("error getting disk io info: %s", err)
	}

The filtering for devices is taking place later in this function.

https://github.com/influxdata/telegraf/blob/master/plugins/inputs/system/disk.go#L142

	for _, io := range diskio {
		_, member := devices[io.Name]
		if restrictDevices && !member {
			continue
		}

Our server's diskstats:

 # wc -l /proc/diskstats 
3680 /proc/diskstats

So for each and every disk it is calling udevadm to get the disk serial number even though we have the default skip_serial_number option which is set to true.

To fix the issue for the time being, I made a quick binary patch. Here are the steps:
(I edited the binary wtih vim binary mode: vim -b)

# service telegraf stop
# cp telegraf.orig telegraf 
# xxd /usr/bin/telegraf > t1.hex
# xxd /usr/bin/telegraf.orig > t2.hex

# diff t1.hex t2.hex 
820489c820489
< 0c85080: 6269 6e71 7564 6576 6164 6d2f 7573 722f  binqudevadm/usr/
---
> 0c85080: 6269 6e2f 7564 6576 6164 6d2f 7573 722f  bin/udevadm/usr/

Test the diskio plugin where you can see the original version was taking longer.

# time telegraf -config /etc/telegraf/telegraf.conf -input-filter diskio -test
* Plugin: inputs.diskio, Collection 1
> diskio,name=sda,host=dmf io_time=63586748i,iops_in_progress=0i,reads=1010872i,writes=5192339i,read_bytes=52354902016i,write_bytes=146722795520i,read_time=13411860i,write_time=374396368i 1490866191000000000
> diskio,name=sdb,host=dmf read_time=5214468i,write_time=3957984i,io_time=5225480i,iops_in_progress=0i,reads=998227i,writes=19431i,read_bytes=2248228864i,write_bytes=46245888i 1490866191000000000

real    0m0.481s
user    0m0.104s
sys     0m0.388s


# time telegraf.orig -config /etc/telegraf/telegraf.conf -input-filter diskio -test
* Plugin: inputs.diskio, Collection 1
> diskio,name=sdb,host=dmf write_time=3957984i,io_time=5225480i,iops_in_progress=0i,reads=998227i,writes=19431i,read_bytes=2248228864i,write_bytes=46245888i,read_time=5214468i 1490866208000000000
> diskio,name=sda,host=dmf io_time=63587200i,iops_in_progress=0i,reads=1010872i,writes=5192372i,read_bytes=52354902016i,write_bytes=146723295232i,read_time=13411860i,write_time=374397612i 1490866208000000000

real    0m7.523s
user    0m0.796s
sys     0m2.732s

I resumed the service and no error or high cpu usage so far. I hope this helps others and the developers.

@danielnelson
Copy link
Contributor

@beheerderdag Thanks for researching what the root cause of this is. I will write up an issue with the gopsutil project and we can begin discussion on what the next step should be.

In order to explain why we need to speed things up, can you add some info about the disk configuration of this system, I imagine it must have some special hardware or software setup to have 3680 disks.

@beheerderdag
Copy link

Daniel,

Thanks for your reply and following up on the issue. The server is part of a hierarchal storage management environment so we have tons of tape storage devices and SAN disks.

@phemmer
Copy link
Contributor

phemmer commented Apr 4, 2017

If the issue is calling udevadm info, it might be worth proposing upstream that the udev files be read directly instead of calling udevadm. Telegraf does this here to be able to add disk info as tags to the diskio input. Reading the files shouldn't take but a few microseconds.

@danielnelson
Copy link
Contributor

@phemmer I added that to the upstream issue, though in my mind ideally we can get the into without udev at all. I'm still hoping for a future with vdev ;)

@danielnelson
Copy link
Contributor

The diskio change has been merged, I'm not sure if it was the cause of the original issue. If you are still having problems we need to narrow it down to a minimal case on the latest version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants