(a possibility of) regression between 1.0.1 and 1.1.0: input plugins exceeding collection time #2197

jubalfh · 2016-12-22T12:59:14Z

This is actually a follow-up for #2183 – I have verified today that with telegraf downgraded to 1.0.1 we don't see the took longer to collect than collection interval errors and indeed the metrics are being gathered, and that upgrading to 1.1.0 brings back the faulty behaviour. We have not changed the daemon's configuration between 1.0 and 1.1 series.

If you need me to assist you in any way, or any other details from my side, let me know.

Configuration details, error logs et al.

As in #2183; as it's locked, I can't copy the markdown here. Feel free to re-open the other ticket and I'll make edits to the title and contents accordingly and happily close this one.

The text was updated successfully, but these errors were encountered:

lpic10 · 2017-01-11T14:30:11Z

Hi, I'm observing same behavior. I've increased collection interval from 10s to 30s but it still happens from time to time on all different nodes running telegraf.

I'm running telegraf inside docker container on a kubernetes setup.

I see these messages:

2017/01/11 14:24:21 E! ERROR: input [inputs.disk] took longer to collect than collection interval (30s)
2017/01/11 14:24:21 E! ERROR: input [inputs.kubernetes] took longer to collect than collection interval (30s)
2017/01/11 14:24:22 E! ERROR: input [inputs.influxdb] took longer to collect than collection interval (30s)
2017/01/11 14:24:22 E! ERROR: input [inputs.net] took longer to collect than collection interval (30s)
2017/01/11 14:24:22 E! ERROR: input [inputs.diskio] took longer to collect than collection interval (30s)
2017/01/11 14:24:27 E! ERROR: input [inputs.netstat] took longer to collect than collection interval (30s)
2017/01/11 14:24:34 E! ERROR: input [inputs.kernel] took longer to collect than collection interval (30s)
2017/01/11 14:24:34 E! ERROR: input [inputs.mem] took longer to collect than collection interval (30s)
2017/01/11 14:24:39 E! ERROR: input [inputs.elasticsearch] took longer to collect than collection interval (30s)
2017/01/11 14:24:39 E! ERROR: input [inputs.docker] took longer to collect than collection interval (30s)
2017/01/11 14:24:39 E! ERROR: input [inputs.prometheus] took longer to collect than collection interval (30s)
2017/01/11 14:24:40 E! ERROR: input [inputs.swap] took longer to collect than collection interval (30s)
2017/01/11 14:24:40 E! ERROR: input [inputs.system] took longer to collect than collection interval (30s)
2017/01/11 14:24:51 E! ERROR: input [inputs.cpu] took longer to collect than collection interval (30s)
2017/01/11 14:24:51 E! ERROR: input [inputs.disk] took longer to collect than collection interval (30s)
2017/01/11 14:24:51 E! ERROR: input [inputs.kubernetes] took longer to collect than collection interval (30s)
2017/01/11 14:24:52 E! ERROR: input [inputs.influxdb] took longer to collect than collection interval (30s)
2017/01/11 14:24:52 E! ERROR: input [inputs.diskio] took longer to collect than collection interval (30s)
2017/01/11 14:24:57 E! ERROR: input [inputs.netstat] took longer to collect than collection interval (30s)
2017/01/11 14:24:59 E! ERROR: input [inputs.net] took longer to collect than collection interval (30s)
2017/01/11 14:25:04 E! ERROR: input [inputs.mem] took longer to collect than collection interval (30s)
2017/01/11 14:25:06 E! ERROR: input [inputs.kernel] took longer to collect than collection interval (30s)
2017/01/11 14:25:09 E! ERROR: input [inputs.elasticsearch] took longer to collect than collection interval (30s)
2017/01/11 14:25:09 E! ERROR: input [inputs.docker] took longer to collect than collection interval (30s)
2017/01/11 14:25:09 E! ERROR: input [inputs.prometheus] took longer to collect than collection interval (30s)
2017/01/11 14:25:10 E! ERROR: input [inputs.swap] took longer to collect than collection interval (30s)
2017/01/11 14:25:10 E! ERROR: input [inputs.system] took longer to collect than collection interval (30s)
2017/01/11 14:25:21 E! ERROR: input [inputs.cpu] took longer to collect than collection interval (30s)
2017/01/11 14:25:21 E! ERROR: input [inputs.disk] took longer to collect than collection interval (30s)
2017/01/11 14:25:21 E! ERROR: input [inputs.kubernetes] took longer to collect than collection interval (30s)
2017/01/11 14:25:22 E! ERROR: input [inputs.influxdb] took longer to collect than collection interval (30s)
2017/01/11 14:25:22 E! ERROR: input [inputs.diskio] took longer to collect than collection interval (30s)
2017/01/11 14:25:27 E! ERROR: input [inputs.netstat] took longer to collect than collection interval (30s)
2017/01/11 14:25:29 E! ERROR: input [inputs.net] took longer to collect than collection interval (30s)
2017/01/11 14:25:36 E! ERROR: input [inputs.kernel] took longer to collect than collection interval (30s)
2017/01/11 14:25:39 E! ERROR: input [inputs.elasticsearch] took longer to collect than collection interval (30s)
2017/01/11 14:25:39 E! ERROR: input [inputs.docker] took longer to collect than collection interval (30s)

I'll try to rollback to version 1.0.1 to see if it solves the issue.

lpic10 · 2017-01-11T14:49:28Z

Well, I see exactly the same on telegraf 1.0.1, so it probably is a different issue.

beheerderdag · 2017-03-27T13:31:03Z

Any update on this issue? We are also seeing this message for 'inputs.diskio' time to time in a system that has lots of disks. However, even though I have "devices = ["sda", "sdb"] in the config, it is taking longer than 10 seconds. Is it going through all the disks?

Here's a test I just did now,

'root# time telegraf -config /etc/telegraf/telegraf.conf --input-filter diskio --test

Plugin: inputs.diskio, Collection 1

diskio,host=dmf,name=sda write_bytes=109697900544i,read_time=10920576i,write_time=305980776i,io_time=52000288i,iops_in_progress=0i,reads=845062i,writes=4253753i,read_bytes=42971568128i 1490621167000000000
diskio,name=sdb,host=dmf write_time=243444i,io_time=4216520i,iops_in_progress=0i,reads=794736i,writes=11531i,read_bytes=1801284096i,write_bytes=25260032i,read_time=4228108i 1490621167000000000

real 0m8.102s
user 0m0.840s
sys 0m2.748s'

Any idea why it is taking around 8 seconds and sometimes more than 10 seconds to collect stat from two disks? Any other debug option?

danielnelson · 2017-03-27T18:15:21Z

Does this still occur with the 1.2.1 release and/or against master?

beheerderdag · 2017-03-27T21:28:14Z

Yes, we are using this:

telegraf --version

Telegraf v1.2.1 (git: release-1.2 3b6ffb3)

beheerderdag · 2017-03-29T08:55:00Z

We are also noticing high cpu usage when diskio plugin is enabled. When I disbable diskio the cpu usage goes down significantly (from 45% to 5%).

I issued SIGQUIT and here are the output.

https://gist.github.com/beheerderdag/06a052d5ab47f8b946e1f05634650bb7

and at a later time:

https://gist.github.com/beheerderdag/835e93d7422a17186dadf64e98edae8a

In the second instance I noticed this:

goroutine 30 [select, 31 minutes, locked to thread]:
runtime.gopark(0x11575f0, 0x0, 0x107ae04, 0x6, 0x18, 0x2)
        /usr/local/go/src/runtime/proc.go:259 +0x13a
runtime.selectgoImpl(0xc42004cf20, 0x0, 0x18)
        /usr/local/go/src/runtime/select.go:423 +0x1235
runtime.selectgo(0xc42004cf20)
        /usr/local/go/src/runtime/select.go:238 +0x1c
runtime.ensureSigM.func1()
        /usr/local/go/src/runtime/signal1_unix.go:304 +0x2f3
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:2086 +0x1

beheerderdag · 2017-03-30T10:15:45Z

We found a work around for our particular issue where the diskio plugin takes longer than 10 seconds and creates high cpu usage. A colleague of mine looked at the trace and the code and suggested this. We basically removed the udevadm call from the binary.

The function to look at is Gather() and the DiskIOStats structure.

https://github.com/influxdata/telegraf/blob/master/plugins/inputs/system/disk.go#L127

func (s *DiskIOStats) Gather(acc telegraf.Accumulator) error {
	diskio, err := s.ps.DiskIO()
	if err != nil {
		return fmt.Errorf("error getting disk io info: %s", err)
	}

The filtering for devices is taking place later in this function.

https://github.com/influxdata/telegraf/blob/master/plugins/inputs/system/disk.go#L142

	for _, io := range diskio {
		_, member := devices[io.Name]
		if restrictDevices && !member {
			continue
		}

Our server's diskstats:

 # wc -l /proc/diskstats 
3680 /proc/diskstats

So for each and every disk it is calling udevadm to get the disk serial number even though we have the default skip_serial_number option which is set to true.

To fix the issue for the time being, I made a quick binary patch. Here are the steps:
(I edited the binary wtih vim binary mode: vim -b)

# service telegraf stop
# cp telegraf.orig telegraf 
# xxd /usr/bin/telegraf > t1.hex
# xxd /usr/bin/telegraf.orig > t2.hex

# diff t1.hex t2.hex 
820489c820489
< 0c85080: 6269 6e71 7564 6576 6164 6d2f 7573 722f  binqudevadm/usr/
---
> 0c85080: 6269 6e2f 7564 6576 6164 6d2f 7573 722f  bin/udevadm/usr/

Test the diskio plugin where you can see the original version was taking longer.

# time telegraf -config /etc/telegraf/telegraf.conf -input-filter diskio -test
* Plugin: inputs.diskio, Collection 1
> diskio,name=sda,host=dmf io_time=63586748i,iops_in_progress=0i,reads=1010872i,writes=5192339i,read_bytes=52354902016i,write_bytes=146722795520i,read_time=13411860i,write_time=374396368i 1490866191000000000
> diskio,name=sdb,host=dmf read_time=5214468i,write_time=3957984i,io_time=5225480i,iops_in_progress=0i,reads=998227i,writes=19431i,read_bytes=2248228864i,write_bytes=46245888i 1490866191000000000

real    0m0.481s
user    0m0.104s
sys     0m0.388s


# time telegraf.orig -config /etc/telegraf/telegraf.conf -input-filter diskio -test
* Plugin: inputs.diskio, Collection 1
> diskio,name=sdb,host=dmf write_time=3957984i,io_time=5225480i,iops_in_progress=0i,reads=998227i,writes=19431i,read_bytes=2248228864i,write_bytes=46245888i,read_time=5214468i 1490866208000000000
> diskio,name=sda,host=dmf io_time=63587200i,iops_in_progress=0i,reads=1010872i,writes=5192372i,read_bytes=52354902016i,write_bytes=146723295232i,read_time=13411860i,write_time=374397612i 1490866208000000000

real    0m7.523s
user    0m0.796s
sys     0m2.732s

I resumed the service and no error or high cpu usage so far. I hope this helps others and the developers.

danielnelson · 2017-03-30T22:18:27Z

@beheerderdag Thanks for researching what the root cause of this is. I will write up an issue with the gopsutil project and we can begin discussion on what the next step should be.

In order to explain why we need to speed things up, can you add some info about the disk configuration of this system, I imagine it must have some special hardware or software setup to have 3680 disks.

beheerderdag · 2017-03-31T06:32:17Z

Daniel,

Thanks for your reply and following up on the issue. The server is part of a hierarchal storage management environment so we have tons of tape storage devices and SAN disks.

phemmer · 2017-04-04T12:26:13Z

If the issue is calling udevadm info, it might be worth proposing upstream that the udev files be read directly instead of calling udevadm. Telegraf does this here to be able to add disk info as tags to the diskio input. Reading the files shouldn't take but a few microseconds.

danielnelson · 2017-04-04T21:46:10Z

@phemmer I added that to the upstream issue, though in my mind ideally we can get the into without udev at all. I'm still hoping for a future with vdev ;)

danielnelson · 2017-04-13T20:22:38Z

The diskio change has been merged, I'm not sure if it was the cause of the original issue. If you are still having problems we need to narrow it down to a minimal case on the latest version.

jubalfh changed the title ~~(probable) regression between 1.0.1 and 1.1.0: input plugins exceeding collection time~~ (a possibility of) regression between 1.0.1 and 1.1.0: input plugins exceeding collection time Dec 22, 2016

sparrc added this to the Future Milestone milestone Jan 6, 2017

danielnelson mentioned this issue Mar 30, 2017

Linux IOCounters() performance with many disks shirou/gopsutil#341

Closed

danielnelson mentioned this issue Apr 11, 2017

Use name filter for IOCounters in diskio #2649

Merged

3 tasks

danielnelson closed this as completed Apr 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(a possibility of) regression between 1.0.1 and 1.1.0: input plugins exceeding collection time #2197

(a possibility of) regression between 1.0.1 and 1.1.0: input plugins exceeding collection time #2197

jubalfh commented Dec 22, 2016

lpic10 commented Jan 11, 2017

lpic10 commented Jan 11, 2017

beheerderdag commented Mar 27, 2017

danielnelson commented Mar 27, 2017

beheerderdag commented Mar 27, 2017

beheerderdag commented Mar 29, 2017 •

edited

Loading

beheerderdag commented Mar 30, 2017

danielnelson commented Mar 30, 2017

beheerderdag commented Mar 31, 2017

phemmer commented Apr 4, 2017 •

edited

Loading

danielnelson commented Apr 4, 2017

danielnelson commented Apr 13, 2017

(a possibility of) regression between 1.0.1 and 1.1.0: input plugins exceeding collection time #2197

(a possibility of) regression between 1.0.1 and 1.1.0: input plugins exceeding collection time #2197

Comments

jubalfh commented Dec 22, 2016

Configuration details, error logs et al.

lpic10 commented Jan 11, 2017

lpic10 commented Jan 11, 2017

beheerderdag commented Mar 27, 2017

danielnelson commented Mar 27, 2017

beheerderdag commented Mar 27, 2017

telegraf --version

beheerderdag commented Mar 29, 2017 • edited Loading

beheerderdag commented Mar 30, 2017

danielnelson commented Mar 30, 2017

beheerderdag commented Mar 31, 2017

phemmer commented Apr 4, 2017 • edited Loading

danielnelson commented Apr 4, 2017

danielnelson commented Apr 13, 2017

beheerderdag commented Mar 29, 2017 •

edited

Loading

phemmer commented Apr 4, 2017 •

edited

Loading