Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service stop after logrotate #685

Closed
wgrcunha opened this issue Feb 11, 2016 · 18 comments · Fixed by #728
Closed

Service stop after logrotate #685

wgrcunha opened this issue Feb 11, 2016 · 18 comments · Fixed by #728
Labels
bug unexpected problem or unintended behavior

Comments

@wgrcunha
Copy link

Hello,

Sometimes i see the agent running but not sending metrics or ever logging, when this happens the logfile is empty and i believe is caused by log rotate. I added to logrotate:

postrotate
        service telegraf restart > /dev/null

And now, working fine. Anyone with the same problem?

**Using debian package

Thanks!

@sparrc
Copy link
Contributor

sparrc commented Feb 15, 2016

certainly sounds like a bug, but I'd like to know if others are seeing it as well

@sparrc sparrc added the bug unexpected problem or unintended behavior label Feb 15, 2016
@wgrcunha
Copy link
Author

@sparrc Is possible to change to use syslog instead of logging to a file? And we control the rotate, log, etc using syslog programs.

@sparrc
Copy link
Contributor

sparrc commented Feb 16, 2016

sure, but I won't have time to work on this for a while, and I think that you could also use systemd which logs to journalctl instead of the file

@wgrcunha
Copy link
Author

@sparrc After did more tests even using systemd the agent stop to send metrics:

Sample log, the last wrote from agent is at 15:43:01 and only start reporting again when i restart the daemon

2016/02/19 15:42:21 Wrote 0 metrics to output influxdb in 1.265238955s
2016/02/19 15:42:31 Wrote 0 metrics to output influxdb in 1.214569861s
2016/02/19 15:42:41 Wrote 0 metrics to output influxdb in 1.254252425s
2016/02/19 15:42:51 Wrote 0 metrics to output influxdb in 1.006064584s
2016/02/19 15:43:00 Gathered metrics, (1m0s interval), from 9 inputs in 127.702047ms
2016/02/19 15:43:01 Wrote 64 metrics to output influxdb in 1.119452438s
2016/02/19 15:44:00 Gathered metrics, (1m0s interval), from 9 inputs in 123.996917ms
2016/02/19 15:45:00 Gathered metrics, (1m0s interval), from 9 inputs in 144.462953ms
2016/02/19 15:46:00 Gathered metrics, (1m0s interval), from 9 inputs in 131.375083ms
2016/02/19 15:47:00 Gathered metrics, (1m0s interval), from 9 inputs in 122.787756ms
2016/02/19 15:48:00 Gathered metrics, (1m0s interval), from 9 inputs in 122.867636ms
2016/02/19 15:49:00 Gathered metrics, (1m0s interval), from 9 inputs in 123.180944ms
2016/02/19 15:50:00 Gathered metrics, (1m0s interval), from 9 inputs in 123.750488ms
2016/02/19 15:51:00 Gathered metrics, (1m0s interval), from 9 inputs in 136.962833ms
2016/02/19 15:52:00 Gathered metrics, (1m0s interval), from 9 inputs in 123.83557ms
2016/02/19 15:53:00 Gathered metrics, (1m0s interval), from 9 inputs in 152.097537ms
2016/02/19 15:54:00 Gathered metrics, (1m0s interval), from 9 inputs in 150.002685ms
2016/02/19 15:55:00 Gathered metrics, (1m0s interval), from 9 inputs in 125.264287ms
2016/02/19 15:56:00 Gathered metrics, (1m0s interval), from 9 inputs in 133.351036ms
2016/02/19 15:57:00 Gathered metrics, (1m0s interval), from 9 inputs in 120.823581ms
2016/02/19 15:58:00 Gathered metrics, (1m0s interval), from 9 inputs in 144.377225ms
2016/02/19 16:45:15 Starting Telegraf (version 0.10.2)
2016/02/19 16:45:15 Loaded outputs: influxdb
2016/02/19 16:45:15 Loaded inputs: swap system net netstat cpu disk diskio mem exec
2016/02/19 16:45:15 Tags enabled: cfdescriptions=* host=* team=* tool=telegraf
2016/02/19 16:45:15 Agent Config: Interval:1m0s, Debug:false, Quiet:false, Hostname:"
*", Flush Interval:10s
2016/02/19 16:46:00 Gathered metrics, (1m0s interval), from 9 inputs in 175.587834ms
2016/02/19 16:46:10 Wrote 64 metrics to output influxdb in 66.437421ms
2016/02/19 16:46:20 Wrote 0 metrics to output influxdb in 20.614915ms
2016/02/19 16:46:30 Wrote 0 metrics to output influxdb in 22.260462ms
2016/02/19 16:46:40 Wrote 0 metrics to output influxdb in 18.838846ms
2016/02/19 16:46:50 Wrote 0 metrics to output influxdb in 20.438875ms
2016/02/19 16:47:00 Gathered metrics, (1m0s interval), from 9 inputs in 140.146721ms

Is the 0.10.2 version, i will try to update to 0.10.3 today to see. I have about 1000 hosts reporting and this happens on random servers every day.

@sparrc
Copy link
Contributor

sparrc commented Feb 19, 2016

@wgrcunha So you are seeing this issue even after adding the logrotate change? It looks like your instance is still logging but something is hung in the write path?

@wgrcunha
Copy link
Author

@sparrc Yes. And using systemd too

@sparrc
Copy link
Contributor

sparrc commented Feb 19, 2016

I'll try to reproduce, can you give your OS details?

If you are able to ctrl-\ the process when this is happening that would help a lot

@wgrcunha
Copy link
Author

/etc/debian_version -> 7.8
uname -a -> Linux fring0013 3.2.0-4-amd64 #1 SMP Debian 3.2.68-1+deb7u1 x86_64 GNU/Linux

Tried to run as root, but got the same problem

root     21881  0.3  0.0 199708 12984 ?        Sl   06:25   2:17 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d

Log

2016/02/19 15:42:01 Wrote 60 metrics to output influxdb in 1.333597222s
2016/02/19 15:42:03 Gathered metrics, (1m0s interval), from 8 inputs in 3.921914358s
2016/02/19 15:42:10 Wrote 2 metrics to output influxdb in 116.105263ms
2016/02/19 15:42:20 Wrote 0 metrics to output influxdb in 41.026027ms
2016/02/19 15:42:30 Wrote 0 metrics to output influxdb in 58.730717ms
2016/02/19 15:42:40 Wrote 0 metrics to output influxdb in 76.754311ms
2016/02/19 15:42:50 Wrote 0 metrics to output influxdb in 58.222168ms
2016/02/19 15:43:01 Wrote 60 metrics to output influxdb in 1.101402649s
2016/02/19 15:43:02 Gathered metrics, (1m0s interval), from 8 inputs in 2.556971171s
2016/02/19 15:44:03 Gathered metrics, (1m0s interval), from 8 inputs in 3.829782121s
2016/02/19 15:45:04 Gathered metrics, (1m0s interval), from 8 inputs in 4.069923161s
2016/02/19 15:46:02 Gathered metrics, (1m0s interval), from 8 inputs in 2.657605177s
2016/02/19 15:47:03 Gathered metrics, (1m0s interval), from 8 inputs in 3.876193623s
2016/02/19 15:48:02 Gathered metrics, (1m0s interval), from 8 inputs in 2.772723071s
2016/02/19 15:49:04 Gathered metrics, (1m0s interval), from 8 inputs in 4.275398812s
2016/02/19 15:50:02 Gathered metrics, (1m0s interval), from 8 inputs in 2.990015137s
2016/02/19 15:51:04 Gathered metrics, (1m0s interval), from 8 inputs in 4.622869253s
2016/02/19 15:52:03 Gathered metrics, (1m0s interval), from 8 inputs in 3.633749382s
2016/02/19 15:53:04 Gathered metrics, (1m0s interval), from 8 inputs in 4.105144697s
2016/02/19 15:54:03 Gathered metrics, (1m0s interval), from 8 inputs in 3.150430417s
2016/02/19 15:55:03 Gathered metrics, (1m0s interval), from 8 inputs in 3.628746145s
2016/02/19 15:56:03 Gathered metrics, (1m0s interval), from 8 inputs in 3.755732228s
2016/02/19 15:57:04 Gathered metrics, (1m0s interval), from 8 inputs in 4.275485434s
2016/02/19 15:58:03 Gathered metrics, (1m0s interval), from 8 inputs in 3.352019822s
2016/02/19 15:59:03 Gathered metrics, (1m0s interval), from 8 inputs in 3.810145085s

lsof

COMMAND    PID USER   FD   TYPE     DEVICE SIZE/OFF    NODE NAME
telegraf 21881 root  cwd    DIR      202,1     4096       2 /
telegraf 21881 root  rtd    DIR      202,1     4096       2 /
telegraf 21881 root  txt    REG      202,1 30978672 1712258 /usr/bin/telegraf
telegraf 21881 root  mem    REG      202,1  1599504  137234 /lib/x86_64-linux-gnu/libc-2.13.so
telegraf 21881 root  mem    REG      202,1   131107  137249 /lib/x86_64-linux-gnu/libpthread-2.13.so
telegraf 21881 root  mem    REG      202,1   136936  137246 /lib/x86_64-linux-gnu/ld-2.13.so
telegraf 21881 root    0r   CHR        1,3      0t0    1028 /dev/null
telegraf 21881 root    1w   CHR        1,3      0t0    1028 /dev/null
telegraf 21881 root    2w   REG      202,1   292540 1179684 /var/log/telegraf/telegraf.log
telegraf 21881 root    3r   CHR        1,9      0t0    1033 /dev/urandom
telegraf 21881 root    4u  IPv4 1724666747      0t0     TCP ********:60653->**********:https (ESTABLISHED)
telegraf 21881 root    6u  0000        0,9        0    3270 anon_inode

Reptyr

[-] Timed out waiting for child stop.
[+] Allocated scratch page: 7fee0ed9d000
[+] Opened the new tty in the child: 5
[+] Target is not a session leader, attempting to setsid.
[+] Forked a child: 23568
[+] Did setsid()
[+] Set the controlling tty
^\SIGQUIT: quit
PC=0x460cd1 m=0

goroutine 0 [idle]:
runtime.futex(0x18a20b0, 0x0, 0x0, 0x0, 0x0, 0x18a0fc0, 0x0, 0x0, 0x410b84, 0x18a20b0, ...)
    /usr/local/go/src/runtime/sys_linux_amd64.s:288 +0x21
runtime.futexsleep(0x18a20b0, 0x0, 0xffffffffffffffff)
    /usr/local/go/src/runtime/os1_linux.go:39 +0x53
runtime.notesleep(0x18a20b0)
    /usr/local/go/src/runtime/lock_futex.go:142 +0xa4
runtime.stopm()
    /usr/local/go/src/runtime/proc1.go:1136 +0x112
runtime.exitsyscall0(0xc820969680)
    /usr/local/go/src/runtime/proc1.go:2146 +0x16c
runtime.mcall(0x7fff56bb0f70)
    /usr/local/go/src/runtime/asm_amd64.s:204 +0x5b

goroutine 1 [semacquire, 94 minutes]:
sync.runtime_Semacquire(0xc8205c0afc)
    /usr/local/go/src/runtime/sema.go:43 +0x26
sync.(*WaitGroup).Wait(0xc8205c0af0)
    /usr/local/go/src/sync/waitgroup.go:126 +0xb4
github.com/influxdata/telegraf.(*Agent).gatherParallel(0xc82002a108, 0xc8205b2420, 0x0, 0x0)
    /home/telebuild/go/src/github.com/influxdata/telegraf/agent.go:148 +0x145
github.com/influxdata/telegraf.(*Agent).Run(0xc82002a108, 0xc8205b2120, 0x0, 0x0)
    /home/telebuild/go/src/github.com/influxdata/telegraf/agent.go:372 +0x8e6
main.main()
    /home/telebuild/go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:234 +0x1ca8

goroutine 17 [syscall, 667 minutes, locked to thread]:
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1721 +0x1

goroutine 5 [syscall, 667 minutes]:
os/signal.loop()
    /usr/local/go/src/os/signal/signal_unix.go:22 +0x18
created by os/signal.init.1
    /usr/local/go/src/os/signal/signal_unix.go:28 +0x37

goroutine 11688 [chan send, 94 minutes]:
github.com/influxdata/telegraf.(*accumulator).AddFields(0xc820162600, 0x11121c8, 0x3, 0x0, 0xc8205487e0, 0xc8208ba5a0, 0x1, 0x1)
    /home/telebuild/go/src/github.com/influxdata/telegraf/accumulator.go:163 +0xcc8
github.com/influxdata/telegraf/plugins/inputs/system.(*CPUStats).Gather(0xc820159c80, 0x7fee0cda3748, 0xc820162600, 0x0, 0x0)
    /home/telebuild/go/src/github.com/influxdata/telegraf/plugins/inputs/system/cpu.go:99 +0x14aa
github.com/influxdata/telegraf.(*Agent).gatherParallel.func1(0xc8205c0af0, 0xc8205b2420, 0xc82002a108, 0x0, 0xc820159ce0)
    /home/telebuild/go/src/github.com/influxdata/telegraf/agent.go:137 +0x490
created by github.com/influxdata/telegraf.(*Agent).gatherParallel
    /home/telebuild/go/src/github.com/influxdata/telegraf/agent.go:141 +0x442

goroutine 35 [select, 667 minutes, locked to thread]:
runtime.gopark(0x13416f0, 0xc820511f28, 0x1117150, 0x6, 0xc820479018, 0x2)
    /usr/local/go/src/runtime/proc.go:185 +0x163
runtime.selectgoImpl(0xc820511f28, 0x0, 0x18)
    /usr/local/go/src/runtime/select.go:392 +0xa64
runtime.selectgo(0xc820511f28)
    /usr/local/go/src/runtime/select.go:212 +0x12
runtime.ensureSigM.func1()
    /usr/local/go/src/runtime/signal1_unix.go:227 +0x353
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1721 +0x1

goroutine 36 [chan receive, 667 minutes]:
main.main.func1(0xc8205b2180, 0xc8205b2120, 0xc820504fc0)
    /home/telebuild/go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:206 +0x47
created by main.main
    /home/telebuild/go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:216 +0x1575

goroutine 37 [semacquire, 109 minutes]:
sync.runtime_Semacquire(0xc82000db0c)
    /usr/local/go/src/runtime/sema.go:43 +0x26
sync.(*WaitGroup).Wait(0xc82000db00)
    /usr/local/go/src/sync/waitgroup.go:126 +0xb4
github.com/influxdata/telegraf.(*Agent).flush(0xc82002a108)
    /home/telebuild/go/src/github.com/influxdata/telegraf/agent.go:263 +0xe9
github.com/influxdata/telegraf.(*Agent).flusher(0xc82002a108, 0xc8205b2120, 0xc8205b2420, 0x0, 0x0)
    /home/telebuild/go/src/github.com/influxdata/telegraf/agent.go:281 +0x23f
github.com/influxdata/telegraf.(*Agent).Run.func1(0xc820236320, 0xc82002a108, 0xc8205b2120, 0xc8205b2420)
    /home/telebuild/go/src/github.com/influxdata/telegraf/agent.go:337 +0x7f
created by github.com/influxdata/telegraf.(*Agent).Run
    /home/telebuild/go/src/github.com/influxdata/telegraf/agent.go:341 +0x4b3

goroutine 11219 [select, 109 minutes]:
net/http.(*persistConn).writeLoop(0xc82026e160)
    /usr/local/go/src/net/http/transport.go:1009 +0x40c
created by net/http.(*Transport).dialConn
    /usr/local/go/src/net/http/transport.go:686 +0xc9d

goroutine 11685 [chan send, 94 minutes]:
github.com/influxdata/telegraf.(*accumulator).AddFields(0xc8205e2380, 0x1117bf0, 0x6, 0x0, 0xc8207f9a20, 0x0, 0x0, 0x0)
    /home/telebuild/go/src/github.com/influxdata/telegraf/accumulator.go:163 +0xcc8
github.com/influxdata/telegraf/plugins/inputs/system.(*SystemStats).Gather(0x18c7c58, 0x7fee0cda3748, 0xc8205e2380, 0x0, 0x0)
    /home/telebuild/go/src/github.com/influxdata/telegraf/plugins/inputs/system/system.go:40 +0x416
github.com/influxdata/telegraf.(*Agent).gatherParallel.func1(0xc8205c0af0, 0xc8205b2420, 0xc82002a108, 0x0, 0xc820159b60)
    /home/telebuild/go/src/github.com/influxdata/telegraf/agent.go:137 +0x490
created by github.com/influxdata/telegraf.(*Agent).gatherParallel
    /home/telebuild/go/src/github.com/influxdata/telegraf/agent.go:141 +0x442

goroutine 11690 [chan send, 94 minutes]:
github.com/influxdata/telegraf.(*accumulator).AddFields(0xc820640980, 0x1112708, 0x6, 0x0, 0xc8205d04b0, 0x0, 0x0, 0x0)
    /home/telebuild/go/src/github.com/influxdata/telegraf/accumulator.go:163 +0xcc8
github.com/influxdata/telegraf/plugins/inputs/system.(*DiskIOStats).Gather(0xc820159d70, 0x7fee0cda3748, 0xc820640980, 0x0, 0x0)
    /home/telebuild/go/src/github.com/influxdata/telegraf/plugins/inputs/system/disk.go:132 +0x9c7
github.com/influxdata/telegraf.(*Agent).gatherParallel.func1(0xc8205c0af0, 0xc8205b2420, 0xc82002a108, 0x0, 0xc820159dd0)
    /home/telebuild/go/src/github.com/influxdata/telegraf/agent.go:137 +0x490
created by github.com/influxdata/telegraf.(*Agent).gatherParallel
    /home/telebuild/go/src/github.com/influxdata/telegraf/agent.go:141 +0x442

goroutine 11366 [select, 109 minutes]:
net/http.(*persistConn).roundTrip(0xc82026e160, 0xc82000dcf0, 0x0, 0x0, 0x0)
    /usr/local/go/src/net/http/transport.go:1164 +0xb0d
net/http.(*Transport).RoundTrip(0xc82008a5a0, 0xc820348460, 0xa8, 0x0, 0x0)
    /usr/local/go/src/net/http/transport.go:235 +0x530
net/http.send(0xc820348460, 0x7fee0cd9b4d8, 0xc82008a5a0, 0xc8205584b0, 0x0, 0x0)
    /usr/local/go/src/net/http/client.go:220 +0x52c
net/http.(*Client).send(0xc820159ec0, 0xc820348460, 0x4e, 0x0, 0x0)
    /usr/local/go/src/net/http/client.go:143 +0x15a
net/http.(*Client).doFollowingRedirects(0xc820159ec0, 0xc820348460, 0x13412a0, 0x0, 0x0, 0x0)
    /usr/local/go/src/net/http/client.go:380 +0xbc0
net/http.(*Client).Do(0xc820159ec0, 0xc820348460, 0x28, 0x0, 0x0)
    /usr/local/go/src/net/http/client.go:178 +0x188
github.com/influxdata/influxdb/client/v2.(*client).Write(0xc8201a6180, 0x7fee0cda3778, 0xc8200b80c0, 0x0, 0x0)
    /home/telebuild/go/src/github.com/influxdata/influxdb/client/v2/client.go:382 +0x8e5
github.com/influxdata/telegraf/plugins/outputs/influxdb.(*InfluxDB).Write(0xc820153180, 0xc82000daf0, 0x2, 0x2, 0x0, 0x0)
    /home/telebuild/go/src/github.com/influxdata/telegraf/plugins/outputs/influxdb/influxdb.go:148 +0x3aa
github.com/influxdata/telegraf/internal/models.(*RunningOutput).Write(0xc820014300, 0x0, 0x0)
    /home/telebuild/go/src/github.com/influxdata/telegraf/internal/models/running_output.go:60 +0xb0
github.com/influxdata/telegraf.(*Agent).flush.func1(0xc82000db00, 0xc820014300)
    /home/telebuild/go/src/github.com/influxdata/telegraf/agent.go:255 +0x7a
created by github.com/influxdata/telegraf.(*Agent).flush
    /home/telebuild/go/src/github.com/influxdata/telegraf/agent.go:260 +0xc0

goroutine 11689 [chan send, 94 minutes]:
github.com/influxdata/telegraf.(*accumulator).AddFields(0xc82053d280, 0x1112700, 0x4, 0x0, 0xc8202489f0, 0x0, 0x0, 0x0)
    /home/telebuild/go/src/github.com/influxdata/telegraf/accumulator.go:163 +0xcc8
github.com/influxdata/telegraf/plugins/inputs/system.(*DiskStats).Gather(0xc8201a6080, 0x7fee0cda3748, 0xc82053d280, 0x0, 0x0)
    /home/telebuild/go/src/github.com/influxdata/telegraf/plugins/inputs/system/disk.go:63 +0x8e8
github.com/influxdata/telegraf.(*Agent).gatherParallel.func1(0xc8205c0af0, 0xc8205b2420, 0xc82002a108, 0x0, 0xc820159d40)
    /home/telebuild/go/src/github.com/influxdata/telegraf/agent.go:137 +0x490
created by github.com/influxdata/telegraf.(*Agent).gatherParallel
    /home/telebuild/go/src/github.com/influxdata/telegraf/agent.go:141 +0x442

goroutine 11687 [chan send, 92 minutes]:
github.com/influxdata/telegraf.(*accumulator).AddFields(0xc8201a62c0, 0x1115308, 0x7, 0x0, 0xc8203691d0, 0x0, 0x0, 0x0)
    /home/telebuild/go/src/github.com/influxdata/telegraf/accumulator.go:163 +0xcc8
github.com/influxdata/telegraf/plugins/inputs/system.(*NetStats).Gather(0xc820434550, 0x7fee0cda3748, 0xc8201a62c0, 0x0, 0x0)
    /home/telebuild/go/src/github.com/influxdata/telegraf/plugins/inputs/system/netstat.go:61 +0x1063
github.com/influxdata/telegraf.(*Agent).gatherParallel.func1(0xc8205c0af0, 0xc8205b2420, 0xc82002a108, 0x0, 0xc820159c50)
    /home/telebuild/go/src/github.com/influxdata/telegraf/agent.go:137 +0x490
created by github.com/influxdata/telegraf.(*Agent).gatherParallel
    /home/telebuild/go/src/github.com/influxdata/telegraf/agent.go:141 +0x442

goroutine 11218 [IO wait, 110 minutes]:
net.runtime_pollWait(0x7fee09cd1f28, 0x72, 0xc82000c160)
    /usr/local/go/src/runtime/netpoll.go:157 +0x60
net.(*pollDesc).Wait(0xc8202ee530, 0x72, 0x0, 0x0)
    /usr/local/go/src/net/fd_poll_runtime.go:73 +0x3a
net.(*pollDesc).WaitRead(0xc8202ee530, 0x0, 0x0)
    /usr/local/go/src/net/fd_poll_runtime.go:78 +0x36
net.(*netFD).Read(0xc8202ee4d0, 0xc820479000, 0x1000, 0x1000, 0x0, 0x7fee0cd97050, 0xc82000c160)
    /usr/local/go/src/net/fd_unix.go:232 +0x23a
net.(*conn).Read(0xc82011e028, 0xc820479000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
    /usr/local/go/src/net/net.go:172 +0xe4
crypto/tls.(*block).readFromUntil(0xc82042c780, 0x7fee0cda1d18, 0xc82011e028, 0x5, 0x0, 0x0)
    /usr/local/go/src/crypto/tls/conn.go:455 +0xcc
crypto/tls.(*Conn).readRecord(0xc8200d1b80, 0x1341717, 0x0, 0x0)
    /usr/local/go/src/crypto/tls/conn.go:540 +0x2d1
crypto/tls.(*Conn).Read(0xc8200d1b80, 0xc820013000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
    /usr/local/go/src/crypto/tls/conn.go:901 +0x167
net/http.noteEOFReader.Read(0x7fee0cda3598, 0xc8200d1b80, 0xc82026e1b8, 0xc820013000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
    /usr/local/go/src/net/http/transport.go:1370 +0x67
net/http.(*noteEOFReader).Read(0xc8205fac00, 0xc820013000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
    <autogenerated>:126 +0xd0
bufio.(*Reader).fill(0xc8205b3560)
    /usr/local/go/src/bufio/bufio.go:97 +0x1e9
bufio.(*Reader).Peek(0xc8205b3560, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0)
    /usr/local/go/src/bufio/bufio.go:132 +0xcc
net/http.(*persistConn).readLoop(0xc82026e160)
    /usr/local/go/src/net/http/transport.go:876 +0xf7
created by net/http.(*Transport).dialConn
    /usr/local/go/src/net/http/transport.go:685 +0xc78

rax    0xca
rbx    0x0
rcx    0xffffffffffffffff
rdx    0x0
rdi    0x18a20b0
rsi    0x0
rbp    0x1
rsp    0x7fff56bb0e80
r8     0x0
r9     0x0
r10    0x0
r11    0x286
r12    0x9
r13    0x133ba7c
r14    0x30
r15    0x8
rip    0x460cd1
rflags 0x286
cs     0x33
fs     0x0
gs     0x0

@sparrc
Copy link
Contributor

sparrc commented Feb 19, 2016

@wgrcunha what is your telegraf version?

@wgrcunha
Copy link
Author

@sparrc 0.10.2

@wgrcunha
Copy link
Author

@sparrc my bad, this one is 0.10.1

@sparrc
Copy link
Contributor

sparrc commented Feb 19, 2016

got it, that makes more sense, thanks for the great debug output BTW

@sparrc
Copy link
Contributor

sparrc commented Feb 19, 2016

Looks like I haven't set a timeout for the InfluxDB writes, which are hanging, that is definitely a bug.

As to the cause of the write hang, it could be that the db is overstressed with http write requests.

@wgrcunha if you have 1000s of hosts, you probably want to be using the flush_jitter config option, are you doing this? Until I get the timeout fixed in a release, I think you have two options:

  1. Use UDP on InfluxDB and Telegraf
  2. Use the flush_jitter telegraf config option to spread your http writes over a larger span of time. (if you are not using this, then all of your hosts are trying to write to influx at exactly the same time)

@wgrcunha
Copy link
Author

flush_jitter is no 0s, i will change this and look.
I send a feedback on some days.

Thanks!

@sparrc
Copy link
Contributor

sparrc commented Feb 19, 2016

That was a little inaccurate, there is actually a timeout parameter for the influxdb output, so you should set that to something reasonable (say "5s" or "10s"?)

The current default config of having no timeout was a pretty dumb default implementation on my part, I will change that to a reasonable default.

@sparrc
Copy link
Contributor

sparrc commented Feb 19, 2016

Just to be clear, I would recommend doing both:

  1. Set flush_jitter to at least 10s in your case (because you have 1000s of hosts)
  2. Set timeout under outputs.influxdb to "5s" so that hangs can't occur

@wgrcunha
Copy link
Author

Changing now, i can give a feedback on a few days.

tyvm for fast response :)

sparrc added a commit that referenced this issue Feb 19, 2016
default to 5s instead, since even if it times out we will cache the
points and move on

closes #685
sparrc added a commit that referenced this issue Feb 19, 2016
default to 5s instead, since even if it times out we will cache the
points and move on

closes #685
sparrc added a commit that referenced this issue Feb 19, 2016
default to 5s instead, since even if it times out we will cache the
points and move on

closes #685
sparrc added a commit that referenced this issue Feb 19, 2016
default to 5s instead, since even if it times out we will cache the
points and move on

closes #685
@wgrcunha
Copy link
Author

@sparrc The feedback is very positive, everything is ok now using that configuration :)

geodimm pushed a commit to miketonks/telegraf that referenced this issue Mar 10, 2016
default to 5s instead, since even if it times out we will cache the
points and move on

closes influxdata#685
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants