Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

telegraf just stops working #1230

Closed
RainerW opened this issue May 19, 2016 · 19 comments · Fixed by #1235
Closed

telegraf just stops working #1230

RainerW opened this issue May 19, 2016 · 19 comments · Fixed by #1235

Comments

@RainerW
Copy link

RainerW commented May 19, 2016

i'm testing telegraf on different systems and it seems to stop working after some time.
I guess tehere is a kind of network hickup, because it stops working on multipe servers at around the same time. But some Servers just continued to work. So I'm sure neither influx nor grafana had a problem.

The Systems in question are most Ubuntu 11,14 or 16. But one of the ubuntu 16 servers continue to work.
In all cases the logfiles just stopped containing anything, but the process continues to run. My guess is that there is no safeguard around the metrics, so when they start hanging for whatever reason telegraf stops working .?

Last Log entries are:

2016/05/13 11:15:30 Wrote 21 metrics to output influxdb in 46.968928ms
2016/05/13 11:15:40 Gathered metrics, (10s interval), from 11 inputs in 33.493802ms

Which is the point in time where telegraf stops reporting.
It seems to be still running:

> ps aux | grep tel
telegraf  7095  0.0  0.0 129524  3372 ?        Sl   May13   0:25 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d

After an service restart all is working fine again. But this happend before, so i guess it will also happen again.

I know this ticket is very broad, but i have nothing to pin it down to. Only there should be safeguards in place to prevent telegraf from stopping completely.

@RainerW
Copy link
Author

RainerW commented May 19, 2016

Oh forgot, this happens with 0.12.1 and 0.13.0

@sparrc
Copy link
Contributor

sparrc commented May 19, 2016

can you provide your configuration?

@sparrc
Copy link
Contributor

sparrc commented May 19, 2016

recently in 0.13 there were some safeguards added to prevent lockups when running exec commands, but these have not all been patched up as some of Telegraf's dependencies could still be running commands without timeouts.

@RainerW
Copy link
Author

RainerW commented May 19, 2016

This is basically the default config with :

  • NFS excluded
  • and apache monitoring enabled
  • telegraf 0.12.1 in that case

https://gist.github.com/RainerW/94c11ac7f4ce42ea17a12ee7e40257eb

@RainerW
Copy link
Author

RainerW commented May 19, 2016

On the Server with 0.13.0

  • NFS and glusterfs excluded
  • apache monitoring added via telegraf.d folder

https://gist.github.com/RainerW/586508bc90bca51e9d8cab0dfe98d97a

@RainerW
Copy link
Author

RainerW commented May 19, 2016

hint: Apache status check is not active on all instances. on one server i had default 0.12.1 config (Onle nfs and glusterfs excludes) https://gist.github.com/RainerW/d3a5f38c1b69d13f69c75e1b1778ccee with the same effect

@sparrc
Copy link
Contributor

sparrc commented May 19, 2016

If you could SIGQUIT (ctrl-\) the process while it's hung and provide the stack trace that would help a lot

@RainerW
Copy link
Author

RainerW commented May 19, 2016

Not totally sure how to do that with a service. "kill -3 6413" just quits the process but the log does not contain a stracktrace (At least a restarted instance i tried, which was not hung but i would like to see a stracktrace first before using it on a hung instance ) :
2016/05/19 16:22:40 Gathered metrics, (10s interval), from 8 inputs in 54.154783ms
Verlassen --> "german for exited"

@sparrc
Copy link
Contributor

sparrc commented May 19, 2016

that's odd, when I send a SIGQUIT I get a stack trace in the logs.... I'm not sure then what's going on with yours, is the disk full or locked to where it can't even write? (or maybe a permissions problem?)

@sparrc
Copy link
Contributor

sparrc commented May 19, 2016

since it's happening with the default config, it's likely related to #1215. It could even be the exact same problem.

@sparrc
Copy link
Contributor

sparrc commented May 19, 2016

You could get hung child processes while the telegraf process is hung with something like:

ps aux -o ppid | grep `pgrep telegraf`

sparrc added a commit that referenced this issue May 19, 2016
currently the input interface does not have any methods for killing a
running Gather call, so there is nothing we can do but log a "FATAL
ERROR" and move on. This will at least give some visibility into the
plugin that is acting up.

Open questions:

- should the telegraf process die and exit when this happens? This might
  be a better idea than leaving around the dead process.
- should the input interface have a Kill() method? I suspect not, since
  most inputs wouldn't have a way of killing themselves anyways.

closes #1230
@sparrc
Copy link
Contributor

sparrc commented May 19, 2016

been kicking this around in the back of my head for a bit, and this seems like as good a reason as any to get it implemented: #1235

@RainerW
Copy link
Author

RainerW commented May 19, 2016

lol, got a stracktrace after fixing an crashed nfs server/mount ... which simultaniously fixed the remaining not restarted servers. So now all telegraf insances are running again.

Sadly the one server which had the problem, but did not have that nfs mount had beend restarted earlier. So I cannot provide a stacktrace of a hung instance. But it seems at least to be a problem while accessing the disk statistics, even i had beend excluding "nfs" (or because?)

sparrc added a commit that referenced this issue May 20, 2016
currently the input interface does not have any methods for killing a
running Gather call, so there is nothing we can do but log a "FATAL
ERROR" and move on. This will at least give some visibility into the
plugin that is acting up.

Open questions:

- should the telegraf process die and exit when this happens? This might
  be a better idea than leaving around the dead process.
- should the input interface have a Kill() method? I suspect not, since
  most inputs wouldn't have a way of killing themselves anyways.

closes #1230
@RainerW
Copy link
Author

RainerW commented May 20, 2016

Now one telegraf stopped which had not that nfs mount:
https://gist.github.com/RainerW/20eb54d3eb4a28a2361d4ad27331610e

@jvalencia
Copy link

I'm seeing this problem as well.

I am running kubernetes 1.2, have telegraf running as a daemonset (0.12.2-1). Last night 2 of 4 instances just stopped reporting. The logs are empty. Restarting it fixes the problem.

I kept one hung process hanging around to debug.

@jvalencia
Copy link

My restarted pod died and threw the following:


2016/05/20 16:55:59 Starting Telegraf (version 0.12.1)
2016/05/20 16:55:59 Loaded outputs: influxdb
2016/05/20 16:55:59 Loaded inputs: statsd cpu docker mem
2016/05/20 16:55:59 Tags enabled: dc=us-east-1d env=prod host=ip-172-20-0-240 region=us-east-1
2016/05/20 16:55:59 Agent Config: Interval:10s, Debug:false, Quiet:false, Hostname:"ip-172-20-0-240", Flush Interval:10s 
2016/05/20 16:55:59 Started the statsd service on :8125
2016/05/20 16:55:59 Statsd listener listening on:  [::]:8125
2016/05/20 16:56:03 Gathered metrics, (10s interval), from 4 inputs in 3.069965841s
2016/05/20 16:56:10 Wrote 250 metrics to output influxdb in 88.993889ms
2016/05/20 16:56:12 Error decoding: EOF
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0x583e6c]

goroutine 208 [running]:
panic(0x10241a0, 0xc8200140a0)
    /usr/local/go/src/runtime/panic.go:464 +0x3e6
github.com/influxdata/telegraf/plugins/inputs/docker.gatherContainerStats(0x0, 0x7fb116680a30, 0xc820a16040, 0xc820624000)
    /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/plugins/inputs/docker/docker.go:238 +0x352c
github.com/influxdata/telegraf/plugins/inputs/docker.(*Docker).gatherContainer(0xc82030e680, 0xc820648580, 0x40, 0xc8206485c0, 0x1, 0x4, 0xc820648600, 0x40, 0xc820648640, 0x40, ...)
    /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/plugins/inputs/docker/docker.go:228 +0x963
github.com/influxdata/telegraf/plugins/inputs/docker.(*Docker).Gather.func1(0xc8206bc810, 0xc82030e680, 0x7fb116680a30, 0xc820a16040, 0xc820648580, 0x40, 0xc8206485c0, 0x1, 0x4, 0xc820648600, ...)
    /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/plugins/inputs/docker/docker.go:112 +0x9e
created by github.com/influxdata/telegraf/plugins/inputs/docker.(*Docker).Gather
    /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/plugins/inputs/docker/docker.go:116 +0x58d

@sparrc
Copy link
Contributor

sparrc commented May 20, 2016

@jvalencia that problem is not related to this, that was fixed in Telegraf 0.13.

@jvalencia
Copy link

I updated, will see if I get hanging behaviour again.

sparrc added a commit that referenced this issue May 20, 2016
Changing the internal behavior around running plugins. Each plugin
will now have it's own goroutine with it's own ticker. This means that a
hung plugin will not block any other plugins. When a plugin is hung, we
will log an error message every interval, letting users know which
plugin is hung.

Currently the input interface does not have any methods for killing a
running Gather call, so there is nothing we can do but log an "ERROR"
and move on. This will give some visibility into the plugin that is
acting up.

closes #1230
sparrc added a commit that referenced this issue May 21, 2016
Changing the internal behavior around running plugins. Each plugin
will now have it's own goroutine with it's own ticker. This means that a
hung plugin will not block any other plugins. When a plugin is hung, we
will log an error message every interval, letting users know which
plugin is hung.

Currently the input interface does not have any methods for killing a
running Gather call, so there is nothing we can do but log an "ERROR"
and move on. This will give some visibility into the plugin that is
acting up.

closes #1230
sparrc added a commit that referenced this issue May 21, 2016
Changing the internal behavior around running plugins. Each plugin
will now have it's own goroutine with it's own ticker. This means that a
hung plugin will not block any other plugins. When a plugin is hung, we
will log an error message every interval, letting users know which
plugin is hung.

Currently the input interface does not have any methods for killing a
running Gather call, so there is nothing we can do but log an "ERROR"
and move on. This will give some visibility into the plugin that is
acting up.

closes #1230
fixes #479
sparrc added a commit that referenced this issue May 21, 2016
Changing the internal behavior around running plugins. Each plugin
will now have it's own goroutine with it's own ticker. This means that a
hung plugin will not block any other plugins. When a plugin is hung, we
will log an error message every interval, letting users know which
plugin is hung.

Currently the input interface does not have any methods for killing a
running Gather call, so there is nothing we can do but log an "ERROR"
and move on. This will give some visibility into the plugin that is
acting up.

closes #1230
fixes #479
sparrc added a commit that referenced this issue May 21, 2016
Changing the internal behavior around running plugins. Each plugin
will now have it's own goroutine with it's own ticker. This means that a
hung plugin will not block any other plugins. When a plugin is hung, we
will log an error message every interval, letting users know which
plugin is hung.

Currently the input interface does not have any methods for killing a
running Gather call, so there is nothing we can do but log an "ERROR"
and move on. This will give some visibility into the plugin that is
acting up.

closes #1230
fixes #479
sparrc added a commit that referenced this issue May 21, 2016
Changing the internal behavior around running plugins. Each plugin
will now have it's own goroutine with it's own ticker. This means that a
hung plugin will not block any other plugins. When a plugin is hung, we
will log an error message every interval, letting users know which
plugin is hung.

Currently the input interface does not have any methods for killing a
running Gather call, so there is nothing we can do but log an "ERROR"
and move on. This will give some visibility into the plugin that is
acting up.

closes #1230
fixes #479
@ar1a
Copy link

ar1a commented Nov 1, 2016

I'm experiencing this too. version 1.0.1

EDIT: Disregard, appears to be a client networking issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants