-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Service stop after logrotate #685
Comments
certainly sounds like a bug, but I'd like to know if others are seeing it as well |
@sparrc Is possible to change to use syslog instead of logging to a file? And we control the rotate, log, etc using syslog programs. |
sure, but I won't have time to work on this for a while, and I think that you could also use systemd which logs to journalctl instead of the file |
@sparrc After did more tests even using systemd the agent stop to send metrics: Sample log, the last wrote from agent is at 15:43:01 and only start reporting again when i restart the daemon 2016/02/19 15:42:21 Wrote 0 metrics to output influxdb in 1.265238955s Is the 0.10.2 version, i will try to update to 0.10.3 today to see. I have about 1000 hosts reporting and this happens on random servers every day. |
@wgrcunha So you are seeing this issue even after adding the logrotate change? It looks like your instance is still logging but something is hung in the write path? |
@sparrc Yes. And using systemd too |
I'll try to reproduce, can you give your OS details? If you are able to ctrl-\ the process when this is happening that would help a lot |
/etc/debian_version -> 7.8 Tried to run as root, but got the same problem
Log
lsof
Reptyr
|
@wgrcunha what is your telegraf version? |
@sparrc 0.10.2 |
@sparrc my bad, this one is 0.10.1 |
got it, that makes more sense, thanks for the great debug output BTW |
Looks like I haven't set a timeout for the InfluxDB writes, which are hanging, that is definitely a bug. As to the cause of the write hang, it could be that the db is overstressed with http write requests. @wgrcunha if you have 1000s of hosts, you probably want to be using the flush_jitter config option, are you doing this? Until I get the timeout fixed in a release, I think you have two options:
|
flush_jitter is no 0s, i will change this and look. Thanks! |
That was a little inaccurate, there is actually a The current default config of having no timeout was a pretty dumb default implementation on my part, I will change that to a reasonable default. |
Just to be clear, I would recommend doing both:
|
Changing now, i can give a feedback on a few days. tyvm for fast response :) |
default to 5s instead, since even if it times out we will cache the points and move on closes #685
default to 5s instead, since even if it times out we will cache the points and move on closes #685
default to 5s instead, since even if it times out we will cache the points and move on closes #685
default to 5s instead, since even if it times out we will cache the points and move on closes #685
@sparrc The feedback is very positive, everything is ok now using that configuration :) |
default to 5s instead, since even if it times out we will cache the points and move on closes influxdata#685
Hello,
Sometimes i see the agent running but not sending metrics or ever logging, when this happens the logfile is empty and i believe is caused by log rotate. I added to logrotate:
And now, working fine. Anyone with the same problem?
**Using debian package
Thanks!
The text was updated successfully, but these errors were encountered: