-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation for inputs.ping: DNS timeouts not counted with the 1s ping timeout #3333
Comments
Interesting, thanks for this info. Maybe we should do the dns lookup before the call to the ping utility. |
@danielnelson DNS lookup (with a timeout we control) before ping command is an excellent idea. If we have a chance we will get a PR your way. How would you prefer we return the state of 'cant get ip'? packets_transmitted=0, packets_received=0, percent_packet_loss=0 seems misleading, I wonder if we want another error={ping-timeout,dns-timeout} output? We came across this because of course what you normally do is plot graphs looking for percent_packet_loss!=0, and the timeout being hit just caused no record at all on the Influx side (which was not obvious). I suspect others will also not check the telegraf logs fleet wide regularly! |
It looks like there is an errors field (only shows up in Windows) that is the percentage of pings that returned errors (always 100.0), we could use it but I probably would want to do it differently. We did an update to the http_response input not long ago that is similar, we added a What do you think about if we put the |
Somewhat related discussion on how to report the error: #2548 (comment) |
I tried to use each of the different options, and here is what I found:
It is possible I just don't know of a good technique for monitoring enum type fields, but I think that the int field is probably the best solution. |
@danielnelson yeah, certainly where we have had to do this, we have no better solution than a int status (either 0/1 or a more complex list, which as you point out means you really have to RTFM to know what you are plotting). How is best to proceed - adjust #3345 to do this and give it a whirl? |
We should probably use
|
Just putting this out there - is there any possibility that somebody else is going to find something else and we would want to make result_code a enum with more than 2 options? If its just this very specific failure , would it be more self explanatory to call it dns_lookup_fail={0,1}? |
That is certainly more readable without checking the documentation. Other things could go wrong that prevent pinging though, for instance if I try to ping googles ipv6 address I get this error because I don't have ipv6 setup:
|
Thanks, I think we have enough for somebody to put a PR together. @vlasad would you be so kind as to modify #3345 based on the above? I think the requirements are:
If i've missed anything, please others comment (or in the eventual PR). |
OK |
We discovered this in production, and it might be useful to add to the docs. The "ping" commands timeout does not apply to the DNS lookup, so if your DNS is broken (meaning that the nameservers do not respond and the record isnt cached), you will find ping (at least on RHEL 7) take what seems to be exactly 11 seconds to return:
I dont think there is much telegraf can do about this other than make a note of this in the documentation. If you want to capture all likely failures you really need to set interval and timeout to 15s.
The text was updated successfully, but these errors were encountered: