Documentation for inputs.ping: DNS timeouts not counted with the 1s ping timeout #3333

ghost · 2017-10-12T20:39:09Z

We discovered this in production, and it might be useful to add to the docs. The "ping" commands timeout does not apply to the DNS lookup, so if your DNS is broken (meaning that the nameservers do not respond and the record isnt cached), you will find ping (at least on RHEL 7) take what seems to be exactly 11 seconds to return:

[root@ljoit-adl1 ~]# time ping -W 1 -c 1 google2.com
ping: google2.com: Temporary failure in name resolution

real    0m11.016s
user    0m0.001s
sys     0m0.004s

I dont think there is much telegraf can do about this other than make a note of this in the documentation. If you want to capture all likely failures you really need to set interval and timeout to 15s.

The text was updated successfully, but these errors were encountered:

danielnelson · 2017-10-12T20:43:39Z

Interesting, thanks for this info. Maybe we should do the dns lookup before the call to the ping utility.

danielnelson · 2017-10-12T20:45:04Z

@scjoiner This might explain why we still had timeouts testing #2849

ghost · 2017-10-12T20:53:42Z

@danielnelson DNS lookup (with a timeout we control) before ping command is an excellent idea. If we have a chance we will get a PR your way.

How would you prefer we return the state of 'cant get ip'? packets_transmitted=0, packets_received=0, percent_packet_loss=0 seems misleading, I wonder if we want another error={ping-timeout,dns-timeout} output?

We came across this because of course what you normally do is plot graphs looking for percent_packet_loss!=0, and the timeout being hit just caused no record at all on the Influx side (which was not obvious). I suspect others will also not check the telegraf logs fleet wide regularly!

danielnelson · 2017-10-12T21:38:51Z

It looks like there is an errors field (only shows up in Windows) that is the percentage of pings that returned errors (always 100.0), we could use it but I probably would want to do it differently.

We did an update to the http_response input not long ago that is similar, we added a result_type string field that is one of several result types. My only regrets on this is that perhaps it is wasteful to store a string field on every point, and also it is somewhat hard to display string fields in most visualization tools.

What do you think about if we put the result_type into a tag instead?

phemmer · 2017-10-12T22:31:18Z

Somewhat related discussion on how to report the error: #2548 (comment)

danielnelson · 2017-10-17T21:05:25Z

I tried to use each of the different options, and here is what I found:

result_type as a string field: Hard to aggregate, can't be displayed as y-axis.
result_code as an int field: Also somewhat hard to aggregate (use mode?), but can display on y-axis. Must consulting documentation to determine error type.
result_type as a tag: Can graph if you group by result_type, but must ensure there is another field available. It is possible we could use packets_transmitted=0 as the field if we used this, but this doesn't translate well to some other plugins.

It is possible I just don't know of a good technique for monitoring enum type fields, but I think that the int field is probably the best solution.

ghost · 2017-10-18T17:50:27Z

@danielnelson yeah, certainly where we have had to do this, we have no better solution than a int status (either 0/1 or a more complex list, which as you point out means you really have to RTFM to know what you are plotting).

How is best to proceed - adjust #3345 to do this and give it a whirl?

danielnelson · 2017-10-18T20:44:23Z

We should probably use net.LookupHost instead of trying to reusing dns_query, and ping only the first address.

result_code (int, 0: success; 1: no such host)

ghost · 2017-10-19T11:09:16Z

Just putting this out there - is there any possibility that somebody else is going to find something else and we would want to make result_code a enum with more than 2 options? If its just this very specific failure , would it be more self explanatory to call it dns_lookup_fail={0,1}?

danielnelson · 2017-10-19T18:46:08Z

That is certainly more readable without checking the documentation. Other things could go wrong that prevent pinging though, for instance if I try to ping googles ipv6 address I get this error because I don't have ipv6 setup:

ping: unknown host 2607:f8b0:4005:80a::2004

daviesalex · 2017-10-23T20:23:04Z

Thanks, I think we have enough for somebody to put a PR together. @vlasad would you be so kind as to modify #3345 based on the above? I think the requirements are:

Do DNS lookup in Go, catch errors, if there are any return result_code=1 (as well as logging the details)
Do ping and catch error, if there are unparsable errors (like the one @danielnelson found with ipv6 above), return result_code=0 (as well as logging the details)
If everything is good, return result_code=0
Add in the documentation a table of result codes (starting with the two above).

If i've missed anything, please others comment (or in the eventual PR).

vlasad · 2017-10-24T05:36:59Z

OK

ghost changed the title ~~Documentation~~ Documentation for inputs.ping: DNS timeouts not counted with the 1s ping timeout Oct 12, 2017

ghost referenced this issue Oct 12, 2017

Use 5 second timeout overhead when waiting for ping to complete

61b0336

danielnelson added this to the 1.5.0 milestone Oct 12, 2017

vlasad mentioned this issue Oct 17, 2017

DNS lookup before ping #3345

Closed

vlasad mentioned this issue Oct 25, 2017

DNS lookup before ping #3385

Merged

danielnelson closed this as completed Oct 26, 2017

danielnelson mentioned this issue Nov 8, 2017

Add result_type field to net_response input plugin #2990

Merged

3 tasks

MicahZoltu mentioned this issue Nov 29, 2018

Ping timetouts not respected on DNS lookups. #5059

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation for inputs.ping: DNS timeouts not counted with the 1s ping timeout #3333

Documentation for inputs.ping: DNS timeouts not counted with the 1s ping timeout #3333

ghost commented Oct 12, 2017 •

edited by ghost

Loading

danielnelson commented Oct 12, 2017

danielnelson commented Oct 12, 2017

ghost commented Oct 12, 2017

danielnelson commented Oct 12, 2017

phemmer commented Oct 12, 2017

danielnelson commented Oct 17, 2017

ghost commented Oct 18, 2017

danielnelson commented Oct 18, 2017

ghost commented Oct 19, 2017

danielnelson commented Oct 19, 2017

daviesalex commented Oct 23, 2017

vlasad commented Oct 24, 2017

Documentation for inputs.ping: DNS timeouts not counted with the 1s ping timeout #3333

Documentation for inputs.ping: DNS timeouts not counted with the 1s ping timeout #3333

Comments

ghost commented Oct 12, 2017 • edited by ghost Loading

danielnelson commented Oct 12, 2017

danielnelson commented Oct 12, 2017

ghost commented Oct 12, 2017

danielnelson commented Oct 12, 2017

phemmer commented Oct 12, 2017

danielnelson commented Oct 17, 2017

ghost commented Oct 18, 2017

danielnelson commented Oct 18, 2017

ghost commented Oct 19, 2017

danielnelson commented Oct 19, 2017

daviesalex commented Oct 23, 2017

vlasad commented Oct 24, 2017

ghost commented Oct 12, 2017 •

edited by ghost

Loading