Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation for inputs.ping: DNS timeouts not counted with the 1s ping timeout #3333

Closed
ghost opened this issue Oct 12, 2017 · 12 comments
Closed
Milestone

Comments

@ghost
Copy link

ghost commented Oct 12, 2017

We discovered this in production, and it might be useful to add to the docs. The "ping" commands timeout does not apply to the DNS lookup, so if your DNS is broken (meaning that the nameservers do not respond and the record isnt cached), you will find ping (at least on RHEL 7) take what seems to be exactly 11 seconds to return:

[root@ljoit-adl1 ~]# time ping -W 1 -c 1 google2.com
ping: google2.com: Temporary failure in name resolution

real    0m11.016s
user    0m0.001s
sys     0m0.004s

I dont think there is much telegraf can do about this other than make a note of this in the documentation. If you want to capture all likely failures you really need to set interval and timeout to 15s.

@ghost ghost changed the title Documentation Documentation for inputs.ping: DNS timeouts not counted with the 1s ping timeout Oct 12, 2017
@danielnelson
Copy link
Contributor

Interesting, thanks for this info. Maybe we should do the dns lookup before the call to the ping utility.

@danielnelson danielnelson added this to the 1.5.0 milestone Oct 12, 2017
@danielnelson
Copy link
Contributor

@scjoiner This might explain why we still had timeouts testing #2849

@ghost
Copy link
Author

ghost commented Oct 12, 2017

@danielnelson DNS lookup (with a timeout we control) before ping command is an excellent idea. If we have a chance we will get a PR your way.

How would you prefer we return the state of 'cant get ip'? packets_transmitted=0, packets_received=0, percent_packet_loss=0 seems misleading, I wonder if we want another error={ping-timeout,dns-timeout} output?

We came across this because of course what you normally do is plot graphs looking for percent_packet_loss!=0, and the timeout being hit just caused no record at all on the Influx side (which was not obvious). I suspect others will also not check the telegraf logs fleet wide regularly!

@danielnelson
Copy link
Contributor

It looks like there is an errors field (only shows up in Windows) that is the percentage of pings that returned errors (always 100.0), we could use it but I probably would want to do it differently.

We did an update to the http_response input not long ago that is similar, we added a result_type string field that is one of several result types. My only regrets on this is that perhaps it is wasteful to store a string field on every point, and also it is somewhat hard to display string fields in most visualization tools.

What do you think about if we put the result_type into a tag instead?

@phemmer
Copy link
Contributor

phemmer commented Oct 12, 2017

Somewhat related discussion on how to report the error: #2548 (comment)

@danielnelson
Copy link
Contributor

I tried to use each of the different options, and here is what I found:

  • result_type as a string field: Hard to aggregate, can't be displayed as y-axis.
  • result_code as an int field: Also somewhat hard to aggregate (use mode?), but can display on y-axis. Must consulting documentation to determine error type.
  • result_type as a tag: Can graph if you group by result_type, but must ensure there is another field available. It is possible we could use packets_transmitted=0 as the field if we used this, but this doesn't translate well to some other plugins.

It is possible I just don't know of a good technique for monitoring enum type fields, but I think that the int field is probably the best solution.

@ghost
Copy link
Author

ghost commented Oct 18, 2017

@danielnelson yeah, certainly where we have had to do this, we have no better solution than a int status (either 0/1 or a more complex list, which as you point out means you really have to RTFM to know what you are plotting).

How is best to proceed - adjust #3345 to do this and give it a whirl?

@danielnelson
Copy link
Contributor

We should probably use net.LookupHost instead of trying to reusing dns_query, and ping only the first address.

  • result_code (int, 0: success; 1: no such host)

@ghost
Copy link
Author

ghost commented Oct 19, 2017

Just putting this out there - is there any possibility that somebody else is going to find something else and we would want to make result_code a enum with more than 2 options? If its just this very specific failure , would it be more self explanatory to call it dns_lookup_fail={0,1}?

@danielnelson
Copy link
Contributor

That is certainly more readable without checking the documentation. Other things could go wrong that prevent pinging though, for instance if I try to ping googles ipv6 address I get this error because I don't have ipv6 setup:

ping: unknown host 2607:f8b0:4005:80a::2004

@daviesalex
Copy link

Thanks, I think we have enough for somebody to put a PR together. @vlasad would you be so kind as to modify #3345 based on the above? I think the requirements are:

  • Do DNS lookup in Go, catch errors, if there are any return result_code=1 (as well as logging the details)
  • Do ping and catch error, if there are unparsable errors (like the one @danielnelson found with ipv6 above), return result_code=0 (as well as logging the details)
  • If everything is good, return result_code=0
  • Add in the documentation a table of result codes (starting with the two above).

If i've missed anything, please others comment (or in the eventual PR).

@vlasad
Copy link
Contributor

vlasad commented Oct 24, 2017

OK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants