-
Notifications
You must be signed in to change notification settings - Fork 4.9k
add retry on EAI_AGAIN to improve DNS lookup stability #34187
Conversation
Does this actually help, i.e. have you seen it to reduce failure rates? My (wrong?) assumption was that the transience wasn't actually brief and such immediate retries wouldn't help. |
no, I don't have good repro. However when I debugging DNS test failures before I've seen resolver returning permanent failures even for our echo servers. While back I was also experimenting with running caching name server locally to avoid network failures but I got distracted before some good statistical data. I'm trying to get access to CI systems so I can collect some traces and collect evidence if it is environmental problem or something else. |
@wfurt, can this be closed for now? Or do you have evidence this actually helps? Thanks. |
I was not able to reproduce it. I have access to new arm64 systems now and I can give it another try. However you can also see that this bug is different than for example #21224. I also have personal observation from running my tests locally that immediate retry resolves failure most of the time. Typical case would be "dotnet restore" where it works when executed again after failure. Another way to improve DNS stability would be adding in-process cache. On Unix resolver and functions we use do not cache every name resolution query would lead to new DNS transaction. With cache we could decrease that and for test runs we can prime it and than have conditional test runs based on ability to resolve. If you have other suggestions how to solve #32797 please let me know. |
This is what I'm questioning. Does it actually? My understanding of this error in this particular case is that it's describing a condition that won't be resolved soon, but the change is immediately retrying. If that's true, all we're going to end up doing is spend more cycles retrying something that won't succeed, and I would have concerns about that. Am I wrong? |
Yes, I think your description is right @stephentoub. man page describes EAI_AGAIN as transient error but there is no suggestions if the conditions will or will not be resolved soon. I could run tests in loop but that may or may not simulate this situation. There may be difference depending if name server you asking has answer in cache (and than we talking about local connectivity problems) or if it needs to do recursive lookup from root servers. I've been thinking about more and if you feel uncomfortable with product change I can modify the test to try again and fail while logging more details. |
That sounds good to me. Let's gather data :) |
added instrumentation in #32797. Closing this for now. |
correct PR number is #34934 |
This is attempt to fix #32797 and improve DNS lookup stability. When dns lookup fails with transient error, we currently return it as failure. However upper layers (ping, sockets, HTTP) do not seems to retry as error code suggests.
To make it more resilient, this change adds retry. To limit possible endless loop, there is retry max so the call still possible can return the transient error.
Also note that the MAX_RETRY_COUNT was chosen arbitrary as best guess.