-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bigtable timeouts leads to permanent disconnection ( Continued ) #29692
Comments
Any clue how this is possible? As this might be the root cause, it would make sense to try and fix that instead of the fallback mechanism |
I think the closed MR should be merged nevertheless as there's no other solution yet, and this is high priority for heavily used RPC nodes |
At Blockdaemon, we've encountered a recurring issue that necessitates the creation of a customer-specific version. This custom build combines the version detailed here with pull 29685.
The issue at hand pertains to token acquisition for BigTable. Occasionally, the process for requesting a new token commences as expected but then stalls, leading to a total failure in establishing a connection with BigTable. The process doesn't repeat after this failure. a normal request looks like:
Comparatively, a typical token request process proceeds smoothly and is reflected in the code found at this GitHub link:
What appears to be obstructing the goauth::get_token call is unknown, as it fails to return any result, be it successful or erroneous. Unfortunately, we are unable to insert custom logging into this underlying call, given it's a third-party library. While considering a blocking variant might be an option, it poses a certain risk.
With this adjustment, we can gain more insight into what happens during the token request process. we have a bit more data:
if we look at the data when it's hung with the extra logging it looks like:
If we examine the data during the period it hangs with the additional logging, it mirrors the typical request process. However, a successful request proceeds further with additional steps and outputs.
And after a few hours:
After a few hours, the logging illustrates how the token request process commences as usual but eventually fails to recover.
Potential underlying issues that might contribute to this problem are being investigated and are related to specific issues on GitHub:
Fixes: Hopefully we can get this resolved so no one has to deal with this issue anymore. |
#26217 I created this long ago and have started patching my nodes again with my fix which absolutely does work and it works because of the google auth issues listed above. It may be 'inelegant' but it works and has not caused any issues with sync despite all words to the contrary. YMMV sometimes a nail is just a nail and needs a hammer |
we can close this issue, #34213 merged |
Problem
Bigtable authentication token is failing to refresh causing nodes to disconnect from bigtable occasionally, which forces a restart of the node in order to reconnect.
This is a continuation of an issue previously closed.
This PR mostly prevented the issue, but some of our nodes that receive significant RPC load still occasionally disconnect from bigtable.
I have found that this disconnect is due to the existing timeout failing to actually resolve, so the the
refresh_active
variable stays true. Thus, any future attempts to refresh the token are stopped because the code believes a refresh is still active.Proposed Solution
A possible solution which I have implemented and tested in this PR is to use an additional check after 2x the timeout period, which will set
refresh_active
back to false. This would allow future attempts to refresh the token to go through, and signify that the timeout failed to stop the functionget_token
.If the 1st ( thought to be failing ) timeout ends up succeeding after the 2nd
get_token
call starts, you will end up getting a 2nd newer token once the 2nd returns.After applying this patch to 4 troubled nodes, it has prevented 20+ disconnects. We applied this patch onto 1.13.5 mainnet for this test.
The text was updated successfully, but these errors were encountered: