-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes issue with node losing connection to BigTable and never regaining it #26217
Conversation
@CriesofCarrots Can you please check this when you get a chance, it solves a significant issue for RPC providers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you elaborate on the root cause and how this change fixes it?
seems like at best it allows multiple outstanding refresh requests, which could stomp on each other
storage-bigtable/src/access_token.rs
Outdated
return; | ||
} | ||
warn!("Token ready to be refreshed"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: warn is far too chatty for these new log messages. if they need to stay, please demote to debug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you elaborate on the root cause and how this change fixes it?
seems like at best it allows multiple outstanding refresh requests, which could stomp on each other
The root cause appears to be that the request to get a new token sometimes does not succeed or fail and continues effectively forever (or is garbage collected) and does not revert the refresh_active bool flag to false as a result, this may indeed be a memory leak (processes continuing to build up with no end) but I'm not a Rust dev so I can't tell, certainly if the GoAuth was failing, it should be caught, and there are no error message in the logs.
The current code, in the second test (after testing elapsed time for the token), falls out to renew the token and at the same time switches refresh_active to true (.compare_and_swap(false, true, Ordering::Relaxed)), but because the call does not complete, it never gets set back to false as it should later in the call
I first commented out both tests and determined that even requesting a new token with every single call on a live production RPC was returning very quickly, but obviously this would not be ideal, but gave me the information that a successful auth request is very quick (less than 2 seconds but I am being conservative here).
My fix changes the behaviour and I think is quite sane, it doesn't affect any other successful calls because if the token is changed by a parallel call the token is updated anyway and the value refresh_active is changed to false anyway (second link) all my change does is says, if refresh_active is true, wait for 2 seconds (by which time it should complete) and change the value to false for the next call and return (it does not attempt to renew the token). This = that if the renew call has failed, it will not block the next renew call.
As I wrote in the comments, I am sure there are better approaches, but this issue has been there since September at least and nobody had fixed it, my method does indeed fix it. One better approach would be to actually make the renew request time out by itself, if there is no response, success or failure, within 2 seconds it should just gracefully exit and set refresh_active to false, but that is a lot more code and as I said, I am not a Rust dev.
I would also like to note, this is a very serious issue with a very serious result, it requires restarting the node to get a new token, I discussed this with triton.one guys and they had set up testing so that as soon as the node is not able to get info from BigTable, they restart the node, obviously this could lead to a cascading issue if BigTable were actually down as all the nodes would be continuously restarting themselves. I think therefore a quick, if not ideal, fix, is warranted.
Feel free to make a different fix for it, I have applied this fix to 25 active production RPC nodes for Ankr and it is working just fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The root cause appears to be that the request to get a new token sometimes does not succeed or fail and continues effectively forever (or is garbage collected) and does not revert the refresh_active bool flag to false as a result
This is not a correct statement.
Even if the request to get a new token fails flag still will be set to false
because we do not unwrap/expect
result, we handle Ok/Err
with match
.
I also do not think that process continues effectively forever, and as you tested new token returned very quickly each time.
The code looks completely ok, what is only possible are multiple updates but this should be ok and can be solved with Ordering::SeqCst
memory order.
--
I do not understand why getting a new token is a stuck 😐
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The root cause appears to be that the request to get a new token sometimes does not succeed or fail and continues effectively forever (or is garbage collected) and does not revert the refresh_active bool flag to false as a result
This is not a correct statement. Even if the request to get a new token fails flag still will be set to
false
because we do notunwrap/expect
result, we handleOk/Err
withmatch
. I also do not think that process continues effectively forever, and as you tested new token returned very quickly each time.The code looks completely ok, what is only possible are multiple updates but this should be ok and can be solved with
Ordering::SeqCst
memory order.-- I do not understand why getting a new token is a stuck 😐
You are declaring this without testing though right? because in checking the logging, the code is not getting past the second test. That indicates that the value is true, continuously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also in testing the token DOES NOT get returned every time quickly, it can fail continuously in testing up to 34 seconds in row, this is only demonstrated if you add my code for logging so you can see it failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PeaStew can you try this patch fanatid@171c9e3? It's on top of 1.10.28
but if you need help to port it to another version I can do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @PeaStew, did you have a chance to test my patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fanatid , thanks for the patch, looks pretty reasonable. I'm thinking to ask one of the foundation rpc nodes to try it out, since we haven't heard anything from PeaStew
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm updated the code a little, so no unwrap
now: fanatid@32afd20
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this patch still good to try? We experience this same issue multiple times per day, so we could probably report back.
I'm updated the code a little, so no
unwrap
now: fanatid@32afd20
Co-authored-by: Trent Nelson <[email protected]>
error is: within `impl futures::Future<Output = [async output]>`, the trait `std::marker::Send` is not implemented for `std::sync::RwLockReadGuard<'_, (Token, std::time::Instant)>` note: future is not `Send` as this value is used across an await
So I dunno if this is being looked at, I can say I have been patching every version up to 1.10.29 with this code and the nodes are stable and not experiencing any sort of failures or dropping off blockheight because of pauses in the execution thread. The patch works, even if it may not be up to the elegance required. |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
This stale pull request has been automatically closed. Thank you for your contributions. |
Your fixes still don't work, mine does, enjoy |
Hi @PeaStew, can you please post your actual patch? Changes with |
Fixes #20336 (comment)
Problem
BigTable connection lost and never regained
Summary of Changes
Set refresh_active to false in subsequent calls after waiting 2 seconds
Fixes #20336