Fixes issue with node losing connection to BigTable and never regaining it #26217

PeaStew · 2022-06-25T11:47:14Z

Fixes #20336 (comment)

Problem

BigTable connection lost and never regained

Summary of Changes

Set refresh_active to false in subsequent calls after waiting 2 seconds

Fixes #20336

…ng it Fixes solana-labs#20336 (comment)

PeaStew · 2022-06-25T16:06:13Z

@CriesofCarrots Can you please check this when you get a chance, it solves a significant issue for RPC providers.

t-nelson

can you elaborate on the root cause and how this change fixes it?

seems like at best it allows multiple outstanding refresh requests, which could stomp on each other

t-nelson · 2022-06-25T17:21:31Z

storage-bigtable/src/access_token.rs

                return;
            }
+            warn!("Token ready to be refreshed");


nit: warn is far too chatty for these new log messages. if they need to stay, please demote to debug

can you elaborate on the root cause and how this change fixes it?

seems like at best it allows multiple outstanding refresh requests, which could stomp on each other

The root cause appears to be that the request to get a new token sometimes does not succeed or fail and continues effectively forever (or is garbage collected) and does not revert the refresh_active bool flag to false as a result, this may indeed be a memory leak (processes continuing to build up with no end) but I'm not a Rust dev so I can't tell, certainly if the GoAuth was failing, it should be caught, and there are no error message in the logs.

The current code, in the second test (after testing elapsed time for the token), falls out to renew the token and at the same time switches refresh_active to true (.compare_and_swap(false, true, Ordering::Relaxed)), but because the call does not complete, it never gets set back to false as it should later in the call

I first commented out both tests and determined that even requesting a new token with every single call on a live production RPC was returning very quickly, but obviously this would not be ideal, but gave me the information that a successful auth request is very quick (less than 2 seconds but I am being conservative here).

My fix changes the behaviour and I think is quite sane, it doesn't affect any other successful calls because if the token is changed by a parallel call the token is updated anyway and the value refresh_active is changed to false anyway (second link) all my change does is says, if refresh_active is true, wait for 2 seconds (by which time it should complete) and change the value to false for the next call and return (it does not attempt to renew the token). This = that if the renew call has failed, it will not block the next renew call.

As I wrote in the comments, I am sure there are better approaches, but this issue has been there since September at least and nobody had fixed it, my method does indeed fix it. One better approach would be to actually make the renew request time out by itself, if there is no response, success or failure, within 2 seconds it should just gracefully exit and set refresh_active to false, but that is a lot more code and as I said, I am not a Rust dev.

I would also like to note, this is a very serious issue with a very serious result, it requires restarting the node to get a new token, I discussed this with triton.one guys and they had set up testing so that as soon as the node is not able to get info from BigTable, they restart the node, obviously this could lead to a cascading issue if BigTable were actually down as all the nodes would be continuously restarting themselves. I think therefore a quick, if not ideal, fix, is warranted.

Feel free to make a different fix for it, I have applied this fix to 25 active production RPC nodes for Ankr and it is working just fine.

The root cause appears to be that the request to get a new token sometimes does not succeed or fail and continues effectively forever (or is garbage collected) and does not revert the refresh_active bool flag to false as a result

This is not a correct statement.
Even if the request to get a new token fails flag still will be set to false because we do not unwrap/expect result, we handle Ok/Err with match.
I also do not think that process continues effectively forever, and as you tested new token returned very quickly each time.

The code looks completely ok, what is only possible are multiple updates but this should be ok and can be solved with Ordering::SeqCst memory order.

--
I do not understand why getting a new token is a stuck 😐

The root cause appears to be that the request to get a new token sometimes does not succeed or fail and continues effectively forever (or is garbage collected) and does not revert the refresh_active bool flag to false as a result

This is not a correct statement. Even if the request to get a new token fails flag still will be set to false because we do not unwrap/expect result, we handle Ok/Err with match. I also do not think that process continues effectively forever, and as you tested new token returned very quickly each time.

The code looks completely ok, what is only possible are multiple updates but this should be ok and can be solved with Ordering::SeqCst memory order.

-- I do not understand why getting a new token is a stuck 😐

You are declaring this without testing though right? because in checking the logging, the code is not getting past the second test. That indicates that the value is true, continuously.

Also in testing the token DOES NOT get returned every time quickly, it can fail continuously in testing up to 34 seconds in row, this is only demonstrated if you add my code for logging so you can see it failing.

@PeaStew can you try this patch fanatid@171c9e3? It's on top of 1.10.28 but if you need help to port it to another version I can do this.

Hey @PeaStew, did you have a chance to test my patch?

@fanatid , thanks for the patch, looks pretty reasonable. I'm thinking to ask one of the foundation rpc nodes to try it out, since we haven't heard anything from PeaStew

I'm updated the code a little, so no unwrap now: fanatid@32afd20

Is this patch still good to try? We experience this same issue multiple times per day, so we could probably report back.

I'm updated the code a little, so no unwrap now: fanatid@32afd20

storage-bigtable/src/access_token.rs

Co-authored-by: Trent Nelson <[email protected]>

error is: within `impl futures::Future<Output = [async output]>`, the trait `std::marker::Send` is not implemented for `std::sync::RwLockReadGuard<'_, (Token, std::time::Instant)>` note: future is not `Send` as this value is used across an await

PeaStew · 2022-07-06T21:10:08Z

So I dunno if this is being looked at, I can say I have been patching every version up to 1.10.29 with this code and the nodes are stable and not experiencing any sort of failures or dropping off blockheight because of pauses in the execution thread. The patch works, even if it may not be up to the elegance required.

stale · 2022-07-31T06:39:59Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

stale · 2022-08-13T13:34:05Z

This stale pull request has been automatically closed. Thank you for your contributions.

PeaStew · 2023-09-04T11:01:32Z

Your fixes still don't work, mine does, enjoy

fanatid · 2023-11-02T12:10:47Z

Hi @PeaStew, can you please post your actual patch? Changes with tokio::time::timeout did not work for you at all? or worked and stopped with the latest releases?

fanatid · 2023-11-23T03:37:15Z

@PeaStew jfyi I submit PR #34213 with fix.

Fixes issue with node losing connection to BigTable and never regaini…

09a9be0

…ng it Fixes solana-labs#20336 (comment)

mergify bot added the community Community contribution label Jun 25, 2022

mergify bot requested a review from a team June 25, 2022 11:47

PeaStew added 5 commits June 25, 2022 13:52

Fixed whitespace issues

1ab5f59

Fixed missing crates

fb045f4

Removed whitespace again

8f78943

Removed other whitespace

8116e36

Removed unused crate

55f11a7

t-nelson reviewed Jun 25, 2022

View reviewed changes

PeaStew and others added 3 commits June 25, 2022 20:36

Use tokio:time and await the call

6db672b

Co-authored-by: Trent Nelson <[email protected]>

Changed log messages to debug

abce8d2

Reverted change as it breaks the code

0327963

error is: within `impl futures::Future<Output = [async output]>`, the trait `std::marker::Send` is not implemented for `std::sync::RwLockReadGuard<'_, (Token, std::time::Instant)>` note: future is not `Send` as this value is used across an await

stale bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Jul 31, 2022

stale bot closed this Aug 13, 2022

brandon-j-roberts mentioned this pull request Nov 2, 2022

bigtable: add timeout to token refresh #28728

Merged

PeaStew mentioned this pull request Sep 4, 2023

Bigtable timeouts leads to permanent disconnection ( Continued ) #29692

Closed

fanatid mentioned this pull request Nov 23, 2023

bigtable: fix AccessToken issues #34213

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes issue with node losing connection to BigTable and never regaining it #26217

Fixes issue with node losing connection to BigTable and never regaining it #26217

PeaStew commented Jun 25, 2022 •

edited

Loading

PeaStew commented Jun 25, 2022

t-nelson left a comment

t-nelson Jun 25, 2022

PeaStew Jun 25, 2022 •

edited

Loading

fanatid Jul 8, 2022

PeaStew Jul 8, 2022

PeaStew Jul 8, 2022 •

edited

Loading

fanatid Jul 8, 2022 •

edited

Loading

fanatid Jul 12, 2022

CriesofCarrots Jul 22, 2022

fanatid Jul 24, 2022 •

edited

Loading

artburkart Oct 14, 2022 •

edited

Loading

PeaStew commented Jul 6, 2022

stale bot commented Jul 31, 2022

stale bot commented Aug 13, 2022

PeaStew commented Sep 4, 2023

fanatid commented Nov 2, 2023

fanatid commented Nov 23, 2023

Fixes issue with node losing connection to BigTable and never regaining it #26217

Fixes issue with node losing connection to BigTable and never regaining it #26217

Conversation

PeaStew commented Jun 25, 2022 • edited Loading

Problem

Summary of Changes

PeaStew commented Jun 25, 2022

t-nelson left a comment

Choose a reason for hiding this comment

t-nelson Jun 25, 2022

Choose a reason for hiding this comment

PeaStew Jun 25, 2022 • edited Loading

Choose a reason for hiding this comment

fanatid Jul 8, 2022

Choose a reason for hiding this comment

PeaStew Jul 8, 2022

Choose a reason for hiding this comment

PeaStew Jul 8, 2022 • edited Loading

Choose a reason for hiding this comment

fanatid Jul 8, 2022 • edited Loading

Choose a reason for hiding this comment

fanatid Jul 12, 2022

Choose a reason for hiding this comment

CriesofCarrots Jul 22, 2022

Choose a reason for hiding this comment

fanatid Jul 24, 2022 • edited Loading

Choose a reason for hiding this comment

artburkart Oct 14, 2022 • edited Loading

Choose a reason for hiding this comment

PeaStew commented Jul 6, 2022

stale bot commented Jul 31, 2022

stale bot commented Aug 13, 2022

PeaStew commented Sep 4, 2023

fanatid commented Nov 2, 2023

fanatid commented Nov 23, 2023

PeaStew commented Jun 25, 2022 •

edited

Loading

PeaStew Jun 25, 2022 •

edited

Loading

PeaStew Jul 8, 2022 •

edited

Loading

fanatid Jul 8, 2022 •

edited

Loading

fanatid Jul 24, 2022 •

edited

Loading

artburkart Oct 14, 2022 •

edited

Loading