Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bigtable timeouts leads to permanent disconnection ( Continued ) #29692

Closed
brandon-j-roberts opened this issue Jan 13, 2023 · 6 comments
Closed
Labels
community Community contribution

Comments

@brandon-j-roberts
Copy link
Contributor

Problem

Bigtable authentication token is failing to refresh causing nodes to disconnect from bigtable occasionally, which forces a restart of the node in order to reconnect.

This is a continuation of an issue previously closed.

This PR mostly prevented the issue, but some of our nodes that receive significant RPC load still occasionally disconnect from bigtable.

I have found that this disconnect is due to the existing timeout failing to actually resolve, so the the refresh_active variable stays true. Thus, any future attempts to refresh the token are stopped because the code believes a refresh is still active.

Proposed Solution

A possible solution which I have implemented and tested in this PR is to use an additional check after 2x the timeout period, which will set refresh_active back to false. This would allow future attempts to refresh the token to go through, and signify that the timeout failed to stop the function get_token.

If the 1st ( thought to be failing ) timeout ends up succeeding after the 2nd get_token call starts, you will end up getting a 2nd newer token once the 2nd returns.

After applying this patch to 4 troubled nodes, it has prevented 20+ disconnects. We applied this patch onto 1.13.5 mainnet for this test.

@RdeWilde
Copy link

RdeWilde commented Mar 28, 2023

[..] due to the existing timeout failing to actually resolve [..]

Any clue how this is possible? As this might be the root cause, it would make sense to try and fix that instead of the fallback mechanism

@RdeWilde
Copy link

I think the closed MR should be merged nevertheless as there's no other solution yet, and this is high priority for heavily used RPC nodes

@maestroi
Copy link

maestroi commented Jul 3, 2023

At Blockdaemon, we've encountered a recurring issue that necessitates the creation of a customer-specific version. This custom build combines the version detailed here with pull 29685.

2023-06-17T14:05:46.087754187Z WARN  solana_storage_bigtable::access_token] Requesting token for BigTableDataReadOnly scope
[2023-06-17T14:05:46.087786118Z WARN  solana_storage_bigtable::access_token] Creating JWT claim
[2023-06-17T14:05:46.090215850Z WARN  solana_storage_bigtable::access_token] Requesting token from endpoint

The issue at hand pertains to token acquisition for BigTable. Occasionally, the process for requesting a new token commences as expected but then stalls, leading to a total failure in establishing a connection with BigTable. The process doesn't repeat after this failure.

a normal request looks like:

[2023-06-17T13:35:46.840716176Z WARN  solana_storage_bigtable::access_token] Requesting token for BigTableDataReadOnly scope
[2023-06-17T13:35:46.849892464Z WARN  solana_storage_bigtable::access_token] Creating JWT claim
[2023-06-17T13:35:46.858238558Z WARN  solana_storage_bigtable::access_token] Requesting token from endpoint
[2023-06-17T13:35:46.987365809Z WARN  solana_storage_bigtable::access_token] Token expires in 3599 seconds
[2023-06-17T13:35:46.987402028Z WARN  solana_storage_bigtable::access_token] Token write lock acquired:

Comparatively, a typical token request process proceeds smoothly and is reflected in the code found at this GitHub link:

let token = goauth::get_token(&jwt, credentials)

What appears to be obstructing the goauth::get_token call is unknown, as it fails to return any result, be it successful or erroneous. Unfortunately, we are unable to insert custom logging into this underlying call, given it's a third-party library. While considering a blocking variant might be an option, it poses a certain risk.

RUST_LOG="solana=warn,reqwest=debug,hyper=debug,solana_storage_bigtable::access_token=debug,solana::solana_metrics=error,solana_runtime::message_processor=error,solana_rpc::rpc_health=error,solana_core::warm_quic_cache_service=error"

With this adjustment, we can gain more insight into what happens during the token request process.

we have a bit more data:

[2023-06-18T15:13:56.381735288Z WARN  solana_storage_bigtable::access_token] Requesting token for BigTableDataReadOnly scope
[2023-06-18T15:13:56.381789903Z WARN  solana_storage_bigtable::access_token] Creating JWT claim
[2023-06-18T15:13:56.384592197Z WARN  solana_storage_bigtable::access_token] Building client
[2023-06-18T15:13:56.391540226Z WARN  solana_storage_bigtable::access_token] Requesting token
[2023-06-18T15:13:56.393855912Z TRACE hyper::client::pool] checkout waiting for idle connection: ("https", oauth2.googleapis.com)
[2023-06-18T15:13:56.393898884Z DEBUG reqwest::connect] starting new connection: https://oauth2.googleapis.com/
[2023-06-18T15:13:56.393949441Z TRACE hyper::client::connect::http] Http::connect; scheme=Some("https"), host=Some("oauth2.googleapis.com"), port=None
[2023-06-18T15:13:56.394127139Z TRACE hyper::client::pool] checkout dropped for ("https", oauth2.googleapis.com)

if we look at the data when it's hung with the extra logging it looks like:

[2023-06-18T15:13:56.381735288Z WARN  solana_storage_bigtable::access_token] Requesting token for BigTableDataReadOnly scope
[2023-06-18T15:13:56.381789903Z WARN  solana_storage_bigtable::access_token] Creating JWT claim
[2023-06-18T15:13:56.384592197Z WARN  solana_storage_bigtable::access_token] Building client
[2023-06-18T15:13:56.391540226Z WARN  solana_storage_bigtable::access_token] Requesting token
[2023-06-18T15:13:56.393855912Z TRACE hyper::client::pool] checkout waiting for idle connection: ("https", oauth2.googleapis.com)
[2023-06-18T15:13:56.393898884Z DEBUG reqwest::connect] starting new connection: https://oauth2.googleapis.com/
[2023-06-18T15:13:56.393949441Z TRACE hyper::client::connect::http] Http::connect; scheme=Some("https"), host=Some("oauth2.googleapis.com"), port=None
[2023-06-18T15:13:56.394127139Z TRACE hyper::client::pool] checkout dropped for ("https", oauth2.googleapis.com)

If we examine the data during the period it hangs with the additional logging, it mirrors the typical request process. However, a successful request proceeds further with additional steps and outputs.

[2023-06-18T14:43:56.806509500Z WARN  solana_storage_bigtable::access_token] Refreshing token: Bearer ya29.c.b<data>
[2023-06-18T14:43:56.806881550Z WARN  solana_storage_bigtable::access_token] Requesting token for BigTableDataReadOnly scope
[2023-06-18T14:43:56.807559283Z WARN  solana_storage_bigtable::access_token] Creating JWT claim
[2023-06-18T14:43:56.808976967Z WARN  solana_storage_bigtable::access_token] Building client
[2023-06-18T14:43:56.817984006Z WARN  solana_storage_bigtable::access_token] Requesting token
[2023-06-18T14:43:56.819288734Z TRACE hyper::client::pool] checkout waiting for idle connection: ("https", oauth2.googleapis.com)
[2023-06-18T14:43:56.822628384Z DEBUG reqwest::connect] starting new connection: https://oauth2.googleapis.com/
[2023-06-18T14:43:56.822663931Z TRACE hyper::client::connect::http] Http::connect; scheme=Some("https"), host=Some("oauth2.googleapis.com"), port=None
[2023-06-18T14:43:56.822792877Z DEBUG hyper::client::connect::dns] resolving host="oauth2.googleapis.com"
[2023-06-18T14:43:56.823688405Z DEBUG hyper::client::connect::http] connecting to [2607:f8b0:4023:140d::5f]:443
[2023-06-18T14:43:56.849190175Z DEBUG hyper::client::connect::http] connected to [2607:f8b0:4023:140d::5f]:443
[2023-06-18T14:43:56.875990070Z TRACE hyper::client::conn] client handshake Http1
[2023-06-18T14:43:56.876001061Z TRACE hyper::client::client] handshake complete, spawning background dispatcher task
[2023-06-18T14:43:56.876026580Z TRACE hyper::client::pool] checkout dropped for ("https", oauth2.googleapis.com)
[2023-06-18T14:43:56.876051457Z TRACE reqwest::connect::verbose] 4b3a2478 write: b"POST /token HTTP/1.1\r\<data>
[2023-06-18T14:43:56.937929195Z TRACE reqwest::connect::verbose] 4b3a2478 read: b"HTTP/1.1 200 OK\r\nContent-Type: application/json; <data>
[2023-06-18T14:43:56.938142061Z WARN  solana_storage_bigtable::access_token] Token expires in 3599 seconds
[2023-06-18T14:43:56.938169413Z WARN  solana_storage_bigtable::access_token] Token write lock acquired
[2023-06-18T14:43:56.938171928Z WARN  solana_storage_bigtable::access_token] Token written: ya29.c<data>
[2023-06-18T14:43:56.938218267Z TRACE reqwest::connect::verbose] 4b3a2478 read: b"0\r\n\r\n"
[2023-06-18T14:43:56.938242413Z TRACE hyper::client::pool] pool dropped, dropping pooled (("https", oauth2.googleapis.com))
[2023-06-18T14:43:58.893502742Z TRACE hyper::client::connect::http] Http::connect; scheme=Some("https"), host=Some("bigtable.googleapis.com"), port=None
[2023-06-18T14:43:58.893544892Z DEBUG hyper::client::connect::dns] resolving host="bigtable.googleapis.com"
[2023-06-18T14:43:58.894669177Z DEBUG hyper::client::connect::http] connecting to [2607:f8b0:4023:1401::5f]:443
[2023-06-18T14:43:58.920182358Z DEBUG hyper::client::connect::http] connected to [2607:f8b0:4023:1401::5f]:443
[2023-06-18T14:43:58.946487479Z TRACE hyper::client::conn] client handshake Http2

And after a few hours:

[2023-06-18T17:46:56.240413238Z WARN  solana_storage_bigtable::access_token] Requesting token for BigTableDataReadOnly scope
[2023-06-18T17:46:56.247280847Z WARN  solana_storage_bigtable::access_token] Creating JWT claim
[2023-06-18T17:46:56.251116497Z WARN  solana_storage_bigtable::access_token] Building client
[2023-06-18T17:46:56.257533657Z WARN  solana_storage_bigtable::access_token] Requesting token
[2023-06-18T17:46:56.261613813Z TRACE hyper::client::pool] checkout waiting for idle connection: ("https", oauth2.googleapis.com)
[2023-06-18T17:46:56.263867168Z DEBUG reqwest::connect] starting new connection: https://oauth2.googleapis.com/
[2023-06-18T17:46:56.263906563Z TRACE hyper::client::connect::http] Http::connect; scheme=Some("https"), host=Some("oauth2.googleapis.com"), port=None
[2023-06-18T17:46:56.264025800Z DEBUG hyper::client::connect::dns] resolving host="oauth2.googleapis.com"
[2023-06-18T17:46:56.264047952Z TRACE hyper::client::pool] checkout dropped for ("https", oauth2.googleapis.com)
[2023-06-18T17:47:04.589392869Z TRACE hyper::client::connect::http] Http::connect; scheme=Some("https"), host=Some("bigtable.googleapis.com"), port=None
[2023-06-18T17:47:04.597589260Z DEBUG hyper::client::connect::dns] resolving host="bigtable.googleapis.com"
[2023-06-18T17:47:04.598939443Z DEBUG hyper::client::connect::http] connecting to [2607:f8b0:4023:1407::5f]:443
[2023-06-18T17:47:04.624507893Z DEBUG hyper::client::connect::http] connected to [2607:f8b0:4023:1407::5f]:443
[2023-06-18T17:47:04.650880576Z TRACE hyper::client::conn] client handshake Http2
[2023-06-18T17:47:06.168594464Z WARN  solana_core::replay_stage] PARTITION DETECTED waiting to join heaviest fork: 194633601 last vote: 194633592, reset slot: 194633593
[2023-06-18T17:47:06.702022422Z WARN  solana_core::replay_stage] PARTITION resolved heaviest fork: 194633594 last vote: 194633594, reset slot: 194633594
[2023-06-18T17:47:08.952291637Z TRACE hyper::client::dispatch] send_when canceled

After a few hours, the logging illustrates how the token request process commences as usual but eventually fails to recover.

[2023-06-19T16:41:41.021936319Z WARN  solana_storage_bigtable::access_token] Requesting token for BigTableDataReadOnly scope
[2023-06-19T16:41:41.029164180Z WARN  solana_storage_bigtable::access_token] Creating JWT claim
[2023-06-19T16:41:41.036498164Z WARN  solana_storage_bigtable::access_token] Building client
[2023-06-19T16:41:41.044716393Z WARN  solana_storage_bigtable::access_token] Requesting token
[2023-06-19T16:41:41.049209821Z DEBUG reqwest::connect] starting new connection: https://oauth2.googleapis.com/
[2023-05-19T16:41:41.049253705Z TRACE hyper::client::connect::http] Http::connect; scheme=Some("https"), host=Some("oauth2.googleapis.com"), port=None
<dead>

Potential underlying issues that might contribute to this problem are being investigated and are related to specific issues on GitHub:

  1. Requests can hang indefinitely when using HTTP/2 seanmonstar/reqwest#1323
  2. bug: connection pooling on http/2 breaks with high concurrency seanmonstar/reqwest#1276
  3. Program gets stuck forever if tcp connect error happens seanmonstar/reqwest#1256
  4. https://github.com/durch/rust-goauth/blob/ae52647f8f1a71c46d064fd3ff259a76cdbeec1d/src/lib.rs#L130
  5. https://github.com/durch/rust-goauth/blob/ae52647f8f1a71c46d064fd3ff259a76cdbeec1d/src/lib.rs#L174
  6. https://github.com/durch/rust-goauth/blob/ae52647f8f1a71c46d064fd3ff259a76cdbeec1d/src/lib.rs#L180

Fixes:

  1. bigtable: Add token timeout backup check #29685

Hopefully we can get this resolved so no one has to deal with this issue anymore.

@joeaba
Copy link
Contributor

joeaba commented Jul 13, 2023

Hey @maestroi , I've reopened #29685
Feel free to takeover it so we can get it land

@PeaStew
Copy link

PeaStew commented Sep 4, 2023

#26217 I created this long ago and have started patching my nodes again with my fix which absolutely does work and it works because of the google auth issues listed above. It may be 'inelegant' but it works and has not caused any issues with sync despite all words to the contrary. YMMV sometimes a nail is just a nail and needs a hammer

@fanatid
Copy link
Contributor

fanatid commented Nov 28, 2023

we can close this issue, #34213 merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Community contribution
Projects
None yet
Development

No branches or pull requests

7 participants