-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Too widespread read-consistency issue for JSON RPC with load balancers #15705
Comments
I think this is an additional consequence of the RPC node instability. Looks like google cloud makes a best effort at session affinity that is broken if a node becomes unhealthy. I expect that if we fix rpc stability, this will no longer be an issue. For that reason, I sort of prefer your client side only idea as a quick fix |
Sorry forgot to paste the link: https://cloud.google.com/load-balancing/docs/backend-service#session_affinity |
@jstarry Yeah, load-balancers session affinity is definitely a good consideration. :) I didn't mention it (my bad) but I rather wanted to avoid it because...:
|
Tagging @linuskendall & @stakeconomy who are currently working on HAproxy config for the new RPC pool. |
The JSON RPC protocol already offers a solution. You can batch multiple requests in the same HTTP POST (until the load balancers get much smarter). For example these two requests are guaranteed to be served by the same backend: $ curl https://api.mainnet-beta.solana.com -X POST -H "Content-Type: application/json" -d '
[
{"jsonrpc":"2.0", "id":2, "method":"getSlot"},
{"jsonrpc":"2.0", "id":1, "method":"getRecentBlockhash"}
]
'
[{"jsonrpc":"2.0","result":67811128,"id":2},{"jsonrpc":"2.0","result":{"context":{"slot":67811128},"value":{"blockhash":"5XaNf8C3KK2gS6Ymn9AdZ6HBZCh412afNtWJfi7jf63o","feeCalculator":{"lamportsPerSignature":5000}}},"id":1}] |
@mvines Yeah, good point. I again forgot to mention why I don't think jsonrpc's batching is adequate solution for the problem... :
Alternatively, I think these problem can be solved if we use websockets for these usecases. But then, we don't provide all of functionalities via websockets. (Of course, we can) All being said still, it seems odd we can't guarantee read consistency across plain old reqs (jsonrpc), but for websockets. we have nice synchronization primitive called slots for I think our discussion is maturing... Thanks! |
I've been setting up a proof of concept set of loadbalancers off of GCP for the public Solana RPC pool. We could potentially support the X-Solana-Slot header directly in the load balancer, with the loadbalancer having an idea of the last slot call returned from any given backend (and routing requests accordingly). We use the slot anyway as a way for health checks in the loadbalancer so I think this could be extended to support something like the X-Solana-Slot header. Affinity is also something I have considered and can be toggled via cookie or custom header for those clients who need it. This doesn't help if the backend RPC goes down, but our aim is for that to be a relatively rare occurrence (with the LBs protecting them to a greater degree than is currently done under GCP). |
Seems like this can still be done fully client side, no new header needed that all LBs may not add support for. The client just batches a |
Yeah, this sounds nice way to put the slot information while avoding the http header. The only downside is
Actually I thought about this idea.. But, i failed to communicate well. Let me elaborate this too abstract description by me. ;) Say, a user wants to send a transaction assuming the current slot is at N and its relevant account state is favourable which can only be determined by human at client side by examining the state with a bank at the slot N or newer banks. Also assume the on-chain program enforces some rule to reject/accept such TX at older/newer than N (this will be pretty common for games). Then, the user actually sends the transaction, batching as (a) If the tx is valid action at N-1 but actually it isn't valid action at N+1, the transaction will be recorded into the ledger as failed tx, which could give unfair edge to the use's counter parties as some public hint of the user's intention. Also, the user (browser) waits rather long time for the If we add (b) If the tx isn't valid at N-1 but actually is valid action at N+1, Then user's browser immediately knows the tx failed and retries. But it's completely up to luck whether or not they access the healthy node or the delayed code catches up next time. Meanwhile, another user can take the opportunity. So the original user misses the opportunity and feel sad in this case too. (this assumes disabling the tx simulation isn't good because of general practice and to avoid needless information disclosure like above). Also, for the mobile clients, the bandwidth and client-side's parsing (especially JSON) budget is rather scarece resource. So, I don't want to force those browsers to download potentially large
Also, I think custom http headers isn't uncommon; you can see various ones from well known apis too. LBs generally passes through them, rather than discarding the unknown, especially if it is prefixed with
@linuskendall Yeah, some smart load-balancing possibility is a nice bonus with |
Hey, I'd like to bump this issue since I think its fairly important. As devs, the problem we're facing is that we can't make sequential requests to an RPC server while guaranteeing that the slot number is monotonically increasing. I want to stress that this is a big problem for apps that are built on Solana. If not handled, it can lead to users thinking that they've lost their funds. We've seen this happen in the wild. You guys are best equipped to find the right solution, but my intuition is that this should be solved at the network level and not in the Javascript SDK. I think this is too critical to handle on the client side. Just my two cents. Thanks 😃 |
If you make a request into a single RPC server, you do have this guarantee. Your problem manifests only when multiple RPC servers are load balanced together. The quick solution is to run application-specific RPC servers so that you have more control over how, if any, load balancing occurs. It would be good to better understand the assumptions that your client is making that breaks when you aim it at a load-balanced RPC endpoint. Perhaps there's client changes we can make that'll be quicker than rebuilding RPC (which I'd love to do one day, but that's a large project). Happy to take this to Discord #orca as well! |
I'm always trying to find some time to work on this; but have been failed so far... Anyway, I've fixed/found read-consistency issues: solana-labs/solana-program-library#1528, #13987 (comment) |
another sad fact I recently noticed is that https://dex.projectserum.com (and bunch of other downstream projects if they're using the same api backend) is slightly affected by this as well. I actually observed that the orderbook's state reverted a bit for a moment for sure. I mean order's remaining amount is increased for a bit while steady reducing otherwise most of time. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
This stale issue has been automatically closed. Thank you for your contributions. |
I would like to add an rpc param called "minimumContextSlot" to have rpc's return an error if they are not caught up. Many RPC requests are called with the expectation that a node has progressed to a particular slot at some commitment. So I think that setting a minimum context and getting an error code that indicates the node isn't ready yet and that the client should retry later will be a better dev experience. |
Problem
Validators could process at slightly different times. So, if there is a load balancer/reverse proxy (there SHOULD really is for any resonable situation...), web3 developers and web3's downstream libraries are forced to consider the reversion of any state. ie. non-humane burden on them. ;)
fun quiz time!
I wrongly implemented this: can you spot the bug?: https://github.com/solana-labs/solana-program-library/pull/1364/files#diff-a89195d5c0dbd3f581bf7b62f88fe627d94149501fd9006ea5852eba7f3adda0R573
Especially this is exaggerating and so frustrating to end users under unstable cluster condition.
Proposed Solution
Client side only idea: tracking returned context.slot from every rpc responses in web3.Connection for the case of singleGossip and stronger commitment internally and retry safe methods=> well this is kinda clumsy. we need to discern which method is safe or not.My current fav idea: Introduce custom universal http header for all JSONRPC methods:
X-SOLANA-SLOT: NNNN
.So, this works like this: Once web3.Connections gets its first response from jsonrpc, it remembers
X-SOLANA-SLOT
value keyed bycommitment
. Then, subsequent req have the custom header.Then, server checks the slot against fetched Bank from
BankForks
:if NNN <= bank.slot
: ok, time is forwarding process the req. Then, client remembers the new NNN value.if NNN > bank.slot
: nah, return 4XX because clients from the future. Then client retries at some time later (like 400ms~3000ms, exponentially?)The text was updated successfully, but these errors were encountered: