-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Archive node missing blocks when under heavy gRPC pressure #8602
Comments
I believe this is a tendermint issue. The RPC is blocking and causes consensus to slow down. This is a known issue and why we recommend validators not expose their rpc to the public network. |
Are you referring to the RPC or gRPC? Cause we noticed this problem only when querying using gRPC. When we only use RPC it has not problems |
All requests in the sdk requests are routed through tendermint. The request goes through the abci_query abci method. |
Ok thanks. Is there an issue opened in Tendermint about this? Maybe we can link it here for future reference |
There doesn't seem to be one, it's also a mix of multiple issues. Do you want to open an issue that links to this one? |
That still doesn't describe it @marbar3778. Why do RPC and legacy API endpoints work "fine", i.e. no regressions, yet gRPC slows down nodes considerably? |
I can reproduce this on tendermint RPC as well. It's a bit harder than gRPC but still present. gRPC was built to handle concurrent requests, but I don't think any of our stack can handle concurrent requests at high volume. To reproduce with Tendermint:
|
I'm curious why this is so exacerbated by gRPC then, which is supposed to be more efficient? Why did block explorers and clients never report such issues for RPC and the legacy API? |
It is more efficient in almost all possible ways if tendermint was not used as a global mutex. Right now all calls are routed through tendermint and the known mutex contention when using RPC is being felt.
I am guessing no one was making so many requests per block. This has been a known issue in Tendermint for as long as I can remember. This is one of the core reasons we tell people to not expose their RPC endpoints to the public. |
They were though. Juno for example did this w/o slowing down the connected node at all. Block explorers continuously use and call the RPC to fetch and index data to external data sources. |
Can someone from our team investigate if there has indeed been a performance regression with gRPC related to these cases? My guess is that it's likely not gRPC per se, but something else in the query handling... Can you triage @clevinson ? |
I think this #10045 may help out. Grpc is natively concurrent, but all the queries are queued behind a single mutex. 0.34.13 makes this mutex a RWmutex but the mentioned pr should not require grpc requires to be routed through tendermint. @RiccardoM would love to see if the pr helps |
I am almost certain this is related in some way to cosmos/gaia#972 ...and I've definitely seen similar behavior to this on any node I've used for relaying. |
closing this for now since grpc is no longer routed through tendermint |
Summary of Bug
When under heavy gRPC pressure (a lot of requests being made), full node can start lacking behind in blocks validation.
Version
v0.40.1
Steps to Reproduce
pruning = "nothing"
Context
We are currently developing BDJuno, a tool that allows to listen to a chain state and parses the data into a PostgreSQL database. In order to do so, it acts in two ways at the same time:
For each block, it then reads the different modules' states and stores them inside the PostgreSQL database. What we do is we a snapshot of the state for each block and store it. To do so, we use gRPC to get all the data that can change from one block to another (i.e. delegations, unbonding delegations, redelegations, staking commissions, etc).
As we also need to parse old blocks and get the state at very old heights, we setup an archive node with
pruning = "nothing"
.When we first started our parser, everything was working properly. The node was able to keep up with syncing new blocks and answering to gRPC calls properly.
Recently, however, we noticed that the node started to lack behind the chain state, was over 500 blocks behind. So, we stopped the parser and let the node catch up again with the chain state. Then, we restarted the parser. One week later and the node is once again more than 1,000 blocks behind the current chain height.
Note
I have no idea if this happens only because the pruning is set to
nothing
. However, I believe this should be investigated as it might result in some tools (eg. explorers) making the nodes stop in the future if too many requests are done to them. It could even be exploited via a DDoS attack to validator nodes if this results to happen also to nodes that have the pruning option set todefault
oreverything
.For Admin Use
The text was updated successfully, but these errors were encountered: