-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Catchup & RPC getHealth Results Do Not Match #16957
Comments
The current I've been thinking that a better health check is to compare the highest slot that the validator has received data for and the current confirmed slot of the validator:
These should be very close, within single digits of each other normally. Would you be able to prototype a health check based on these two values? If that looks better overall in the real world, I'm game to reimplement |
I added it to our health checks, but I am not sure how to make sense of it? From this it would look like our nodes are terribly behind, about 20k differnce from maxretransmitslot.
Follows also on command line:
|
Hmm, I see the same here now too:
The leader for slot 76568995 is Dfc73Czi5K3xa6yKMq98LHJx69PDsSzUvSdCSSqr8NDQ, and that node is currently delinquent. I think what happened is Dfc73C... forked off and is now running ahead of the cluster, and transmitting shreds for the fork it's on whenever it encounters a slot that it's leader for. This generally means that my proposal at #16957 (comment) is 👎🏼 |
Note to self: perhaps something vote related for a health check instead. A node should see vote transactions being sent over gossip, and then shortly after see most of those vote transactions land in blocks. If the vote transactions in gossip aren't observed to be landing in blocks promptly then the node knows it's not at the tip of the cluster yet. |
Could maybe generalize the restart leader slot skipping logic Stephen added in #15607 |
Just a quick note that I just ran into this issue, setting up a v1.6.7 non-voting validator on mainnet. I was getting RPC errors ala "Node is behind by 375 slots", while getSlot showed my node to be at the same slot as api.mainnet-beta.solana.com. As a workaround, I added "--health-check-slot-distance 1000", though I suppose that defeats the purpose of the health check. |
seems this naughty error message ( |
We are also running into this, with
I wonder this about the gossip network in general because Solana relies on the latest snapshot in a few more places, but what happens when a malicious node starts advertising snapshot that is too high? Will all valid nodes report themselves as behind? |
The |
Here's my "attempt" to "fix" The implementation of this PR is tongue-in-cheek, but it does feel almost as arbitrary as the current version.
|
Perhaps the latest gossip vote from known validators could be used instead. That should much better resolution than using the account hashes, and is information traveling over gossip already (for now) |
1 similar comment
Perhaps the latest gossip vote from known validators could be used instead. That should much better resolution than using the account hashes, and is information traveling over gossip already (for now) |
So I think the current implementation is trying to approximate this. As for directly using the latest roots, how those latest roots are queried matters.
Arguably yes, and if something goes wrong, the operator probably needs to dig in deeper. But, I think an ideal |
Problem
Administrators of RPC pools depend on the
getHealth
RPC call to determine the health of a node. We dynamically move unhealthy nodes out of the pool when they fall behind. On v1.6.6, we see errors between the CLIcatchup
command and RPCgetHealth
.In the example above, it looks like we are both caught up and 213 slots behind at the same time. The bigger problem is that the 'unhealthy' result from
getHealth
might keep healthy RPC nodes out of the pool.Proposed Solution
Please investigate the source of the problem. The
getHealth
endpoint is a critical RPC admin tool.The text was updated successfully, but these errors were encountered: