Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catchup & RPC getHealth Results Do Not Match #16957

Closed
brianlong opened this issue Apr 29, 2021 · 13 comments · Fixed by #33651
Closed

Catchup & RPC getHealth Results Do Not Match #16957

brianlong opened this issue Apr 29, 2021 · 13 comments · Fixed by #33651
Milestone

Comments

@brianlong
Copy link
Contributor

Problem

Administrators of RPC pools depend on the getHealth RPC call to determine the health of a node. We dynamically move unhealthy nodes out of the pool when they fall behind. On v1.6.6, we see errors between the CLI catchup command and RPC getHealth.

$ solana catchup ~/keypair.json http://127.0.0.1:8899/ &&  
         curl http://localhost:8899/ -X POST -H "Content-Type: application/json" -d '{"jsonrpc":"2.0","id":1, "method":"getHealth"}' && 
        curl http://localhost:8899/ -X POST -H "Content-Type: application/json" -d '{"jsonrpc":"2.0","id":1, "method":"getSlot"}'

4MCpdngneZhRWB1EzX1hcmnb3GPEAQSY6saqS4H1TmSa has caught up (us:76048793 them:76048769)
{"jsonrpc":"2.0","error":{"code":-32005,"message":"Node is behind by 213 slots","data":{"numSlotsBehind":213}},"id":1}
{"jsonrpc":"2.0","result":76048761,"id":1}

In the example above, it looks like we are both caught up and 213 slots behind at the same time. The bigger problem is that the 'unhealthy' result from getHealth might keep healthy RPC nodes out of the pool.

Proposed Solution

Please investigate the source of the problem. The getHealth endpoint is a critical RPC admin tool.

@mvines
Copy link
Member

mvines commented Apr 30, 2021

The current getHealth result is very coarse, it compares the latest slot in the local validator against the highest snapshot slot over gossip, which is typically ~100-~200 slots behind.

I've been thinking that a better health check is to compare the highest slot that the validator has received data for and the current confirmed slot of the validator:

$ curl -s http://localhost:8899 -X POST -H "Content-Type: application/json" -d '
  [{"jsonrpc":"2.0","id":1, "method":"getMaxRetransmitSlot"},
   {"jsonrpc":"2.0","id":1, "method":"getSlot", "params":[{"commitment": "confirmed"}]}]
' | jq
[
  {
    "jsonrpc": "2.0",
    "result": 76095287,
    "id": 1
  },
  {
    "jsonrpc": "2.0",
    "result": 76095285,
    "id": 1
  }
]

These should be very close, within single digits of each other normally. Would you be able to prototype a health check based on these two values? If that looks better overall in the real world, I'm game to reimplement getHealth with this logic instead of what it does currently.

@mvines mvines added this to the v1.6.6 milestone Apr 30, 2021
@linuskendall
Copy link

linuskendall commented May 1, 2021

I added it to our health checks, but I am not sure how to make sense of it?

From this it would look like our nodes are terribly behind, about 20k differnce from maxretransmitslot.

id  rpcNode                          minSlot   curSlot   maxRetransmitSlot  slotsStored  prevEpochBlocks  curEpochBlocks
0   http://fra2.rpcpool.com:8899     75154550  76335047  76315666           1161116      0                0
1   http://api.stakeconomy.com       76161940  76335047  76315850           153910       0                0
2   http://fsn1.rpcpool.com:8899     76161940  76335047  76315850           153910       0                0
3   http://ams1.rpcpool.com:8899     75743484  76335047  76315829           572345       0                0
4   http://fra3.rpcpool.com:8899     75153906  76335047  76315693           1161787      0                0
5   http://nyc1.rpcpool.com:8899     74909394  76335047  76315664           1406270      0                0
6   http://mia1.rpcpool.com:8899     75743977  76335047  76315850           571873       0                0
7   http://nyc1.rpcpool.com:8899     74909394  76335047  76315664           1406270      0                0
8   http://mia2.rpcpool.com:8899     75743485  76335047  76315834           572349       0                0
9   http://bl-lax1.rpcpool.com:8899  75743485  76335047  76315850           572365       0                0
10  http://lax2.rpcpool.com:8899     75743487  76335047  76315772           572285       0                0
11  http://sfo1.rpcpool.com:8899     75320189  76335047  76315850           995661       0                0
12  http://hk1.rpcpool.com:8899      75743346  76335047  76315784           572438       0                0

Follows also on command line:

linuskendall@workstation:~/work/solana-rpc-node$ curl -s http://mia2.rpcpool.com:8899 -X POST -H "Content-Type: application/json" -d '
  [{"jsonrpc":"2.0","id":1, "method":"getMaxRetransmitSlot"},
   {"jsonrpc":"2.0","id":1, "method":"getSlot", "params":[{"commitment": "confirmed"}]}]'
[{"jsonrpc":"2.0","result":76335047,"id":1},{"jsonrpc":"2.0","result":76316891,"id":1}]

@mvines
Copy link
Member

mvines commented May 2, 2021

Hmm, I see the same here now too:

$ curl -s http://localhost:8899 -X POST -H "Content-Type: application/json" -d '
  [{"jsonrpc":"2.0","id":1, "method":"getMaxRetransmitSlot"},
   {"jsonrpc":"2.0","id":1, "method":"getMaxShredInsertSlot"},
   {"jsonrpc":"2.0","id":1, "method":"getSlot", "params":[{"commitment": "confirmed"}]}]
' | jq
[
  {
    "jsonrpc": "2.0",
    "result": 76568995,
    "id": 1
  },
  {
    "jsonrpc": "2.0",
    "result": 76568995,
    "id": 1
  },
  {
    "jsonrpc": "2.0",
    "result": 76530218,
    "id": 1
  }
]

The leader for slot 76568995 is Dfc73Czi5K3xa6yKMq98LHJx69PDsSzUvSdCSSqr8NDQ, and that node is currently delinquent. I think what happened is Dfc73C... forked off and is now running ahead of the cluster, and transmitting shreds for the fork it's on whenever it encounters a slot that it's leader for.

This generally means that my proposal at #16957 (comment) is 👎🏼

@mvines mvines self-assigned this May 2, 2021
@mvines mvines modified the milestones: v1.6.7, v1.6.8 May 4, 2021
@mvines
Copy link
Member

mvines commented May 4, 2021

Note to self: perhaps something vote related for a health check instead. A node should see vote transactions being sent over gossip, and then shortly after see most of those vote transactions land in blocks. If the vote transactions in gossip aren't observed to be landing in blocks promptly then the node knows it's not at the tip of the cluster yet.

@t-nelson
Copy link
Contributor

t-nelson commented May 5, 2021

Note to self: perhaps something vote related for a health check instead. A node should see vote transactions being sent over gossip, and then shortly after see most of those vote transactions land in blocks. If the vote transactions in gossip aren't observed to be landing in blocks promptly then the node knows it's not at the tip of the cluster yet.

Could maybe generalize the restart leader slot skipping logic Stephen added in #15607

@machacekch
Copy link
Contributor

Just a quick note that I just ran into this issue, setting up a v1.6.7 non-voting validator on mainnet. I was getting RPC errors ala "Node is behind by 375 slots", while getSlot showed my node to be at the same slot as api.mainnet-beta.solana.com.
(I'm using the trusted validators from https://docs.solana.com/clusters )

As a workaround, I added "--health-check-slot-distance 1000", though I suppose that defeats the purpose of the health check.

@CriesofCarrots CriesofCarrots modified the milestones: v1.6.9, v1.6.10 May 15, 2021
@CriesofCarrots CriesofCarrots modified the milestones: v1.6.10, v1.6.11, v1.6.12 May 26, 2021
@mvines mvines removed their assignment Jun 7, 2021
@mvines mvines modified the milestones: v1.6.12, v1.8.0 Jun 7, 2021
@ryoqun
Copy link
Contributor

ryoqun commented Oct 25, 2021

seems this naughty error message (Node is behind by {100-1000} slots) could also be seen from sendTransaction rpc method, which is kinda important rpc method. (i got a support inquiry about this)

@ruuda
Copy link
Contributor

ruuda commented Mar 22, 2023

We are also running into this, with --health-check-slot-distance 100, nodes regularly report that they are behind, even though from the metrics that we added in #27665 we can tell that they are perfectly in sync.

The current getHealth result is very coarse, it compares the latest slot in the local validator against the highest snapshot slot over gossip, which is typically ~100-~200 slots behind.

I wonder this about the gossip network in general because Solana relies on the latest snapshot in a few more places, but what happens when a malicious node starts advertising snapshot that is too high? Will all valid nodes report themselves as behind?

@mvines
Copy link
Member

mvines commented Mar 22, 2023

but what happens when a malicious node starts advertising snapshot that is too high

The --known-validator flag prevents this, as when specified only the snapshot hashes from the whitelisted identities are considered

@brooksprumo
Copy link
Contributor

brooksprumo commented Jun 6, 2023

Here's my "attempt" to "fix" getHealth: #31989

The implementation of this PR is tongue-in-cheek, but it does feel almost as arbitrary as the current version.

getHealth feels pretty optimistic, and mostly trusts the known validators already; can we query our latest rooted slot and the known validator's latest rooted slot and call it sufficient? To really know a validator's health, the operator has to do more monitoring anyway, right?

@mvines
Copy link
Member

mvines commented Jun 6, 2023

can we query our latest rooted slot and the known validator's latest rooted slot and call it sufficient

Perhaps the latest gossip vote from known validators could be used instead. That should much better resolution than using the account hashes, and is information traveling over gossip already (for now)

1 similar comment
@mvines
Copy link
Member

mvines commented Jun 6, 2023

can we query our latest rooted slot and the known validator's latest rooted slot and call it sufficient

Perhaps the latest gossip vote from known validators could be used instead. That should much better resolution than using the account hashes, and is information traveling over gossip already (for now)

@steviez
Copy link
Contributor

steviez commented Jun 6, 2023

can we query our latest rooted slot and the known validator's latest rooted slot and call it sufficient?

So I think the current implementation is trying to approximate this. As for directly using the latest roots, how those latest roots are queried matters.

  • If we grab the latest roots from our latest Bank such as how getVoteAccounts does so, we're completely dependent on our local view of the network. For example, during node catchup after a restart, our node would be recognizing roots as we proceed through blocks. Those roots would likely only be a few slots off of our known validators, even if we are 10k+ slots behind the tip of the cluster.
  • I think I mentioned it in my other issue and mvines just reiterated, but I agree with him that looking at gossip votes is seemingly a more robust option with much better resolution. As Brooks and I discussed in DMs, relying on any item that is published at a set interval (snapshot / bank / etc hashes) is prone to these sampling glitches and non-ideal

To really know a validator's health, the operator has to do more monitoring anyway, right?

Arguably yes, and if something goes wrong, the operator probably needs to dig in deeper. But, I think an ideal getHealth provides an indication of "is my node keeping up with the cluster".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging a pull request may close this issue.

10 participants