Stuck subnet node unable to recover after validator re-stake #2594

jpop32 · 2024-01-09T09:36:08Z

Describe the bug
If a node receives a transaction on RPC port for a synced subnet that is currently without validators, it enters a faulty state because there are no validators to finalize a block. Node starts failing health check with:
snowman consensus is not healthy reason: block processing time 11h5m29.98576991s > 30s
It also warns about the cause of the issue:
validator set is empty

But once the validators have been re-added and subnet starts processing blocks again, node doesn't recover. Instead, it is forever stuck with the same failing heath check:
snowman consensus is not healthy reason: block processing time 12h44m13.985749723s > 30s

The only workaround to resume normal operation is to restart the node, after which it correctly syncs to the chain and resumes processing transactions. Stuck block presumably gets dropped.

To Reproduce
Spin up a subnet, sync one non-validating node. Post a transaction to its RPC port. Node starts failing health check. Add a validator to the subnet. Observe that the failing node doesn't recover. Restart the failing node. Node comes up healthy.

Expected behavior
Once validators have been added (or re-added) to the subnet, the node should be able to recover to a healthy state and resume normal operation without any intervention. Not sure what would be the correct/desirable behaviour for transactions that were submitted while validators were absent. They could be dropped, or possibly re-entered into the mempool to be included in a future block. At a minimum, the behaviour that happens when stuck node is restarted can be reproduced. The main idea is for the nodes to be able to auto-recover, without external intervention (a restart).

The text was updated successfully, but these errors were encountered:

jpop32 added the bug Something isn't working label Jan 9, 2024

StephenButtolph added enhancement New feature or request and removed bug Something isn't working labels Jan 30, 2024

StephenButtolph added this to the v1.10.20 milestone Jan 30, 2024

StephenButtolph self-assigned this Jan 30, 2024

StephenButtolph mentioned this issue Jan 30, 2024

Unblock misconfigured subnets #2679

Merged

3 tasks

StephenButtolph closed this as completed in #2679 Jan 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stuck subnet node unable to recover after validator re-stake #2594

Stuck subnet node unable to recover after validator re-stake #2594

jpop32 commented Jan 9, 2024

Stuck subnet node unable to recover after validator re-stake #2594

Stuck subnet node unable to recover after validator re-stake #2594

Comments

jpop32 commented Jan 9, 2024