Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck subnet node unable to recover after validator re-stake #2594

Closed
jpop32 opened this issue Jan 9, 2024 · 0 comments · Fixed by #2679
Closed

Stuck subnet node unable to recover after validator re-stake #2594

jpop32 opened this issue Jan 9, 2024 · 0 comments · Fixed by #2679
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@jpop32
Copy link
Contributor

jpop32 commented Jan 9, 2024

Describe the bug
If a node receives a transaction on RPC port for a synced subnet that is currently without validators, it enters a faulty state because there are no validators to finalize a block. Node starts failing health check with:
snowman consensus is not healthy reason: block processing time 11h5m29.98576991s > 30s
It also warns about the cause of the issue:
validator set is empty

But once the validators have been re-added and subnet starts processing blocks again, node doesn't recover. Instead, it is forever stuck with the same failing heath check:
snowman consensus is not healthy reason: block processing time 12h44m13.985749723s > 30s

The only workaround to resume normal operation is to restart the node, after which it correctly syncs to the chain and resumes processing transactions. Stuck block presumably gets dropped.

To Reproduce
Spin up a subnet, sync one non-validating node. Post a transaction to its RPC port. Node starts failing health check. Add a validator to the subnet. Observe that the failing node doesn't recover. Restart the failing node. Node comes up healthy.

Expected behavior
Once validators have been added (or re-added) to the subnet, the node should be able to recover to a healthy state and resume normal operation without any intervention. Not sure what would be the correct/desirable behaviour for transactions that were submitted while validators were absent. They could be dropped, or possibly re-entered into the mempool to be included in a future block. At a minimum, the behaviour that happens when stuck node is restarted can be reproduced. The main idea is for the nodes to be able to auto-recover, without external intervention (a restart).

@jpop32 jpop32 added the bug Something isn't working label Jan 9, 2024
@StephenButtolph StephenButtolph added enhancement New feature or request and removed bug Something isn't working labels Jan 30, 2024
@StephenButtolph StephenButtolph added this to the v1.10.20 milestone Jan 30, 2024
@StephenButtolph StephenButtolph self-assigned this Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants