Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove syncing tower from bank on old local tower #32894

Closed
wants to merge 1 commit into from

Conversation

AshwinSekar
Copy link
Contributor

@AshwinSekar AshwinSekar commented Aug 18, 2023

Problem

This workflow was intended to stop accidental slashable voting on server crash + restart. It detected if local state was behind on chain state and adopted on chain state if that was the case.

This was not designed with secondary non-voting validators in mind. This completely breaks secondary validators because their tower is constantly diverged from their primary validator, due to their freedom to freeze and vote on banks differently. Because of this, their fork choice could be behind the primary validators causing errors like the following to occur:

thread 'solReplayStage' panicked at 'a bank at last_voted_slot((212118126, 2Vz5TaJK6SLCGaV5VV7m3ASXXxHGPrXX9xCWir6t3UMs)) is a frozen bank so must have been added to heaviest_subtree_fork_choice at time of freezing', core/src/heaviest_subtree_fork_choice.rs:874:29

Summary of Changes

Remove the sync until a proper fix can be designed.

Fixes #

@AshwinSekar AshwinSekar added v1.14 v1.16 PRs that should be backported to v1.16 labels Aug 18, 2023
@AshwinSekar AshwinSekar changed the title Removing syncing tower from bank on old local tower Remove syncing tower from bank on old local tower Aug 18, 2023
@diman-io
Copy link
Contributor

@AshwinSekar Will deleting the tower file prevent an error?

@AshwinSekar
Copy link
Contributor Author

@AshwinSekar Will deleting the tower file prevent an error?

No. This sync only happens when local tower is behind on chain tower which will always be the case if your local tower is deleted.

@diman-io
Copy link
Contributor

No. This sync only happens when local tower is behind on chain tower which will always be the case if your local tower is deleted.

What if we start without the local tower at all? I've restarted it, and the node has been running for 15 minutes already.

@AshwinSekar
Copy link
Contributor Author

No. This sync only happens when local tower is behind on chain tower which will always be the case if your local tower is deleted.

What if we start without the local tower at all? I've restarted it, and the node has been running for 15 minutes already.

as long as your secondary is still producing and updating a local tower, there is a chance that the sync condition will be triggered which has a chance to lead to the panic mentioned above.

@diman-io
Copy link
Contributor

No. This sync only happens when local tower is behind on chain tower which will always be the case if your local tower is deleted.

What if we start without the local tower at all? I've restarted it, and the node has been running for 15 minutes already.

Ah, I looked into the code, and it seems I misunderstood you. By "local tower," you meant the structure that stores the tower's state in the validator's runtime, not the local tower file. (I hope I didn't say something even more foolish than when I asked the previous question 🤦‍♀️).

@AshwinSekar
Copy link
Contributor Author

i don't think it's worth rolling this back at the detriment of voting validators. secondary validators can cherry-pick this if needed until we have a proper fix in place.

@ruuda
Copy link
Contributor

ruuda commented Aug 21, 2023

until we have a proper fix in place

What would be the timeline for a proper fix? We cannot run different binaries for the primary and secondary because their roles switch at failover.

@AshwinSekar
Copy link
Contributor Author

I'm working on a fix at the moment, should have it in time for this weeks release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v1.16 PRs that should be backported to v1.16
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants