-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TESTING] - Fix Startup / Restart Test Failures #3669
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks right to me overall
one question: if I understand correctly, I think the failure check for a round needs to be fixed too. it seems like we currently fail a view as soon as 1/3 of the quorum says it fails. but that's probably not correct if nodes can send a quorum vote, then a timeout vote, but have the view eventually pass. (maybe the call to view.update_status
just needs to lag 3 views behind?)
I don't think a node should ever send a vote and a timeout vote for the same view. If a node votes and then doesn't see the QC , it times out the next view. The only scenario where a node fails a view, then succeeds (by the testing logic) is if the other nodes vote for the view and it's eventually decided. if 1/3 of the nodes timeout then the view will actually fail and never be decided (unless there is a bug and we want to catch that with a test failure). |
Ah yes, I was mixing up the views -- agree with @bfish713 that this should be fine! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
Closes (#3651)
This PR:
We noticed with certain flaky catchup tests that a node may come up before other nodes and wait for a proposal. The node wont receive the proposal in time then timeout the view, and add it to
failed_views
in our test harness. Later, the node will receive the proposal and decide upon the view, but in our test harness we never remove the previously timed out view fromfailed_views
. So this PR will add some logic to do so.This PR does not:
Key places to review: