Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TESTING] - Fix Startup / Restart Test Failures #3669

Merged
merged 6 commits into from
Sep 12, 2024

Conversation

lukeiannucci
Copy link
Contributor

@lukeiannucci lukeiannucci commented Sep 11, 2024

Closes (#3651)

This PR:

We noticed with certain flaky catchup tests that a node may come up before other nodes and wait for a proposal. The node wont receive the proposal in time then timeout the view, and add it to failed_views in our test harness. Later, the node will receive the proposal and decide upon the view, but in our test harness we never remove the previously timed out view from failed_views. So this PR will add some logic to do so.

This PR does not:

Key places to review:

@lukeiannucci lukeiannucci changed the title add logic to remove from failed views if node decides on view after t… [TESTING] - Investigate Restart Test Failures Sep 11, 2024
@lukeiannucci lukeiannucci marked this pull request as ready for review September 11, 2024 19:43
@lukeiannucci lukeiannucci changed the title [TESTING] - Investigate Restart Test Failures [TESTING] - Fix Startup / Restart Test Failures Sep 11, 2024
Copy link
Contributor

@ss-es ss-es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks right to me overall

one question: if I understand correctly, I think the failure check for a round needs to be fixed too. it seems like we currently fail a view as soon as 1/3 of the quorum says it fails. but that's probably not correct if nodes can send a quorum vote, then a timeout vote, but have the view eventually pass. (maybe the call to view.update_status just needs to lag 3 views behind?)

crates/testing/src/overall_safety_task.rs Outdated Show resolved Hide resolved
@lukeiannucci lukeiannucci linked an issue Sep 11, 2024 that may be closed by this pull request
@bfish713
Copy link
Collaborator

bfish713 commented Sep 12, 2024

I think this looks right to me overall

one question: if I understand correctly, I think the failure check for a round needs to be fixed too. it seems like we currently fail a view as soon as 1/3 of the quorum says it fails. but that's probably not correct if nodes can send a quorum vote, then a timeout vote, but have the view eventually pass. (maybe the call to view.update_status just needs to lag 3 views behind?)

I think this looks right to me overall

one question: if I understand correctly, I think the failure check for a round needs to be fixed too. it seems like we currently fail a view as soon as 1/3 of the quorum says it fails. but that's probably not correct if nodes can send a quorum vote, then a timeout vote, but have the view eventually pass. (maybe the call to view.update_status just needs to lag 3 views behind?)

I don't think a node should ever send a vote and a timeout vote for the same view. If a node votes and then doesn't see the QC , it times out the next view. The only scenario where a node fails a view, then succeeds (by the testing logic) is if the other nodes vote for the view and it's eventually decided. if 1/3 of the nodes timeout then the view will actually fail and never be decided (unless there is a bug and we want to catch that with a test failure).

@ss-es
Copy link
Contributor

ss-es commented Sep 12, 2024

Ah yes, I was mixing up the views -- agree with @bfish713 that this should be fine!

Copy link
Contributor

@ss-es ss-es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@lukeiannucci lukeiannucci merged commit 7879bf0 into main Sep 12, 2024
36 checks passed
@lukeiannucci lukeiannucci deleted the li/node-timing-startup-fix branch September 12, 2024 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[TESTING] - Investigate Restart Test Failures
4 participants