-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CATCHUP] Generate Network on the fly for restart nodes #3723
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes look good and make sense overall, just had a few questions (maybe just for my own understanding)
crates/testing/src/test_runner.rs
Outdated
@@ -481,7 +482,7 @@ where | |||
self.late_start.insert( | |||
node_id, | |||
LateStartNode { | |||
network, | |||
network: Some(network), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused here -- I would've assumed skip_late == true
means we want to skip initializing the nodes. Shouldn't we set network: None
in that case?
(I don't actually know -- genuinely just wondering)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this was part of the plan to eventually have all the LateStartNode
s actually generate and start the network on their proper view. I have fixed this scenario I also think we can fix the other ones in a future PR. The main issue seems to be with restarting nodes and generating nodes * 2 networks before the test started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can make a separate issue for someone to pick it up
@@ -520,7 +521,7 @@ where | |||
self.late_start.insert( | |||
node_id, | |||
LateStartNode { | |||
network, | |||
network: Some(network), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this one makes complete sense
@@ -210,13 +220,15 @@ where | |||
if let Some(node) = self.handles.write().await.get_mut(idx) { | |||
tracing::error!("Node {} shutting down", idx); | |||
node.handle.shut_down().await; | |||
// For restarted nodes generate the network on correct view | |||
let generated_network = (self.channel_generator)(node_id).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we generate the network on RestartDown
rather than RestartUp
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be fine to do either way, the main thing is to ensure we shut down the old nodes network before we generate the new network
Closes #<ISSUE_NUMBER>
This PR:
Instead of starting the network for the Restart Nodes at test startup, we are pushing it to when we create the node and replace old one.
Also, broadcast the
QuorumProposalRequestSend
to all nodes. Previously we would send to leader, wait, then do the DA committee. Sometimes this would take too long and theREQUEST_TIMEOUT
would timeout before we received the proposal causing some nodes to never receive it, leading to flaky UT's.This PR does not:
Key places to review: