-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ed/timeout #1794
Ed/timeout #1794
Conversation
crates/task-impls/src/consensus.rs
Outdated
let Some(parent) = parent else { | ||
error!( | ||
"Proposal's parent missing from storage with commitment: {:?}", | ||
justify_qc.leaf_commitment() | ||
); | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will reintroduce the error we fix last week, the replica will not be able to vote and timeout the next view because the proposal isn't stored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest push should fix this. Instead of return right away here, it first updates the state map and then returns. Let me know if it still incorrect.
crates/testing/tests/timeout.rs
Outdated
// TODO ED Reduce down to 5 nodes once memory network issues is resolved | ||
// https://github.com/EspressoSystems/HotShot/issues/1790 | ||
let mut metadata = TestMetadata { | ||
total_nodes: 10, | ||
start_nodes: 10, | ||
..Default::default() | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these issues are fixed? Can we run this with 5 nodes or are you still hitting an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still get the same error. I actually ran the timeout test on the develop
branch. Looks like the same thing happens. The test works with the web server network but fails with the memory network. Looking at the logs it looks like some messages just don't make it to the nodes like they are supposed to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm ok I'll look into this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did update the timeout_test to have both libp2p and the webserver, though. Just to ensure something weird wasn't also happening with one of them.
/// respectively. | ||
/// | ||
/// TODO GG used only in election.rs; move this to there and make it private? | ||
pub struct VoteAccumulator<TOKEN, COMMITMENT: CommitmentBounds, TYPES: NodeType> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love getting rid of this!
TOKEN: Clone + VoteToken, | ||
{ | ||
#![allow(clippy::too_many_lines)] | ||
fn append( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see append implemented for TimeCertificate, did we already have it for the other certificate types, I can't find it in this PR and am confused since it obviously works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they were in a previous PR!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except for some dead codes, all LGTM!
@@ -54,6 +54,7 @@ impl BLSPrivKey { | |||
} | |||
} | |||
|
|||
// #[allow(clippy::incorrect_partial_ord_impl_on_ord_type)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to get rid of this dead code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept this in because my local clippy needed it. I expect once CI's clippy is upgraded we'll need to uncomment these. :)
@@ -27,6 +27,7 @@ pub struct BLSPubKey { | |||
pub_key: VerKey, | |||
} | |||
|
|||
// #[allow(clippy::incorrect_partial_ord_impl_on_ord_type)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also dead code here?
@@ -136,6 +136,9 @@ impl<S: Default + Debug> NetworkNodeHandle<S> { | |||
/// | |||
/// Will panic if a handler is already spawned | |||
#[allow(clippy::unused_async)] | |||
// // Tokio and async_std disagree how this function should be linted | |||
// #[allow(clippy::ignored_unit_patterns)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also dead code here?
error!(?proposal.signature, "Could not verify proposal."); | ||
return; | ||
} | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the fix branch we were updating the view here to avoid timing out in the view that might have actually been successful. I'm not 100% sure either way so I'll leave up to you. If we don't update the view like in this PR, then the view will time out and the network might move on to future views without us and we'll be stuck trying to catch back up.
If we do update the view then we might time out on a future view when everyone else times out because this is a bogus QC that nobody voted on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If after this is deployed and we see one node constantly behind, this is probably the culprit. On the other hand if we see nodes entering view sync at different times in the current deployment this change would fix it.
This PR:
QCFormed
event instead ofViewChange
eventVoteType
trait with more getter functionsfrom_signatures_and_commitment
tocreate_certificate
TimeoutExchange
to facilitateTimeoutVotes
but does not expose it outside ofNodeImplementation
. We can do the same for the other exchanges in this issue: RemoveExchanges
#1799This PR does not:
Use
just async_std test_timeout
to run the timeout test