Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ed/timeout #1794

Merged
merged 59 commits into from
Oct 12, 2023
Merged

Ed/timeout #1794

merged 59 commits into from
Oct 12, 2023

Conversation

elliedavidson
Copy link
Member

@elliedavidson elliedavidson commented Sep 21, 2023

This PR:

  • Starts hotshot with a QCFormed event instead of ViewChange event
  • Updates view logic such that replicas enter new views when they see a QC, TC, or ViewSyncCert for the previous view. They no longer increment their view after successfully voting on a proposal (see this thread here for more detail)
  • Updates view sync logic to handle the new view change logic - view sync is now triggered after 3 timeouts: 1st timeout indicates we didn't see a QC for the next view and should send a timeout certificate, second timeout is waiting for said timeout certificate, third time is when a second leader in a row is unresponsive, meaning we should view sync
  • Removes lots of dead code
  • Replaces "2" types with their native types
  • Updates VoteType trait with more getter functions
  • Renames from_signatures_and_commitment to create_certificate
  • Adds a TimeoutExchange to facilitate TimeoutVotes but does not expose it outside of NodeImplementation. We can do the same for the other exchanges in this issue: Remove Exchanges #1799
  • Adds timeout tests for both the web server and libp2p networks

This PR does not:

Use just async_std test_timeout to run the timeout test

Comment on lines 721 to 726
let Some(parent) = parent else {
error!(
"Proposal's parent missing from storage with commitment: {:?}",
justify_qc.leaf_commitment()
);
return;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will reintroduce the error we fix last week, the replica will not be able to vote and timeout the next view because the proposal isn't stored.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest push should fix this. Instead of return right away here, it first updates the state map and then returns. Let me know if it still incorrect.

crates/task-impls/src/da.rs Outdated Show resolved Hide resolved
Comment on lines 27 to 33
// TODO ED Reduce down to 5 nodes once memory network issues is resolved
// https://github.com/EspressoSystems/HotShot/issues/1790
let mut metadata = TestMetadata {
total_nodes: 10,
start_nodes: 10,
..Default::default()
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these issues are fixed? Can we run this with 5 nodes or are you still hitting an issue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still get the same error. I actually ran the timeout test on the develop branch. Looks like the same thing happens. The test works with the web server network but fails with the memory network. Looking at the logs it looks like some messages just don't make it to the nodes like they are supposed to.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm ok I'll look into this one

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did update the timeout_test to have both libp2p and the webserver, though. Just to ensure something weird wasn't also happening with one of them.

crates/testing/tests/timeout.rs Show resolved Hide resolved
/// respectively.
///
/// TODO GG used only in election.rs; move this to there and make it private?
pub struct VoteAccumulator<TOKEN, COMMITMENT: CommitmentBounds, TYPES: NodeType> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love getting rid of this!

TOKEN: Clone + VoteToken,
{
#![allow(clippy::too_many_lines)]
fn append(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see append implemented for TimeCertificate, did we already have it for the other certificate types, I can't find it in this PR and am confused since it obviously works

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they were in a previous PR!

Copy link
Contributor

@dailinsubjam dailinsubjam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except for some dead codes, all LGTM!

@@ -54,6 +54,7 @@ impl BLSPrivKey {
}
}

// #[allow(clippy::incorrect_partial_ord_impl_on_ord_type)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to get rid of this dead code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept this in because my local clippy needed it. I expect once CI's clippy is upgraded we'll need to uncomment these. :)

@@ -27,6 +27,7 @@ pub struct BLSPubKey {
pub_key: VerKey,
}

// #[allow(clippy::incorrect_partial_ord_impl_on_ord_type)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also dead code here?

@@ -136,6 +136,9 @@ impl<S: Default + Debug> NetworkNodeHandle<S> {
///
/// Will panic if a handler is already spawned
#[allow(clippy::unused_async)]
// // Tokio and async_std disagree how this function should be linted
// #[allow(clippy::ignored_unit_patterns)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also dead code here?

@elliedavidson elliedavidson merged commit 9c5c83c into develop Oct 12, 2023
@elliedavidson elliedavidson deleted the ed/timeout branch October 12, 2023 19:36
error!(?proposal.signature, "Could not verify proposal.");
return;
}
return;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the fix branch we were updating the view here to avoid timing out in the view that might have actually been successful. I'm not 100% sure either way so I'll leave up to you. If we don't update the view like in this PR, then the view will time out and the network might move on to future views without us and we'll be stuck trying to catch back up.

If we do update the view then we might time out on a future view when everyone else times out because this is a bogus QC that nobody voted on.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If after this is deployed and we see one node constantly behind, this is probably the culprit. On the other hand if we see nodes entering view sync at different times in the current deployment this change would fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants