Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Hanging CI #3697

Merged
merged 47 commits into from
Sep 27, 2024
Merged

Fix Hanging CI #3697

merged 47 commits into from
Sep 27, 2024

Conversation

bfish713
Copy link
Collaborator

@bfish713 bfish713 commented Sep 23, 2024

This PR:

Finally fix the test hanging issue. The main issue was that the test task implementation could infinite loop. The issue was exposed one of the test tasks takes the internal event streams. The task is just a while loop that selects the next event from any event receiver it has. Because in the case of the restart test we create new internal senders for the new nodes and kill all of senders of the of the old nodes, the receivers held by the test tasks are all closed. This means that each time we select one of the receivers next event we immediately get an error and the loop just becomes a hot loop that constantly is processing the error and never awaiting. This spin loop then starves the executor from running so none of the async code can run. It was easy to reproduce by limiting the async executor threads to 1 on tokio.

In the process of debugging this I found a few other issues that I fixed:

  • In the spinning test task I now spawn all the new nodes in parallel so they should start a roughly the same time
  • I noticed that when we create consensus and before we call start consensus we spawn the first timeout tasks, if it takes awhile to call start_consensus we could get a timeout before even getting the event for genesis. I fixed that by starting the first timeout task when we start consensus
  • We remove a failed view from our failed view list if all nodes who failed it eventually succeed it, but in some tests the nodes might go down or not decided it before the test completes so I made it so that we remove the failed view if a quorum of nodes succeed it. (this scenario should not be possible but I think it's something that might have been fixed with the timeout change
  • The completion task was trying to shutdown nodes but that logic is already handled when the test finishes so it is duplicated.
  • Rebalances the time of the CI jobs, they should be all about 10 minutes now
  • Update the justfile to log all logs at info. That's the level we log in production so it shouldn't be too spammy. Also the previous justfile would only show logs of consensus starting up and was not usually helpful.

This PR does not:

Fix the underlying issue that some test tasks are holding dead receivers when all nodes are restarted. I think it's just the view sync task and we should probably not spawn it for the restart tests.

Key places to review:

Changes to spinning_task, the new initial timeout logic, changes to test_task

@bfish713 bfish713 changed the title Bf/test hang fix Fix Hanging CI Sep 26, 2024
@bfish713 bfish713 marked this pull request as ready for review September 26, 2024 03:28
@bfish713
Copy link
Collaborator Author

This has gone through at least 5 or 6 consecutive CI successes without a failure, and more like 10 if you ignore the marketplace test failures which I believe were not related

Copy link
Contributor

@lukeiannucci lukeiannucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

@ss-es ss-es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't imagine how annoying this must've been to track down! left just a few comments, I'll try to take a closer look tomorrow

@@ -53,31 +53,31 @@ test *ARGS:

test-ci *ARGS:
echo Testing {{ARGS}}
RUST_LOG=error,hotshot=debug,libp2p-networking=debug cargo test --lib --bins --tests --benches --workspace --no-fail-fast {{ARGS}} -- --test-threads=1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was the log level change here intentional or leftover from debugging?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry ignore the now delete comment. This change was intentional. The issue was that before we'd a ton of logs from network and hotshot startup and no logs from the actual consensus impl once the nodes started (I think "Starting HothShot" as usually the last log) so it wasn't very helpful. There is probably some better version we can do, but I do think info level for everything is a good balance of info and noise to debug.

.inspect_err(|e| tracing::error!("{e}"));
}
Ok((Err(e), _id, _)) => {
error!("Error from one channel in test task {:?}", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if this would be better, but a thought: could we match and drop the receiver here? e.g. Ok((Err(RecvError::Closed), id, _)) => { self.receivers.remove(id) }

(this might require also breaking at the start if self.receivers.is_empty())

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I want to do that, but it would change the idx of the receiver and might lead to some weird behaviours. I plan to follow up with this with a few more cleanups and will store the receivers with their node_id somehow in this task (or replace with empty or new receivers if possible)

Comment on lines +379 to +390
// Spawn a task that will sleep for the next view timeout and then send a timeout event
// if not cancelled
async_spawn({
async move {
async_sleep(Duration::from_millis(next_view_timeout)).await;
broadcast_event(
Arc::new(HotShotEvent::Timeout(start_view + 1)),
&event_stream,
)
.await;
}
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we cancel the task spawned here? Probably I'm missing something.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't, but it can only run for the timeout duration then it'll just send an event into a closed stream (and error) if the node is shutdown before timeout. The timeout itself will be ignored if progress is made. I think cleaning this up so we can cancel on view change is a good idea though.

@bfish713 bfish713 merged commit ce7c0a3 into main Sep 27, 2024
36 checks passed
@bfish713 bfish713 deleted the bf/test-hang-fix branch September 27, 2024 13:46
rob-maron pushed a commit that referenced this pull request Sep 27, 2024
* shutdown completion task last

* fix shutdown order

* fmt

* log everything at info when test fails

* clear failed, logging

* fix build

* different log level

* no capture again

* typo

* move logging + do startups in parallel

* fmt

* change initial timeout

* remove nocapture

* nocapture again

* fix

* only log nodes started when there are nodes starting

* log exit

* log when timeout starts

* log id and view

* only shutdown from 1 place

* fix build, remove handles from completetion task

* leave one up in cdn test

* more logs, less threads, maybe fix?

* actual fix

* lint fmt

* allow more than 1 thread, tweaks

* remove nocapture

* move byzantine tests to ci-3

* rebalance tests more

* one more test to 4

* only spawn first timeout when starting consensus

* cleanup

* fix justfile lint tokio

* fix justfil

* sleep longer, nocapture to debug

* info

* fix another hot loop maybe

* don't spawn r/r tasks for cdn that do nothing

* lint no sleep

* lower log level in libp2p

* lower builder test threshold

* remove nocapture for the last time, hopefully

* remove cleanup_previous_timeouts_on_view
Comment on lines -113 to -139
HotShotEvent::QuorumProposalRequestRecv(req, signature) => {
// Make sure that this request came from who we think it did
ensure!(
req.key.validate(signature, req.commit().as_ref()),
"Invalid signature key on proposal request."
);

if let Some(quorum_proposal) = self
.state
.read()
.await
.last_proposals()
.get(&req.view_number)
{
broadcast_event(
HotShotEvent::QuorumProposalResponseSend(
req.key.clone(),
quorum_proposal.clone(),
)
.into(),
sender,
)
.await;
}

Ok(())
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants