-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Hanging CI #3697
Fix Hanging CI #3697
Conversation
This has gone through at least 5 or 6 consecutive CI successes without a failure, and more like 10 if you ignore the marketplace test failures which I believe were not related |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't imagine how annoying this must've been to track down! left just a few comments, I'll try to take a closer look tomorrow
@@ -53,31 +53,31 @@ test *ARGS: | |||
|
|||
test-ci *ARGS: | |||
echo Testing {{ARGS}} | |||
RUST_LOG=error,hotshot=debug,libp2p-networking=debug cargo test --lib --bins --tests --benches --workspace --no-fail-fast {{ARGS}} -- --test-threads=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was the log level change here intentional or leftover from debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry ignore the now delete comment. This change was intentional. The issue was that before we'd a ton of logs from network and hotshot startup and no logs from the actual consensus impl once the nodes started (I think "Starting HothShot" as usually the last log) so it wasn't very helpful. There is probably some better version we can do, but I do think info level for everything is a good balance of info and noise to debug.
.inspect_err(|e| tracing::error!("{e}")); | ||
} | ||
Ok((Err(e), _id, _)) => { | ||
error!("Error from one channel in test task {:?}", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if this would be better, but a thought: could we match and drop the receiver here? e.g. Ok((Err(RecvError::Closed), id, _)) => { self.receivers.remove(id) }
(this might require also breaking at the start if self.receivers.is_empty()
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I want to do that, but it would change the idx of the receiver and might lead to some weird behaviours. I plan to follow up with this with a few more cleanups and will store the receivers with their node_id somehow in this task (or replace with empty or new receivers if possible)
// Spawn a task that will sleep for the next view timeout and then send a timeout event | ||
// if not cancelled | ||
async_spawn({ | ||
async move { | ||
async_sleep(Duration::from_millis(next_view_timeout)).await; | ||
broadcast_event( | ||
Arc::new(HotShotEvent::Timeout(start_view + 1)), | ||
&event_stream, | ||
) | ||
.await; | ||
} | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we cancel the task spawned here? Probably I'm missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't, but it can only run for the timeout duration then it'll just send an event into a closed stream (and error) if the node is shutdown before timeout. The timeout itself will be ignored if progress is made. I think cleaning this up so we can cancel on view change is a good idea though.
* shutdown completion task last * fix shutdown order * fmt * log everything at info when test fails * clear failed, logging * fix build * different log level * no capture again * typo * move logging + do startups in parallel * fmt * change initial timeout * remove nocapture * nocapture again * fix * only log nodes started when there are nodes starting * log exit * log when timeout starts * log id and view * only shutdown from 1 place * fix build, remove handles from completetion task * leave one up in cdn test * more logs, less threads, maybe fix? * actual fix * lint fmt * allow more than 1 thread, tweaks * remove nocapture * move byzantine tests to ci-3 * rebalance tests more * one more test to 4 * only spawn first timeout when starting consensus * cleanup * fix justfile lint tokio * fix justfil * sleep longer, nocapture to debug * info * fix another hot loop maybe * don't spawn r/r tasks for cdn that do nothing * lint no sleep * lower log level in libp2p * lower builder test threshold * remove nocapture for the last time, hopefully * remove cleanup_previous_timeouts_on_view
HotShotEvent::QuorumProposalRequestRecv(req, signature) => { | ||
// Make sure that this request came from who we think it did | ||
ensure!( | ||
req.key.validate(signature, req.commit().as_ref()), | ||
"Invalid signature key on proposal request." | ||
); | ||
|
||
if let Some(quorum_proposal) = self | ||
.state | ||
.read() | ||
.await | ||
.last_proposals() | ||
.get(&req.view_number) | ||
{ | ||
broadcast_event( | ||
HotShotEvent::QuorumProposalResponseSend( | ||
req.key.clone(), | ||
quorum_proposal.clone(), | ||
) | ||
.into(), | ||
sender, | ||
) | ||
.await; | ||
} | ||
|
||
Ok(()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here
This PR:
Finally fix the test hanging issue. The main issue was that the test task implementation could infinite loop. The issue was exposed one of the test tasks takes the internal event streams. The task is just a while loop that selects the next event from any event receiver it has. Because in the case of the restart test we create new internal senders for the new nodes and kill all of senders of the of the old nodes, the receivers held by the test tasks are all closed. This means that each time we select one of the receivers next event we immediately get an error and the loop just becomes a hot loop that constantly is processing the error and never awaiting. This spin loop then starves the executor from running so none of the async code can run. It was easy to reproduce by limiting the async executor threads to 1 on tokio.
In the process of debugging this I found a few other issues that I fixed:
start consensus
we spawn the first timeout tasks, if it takes awhile to callstart_consensus
we could get a timeout before even getting the event for genesis. I fixed that by starting the first timeout task when we start consensusThis PR does not:
Fix the underlying issue that some test tasks are holding dead receivers when all nodes are restarted. I think it's just the view sync task and we should probably not spawn it for the restart tests.
Key places to review:
Changes to
spinning_task
, the new initial timeout logic, changes totest_task