Fix Hanging CI #3697

bfish713 · 2024-09-23T19:26:17Z

This PR:

Finally fix the test hanging issue. The main issue was that the test task implementation could infinite loop. The issue was exposed one of the test tasks takes the internal event streams. The task is just a while loop that selects the next event from any event receiver it has. Because in the case of the restart test we create new internal senders for the new nodes and kill all of senders of the of the old nodes, the receivers held by the test tasks are all closed. This means that each time we select one of the receivers next event we immediately get an error and the loop just becomes a hot loop that constantly is processing the error and never awaiting. This spin loop then starves the executor from running so none of the async code can run. It was easy to reproduce by limiting the async executor threads to 1 on tokio.

In the process of debugging this I found a few other issues that I fixed:

In the spinning test task I now spawn all the new nodes in parallel so they should start a roughly the same time
I noticed that when we create consensus and before we call start consensus we spawn the first timeout tasks, if it takes awhile to call start_consensus we could get a timeout before even getting the event for genesis. I fixed that by starting the first timeout task when we start consensus
We remove a failed view from our failed view list if all nodes who failed it eventually succeed it, but in some tests the nodes might go down or not decided it before the test completes so I made it so that we remove the failed view if a quorum of nodes succeed it. (this scenario should not be possible but I think it's something that might have been fixed with the timeout change
The completion task was trying to shutdown nodes but that logic is already handled when the test finishes so it is duplicated.
Rebalances the time of the CI jobs, they should be all about 10 minutes now
Update the justfile to log all logs at info. That's the level we log in production so it shouldn't be too spammy. Also the previous justfile would only show logs of consensus starting up and was not usually helpful.

This PR does not:

Fix the underlying issue that some test tasks are holding dead receivers when all nodes are restarted. I think it's just the view sync task and we should probably not spawn it for the restart tests.

Key places to review:

Changes to spinning_task, the new initial timeout logic, changes to test_task

crates/testing/src/overall_safety_task.rs

bfish713 · 2024-09-26T21:27:13Z

This has gone through at least 5 or 6 consecutive CI successes without a failure, and more like 10 if you ignore the marketplace test failures which I believe were not related

lukeiannucci

lgtm

ss-es

can't imagine how annoying this must've been to track down! left just a few comments, I'll try to take a closer look tomorrow

ss-es · 2024-09-26T22:49:02Z

justfile

@@ -53,31 +53,31 @@ test *ARGS:

 test-ci *ARGS:
  echo Testing {{ARGS}}
-  RUST_LOG=error,hotshot=debug,libp2p-networking=debug cargo test --lib --bins --tests --benches --workspace --no-fail-fast {{ARGS}} -- --test-threads=1


was the log level change here intentional or leftover from debugging?

Sorry ignore the now delete comment. This change was intentional. The issue was that before we'd a ton of logs from network and hotshot startup and no logs from the actual consensus impl once the nodes started (I think "Starting HothShot" as usually the last log) so it wasn't very helpful. There is probably some better version we can do, but I do think info level for everything is a good balance of info and noise to debug.

ss-es · 2024-09-26T23:11:51Z

crates/testing/src/test_task.rs

+                            .inspect_err(|e| tracing::error!("{e}"));
+                    }
+                    Ok((Err(e), _id, _)) => {
+                        error!("Error from one channel in test task {:?}", e);


not sure if this would be better, but a thought: could we match and drop the receiver here? e.g. Ok((Err(RecvError::Closed), id, _)) => { self.receivers.remove(id) }

(this might require also breaking at the start if self.receivers.is_empty())

yeah I want to do that, but it would change the idx of the receiver and might lead to some weird behaviours. I plan to follow up with this with a few more cleanups and will store the receivers with their node_id somehow in this task (or replace with empty or new receivers if possible)

lukaszrzasik · 2024-09-27T12:10:38Z

crates/hotshot/src/lib.rs

+        // Spawn a task that will sleep for the next view timeout and then send a timeout event
+        // if not cancelled
+        async_spawn({
+            async move {
+                async_sleep(Duration::from_millis(next_view_timeout)).await;
+                broadcast_event(
+                    Arc::new(HotShotEvent::Timeout(start_view + 1)),
+                    &event_stream,
+                )
+                .await;
+            }
+        });


How do we cancel the task spawned here? Probably I'm missing something.

We don't, but it can only run for the timeout duration then it'll just send an event into a closed stream (and error) if the node is shutdown before timeout. The timeout itself will be ignored if progress is made. I think cleaning this up so we can cancel on view change is a good idea though.

* shutdown completion task last * fix shutdown order * fmt * log everything at info when test fails * clear failed, logging * fix build * different log level * no capture again * typo * move logging + do startups in parallel * fmt * change initial timeout * remove nocapture * nocapture again * fix * only log nodes started when there are nodes starting * log exit * log when timeout starts * log id and view * only shutdown from 1 place * fix build, remove handles from completetion task * leave one up in cdn test * more logs, less threads, maybe fix? * actual fix * lint fmt * allow more than 1 thread, tweaks * remove nocapture * move byzantine tests to ci-3 * rebalance tests more * one more test to 4 * only spawn first timeout when starting consensus * cleanup * fix justfile lint tokio * fix justfil * sleep longer, nocapture to debug * info * fix another hot loop maybe * don't spawn r/r tasks for cdn that do nothing * lint no sleep * lower log level in libp2p * lower builder test threshold * remove nocapture for the last time, hopefully * remove cleanup_previous_timeouts_on_view

rob-maron · 2024-09-27T21:04:18Z

crates/task-impls/src/request.rs

-            HotShotEvent::QuorumProposalRequestRecv(req, signature) => {
-                // Make sure that this request came from who we think it did
-                ensure!(
-                    req.key.validate(signature, req.commit().as_ref()),
-                    "Invalid signature key on proposal request."
-                );
-
-                if let Some(quorum_proposal) = self
-                    .state
-                    .read()
-                    .await
-                    .last_proposals()
-                    .get(&req.view_number)
-                {
-                    broadcast_event(
-                        HotShotEvent::QuorumProposalResponseSend(
-                            req.key.clone(),
-                            quorum_proposal.clone(),
-                        )
-                        .into(),
-                        sender,
-                    )
-                    .await;
-                }
-
-                Ok(())
-            }


bfish713 added 30 commits September 23, 2024 15:20

shutdown completion task last

b581608

fix shutdown order

4ee6bca

Merge branch 'main' into bf/test-hang-fix

c93c222

fmt

6c88880

log everything at info when test fails

b261433

clear failed, logging

4bfa83d

fix build

67a2f42

different log level

77e5ab1

no capture again

c1f2b7f

typo

78d7b3d

move logging + do startups in parallel

803273d

fmt

a11ff20

change initial timeout

e347c39

remove nocapture

82175e4

nocapture again

2b60753

Merge branch 'main' into bf/test-hang-fix

40569d6

fix

a81ad8f

only log nodes started when there are nodes starting

70c994b

log exit

11c403b

log when timeout starts

ca9e508

log id and view

c90ef6a

only shutdown from 1 place

827aeb4

fix build, remove handles from completetion task

b9e4c28

leave one up in cdn test

ca2a0d2

more logs, less threads, maybe fix?

07b54b0

actual fix

74f0e27

lint fmt

b2ca805

Merge remote-tracking branch 'origin/main' into bf/test-hang-fix

05794c4

allow more than 1 thread, tweaks

4cca1d0

remove nocapture

6319aa0

bfish713 added 6 commits September 25, 2024 16:08

move byzantine tests to ci-3

f9af84f

rebalance tests more

a3ec2bb

one more test to 4

aafdc32

only spawn first timeout when starting consensus

5f305bc

cleanup

9df4ad2

fix justfile lint tokio

609ceca

bfish713 changed the title ~~Bf/test hang fix~~ Fix Hanging CI Sep 26, 2024

bfish713 marked this pull request as ready for review September 26, 2024 03:28

github-actions bot assigned lukaszrzasik and rob-maron Sep 26, 2024

fix justfil

7e8fb31

lukeiannucci reviewed Sep 26, 2024

View reviewed changes

crates/testing/src/overall_safety_task.rs Show resolved Hide resolved

bfish713 added 10 commits September 26, 2024 09:39

sleep longer, nocapture to debug

99ef939

info

43343bd

fix another hot loop maybe

44fd242

don't spawn r/r tasks for cdn that do nothing

a69f68c

lint no sleep

369ab05

lower log level in libp2p

99b3f94

Merge branch 'main' into bf/test-hang-fix

d7e21b6

lower builder test threshold

13fd192

remove nocapture for the last time, hopefully

d7290bd

remove cleanup_previous_timeouts_on_view

068bd02

lukeiannucci approved these changes Sep 26, 2024

View reviewed changes

ss-es reviewed Sep 26, 2024

View reviewed changes

lukaszrzasik approved these changes Sep 27, 2024

View reviewed changes

bfish713 merged commit ce7c0a3 into main Sep 27, 2024
36 checks passed

bfish713 deleted the bf/test-hang-fix branch September 27, 2024 13:46

rob-maron reviewed Sep 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Hanging CI #3697

Fix Hanging CI #3697

bfish713 commented Sep 23, 2024 •

edited

Loading

bfish713 commented Sep 26, 2024

lukeiannucci left a comment

ss-es left a comment

ss-es Sep 26, 2024

bfish713 Sep 27, 2024

ss-es Sep 26, 2024

bfish713 Sep 27, 2024

lukaszrzasik Sep 27, 2024

bfish713 Sep 27, 2024

rob-maron Sep 27, 2024

Fix Hanging CI #3697

Fix Hanging CI #3697

Conversation

bfish713 commented Sep 23, 2024 • edited Loading

This PR:

This PR does not:

Key places to review:

bfish713 commented Sep 26, 2024

lukeiannucci left a comment

Choose a reason for hiding this comment

ss-es left a comment

Choose a reason for hiding this comment

ss-es Sep 26, 2024

Choose a reason for hiding this comment

bfish713 Sep 27, 2024

Choose a reason for hiding this comment

ss-es Sep 26, 2024

Choose a reason for hiding this comment

bfish713 Sep 27, 2024

Choose a reason for hiding this comment

lukaszrzasik Sep 27, 2024

Choose a reason for hiding this comment

bfish713 Sep 27, 2024

Choose a reason for hiding this comment

rob-maron Sep 27, 2024

Choose a reason for hiding this comment

bfish713 commented Sep 23, 2024 •

edited

Loading