CI failure after upgrading actix #3925

bowenwang1996 · 2021-02-07T18:37:27Z

We occasionally see CI failure like the following after #3869:


[2021-02-05T06:41:18Z] test result: ok. 4 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

[2021-02-05T06:41:18Z]
[2021-02-05T06:41:18Z] pure virtual method called
[2021-02-05T06:41:18Z] terminate called without an active exception
[2021-02-05T06:41:18Z] error: test failed, to rerun pass '-p neard --test run_nodes'

It looks like that even though the tests passed, something went wrong afterwards.

The text was updated successfully, but these errors were encountered:

matklad · 2021-02-08T09:16:28Z

Oh dear.

pure virtual method called

That's ... some C++ code?

matklad · 2021-02-08T13:20:14Z

@bowenwang1996 could you link a specific build where this happens? I've downloaded a couple of recent build logs, but didn't find the error

bowenwang1996 · 2021-02-08T15:29:33Z

@matklad sorry I somehow forgot to include the link, but here it is https://buildkite.com/nearprotocol/nearcore/builds/5548#c1917194-dff1-423d-ba38-930896e2046c

matklad · 2021-02-08T15:40:00Z

Might be facebook/rocksdb#649?

Local repro: cargo t -p neard --test run_nodes -q -- run_nodes_1_2_2

matklad · 2021-02-11T13:36:06Z

Ok, I think I've diagnosed the root cause: we didn't properly wait for
PeerManager actors to join. If I apply the following diff

diff --git a/chain/network/src/peer_manager.rs b/chain/network/src/peer_manager.rs
index efb35cd7..2fb16197 100644
--- a/chain/network/src/peer_manager.rs
+++ b/chain/network/src/peer_manager.rs
@@ -1233,6 +1233,12 @@ impl Actor for PeerManagerActor {

         Running::Stop
     }
+
+    fn stopped(&mut self, ctx: &mut Self::Context) {
+        eprintln!("WILL STOP PEER MANAGER");
+        std::thread::sleep_ms(250);
+        eprintln!("DID STOP PEER MANAGER");
+    }
 }

 impl Handler<NetworkRequests> for PeerManagerActor {

and run

cargo t  -p neard --test run_nodes -- run_nodes_1_2_1 --nocapture --test-threads 1

I get the following output:

running 1 test
test run_nodes_1_2_1 ... Feb 11 16:09:20.071  INFO near: Did not find "/run/user/1000/run_nodes_2_1_0DFlMiO/data" path, will be creating new store database
Feb 11 16:09:20.279  INFO runtime: Genesis state has 6 records, computing state roots
Feb 11 16:09:20.280  INFO runtime: Tracking shards: {}
Feb 11 16:09:20.291  INFO actix_server::builder: Starting 4 workers
Feb 11 16:09:20.292  INFO actix_server::builder: Starting "actix-web-service-0.0.0.0:38841" service on 0.0.0.0:38841
Feb 11 16:09:20.292  INFO near: Did not find "/run/user/1000/run_nodes_2_1_1DDiQJv/data" path, will be creating new store database
Feb 11 16:09:20.296  INFO stats: Server listening at ed25519:[email protected]:40657
Feb 11 16:09:20.511  INFO runtime: Genesis state has 6 records, computing state roots
Feb 11 16:09:20.512  INFO runtime: Tracking shards: {0}
Feb 11 16:09:20.515  INFO actix_server::builder: Starting 4 workers
Feb 11 16:09:20.516  INFO actix_server::builder: Starting "actix-web-service-0.0.0.0:41963" service on 0.0.0.0:41963
Feb 11 16:09:20.516  INFO client: Starting validator node: near.0
Feb 11 16:09:20.517  INFO chain: Init: saved genesis: `AkkFBh9g1fq94MsLGy8uJmHJUJxvYWA3wMQCuyiZ1Ah2` / [`3UPJSSM6qstkqpRgL1suqMh2c6yeGkxihMBzrn9yH5rc`]
Feb 11 16:09:20.518  INFO chain: Init: head @ 4326 [AkkFBh9g1fq94MsLGy8uJmHJUJxvYWA3wMQCuyiZ1Ah2]
Feb 11 16:09:20.518  INFO stats: Server listening at ed25519:[email protected]:42235
Feb 11 16:09:20.519  INFO chain: Init: saved genesis: `AkkFBh9g1fq94MsLGy8uJmHJUJxvYWA3wMQCuyiZ1Ah2` / [`3UPJSSM6qstkqpRgL1suqMh2c6yeGkxihMBzrn9yH5rc`]
Feb 11 16:09:20.519  INFO chain: Init: head @ 4326 [AkkFBh9g1fq94MsLGy8uJmHJUJxvYWA3wMQCuyiZ1Ah2]
WILL STOP PEER MANAGER
WILL STOP PEER MANAGER
Feb 11 16:09:20.522 ERROR network: Failed sending GetChain to client: Mailbox has closed
ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 3 filtered out

That is, at the end of the test PeerManager isn't dropped, because the thread
that owns it is abruptly terminated. I observe the same behavior even with old
actix, so this bug is pre-existing.

Why does it manifests as that pure virtual function message? That's because
PeerManager Drop might or might not run, depending on timings. If the drop
runs just as the main thread exits, rocksdb might observe partially-destroyed
C++ globals and die in flames. Note that with sleep in there, we actually don't
get a crash, as we simply don't get to destroying rocksdb.

Why does this only manifests with a new actix? It doesn't actually! I managed to
reproduce this on ffe8f67 by running

cargo t --release  -p neard --test run_nodes -- run_nodes_1_2_1 --nocapture --test-threads 1

while hammering my CPU with rust compilation to create more erratic behavior. It
took 9 runs for pure virtual method called to appear.

The problem here I think is slightly bigger than just weird message in tests:

we have a HEAVY_TESTS_LOCK to prevent concurrent execution of the tests. As
threads leak beyond the test's extent, this doesn't ellimenate the
interference completely.
I wonder if we may fail to properly close the database on a running instance?
This shouldn't be a huge problem (as it should always be safe to just kill the
instance), but still is worth fixing.

matklad · 2021-02-11T16:29:44Z

I've looked into fixing this by joining all the threads, and seems like this is
going to be a yak shave.

To recap, the problem is that Actix runtime leaks threads. Any time
Arbiter::new is called, a new native thread is spawned and, by default,
nothing joins that thread. That means that the thread might leak out of
#[test] or main, and get abruptly killed. This is problematic because
destructors might not get called, or might get called after main returns
(which causes the original problem obesreved in this issue)

Here's a short demo:

matklad@68b8800

I believe this is going to be hard to fix, as the actix actor framework isn't
really designed for this. Take a look at how Arbiter::join method was added:

https://github.com/actix/actix-net/pull/60/files#diff-fe78f08c4b652654276b5894fc7f458014b714857e8f4f23313399f9ca71019cR51-R55

Note that the Clone impl forgets the associated thread, so ::join only
actually works on the original instance of the Arbiter, and there are zero
concerns about that in the PR. This feels like a "not really designed" API to me
:(

When I tried to write a simple fix here manually joining the Arbiters, I got
weird panics inside actix from run_later about missing tokio runtime.

I am not sure how to proceed further, as it feels like this should touch the
core of our approach to concurrency.

It might also be that everything is fine, and I am just overly-sensitive to
cancellation and shutdown issues? Every concurrent application I've looked
at seems to be broken around causality, cancellation & shuttdown. A the same time,
the apps generally work most of the time? So maybe fixing this isn't worth it?

bowenwang1996 · 2021-02-11T16:50:07Z

I think a practical concern for us is that this causes CI to fail randomly from time to time, which is fairly annoying. Did you think it could be fixed inside actix @matklad? Should we submit an issue there?

matklad · 2021-02-15T17:46:13Z

Submitted a "fix" in #3958 and opened an issue aginst the actix repository. My current thinking is that actix actor framework (not the actix-web http framework) isn't really needed, and we are better of by using something like https://ryhl.io/blog/actors-with-tokio/

Co-authored-by: Vlad Frolov <[email protected]>

frol · 2021-03-02T15:43:31Z

I will close it for now. I think #3958 and #4013 address (workaround colliding nodes), and given it is not a common scenario to run multiple nodes at once or concurrently, we can deal with it later when we address #3962.

bowenwang1996 assigned frol and matklad Feb 7, 2021

matklad mentioned this issue Feb 9, 2021

fix: make sure to join all arbiters #3932

Closed

matklad added a commit to matklad/nearcore that referenced this issue Feb 15, 2021

fix: paper over near#3925

e2e7067

matklad mentioned this issue Feb 15, 2021

Multi-arbiter systems leave lingering Arbiters behind after shutdown. actix/actix#464

Open

near-bulldozer bot pushed a commit that referenced this issue Feb 16, 2021

fix: paper over #3925 (#3958)

1394957

Co-authored-by: Vlad Frolov <[email protected]>

frol closed this as completed Mar 2, 2021

marcelo-gonzalez mentioned this issue Feb 14, 2022

Should nearcore exit with error code after upgrade panic #5340

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI failure after upgrading actix #3925

CI failure after upgrading actix #3925

bowenwang1996 commented Feb 7, 2021

matklad commented Feb 8, 2021

matklad commented Feb 8, 2021

bowenwang1996 commented Feb 8, 2021

matklad commented Feb 8, 2021 •

edited

Loading

matklad commented Feb 11, 2021

matklad commented Feb 11, 2021 •

edited

Loading

bowenwang1996 commented Feb 11, 2021

matklad commented Feb 15, 2021

frol commented Mar 2, 2021

CI failure after upgrading actix #3925

CI failure after upgrading actix #3925

Comments

bowenwang1996 commented Feb 7, 2021

matklad commented Feb 8, 2021

matklad commented Feb 8, 2021

bowenwang1996 commented Feb 8, 2021

matklad commented Feb 8, 2021 • edited Loading

matklad commented Feb 11, 2021

matklad commented Feb 11, 2021 • edited Loading

bowenwang1996 commented Feb 11, 2021

matklad commented Feb 15, 2021

frol commented Mar 2, 2021

matklad commented Feb 8, 2021 •

edited

Loading

matklad commented Feb 11, 2021 •

edited

Loading