Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI failure after upgrading actix #3925

Closed
bowenwang1996 opened this issue Feb 7, 2021 · 9 comments
Closed

CI failure after upgrading actix #3925

bowenwang1996 opened this issue Feb 7, 2021 · 9 comments
Assignees

Comments

@bowenwang1996
Copy link
Collaborator

We occasionally see CI failure like the following after #3869:


[2021-02-05T06:41:18Z] test result: ok. 4 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

[2021-02-05T06:41:18Z]
[2021-02-05T06:41:18Z] pure virtual method called
[2021-02-05T06:41:18Z] terminate called without an active exception
[2021-02-05T06:41:18Z] error: test failed, to rerun pass '-p neard --test run_nodes'

It looks like that even though the tests passed, something went wrong afterwards.

@matklad
Copy link
Contributor

matklad commented Feb 8, 2021

Oh dear.

pure virtual method called

That's ... some C++ code?

@matklad
Copy link
Contributor

matklad commented Feb 8, 2021

@bowenwang1996 could you link a specific build where this happens? I've downloaded a couple of recent build logs, but didn't find the error

@bowenwang1996
Copy link
Collaborator Author

@matklad sorry I somehow forgot to include the link, but here it is https://buildkite.com/nearprotocol/nearcore/builds/5548#c1917194-dff1-423d-ba38-930896e2046c

@matklad
Copy link
Contributor

matklad commented Feb 8, 2021

Might be facebook/rocksdb#649?

Local repro: cargo t -p neard --test run_nodes -q -- run_nodes_1_2_2

@matklad
Copy link
Contributor

matklad commented Feb 11, 2021

Ok, I think I've diagnosed the root cause: we didn't properly wait for
PeerManager actors to join. If I apply the following diff

diff --git a/chain/network/src/peer_manager.rs b/chain/network/src/peer_manager.rs
index efb35cd7..2fb16197 100644
--- a/chain/network/src/peer_manager.rs
+++ b/chain/network/src/peer_manager.rs
@@ -1233,6 +1233,12 @@ impl Actor for PeerManagerActor {

         Running::Stop
     }
+
+    fn stopped(&mut self, ctx: &mut Self::Context) {
+        eprintln!("WILL STOP PEER MANAGER");
+        std::thread::sleep_ms(250);
+        eprintln!("DID STOP PEER MANAGER");
+    }
 }

 impl Handler<NetworkRequests> for PeerManagerActor {

and run

cargo t  -p neard --test run_nodes -- run_nodes_1_2_1 --nocapture --test-threads 1

I get the following output:

running 1 test
test run_nodes_1_2_1 ... Feb 11 16:09:20.071  INFO near: Did not find "/run/user/1000/run_nodes_2_1_0DFlMiO/data" path, will be creating new store database
Feb 11 16:09:20.279  INFO runtime: Genesis state has 6 records, computing state roots
Feb 11 16:09:20.280  INFO runtime: Tracking shards: {}
Feb 11 16:09:20.291  INFO actix_server::builder: Starting 4 workers
Feb 11 16:09:20.292  INFO actix_server::builder: Starting "actix-web-service-0.0.0.0:38841" service on 0.0.0.0:38841
Feb 11 16:09:20.292  INFO near: Did not find "/run/user/1000/run_nodes_2_1_1DDiQJv/data" path, will be creating new store database
Feb 11 16:09:20.296  INFO stats: Server listening at ed25519:[email protected]:40657
Feb 11 16:09:20.511  INFO runtime: Genesis state has 6 records, computing state roots
Feb 11 16:09:20.512  INFO runtime: Tracking shards: {0}
Feb 11 16:09:20.515  INFO actix_server::builder: Starting 4 workers
Feb 11 16:09:20.516  INFO actix_server::builder: Starting "actix-web-service-0.0.0.0:41963" service on 0.0.0.0:41963
Feb 11 16:09:20.516  INFO client: Starting validator node: near.0
Feb 11 16:09:20.517  INFO chain: Init: saved genesis: `AkkFBh9g1fq94MsLGy8uJmHJUJxvYWA3wMQCuyiZ1Ah2` / [`3UPJSSM6qstkqpRgL1suqMh2c6yeGkxihMBzrn9yH5rc`]
Feb 11 16:09:20.518  INFO chain: Init: head @ 4326 [AkkFBh9g1fq94MsLGy8uJmHJUJxvYWA3wMQCuyiZ1Ah2]
Feb 11 16:09:20.518  INFO stats: Server listening at ed25519:[email protected]:42235
Feb 11 16:09:20.519  INFO chain: Init: saved genesis: `AkkFBh9g1fq94MsLGy8uJmHJUJxvYWA3wMQCuyiZ1Ah2` / [`3UPJSSM6qstkqpRgL1suqMh2c6yeGkxihMBzrn9yH5rc`]
Feb 11 16:09:20.519  INFO chain: Init: head @ 4326 [AkkFBh9g1fq94MsLGy8uJmHJUJxvYWA3wMQCuyiZ1Ah2]
WILL STOP PEER MANAGER
WILL STOP PEER MANAGER
Feb 11 16:09:20.522 ERROR network: Failed sending GetChain to client: Mailbox has closed
ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 3 filtered out

That is, at the end of the test PeerManager isn't dropped, because the thread
that owns it is abruptly terminated. I observe the same behavior even with old
actix, so this bug is pre-existing.

Why does it manifests as that pure virtual function message? That's because
PeerManager Drop might or might not run, depending on timings. If the drop
runs just as the main thread exits, rocksdb might observe partially-destroyed
C++ globals and die in flames. Note that with sleep in there, we actually don't
get a crash, as we simply don't get to destroying rocksdb.

Why does this only manifests with a new actix? It doesn't actually! I managed to
reproduce this on ffe8f67 by running

cargo t --release  -p neard --test run_nodes -- run_nodes_1_2_1 --nocapture --test-threads 1

while hammering my CPU with rust compilation to create more erratic behavior. It
took 9 runs for pure virtual method called to appear.

The problem here I think is slightly bigger than just weird message in tests:

  • we have a HEAVY_TESTS_LOCK to prevent concurrent execution of the tests. As
    threads leak beyond the test's extent, this doesn't ellimenate the
    interference completely.

  • I wonder if we may fail to properly close the database on a running instance?
    This shouldn't be a huge problem (as it should always be safe to just kill the
    instance), but still is worth fixing.

@matklad
Copy link
Contributor

matklad commented Feb 11, 2021

I've looked into fixing this by joining all the threads, and seems like this is
going to be a yak shave.

To recap, the problem is that Actix runtime leaks threads. Any time
Arbiter::new is called, a new native thread is spawned and, by default,
nothing joins that thread. That means that the thread might leak out of
#[test] or main, and get abruptly killed. This is problematic because
destructors might not get called, or might get called after main returns
(which causes the original problem obesreved in this issue)

Here's a short demo:

matklad@68b8800

I believe this is going to be hard to fix, as the actix actor framework isn't
really designed for this. Take a look at how Arbiter::join method was added:

https://github.com/actix/actix-net/pull/60/files#diff-fe78f08c4b652654276b5894fc7f458014b714857e8f4f23313399f9ca71019cR51-R55

Note that the Clone impl forgets the associated thread, so ::join only
actually works on the original instance of the Arbiter, and there are zero
concerns about that in the PR. This feels like a "not really designed" API to me
:(

When I tried to write a simple fix here manually joining the Arbiters, I got
weird panics inside actix from run_later about missing tokio runtime.

I am not sure how to proceed further, as it feels like this should touch the
core of our approach to concurrency.

It might also be that everything is fine, and I am just overly-sensitive to
cancellation and shutdown issues? Every concurrent application I've looked
at seems to be broken around causality, cancellation & shuttdown. A the same time,
the apps generally work most of the time? So maybe fixing this isn't worth it?

@bowenwang1996
Copy link
Collaborator Author

I think a practical concern for us is that this causes CI to fail randomly from time to time, which is fairly annoying. Did you think it could be fixed inside actix @matklad? Should we submit an issue there?

@matklad
Copy link
Contributor

matklad commented Feb 15, 2021

Submitted a "fix" in #3958 and opened an issue aginst the actix repository. My current thinking is that actix actor framework (not the actix-web http framework) isn't really needed, and we are better of by using something like https://ryhl.io/blog/actors-with-tokio/

near-bulldozer bot pushed a commit that referenced this issue Feb 16, 2021
Co-authored-by: Vlad Frolov <[email protected]>
@frol
Copy link
Collaborator

frol commented Mar 2, 2021

I will close it for now. I think #3958 and #4013 address (workaround colliding nodes), and given it is not a common scenario to run multiple nodes at once or concurrently, we can deal with it later when we address #3962.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants