Handle node disconnections #2745

Anviking · 2021-06-30T11:21:35Z

Issue Number

ADP-871

Overview

Ensure follow catches asyncronous exceptions from connectClient,
such that it can restart with a new connection and cursor.
Keep Local State Queries and to-be-submitted Txs queued until their
requests finish, not just when they start. If a query is interrupted by
the node being disconnected, it will block until a connection is
re-established, and then retry.
Remove need to catch async exceptions
- Think this is a good opportunity to refactor / simplify all the chain-following code.
Verify ouroboros-network errors are logged

Comments

Integration tests could probably be written, but it would:

take some time better spend on other tasks
add complexity to the already complex cluster setup

so I'd lean towards it not being worth it at this moment.

I tested (see below) manually though, which could be added as a documented .md test if desired.

Manual testing

Interrupted chain-sync & wallet listing

start node and wallet
list wallets => [wallet1]
kill node
list wallets => [wallet1] (previously [], or crash)
start node
wallet resumes syncing
list wallets => [wallet1]

Interrupted Local State Query

Start node and wallet
cardano-wallet stake-pool list --stake 0
kill node
observe that nothing happens
re-start node
stake-pool list request finishes after a while

1. Ensure follow catches asyncronous exceptions from connectClient, such that it can restart with a new connection and cursor. 2. Keep Local State Queries and to-be-submitted Txs queued until their requests finish, not just when they start. If a query is interrupted by the node being disconnected, it will block until a connection is re-established, and then retry.

rvl

Looks promising...

rvl · 2021-06-30T11:46:03Z

lib/core/src/Cardano/Wallet/Network.hs

+        -- Node disconnections are seen as async exceptions from here. By
+        -- catching them, `follow` will try to establish a new connection
+        -- depending on the `FollowExceptionRecovery`.
+        handleSyncOrAsync (traceException *> const (pure FollowFailure))


I really doubt that we need to handle asynchronous exceptions.
Just UnliftIO.Exception.handleAny (any synchronous exception) should be enough.

We already used to catch synchronous exceptions.

Note that chain-sync is run on a separate thread from follow, connected via link, which seems to indeed cause async exceptions to be thrown:

λ> f = async (threadDelay 1000000 >> throwIO (userError "hi")) >>= link >> threadDelay 3000000 λ> f `catchSyncOrAsync` (\(e::SomeException) -> putStrLn "Caught") Caught λ> f `catchAny` (\_e -> putStrLn "Caught") *** Exception: ExceptionInLinkedThread (ThreadId 988) user error (hi)

OK, then we should either be catching the exceptions in the chain sync thread, or not using link. Trying to catch exceptions from other threads is a recipe for disaster.

rvl · 2021-06-30T11:54:45Z

lib/core/src/Ouroboros/Network/Client/Wallet.hs

@@ -595,9 +609,14 @@ localTxSubmission queue =
  where
    clientStIdle
        :: m (LocalTxClientStIdle tx err m Void)
-    clientStIdle = atomically (readTQueue queue) <&> \case
+    clientStIdle = atomically (peekTQueue queue) <&> \case


It might be easier to keep using readTQueue but then put an unGetTQueue in an onException handler.

piotr-iohk · 2021-07-01T11:27:06Z

This looks to be improving also ADP-1013 / #2730 (Wallet workers die while syncing when there are many wallets).

Having 240+ wallets I don't see workers dying. Although there are in the log progress decreased entires:

bed93eb5: Syncing (1.64% percent) Applied 0 blocks, 0 rollbacks in the last 30.213246312s. Currently at slot 50031. (progress decreased)

Perhaps it is due to the fact there are a lot of wallets syncing at the same time and therefore syncing is much slower.
There are also entries saying wallet looses connection with the node:

[cardano-wallet.network:Warning:9858] [2021-07-01 11:17:33.39 UTC] Connection lost with the node.

However wallets seem to be operational.

Checked on rev: v2021-06-11 (git revision: 0b5a228).

piotr-iohk · 2021-07-01T13:33:58Z

However, one thing I've noticed is deleting a wallet (even when there's only one wallet) results in DB "lock" pretty much every time:

[cardano-wallet.wallet-db:Info:174] [2021-07-01 13:29:42.12 UTC] Waiting for 1 withDatabase 075fde25...7f11b717 call(s) to finish
[cardano-wallet.wallet-db:Info:174] [2021-07-01 13:29:42.17 UTC] Waiting for 1 withDatabase 075fde25...7f11b717 call(s) to finish
[cardano-wallet.wallet-db:Info:174] [2021-07-01 13:29:42.22 UTC] Waiting for 1 withDatabase 075fde25...7f11b717 call(s) to finish
...
[cardano-wallet.wallet-db:Notice:174] [2021-07-01 13:31:38.75 UTC] Timed out waiting for 1 withDatabase 075fde25...7f11b717 call(s) to finish. Attempting to remove the database anyway.
[cardano-wallet.wallet-db:Info:174] [2021-07-01 13:31:38.75 UTC] Removing wallet's database. Wallet id was 075fde25...7f11b717
[cardano-wallet.api-server:Error:174] [2021-07-01 13:31:38.75 UTC] [RequestId 8] DELETE /v2/wallets/075fde25bfe36db533c3f91ad1ecbc657f11b717 204 No Content in 116.628722893s

Anviking · 2021-07-01T13:36:02Z

However, one thing I've noticed is deleting a wallet (even when there's only one wallet) results in DB "lock" pretty much every time:

In what context? Disconnected node or lots of wallets?

piotr-iohk · 2021-07-01T13:37:45Z

However, one thing I've noticed is deleting a wallet (even when there's only one wallet) results in DB "lock" pretty much every time:

In what context? Disconnected node or lots of wallets?

Nope, if you just:

create single wallet
delete wallet

and wallet deletion takes very long time with above entires in the log.

Anviking · 2021-07-01T13:40:42Z

Ah, might be the concrete problem of #2745 (comment) ! I had tried killing cardano-wallet (which worked fine), but not deleting a single wallet!

Anviking · 2021-07-01T13:52:27Z

There are also entries saying wallet looses connection with the node:

Yeah, even if re-opening connections help, it's interesting why they are closed to begin with...

There should be a more detailed error message from ouroboros-network (but it might be "lost" / not shown)

can try to reproduce / look over logging if needed after sorting out the need-to-catch-async problem

Anviking · 2021-07-07T12:47:35Z

Closing in favour of #2750, at least for now.

2750: Refactor chain following r=piotr-iohk a=Anviking ### Issue Number Based on #2745 ADP-871 ### Motivation Simplify and clean up the chain following code, in the hope of making the wallet more robust against node disconnects. In particular, reduce the number of threads used to run the ChainSync protocol. ### Overview * The `chainSync` function now takes a record of callbacks, `ChainFollower`, as an argument. These callbacks are used to request the current intersection, as well as react to roll forward and roll backwards messages. * Remove the old `Cursor` type. * Reduce the number of threads used for implementing the above. ### Progress - [x] Integration tests run without errors - [ ] Clean up / review more closely - [x] Corner cases - [x] Logging - [x] Aggregation of sync progress works again - [x] Ensure that all `ChainFollowLog` messages are traced or removed if redundant. - [x] `withFollowStatsMonitoring` runs in a separate thread so that it can compute statistics in regular time intervals. - [ ] Node disconnect errors - [x] … are thrown as exceptions in `connectTo` and are handled by `recoveringNodeConnection` - [ ] Check again at the end that the PR solves the bug - [x] Can we property-test this? (Probably no benefit.) - [x] Check for performance regressions (running the chain-sync and db operations on the same thread might make it slightly slower) - Actually seemed 10% faster running integration tests locally with -j 8. - ef3fdc3 -> 274s (n=2) - master -> 303s (n=1) <- but I only ran once - Speedup _could be_ related to chain-following now being able to rollback without necessarily re-negotiating the intersection. ### Comments   Co-authored-by: Heinrich Apfelmus <[email protected]> Co-authored-by: Johannes Lund <[email protected]>

Anviking force-pushed the anviking/ADP-871/node-connections branch from a0dacaa to f308410 Compare June 30, 2021 11:25

Anviking self-assigned this Jun 30, 2021

Anviking force-pushed the anviking/ADP-871/node-connections branch from cd4776c to 2738c71 Compare June 30, 2021 11:38

Regenerate nix

0b5a228

rvl reviewed Jun 30, 2021

View reviewed changes

Anviking mentioned this pull request Jul 1, 2021

Refactor chain following #2750

Merged

12 tasks

rvl changed the title ~~Handle node disconnecions~~ Handle node disconnections Jul 6, 2021

Anviking closed this Jul 7, 2021

Anviking deleted the anviking/ADP-871/node-connections branch August 1, 2022 20:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle node disconnections #2745

Handle node disconnections #2745

Anviking commented Jun 30, 2021 •

edited

Loading

rvl left a comment

rvl Jun 30, 2021

Anviking Jun 30, 2021

rvl Jun 30, 2021

Anviking Jun 30, 2021

rvl Jun 30, 2021

piotr-iohk commented Jul 1, 2021

piotr-iohk commented Jul 1, 2021

Anviking commented Jul 1, 2021

piotr-iohk commented Jul 1, 2021

Anviking commented Jul 1, 2021

Anviking commented Jul 1, 2021

Anviking commented Jul 7, 2021

Handle node disconnections #2745

Handle node disconnections #2745

Conversation

Anviking commented Jun 30, 2021 • edited Loading

Issue Number

Overview

Comments

Manual testing

Interrupted chain-sync & wallet listing

Interrupted Local State Query

rvl left a comment

Choose a reason for hiding this comment

rvl Jun 30, 2021

Choose a reason for hiding this comment

Anviking Jun 30, 2021

Choose a reason for hiding this comment

rvl Jun 30, 2021

Choose a reason for hiding this comment

Anviking Jun 30, 2021

Choose a reason for hiding this comment

rvl Jun 30, 2021

Choose a reason for hiding this comment

piotr-iohk commented Jul 1, 2021

piotr-iohk commented Jul 1, 2021

Anviking commented Jul 1, 2021

piotr-iohk commented Jul 1, 2021

Anviking commented Jul 1, 2021

Anviking commented Jul 1, 2021

Anviking commented Jul 7, 2021

Anviking commented Jun 30, 2021 •

edited

Loading