-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle node disconnections #2745
Conversation
a0dacaa
to
f308410
Compare
1. Ensure follow catches asyncronous exceptions from connectClient, such that it can restart with a new connection and cursor. 2. Keep Local State Queries and to-be-submitted Txs queued until their requests finish, not just when they start. If a query is interrupted by the node being disconnected, it will block until a connection is re-established, and then retry.
cd4776c
to
2738c71
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks promising...
-- Node disconnections are seen as async exceptions from here. By | ||
-- catching them, `follow` will try to establish a new connection | ||
-- depending on the `FollowExceptionRecovery`. | ||
handleSyncOrAsync (traceException *> const (pure FollowFailure)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really doubt that we need to handle asynchronous exceptions.
Just UnliftIO.Exception.handleAny
(any synchronous exception) should be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already used to catch synchronous exceptions.
Note that chain-sync is run on a separate thread from follow
, connected via link
, which seems to indeed cause async exceptions to be thrown:
λ> f = async (threadDelay 1000000 >> throwIO (userError "hi")) >>= link >> threadDelay 3000000
λ> f `catchSyncOrAsync` (\(e::SomeException) -> putStrLn "Caught")
Caught
λ> f `catchAny` (\_e -> putStrLn "Caught")
*** Exception: ExceptionInLinkedThread (ThreadId 988) user error (hi)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, then we should either be catching the exceptions in the chain sync thread, or not using link
. Trying to catch exceptions from other threads is a recipe for disaster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok 👍
@@ -595,9 +609,14 @@ localTxSubmission queue = | |||
where | |||
clientStIdle | |||
:: m (LocalTxClientStIdle tx err m Void) | |||
clientStIdle = atomically (readTQueue queue) <&> \case | |||
clientStIdle = atomically (peekTQueue queue) <&> \case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be easier to keep using readTQueue
but then put an unGetTQueue
in an onException
handler.
This looks to be improving also ADP-1013 / #2730 (Wallet workers die while syncing when there are many wallets). Having 240+ wallets I don't see workers dying. Although there are in the log
Perhaps it is due to the fact there are a lot of wallets syncing at the same time and therefore syncing is much slower.
However wallets seem to be operational. Checked on rev: v2021-06-11 (git revision: 0b5a228). |
However, one thing I've noticed is deleting a wallet (even when there's only one wallet) results in DB "lock" pretty much every time:
|
In what context? Disconnected node or lots of wallets? |
Nope, if you just:
and wallet deletion takes very long time with above entires in the log. |
Ah, might be the concrete problem of #2745 (comment) ! I had tried killing cardano-wallet (which worked fine), but not deleting a single wallet! |
Yeah, even if re-opening connections help, it's interesting why they are closed to begin with... There should be a more detailed error message from ouroboros-network (but it might be "lost" / not shown) can try to reproduce / look over logging if needed after sorting out the need-to-catch-async problem |
Closing in favour of #2750, at least for now. |
2750: Refactor chain following r=piotr-iohk a=Anviking ### Issue Number Based on #2745 ADP-871 ### Motivation Simplify and clean up the chain following code, in the hope of making the wallet more robust against node disconnects. In particular, reduce the number of threads used to run the ChainSync protocol. ### Overview * The `chainSync` function now takes a record of callbacks, `ChainFollower`, as an argument. These callbacks are used to request the current intersection, as well as react to roll forward and roll backwards messages. * Remove the old `Cursor` type. * Reduce the number of threads used for implementing the above. ### Progress - [x] Integration tests run without errors - [ ] Clean up / review more closely - [x] Corner cases - [x] Logging - [x] Aggregation of sync progress works again - [x] Ensure that all `ChainFollowLog` messages are traced or removed if redundant. - [x] `withFollowStatsMonitoring` runs in a separate thread so that it can compute statistics in regular time intervals. - [ ] Node disconnect errors - [x] … are thrown as exceptions in `connectTo` and are handled by `recoveringNodeConnection` - [ ] Check again at the end that the PR solves the bug - [x] Can we property-test this? (Probably no benefit.) - [x] Check for performance regressions (running the chain-sync and db operations on the same thread might make it slightly slower) - Actually seemed 10% faster running integration tests locally with -j 8. - ef3fdc3 -> 274s (n=2) - master -> 303s (n=1) <- but I only ran once - Speedup _could be_ related to chain-following now being able to rollback without necessarily re-negotiating the intersection. ### Comments <!-- Additional comments or screenshots to attach if any --> <!-- Don't forget to: ✓ Self-review your changes to make sure nothing unexpected slipped through ✓ Assign yourself to the PR ✓ Assign one or several reviewer(s) ✓ Jira will detect and link to this PR once created, but you can also link this PR in the description of the corresponding ticket ✓ Acknowledge any changes required to the Wiki ✓ Finally, in the PR description delete any empty sections and all text commented in <!--, so that this text does not appear in merge commit messages. --> Co-authored-by: Heinrich Apfelmus <[email protected]> Co-authored-by: Johannes Lund <[email protected]>
Issue Number
ADP-871
Overview
Ensure follow catches asyncronous exceptions from connectClient,
such that it can restart with a new connection and cursor.
Keep Local State Queries and to-be-submitted Txs queued until their
requests finish, not just when they start. If a query is interrupted by
the node being disconnected, it will block until a connection is
re-established, and then retry.
Remove need to catch async exceptions
Verify ouroboros-network errors are logged
Comments
Integration tests could probably be written, but it would:
so I'd lean towards it not being worth it at this moment.
I tested (see below) manually though, which could be added as a documented .md test if desired.
Manual testing
Interrupted chain-sync & wallet listing
Interrupted Local State Query
cardano-wallet stake-pool list --stake 0