Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle node disconnections #2745

Closed
wants to merge 2 commits into from
Closed

Conversation

Anviking
Copy link
Member

@Anviking Anviking commented Jun 30, 2021

Issue Number

ADP-871

Overview

  • Ensure follow catches asyncronous exceptions from connectClient,
    such that it can restart with a new connection and cursor.

  • Keep Local State Queries and to-be-submitted Txs queued until their
    requests finish, not just when they start. If a query is interrupted by
    the node being disconnected, it will block until a connection is
    re-established, and then retry.

  • Remove need to catch async exceptions

    • Think this is a good opportunity to refactor / simplify all the chain-following code.
  • Verify ouroboros-network errors are logged

Comments

Integration tests could probably be written, but it would:

  • take some time better spend on other tasks
  • add complexity to the already complex cluster setup

so I'd lean towards it not being worth it at this moment.

I tested (see below) manually though, which could be added as a documented .md test if desired.

Manual testing

Interrupted chain-sync & wallet listing

  1. start node and wallet
  2. list wallets => [wallet1]
  3. kill node
  4. list wallets => [wallet1] (previously [], or crash)
  5. start node
  6. wallet resumes syncing
  7. list wallets => [wallet1]

Interrupted Local State Query

  1. Start node and wallet
  2. cardano-wallet stake-pool list --stake 0
  3. kill node
  4. observe that nothing happens
  5. re-start node
  6. stake-pool list request finishes after a while

@Anviking Anviking force-pushed the anviking/ADP-871/node-connections branch from a0dacaa to f308410 Compare June 30, 2021 11:25
@Anviking Anviking self-assigned this Jun 30, 2021
1. Ensure follow catches asyncronous exceptions from connectClient,
such that it can restart with a new connection and cursor.

2. Keep Local State Queries and to-be-submitted Txs queued until their
requests finish, not just when they start. If a query is interrupted by
the node being disconnected, it will block until a connection is
re-established, and then retry.
@Anviking Anviking force-pushed the anviking/ADP-871/node-connections branch from cd4776c to 2738c71 Compare June 30, 2021 11:38
Copy link
Contributor

@rvl rvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks promising...

-- Node disconnections are seen as async exceptions from here. By
-- catching them, `follow` will try to establish a new connection
-- depending on the `FollowExceptionRecovery`.
handleSyncOrAsync (traceException *> const (pure FollowFailure))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really doubt that we need to handle asynchronous exceptions.
Just UnliftIO.Exception.handleAny (any synchronous exception) should be enough.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already used to catch synchronous exceptions.

Note that chain-sync is run on a separate thread from follow, connected via link, which seems to indeed cause async exceptions to be thrown:

λ> f = async (threadDelay 1000000 >> throwIO (userError "hi")) >>= link >> threadDelay 3000000

λ> f `catchSyncOrAsync` (\(e::SomeException) -> putStrLn "Caught")
Caught

λ>  f `catchAny` (\_e -> putStrLn "Caught")
*** Exception: ExceptionInLinkedThread (ThreadId 988) user error (hi)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, then we should either be catching the exceptions in the chain sync thread, or not using link. Trying to catch exceptions from other threads is a recipe for disaster.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok 👍

@@ -595,9 +609,14 @@ localTxSubmission queue =
where
clientStIdle
:: m (LocalTxClientStIdle tx err m Void)
clientStIdle = atomically (readTQueue queue) <&> \case
clientStIdle = atomically (peekTQueue queue) <&> \case
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be easier to keep using readTQueue but then put an unGetTQueue in an onException handler.

@piotr-iohk
Copy link
Contributor

This looks to be improving also ADP-1013 / #2730 (Wallet workers die while syncing when there are many wallets).

Having 240+ wallets I don't see workers dying. Although there are in the log progress decreased entires:

bed93eb5: Syncing (1.64% percent) Applied 0 blocks, 0 rollbacks in the last 30.213246312s. Currently at slot 50031. (progress decreased)

Perhaps it is due to the fact there are a lot of wallets syncing at the same time and therefore syncing is much slower.
There are also entries saying wallet looses connection with the node:

[cardano-wallet.network:Warning:9858] [2021-07-01 11:17:33.39 UTC] Connection lost with the node.

However wallets seem to be operational.

Checked on rev: v2021-06-11 (git revision: 0b5a228).

@piotr-iohk
Copy link
Contributor

However, one thing I've noticed is deleting a wallet (even when there's only one wallet) results in DB "lock" pretty much every time:

[cardano-wallet.wallet-db:Info:174] [2021-07-01 13:29:42.12 UTC] Waiting for 1 withDatabase 075fde25...7f11b717 call(s) to finish
[cardano-wallet.wallet-db:Info:174] [2021-07-01 13:29:42.17 UTC] Waiting for 1 withDatabase 075fde25...7f11b717 call(s) to finish
[cardano-wallet.wallet-db:Info:174] [2021-07-01 13:29:42.22 UTC] Waiting for 1 withDatabase 075fde25...7f11b717 call(s) to finish
...
[cardano-wallet.wallet-db:Notice:174] [2021-07-01 13:31:38.75 UTC] Timed out waiting for 1 withDatabase 075fde25...7f11b717 call(s) to finish. Attempting to remove the database anyway.
[cardano-wallet.wallet-db:Info:174] [2021-07-01 13:31:38.75 UTC] Removing wallet's database. Wallet id was 075fde25...7f11b717
[cardano-wallet.api-server:Error:174] [2021-07-01 13:31:38.75 UTC] [RequestId 8] DELETE /v2/wallets/075fde25bfe36db533c3f91ad1ecbc657f11b717 204 No Content in 116.628722893s

@Anviking
Copy link
Member Author

Anviking commented Jul 1, 2021

However, one thing I've noticed is deleting a wallet (even when there's only one wallet) results in DB "lock" pretty much every time:

In what context? Disconnected node or lots of wallets?

@piotr-iohk
Copy link
Contributor

However, one thing I've noticed is deleting a wallet (even when there's only one wallet) results in DB "lock" pretty much every time:

In what context? Disconnected node or lots of wallets?

Nope, if you just:

  • create single wallet
  • delete wallet

and wallet deletion takes very long time with above entires in the log.

@Anviking
Copy link
Member Author

Anviking commented Jul 1, 2021

Ah, might be the concrete problem of #2745 (comment) ! I had tried killing cardano-wallet (which worked fine), but not deleting a single wallet!

@Anviking
Copy link
Member Author

Anviking commented Jul 1, 2021

There are also entries saying wallet looses connection with the node:

Yeah, even if re-opening connections help, it's interesting why they are closed to begin with...

There should be a more detailed error message from ouroboros-network (but it might be "lost" / not shown)

can try to reproduce / look over logging if needed after sorting out the need-to-catch-async problem

@Anviking Anviking mentioned this pull request Jul 1, 2021
12 tasks
@rvl rvl changed the title Handle node disconnecions Handle node disconnections Jul 6, 2021
@Anviking
Copy link
Member Author

Anviking commented Jul 7, 2021

Closing in favour of #2750, at least for now.

@Anviking Anviking closed this Jul 7, 2021
iohk-bors bot added a commit that referenced this pull request Nov 4, 2021
2750: Refactor chain following r=piotr-iohk a=Anviking

### Issue Number

Based on #2745 
ADP-871

### Motivation

Simplify and clean up the chain following code, in the hope of making the wallet more robust against node disconnects. In particular, reduce the number of threads used to run the ChainSync protocol.

### Overview

* The `chainSync` function now takes a record of callbacks, `ChainFollower`, as an argument. These callbacks are used to request the current intersection, as well as react to roll forward and roll backwards messages.
* Remove the old `Cursor` type.
* Reduce the number of threads used for implementing the above.

### Progress

- [x] Integration tests run without errors
- [ ] Clean up / review more closely
    - [x] Corner cases
    - [x] Logging 
        - [x] Aggregation of sync progress works again
        - [x] Ensure that all `ChainFollowLog` messages are traced or removed if redundant.
        - [x] `withFollowStatsMonitoring` runs in a separate thread so that it can compute statistics in regular time intervals.
    - [ ] Node disconnect errors
        - [x] … are thrown as exceptions in `connectTo` and are handled by `recoveringNodeConnection`
        - [ ] Check again at the end that the PR solves the bug
    - [x] Can we property-test this? (Probably no benefit.)
    - [x] Check for performance regressions (running the chain-sync and db operations on the same thread might make it slightly slower)
        - Actually seemed 10% faster running integration tests locally with -j 8.
            - ef3fdc3 -> 274s (n=2)
            - master -> 303s (n=1) <- but I only ran once
            - Speedup _could be_ related to chain-following now being able to rollback without necessarily re-negotiating the intersection.

### Comments


<!-- Additional comments or screenshots to attach if any -->

<!--
Don't forget to:

 ✓ Self-review your changes to make sure nothing unexpected slipped through
 ✓ Assign yourself to the PR
 ✓ Assign one or several reviewer(s)
 ✓ Jira will detect and link to this PR once created, but you can also link this PR in the description of the corresponding ticket
 ✓ Acknowledge any changes required to the Wiki
 ✓ Finally, in the PR description delete any empty sections and all text commented in <!--, so that this text does not appear in merge commit messages.
-->


Co-authored-by: Heinrich Apfelmus <[email protected]>
Co-authored-by: Johannes Lund <[email protected]>
@Anviking Anviking deleted the anviking/ADP-871/node-connections branch August 1, 2022 20:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants