Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The TxGraph::missing_heights implementation is problematic. #1354

Closed
evanlinjin opened this issue Feb 20, 2024 · 8 comments · Fixed by #1369 or #1380
Closed

The TxGraph::missing_heights implementation is problematic. #1354

evanlinjin opened this issue Feb 20, 2024 · 8 comments · Fixed by #1369 or #1380
Assignees
Labels
api A breaking API change discussion There's still a discussion ongoing module-blockchain
Milestone

Comments

@evanlinjin
Copy link
Member

The logic of missing_heights is explained clearly in this comment:

// Usually, if a height of a tx anchor is missing from the chain, we would want to return
// this height in the iterator. The exception is when the tx is confirmed in chain. All the
// other missing-height anchors of this tx can be skipped.

However, if a transaction is confirmed in a block that is about to turn stale (get reorged out), we
still want to fetch the new block hash. I suspect this is where our "transaction is never confirmed"
bug comes from.

My proposed solution:

Change the exception to:

// ... The exception is when the tx is confirmed in chain with a depth greater than the provided
// `max_reorg_depth`.

And we change the signature of the method to be:

pub fn missing_heights<'a>(
    &'a self,
    chain: &'a LocalChain,
    max_reorg_depth: u32,
) -> impl Iterator<Item = u32> + 'a { todo!() }

With some nice documentation for max_reorg_depth.

An alternative solution:

Remove the stated "exception". We can continue to consider a height "missing" when there is a
floating anchor at that height (anchor that does not point to a block in the best chain).

The downside of this approach is that there are situations where certain heights will always be
considered "missing". I think this is somewhat acceptable as reorgs are rare and this situation only
happens due to reorgs.

Because of the behavior change, if we do this solution, we should rename the method to
floating_anchor_heights.

How to reproduce with bdk_esplora:

The following should be written as a test.

  • Wallet receives money via transaction (A).
  • Transaction gets confirmed at confirmation height (h).
  • (Sync wallet with bdk_esplora),
  • A reorg happens, and (A) gets re-confirmed at height (h') where h' >= h.
  • (Sync wallet with bdk_esplora).
  • At this point onwards, transaction (A) is forever stuck as unconfirmed!

Going forward:

The test should be written first while we wait for the discussion on which solution to do matures.

@evanlinjin
Copy link
Member Author

Tests written can be based on #1171

@yanganto
Copy link
Contributor

I do not know what the root cause for the forever unconfirmed one, but I think I can make the test first, and then fix the issue after the test.

@notmandatory notmandatory moved this to Todo in BDK Feb 22, 2024
@LLFourn LLFourn added discussion There's still a discussion ongoing and removed bug Something isn't working labels Feb 28, 2024
@LLFourn LLFourn moved this from Todo to Discussion in BDK Feb 28, 2024
@LLFourn
Copy link
Contributor

LLFourn commented Feb 28, 2024

First I'll respond inline

The logic of missing_heights is explained clearly in this comment:

// Usually, if a height of a tx anchor is missing from the chain, we would want to return
// this height in the iterator. The exception is when the tx is confirmed in chain. All the
// other missing-height anchors of this tx can be skipped.

However, if a transaction is confirmed in a block that is about to turn stale (get reorged out), we still want to fetch the new block hash.

Thanks for figuring this out. Let me restate the key observation. Imagine a chain with tip t and a transaction anchored at h and h'. If h is in the chain with t then missing heights will not return h' as "missing", however, if h is re-org'd out and the tip replaced with t' all of a sudden h' would be "missing" and returned. The problem is we are making the transition from t -> t' after we've called missing_heights and so we manage to transition to t' and but still never notice that h' is missing.

Then, because the person may be using missing_heights on the update transaction graph (which is not what I had in mind with this method but certainly a mistake you could make) they never notice that h' is missing from their main tx graph because they only ever call it on the update graph.

I suspect this is where our "transaction is never confirmed" bug comes from.

I think this is extremely unlikely to be the source of any real world observed bug. Reorgs simply don't happen often enough. it's also easy to verify this theory -- take a look at the tx anchors when the bug occurs. Does it have two?

My proposed solution:

Change the exception to:

// ... The exception is when the tx is confirmed in chain with a depth greater than the provided
// `max_reorg_depth`.

And we change the signature of the method to be:

pub fn missing_heights<'a>(
    &'a self,
    chain: &'a LocalChain,
    max_reorg_depth: u32,
) -> impl Iterator<Item = u32> + 'a { todo!() }

With some nice documentation for max_reorg_depth.

An alternative solution:

Remove the stated "exception". We can continue to consider a height "missing" when there is a floating anchor at that height (anchor that does not point to a block in the best chain).

The downside of this approach is that there are situations where certain heights will always be considered "missing". I think this is somewhat acceptable as reorgs are rare and this situation only happens due to reorgs.

Because of the behavior change, if we do this solution, we should rename the method to floating_anchor_heights.

The method does what it says it's just not a very useful method. You even made this observation in #1126 but I NACK'd it because I thought this method could have a use outside of typical syns.c I didn't think we were actually still using this method anywhere but it looks like it is used in the wallet examples somehow. These should be changed to the ChangeSet::missing_heights_from like the other example. After understanding what @evanlinjin has discovered here I think he's right in #1126 -- we should just delete this method.

BTW I think it is always the wrong idea to have methods that require the user to specify a "max_reorg_depth". Our algorithms should be able to logically tolerate any re-org depth without exception.

How to reproduce with bdk_esplora:

The following should be written as a test.

I don't think so. Is the point of the test to make sure bdk_esplora returns the right results or missing_heights returns the right heights? Neither of them are wrong, missing_heights is just being used in the wrong way (arguably there's no correct way to use it with the current blockchain APIs we have). You can't test for API bugs you need to remove them by designing a better one!

@yanganto don't waste your time writing a test here yet.

@evanlinjin
Copy link
Member Author

@LLFourn thank you for the response. I think we can still write tests with the ChangeSet::missing_heights_from method.

@LLFourn
Copy link
Contributor

LLFourn commented Feb 29, 2024

@LLFourn thank you for the response. I think we can still write tests with the ChangeSet::missing_heights_from method.

Yes. I am objecting to these tests being integration tests that actually make esplora calls. You can simulate re-orgs without running esplora.

We have a issue of understanding the implications of our design. I think it's wasteful to try and tackle those with integration tests which are hard to maintain and will quickly go stale. Gaining understanding about design is a painful process because it exposes our flaws and may cause us to lose confidence along the way. But you have to try and resist the urge to call for integration tests to address bugs in our understanding -- this can only make the situation worse. If we can't understand it, it's probably too complicated. I think #1194 is the right approach because it redesigns the process that doesn't require this call at all so the user can't even make this mistake. In theory this is possible because in eslplora/electrum you can always assume the "missing heights" is the heights of all the transactions after what we call the "point of agreement".

To demonstrate that the problem is lack of understanding here -- I think we're wonrg. I've changed my mind: there is nothing problematic about the missing_heights API. I claim that the way it's being used right now is totally fine. This claim is wrong:

However, if a transaction is confirmed in a block that is about to turn stale (get reorged out), we
still want to fetch the new block hash.

We will always fetch the new block hash if missing_heights is called on the update TxGraph because it will have all the new anchors. Each tx will have at most one anchor. TxGraph::missing_heights will return the heights of the anchors is doesn't have which will be all the new ones. The problem we are imagining here is actually only applicable to when you call missing_heights on the existing TxGraph after you've updated it. But this will only be a problem for the first call, it will heal itself on the next call.

I think the only work to do here is to delete TxGraph::missing_heights since it just doesn't need to exist in any workflow we're aware of and it requires too much insight to guarantee you're using it correctly (and the docs don't provide this). It's also hard to implement correctly.

@evanlinjin
Copy link
Member Author

I'll work on getting rid of TxGraph::missing_heights.

@evanlinjin
Copy link
Member Author

I think that tx_graph::ChangeSet::missing_heights_from is not the perfect API either. We are only fetching checkpoints for newly introduced anchors. However, if fetching checkpoints fail, and the newly introduced anchors are already persisted, we will never fetch the required checkpoints again.

As an example, let's say the caller is syncing with Esplora. EsploraExt::full_scan succeeds and it turns out we need a checkpoint at height 100 to connect anchor 100:A. However, the network fails so the call to EsploraExt::update_local_chain fails. Based on how our esplora_examples are written, this is okay since we return early without persisting if anything fails. However, if the caller persists after the full_scan update, they may be left with a situation where a corresponding checkpoint is never fetched.

Another problem is that tx_graph::ChangeSet::missing_heights_from does not work well with bdk_wallet since we don't actually return the changeset in wallet methods. A solution to this would be to provide a method that calls ::missing_height_from on the staged changeset. However, what if the caller commits the staged changeset before fetching missing heights? (we'll have the same problem as stated in the aforementioned example). Another solution would be to have the method output missing heights after applying update. However, if fetching relevant checkpoints fail, and we don't somehow record the missing heights, we won't have another opportunity to fetch corresponding checkpoints again.

Because relying on a returned changeset to find missing checkpoints is problematic, I think the safest solution would be to determine missing heights from the wallet's TxGraph (however not with TxGraph::missing_heights, as it was problematic as stated in this PR). Therefore I think my second proposal - TxGraph::floating_anchor_heights - which fetches all heights that contains anchors that do not point to the best chain, makes the most sense).

@evanlinjin
Copy link
Member Author

After chatting with @LLFourn a couple of days ago, I am convinced by his idea to remove the need to lock the wallet multiple times. @LLFourn suggested that since we are already passing a linked list of checkpoints to the chain source, we can use that to determine missing heights instead of locking LocalChain. This will require iterating backwards to determine whether update anchors connect with our chain - which is expensive. However, we can use a skip list structure to make checkpoint traversal more efficient. I think this should be our first step and that means we can get rid of EsploraExt::update_local_chain.

@nondiremanuel nondiremanuel moved this from Discussion to Needs Review in BDK Mar 12, 2024
@notmandatory notmandatory added module-blockchain api A breaking API change labels Mar 16, 2024
@github-project-automation github-project-automation bot moved this from Needs Review to Done in BDK Apr 6, 2024
evanlinjin added a commit that referenced this issue Apr 22, 2024
96a9aa6 feat(chain): refactor `merge_chains` (志宇)
2f22987 chore(chain): fix comment (志宇)
daf588f feat(chain): optimize `merge_chains` (志宇)
77d3595 feat(chain)!: rm `local_chain::Update` (志宇)
1269b06 test(chain): fix incorrect test case (志宇)
72fe65b feat(esplora)!: simplify chain update logic (志宇)
eded1a7 feat(chain): introduce `CheckPoint::insert` (志宇)
519cd75 test(esplora): move esplora tests into src files (志宇)
a6e613e test(esplora): add `test_finalize_chain_update` (志宇)
494d253 feat(testenv): add `genesis_hash` method (志宇)
886d72e chore(chain)!: rm `missing_heights` and `missing_heights_from` methods (志宇)
bd62aa0 feat(esplora)!: remove `EsploraExt::update_local_chain` (志宇)
1e99793 feat(testenv): add `make_checkpoint_tip` (志宇)

Pull request description:

  Fixes #1354

  ### Description

  Built on top of both #1369 and #1373, we simplify the `EsploraExt` API by removing the `update_local_chain` method and having `full_scan` and `sync` update the local chain in the same call. The `full_scan` and `sync` methods now takes in an additional input (`local_tip`) which provides us with the view of the `LocalChain` before the update. These methods now return structs `FullScanUpdate` and `SyncUpdate`.

  The examples are updated to use this new API. `TxGraph::missing_heights` and `tx_graph::ChangeSet::missing_heights_from` are no longer needed, therefore they are removed.

  Additionally, we used this opportunity to simplify the logic which updates `LocalChain`. We got rid of the `local_chain::Update` struct (which contained the update `CheckPoint` tip and a `bool` which signaled whether we want to introduce blocks below point of agreement). It turns out we can use something like `CheckPoint::insert` so the chain source can craft an update based on the old tip. This way, we can make better use of `merge_chains`' optimization that compares the `Arc` pointers of the local and update chain (before we were crafting the update chain NOT based on top of the previous local chain). With this, we no longer need the `Update::introduce_older_block` field since the logic will naturally break when we reach a matching `Arc` pointer.

  ### Notes to the reviewers

  * Obtaining the `LocalChain`'s update now happens within `EsploraExt::full_scan` and `EsploraExt::sync`. Creating the `LocalChain` update is now split into two methods (`fetch_latest_blocks` and `chain_update`) that are called before and after fetching transactions and anchors.
  * We need to duplicate code for `bdk_esplora`. One for blocking and one for async.

  ### Changelog notice

  * Changed `EsploraExt` API so that sync only requires one round of fetching data. The `local_chain_update` method is removed and the `local_tip` parameter is added to the `full_scan` and `sync` methods.
  * Removed `TxGraph::missing_heights` and `tx_graph::ChangeSet::missing_heights_from` methods.
  * Introduced `CheckPoint::insert` which allows convenient checkpoint-insertion. This is intended for use by chain-sources when crafting an update.
  * Refactored `merge_chains` to also return the resultant `CheckPoint` tip.
  * Optimized the update `LocalChain` logic - use the update `CheckPoint` as the new `CheckPoint` tip when possible.

  ### Checklists

  #### All Submissions:

  * [x] I've signed all my commits
  * [x] I followed the [contribution guidelines](https://github.com/bitcoindevkit/bdk/blob/master/CONTRIBUTING.md)
  * [x] I ran `cargo fmt` and `cargo clippy` before committing

  #### New Features:

  * [x] I've added tests for the new feature
  * [x] I've added docs for the new feature

ACKs for top commit:
  LLFourn:
    ACK 96a9aa6

Tree-SHA512: 3d4f2eab08a1fe94eb578c594126e99679f72e231680b2edd4bfb018ba1d998ca123b07acb2d19c644d5887fc36b8e42badba91cd09853df421ded04de45bf69
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api A breaking API change discussion There's still a discussion ongoing module-blockchain
Projects
Archived in project
4 participants