Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug]: Channel is stuck in pending state. #8028

Closed
ziggie1984 opened this issue Sep 24, 2023 · 23 comments
Closed

[bug]: Channel is stuck in pending state. #8028

ziggie1984 opened this issue Sep 24, 2023 · 23 comments
Labels
bug Unintended code behaviour force closes P2 should be fixed if one has time utxo sweeping

Comments

@ziggie1984
Copy link
Collaborator

Reported by several node runners, Channels which have all their contracts resolved onchain, still do have the channel stuck in the pending state.

The problem is based on the behaviour of the sweeper especially when a sweep is not resolved (pending in the mempool) and lnd restarts.

Example: With force-close leading to an Output (alias Sweeper-Inputs) which is registered with the sweeper engine. So far so good, lets say the fee is not sufficiient enough and the node-runner restarts the node after a while. Lnd does not remember the old sweeps. Now Lnd is relaunching the contract resolvers including the Output from the Force-Close. Very likely that lnd tries to publish the Sweep-Input again but its rejected by the mempool because we already have the Output swept before. The problem is now that LND will try to sweep this Input until MaxSweepAttempts is reached, removing the notification notifier for this input. Now lets imagine the old sweep tx is confirmed. Lnd will not be able to register this spent and the channel will stay in the Pending state forever (user can still abolish the channel but thats another topic).

Problem lies here: https://github.com/lightningnetwork/lnd/blob/master/sweep/sweeper.go#L1373-L1382 => we basically remove the channel which notifies for a spent because we remove the input alltogether.

So we could fix this quick and dirty but I think this should be taken care of in the process of refactoring the sweeper wdyt @yyforyongyu. I think the best strategy is for bitcoind backends to check the input whether its already spent when registering the input with the sweeper. (gettxout rpc call from bitcoind - including spent in mempool). For other backends I still have to think about a solution.

I am wondering why the rescan when registering the input with the sweeper did not signal "already spent" when calling waitforspend maybe the look ahead is too short ?

@ziggie1984 ziggie1984 added bug Unintended code behaviour needs triage labels Sep 24, 2023
@bitromortac bitromortac added recovery Related to the backup/restoration of LND data (e.g. wallet seeds) bug Unintended code behaviour needs triage force closes and removed bug Unintended code behaviour needs triage recovery Related to the backup/restoration of LND data (e.g. wallet seeds) labels Sep 25, 2023
@saubyk saubyk added P2 should be fixed if one has time and removed needs triage labels Sep 26, 2023
@guggero
Copy link
Collaborator

guggero commented Sep 26, 2023

Not sure if this is related or not, but I noticed this from the log in #8007:

Using height hint 809349 retrieved from cache for outpoint=7e26f401a44a5398529032ac50e9a6182a5719fdc0f5630451c2aa7c6c49d585:0, script=0 6eec80ba4a7d32058575c31c24a29946ca03a46c4829be5801c6f5a6b88ae7bd instead of 724383 for spend subscription

This is a channel that on-chain has 50k confirmations for the force-close TX but lnd still sees it as pending. Could it be that our own sweep TX "poisons" our height hint cache and causes us to miss spend notifications?

I see that we try to re-publish the force close TX which gets refused:

2023-09-25 16:47:31.217 [INF] CNCT: Re-publishing force close tx(286d1a06df7ed99dae48636f42ad9bee7d1fcd15e0e31db09517bf68b634e2d1) for channel 7e26f401a44a5398529032ac50e9a6182a5719fdc0f5630451c2aa7c6c49d585:0
...
2023-09-25 16:47:31.232 [INF] LNWL: Inserting unconfirmed transaction 286d1a06df7ed99dae48636f42ad9bee7d1fcd15e0e31db09517bf68b634e2d1
2023-09-25 16:47:31.304 [INF] LNWL: 286d1a06df7ed99dae48636f42ad9bee7d1fcd15e0e31db09517bf68b634e2d1: tx already confirmed
...
2023-09-25 16:47:31.435 [INF] NTFN: New confirmation subscription: conf_id=2, txid=286d1a06df7ed99dae48636f42ad9bee7d1fcd15e0e31db09517bf68b634e2d1, num_confs=6 height_hint=809351
..
2023-09-25 16:47:31.449 [DBG] NTFN: Dispatching historical confirmation rescan for txid=286d1a06df7ed99dae48636f42ad9bee7d1fcd15e0e31db09517bf68b634e2d1

@sanjay-shah
Copy link

Thanks @guggero and @ziggie1984 for your help in resolving this issue. If this gets implemented in any pre-release branch, I can test it in my current environment with pending channels and let you know the results.

@yyforyongyu
Copy link
Collaborator

Based on description above there are few questions here,

Now Lnd is relaunching the contract resolvers including the Output from the Force-Close.

lnd shouldn't relauch the resolvers if they are resolved onchain. So the sweep tx should still live in the mempool.

I am wondering why the rescan when registering the input with the sweeper did not signal "already spent" when calling waitforspend maybe the look ahead is too short ?

Since the tx is not confirmed, waitForSpend won't return. I think the look ahead we use here is the broadcast height so it should be fine? Do you have logs showing the output that's causing the issue?

On the sweeper front, we store the sweep tx via NotifyPublishTx so I think we should leverage the sweeper store during startup to reconstruct the pending inputs state.

@ziggie1984
Copy link
Collaborator Author

lnd shouldn't relauch the resolvers if they are resolved onchain. So the sweep tx should still live in the mempool.

lnd will restart the resolver as long as they are not confirmed so when restarting lnd those outpoints will be reoffered, maybe I did not 100% understand the question?

Since the tx is not confirmed, waitForSpend won't return. I think the look ahead we use here is the broadcast height so it should be fine? Do you have logs showing the output that's causing the issue?

You can see in the logs in ##8007 that for example the Force-Close transaction is already confirmed but the height hint got updated and now lnd will not rescan/find the spend, so as guggero said something is poisoning our height_hint

On the sweeper front, we store the sweep tx via NotifyPublishTx so I think we should leverage the sweeper store during startup to reconstruct the pending inputs state.

Good Point I think we could do that, or maybe restructure the db so that we have kind of a map for an outpoint and its potential spending transaction.

@ziggie1984
Copy link
Collaborator Author

@guggero I think a quick fix for @sanjay-shah would be if we add a command in chantools to drop the spend and conf height hint caches, so that a restart would trigger the resolving of all channels ? Not a long term fix but still something which is done quickly ?

@ziggie1984
Copy link
Collaborator Author

Tried to analyse the problem why our HeightHint Cache is filled with garbage data but could not find anything interesting which could cause this behaviour. Normally the HeightHint is updated either if a historical rescan did not reveal the transaction in the blockchain or the transaction is confirmed.

Tho my suggestion is to either reset the Spend/Conf Height Cache when starting up lnd with --reset-wallet-transactions=true, so that we can at least recover the funds if we run into this situation again. Currently we do only drop the transactions so I think it makes sense to include it.

@ziggie1984
Copy link
Collaborator Author

@sanjay-shah while fixing another bug I found the following setting already present in lnd:

--height-hint-cache-query-disable, try starting lnd with this setting, and your funds should be recovered.

@sanjay-shah
Copy link

sanjay-shah commented Oct 25, 2023

@ziggie1984 I ran lnd --height-hint-cache-query-disable.

I did see one transaction being sent to mempool and its also confirmed, It was related expired htlc of one channel closed around one year ago: https://mempool.space/tx/0ce6fe4890977aa4e253e0c3eb2ef3288b5b7ca028eef906c98e5223348a3535

However my lnd has been crashing since I ran it with --height-hint-cache-query-disable option.
Now I can't even run lnd normally. Please see the attached lnd.log And please help urgently as we were scheduled to release our app tomorrow :(

lnd.log

After this log I get a runtime error not logged inn log file:
panic: runtime error: index out of range [1] with length 0

goroutine 4051 [running]:
github.com/lightningnetwork/lnd/contractcourt.isPreimageSpend(...)
github.com/lightningnetwork/lnd/contractcourt/htlc_timeout_resolver.go:255
github.com/lightningnetwork/lnd/contractcourt.(*htlcTimeoutResolver).Resolve(0xc002c6b000)
github.com/lightningnetwork/lnd/contractcourt/htlc_timeout_resolver.go:281 +0x2a9
github.com/lightningnetwork/lnd/contractcourt.(*ChannelArbitrator).resolveContract(0xc001da0840, {0x2258fe0?, 0xc002c6b000})
github.com/lightningnetwork/lnd/contractcourt/channel_arbitrator.go:2371 +0x2a9
created by github.com/lightningnetwork/lnd/contractcourt.(*ChannelArbitrator).launchResolvers
github.com/lightningnetwork/lnd/contractcourt/channel_arbitrator.go:1376 +0xd2

image

@ziggie1984
Copy link
Collaborator Author

Try updating to lnd 17.0, this has a bugfix for this case.

@ziggie1984
Copy link
Collaborator Author

Ok will take a look, crashes also with 17.0.

@ziggie1984
Copy link
Collaborator Author

ziggie1984 commented Oct 25, 2023

I do not remember exactly but why do we not also check for robustness (length of the witness) here, maybe you could help @yyforyongyu : https://github.com/lightningnetwork/lnd/blob/master/contractcourt/htlc_timeout_resolver.go#L377

Screenshot 2023-10-25 at 10 25 49

@ziggie1984
Copy link
Collaborator Author

Evaluating the situation his bitcoind node reveals the following tx information:

bitcoin-cli getrawtransaction 0ce6fe4890977aa4e253e0c3eb2ef3288b5b7ca028eef906c98e5223348a3535
0200000002a72b35f9bcd46bb78c0cf63c6cd2bdc467044d4837ea27f8b905b51d20a6b2be020000000001000000f3f51c7750cbb11033e3b61e6de295ba6f1f0846fd256038e4dcb3c5cf26889f00000000000000000002127a000000000000220020ead078680d77a590ac3f065ce64ce8d9dbe4462b857857fe39d2f6df5ddf7c61410e30000000000022512089a3a0047f3733310a43fa77c55acd855bcec63040d090a055a28ef3377e8acedf960b00

its missing all the witness data.

my fullnode but also mempool.space shows this instead:

02000000000102a72b35f9bcd46bb78c0cf63c6cd2bdc467044d4837ea27f8b905b51d20a6b2be020000000001000000f3f51c7750cbb11033e3b61e6de295ba6f1f0846fd256038e4dcb3c5cf26889f00000000000000000002127a000000000000220020ead078680d77a590ac3f065ce64ce8d9dbe4462b857857fe39d2f6df5ddf7c61410e30000000000022512089a3a0047f3733310a43fa77c55acd855bcec63040d090a055a28ef3377e8ace0500483045022100a1fe9f370ac1fde977ec6a147d8e32f07c8b6f6548667bcc0f6cf5b6a10e4dde02202484ab1a9ec46a075c209c3f1127948e0baa26137b0501a6a51df64ba1cfebbd8347304402204a46f33d8fbe7e51ce462d0099caf66fa2c4a94128d7f0e5d4828de25d13b7f002205ab9a9f8d997033e500ce81bdf6ca21a07262311517266f61e6e452992e91ff401008876a9145a9526206556cafdda02fc05c1da0aa49d2f17bb8763ac67210332bdfbea33e7be2d3d8873b144b401c02aedb13ee2211138406c4d574dc4685c7c820120876475527c210334888d0231807688c2cdb942fb61605b460789605c734539e7d749b652527bbf52ae67a91453d453c8befad357c7de27a98858165e2b7a152388ac6851b27568014011de1eab981fc10d5efcb0543890b9b80851eeba7627226601d2a2ad9675542abad51eb8b1f22b8140eb9db9eccae9a0190e17fb9656b14b8cb7c0c86e98be85df960b00

so we need to also add more robustness for the case where its a local commitment(non-taproot).

@guggero
Copy link
Collaborator

guggero commented Oct 25, 2023

Doesn't the lack of witness data indicate an issue with the bitcoind node itself? I agree we shouldn't panic. But this would still warrant a critical (meaning it would trigger a safe shutdown of lnd) error.

@ziggie1984
Copy link
Collaborator Author

Agree with you (we should definitely shutdown safely but shutdown), so his bitcoind is behaving very strange, he is updating now (was on 21), but he was not pruned or something interesting whether an update fixes the issue:

bitcoin-cli getblockchaininfo
{
  "chain": "main",
  "blocks": 813751,
  "headers": 813751,
  "bestblockhash": "00000000000000000002240d0ee6598bae540d6c2a2552967cf3a586fa9edc26",
  "difficulty": 61030681983175.59,
  "mediantime": 1698220005,
  "verificationprogress": 0.9999996602748938,
  "initialblockdownload": false,
  "chainwork": "00000000000000000000000000000000000000005a5a43f64d82a9739fd4cf18",
  "size_on_disk": 591392393331,
  "pruned": false,
  "softforks": {
    "bip34": {
      "type": "buried",
      "active": true,
      "height": 227931
    },
    "bip66": {
      "type": "buried",
      "active": true,
      "height": 363725
    },
    "bip65": {
      "type": "buried",
      "active": true,
      "height": 388381
    },
    "csv": {
      "type": "buried",
      "active": true,
      "height": 419328
    },
    "segwit": {
      "type": "buried",
      "active": true,
      "height": 481824
    },
    "taproot": {
      "type": "bip9",
      "bip9": {
        "status": "active",
        "start_time": 1619222400,
        "timeout": 1628640000,
        "since": 709632,
        "min_activation_height": 709632
      },
      "height": 709632,
      "active": true
    }
  },
  "warnings": ""
}

@guggero
Copy link
Collaborator

guggero commented Oct 25, 2023

I think we've seen this before with a node but couldn't find the issue anymore. I think if the update doesn't work there might be a need to reindex the chain. Not sure if -reindex or -reindex-chainstate is enough, worst case a full reset and starting from an empty data directory for bitcoind might be needed.

@yyforyongyu
Copy link
Collaborator

#7811 catches this case and gracefully shuts down lnd as it's been reported before occasionally bitcoind won't return the witness.

@sanjay-shah
Copy link

I think we've seen this before with a node but couldn't find the issue anymore. I think if the update doesn't work there might be a need to reindex the chain. Not sure if -reindex or -reindex-chainstate is enough, worst case a full reset and starting from an empty data directory for bitcoind might be needed.

Thanks @guggero , After updating to 25.1, I'm still seeing the same i.e. no witness data.
image

I will just spin up a new v25.1 bitcoind full node and see if that fixes the problem.

@sanjay-shah
Copy link

bitcoind debug.log while lnd crashed.

debug.log

@ziggie1984
Copy link
Collaborator Author

With the bitcoin-core dev help of @maflcko we were able to identify the issue.

When having the bitcoind config option set: rpcserialversion=0, the rpc command will only provide the non-segwit serialization of the transaction, which leads to the above behaviour (no witness data).

According to maflcko this setting will be deprecated in the next release (26) but we will definitely need to highlight it in the docs or even check for this if that's possible.

@guggero
Copy link
Collaborator

guggero commented Oct 26, 2023

Aaah, that makes a lot of sense... Thank you so much @ziggie1984 for digging into this! Should we close this issue or do you want to link it to the PR that updates the docs and then close it when merging that?

@ziggie1984
Copy link
Collaborator Author

Yes let's wait until docs are updated.

@ziggie1984
Copy link
Collaborator Author

I am not sure whether we can close this issue tho? Because sanjay main problem was somehow that the height hint cache was updated although funds where not recovered. I looked into the sweeper code and whether it does somehow poison the height_hint cache but could not reproduce this behaviour. But I have the feeling that this setting --height-hint-cache-query-disable was introduced because of such a behaviour in the past.

Maybe close this issue and open a more narrowed one which describes the problem with the height hint cache ?

@guggero
Copy link
Collaborator

guggero commented Oct 27, 2023

Okay, closing this issue then. @ziggie1984 would you mind creating the issue for the height hint cache? It sounds to me like you might have the most up-to-date context on that. I did a quick scan over our existing issues and there doesn't seem to be one for it yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Unintended code behaviour force closes P2 should be fixed if one has time utxo sweeping
Projects
None yet
Development

No branches or pull requests

6 participants