-
Notifications
You must be signed in to change notification settings - Fork 911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle blockheight disagreement between us and payee #3376
Handle blockheight disagreement between us and payee #3376
Conversation
@ZmnSCPxj you want me to test now or wait a bit it is more stable? |
It should be stable at this point. I just need to create a reproducible test, which requires me to dredge through our codebase for how we do this kind of testing where we have to pause forwarding temporarily. You can evaluate with your test suite now in parallel to confirm the fix. |
@rustyrussell @cdecker @niftynei the CI "just" needs some re-kicking, it seems failure is only timeout due to valgrind being slow. For some reason I can no longer restart the travis-ci for any ElementsProject/lightning PRs myself, I used to be able to do that, maybe some permissions problem? |
@NicolasDorier there are 4 commits, you have to cherry-pick them in order.
The |
@ZmnSCPxj yes sorry, by the time you replied, I removed the comment ;) |
For the repro of the test, it is simple enough. Create two nodes, sharing the same bitcoind, connect them together. Then create 6 block and imediately after create an invoice and pay it. If you repeat that enough time, eventually you'll get the repro. (10 times should be enough) |
Got the error again. But I think you improved the situation, it took a while before happening again.
|
Unfortunately our test environment tends to strongly synchronize all nodes to Hmmm looks like it would be more reliable to parse |
Other error, seem slightly different?
|
So in summary, still got errors with the following messages: |
The changes in the last bytes of the messages are just the blockheight at which the payee is. The intended way to detect the blockheight disagreement is to check if the payer local blockheight at the time the |
The previous error I reported was |
0x123 == 291 for example, meaning the payee was at height 291 when it got the payment. In your log we then see the payer actually process blocks 290 and 291. Similarly, 0xfb == 251, we see that the log shows the payer only begins to process blocks 250 and 251 after the payment has been returned and failed. So it is still the same error, blockheight disagreement. |
@ZmnSCPxj it is not same error than |
The The initial |
No I did not, but it is possible as there are two places where I send different amount and it randomly fail another place. I found |
This message does not appear anywhere |
here full log of an attempt which happened faster:
|
The message will only appear in |
Let me know if I can do anything else. |
Well I am reimplementing it yet again, thanks anyway for the quick feedback. |
1545350
to
1e43e71
Compare
Hi @NicolasDorier another set of commits for evaluation, note that you need to remove the previous patch completely and re-apply this set. |
@ZmnSCPxj It seems to have fixed it. Ran the test in a 30 times loop and it passed. |
Thanks for your patience and cooperation. Hopefully that unblocks 0.8.0 for your software. For now I will be unavailable for some time so will take this out of draft status even if I have not been able to derive a decent test. |
I wait for ACK and merge, and will add that to my branch! EDIT: Actually I will just push that to my branch now, thanks a lot for fixing it! |
Slowly going through this thread 😉
Could be an issue with Travis migrating some things lately, I think they're
We can do this slightly better: in the test framework the
See above 😉 |
Strange, it seems I got this problem again. So I have two layers of tests, on on my lightning client library, where I could reproduce the error which seemed to be fixed by this PR, and I can't reproduce it there anymore. The other layer of tests is at btcpayserver level, and it seems I still have the error... It does not make sense this happen only on btcpayserver tests. |
1e43e71
to
c658d9e
Compare
Rebased and added new test as per @cdecker recommendation. The test reliably fails before the fix introduced in this PR: https://travis-ci.org/ZmnSCPxj/lightning/jobs/632554191#L1343 and is thus presumed to replicate the reported scenario. @NicolasDorier how is replication of this test/fix at btcpayserver level? |
@ZmnSCPxj I executed the tests again. Does not happen, it is weird, I will blame some caching or timing issue on my side. Did not code review, but this fixed my issue. |
Okay, pending ACK from @cdecker @niftynei or @rustyrussell then. |
@ZmnSCPxj so this bug still appear but very rarely. It happens sometimes when paying an invoice when I create a channel between two peers before running my actual tests. The specificity of this step is that a bunch of block may be mined in relatively short period or time (say 6 blocks at once or more). So I imagine you are retrying only a single time. Would it be a bad idea for me to just retry on my client a payment when I see this error? I don't think this is a blocker for this PR. This case probably don't happen in mainnet. |
The payer should wait until it reaches what the payee reports as blockheight in the failure message (i.e. |
@ZmnSCPxj then something strange. I saw it happening in some of my tests. I ended up doing this hack https://github.com/btcpayserver/BTCPayServer.Lightning/blob/master/src/BTCPayServer.Lightning.All/Tests/ConnectChannels.cs#L93 in my code where I open a channel for my tests. Now everything works fine and reliably. This PR is fixing things for sure as this was happening before even during 1 block disagreement. |
The code seems pretty straightforward, but I'm wondering if a more direct approach isn't more desirable: return an error indicating that the user may need to wait instead of automatically waiting for the blockheight to be reached. The In addition I don't think that |
Presumably the user is interested in getting the payment through, and rare and random failures pointing to If the user wanted to handle retrying by themselves, they are welcome to use In any case, before lightning/bolts@6729755 this would not have been an issue, since that failure would have been a temporary failure at the payee and we would have just spun using the classic loop. The Real Issue is the merging of a temporary failure into a permanent one at the BOLT spec, which leads to this problem. Since there are good reasons for that merge, we need to adapt to the BOLT spec change. Otherwise this is a real regression, one that used not to happen before lightning/bolts@6729755 .
I suggest still exposing Worse, due to how |
getstartblockheight_error(struct command *cmd, | ||
const char *buf, | ||
const jsmntok_t *error, | ||
struct pay_command *pc) | ||
{ | ||
/* Should never happen. */ | ||
plugin_err("getstartblockheight: getinfo failed!? '%.*s'", | ||
error->end - error->start, buf); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This likely needs to fail the pay_attempt
otherwise we're leaking memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plugin_err
leads to the plugin dying, meaning all its memory will disappear.
c658d9e
to
63e6055
Compare
This is needed to fully implement handling of blockheight disagreements between us and payee. If payee believes the blockheight is higher than ours, then `pay` should wait for our node to achieve that blockheight. Changelog-Add: Implement `waitblockheight` to wait for a specific blockheight.
Changelog-Fixed: Detect a previously non-permanent error (`final_cltv_too_soon`) that has been merged into a permanent error (`incorrect_or_unknown_payment_details`), and retry that failure case in `pay`.
63e6055
to
5b7c215
Compare
In that case we should address the root issue and amend the spec to introduce a In the meantime I think it's ok to work around the issue like it is, but we need to address the cause, not the symptoms :-) |
That is what I initially wanted to do, but after a little digging I realized we simply cannot have a non- This is a tradeoff, basically, we are closing a privacy-leaking case, at the cost of increased complexity at our implementation (but which should not reach our users, hence why we do not propagate the failure in this case and special-case its detection). I believe that privacy-preservation is important enough that we should accept complexity in our implementation: it is always easier to break privacy than to preserve it. |
PING! |
ACK 5b7c215 |
Fixes #3367
Due to the recent BOLT change that merges
final_expiry_too_soon
intoincorrect_or_unknown_payment_details
, the previously-retried transient failurefinal_expiry_too_soon
(17), which used to be how blockheight disagreements between payee and payer were signalled, is now a permanent failureincorrect_or_unknown_payment_details
(PERM|15). Since it is a permanent failure at the payee end,pay
did not retry it. (The commit linked is factually incorrect as well: it claimsfinal_expiry_too_soon
to be PERM|17, but looking at its actual diff shows it was 17 before the commit).Blockheight disagreements between payer and payee are transient and should be retried. This patch is parses the returned
height
from the payee and checks if it is the same as the height at which we started the attempt; if different, we wait until we reach the payee blockheight using the newwaitblockheight
command (the first commit), then retry.@NicolasDorier please notice this PR.
Currently draft since I still have to make a proper reproducible test for this. PRed now so that @NicolasDorier can go patch it in his software.
Test plan:
line_graph
ofl1
,l2
,l3
.l2
so it stops before it can forward any payment (I believe we have this feature in our test suite somehow, just need to review it again).l3
.l1
pay the invoice.l2
to stop.l2
, setting it up so it will now correctly forward payments.