Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLV failure message and length relaxation #1021

Merged
merged 4 commits into from
Dec 5, 2022

Conversation

joostjager
Copy link
Collaborator

Extend the failure message specification with a tlv stream in line with the onion tlv payload and p2p messages.

@t-bast
Copy link
Collaborator

t-bast commented Sep 2, 2022

Concept ACK. Note that this doesn't provide a lot of bytes for the tlv stream though, since we're limited to 256 bytes for the failure payload, but it's better than nothing.

An important point to note is that the tlv stream mustn't be length-prefixed (same as our tlv extensions at the end of each lightning message). It's worth adding a test vector to make sure people get that right.

@joostjager
Copy link
Collaborator Author

Yes, I can add a test vector. But I think I'll do it just for the failure message bytes excluding padding and encryption.

04-onion-routing.md Outdated Show resolved Hide resolved
@joostjager joostjager force-pushed the failure-message-tlv branch 2 times, most recently from 3860351 to ff5b364 Compare September 12, 2022 18:17
@joostjager
Copy link
Collaborator Author

Added test vector to ensure that implementers won't prepend the tlv stream with a length.

@t-bast
Copy link
Collaborator

t-bast commented Sep 12, 2022

IMHO it would really be more helpful to future readers to update the existing e2e test vector...
I can take a stab at it when implementing this change for eclair.

@joostjager
Copy link
Collaborator Author

Discussed in spec meeting: pair this change with an increase in the recommended failure message size. Any thoughts on the new size? I was thinking 1024, to bring it roughly in line with the forward pass payload?

@t-bast
Copy link
Collaborator

t-bast commented Sep 13, 2022

1024 sounds good to me since we don't have a real use-case yet, as long as all implementations change their behavior to handle potentially longer messages in the future (this way we can change the spec recommendation later if a use-case comes up that requires a bigger length without having to update intermediate nodes).

@joostjager
Copy link
Collaborator Author

So just to get it right, you're saying we don't change the recommendation yet? If that's indeed the intention, we have to be aware that a later increase allows for fingerprinting of the destination node. Perhaps this needs an activation date... All implementations switch to 1024 byte failure messages on Jan 1, 2023 :)

@t-bast
Copy link
Collaborator

t-bast commented Sep 13, 2022

So just to get it right, you're saying we don't change the recommendation yet?

No, I think it's a good idea to change the recommendation to something bigger, and 1024 sounds good. I'm just saying we can potentially go much bigger (we're only restricted by the whole message being 65kB), but since we don't know yet how we're going to use it, a first step to 1024 is probably a good idea.

But we should also change the spec to say that forwarding nodes should be able to handle messages with a different length than the recommended one, to give us freedom to change it in the future without requiring a network-wide upgrade. Does that make sense?

@joostjager
Copy link
Collaborator Author

Yes. Will also check to see what lnd does and does not support currently.

@joostjager
Copy link
Collaborator Author

It seems that lnd forwards the failure message fine. The only problem is that the payment sender isn't able to handle any other length than 256. So if we'd add a long but optional tlv extension to the failure message today, senders running lnd would interpret that as an unreadable failure message with all the undesired consequences for reputation tracking.

The fix lightningnetwork/lnd#6913 is small, but critical to unblock the usage of longer tlv failures.

@joostjager joostjager force-pushed the failure-message-tlv branch 4 times, most recently from 82e35fd to 7baffed Compare September 14, 2022 07:36
@joostjager
Copy link
Collaborator Author

joostjager commented Sep 14, 2022

Updated PR with 1024 byte tlv failure test vector.

For my own reference: patch used to generate vector lightningnetwork/lnd@master...bottlepay:lnd:failure-message-test-vector and lightningnetwork/lightning-onion@master...joostjager:lightning-onion:failure-message-test-vector

@joostjager joostjager changed the title Onion failure message may be followed by a tlv stream TLV failure message and recommended length to 1024 Sep 14, 2022
@joostjager joostjager force-pushed the failure-message-tlv branch 2 times, most recently from 51cc167 to 344440a Compare September 14, 2022 10:24
@@ -823,9 +823,8 @@ channel.
### Requirements

The _erring node_:
- SHOULD set `pad` such that the `failure_len` plus `pad_len` is equal to 256.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this change a bit more: I think we might need either a new feature bit or a new "empty" TLV key in the onion for the switch over to work properly. Without some sort of signal, then a destination or intermediate node doesn't know if the sender can actually handle/decrypt errors larger than 256 bytes. This is effectively hard coded in the spec and most implementations, so just changing it here without any additional context-specific signalling won't allow for a smooth switchover.

As an example, on the lnd side we may end up black listing a node for sending a larger errors (we have a hard check, and would discard it as internal network funny business or w/e). Instead, if we had the sender signal in the onion to each node (intermediate and also final) that they understand larger errors, then the error'ing node will know if they can attach additional context or not.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point. The empty TLV key option looks to be the smallest in size. But if we invent a new failure message format in the future that for example fixes error attribution, we'd need another empty TLV key, or use the original TLV with a non-empty value. In the latter case, we need to specify it so that no implementation will trip over a non-empty value from the start.

Perhaps defining a TLV record for sender feature bits from the get go is the cleanest solution. Trade-off is obviously the added size, which I think is one byte extra per hop compared to an empty tlv. This assumes we start fresh with the assignment of failure message feature bits, to not exceed one byte initially.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feedback from the spec meeting on this: since we don't have a short-term use-case for longer error messages, we can take the slow approach that requires less changes and no signaling:

  1. Implementations add support now for reading failure messages of arbitrary length
  2. In 1 year or so, we can consider that senders do support it and may start writing failure messages of arbitrary length
  3. If a specific feature requires arbitrary length messages before that, the usage of that feature is an implicit signal that the sender knows how to read failure messages of arbitrary length, so it's ok to generate them

Copy link
Collaborator Author

@joostjager joostjager Sep 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a short term use case: user defined extensions to the incorrect_details failure. For example for hodl invoices that are cancelled, more details could be added as to why it was cancelled. Or (just brainstorming now) return a "redirect" invoice in a tlv field.

Given the right extension points in a node implementation (fail hodl invoice with reason and expose failure tlv to user of the sending node), users could add and read their own data and this may exceed the current 256 byte limit. I don't think 3) applies to this scenario? LND's htlc interceptor API also allows users to return custom failure messages today, so one receiver-side extension point already exists.

You could argue that it is not only up to the lightning devs to decide that there is no current use case.

Regarding 2): Is it safe to assume that almost all senders support the feature after one year? I don't think there is a direct indication of the proportion of nodes that have upgraded, and this isn't a feature bit that is part of the node announcement either? It could be part of it, but it doesn't seem to be useful because nodes in a path don't know the key of the origin node anyway.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For inbound fees, we probably need this too. See lightningnetwork/lnd#6967, commit "signal failure message format".

t-bast added a commit to ACINQ/eclair that referenced this pull request Sep 29, 2022
The specification recommends using a length of 256 for onion errors, but
it doesn't say that we should reject errors that use a different length.

We may want to start creating errors with a bigger length than 256 if we
need to transmit more data to the sender. In order to prepare for this,
we keep creating 256-bytes onion errors, but allow receiving errors of
arbitrary length.

See the specification here: lightning/bolts#1021

Fixes #2438
@TheBlueMatt
Copy link
Collaborator

Right, that's what I meant by "no failure reason", sorry it was unclear. My point is that if you don't upgrade for two years suffering from degraded route selection performance seems like a perfectly fine outcome.

@joostjager
Copy link
Collaborator Author

Maybe inbound routing fees become a success, and then we need to wait for two years before we can return a complete fee_insufficient failure that exceeds 256 bytes. Or it does not become a success, exactly because the required failure cannot be returned. No fee update failure isn't ideal given the quirks that we sometimes see with gossip propagation.

Of course everyone can have their own opinion about the viability of protocol extensions, the value of experimentation and whether that's worth the trade-off with privacy.

@TheBlueMatt
Copy link
Collaborator

I'm kinda confused why a fee_insufficient failure is > 256 bytes? There's a ton of headroom above existing error messages to play with.

@joostjager
Copy link
Collaborator Author

If the fee is insufficient, it could be that the sender either used an outdated inbound fee policy, an outdated outbound fee policy or both. The routing node only sees a total fee paid, and doesn't know which one of the cases applies.

To cover all cases and ensure that the next attempt is successful, the routing node would need to return both the incoming channel update and the outgoing channel update with the fee_insufficient failure. 256 bytes is not enough for two channel updates.

Of course signed fee information can be returned in a more compact way, but that would deviate from the standard channel update format. I expect that to lead to lots of extra work and maintenance. The graph database for example would also need to support this new compact format.

@t-bast
Copy link
Collaborator

t-bast commented Nov 2, 2022

Of course signed fee information can be returned in a more compact way, but that would deviate from the standard channel update format. I expect that to lead to lots of extra work and maintenance. The graph database for example would also need to support this new compact format.

That sounds like the way forward if you want to do experimentation though? You really don't need to return complete channel updates, and returning only the values you're interested in lets you experiment today, without changes from any implementation nor the spec required.

@joostjager
Copy link
Collaborator Author

joostjager commented Nov 2, 2022

Yes, could do that. But the quicker way to go forward with the experiment is to just signal using a custom onion payload tlv record. Long failures are already relayed back, so no implementation changes required. And for the rest it's just between sender and routing node.

I thought that it would be nice to make that signaling official if we see a future need for long failure messages anyway. But if not, I'll just go ahead and do my own thing.

In that case, I can remove the increase of the recommended failure length to 1024 from this PR and note that longer failure messages are a risk because senders may not be able to parse.

t-bast added a commit to ACINQ/eclair that referenced this pull request Nov 4, 2022
The specification recommends using a length of 256 for onion errors, but
it doesn't say that we should reject errors that use a different length.

We may want to start creating errors with a bigger length than 256 if we
need to transmit more data to the sender. In order to prepare for this,
we keep creating 256-bytes onion errors, but allow receiving errors of
arbitrary length.

See the specification here: lightning/bolts#1021

Fixes #2438
@joostjager
Copy link
Collaborator Author

Removed increase of the recommended message length to 1024 from PR and added comment about old nodes not being able to interpret long messages.

t-bast added a commit to ACINQ/eclair that referenced this pull request Nov 22, 2022
Extend every onion failure with an optional tlv stream.

See the specification here: lightning/bolts#1021
Copy link
Collaborator

@t-bast t-bast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 1a48cdd

The test vectors match with eclair (see ACINQ/eclair#2455).

Copy link
Contributor

@halseth halseth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 1a48cdd

Copy link
Collaborator

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🛹

lnd v0.16.0 is scheduled to ship with the relaxed behavior for the sender (the one decrypting the final payload).

@t-bast
Copy link
Collaborator

t-bast commented Dec 5, 2022

@joostjager do you have a branch of lnd where I can force a node to emit an error bigger than 256 bytes that contains dummy tlvs to test cross-compatibility?

@joostjager
Copy link
Collaborator Author

If you run an lnd routing node with lightningnetwork/lnd#6967 and the sender does not include enough fee, you will get back a long failure message with two channel updates. The sender does need to signal long failure compatibility through a tlv record though.

If this is all a bit too much for the purpose of interop testing, I can prep a simpler branch with a long failure for you.

@t-bast
Copy link
Collaborator

t-bast commented Dec 5, 2022

If this is all a bit too much for the purpose of interop testing, I can prep a simpler branch with a long failure for you.

Don't worry, this should work well enough for my tests. Can you simply update the description of lightningnetwork/lnd#6967 with the details of the tlv that needs to be set in the onion to opt-in to receive the long failure?

@joostjager
Copy link
Collaborator Author

Description updated.

@Roasbeef Roasbeef merged commit a0bbe47 into lightning:master Dec 5, 2022
@joostjager joostjager changed the title TLV failure message and recommended length to 1024 TLV failure message and length relaxation Dec 14, 2022
t-bast added a commit to ACINQ/eclair that referenced this pull request Dec 16, 2022
Extend every onion failure with an optional tlv stream.

See the specification here: lightning/bolts#1021
t-bast added a commit to ACINQ/eclair that referenced this pull request Jan 3, 2023
Extend every onion failure with an optional tlv stream.

See the specification here: lightning/bolts#1021
t-bast added a commit to ACINQ/eclair that referenced this pull request Jan 6, 2023
Extend every onion failure with an optional tlv stream.
Added to the specification by: lightning/bolts#1021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants