MP_FAIL support #52

matttbe · 2020-07-13T14:47:39Z

As defined in RFC8684:

When multiple subflows are in use, the data in flight on a subflow will likely involve data that is not contiguously part of the connection-level stream, since segments will be spread across the multiple subflows. Due to the problems identified above, it is not possible to determine what adjustments have been done to the data (notably, any changes to the subflow sequence numbering). Therefore, it is not possible to recover the subflow, and the affected subflow must be immediately closed with a RST that includes an MP_FAIL option (Figure 16), which defines the data sequence number at the start of the segment (defined by the Data Sequence Mapping) that had the checksum failure. Note that the MP_FAIL option requires the use of the full 64-bit sequence number, even if 32-bit sequence numbers are normally in use in the DSS signals on the path.
                       1                   2                   3
   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
  +---------------+---------------+-------+----------------------+
  |     Kind      |   Length=12   |Subtype|      (reserved)      |
  +---------------+---------------+-------+----------------------+
  |                                                              |
  |                 Data Sequence Number (8 octets)              |
  |                                                              |
  +--------------------------------------------------------------+
Figure 16: Fallback (MP_FAIL) Option

The receiver of this option MUST discard all data following the data sequence number specified. Failed data MUST NOT be DATA_ACKed and so will be retransmitted on other subflows (Section 3.3.6).
A special case is when there is a single subflow and it fails with a checksum error. If it is known that all unacknowledged data in flight is contiguous (which will usually be the case with a single subflow), an infinite mapping can be applied to the subflow without the need to close it first, essentially turning off all further MPTCP signaling. In this case, if a receiver identifies a checksum failure when there is only one path, it will send back an MP_FAIL option on the subflow-level ACK, referring to the data-level sequence number of the start of the segment on which the checksum error was detected. The sender will receive this information and, if all unacknowledged data in flight is contiguous, will signal an infinite mapping. This infinite mapping will be a DSS option (Section 3.3) on the first new packet, containing a Data Sequence Mapping that acts retroactively, referring to the start of the subflow sequence number of the most recent segment that was known to be delivered intact (i.e., was successfully DATA_ACKed). From that point onward, data can be altered by a middlebox without affecting MPTCP, as the data stream is equivalent to a regular, legacy TCP session. While in theory paths may only be damaged in one direction -- and the MP_FAIL signal affects only one direction of traffic -- for simplicity of implementation, the receiver of an MP_FAIL MUST also respond with an MP_FAIL in the reverse direction and entirely revert to a regular TCP session.
In the rare case that the data is not contiguous (which could happen when there is only one subflow but it is retransmitting data from a subflow that has recently been uncleanly closed), the receiver MUST close the subflow with a RST with MP_FAIL. The receiver MUST discard all data that follows the data sequence number specified. The sender MAY attempt to create a new subflow belonging to the same connection and, if it chooses to do so, SHOULD immediately place the single subflow in single-path mode by setting an infinite Data Sequence Mapping. This mapping will begin from the data-level sequence number that was declared in the MP_FAIL.
After a sender signals an infinite mapping, it MUST only use subflow ACKs to clear its send buffer. This is because Data ACKs may become misaligned with the subflow ACKs when middleboxes insert or delete data. The receiver SHOULD stop generating Data ACKs after it receives an infinite mapping.

https://www.rfc-editor.org/rfc/rfc8684.html#name-fallback-mp_fail-option

If needed, new tickets can be created to track: receiver/sender side only and packetdrill support.

(Feature from the initial roadmap)

The text was updated successfully, but these errors were encountered:

geliangtang · 2021-01-14T08:32:13Z

Note that if checksum usage has not been negotiated, this fallback mechanism cannot be used unless there is some higher-layer or lower‑layer signal to inform the MPTCP implementation that the payload has been tampered with.

Syzkaller was able to trigger again the following splat: WARNING: CPU: 1 PID: 12512 at net/mptcp/protocol.c:761 mptcp_reset_timer+0x12a/0x160 net/mptcp/protocol.c:761 Modules linked in: CPU: 1 PID: 12512 Comm: kworker/1:6 Not tainted 5.10.0-rc6 #52 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Workqueue: events mptcp_worker RIP: 0010:mptcp_reset_timer+0x12a/0x160 net/mptcp/protocol.c:761 Code: e8 4b 0c ad ff e8 56 21 88 fe 48 b8 00 00 00 00 00 fc ff df 48 c7 04 03 00 00 00 00 48 83 c4 40 5b 5d 41 5c c3 e8 36 21 88 fe <0f> 0b 41 bc c8 00 00 00 eb 98 e8 e7 b1 af fe e9 30 ff ff ff 48 c7 RSP: 0018:ffffc900018c7c68 EFLAGS: 00010293 RAX: ffff888108cb1c80 RBX: 1ffff92000318f8d RCX: ffffffff82ad0307 RDX: 0000000000000000 RSI: ffffffff82ad036a RDI: 0000000000000007 RBP: ffff888113e2d000 R08: ffff888108cb1c80 R09: ffffed10227c5ab7 R10: ffff888113e2d5b7 R11: ffffed10227c5ab6 R12: 0000000000000000 R13: ffff88801f100000 R14: ffff888113e2d5b0 R15: 0000000000000001 FS: 0000000000000000(0000) GS:ffff88811b500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fd76a874ef8 CR3: 000000001689c005 CR4: 0000000000170ee0 Call Trace: mptcp_worker+0xaa4/0x1560 net/mptcp/protocol.c:2334 process_one_work+0x8d3/0x1200 kernel/workqueue.c:2272 worker_thread+0x9c/0x1090 kernel/workqueue.c:2418 kthread+0x303/0x410 kernel/kthread.c:292 ret_from_fork+0x22/0x30 arch/x86/entry/entry_64.S:296 The mptcp_worker tries to update the MPTCP retransmission timer even if such timer is not currently scheduled. The mptcp_rtx_head() return value is bogus: we can have enqueued data not yet transmitted. The above may additionally cause spurious, unneeded MPTCP-level retransmissions. Fix the issue adding an explicit clearing the rtx queue before trying to retransmit and checking for unacked data Additionally drop an unneeded timer stop call and the unused mptcp_rtx_tail() helper. Reported-by: Christoph Paasch <[email protected]> Fixes: 6e628cd ("mptcp: use mptcp release_cb for delayed tasks") Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>

Syzkaller was able to trigger the following splat again: WARNING: CPU: 1 PID: 12512 at net/mptcp/protocol.c:761 mptcp_reset_timer+0x12a/0x160 net/mptcp/protocol.c:761 Modules linked in: CPU: 1 PID: 12512 Comm: kworker/1:6 Not tainted 5.10.0-rc6 #52 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Workqueue: events mptcp_worker RIP: 0010:mptcp_reset_timer+0x12a/0x160 net/mptcp/protocol.c:761 Code: e8 4b 0c ad ff e8 56 21 88 fe 48 b8 00 00 00 00 00 fc ff df 48 c7 04 03 00 00 00 00 48 83 c4 40 5b 5d 41 5c c3 e8 36 21 88 fe <0f> 0b 41 bc c8 00 00 00 eb 98 e8 e7 b1 af fe e9 30 ff ff ff 48 c7 RSP: 0018:ffffc900018c7c68 EFLAGS: 00010293 RAX: ffff888108cb1c80 RBX: 1ffff92000318f8d RCX: ffffffff82ad0307 RDX: 0000000000000000 RSI: ffffffff82ad036a RDI: 0000000000000007 RBP: ffff888113e2d000 R08: ffff888108cb1c80 R09: ffffed10227c5ab7 R10: ffff888113e2d5b7 R11: ffffed10227c5ab6 R12: 0000000000000000 R13: ffff88801f100000 R14: ffff888113e2d5b0 R15: 0000000000000001 FS: 0000000000000000(0000) GS:ffff88811b500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fd76a874ef8 CR3: 000000001689c005 CR4: 0000000000170ee0 Call Trace: mptcp_worker+0xaa4/0x1560 net/mptcp/protocol.c:2334 process_one_work+0x8d3/0x1200 kernel/workqueue.c:2272 worker_thread+0x9c/0x1090 kernel/workqueue.c:2418 kthread+0x303/0x410 kernel/kthread.c:292 ret_from_fork+0x22/0x30 arch/x86/entry/entry_64.S:296 The mptcp_worker tries to update the MPTCP retransmission timer even if such timer is not currently scheduled. The mptcp_rtx_head() return value is bogus: we can have enqueued data not yet transmitted. The above may additionally cause spurious, unneeded MPTCP-level retransmissions. Fix the issue adding an explicit clearing of the rtx queue before trying to retransmit and checking for unacked data. Additionally drop an unneeded timer stop call and the unused mptcp_rtx_tail() helper. Reported-by: Christoph Paasch <[email protected]> Fixes: 6e628cd ("mptcp: use mptcp release_cb for delayed tasks") Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Mat Martineau <[email protected]> Signed-off-by: David S. Miller <[email protected]>

mjmartineau · 2021-07-28T23:39:42Z

mp_join-01-ns1-0-D2LlXf.pcap.gz

Attached pcap (running v6 patch set with @geliangtang test code) shows:

The checksum failure is detected, a packet is sent with MP_FAIL.
However, the packet also has data payload and no DSS option.
The peer receives MP_FAIL, and echoes back. But it sends two TCP RST +
MP_FAIL packets back-to-back.

Cross reference to mailing list: https://lore.kernel.org/mptcp/[email protected]/

matttbe · 2021-08-04T10:02:50Z

Can be closed thanks to @geliangtang 's patches now in our tree:

6bf0897: mptcp: MP_FAIL suboption sending
913f5ee: mptcp: MP_FAIL suboption receiving
cce924a: mptcp: send out MP_FAIL when data checksum fails
e143565: mptcp: add the mibs for MP_FAIL
7740516: selftests: mptcp: add MP_FAIL mibs check
Results: 7706daf..d2d5ed4

cifsd-fixes

The inline assembly for arm64's cmpxchg_double*() implementations use a +Q constraint to hazard against other accesses to the memory location being exchanged. However, the pointer passed to the constraint is a pointer to unsigned long, and thus the hazard only applies to the first 8 bytes of the location. GCC can take advantage of this, assuming that other portions of the location are unchanged, leading to a number of potential problems. This is similar to what we fixed back in commit: fee960b ("arm64: xchg: hazard against entire exchange variable") ... but we forgot to adjust cmpxchg_double*() similarly at the same time. The same problem applies, as demonstrated with the following test: | struct big { | u64 lo, hi; | } __aligned(128); | | unsigned long foo(struct big *b) | { | u64 hi_old, hi_new; | | hi_old = b->hi; | cmpxchg_double_local(&b->lo, &b->hi, 0x12, 0x34, 0x56, 0x78); | hi_new = b->hi; | | return hi_old ^ hi_new; | } ... which GCC 12.1.0 compiles as: | 0000000000000000 <foo>: | 0: d503233f paciasp | 4: aa0003e4 mov x4, x0 | 8: 1400000e b 40 <foo+0x40> | c: d2800240 mov x0, #0x12 // #18 | 10: d2800681 mov x1, #0x34 // #52 | 14: aa0003e5 mov x5, x0 | 18: aa0103e6 mov x6, x1 | 1c: d2800ac2 mov x2, #0x56 // #86 | 20: d2800f03 mov x3, #0x78 // #120 | 24: 48207c82 casp x0, x1, x2, x3, [x4] | 28: ca050000 eor x0, x0, x5 | 2c: ca060021 eor x1, x1, x6 | 30: aa010000 orr x0, x0, x1 | 34: d2800000 mov x0, #0x0 // #0 <--- BANG | 38: d50323bf autiasp | 3c: d65f03c0 ret | 40: d2800240 mov x0, #0x12 // #18 | 44: d2800681 mov x1, #0x34 // #52 | 48: d2800ac2 mov x2, #0x56 // #86 | 4c: d2800f03 mov x3, #0x78 // #120 | 50: f9800091 prfm pstl1strm, [x4] | 54: c87f1885 ldxp x5, x6, [x4] | 58: ca0000a5 eor x5, x5, x0 | 5c: ca0100c6 eor x6, x6, x1 | 60: aa0600a6 orr x6, x5, x6 | 64: b5000066 cbnz x6, 70 <foo+0x70> | 68: c8250c82 stxp w5, x2, x3, [x4] | 6c: 35ffff45 cbnz w5, 54 <foo+0x54> | 70: d2800000 mov x0, #0x0 // #0 <--- BANG | 74: d50323bf autiasp | 78: d65f03c0 ret Notice that at the lines with "BANG" comments, GCC has assumed that the higher 8 bytes are unchanged by the cmpxchg_double() call, and that `hi_old ^ hi_new` can be reduced to a constant zero, for both LSE and LL/SC versions of cmpxchg_double(). This patch fixes the issue by passing a pointer to __uint128_t into the +Q constraint, ensuring that the compiler hazards against the entire 16 bytes being modified. With this change, GCC 12.1.0 compiles the above test as: | 0000000000000000 <foo>: | 0: f9400407 ldr x7, [x0, #8] | 4: d503233f paciasp | 8: aa0003e4 mov x4, x0 | c: 1400000f b 48 <foo+0x48> | 10: d2800240 mov x0, #0x12 // #18 | 14: d2800681 mov x1, #0x34 // #52 | 18: aa0003e5 mov x5, x0 | 1c: aa0103e6 mov x6, x1 | 20: d2800ac2 mov x2, #0x56 // #86 | 24: d2800f03 mov x3, #0x78 // #120 | 28: 48207c82 casp x0, x1, x2, x3, [x4] | 2c: ca050000 eor x0, x0, x5 | 30: ca060021 eor x1, x1, x6 | 34: aa010000 orr x0, x0, x1 | 38: f9400480 ldr x0, [x4, #8] | 3c: d50323bf autiasp | 40: ca0000e0 eor x0, x7, x0 | 44: d65f03c0 ret | 48: d2800240 mov x0, #0x12 // #18 | 4c: d2800681 mov x1, #0x34 // #52 | 50: d2800ac2 mov x2, #0x56 // #86 | 54: d2800f03 mov x3, #0x78 // #120 | 58: f9800091 prfm pstl1strm, [x4] | 5c: c87f1885 ldxp x5, x6, [x4] | 60: ca0000a5 eor x5, x5, x0 | 64: ca0100c6 eor x6, x6, x1 | 68: aa0600a6 orr x6, x5, x6 | 6c: b5000066 cbnz x6, 78 <foo+0x78> | 70: c8250c82 stxp w5, x2, x3, [x4] | 74: 35ffff45 cbnz w5, 5c <foo+0x5c> | 78: f9400480 ldr x0, [x4, #8] | 7c: d50323bf autiasp | 80: ca0000e0 eor x0, x7, x0 | 84: d65f03c0 ret ... sampling the high 8 bytes before and after the cmpxchg, and performing an EOR, as we'd expect. For backporting, I've tested this atop linux-4.9.y with GCC 5.5.0. Note that linux-4.9.y is oldest currently supported stable release, and mandates GCC 5.1+. Unfortunately I couldn't get a GCC 5.1 binary to run on my machines due to library incompatibilities. I've also used a standalone test to check that we can use a __uint128_t pointer in a +Q constraint at least as far back as GCC 4.8.5 and LLVM 3.9.1. Fixes: 5284e1b ("arm64: xchg: Implement cmpxchg_double") Fixes: e9a4b79 ("arm64: cmpxchg_dbl: patch in lse instructions when supported by the CPU") Reported-by: Boqun Feng <[email protected]> Link: https://lore.kernel.org/lkml/Y6DEfQXymYVgL3oJ@boqun-archlinux/ Reported-by: Peter Zijlstra <[email protected]> Link: https://lore.kernel.org/lkml/[email protected]/ Signed-off-by: Mark Rutland <[email protected]> Cc: [email protected] Cc: Arnd Bergmann <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Steve Capper <[email protected]> Cc: Will Deacon <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]>

matttbe added the enhancement label Jul 13, 2020

matttbe mentioned this issue Jan 13, 2021

Checksum support #134

Closed

geliangtang closed this as completed Jan 14, 2021

geliangtang reopened this Jan 14, 2021

geliangtang self-assigned this Feb 25, 2021

geliangtang mentioned this issue Jul 15, 2021

The infinite mapping support: kernel side #216

Closed

15 tasks

matttbe closed this as completed Aug 4, 2021

dcaratti pushed a commit to dcaratti/mptcp_net-next that referenced this issue Sep 2, 2021

Merge pull request multipath-tcp#52 from namjaejeon/cifsd-for-next

55b4fcd

cifsd-fixes

FoRward-999 mentioned this issue Jan 16, 2023

Issues with BPF packet scheduler #336

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MP_FAIL support #52

MP_FAIL support #52

matttbe commented Jul 13, 2020

geliangtang commented Jan 14, 2021

mjmartineau commented Jul 28, 2021

matttbe commented Aug 4, 2021

MP_FAIL support #52

MP_FAIL support #52

Comments

matttbe commented Jul 13, 2020

geliangtang commented Jan 14, 2021

mjmartineau commented Jul 28, 2021

matttbe commented Aug 4, 2021