pasta: UDP/IPv4 large transfer, tap: mismatch #24147

edsantiago · 2024-10-03T13:59:53Z

So far debian-only, in parallel system tests

not ok 379 |505| UDP/IPv4 large transfer, tap in 2576ms
...
$ podman logs --follow c-socat-t379-i0yzkalc
f7f2c12c2da26c7fe3701753a8a3167b  -
#/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
#|     FAIL: Mismatch between data sent and received
#| expected: = 16650cce72835ebf29698339858b4296  -
#|   actual:   f7f2c12c2da26c7fe3701753a8a3167b  -
#\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

debian-13 : sys podman debian-13 rootless host sqlite
- PR WIP: system test parallelization: two-pass approach #23275
  - 10-02 12:37 in [sys] |505| UDP/IPv4 large transfer, tap
  - 10-01 17:23 in [sys] |505| UDP/IPv4 large transfer, tap

x	x	x	x	x	x
sys(2)	podman(2)	debian-13(2)	rootless(2)	host(2)	sqlite(2)

The text was updated successfully, but these errors were encountered:

sbrivio-rh · 2024-10-04T13:34:46Z

...only in parallel tests, and it happened only twice in a while, correct?

edsantiago · 2024-10-05T12:06:32Z

Once in my own parallel testing on my f40 laptop, in the small transfer, tap test. Most of the easily-reproduced parallel flakes have been addressed, which unfortunately leaves us with the more challenging ones.

edsantiago · 2024-10-11T13:23:07Z

I instrumented the test, and got a hit (flake) on my first try:

# #|     FAIL: Mismatch between data sent and received
         # #| expected: = size=53248 hash=61514771ee5e67d618f0fe13e97bfa08 - head= df 08 a1 d0 ca c4 87 0e tail= e8 f2 d3 0f 74 0e 1e 13
         # #|   actual:   size=49152 hash=23a9a6b6368ffb95ecdc9116944fc53b - head= df 08 a1 d0 ca c4 87 0e tail= 86 21 ae d8 31 48 ec 3a

53248 - 49152 = 4096 and it looks like it's missing from the end (not surprising)

sbrivio-rh · 2024-10-11T18:10:32Z

Thanks a lot, that's really helpful. 4096 bytes makes it look like it's not actually an issue in pasta itself, rather a race in the tests (that's the size of a page, not a packet or any buffer we use). I'll have another look at the test and try to make this more likely to happen somehow and hopefully reproduce it...

sbrivio-rh · 2024-10-15T16:57:36Z

I'm able to reproduce this easily (by simply running that test in a loop). I'm still clueless as to what's going on.

sbrivio-rh · 2024-10-15T18:22:39Z

With commit edsantiago@a659d9f and packet captures:

successful test: https://passt.top/static/podman24147_good.pcap
failing test: https://passt.top/static/podman24147_good.pcap and

   [19:51:22.959789450] size=1040384 hash=7fded94f95042fdbb9027b9d32ea715c - head= 62 cf fa 91 8e cb b3 15 tail= 90 40 43 0a af 83 d6 f1
   #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
   #|     FAIL: Mismatch between data sent and received
   #| expected: = size=1048576 hash=30cb39a1743c7cfd386320f1a378ccea - head= 62 cf fa 91 8e cb b3 15 tail= 90 40 43 0a af 83 d6 f1
   #|   actual:   size=1040384 hash=7fded94f95042fdbb9027b9d32ea715c - head= 62 cf fa 91 8e cb b3 15 tail= 90 40 43 0a af 83 d6 f1
   #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

in my case, 8192 bytes are missing from somewhere in the middle of the file. From the capture, it looks like the kernel drops frame #4 (second UDP data packet pasta is sending, range 8192-16384): there's an ICMP port unreachable message referring to that data (d3 2c f7 11...) in frame #9.

Curiously, in this case, we also see an ARP request for 88.198.0.161 (default gateway in the container) and pasta replying to it (frames 6 and 7). Overview using tshark(1) output:

    1   0.000000 fe80::ccc9:bfff:fe5b:b4cb → ff02::2      ICMPv6 70 Router Solicitation from ce:c9:bf:5b:b4:cb
    2   0.005934 fe80::ccc9:bfff:fe5b:b4cb → ff02::16     ICMPv6 90 Multicast Listener Report Message v2

Expected.

    3   0.139513  169.254.1.2 → 88.198.0.164 UDP 8234 43171 → 5189 Len=8192
    4   0.139560  169.254.1.2 → 88.198.0.164 UDP 8234 43171 → 5189 Len=8192
    5   0.139576  169.254.1.2 → 88.198.0.164 UDP 8234 43171 → 5189 Len=8192

Start of transfer.

    6   0.139587 ce:c9:bf:5b:b4:cb → Broadcast    ARP 42 Who has 88.198.0.161? Tell 88.198.0.164
    7   0.139601 9a:55:9a:55:9a:55 → ce:c9:bf:5b:b4:cb ARP 42 88.198.0.161 is at 9a:55:9a:55:9a:55

ARP request and reply for default gateway, only visible in the bad case.

    8   0.139611  169.254.1.2 → 88.198.0.164 UDP 8234 43171 → 5189 Len=8192

Transfer continues.

    9   0.139620 88.198.0.164 → 169.254.1.2  ICMP 590 Destination unreachable (Port unreachable)

Completely unexpected: this refers to frame #4.

   10   0.139634  169.254.1.2 → 88.198.0.164 UDP 8234 43171 → 5189 Len=8192
   11   0.139652  169.254.1.2 → 88.198.0.164 UDP 8234 43171 → 5189 Len=8192
   12   0.139667  169.254.1.2 → 88.198.0.164 UDP 8234 43171 → 5189 Len=8192
[...]

Transfer goes on.

This starts looking like a kernel issue to me: if we send UDP message shortly after we set up the namespace, one message (and not the first one!) gets a port unreachable message (this happens after we check that the port is bound via procfs).

By the way, this looks similar to a failure I've noticed with a pending series (not merged yet): https://archives.passt.top/passt-dev/20241009224433.7fc28fc7@elisabeth/. That's from passt's tests (no Podman): on a spliced connection (loopback path, not tap) we get a "host unreachable" error from the kernel for the tap (local, non-loopback) address.

I'll add some prints to the kernel and try to figure out what's going on there. This doesn't appear to be a race in the tests (data in the middle of a transfer is missing) and it doesn't look like an issue in pasta (we're sending all the packets we need to send, the kernel drops one, without any apparent reason).

edsantiago · 2024-10-15T18:26:01Z

I can't tell if that's a really really fun issue to debug, or a really really horrible one, or both. Whatever the case, thank you for looking into it.

...for debugging containers#24147, because "md5sum mismatch" is not the best way to troubleshoot bytestream differences. socat is run on the container, so this requires building a new testimage (20241011). Bump to new CI VMs[1] which include it. [1] containers/automation_images#389 Signed-off-by: Ed Santiago <[email protected]>

sbrivio-rh · 2024-10-18T15:28:52Z

So, I was trying to find out why the kernel can't match the bound port with a UDP socket (which is the reason why that ICMP message is generated, and the corresponding datagram dropped)... and, since yesterday, I can't reproduce this anymore. Same kernel, same exact build of Podman and pasta, same machine and terminal (I still had the same open), even.

Earlier it would happen once out of ~10 times. Since yesterday it's completely gone, and today it didn't fail once in a loop of 1000 runs. I'm completely puzzled. Now I'm trying to increase/decrease memory pressure on that machine.

So, well, I don't really have an update yet. It's just turning out to be much subtler than I thought. I don't think it depends on the timing of the test, by the way.

...for debugging containers#24147, because "md5sum mismatch" is not the best way to troubleshoot bytestream differences. Because socat is run on the container, we need to build a new testimage (20241009). Signed-off-by: Ed Santiago <[email protected]>

dgibson · 2024-10-23T03:11:47Z

I've managed to reproduce this (or at least something which looks very much like it), but it took quite a long time (finding and fixing #24342 along the way).

dgibson · 2024-10-23T03:45:06Z

I think I now have an improved reproducer, though it's certainly not instant. I can get it reasonably quickly if I simultaneously run the podman pasta tests in a loop, and run this script:

#! /bin/sh

set -ex

BASE="$(dirname $0)/.."
PASTA=$BASE/pasta
PORT=23967

LEN=$(($(cat /proc/sys/net/core/wmem_default) / 4))

IN=$BASE/test/udp.bin
OUT=$BASE/test/out.bin

dd if=/dev/urandom bs=1 count=$LEN of=$IN

while :; do
        $PASTA --config-net -u $PORT -- socat -u UDP4-LISTEN:$PORT,null-eof OPEN:$OUT,create,trunc &
        sleep 1
        socat -u OPEN:$IN UDP4:localhost:$PORT,shut-null
        wait
        cmp $IN $OUT
done

dgibson · 2024-10-23T04:24:17Z

Realized, of course that the script above shows something up for a "spliced" connection rather than a tap one. I have also reproduced with a "tap" connection (replacing localhost with the host's global address), although it did seem to take longer.

I suspect this problem affects both "splice" and "tap" flows. That's unusual, but makes some sense based on the packet capture that @sbrivio-rh obtained (and I have similar ones). It looks like something on the container/namespace side is causing an ICMP to be generated for one of the packets , and that could happen whether the packet is travelling over the tuntap link or over the namespace's loopback interface.

dgibson · 2024-10-23T06:06:44Z

I concur with @sbrivio-rh that this looks like a kernel bug. There only seems to be one place that sends a port unreachable ICMP, which is if can't find a socket. I've confirmed with systemtap that we are indeed hitting that path. So for some reason the kernel momentarily can't local the listening socket, which is quite odd.

dgibson · 2024-10-23T06:28:12Z

@sbrivio-rh and I had speculated that this might have something to do with migration of processes and/or sockets between CPUs, so I tried a couple of tests based on that theory:

If I taskset the pasta command line (both pasta and the socat server) to a single CPU I wasn't able to reproduce. This isn't conclusive, since the fault is intermittent.
I was able to reproduce with only the socat taskset to a single cpu
I was able to reproduce with pasta and socat each taskset to different single cpus.

Not really sure what that means, yet, but maybe it's a clue.

dgibson · 2024-10-23T06:33:16Z

There are two paths to the no_sk label in __udp4_lib_rcv() which we seem to be reaching.

Simply falling through after failing to find a socket from __udp4_lib_rcv()
From a goto after getting an error from inet_steal_sock()

As far as I could tell from systemtap, we're coming through path (1) - I only checked on one occasion, though, so I don't know for certain if that's true all the time.

dgibson · 2024-10-23T07:00:34Z

@sbrivio-rh further systemtap investigation has shown something I suspect is pretty useful, though I can't immediately interpret it.

AFAICT there are two different success paths through __udb4_lib_rcv(). The first is if we get a socket from inet_steal_sock(). The second is if that returns no socket, and we instead look one up via __udp4_lib_lookup_skb().

From systemtap traces, each failure case seems to go:

Several successful lookups via __udp4_lib_lookup_skb()
A failed lookup via __udp4_lib_lookup_skb(), leading to the ICMP port unreachable
Successful lookups via inet_steal_sock() for the remaining packets

I don't really know what the inet_steal_sock() mechanism is about, but it kinda looks like there's a race in the transition from one lookup mechanism to the other which causes the socket to be unfindable.

sbrivio-rh · 2024-10-23T08:56:32Z

I don't really know what the inet_steal_sock() mechanism is about

The basic idea, introduced by kernel commit 23542618deb7 ("inet: Don't lookup the socket if there's a socket attached to the skb"), is that if a socket is attached to a skb, we can avoid an actual socket lookup, and just grab ("steal") a reference to it.

but it kinda looks like there's a race in the transition from one lookup mechanism to the other which causes the socket to be unfindable

That logic was slightly modified by kernel commit 9c02bec95954 ("bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign") last year. I couldn't spot any obvious issue with it, but let me try to reproduce this with and without that commit.

sbrivio-rh · 2024-10-23T11:01:47Z

let me try to reproduce this with and without that commit.

I can reproduce the issue with the Podman test after and also before (not a revert) that commit, even though before that commit, this is more likely to happen:

+ :
+ sleep 1
+ ./pasta --config-net -u 23967 -- socat -u UDP4-LISTEN:23967,null-eof OPEN:./test/out.bin,create,trunc
+ socat -u OPEN:./test/udp.bin UDP4:localhost:23967,shut-null
+ wait
+ cmp ./test/udp.bin ./test/out.bin
./test/udp.bin ./test/out.bin differ: char 32769, line 102

and I never observed this with a recent net.git kernel instead.

sbrivio-rh · 2024-10-23T12:22:11Z

Context: Documentation/RCU/rculist_nulls.rst, commit da82af04352b ("doc: Update and wordsmith rculist_nulls.rst"), commit 47d63d7a391 ("Docs/RCU/rculist_nulls: Fix text about atomic_set_release()"), and this discussion.

Back to UDP: memory barriers were dropped by commit 88ab1932eac7 ("udp: Use hlist_nulls in UDP RCU code") because of the switch to the "hlist_nulls" implementation, in 2008. Then, 8 years later, the switch was reverted by commit ca065d0cf80f ("udp: no longer use SLAB_DESTROY_BY_RCU"), but the barriers were not added back. Especially relevant here: the one in udb_lib_get_port(). Sure, it's faster, but we might miss sockets on lookup.

I'm trying to add again barriers that appeared to make sense and see if this is fixed...

sbrivio-rh · 2024-10-23T17:56:00Z

I'm trying to add again barriers that appeared to make sense and see if this is fixed...

Not so easy, I tried a bunch of things:

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 8accbf4cb295..4f9c585aec1f 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -327,6 +327,7 @@ int udp_lib_get_port(struct sock *sk, unsigned short snum,
                        goto fail_unlock;
                }
 
+               smp_wmb();
                sock_set_flag(sk, SOCK_RCU_FREE);
 
                sk_add_node_rcu(sk, &hslot->head);
@@ -434,6 +435,10 @@ static struct sock *udp4_lib_lookup2(const struct net *net,
 
        result = NULL;
        badness = 0;
+
+       rcu_read_lock();
+
+       smp_rmb();
        udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
                need_rescore = false;
 rescore:
@@ -458,8 +463,10 @@ static struct sock *udp4_lib_lookup2(const struct net *net,
                        }
 
                        /* Fall back to scoring if group has connections */
-                       if (!reuseport_has_conns(sk))
+                       if (!reuseport_has_conns(sk)) {
+                               rcu_read_unlock();
                                return result;
+                       }
 
                        /* Reuseport logic returned an error, keep original score. */
                        if (IS_ERR(result))
@@ -475,6 +482,9 @@ static struct sock *udp4_lib_lookup2(const struct net *net,
                        goto rescore;
                }
        }
+
+       rcu_read_unlock();
+
        return result;
 }

but no luck so far. I didn't try reintroducing an equivalent of hlist_for_each_entry_rcu_safenext() as removed by commit 88ab1932eac7. I guess the problem is more subtle than I expected at this point.

dgibson · 2024-10-24T00:37:28Z

I guess the problem is more subtle than I expected at this point.

Youch, it really is. Should we be trying to bring in some kernel networking experts at this point?

sbrivio-rh · 2024-10-24T04:01:36Z

I guess the problem is more subtle than I expected at this point.

Youch, it really is. Should we be trying to bring in some kernel networking experts at this point?

I was kind of maybe supposed to be one. :) But yes, I think it would be fair to ask the author of the changes around this stuff, at least. We should just come up with a more contained reproducer, possibly without Podman and bats.

sbrivio-rh · 2024-10-25T18:12:39Z

Just adding these prints makes the shell script reproducer fail by itself reliably on the first loop (without need for bats and Podman in parallel). Unless I taskset -c1 it, of course: in that case, it never fails.

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 8accbf4cb295..9ddf42e35335 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -429,12 +429,14 @@ static struct sock *udp4_lib_lookup2(const struct net *net,
                                     struct sk_buff *skb)
 {
        struct sock *sk, *result;
-       int score, badness;
+       int score, badness, count = 0, ne = 0;
        bool need_rescore;
 
        result = NULL;
        badness = 0;
+
        udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
+               count++;
                need_rescore = false;
 rescore:
                score = compute_score(need_rescore ? result : sk, net, saddr,
@@ -450,6 +452,8 @@ static struct sock *udp4_lib_lookup2(const struct net *net,
                                continue;
                        }
 
+                       ne = 1;
+
                        result = inet_lookup_reuseport(net, sk, skb, sizeof(struct udphdr),
                                                       saddr, sport, daddr, hnum, udp_ehashfn);
                        if (!result) {
@@ -475,6 +479,12 @@ static struct sock *udp4_lib_lookup2(const struct net *net,
                        goto rescore;
                }
        }
+
+       if (!result) {
+               pr_err("%s:%i, score: %i, badness: %i, entries: %i, ne: %i",
+                       __func__, __LINE__, score, badness, count, ne);
+       }
+
        return result;
 }
 
@@ -2422,8 +2432,10 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 
        sk = inet_steal_sock(net, skb, sizeof(struct udphdr), saddr, uh->source, daddr, uh->dest,
                             &refcounted, udp_ehashfn);
-       if (IS_ERR(sk))
+       if (IS_ERR(sk)) {
+               pr_err("%s:%i, steal error", __func__, __LINE__);
                goto no_sk;
+       }
 
        if (sk) {
                struct dst_entry *dst = skb_dst(skb);
@@ -2435,6 +2447,7 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
                ret = udp_unicast_rcv_skb(sk, skb, uh);
                if (refcounted)
                        sock_put(sk);
+               pr_err("%s:%i, steal found", __func__, __LINE__);
                return ret;
        }
 
@@ -2445,6 +2458,9 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
        sk = __udp4_lib_lookup_skb(skb, uh->source, uh->dest, udptable);
        if (sk)
                return udp_unicast_rcv_skb(sk, skb, uh);
+
+       pr_err("%s:%i lib_lookup_skb error", __func__, __LINE__);
+
 no_sk:
        if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb))
                goto drop;
@@ -2463,6 +2479,7 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
         * don't wanna listen.  Ignore it.
         */
        sk_skb_reason_drop(sk, skb, drop_reason);
+       pr_err("=== %s:%i drop", __func__, __LINE__);
        return 0;
 
 short_packet:

sbrivio-rh · 2024-10-25T22:55:56Z

Based on further investigation, the issue seems to be the same as what was fixed 14 years ago by commit 719f835853a9 ("udp: add rehash on connect()").

The packet drop happens when we start looking up a socket, and its connect() call completes in the meantime: udp_lib_rehash() is called before __udp4_lib_lookup() but didn't complete yet, or just after. The local address changes, but the hash wasn't updated, yet, and the socket can't be found.

That is, the fix is not complete, because it leaves a window of opportunity for the socket to be missed.

We can probably split the rehash operation into two atomic operations, the first one adding a new entry in the secondary hash table, before we assign the new inet->inet_rcv_saddr in ip4_datagram_connect(), and the second one removing the old entry, after that address is assigned and any possible lookup has completed. Insertions and removals are protected by RCU, so locking is mostly sorted. As to the assignment itself, its ordering, and its visibility across CPUs, I still need to reach some conclusions.

… lookup If a UDP socket changes its local address while it's receiving datagrams, as a result of connect(), there is a period during which a lookup operation might fail to find it, after the address is changed but before the secondary hash (port and address) is updated. Secondary hash chains were introduced by commit 30fff92 ("udp: bind() optimisation") and, as a result, a rehash operation became needed to make a bound socket reachable again after a connect(). This operation was introduced by commit 719f835 ("udp: add rehash on connect()") which isn't however a complete fix: the socket will be found once the rehashing completes, but not while it's pending. This is noticeable with a socat(1) server in UDP4-LISTEN mode, and a client sending datagrams to it. After the server receives the first datagram (cf. _xioopen_ipdgram_listen()), it issues a connect() to the address of the sender, in order to set up a directed flow. Now, if the client, running on a different CPU thread, happens to send a (subsequent) datagram while the server's socket changes its address, but is not rehashed yet, this will result in a failed lookup and a port unreachable error delivered to the client, as apparent from the following reproducer: LEN=$(($(cat /proc/sys/net/core/wmem_default) / 4)) dd if=/dev/urandom bs=1 count=${LEN} of=tmp.in while :; do taskset -c 1 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc & sleep 0.1 || sleep 1 taskset -c 2 socat OPEN:tmp.in UDP4:localhost:1337,shut-null wait done where the client will eventually get ECONNREFUSED on a write() (typically the second or third one of a given iteration): 2024/11/13 21:28:23 socat[46901] E write(6, 0x556db2e3c000, 8192): Connection refused This issue was first observed as a seldom failure in Podman's tests checking UDP functionality while using pasta(1) to connect the container's network namespace, which leads us to a reproducer with the lookup error resulting in an ICMP packet on a tap device: LOCAL_ADDR="$(ip -j -4 addr show|jq -rM '.[] | .addr_info[0] | select(.scope == "global").local')" while :; do ./pasta --config-net -p pasta.pcap -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc & sleep 0.2 || sleep 1 socat OPEN:tmp.in UDP4:${LOCAL_ADDR}:1337,shut-null wait cmp tmp.in tmp.out done Once this fails: tmp.in tmp.out differ: char 8193, line 29 we can finally have a look at what's going on: $ tshark -r pasta.pcap 1 0.000000 :: ? ff02::16 ICMPv6 110 Multicast Listener Report Message v2 2 0.168690 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 3 0.168767 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 4 0.168806 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 5 0.168827 c6:47:05:8d:dc:04 ? Broadcast ARP 42 Who has 88.198.0.161? Tell 88.198.0.164 6 0.168851 9a:55:9a:55:9a:55 ? c6:47:05:8d:dc:04 ARP 42 88.198.0.161 is at 9a:55:9a:55:9a:55 7 0.168875 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 8 0.168896 88.198.0.164 ? 88.198.0.161 ICMP 590 Destination unreachable (Port unreachable) 9 0.168926 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 10 0.168959 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 11 0.168989 88.198.0.161 ? 88.198.0.164 UDP 4138 60260 ? 1337 Len=4096 12 0.169010 88.198.0.161 ? 88.198.0.164 UDP 42 60260 ? 1337 Len=0 On the third datagram received, the network namespace of the container initiates an ARP lookup to deliver the ICMP message. In another variant of this reproducer, starting the client with: strace -f pasta --config-net -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc 2>strace.log & and connecting to the socat server using a loopback address: socat OPEN:tmp.in UDP4:localhost:1337,shut-null we can more clearly observe a sendmmsg() call failing after the first datagram is delivered: [pid 278012] connect(173, 0x7fff96c95fc0, 16) = 0 [...] [pid 278012] recvmmsg(173, 0x7fff96c96020, 1024, MSG_DONTWAIT, NULL) = -1 EAGAIN (Resource temporarily unavailable) [pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = 1 [...] [pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = -1 ECONNREFUSED (Connection refused) and, somewhat confusingly, after a connect() on the same socket succeeded. To fix this, replace the rehash operation by a set_rcv_saddr() callback holding the spinlock on the primary hash chain, just like the rehash operation used to do, but also setting the address while holding the spinlock. To make this atomic against the lookup operation, also acquire the spinlock on the primary chain there. This results in some awkwardness at a caller site, specifically sock_bindtoindex_locked(), where we really just need to rehash the socket without changing its address. With the new operation, we now need to forcibly set the current address again. On the other hand, this appears more elegant than alternatives such as fetching the spinlock reference in ip4_datagram_connect() and ip6_datagram_conect(), and keeping the rehash operation around for a single user also seems a tad overkill. Reported-by: Ed Santiago <[email protected]> Link: containers/podman#24147 Analysed-by: David Gibson <[email protected]> Fixes: 30fff92 ("udp: bind() optimisation") Signed-off-by: Stefano Brivio <[email protected]>

sbrivio-rh · 2024-11-15T12:07:29Z

RFC fix at https://lore.kernel.org/netdev/[email protected]/

… lookup If a UDP socket changes its local address while it's receiving datagrams, as a result of connect(), there is a period during which a lookup operation might fail to find it, after the address is changed but before the secondary hash (port and address) and the four-tuple hash (local and remote ports and addresses) are updated. Secondary hash chains were introduced by commit 30fff92 ("udp: bind() optimisation") and, as a result, a rehash operation became needed to make a bound socket reachable again after a connect(). This operation was introduced by commit 719f835 ("udp: add rehash on connect()") which isn't however a complete fix: the socket will be found once the rehashing completes, but not while it's pending. This is noticeable with a socat(1) server in UDP4-LISTEN mode, and a client sending datagrams to it. After the server receives the first datagram (cf. _xioopen_ipdgram_listen()), it issues a connect() to the address of the sender, in order to set up a directed flow. Now, if the client, running on a different CPU thread, happens to send a (subsequent) datagram while the server's socket changes its address, but is not rehashed yet, this will result in a failed lookup and a port unreachable error delivered to the client, as apparent from the following reproducer: LEN=$(($(cat /proc/sys/net/core/wmem_default) / 4)) dd if=/dev/urandom bs=1 count=${LEN} of=tmp.in while :; do taskset -c 1 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc & sleep 0.1 || sleep 1 taskset -c 2 socat OPEN:tmp.in UDP4:localhost:1337,shut-null wait done where the client will eventually get ECONNREFUSED on a write() (typically the second or third one of a given iteration): 2024/11/13 21:28:23 socat[46901] E write(6, 0x556db2e3c000, 8192): Connection refused This issue was first observed as a seldom failure in Podman's tests checking UDP functionality while using pasta(1) to connect the container's network namespace, which leads us to a reproducer with the lookup error resulting in an ICMP packet on a tap device: LOCAL_ADDR="$(ip -j -4 addr show|jq -rM '.[] | .addr_info[0] | select(.scope == "global").local')" while :; do ./pasta --config-net -p pasta.pcap -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc & sleep 0.2 || sleep 1 socat OPEN:tmp.in UDP4:${LOCAL_ADDR}:1337,shut-null wait cmp tmp.in tmp.out done Once this fails: tmp.in tmp.out differ: char 8193, line 29 we can finally have a look at what's going on: $ tshark -r pasta.pcap 1 0.000000 :: ? ff02::16 ICMPv6 110 Multicast Listener Report Message v2 2 0.168690 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 3 0.168767 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 4 0.168806 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 5 0.168827 c6:47:05:8d:dc:04 ? Broadcast ARP 42 Who has 88.198.0.161? Tell 88.198.0.164 6 0.168851 9a:55:9a:55:9a:55 ? c6:47:05:8d:dc:04 ARP 42 88.198.0.161 is at 9a:55:9a:55:9a:55 7 0.168875 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 8 0.168896 88.198.0.164 ? 88.198.0.161 ICMP 590 Destination unreachable (Port unreachable) 9 0.168926 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 10 0.168959 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 11 0.168989 88.198.0.161 ? 88.198.0.164 UDP 4138 60260 ? 1337 Len=4096 12 0.169010 88.198.0.161 ? 88.198.0.164 UDP 42 60260 ? 1337 Len=0 On the third datagram received, the network namespace of the container initiates an ARP lookup to deliver the ICMP message. In another variant of this reproducer, starting the client with: strace -f pasta --config-net -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc 2>strace.log & and connecting to the socat server using a loopback address: socat OPEN:tmp.in UDP4:localhost:1337,shut-null we can more clearly observe a sendmmsg() call failing after the first datagram is delivered: [pid 278012] connect(173, 0x7fff96c95fc0, 16) = 0 [...] [pid 278012] recvmmsg(173, 0x7fff96c96020, 1024, MSG_DONTWAIT, NULL) = -1 EAGAIN (Resource temporarily unavailable) [pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = 1 [...] [pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = -1 ECONNREFUSED (Connection refused) and, somewhat confusingly, after a connect() on the same socket succeeded. To fix this, replace the rehash operation by a set_rcv_saddr() callback holding the spinlock on the primary hash chain, just like the rehash operation used to do, but also setting the address (via inet_update_saddr(), moved to headers) while holding the spinlock. To make this atomic against the lookup operation, also acquire the spinlock on the primary chain there. This results in some awkwardness at a caller site, specifically sock_bindtoindex_locked(), where we really just need to rehash the socket without changing its address. With the new operation, we now need to forcibly set the current address again. On the other hand, this appears more elegant than alternatives such as fetching the spinlock reference in ip4_datagram_connect() and ip6_datagram_conect(), and keeping the rehash operation around for a single user also seems a tad overkill. v1: - fix build with CONFIG_IPV6=n: add ifdef around sk_v6_rcv_saddr usage (Kuniyuki Iwashima) - directly use sk_rcv_saddr for IPv4 receive addresses instead of fetching inet_rcv_saddr (Kuniyuki Iwashima) - move inet_update_saddr() to inet_hashtables.h and use that to set IPv4/IPv6 addresses as suitable (Kuniyuki Iwashima) - rebase onto net-next, update commit message accordingly Reported-by: Ed Santiago <[email protected]> Link: containers/podman#24147 Analysed-by: David Gibson <[email protected]> Fixes: 30fff92 ("udp: bind() optimisation") Signed-off-by: Stefano Brivio <[email protected]>

edsantiago added flakes Flakes from Continuous Integration pasta pasta(1) bugs or features labels Oct 3, 2024

edsantiago mentioned this issue Oct 11, 2024

pasta udp tests: new bytecheck helper #24238

Merged

sbrivio-rh mentioned this issue Oct 17, 2024

pasta: UDP port range forwarding: hang in podman logs #24272

Open

dgibson mentioned this issue Oct 23, 2024

Spurious duplicate test error in pasta system tests #24342

Closed

sbrivio-rh mentioned this issue Oct 30, 2024

CI: e2e: pull/run and decrypt from local registry: timeout #24219

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pasta: UDP/IPv4 large transfer, tap: mismatch #24147

pasta: UDP/IPv4 large transfer, tap: mismatch #24147

edsantiago commented Oct 3, 2024

sbrivio-rh commented Oct 4, 2024

edsantiago commented Oct 5, 2024

edsantiago commented Oct 11, 2024

sbrivio-rh commented Oct 11, 2024

sbrivio-rh commented Oct 15, 2024

sbrivio-rh commented Oct 15, 2024

edsantiago commented Oct 15, 2024

sbrivio-rh commented Oct 18, 2024

dgibson commented Oct 23, 2024

dgibson commented Oct 23, 2024

dgibson commented Oct 23, 2024

dgibson commented Oct 23, 2024

dgibson commented Oct 23, 2024

dgibson commented Oct 23, 2024

dgibson commented Oct 23, 2024

sbrivio-rh commented Oct 23, 2024

sbrivio-rh commented Oct 23, 2024

sbrivio-rh commented Oct 23, 2024

sbrivio-rh commented Oct 23, 2024

dgibson commented Oct 24, 2024

sbrivio-rh commented Oct 24, 2024

sbrivio-rh commented Oct 25, 2024 •

edited

Loading

sbrivio-rh commented Oct 25, 2024 •

edited

Loading

sbrivio-rh commented Nov 15, 2024

pasta: UDP/IPv4 large transfer, tap: mismatch #24147

pasta: UDP/IPv4 large transfer, tap: mismatch #24147

Comments

edsantiago commented Oct 3, 2024

sbrivio-rh commented Oct 4, 2024

edsantiago commented Oct 5, 2024

edsantiago commented Oct 11, 2024

sbrivio-rh commented Oct 11, 2024

sbrivio-rh commented Oct 15, 2024

sbrivio-rh commented Oct 15, 2024

edsantiago commented Oct 15, 2024

sbrivio-rh commented Oct 18, 2024

dgibson commented Oct 23, 2024

dgibson commented Oct 23, 2024

dgibson commented Oct 23, 2024

dgibson commented Oct 23, 2024

dgibson commented Oct 23, 2024

dgibson commented Oct 23, 2024

dgibson commented Oct 23, 2024

sbrivio-rh commented Oct 23, 2024

sbrivio-rh commented Oct 23, 2024

sbrivio-rh commented Oct 23, 2024

sbrivio-rh commented Oct 23, 2024

dgibson commented Oct 24, 2024

sbrivio-rh commented Oct 24, 2024

sbrivio-rh commented Oct 25, 2024 • edited Loading

sbrivio-rh commented Oct 25, 2024 • edited Loading

sbrivio-rh commented Nov 15, 2024

sbrivio-rh commented Oct 25, 2024 •

edited

Loading

sbrivio-rh commented Oct 25, 2024 •

edited

Loading