Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pasta: UDP/IPv4 large transfer, tap: mismatch #24147

Open
edsantiago opened this issue Oct 3, 2024 · 24 comments
Open

pasta: UDP/IPv4 large transfer, tap: mismatch #24147

edsantiago opened this issue Oct 3, 2024 · 24 comments
Labels
flakes Flakes from Continuous Integration pasta pasta(1) bugs or features

Comments

@edsantiago
Copy link
Member

So far debian-only, in parallel system tests

not ok 379 |505| UDP/IPv4 large transfer, tap in 2576ms
...
$ podman logs --follow c-socat-t379-i0yzkalc
f7f2c12c2da26c7fe3701753a8a3167b  -
#/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
#|     FAIL: Mismatch between data sent and received
#| expected: = 16650cce72835ebf29698339858b4296  -
#|   actual:   f7f2c12c2da26c7fe3701753a8a3167b  -
#\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
x x x x x x
sys(2) podman(2) debian-13(2) rootless(2) host(2) sqlite(2)
@edsantiago edsantiago added flakes Flakes from Continuous Integration pasta pasta(1) bugs or features labels Oct 3, 2024
@sbrivio-rh
Copy link
Collaborator

...only in parallel tests, and it happened only twice in a while, correct?

@edsantiago
Copy link
Member Author

Once in my own parallel testing on my f40 laptop, in the small transfer, tap test. Most of the easily-reproduced parallel flakes have been addressed, which unfortunately leaves us with the more challenging ones.

@edsantiago
Copy link
Member Author

I instrumented the test, and got a hit (flake) on my first try:

# #|     FAIL: Mismatch between data sent and received
         # #| expected: = size=53248 hash=61514771ee5e67d618f0fe13e97bfa08 - head= df 08 a1 d0 ca c4 87 0e tail= e8 f2 d3 0f 74 0e 1e 13
         # #|   actual:   size=49152 hash=23a9a6b6368ffb95ecdc9116944fc53b - head= df 08 a1 d0 ca c4 87 0e tail= 86 21 ae d8 31 48 ec 3a

53248 - 49152 = 4096 and it looks like it's missing from the end (not surprising)

@sbrivio-rh
Copy link
Collaborator

Thanks a lot, that's really helpful. 4096 bytes makes it look like it's not actually an issue in pasta itself, rather a race in the tests (that's the size of a page, not a packet or any buffer we use). I'll have another look at the test and try to make this more likely to happen somehow and hopefully reproduce it...

@sbrivio-rh
Copy link
Collaborator

I'm able to reproduce this easily (by simply running that test in a loop). I'm still clueless as to what's going on.

@sbrivio-rh
Copy link
Collaborator

With commit edsantiago@a659d9f and packet captures:

   [19:51:22.959789450] size=1040384 hash=7fded94f95042fdbb9027b9d32ea715c - head= 62 cf fa 91 8e cb b3 15 tail= 90 40 43 0a af 83 d6 f1
   #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
   #|     FAIL: Mismatch between data sent and received
   #| expected: = size=1048576 hash=30cb39a1743c7cfd386320f1a378ccea - head= 62 cf fa 91 8e cb b3 15 tail= 90 40 43 0a af 83 d6 f1
   #|   actual:   size=1040384 hash=7fded94f95042fdbb9027b9d32ea715c - head= 62 cf fa 91 8e cb b3 15 tail= 90 40 43 0a af 83 d6 f1
   #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

in my case, 8192 bytes are missing from somewhere in the middle of the file. From the capture, it looks like the kernel drops frame #4 (second UDP data packet pasta is sending, range 8192-16384): there's an ICMP port unreachable message referring to that data (d3 2c f7 11...) in frame #9.

Curiously, in this case, we also see an ARP request for 88.198.0.161 (default gateway in the container) and pasta replying to it (frames 6 and 7). Overview using tshark(1) output:

    1   0.000000 fe80::ccc9:bfff:fe5b:b4cb → ff02::2      ICMPv6 70 Router Solicitation from ce:c9:bf:5b:b4:cb
    2   0.005934 fe80::ccc9:bfff:fe5b:b4cb → ff02::16     ICMPv6 90 Multicast Listener Report Message v2

Expected.

    3   0.139513  169.254.1.2 → 88.198.0.164 UDP 8234 43171 → 5189 Len=8192
    4   0.139560  169.254.1.2 → 88.198.0.164 UDP 8234 43171 → 5189 Len=8192
    5   0.139576  169.254.1.2 → 88.198.0.164 UDP 8234 43171 → 5189 Len=8192

Start of transfer.

    6   0.139587 ce:c9:bf:5b:b4:cb → Broadcast    ARP 42 Who has 88.198.0.161? Tell 88.198.0.164
    7   0.139601 9a:55:9a:55:9a:55 → ce:c9:bf:5b:b4:cb ARP 42 88.198.0.161 is at 9a:55:9a:55:9a:55

ARP request and reply for default gateway, only visible in the bad case.

    8   0.139611  169.254.1.2 → 88.198.0.164 UDP 8234 43171 → 5189 Len=8192

Transfer continues.

    9   0.139620 88.198.0.164 → 169.254.1.2  ICMP 590 Destination unreachable (Port unreachable)

Completely unexpected: this refers to frame #4.

   10   0.139634  169.254.1.2 → 88.198.0.164 UDP 8234 43171 → 5189 Len=8192
   11   0.139652  169.254.1.2 → 88.198.0.164 UDP 8234 43171 → 5189 Len=8192
   12   0.139667  169.254.1.2 → 88.198.0.164 UDP 8234 43171 → 5189 Len=8192
[...]

Transfer goes on.

This starts looking like a kernel issue to me: if we send UDP message shortly after we set up the namespace, one message (and not the first one!) gets a port unreachable message (this happens after we check that the port is bound via procfs).

By the way, this looks similar to a failure I've noticed with a pending series (not merged yet): https://archives.passt.top/passt-dev/20241009224433.7fc28fc7@elisabeth/. That's from passt's tests (no Podman): on a spliced connection (loopback path, not tap) we get a "host unreachable" error from the kernel for the tap (local, non-loopback) address.

I'll add some prints to the kernel and try to figure out what's going on there. This doesn't appear to be a race in the tests (data in the middle of a transfer is missing) and it doesn't look like an issue in pasta (we're sending all the packets we need to send, the kernel drops one, without any apparent reason).

@edsantiago
Copy link
Member Author

I can't tell if that's a really really fun issue to debug, or a really really horrible one, or both. Whatever the case, thank you for looking into it.

edsantiago added a commit to edsantiago/libpod that referenced this issue Oct 16, 2024
...for debugging containers#24147, because "md5sum mismatch" is not
the best way to troubleshoot bytestream differences.

socat is run on the container, so this requires building a
new testimage (20241011). Bump to new CI VMs[1] which include it.

 [1] containers/automation_images#389

Signed-off-by: Ed Santiago <[email protected]>
@sbrivio-rh
Copy link
Collaborator

So, I was trying to find out why the kernel can't match the bound port with a UDP socket (which is the reason why that ICMP message is generated, and the corresponding datagram dropped)... and, since yesterday, I can't reproduce this anymore. Same kernel, same exact build of Podman and pasta, same machine and terminal (I still had the same open), even.

Earlier it would happen once out of ~10 times. Since yesterday it's completely gone, and today it didn't fail once in a loop of 1000 runs. I'm completely puzzled. Now I'm trying to increase/decrease memory pressure on that machine.

So, well, I don't really have an update yet. It's just turning out to be much subtler than I thought. I don't think it depends on the timing of the test, by the way.

edsantiago added a commit to edsantiago/libpod that referenced this issue Oct 22, 2024
...for debugging containers#24147, because "md5sum mismatch" is not
the best way to troubleshoot bytestream differences.

Because socat is run on the container, we need to build
a new testimage (20241009).

Signed-off-by: Ed Santiago <[email protected]>
@dgibson
Copy link
Collaborator

dgibson commented Oct 23, 2024

I've managed to reproduce this (or at least something which looks very much like it), but it took quite a long time (finding and fixing #24342 along the way).

@dgibson
Copy link
Collaborator

dgibson commented Oct 23, 2024

I think I now have an improved reproducer, though it's certainly not instant. I can get it reasonably quickly if I simultaneously run the podman pasta tests in a loop, and run this script:

#! /bin/sh

set -ex

BASE="$(dirname $0)/.."
PASTA=$BASE/pasta
PORT=23967

LEN=$(($(cat /proc/sys/net/core/wmem_default) / 4))

IN=$BASE/test/udp.bin
OUT=$BASE/test/out.bin

dd if=/dev/urandom bs=1 count=$LEN of=$IN

while :; do
        $PASTA --config-net -u $PORT -- socat -u UDP4-LISTEN:$PORT,null-eof OPEN:$OUT,create,trunc &
        sleep 1
        socat -u OPEN:$IN UDP4:localhost:$PORT,shut-null
        wait
        cmp $IN $OUT
done

@dgibson
Copy link
Collaborator

dgibson commented Oct 23, 2024

Realized, of course that the script above shows something up for a "spliced" connection rather than a tap one. I have also reproduced with a "tap" connection (replacing localhost with the host's global address), although it did seem to take longer.

I suspect this problem affects both "splice" and "tap" flows. That's unusual, but makes some sense based on the packet capture that @sbrivio-rh obtained (and I have similar ones). It looks like something on the container/namespace side is causing an ICMP to be generated for one of the packets , and that could happen whether the packet is travelling over the tuntap link or over the namespace's loopback interface.

@dgibson
Copy link
Collaborator

dgibson commented Oct 23, 2024

I concur with @sbrivio-rh that this looks like a kernel bug. There only seems to be one place that sends a port unreachable ICMP, which is if can't find a socket. I've confirmed with systemtap that we are indeed hitting that path. So for some reason the kernel momentarily can't local the listening socket, which is quite odd.

@dgibson
Copy link
Collaborator

dgibson commented Oct 23, 2024

@sbrivio-rh and I had speculated that this might have something to do with migration of processes and/or sockets between CPUs, so I tried a couple of tests based on that theory:

  • If I taskset the pasta command line (both pasta and the socat server) to a single CPU I wasn't able to reproduce. This isn't conclusive, since the fault is intermittent.
  • I was able to reproduce with only the socat taskset to a single cpu
  • I was able to reproduce with pasta and socat each taskset to different single cpus.

Not really sure what that means, yet, but maybe it's a clue.

@dgibson
Copy link
Collaborator

dgibson commented Oct 23, 2024

There are two paths to the no_sk label in __udp4_lib_rcv() which we seem to be reaching.

  1. Simply falling through after failing to find a socket from __udp4_lib_rcv()
  2. From a goto after getting an error from inet_steal_sock()

As far as I could tell from systemtap, we're coming through path (1) - I only checked on one occasion, though, so I don't know for certain if that's true all the time.

@dgibson
Copy link
Collaborator

dgibson commented Oct 23, 2024

@sbrivio-rh further systemtap investigation has shown something I suspect is pretty useful, though I can't immediately interpret it.

AFAICT there are two different success paths through __udb4_lib_rcv(). The first is if we get a socket from inet_steal_sock(). The second is if that returns no socket, and we instead look one up via __udp4_lib_lookup_skb().

From systemtap traces, each failure case seems to go:

  1. Several successful lookups via __udp4_lib_lookup_skb()
  2. A failed lookup via __udp4_lib_lookup_skb(), leading to the ICMP port unreachable
  3. Successful lookups via inet_steal_sock() for the remaining packets

I don't really know what the inet_steal_sock() mechanism is about, but it kinda looks like there's a race in the transition from one lookup mechanism to the other which causes the socket to be unfindable.

@sbrivio-rh
Copy link
Collaborator

I don't really know what the inet_steal_sock() mechanism is about

The basic idea, introduced by kernel commit 23542618deb7 ("inet: Don't lookup the socket if there's a socket attached to the skb"), is that if a socket is attached to a skb, we can avoid an actual socket lookup, and just grab ("steal") a reference to it.

but it kinda looks like there's a race in the transition from one lookup mechanism to the other which causes the socket to be unfindable

That logic was slightly modified by kernel commit 9c02bec95954 ("bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign") last year. I couldn't spot any obvious issue with it, but let me try to reproduce this with and without that commit.

@sbrivio-rh
Copy link
Collaborator

let me try to reproduce this with and without that commit.

I can reproduce the issue with the Podman test after and also before (not a revert) that commit, even though before that commit, this is more likely to happen:

+ :
+ sleep 1
+ ./pasta --config-net -u 23967 -- socat -u UDP4-LISTEN:23967,null-eof OPEN:./test/out.bin,create,trunc
+ socat -u OPEN:./test/udp.bin UDP4:localhost:23967,shut-null
+ wait
+ cmp ./test/udp.bin ./test/out.bin
./test/udp.bin ./test/out.bin differ: char 32769, line 102

and I never observed this with a recent net.git kernel instead.

@sbrivio-rh
Copy link
Collaborator

Context: Documentation/RCU/rculist_nulls.rst, commit da82af04352b ("doc: Update and wordsmith rculist_nulls.rst"), commit 47d63d7a391 ("Docs/RCU/rculist_nulls: Fix text about atomic_set_release()"), and this discussion.

Back to UDP: memory barriers were dropped by commit 88ab1932eac7 ("udp: Use hlist_nulls in UDP RCU code") because of the switch to the "hlist_nulls" implementation, in 2008. Then, 8 years later, the switch was reverted by commit ca065d0cf80f ("udp: no longer use SLAB_DESTROY_BY_RCU"), but the barriers were not added back. Especially relevant here: the one in udb_lib_get_port(). Sure, it's faster, but we might miss sockets on lookup.

I'm trying to add again barriers that appeared to make sense and see if this is fixed...

@sbrivio-rh
Copy link
Collaborator

I'm trying to add again barriers that appeared to make sense and see if this is fixed...

Not so easy, I tried a bunch of things:

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 8accbf4cb295..4f9c585aec1f 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -327,6 +327,7 @@ int udp_lib_get_port(struct sock *sk, unsigned short snum,
                        goto fail_unlock;
                }
 
+               smp_wmb();
                sock_set_flag(sk, SOCK_RCU_FREE);
 
                sk_add_node_rcu(sk, &hslot->head);
@@ -434,6 +435,10 @@ static struct sock *udp4_lib_lookup2(const struct net *net,
 
        result = NULL;
        badness = 0;
+
+       rcu_read_lock();
+
+       smp_rmb();
        udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
                need_rescore = false;
 rescore:
@@ -458,8 +463,10 @@ static struct sock *udp4_lib_lookup2(const struct net *net,
                        }
 
                        /* Fall back to scoring if group has connections */
-                       if (!reuseport_has_conns(sk))
+                       if (!reuseport_has_conns(sk)) {
+                               rcu_read_unlock();
                                return result;
+                       }
 
                        /* Reuseport logic returned an error, keep original score. */
                        if (IS_ERR(result))
@@ -475,6 +482,9 @@ static struct sock *udp4_lib_lookup2(const struct net *net,
                        goto rescore;
                }
        }
+
+       rcu_read_unlock();
+
        return result;
 }
 

but no luck so far. I didn't try reintroducing an equivalent of hlist_for_each_entry_rcu_safenext() as removed by commit 88ab1932eac7. I guess the problem is more subtle than I expected at this point.

@dgibson
Copy link
Collaborator

dgibson commented Oct 24, 2024

I guess the problem is more subtle than I expected at this point.

Youch, it really is. Should we be trying to bring in some kernel networking experts at this point?

@sbrivio-rh
Copy link
Collaborator

I guess the problem is more subtle than I expected at this point.

Youch, it really is. Should we be trying to bring in some kernel networking experts at this point?

I was kind of maybe supposed to be one. :) But yes, I think it would be fair to ask the author of the changes around this stuff, at least. We should just come up with a more contained reproducer, possibly without Podman and bats.

@sbrivio-rh
Copy link
Collaborator

sbrivio-rh commented Oct 25, 2024

Just adding these prints makes the shell script reproducer fail by itself reliably on the first loop (without need for bats and Podman in parallel). Unless I taskset -c1 it, of course: in that case, it never fails.

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 8accbf4cb295..9ddf42e35335 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -429,12 +429,14 @@ static struct sock *udp4_lib_lookup2(const struct net *net,
                                     struct sk_buff *skb)
 {
        struct sock *sk, *result;
-       int score, badness;
+       int score, badness, count = 0, ne = 0;
        bool need_rescore;
 
        result = NULL;
        badness = 0;
+
        udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
+               count++;
                need_rescore = false;
 rescore:
                score = compute_score(need_rescore ? result : sk, net, saddr,
@@ -450,6 +452,8 @@ static struct sock *udp4_lib_lookup2(const struct net *net,
                                continue;
                        }
 
+                       ne = 1;
+
                        result = inet_lookup_reuseport(net, sk, skb, sizeof(struct udphdr),
                                                       saddr, sport, daddr, hnum, udp_ehashfn);
                        if (!result) {
@@ -475,6 +479,12 @@ static struct sock *udp4_lib_lookup2(const struct net *net,
                        goto rescore;
                }
        }
+
+       if (!result) {
+               pr_err("%s:%i, score: %i, badness: %i, entries: %i, ne: %i",
+                       __func__, __LINE__, score, badness, count, ne);
+       }
+
        return result;
 }
 
@@ -2422,8 +2432,10 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 
        sk = inet_steal_sock(net, skb, sizeof(struct udphdr), saddr, uh->source, daddr, uh->dest,
                             &refcounted, udp_ehashfn);
-       if (IS_ERR(sk))
+       if (IS_ERR(sk)) {
+               pr_err("%s:%i, steal error", __func__, __LINE__);
                goto no_sk;
+       }
 
        if (sk) {
                struct dst_entry *dst = skb_dst(skb);
@@ -2435,6 +2447,7 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
                ret = udp_unicast_rcv_skb(sk, skb, uh);
                if (refcounted)
                        sock_put(sk);
+               pr_err("%s:%i, steal found", __func__, __LINE__);
                return ret;
        }
 
@@ -2445,6 +2458,9 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
        sk = __udp4_lib_lookup_skb(skb, uh->source, uh->dest, udptable);
        if (sk)
                return udp_unicast_rcv_skb(sk, skb, uh);
+
+       pr_err("%s:%i lib_lookup_skb error", __func__, __LINE__);
+
 no_sk:
        if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb))
                goto drop;
@@ -2463,6 +2479,7 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
         * don't wanna listen.  Ignore it.
         */
        sk_skb_reason_drop(sk, skb, drop_reason);
+       pr_err("=== %s:%i drop", __func__, __LINE__);
        return 0;
 
 short_packet:

@sbrivio-rh
Copy link
Collaborator

sbrivio-rh commented Oct 25, 2024

Based on further investigation, the issue seems to be the same as what was fixed 14 years ago by commit 719f835853a9 ("udp: add rehash on connect()").

The packet drop happens when we start looking up a socket, and its connect() call completes in the meantime: udp_lib_rehash() is called before __udp4_lib_lookup() but didn't complete yet, or just after. The local address changes, but the hash wasn't updated, yet, and the socket can't be found.

That is, the fix is not complete, because it leaves a window of opportunity for the socket to be missed.

We can probably split the rehash operation into two atomic operations, the first one adding a new entry in the secondary hash table, before we assign the new inet->inet_rcv_saddr in ip4_datagram_connect(), and the second one removing the old entry, after that address is assigned and any possible lookup has completed. Insertions and removals are protected by RCU, so locking is mostly sorted. As to the assignment itself, its ordering, and its visibility across CPUs, I still need to reach some conclusions.

intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this issue Nov 14, 2024
… lookup

If a UDP socket changes its local address while it's receiving
datagrams, as a result of connect(), there is a period during which
a lookup operation might fail to find it, after the address is changed
but before the secondary hash (port and address) is updated.

Secondary hash chains were introduced by commit 30fff92 ("udp:
bind() optimisation") and, as a result, a rehash operation became
needed to make a bound socket reachable again after a connect().

This operation was introduced by commit 719f835 ("udp: add
rehash on connect()") which isn't however a complete fix: the
socket will be found once the rehashing completes, but not while
it's pending.

This is noticeable with a socat(1) server in UDP4-LISTEN mode, and a
client sending datagrams to it. After the server receives the first
datagram (cf. _xioopen_ipdgram_listen()), it issues a connect() to
the address of the sender, in order to set up a directed flow.

Now, if the client, running on a different CPU thread, happens to
send a (subsequent) datagram while the server's socket changes its
address, but is not rehashed yet, this will result in a failed
lookup and a port unreachable error delivered to the client, as
apparent from the following reproducer:

  LEN=$(($(cat /proc/sys/net/core/wmem_default) / 4))
  dd if=/dev/urandom bs=1 count=${LEN} of=tmp.in

  while :; do
  	taskset -c 1 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc &
  	sleep 0.1 || sleep 1
  	taskset -c 2 socat OPEN:tmp.in UDP4:localhost:1337,shut-null
  	wait
  done

where the client will eventually get ECONNREFUSED on a write()
(typically the second or third one of a given iteration):

  2024/11/13 21:28:23 socat[46901] E write(6, 0x556db2e3c000, 8192): Connection refused

This issue was first observed as a seldom failure in Podman's tests
checking UDP functionality while using pasta(1) to connect the
container's network namespace, which leads us to a reproducer with
the lookup error resulting in an ICMP packet on a tap device:

  LOCAL_ADDR="$(ip -j -4 addr show|jq -rM '.[] | .addr_info[0] | select(.scope == "global").local')"

  while :; do
  	./pasta --config-net -p pasta.pcap -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc &
  	sleep 0.2 || sleep 1
  	socat OPEN:tmp.in UDP4:${LOCAL_ADDR}:1337,shut-null
  	wait
  	cmp tmp.in tmp.out
  done

Once this fails:

  tmp.in tmp.out differ: char 8193, line 29

we can finally have a look at what's going on:

  $ tshark -r pasta.pcap
      1   0.000000           :: ? ff02::16     ICMPv6 110 Multicast Listener Report Message v2
      2   0.168690 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
      3   0.168767 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
      4   0.168806 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
      5   0.168827 c6:47:05:8d:dc:04 ? Broadcast    ARP 42 Who has 88.198.0.161? Tell 88.198.0.164
      6   0.168851 9a:55:9a:55:9a:55 ? c6:47:05:8d:dc:04 ARP 42 88.198.0.161 is at 9a:55:9a:55:9a:55
      7   0.168875 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
      8   0.168896 88.198.0.164 ? 88.198.0.161 ICMP 590 Destination unreachable (Port unreachable)
      9   0.168926 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
     10   0.168959 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
     11   0.168989 88.198.0.161 ? 88.198.0.164 UDP 4138 60260 ? 1337 Len=4096
     12   0.169010 88.198.0.161 ? 88.198.0.164 UDP 42 60260 ? 1337 Len=0

On the third datagram received, the network namespace of the container
initiates an ARP lookup to deliver the ICMP message.

In another variant of this reproducer, starting the client with:

  strace -f pasta --config-net -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc 2>strace.log &

and connecting to the socat server using a loopback address:

  socat OPEN:tmp.in UDP4:localhost:1337,shut-null

we can more clearly observe a sendmmsg() call failing after the
first datagram is delivered:

  [pid 278012] connect(173, 0x7fff96c95fc0, 16) = 0
  [...]
  [pid 278012] recvmmsg(173, 0x7fff96c96020, 1024, MSG_DONTWAIT, NULL) = -1 EAGAIN (Resource temporarily unavailable)
  [pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = 1
  [...]
  [pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = -1 ECONNREFUSED (Connection refused)

and, somewhat confusingly, after a connect() on the same socket
succeeded.

To fix this, replace the rehash operation by a set_rcv_saddr()
callback holding the spinlock on the primary hash chain, just like
the rehash operation used to do, but also setting the address while
holding the spinlock.

To make this atomic against the lookup operation, also acquire the
spinlock on the primary chain there.

This results in some awkwardness at a caller site, specifically
sock_bindtoindex_locked(), where we really just need to rehash the
socket without changing its address. With the new operation, we now
need to forcibly set the current address again.

On the other hand, this appears more elegant than alternatives such
as fetching the spinlock reference in ip4_datagram_connect() and
ip6_datagram_conect(), and keeping the rehash operation around for
a single user also seems a tad overkill.

Reported-by: Ed Santiago <[email protected]>
Link: containers/podman#24147
Analysed-by: David Gibson <[email protected]>
Fixes: 30fff92 ("udp: bind() optimisation")
Signed-off-by: Stefano Brivio <[email protected]>
@sbrivio-rh
Copy link
Collaborator

intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this issue Dec 4, 2024
… lookup

If a UDP socket changes its local address while it's receiving
datagrams, as a result of connect(), there is a period during which
a lookup operation might fail to find it, after the address is changed
but before the secondary hash (port and address) and the four-tuple
hash (local and remote ports and addresses) are updated.

Secondary hash chains were introduced by commit 30fff92 ("udp:
bind() optimisation") and, as a result, a rehash operation became
needed to make a bound socket reachable again after a connect().

This operation was introduced by commit 719f835 ("udp: add
rehash on connect()") which isn't however a complete fix: the
socket will be found once the rehashing completes, but not while
it's pending.

This is noticeable with a socat(1) server in UDP4-LISTEN mode, and a
client sending datagrams to it. After the server receives the first
datagram (cf. _xioopen_ipdgram_listen()), it issues a connect() to
the address of the sender, in order to set up a directed flow.

Now, if the client, running on a different CPU thread, happens to
send a (subsequent) datagram while the server's socket changes its
address, but is not rehashed yet, this will result in a failed
lookup and a port unreachable error delivered to the client, as
apparent from the following reproducer:

  LEN=$(($(cat /proc/sys/net/core/wmem_default) / 4))
  dd if=/dev/urandom bs=1 count=${LEN} of=tmp.in

  while :; do
  	taskset -c 1 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc &
  	sleep 0.1 || sleep 1
  	taskset -c 2 socat OPEN:tmp.in UDP4:localhost:1337,shut-null
  	wait
  done

where the client will eventually get ECONNREFUSED on a write()
(typically the second or third one of a given iteration):

  2024/11/13 21:28:23 socat[46901] E write(6, 0x556db2e3c000, 8192): Connection refused

This issue was first observed as a seldom failure in Podman's tests
checking UDP functionality while using pasta(1) to connect the
container's network namespace, which leads us to a reproducer with
the lookup error resulting in an ICMP packet on a tap device:

  LOCAL_ADDR="$(ip -j -4 addr show|jq -rM '.[] | .addr_info[0] | select(.scope == "global").local')"

  while :; do
  	./pasta --config-net -p pasta.pcap -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc &
  	sleep 0.2 || sleep 1
  	socat OPEN:tmp.in UDP4:${LOCAL_ADDR}:1337,shut-null
  	wait
  	cmp tmp.in tmp.out
  done

Once this fails:

  tmp.in tmp.out differ: char 8193, line 29

we can finally have a look at what's going on:

  $ tshark -r pasta.pcap
      1   0.000000           :: ? ff02::16     ICMPv6 110 Multicast Listener Report Message v2
      2   0.168690 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
      3   0.168767 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
      4   0.168806 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
      5   0.168827 c6:47:05:8d:dc:04 ? Broadcast    ARP 42 Who has 88.198.0.161? Tell 88.198.0.164
      6   0.168851 9a:55:9a:55:9a:55 ? c6:47:05:8d:dc:04 ARP 42 88.198.0.161 is at 9a:55:9a:55:9a:55
      7   0.168875 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
      8   0.168896 88.198.0.164 ? 88.198.0.161 ICMP 590 Destination unreachable (Port unreachable)
      9   0.168926 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
     10   0.168959 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
     11   0.168989 88.198.0.161 ? 88.198.0.164 UDP 4138 60260 ? 1337 Len=4096
     12   0.169010 88.198.0.161 ? 88.198.0.164 UDP 42 60260 ? 1337 Len=0

On the third datagram received, the network namespace of the container
initiates an ARP lookup to deliver the ICMP message.

In another variant of this reproducer, starting the client with:

  strace -f pasta --config-net -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc 2>strace.log &

and connecting to the socat server using a loopback address:

  socat OPEN:tmp.in UDP4:localhost:1337,shut-null

we can more clearly observe a sendmmsg() call failing after the
first datagram is delivered:

  [pid 278012] connect(173, 0x7fff96c95fc0, 16) = 0
  [...]
  [pid 278012] recvmmsg(173, 0x7fff96c96020, 1024, MSG_DONTWAIT, NULL) = -1 EAGAIN (Resource temporarily unavailable)
  [pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = 1
  [...]
  [pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = -1 ECONNREFUSED (Connection refused)

and, somewhat confusingly, after a connect() on the same socket
succeeded.

To fix this, replace the rehash operation by a set_rcv_saddr()
callback holding the spinlock on the primary hash chain, just like
the rehash operation used to do, but also setting the address (via
inet_update_saddr(), moved to headers) while holding the spinlock.

To make this atomic against the lookup operation, also acquire the
spinlock on the primary chain there.

This results in some awkwardness at a caller site, specifically
sock_bindtoindex_locked(), where we really just need to rehash the
socket without changing its address. With the new operation, we now
need to forcibly set the current address again.

On the other hand, this appears more elegant than alternatives such
as fetching the spinlock reference in ip4_datagram_connect() and
ip6_datagram_conect(), and keeping the rehash operation around for
a single user also seems a tad overkill.

v1:
  - fix build with CONFIG_IPV6=n: add ifdef around sk_v6_rcv_saddr
    usage (Kuniyuki Iwashima)
  - directly use sk_rcv_saddr for IPv4 receive addresses instead of
    fetching inet_rcv_saddr (Kuniyuki Iwashima)
  - move inet_update_saddr() to inet_hashtables.h and use that
    to set IPv4/IPv6 addresses as suitable (Kuniyuki Iwashima)
  - rebase onto net-next, update commit message accordingly

Reported-by: Ed Santiago <[email protected]>
Link: containers/podman#24147
Analysed-by: David Gibson <[email protected]>
Fixes: 30fff92 ("udp: bind() optimisation")
Signed-off-by: Stefano Brivio <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flakes Flakes from Continuous Integration pasta pasta(1) bugs or features
Projects
None yet
Development

No branches or pull requests

3 participants