-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pasta: UDP/IPv4 large transfer, tap: mismatch #24147
Comments
...only in parallel tests, and it happened only twice in a while, correct? |
Once in my own parallel testing on my f40 laptop, in the |
I instrumented the test, and got a hit (flake) on my first try:
|
Thanks a lot, that's really helpful. 4096 bytes makes it look like it's not actually an issue in pasta itself, rather a race in the tests (that's the size of a page, not a packet or any buffer we use). I'll have another look at the test and try to make this more likely to happen somehow and hopefully reproduce it... |
I'm able to reproduce this easily (by simply running that test in a loop). I'm still clueless as to what's going on. |
With commit edsantiago@a659d9f and packet captures:
in my case, 8192 bytes are missing from somewhere in the middle of the file. From the capture, it looks like the kernel drops frame #4 (second UDP data packet pasta is sending, range 8192-16384): there's an ICMP port unreachable message referring to that data ( Curiously, in this case, we also see an ARP request for 88.198.0.161 (default gateway in the container) and pasta replying to it (frames 6 and 7). Overview using tshark(1) output:
Expected.
Start of transfer.
ARP request and reply for default gateway, only visible in the bad case.
Transfer continues.
Completely unexpected: this refers to frame #4.
Transfer goes on. This starts looking like a kernel issue to me: if we send UDP message shortly after we set up the namespace, one message (and not the first one!) gets a port unreachable message (this happens after we check that the port is bound via procfs). By the way, this looks similar to a failure I've noticed with a pending series (not merged yet): https://archives.passt.top/passt-dev/20241009224433.7fc28fc7@elisabeth/. That's from passt's tests (no Podman): on a spliced connection (loopback path, not tap) we get a "host unreachable" error from the kernel for the tap (local, non-loopback) address. I'll add some prints to the kernel and try to figure out what's going on there. This doesn't appear to be a race in the tests (data in the middle of a transfer is missing) and it doesn't look like an issue in pasta (we're sending all the packets we need to send, the kernel drops one, without any apparent reason). |
I can't tell if that's a really really fun issue to debug, or a really really horrible one, or both. Whatever the case, thank you for looking into it. |
...for debugging containers#24147, because "md5sum mismatch" is not the best way to troubleshoot bytestream differences. socat is run on the container, so this requires building a new testimage (20241011). Bump to new CI VMs[1] which include it. [1] containers/automation_images#389 Signed-off-by: Ed Santiago <[email protected]>
So, I was trying to find out why the kernel can't match the bound port with a UDP socket (which is the reason why that ICMP message is generated, and the corresponding datagram dropped)... and, since yesterday, I can't reproduce this anymore. Same kernel, same exact build of Podman and pasta, same machine and terminal (I still had the same open), even. Earlier it would happen once out of ~10 times. Since yesterday it's completely gone, and today it didn't fail once in a loop of 1000 runs. I'm completely puzzled. Now I'm trying to increase/decrease memory pressure on that machine. So, well, I don't really have an update yet. It's just turning out to be much subtler than I thought. I don't think it depends on the timing of the test, by the way. |
...for debugging containers#24147, because "md5sum mismatch" is not the best way to troubleshoot bytestream differences. Because socat is run on the container, we need to build a new testimage (20241009). Signed-off-by: Ed Santiago <[email protected]>
I've managed to reproduce this (or at least something which looks very much like it), but it took quite a long time (finding and fixing #24342 along the way). |
I think I now have an improved reproducer, though it's certainly not instant. I can get it reasonably quickly if I simultaneously run the podman pasta tests in a loop, and run this script:
|
Realized, of course that the script above shows something up for a "spliced" connection rather than a tap one. I have also reproduced with a "tap" connection (replacing I suspect this problem affects both "splice" and "tap" flows. That's unusual, but makes some sense based on the packet capture that @sbrivio-rh obtained (and I have similar ones). It looks like something on the container/namespace side is causing an ICMP to be generated for one of the packets , and that could happen whether the packet is travelling over the tuntap link or over the namespace's loopback interface. |
I concur with @sbrivio-rh that this looks like a kernel bug. There only seems to be one place that sends a port unreachable ICMP, which is if can't find a socket. I've confirmed with systemtap that we are indeed hitting that path. So for some reason the kernel momentarily can't local the listening socket, which is quite odd. |
@sbrivio-rh and I had speculated that this might have something to do with migration of processes and/or sockets between CPUs, so I tried a couple of tests based on that theory:
Not really sure what that means, yet, but maybe it's a clue. |
There are two paths to the
As far as I could tell from systemtap, we're coming through path (1) - I only checked on one occasion, though, so I don't know for certain if that's true all the time. |
@sbrivio-rh further systemtap investigation has shown something I suspect is pretty useful, though I can't immediately interpret it. AFAICT there are two different success paths through From systemtap traces, each failure case seems to go:
I don't really know what the |
The basic idea, introduced by kernel commit 23542618deb7 ("inet: Don't lookup the socket if there's a socket attached to the skb"), is that if a socket is attached to a skb, we can avoid an actual socket lookup, and just grab ("steal") a reference to it.
That logic was slightly modified by kernel commit 9c02bec95954 ("bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign") last year. I couldn't spot any obvious issue with it, but let me try to reproduce this with and without that commit. |
I can reproduce the issue with the Podman test after and also before (not a revert) that commit, even though before that commit, this is more likely to happen:
and I never observed this with a recent net.git kernel instead. |
Context: Back to UDP: memory barriers were dropped by commit 88ab1932eac7 ("udp: Use hlist_nulls in UDP RCU code") because of the switch to the "hlist_nulls" implementation, in 2008. Then, 8 years later, the switch was reverted by commit ca065d0cf80f ("udp: no longer use SLAB_DESTROY_BY_RCU"), but the barriers were not added back. Especially relevant here: the one in udb_lib_get_port(). Sure, it's faster, but we might miss sockets on lookup. I'm trying to add again barriers that appeared to make sense and see if this is fixed... |
Not so easy, I tried a bunch of things:
but no luck so far. I didn't try reintroducing an equivalent of hlist_for_each_entry_rcu_safenext() as removed by commit 88ab1932eac7. I guess the problem is more subtle than I expected at this point. |
Youch, it really is. Should we be trying to bring in some kernel networking experts at this point? |
I was kind of maybe supposed to be one. :) But yes, I think it would be fair to ask the author of the changes around this stuff, at least. We should just come up with a more contained reproducer, possibly without Podman and bats. |
Just adding these prints makes the shell script reproducer fail by itself reliably on the first loop (without need for bats and Podman in parallel). Unless I
|
Based on further investigation, the issue seems to be the same as what was fixed 14 years ago by commit 719f835853a9 ("udp: add rehash on connect()"). The packet drop happens when we start looking up a socket, and its connect() call completes in the meantime: udp_lib_rehash() is called before __udp4_lib_lookup() but didn't complete yet, or just after. The local address changes, but the hash wasn't updated, yet, and the socket can't be found. That is, the fix is not complete, because it leaves a window of opportunity for the socket to be missed. We can probably split the rehash operation into two atomic operations, the first one adding a new entry in the secondary hash table, before we assign the new |
… lookup If a UDP socket changes its local address while it's receiving datagrams, as a result of connect(), there is a period during which a lookup operation might fail to find it, after the address is changed but before the secondary hash (port and address) is updated. Secondary hash chains were introduced by commit 30fff92 ("udp: bind() optimisation") and, as a result, a rehash operation became needed to make a bound socket reachable again after a connect(). This operation was introduced by commit 719f835 ("udp: add rehash on connect()") which isn't however a complete fix: the socket will be found once the rehashing completes, but not while it's pending. This is noticeable with a socat(1) server in UDP4-LISTEN mode, and a client sending datagrams to it. After the server receives the first datagram (cf. _xioopen_ipdgram_listen()), it issues a connect() to the address of the sender, in order to set up a directed flow. Now, if the client, running on a different CPU thread, happens to send a (subsequent) datagram while the server's socket changes its address, but is not rehashed yet, this will result in a failed lookup and a port unreachable error delivered to the client, as apparent from the following reproducer: LEN=$(($(cat /proc/sys/net/core/wmem_default) / 4)) dd if=/dev/urandom bs=1 count=${LEN} of=tmp.in while :; do taskset -c 1 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc & sleep 0.1 || sleep 1 taskset -c 2 socat OPEN:tmp.in UDP4:localhost:1337,shut-null wait done where the client will eventually get ECONNREFUSED on a write() (typically the second or third one of a given iteration): 2024/11/13 21:28:23 socat[46901] E write(6, 0x556db2e3c000, 8192): Connection refused This issue was first observed as a seldom failure in Podman's tests checking UDP functionality while using pasta(1) to connect the container's network namespace, which leads us to a reproducer with the lookup error resulting in an ICMP packet on a tap device: LOCAL_ADDR="$(ip -j -4 addr show|jq -rM '.[] | .addr_info[0] | select(.scope == "global").local')" while :; do ./pasta --config-net -p pasta.pcap -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc & sleep 0.2 || sleep 1 socat OPEN:tmp.in UDP4:${LOCAL_ADDR}:1337,shut-null wait cmp tmp.in tmp.out done Once this fails: tmp.in tmp.out differ: char 8193, line 29 we can finally have a look at what's going on: $ tshark -r pasta.pcap 1 0.000000 :: ? ff02::16 ICMPv6 110 Multicast Listener Report Message v2 2 0.168690 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 3 0.168767 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 4 0.168806 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 5 0.168827 c6:47:05:8d:dc:04 ? Broadcast ARP 42 Who has 88.198.0.161? Tell 88.198.0.164 6 0.168851 9a:55:9a:55:9a:55 ? c6:47:05:8d:dc:04 ARP 42 88.198.0.161 is at 9a:55:9a:55:9a:55 7 0.168875 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 8 0.168896 88.198.0.164 ? 88.198.0.161 ICMP 590 Destination unreachable (Port unreachable) 9 0.168926 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 10 0.168959 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 11 0.168989 88.198.0.161 ? 88.198.0.164 UDP 4138 60260 ? 1337 Len=4096 12 0.169010 88.198.0.161 ? 88.198.0.164 UDP 42 60260 ? 1337 Len=0 On the third datagram received, the network namespace of the container initiates an ARP lookup to deliver the ICMP message. In another variant of this reproducer, starting the client with: strace -f pasta --config-net -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc 2>strace.log & and connecting to the socat server using a loopback address: socat OPEN:tmp.in UDP4:localhost:1337,shut-null we can more clearly observe a sendmmsg() call failing after the first datagram is delivered: [pid 278012] connect(173, 0x7fff96c95fc0, 16) = 0 [...] [pid 278012] recvmmsg(173, 0x7fff96c96020, 1024, MSG_DONTWAIT, NULL) = -1 EAGAIN (Resource temporarily unavailable) [pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = 1 [...] [pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = -1 ECONNREFUSED (Connection refused) and, somewhat confusingly, after a connect() on the same socket succeeded. To fix this, replace the rehash operation by a set_rcv_saddr() callback holding the spinlock on the primary hash chain, just like the rehash operation used to do, but also setting the address while holding the spinlock. To make this atomic against the lookup operation, also acquire the spinlock on the primary chain there. This results in some awkwardness at a caller site, specifically sock_bindtoindex_locked(), where we really just need to rehash the socket without changing its address. With the new operation, we now need to forcibly set the current address again. On the other hand, this appears more elegant than alternatives such as fetching the spinlock reference in ip4_datagram_connect() and ip6_datagram_conect(), and keeping the rehash operation around for a single user also seems a tad overkill. Reported-by: Ed Santiago <[email protected]> Link: containers/podman#24147 Analysed-by: David Gibson <[email protected]> Fixes: 30fff92 ("udp: bind() optimisation") Signed-off-by: Stefano Brivio <[email protected]>
… lookup If a UDP socket changes its local address while it's receiving datagrams, as a result of connect(), there is a period during which a lookup operation might fail to find it, after the address is changed but before the secondary hash (port and address) and the four-tuple hash (local and remote ports and addresses) are updated. Secondary hash chains were introduced by commit 30fff92 ("udp: bind() optimisation") and, as a result, a rehash operation became needed to make a bound socket reachable again after a connect(). This operation was introduced by commit 719f835 ("udp: add rehash on connect()") which isn't however a complete fix: the socket will be found once the rehashing completes, but not while it's pending. This is noticeable with a socat(1) server in UDP4-LISTEN mode, and a client sending datagrams to it. After the server receives the first datagram (cf. _xioopen_ipdgram_listen()), it issues a connect() to the address of the sender, in order to set up a directed flow. Now, if the client, running on a different CPU thread, happens to send a (subsequent) datagram while the server's socket changes its address, but is not rehashed yet, this will result in a failed lookup and a port unreachable error delivered to the client, as apparent from the following reproducer: LEN=$(($(cat /proc/sys/net/core/wmem_default) / 4)) dd if=/dev/urandom bs=1 count=${LEN} of=tmp.in while :; do taskset -c 1 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc & sleep 0.1 || sleep 1 taskset -c 2 socat OPEN:tmp.in UDP4:localhost:1337,shut-null wait done where the client will eventually get ECONNREFUSED on a write() (typically the second or third one of a given iteration): 2024/11/13 21:28:23 socat[46901] E write(6, 0x556db2e3c000, 8192): Connection refused This issue was first observed as a seldom failure in Podman's tests checking UDP functionality while using pasta(1) to connect the container's network namespace, which leads us to a reproducer with the lookup error resulting in an ICMP packet on a tap device: LOCAL_ADDR="$(ip -j -4 addr show|jq -rM '.[] | .addr_info[0] | select(.scope == "global").local')" while :; do ./pasta --config-net -p pasta.pcap -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc & sleep 0.2 || sleep 1 socat OPEN:tmp.in UDP4:${LOCAL_ADDR}:1337,shut-null wait cmp tmp.in tmp.out done Once this fails: tmp.in tmp.out differ: char 8193, line 29 we can finally have a look at what's going on: $ tshark -r pasta.pcap 1 0.000000 :: ? ff02::16 ICMPv6 110 Multicast Listener Report Message v2 2 0.168690 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 3 0.168767 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 4 0.168806 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 5 0.168827 c6:47:05:8d:dc:04 ? Broadcast ARP 42 Who has 88.198.0.161? Tell 88.198.0.164 6 0.168851 9a:55:9a:55:9a:55 ? c6:47:05:8d:dc:04 ARP 42 88.198.0.161 is at 9a:55:9a:55:9a:55 7 0.168875 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 8 0.168896 88.198.0.164 ? 88.198.0.161 ICMP 590 Destination unreachable (Port unreachable) 9 0.168926 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 10 0.168959 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 11 0.168989 88.198.0.161 ? 88.198.0.164 UDP 4138 60260 ? 1337 Len=4096 12 0.169010 88.198.0.161 ? 88.198.0.164 UDP 42 60260 ? 1337 Len=0 On the third datagram received, the network namespace of the container initiates an ARP lookup to deliver the ICMP message. In another variant of this reproducer, starting the client with: strace -f pasta --config-net -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc 2>strace.log & and connecting to the socat server using a loopback address: socat OPEN:tmp.in UDP4:localhost:1337,shut-null we can more clearly observe a sendmmsg() call failing after the first datagram is delivered: [pid 278012] connect(173, 0x7fff96c95fc0, 16) = 0 [...] [pid 278012] recvmmsg(173, 0x7fff96c96020, 1024, MSG_DONTWAIT, NULL) = -1 EAGAIN (Resource temporarily unavailable) [pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = 1 [...] [pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = -1 ECONNREFUSED (Connection refused) and, somewhat confusingly, after a connect() on the same socket succeeded. To fix this, replace the rehash operation by a set_rcv_saddr() callback holding the spinlock on the primary hash chain, just like the rehash operation used to do, but also setting the address (via inet_update_saddr(), moved to headers) while holding the spinlock. To make this atomic against the lookup operation, also acquire the spinlock on the primary chain there. This results in some awkwardness at a caller site, specifically sock_bindtoindex_locked(), where we really just need to rehash the socket without changing its address. With the new operation, we now need to forcibly set the current address again. On the other hand, this appears more elegant than alternatives such as fetching the spinlock reference in ip4_datagram_connect() and ip6_datagram_conect(), and keeping the rehash operation around for a single user also seems a tad overkill. v1: - fix build with CONFIG_IPV6=n: add ifdef around sk_v6_rcv_saddr usage (Kuniyuki Iwashima) - directly use sk_rcv_saddr for IPv4 receive addresses instead of fetching inet_rcv_saddr (Kuniyuki Iwashima) - move inet_update_saddr() to inet_hashtables.h and use that to set IPv4/IPv6 addresses as suitable (Kuniyuki Iwashima) - rebase onto net-next, update commit message accordingly Reported-by: Ed Santiago <[email protected]> Link: containers/podman#24147 Analysed-by: David Gibson <[email protected]> Fixes: 30fff92 ("udp: bind() optimisation") Signed-off-by: Stefano Brivio <[email protected]>
So far debian-only, in parallel system tests
The text was updated successfully, but these errors were encountered: