-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: e2e: pull/run and decrypt from local registry: timeout #24219
Comments
I saw this before as well, I am somewhat convinced that this is a pasta issue on the local port splice bypass. For example these are the exact same set of tests that were broken in #23517. Also #22575 might be related too. What happens there is we spawn a registry container (with default network) and the forward the port. As we use 127.0.0.1 as registry address it used the splice bypass path in pasta and I assume that might cause hangs in some way as the connection is not closed properly maybe? i.e. from the stack traces we get in CI on command timeout is is very obvious that these hangs all happen while we pull the image, i.e. normal http tcp connection. In particular this seems to be the goroutine hanging
And this seems to wait on the reading side but it is hard to tell whenever more traffic is expected or we just wait for the server to close the connection. I guess we have some options here, use host network like we do for other registry setups or switch to slirp4netns (revert 4dc5708). Not sure if there is way to avoid the splice path for local connections? I guess one could bind 0.0.0.0 and then use the host ip as registry ip instead of localhost but this seems far more complicated. |
@sbrivio-rh @dgibson Are you guys already aware of any issues in the local splice path? |
Gosh, no, not really. No chances to run this with pasta's |
Right, that would be pretty much the only way I can think of. |
We can certainly try adding this but the question how much the slow down of the writes is going to impact this race condition here but I guess here is only one way to find out. |
Run pasta with --trace and a log file to see if the hangs are caused by pasta not correctly closing connections as assumed in containers#24219. As the log is super verbose do not log it by default so I added some extra logic to make sure it is only logged when the test fails. Signed-off-by: Paul Holzinger <[email protected]>
#24225 should add --trace so hopefully it still reproduces with that |
Ok it finally flaked: pasting the full log here as CI logs get deleted after some time
|
After that there doesn't seem to be anything happening anymore with the flow, until the end where the connection gets closed (I assume because we kill the client after 90s which matches the time sin the log) |
First off, it's quite impressive that #24225 worked.
It's weird that we get no (socket) events at all after that. I'm wondering if the fact that we create a bunch of pipes after (and set their size), to refill the poll of pre-created pipes, might cause some issue. By the way, these messages:
are bogus. We can set the pipe size, it's just that I couldn't read the documentation of fcntl(2) when I added that check. I'll fix that in a bit. Perhaps we can reproduce this by initiating a transfer early enough to be before that refilling operation, but making it last long enough as to complete after that. |
Well I of course did test that it works locally first by introducing bogus errors so I am not that surprised it worked.
Well as show in the go stack above it does a read() so it blocks as it waits for more data (not quite a read as the go runtime uses async io vie epoll so it is not a blocking read() call but just waiting for the epoll event) One thing that is not clear from the go stack trace where it is stuck reading, i.e. at start/middle/end of a stream? I could maybe see the case where the server closed the connection and somehow pasta did not carry this close through to the client assuming that is possible? |
I'm not sure either. One interesting thing is that we transfer these 1320 bytes:
then we don't get any event for 90 seconds, until the client times out as you suggested, we signal the server that we are closing the connection (FIN/SHUT_WR), and then:
we try again to move bytes from the server (socket 54 is server-side) to the client (socket 49) and suddenly we find out that we have those 97457 bytes pending on the pipe, which we write to the client (too late). The issue might come from the fact that, earlier, we had data to transfer to the client but we couldn't for a while:
as we filled up its buffer. So we waited (
and we read 1320 bytes, so we also tried to write 1320 bytes, which worked. But meanwhile, we read much more from the server (socket to pipe), and as a result we had more than those 1320 bytes. In this case, we should try to write more. And we keep counters, so we know how much, but actually who cares, the kernel knows as well, so we might just as well write out whatever we have on the pipe. I might have an idea for a reproducer, I'll try that next. The fix (or at least a possible one) would be kind of obvious:
It would be possible, yes, but the fact that we suddenly realise that we have bytes to transfer as the client closes the connection indicates that the client is waiting for some data before closing it (HTTP 1.1 connections are persistent by default, that is, clients normally need to close them). |
And by the way that's 98304 - 33615 + 32768, that is:
plus:
so I'm fairly confident that the issue is exactly what I described. |
Patch posted. |
@sbrivio-rh I see the patch is applied. Can you tag a new release soon so we can start getting this into the fedora/debian packages we use in CI? |
Yes, I wanted to do that yesterday/today (also to finally close #24045) but I've been waiting to see if I could merge https://archives.passt.top/passt-dev/[email protected]/ as well. Well, let me make a new release early (my time) tomorrow regardless of that. |
This is now fixed in updates for:
|
Thanks, AFAIK there is still a kernel regression that will prevent CI from working anyways if we update our CI images now, #24374. |
Right, yes, I just checked: commit 306ed1728e84 ("netfilter: xtables: fix typo causing some targets not to load on IPv6") is not included in the 6.11.y branch yet, but it's in the patch queue, so it should end up in 6.11.6 (and Fedora 40's kernel-6.11.6-*.fc40). By the way, I'm still working on the kernel fix for #24147. At the moment the patch looks a bit too big for -stable, though. |
…t read In tcp_splice_sock_handler(), we try to calculate how much we can move from the pipe to the writing socket: if we just read some bytes, we'll use that amount, but if we haven't, we just try to empty the pipe. However, if we just read something, that doesn't mean that that's all the data we have on the pipe, as it's obvious from this sequence, where: pasta: epoll event on connected spliced TCP socket 54 (events: 0x00000001) Flow 0 (TCP connection (spliced)): 98304 from read-side call Flow 0 (TCP connection (spliced)): 33615 from write-side call (passed 98304) Flow 0 (TCP connection (spliced)): -1 from read-side call Flow 0 (TCP connection (spliced)): -1 from write-side call (passed 524288) Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:580 Flow 0 (TCP connection (spliced)): OUT_WAIT_0 we first pile up 98304 - 33615 = 64689 pending bytes, that we read but couldn't write, as the receiver buffer is full, and we set the corresponding OUT_WAIT flag. Then: pasta: epoll event on connected spliced TCP socket 54 (events: 0x00000001) Flow 0 (TCP connection (spliced)): 32768 from read-side call Flow 0 (TCP connection (spliced)): -1 from write-side call (passed 32768) Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:580 we splice() 32768 more bytes from our receiving side to the pipe. At some point: pasta: epoll event on connected spliced TCP socket 49 (events: 0x00000004) Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:489 Flow 0 (TCP connection (spliced)): ~OUT_WAIT_0 Flow 0 (TCP connection (spliced)): 1320 from read-side call Flow 0 (TCP connection (spliced)): 1320 from write-side call (passed 1320) the receiver is signalling to us that it's ready for more data (EPOLLOUT). We reset the OUT_WAIT flag, read 1320 more bytes from our receiving socket into the pipe, and that's what we write to the receiver, forgetting about the pending 97457 bytes we had, which the receiver might never get (not the same 97547 bytes: we'll actually send 1320 of those). This condition is rather hard to reproduce, and it was observed with Podman pulling container images via HTTPS. In the traces above, the client is side 0 (the initiating peer), and the server is sending the whole data. Instead of splicing from pipe to socket the amount of data we just read, we need to splice all the pending data we piled up until that point. We could do that using 'read' and 'written' counters, but there's actually no need, as the kernel also keeps track of how much data is available on the pipe. So, to make this simple and more robust, just give the whole pipe size as length to splice(). The kernel knows what to do with it. Later in the function, we used 'to_write' for an optimisation meant to reduce wakeups which retries right away to splice() in both directions if we couldn't write to the receiver the whole amount of pending data. Calculate a 'pending' value instead, only if we reach that point. Now that we check for the actual amount of pending data in that optimisation, we need to make sure we don't compare a zero or negative 'written' value: if we met that, it means that the receiver signalled end-of-file, an error, or to try again later. In those three cases, the optimisation doesn't make any sense, so skip it. Reported-by: Ed Santiago <[email protected]> Reported-by: Paul Holzinger <[email protected]> Analysed-by: Paul Holzinger <[email protected]> Link: containers/podman#24219 Signed-off-by: Stefano Brivio <[email protected]> Reviewed-by: David Gibson <[email protected]>
…cceeds Don't report bogus failures (with --trace) just because the return value is not zero. Link: containers/podman#24219 Signed-off-by: Stefano Brivio <[email protected]> Reviewed-by: David Gibson <[email protected]>
Finally available now. |
passt, for containers/podman#24219 crun, for containers/podman#24327 Do not merge, because: 1. crun 1.18.1 is buggy, we really want 1.18.2 2. we really need to focus on containers#390 (bumping to f41) Signed-off-by: Ed Santiago <[email protected]>
As of #24485 debian and f41 are on pasta 10-30 but f40 (prior-fedora) is still on 09-06. Closing in hopes that the flake will eventually fade away. |
Placeholder for a flake I've been seeing in e2e. So far, only rootless, but that might be happenstance.
--mount
#24124The text was updated successfully, but these errors were encountered: