-
Notifications
You must be signed in to change notification settings - Fork 411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
io_uring is slower than epoll #189
Comments
That's definitely a great thing to do. I don't know whether anybody have time for that at the moment, though. |
Maybe your benchmark is full of errors |
It's not my benchmark and the benchmark I reference is the one @axboe himself has referenced on Twitter, showing that he interprets it as true. That's why I would like @axboe himself to write a benchmark without issues so that we can actually prove this thing works better. |
For what it’s worth there is a whole lot more to it than trivial one to one comparison.
- you save many syscalls that would be required to modify the epolll kernel state (adding, removing monitored FDs, updating per FD state etc). This becomes extremely important as the number of managed FDs increases - exponentially so.
- you can register FDs (eg listener FDs and long lived connections FDs) which means you dont have to incur the cost of kernel looking up the file struct and checking for access every time.
- and, obviously, you don’t get to monitor just network sockets. You can do so much more.
@markpapadakis
… On 30 Aug 2020, at 1:45 PM, Alex Hultman ***@***.***> wrote:
Don't get me wrong, since I heard about io_uring I've been all about trying to scientifically reproduce any claim that it improves performance (theoretically it should, since you can have fewer syscalls and I like that idea). I would love to add support for io_uring in my projects, but I won't touch it until someone can scientifically show me proofs of it outperforming epoll (significantly).
I've tested three different examples claiming to be faster than epoll, tested on Linux 5.7 without spectre mitigations, tested on Linux 5.8 with spectre mitigations. Tested Clear Linux, tested Fedora. All tests points towards epoll being faster for my kind of test.
What I have is a simple TCP echo server echoing small chunks of data for 40-400 clients. In all tests epoll is performing better by a measurable amount. In no single test have I seen io_uring beat epoll in this kind of test.
Some io_uring examples are horribly slow, while others are quite close to epoll perforamance. The closest I have seen it go is this one: https://github.com/frevib/io_uring-echo-server comparing to this one: https://github.com/frevib/epoll-echo-server
I have tested on two separate and different machines, both with the same outcome: epoll wins.
Can someone enlighten me on how I can get io_uring to outperform epoll in this case?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
@markpapadakis Have you taken a look at the benchmarks that are quoted above? They use many connections at once, thus many FDs at once. uring still consistently performs worse, in conditions that mimic production usage well enough without actually being production. Part of the scentific process is to be able to reproduce claims like those being made (60%+ increase in performance over epoll, apparently that was 99% at one point but bugs were found). These are metrics that @axboe has not refuted and has seemingly even confirmed, especially through promoting it on twitter and urging others to buy in (e.g. Netty). Even if this were a 5% increase, I'd be for it. However, I simply do not understand where these results are coming from, and everyone on twitter seems to be more interested in patting themselves on the back rather than addressing criticisms of things (sorry if that sounds harsh). Outrageous claims require outrageous evidence, especially when it comes to claims that could completely renovate the space. |
@markpapadakis We all know the theory. It is not that hard to understand, quite basic actually. But theory means nothing if actual reality mismatches with theorized conclusions (any scientist ever). I would be happy and excited if io_uring could improve performances but so far no benchmark can show this. All I'm asking for is scientific proofs. I am a scientist, not a believer. |
I think it is time to take this to the Linux kernel mailing list since @axboe has ignored this criticism entirely. |
@alexhultman Check netty/netty#10622 - I won't have any bandwidth for the next day or two, but maybe that will pique your interest. |
I think that @axboe knows that uring is slower. Otherwise he would answer. |
I have not ignored any of this, but I've been way too busy on other items. And frankly, the way the tone has shifted in here, it's not really providing much impetus to engage. I did run the echo server benchmarks back then, and it did reproduce for me. Haven't looked at it since, it's not like I daily run echo server benchmarks. So in short, I'm of course interested in cases where io_uring isn't performing to its full potential, it most certainly should. I have a feeling that #215 might explain some of these. Once I get myself out from under the pressure I'm at now, I'll re-run the benchmarks myself. |
I can testify that io_uring is much faster than epoll. Please use kernel 5.7.15 for your benchmarks. |
I have made available a new, detailed benchmark that shows io_uring is reliably slower than epoll: https://github.com/alexhultman/io_uring_epoll_benchmark
And where is your 1-to-1 comparison with epoll? Or you just go by feeling of "high numbers"? See my benchmark for a 1-to-1 comparison. |
And were is your benchmark for this claim? I know the theory very well, but you're just assuming theory is correct here because it must be. See my posted benchmark - it shows the complete opposite of what you claim. The bennchmark I have posted performs ZERO syscalls and does ZERO copies, yet epoll wins reliably despite doing millions of syscalls and performing copies in every syscall. |
Woud you look at my new benchmark? I have eliminated everything but epoll and io_uring and on both my machines epoll wins despite io_uring being SQ-polled with 0 syscalls and using pre-registered file descriptors and buffers. I'm not involving any networking at all. strace shows the epoll case make millions of syscalls while the io_uring is entirely silent in the syscalling department. What am I doing wrong / why is io_uring not performing? |
The fact that a virtualized garbage collected JIT-stuttery Java project can see performance improvements when swapping from epoll to io_uring is not a viable proof of io_uring itself (as a kernel feature) being more performant than epoll itself (as a kernel feature). It really just proves that writing systems in non-systems programming languages are going to get you poor results. io_uring does more things in kernel, meaning a swap from epoll to io_uring leads to less things happening in Java. As a general rule of tumb; the less you do in high level garbage collected virtualized code, the better. |
@alexhultman which kernel version did you use? |
It is clearly stated in the posted text. 5.9.9 |
@alexhultman By running your tests locally with a
|
First of all, there is a bug that hurts io_uring performance. its fix will be merged into 5.11. Secondly, I've looked at your benchmark code and you test something that is not necessarily relevant nor optimal for networking use-case.
Finally, I've never tried using pipes in my tests. io_uring essentially delegates requests to the their corresponding APIs. So if, for example, the pipes kernel code takes most CPU you may not see much difference between io_uring or epoll. |
Also, just to add (note that I was a skeptic above as well): Hypervisors tend to level out epoll vs io_uring performance benchmarks in my findings - if you're within a VM, expect io_uring to have less than ideal performance increases against epoll. Testing on bare metal seems to make quite a bit of difference depending on how you're using io_uring.
@romange Is that already on the io_uring branch of mainline? Just curious. |
Here are the results for @alexhultman 's benchmark on my machine (MyTuxedo laptop, Intel i7-7700HQ (8) @ 3.800GHz, 32GB RAM, Ubuntu 20.10 x86_64, Kernel: 5.10.0-051000rc6-generic)
io_uring gives better results! |
I do not think it's on io_uring branch because the fix does not reside in io_ring code. |
$ make
gcc -O3 epoll.c -o epoll
gcc -O3 io_uring.c /usr/lib/liburing.a -o io_uring
gcc: error: /usr/lib/liburing.a: No such file or directory
make: *** [Makefile:3: default] Error 1 |
|
@martin-g @santigimeno That's very interesting - thanks for reporting! I will do some more testing on newer kernels and see if I can finally get to see this supposed io_uring wonder myself. |
@romange You do a lot of confident talking but you still refuse to follow up with any actual testing of your claims.
Please do testing before you make up assumptions about everything. This entire thread is about this exact behavior - show with actual numbers like @santigimeno and @martin-g did. |
@alexhultman Please tone down the snark. FWIW, @romange has done plenty of testing in the past, and was instrumental in finding the task work related signal slowdown for threads. Questioning the validity of a test is perfectly valid, in fact it's the very first thing that should be done before potentially wasting any time on examining results from said test. Nobody is in a position to be making any demands in here, in my experience you get a lot further by being welcoming and courteous. |
The above is not a question, it is a confident statement without any backing proof other than "it will be the case". This is what is the issue here - blindly making claims without any backing proof other than "listen to my assumption". |
I can recompile with debug symbols. These are very small copies, like we have talked about already, it's 32 bytes per pipe so overhead of copy is very likely not the bottleneck and again, if you look in this thread you can see that many people with AMD CPUs see big wins with io-uring and, again, my own raspberry pi is much faster with io uring. They all do the same copying, yet my shitty Intel machines don't perform any better with iouring |
At this point the bug report is more about Intel machines not seeing any gains while pretty much all other ones seeing big gains. |
Ok, still curious to see relative overheads. If you're willing to recompile, there is a list of kernel options to check
|
Yep I'll come back in a few days I have other things to do also. Will be interesting to see. |
@alexhultman The io_uring variant of your benchmark completes consistently faster in my tests when I change the
@axboe Is this style of brute-force CQ polling supposed to work well? In my tests, especially when I add I'm on a somewhat older Intel CPU, and adding the wait as noted above makes the io_uring test consistently faster than epoll. From where I'm sitting right now, the io_uring benchmark seems to simply be incorrectly implemented. And judging from the code quality in general, that would not surprise me one bit. |
@vcaputo The benchmark started like that, io_uring_submit_and_wait and io_uring_wait_cqe was how the original version of the benchmark ran. At that point (Linux 5.8) it was still not faster than epoll so the benchmark changed to make use of fixed files, fixed buffers and polling. That was faster so that solution remained. I can re-check with io_uring_wait_cqe on my Linux 5.16. Thanks for testing and reporting. |
Well, you still keep the opportunistic batched consumption and only resort to waiting when unlucky. It's a sort of hybrid, akin to how mutexes are often tweaked to spin a little before going to sleep (involving the kernel) when contended. But what you have currently is written more like a spinlock in userspace, which is generally A Bad Idea unless you're exerting very fine control over which threads are running on which cores. If you don't even attempt the batched wait-less consumption you'll definitely go slower as the wait interface just gets a single cqe IIRC. |
I'm definitely open to whatever solution is the best as of Linux 5.16. My goal is really just to have io_uring beat epoll on all my machines including the shitty Intel ones. I've tested 10-or-so solutions by now. Will test yours when I have more time for this. |
@vcaputo That changes doesn't do any difference for me. It never even runs that path. |
@lnicola, I just noticed that it enables SQPOLL, and the program doesn't handle it right, though should work fine with normal rings. Anyone benchmarking it should disable SQPOLL unless willing to trink appropriately and invest more time in comparison. It's not magic, it can degrade performance and add variability to results. |
Hi, I've also been playing with io_uring latetly, trying to match epoll performance in an IPC use case with pipes, one server and n clients, and a simple request-reply protocol. I modified the benchmark mentionned here and elsewhere for my tests, see the What I could see is that epoll currently always beats io_uring, with any number of clients. My use of io_uring is pretty simple, I used the same base as the existing code with fixed fds and buffers (although that did not make a difference), and I'm submitting write-read linked sqes, with the cqe_skip_success flag set on the write. I've made some gpuvis traces, which I think are intersting to understand the reasons. I used a build of the linux kernel from git (79a72162048e42a677bc7336a9f5d86fc3ff9558), with the patches from [1] on top, and I verified that there's no cqe returned for the write completion. (I'm not sure if this is the latest version of the patches, and I'm happy to try something else if there's any) With epoll, with 1 client thread, each thread wakes the other one when it has completed its write, and they alternate very quickly. With 4 client threads, the server is fully busy (in red), and never gets interrupted, serving requests one after another, and the processing time is still very small and roughtly the same as with one client. With io_uring, however things are quite different, even with 1 client. For a start, there's a kernel worker involved, so a third thread. Then, there seems to be at least one spurious wakeup of the server. I would expect that the client write would only wake up the worker, copying the data and waking up the server, which would process it, submit the next sqes, waking the worker and starting to wait for the next one. I verified that the only cqe that the server receives is for the read completion, but maybe it resumes from io_uring_submit_and_wait for nothing once. The number of spurious wakeups doesn't see to depend on the number of clients, as can be seen here, with 2 clients. However, there's an additional worker for each client, it doesn't really seem necessary? About the io_uring workers, I think there's something fishy, as whenever I stopped the measuring, all went crazy for a bit, as can be seen here (note that the scale is much larger, and every small block is also an io_uring worker working). Edit: That last part may just be because the code doesn't handle the errors and just started submitting a lot of invalid sqes when interrupted. Edit2: It's actually not, I checked again and I don't see any completion with error status, or any error from the client read or write when I interrupt the process, it just shuts down immediately. The iou_worker frenzy is still there though as soon as there's more than 1 client, and already there on 5.15 without the SKIP_SUCCESS flag. It's possible there's something wrong with my code but I don't see where? |
@rbernon just out of curiosity, which tool is that? |
If you see any iou-wrk workers, then something isn't working right. That's the slower async path, and should not get hit unless IOSQE_ASYNC is being explicitly set. What code is being run? |
Took a quick look at that server, and it looks like it has two modes:
Neither are great solutions, the most performant will generally be to just issue the request and have the internal poll handle it. Otherwise you're just trading an epoll readiness based model for the ditto on io_uring, which doesn't make a lot of sense. It can be useful for converting existing code, but for new code it isn't really advisable. |
Interesting, I added the flag on the read sqe because it's not expected to succeed immediately after a write from the server (the clients are not generally spamming requests). It didn't seem to make a difference at first, but I'll try without it. |
FWIW the server code is https://github.com/rbernon/io_uring-echo-server/blob/master/io_uring_ipc_server.c, and it only used the IOSQE_ASYNC flag on the read sqe. I tried again without this flag, and with IOSQE_CQE_SKIP_SUCCESS, and now the spurious wakeup is gone but there's still iou workers involved, as well as the worker frenzy when exiting the test. With 4 clients, something weird started to happen, with periods of "normal" latency and periods of very high latency. Looking at gpuvis I can see that the "normal" latency periods involve server, clients, and workers talking to each other: But after a while there's only the server left, seemingly doing nothing? Plus a huge frenzy at the end, but kind of expected. Unrelated note: by default gpuvis uses a millisecond resolution, I modified it to round the timestamps to us because I wanted to be able to see the latency more accurately. |
@rbernon, pipes, right? The pipes internals don't support any sane nowait behaviour so we have to force any I/O against them to io-wq (slow path). It's just a one line on io_uring side to change that, but then submission may end up waiting for I/O potentially unboundedly. Hopefully, one day we'll push good enough for a change in the pipe code. fwiw, the benchmark up in the thread doesn't go through io-wq only because O_NONBLOCK and there is special case in io_uring for that. Don't know what that "frenzy" with lots of rescheduling exactly is, though that's interesting. |
Can't we register a poll on the pipe, and issue the write from the poll callback? The same holds for other socket-like files that support poll() but aren't sockets. They can be handled in the same way as recv. |
Now I'm not sure that recv completes without a workqueue. Looking at the code, it ends up in io_apoll_task_func, which punts via io_req_task_submit. Is recv indeed using workqueue? |
I'm probably lost in the maze. |
We do use poll for any file type that supports it, but since we can't issue a nonblocking read attempt on a pipe, it still has to be done from a worker. What needs to happen here is just converting pipe from using struct file_operations->read to ->read_iter. The solution is known, and patches do exist. This isn't specific to pipes, but obviously they are one of the more important file types. Thankfully most files use ->read_iter and ->write_iter these days. |
It is not, this is task_work. That's different from thread offload. For the latter, look for io_queue_async_work(). |
Yes, sorry. |
I'm trying to fork() the io_uring simulation by creating parent and child process, each process have its own rings, parent for writing to the pipe, child for reading from the pipe, however I stumbled upon some error when the reader side read an empty pipe and instead of waiting for content in the pipe (if therers no error the simulation indeed run faaster), it update the CQE Async task failed. To overcome the problem, I didn't set the O_NONBLOCK in the pipe2(), this makes the reader to wait until theres content in the pipes before unblocking. But on the downside, it will call fcntl() with call syscall and the performance of the simulation is slower than the original simulation file. I wanted to know how to set reader side of pipe to be ready without syscalls |
This is a general question. Does replacing epoll with io_uring reduce the CPU usage for a given workload, said an nginx like http gateway? Reducing CPU usage in our production environment is the primary task rather than achieving a higher throughput or RPS in the benchmark environment. This question comes to me because, IMO, io_uring makes socket IO purely asynchronous, but the data copying between user and kernel is preserved, assuming zero copy(IORING_OP_SEND_ZC, IORING_OP_SENDMSG_ZC) is not used. |
EDIT: I have made available detailed benchmark with epoll that shows this in a reliable way: https://github.com/alexhultman/io_uring_epoll_benchmark
Don't get me wrong, since I heard about io_uring I've been all about trying to scientifically reproduce any claim that it improves performance (theoretically it should, since you can have fewer syscalls and I like that idea). I would love to add support for io_uring in my projects, but I won't touch it until someone can scientifically show me proofs of it outperforming epoll (significantly).
I've tested three different examples claiming to be faster than epoll, tested on Linux 5.7 without spectre mitigations, tested on Linux 5.8 with spectre mitigations. Tested Clear Linux, tested Fedora. All tests points towards epoll being faster for my kind of test.
What I have is a simple TCP echo server echoing small chunks of data for 40-400 clients. In all tests epoll is performing better by a measurable amount. In no single test have I seen io_uring beat epoll in this kind of test.
Some io_uring examples are horribly slow, while others are quite close to epoll perforamance. The closest I have seen it go is this one: https://github.com/frevib/io_uring-echo-server comparing to this one: https://github.com/frevib/epoll-echo-server
I have tested on two separate and different machines, both with the same outcome: epoll wins.
Can someone enlighten me on how I can get io_uring to outperform epoll in this case?
The text was updated successfully, but these errors were encountered: