io_uring is slower than epoll #189

ghost · 2020-08-30T10:44:59Z

EDIT: I have made available detailed benchmark with epoll that shows this in a reliable way: https://github.com/alexhultman/io_uring_epoll_benchmark

Don't get me wrong, since I heard about io_uring I've been all about trying to scientifically reproduce any claim that it improves performance (theoretically it should, since you can have fewer syscalls and I like that idea). I would love to add support for io_uring in my projects, but I won't touch it until someone can scientifically show me proofs of it outperforming epoll (significantly).

I've tested three different examples claiming to be faster than epoll, tested on Linux 5.7 without spectre mitigations, tested on Linux 5.8 with spectre mitigations. Tested Clear Linux, tested Fedora. All tests points towards epoll being faster for my kind of test.

What I have is a simple TCP echo server echoing small chunks of data for 40-400 clients. In all tests epoll is performing better by a measurable amount. In no single test have I seen io_uring beat epoll in this kind of test.

Some io_uring examples are horribly slow, while others are quite close to epoll perforamance. The closest I have seen it go is this one: https://github.com/frevib/io_uring-echo-server comparing to this one: https://github.com/frevib/epoll-echo-server

I have tested on two separate and different machines, both with the same outcome: epoll wins.

Can someone enlighten me on how I can get io_uring to outperform epoll in this case?

ghost · 2020-08-30T14:12:39Z

frevib/io_uring-echo-server#8

isilence · 2020-09-02T15:54:49Z

That's definitely a great thing to do. I don't know whether anybody have time for that at the moment, though.
io_uring-echo-server looks much bulkier from the last time I've seen it.

InternalHigh · 2020-09-13T11:19:32Z

Maybe your benchmark is full of errors

ghost · 2020-09-13T11:28:22Z

Maybe your benchmark is full of errors

It's not my benchmark and the benchmark I reference is the one @axboe himself has referenced on Twitter, showing that he interprets it as true. That's why I would like @axboe himself to write a benchmark without issues so that we can actually prove this thing works better.

markpapadakis · 2020-09-13T12:03:16Z

For what it’s worth there is a whole lot more to it than trivial one to one comparison. - you save many syscalls that would be required to modify the epolll kernel state (adding, removing monitored FDs, updating per FD state etc). This becomes extremely important as the number of managed FDs increases - exponentially so. - you can register FDs (eg listener FDs and long lived connections FDs) which means you dont have to incur the cost of kernel looking up the file struct and checking for access every time. - and, obviously, you don’t get to monitor just network sockets. You can do so much more. @markpapadakis

…

On 30 Aug 2020, at 1:45 PM, Alex Hultman ***@***.***> wrote: Don't get me wrong, since I heard about io_uring I've been all about trying to scientifically reproduce any claim that it improves performance (theoretically it should, since you can have fewer syscalls and I like that idea). I would love to add support for io_uring in my projects, but I won't touch it until someone can scientifically show me proofs of it outperforming epoll (significantly). I've tested three different examples claiming to be faster than epoll, tested on Linux 5.7 without spectre mitigations, tested on Linux 5.8 with spectre mitigations. Tested Clear Linux, tested Fedora. All tests points towards epoll being faster for my kind of test. What I have is a simple TCP echo server echoing small chunks of data for 40-400 clients. In all tests epoll is performing better by a measurable amount. In no single test have I seen io_uring beat epoll in this kind of test. Some io_uring examples are horribly slow, while others are quite close to epoll perforamance. The closest I have seen it go is this one: https://github.com/frevib/io_uring-echo-server comparing to this one: https://github.com/frevib/epoll-echo-server I have tested on two separate and different machines, both with the same outcome: epoll wins. Can someone enlighten me on how I can get io_uring to outperform epoll in this case? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

Qix- · 2020-09-30T02:24:31Z

@markpapadakis Have you taken a look at the benchmarks that are quoted above? They use many connections at once, thus many FDs at once. uring still consistently performs worse, in conditions that mimic production usage well enough without actually being production.

Part of the scentific process is to be able to reproduce claims like those being made (60%+ increase in performance over epoll, apparently that was 99% at one point but bugs were found). These are metrics that @axboe has not refuted and has seemingly even confirmed, especially through promoting it on twitter and urging others to buy in (e.g. Netty).

Even if this were a 5% increase, I'd be for it. However, I simply do not understand where these results are coming from, and everyone on twitter seems to be more interested in patting themselves on the back rather than addressing criticisms of things (sorry if that sounds harsh).

Outrageous claims require outrageous evidence, especially when it comes to claims that could completely renovate the space.

ghost · 2020-09-30T06:29:59Z

@markpapadakis We all know the theory. It is not that hard to understand, quite basic actually.

But theory means nothing if actual reality mismatches with theorized conclusions (any scientist ever).

I would be happy and excited if io_uring could improve performances but so far no benchmark can show this.

All I'm asking for is scientific proofs. I am a scientist, not a believer.

ghost · 2020-09-30T06:36:09Z

I think it is time to take this to the Linux kernel mailing list since @axboe has ignored this criticism entirely.

Qix- · 2020-09-30T11:25:06Z

@alexhultman Check netty/netty#10622 - I won't have any bandwidth for the next day or two, but maybe that will pique your interest.

InternalHigh · 2020-09-30T20:44:45Z

I think that @axboe knows that uring is slower. Otherwise he would answer.

axboe · 2020-09-30T20:54:57Z

I have not ignored any of this, but I've been way too busy on other items. And frankly, the way the tone has shifted in here, it's not really providing much impetus to engage. I did run the echo server benchmarks back then, and it did reproduce for me. Haven't looked at it since, it's not like I daily run echo server benchmarks.

So in short, I'm of course interested in cases where io_uring isn't performing to its full potential, it most certainly should. I have a feeling that #215 might explain some of these. Once I get myself out from under the pressure I'm at now, I'll re-run the benchmarks myself.

romange · 2020-09-30T21:16:33Z

I can testify that io_uring is much faster than epoll. Please use kernel 5.7.15 for your benchmarks.
This server https://github.com/romange/gaia/tree/master/examples/pingserver reaches 3M qps on a single instance for redis-benchmark (ping_inline API) on c5n ec2 instances.

ghost · 2020-12-07T11:08:40Z

I have made available a new, detailed benchmark that shows io_uring is reliably slower than epoll:

https://github.com/alexhultman/io_uring_epoll_benchmark

I can testify that io_uring is much faster than epoll. Please use kernel 5.7.15 for your benchmarks.
This server https://github.com/romange/gaia/tree/master/examples/pingserver reaches 3M qps on a single instance for redis-benchmark (ping_inline API) on c5n ec2 instances.

And where is your 1-to-1 comparison with epoll? Or you just go by feeling of "high numbers"? See my benchmark for a 1-to-1 comparison.

ghost · 2020-12-07T11:15:10Z

For what it’s worth there is a whole lot more to it than trivial one to one comparison. - you save many syscalls that would be required to modify the epolll kernel state (adding, removing monitored FDs, updating per FD state etc). This becomes extremely important as the number of managed FDs increases - exponentially so. - you can register FDs (eg listener FDs and long lived connections FDs) which means you dont have to incur the cost of kernel looking up the file struct and checking for access every time. - and, obviously, you don’t get to monitor just network sockets. You can do so much more.

And were is your benchmark for this claim? I know the theory very well, but you're just assuming theory is correct here because it must be. See my posted benchmark - it shows the complete opposite of what you claim.

The bennchmark I have posted performs ZERO syscalls and does ZERO copies, yet epoll wins reliably despite doing millions of syscalls and performing copies in every syscall.

ghost · 2020-12-07T11:24:55Z

@axboe

So in short, I'm of course interested in cases where io_uring isn't performing to its full potential, it most certainly should.

Woud you look at my new benchmark? I have eliminated everything but epoll and io_uring and on both my machines epoll wins despite io_uring being SQ-polled with 0 syscalls and using pre-registered file descriptors and buffers. I'm not involving any networking at all.

strace shows the epoll case make millions of syscalls while the io_uring is entirely silent in the syscalling department.

What am I doing wrong / why is io_uring not performing?

ghost · 2020-12-07T11:44:23Z

@alexhultman Check netty/netty#10622 - I won't have any bandwidth for the next day or two, but maybe that will pique your interest.

The fact that a virtualized garbage collected JIT-stuttery Java project can see performance improvements when swapping from epoll to io_uring is not a viable proof of io_uring itself (as a kernel feature) being more performant than epoll itself (as a kernel feature). It really just proves that writing systems in non-systems programming languages are going to get you poor results.

io_uring does more things in kernel, meaning a swap from epoll to io_uring leads to less things happening in Java. As a general rule of tumb; the less you do in high level garbage collected virtualized code, the better.

romange · 2020-12-07T12:16:13Z

@alexhultman which kernel version did you use?

ghost · 2020-12-07T12:17:45Z

It is clearly stated in the posted text. 5.9.9

santigimeno · 2020-12-07T12:24:44Z

@alexhultman By running your tests locally with a 5.10-rc5 version it seems I'm seeing io_uring behave better, or am I reading it wrong?:

$  ./epoll 1000
Pipes: 1000
Time: 16.059609

$ sudo ./io_uring 1000
Pipes: 1000
Time: 10.984051

$ ./epoll 1500
Pipes: 1500
Time: 24.501726

$ sudo ./io_uring 1500
Pipes: 1500
Time: 18.112729

$ ./epoll 2000
Pipes: 2000
Time: 37.705230

$ sudo ./io_uring 2000
Pipes: 2000
Time: 26.174995

romange · 2020-12-07T12:48:53Z

First of all, there is a bug that hurts io_uring performance. its fix will be merged into 5.11.

Secondly, I've looked at your benchmark code and you test something that is not necessarily relevant nor optimal for networking use-case.

You test IORING_SETUP_SQPOLL mode - I did not succeed to get any performance gain there with sockets. In fact it was consistently worse than using non-polling mode.
You test a single epoll/io_uring loop which does not trigger contention edge-cases inside kernel. When you have N cores running N epoll loops doing read/writes via socket you put 100% load on your machine, you will see how io_uring performs better.

Finally, I've never tried using pipes in my tests. io_uring essentially delegates requests to the their corresponding APIs. So if, for example, the pipes kernel code takes most CPU you may not see much difference between io_uring or epoll.

Qix- · 2020-12-07T13:00:53Z

Also, just to add (note that I was a skeptic above as well): Hypervisors tend to level out epoll vs io_uring performance benchmarks in my findings - if you're within a VM, expect io_uring to have less than ideal performance increases against epoll. Testing on bare metal seems to make quite a bit of difference depending on how you're using io_uring.

First of all, there is a bug that hurts io_uring performance. its fix will be merged into 5.11.

@romange Is that already on the io_uring branch of mainline? Just curious.

martin-g · 2020-12-07T13:09:05Z

Here are the results for @alexhultman 's benchmark on my machine (MyTuxedo laptop, Intel i7-7700HQ (8) @ 3.800GHz, 32GB RAM, Ubuntu 20.10 x86_64, Kernel: 5.10.0-051000rc6-generic)

 make runs
rm -f epoll_runs
rm -f io_uring_runs
for i in `seq 100 100 1000`; do ./io_uring $i; done
Pipes: 100
Time: 0.908056
Pipes: 200
Time: 2.063551
Pipes: 300
Time: 3.183146
Pipes: 400
Time: 4.810344
Pipes: 500
Time: 5.609743
Pipes: 600
Time: 8.197645
Pipes: 700
Time: 10.275732
Pipes: 800
Time: 11.889881
Pipes: 900
Time: 15.030963
Pipes: 1000
Time: 15.421023

for i in `seq 100 100 1000`; do ./epoll $i; done
Pipes: 100
Time: 1.575792
Pipes: 200
Time: 3.173769
Pipes: 300
Time: 5.173567
Pipes: 400
Time: 7.255583
Pipes: 500
Time: 10.283918
Pipes: 600
Time: 12.986523
Pipes: 700
Time: 14.560208
Pipes: 800
Time: 17.426127
Pipes: 900
Time: 19.796715
Pipes: 1000
Time: 23.262279

io_uring gives better results!

romange · 2020-12-07T13:12:13Z

Also, just to add (note that I was a skeptic above as well): Hypervisors tend to level out epoll vs io_uring performance benchmarks in my findings - if you're within a VM, expect io_uring to have less than ideal performance increases against epoll. Testing on bare metal seems to make quite a bit of difference depending on how you're using io_uring.

First of all, there is a bug that hurts io_uring performance. its fix will be merged into 5.11.

@romange Is that already on the io_uring branch of mainline? Just curious.

I do not think it's on io_uring branch because the fix does not reside in io_ring code.
Here is the relevant article: https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.11-Task-Work-Opt

YoSTEALTH · 2020-12-07T13:17:33Z

$ make
gcc -O3 epoll.c -o epoll
gcc -O3 io_uring.c /usr/lib/liburing.a -o io_uring
gcc: error: /usr/lib/liburing.a: No such file or directory
make: *** [Makefile:3: default] Error 1

romange · 2020-12-07T13:19:23Z

$ make
gcc -O3 epoll.c -o epoll
gcc -O3 io_uring.c /usr/lib/liburing.a -o io_uring
gcc: error: /usr/lib/liburing.a: No such file or directory
make: *** [Makefile:3: default] Error 1

git submodule update --init --recursive 
cd liburing/
./configure && make
sudo make install

ghost · 2020-12-07T15:46:22Z

@martin-g @santigimeno That's very interesting - thanks for reporting! I will do some more testing on newer kernels and see if I can finally get to see this supposed io_uring wonder myself.

ghost · 2020-12-07T15:58:11Z

@romange You do a lot of confident talking but you still refuse to follow up with any actual testing of your claims.

I have tested all modes, the mode without SQ polling had insignificant differences in performance and because it caused syscalls to appear, I wanted to use SQ because that is where all the fuzz is about regarding io_uring.
You haven't tested this, but you just assume. I did an actual test of this and I got results significantly opposing your so confident assumption.

Please do testing before you make up assumptions about everything. This entire thread is about this exact behavior - show with actual numbers like @santigimeno and @martin-g did.

axboe · 2020-12-07T16:08:13Z

@alexhultman Please tone down the snark. FWIW, @romange has done plenty of testing in the past, and was instrumental in finding the task work related signal slowdown for threads. Questioning the validity of a test is perfectly valid, in fact it's the very first thing that should be done before potentially wasting any time on examining results from said test. Nobody is in a position to be making any demands in here, in my experience you get a lot further by being welcoming and courteous.

ghost · 2020-12-07T16:14:36Z

When you have N cores running N epoll loops doing read/writes via socket you put 100% load on your machine, you will see how io_uring performs better.

The above is not a question, it is a confident statement without any backing proof other than "it will be the case". This is what is the issue here - blindly making claims without any backing proof other than "listen to my assumption".

ghost · 2021-10-14T09:35:03Z

I can recompile with debug symbols. These are very small copies, like we have talked about already, it's 32 bytes per pipe so overhead of copy is very likely not the bottleneck and again, if you look in this thread you can see that many people with AMD CPUs see big wins with io-uring and, again, my own raspberry pi is much faster with io uring. They all do the same copying, yet my shitty Intel machines don't perform any better with iouring

ghost · 2021-10-14T09:38:53Z

At this point the bug report is more about Intel machines not seeing any gains while pretty much all other ones seeing big gains.

isilence · 2021-10-14T09:39:54Z

I can recompile with debug symbols. These are very small copies, like we have talked about already, it's 32 bytes per pipe so overhead of copy is very likely not the bottleneck and again, if you look in this thread you can see that many people with AMD CPUs see big wins with io-uring and, again, my own raspberry pi is much faster with io uring. They all do the same copying, yet my shitty Intel machines don't perform any better with iouring

Ok, still curious to see relative overheads. If you're willing to recompile, there is a list of kernel options to check
(from https://www.brendangregg.com/perf.html)

# for perf_events:
CONFIG_PERF_EVENTS=y
# for stack traces:
CONFIG_FRAME_POINTER=y
# kernel symbols:
CONFIG_KALLSYMS=y
# tracepoints:
CONFIG_TRACEPOINTS=y
# kernel function trace:
CONFIG_FTRACE=y
# kernel-level dynamic tracing:
CONFIG_KPROBES=y
CONFIG_KPROBE_EVENTS=y
# user-level dynamic tracing:
CONFIG_UPROBES=y
CONFIG_UPROBE_EVENTS=y
# full kernel debug info:
CONFIG_DEBUG_INFO=y
# kernel lock tracing:
CONFIG_LOCKDEP=y
# kernel lock tracing:
CONFIG_LOCK_STAT=y
# kernel dynamic tracepoint variables:
CONFIG_DEBUG_INFO=y

ghost · 2021-10-14T09:45:40Z

Yep I'll come back in a few days I have other things to do also. Will be interesting to see.

vcaputo · 2021-10-18T23:50:15Z

@alexhultman The io_uring variant of your benchmark completes consistently faster in my tests when I change the continue; busy-loop style CQ polling to instead wait for a CQE to arrive via io_uring_wait_cqe() when the CQ isn't ready:

             if (completions == 0) {
-                continue;
+
+               if (io_uring_wait_cqe(&ring, cqes) < 0) {
+                       printf("error waiting for any completions\n");
+                       return 0;
+               }
+               completions = 1;
             }

@axboe Is this style of brute-force CQ polling supposed to work well? In my tests, especially when I add taskset -c 0, the io_uring test takes much longer without the io_uring_wait_cqe()-when-unready. At a glance, at the very least there's the io_uring-created sqp sibling thread that also requires CPU time, which the spinning is contending with. If there's supposed to be some magic in how the sqp thread is created to make it complement the process-wide taskset -c 0 nicely, and not fight with this style of polling, it doesn't seem to be working correctly here (5.14.5-arch1-1, i7-3520M).

I'm on a somewhat older Intel CPU, and adding the wait as noted above makes the io_uring test consistently faster than epoll. From where I'm sitting right now, the io_uring benchmark seems to simply be incorrectly implemented. And judging from the code quality in general, that would not surprise me one bit.

ghost · 2021-10-19T09:55:54Z

@vcaputo The benchmark started like that, io_uring_submit_and_wait and io_uring_wait_cqe was how the original version of the benchmark ran. At that point (Linux 5.8) it was still not faster than epoll so the benchmark changed to make use of fixed files, fixed buffers and polling. That was faster so that solution remained. I can re-check with io_uring_wait_cqe on my Linux 5.16. Thanks for testing and reporting.

vcaputo · 2021-10-19T10:06:14Z

@vcaputo The benchmark started like that, io_uring_submit_and_wait and io_uring_wait_cqe was how the original version of the benchmark ran. At that point (Linux 5.8) it was still not faster than epoll so the benchmark changed to make use of fixed files, fixed buffers and polling. That was faster so that solution remained. I can re-check with io_uring_wait_cqe on my Linux 5.16. Thanks for testing and reporting.

Well, you still keep the opportunistic batched consumption and only resort to waiting when unlucky. It's a sort of hybrid, akin to how mutexes are often tweaked to spin a little before going to sleep (involving the kernel) when contended. But what you have currently is written more like a spinlock in userspace, which is generally A Bad Idea unless you're exerting very fine control over which threads are running on which cores.

If you don't even attempt the batched wait-less consumption you'll definitely go slower as the wait interface just gets a single cqe IIRC.

ghost · 2021-10-19T10:13:25Z

I'm definitely open to whatever solution is the best as of Linux 5.16. My goal is really just to have io_uring beat epoll on all my machines including the shitty Intel ones. I've tested 10-or-so solutions by now. Will test yours when I have more time for this.

ghost · 2021-10-19T10:40:08Z

@vcaputo That changes doesn't do any difference for me. It never even runs that path.

isilence · 2021-10-20T10:45:18Z

PS: there's something funny about the test, it sometimes fails to retrieve a SQE, even though the ring seems to be appropriately-sized.

@lnicola, I just noticed that it enables SQPOLL, and the program doesn't handle it right, though should work fine with normal rings. Anyone benchmarking it should disable SQPOLL unless willing to trink appropriately and invest more time in comparison. It's not magic, it can degrade performance and add variability to results.

rbernon · 2021-12-05T18:54:17Z

Hi, I've also been playing with io_uring latetly, trying to match epoll performance in an IPC use case with pipes, one server and n clients, and a simple request-reply protocol.

I modified the benchmark mentionned here and elsewhere for my tests, see the *ipc*.c sources in https://github.com/rbernon/io_uring-echo-server

What I could see is that epoll currently always beats io_uring, with any number of clients. My use of io_uring is pretty simple, I used the same base as the existing code with fixed fds and buffers (although that did not make a difference), and I'm submitting write-read linked sqes, with the cqe_skip_success flag set on the write.

I've made some gpuvis traces, which I think are intersting to understand the reasons. I used a build of the linux kernel from git (79a72162048e42a677bc7336a9f5d86fc3ff9558), with the patches from [1] on top, and I verified that there's no cqe returned for the write completion. (I'm not sure if this is the latest version of the patches, and I'm happy to try something else if there's any)

With epoll, with 1 client thread, each thread wakes the other one when it has completed its write, and they alternate very quickly.

With 4 client threads, the server is fully busy (in red), and never gets interrupted, serving requests one after another, and the processing time is still very small and roughtly the same as with one client.

With io_uring, however things are quite different, even with 1 client. For a start, there's a kernel worker involved, so a third thread. Then, there seems to be at least one spurious wakeup of the server.

I would expect that the client write would only wake up the worker, copying the data and waking up the server, which would process it, submit the next sqes, waking the worker and starting to wait for the next one. I verified that the only cqe that the server receives is for the read completion, but maybe it resumes from io_uring_submit_and_wait for nothing once.

The number of spurious wakeups doesn't see to depend on the number of clients, as can be seen here, with 2 clients. However, there's an additional worker for each client, it doesn't really seem necessary?

About the io_uring workers, I think there's something fishy, as whenever I stopped the measuring, all went crazy for a bit, as can be seen here (note that the scale is much larger, and every small block is also an io_uring worker working).

Edit: That last part may just be because the code doesn't handle the errors and just started submitting a lot of invalid sqes when interrupted.

Edit2: It's actually not, I checked again and I don't see any completion with error status, or any error from the client read or write when I interrupt the process, it just shuts down immediately. The iou_worker frenzy is still there though as soon as there's more than 1 client, and already there on 5.15 without the SKIP_SUCCESS flag. It's possible there's something wrong with my code but I don't see where?

[1] https://lore.kernel.org/all/[email protected]

Qix- · 2021-12-05T19:52:38Z

@rbernon just out of curiosity, which tool is that?

rbernon · 2021-12-05T19:55:51Z

https://github.com/mikesart/gpuvis

axboe · 2021-12-06T03:25:50Z

If you see any iou-wrk workers, then something isn't working right. That's the slower async path, and should not get hit unless IOSQE_ASYNC is being explicitly set. What code is being run?

axboe · 2021-12-06T03:28:06Z

Took a quick look at that server, and it looks like it has two modes:

Use IOSQE_ASYNC, or
Use an explicit poll

Neither are great solutions, the most performant will generally be to just issue the request and have the internal poll handle it. Otherwise you're just trading an epoll readiness based model for the ditto on io_uring, which doesn't make a lot of sense. It can be useful for converting existing code, but for new code it isn't really advisable.

rbernon · 2021-12-06T07:52:31Z

If you see any iou-wrk workers, then something isn't working right. That's the slower async path, and should not get hit unless IOSQE_ASYNC is being explicitly set. What code is being run?

Interesting, I added the flag on the read sqe because it's not expected to succeed immediately after a write from the server (the clients are not generally spamming requests). It didn't seem to make a difference at first, but I'll try without it.

rbernon · 2021-12-06T08:47:28Z

Took a quick look at that server, and it looks like it has two modes:
1. Use IOSQE_ASYNC, or

2. Use an explicit poll
Neither are great solutions, the most performant will generally be to just issue the request and have the internal poll handle it. Otherwise you're just trading an epoll readiness based model for the ditto on io_uring, which doesn't make a lot of sense. It can be useful for converting existing code, but for new code it isn't really advisable.

FWIW the server code is https://github.com/rbernon/io_uring-echo-server/blob/master/io_uring_ipc_server.c, and it only used the IOSQE_ASYNC flag on the read sqe.

I tried again without this flag, and with IOSQE_CQE_SKIP_SUCCESS, and now the spurious wakeup is gone but there's still iou workers involved, as well as the worker frenzy when exiting the test.

Here with 1 client:

And with 2:

With 4 clients, something weird started to happen, with periods of "normal" latency and periods of very high latency. Looking at gpuvis I can see that the "normal" latency periods involve server, clients, and workers talking to each other:

But after a while there's only the server left, seemingly doing nothing?

Plus a huge frenzy at the end, but kind of expected.

Unrelated note: by default gpuvis uses a millisecond resolution, I modified it to round the timestamps to us because I wanted to be able to see the latency more accurately.

isilence · 2021-12-15T13:27:08Z

@rbernon, pipes, right? The pipes internals don't support any sane nowait behaviour so we have to force any I/O against them to io-wq (slow path). It's just a one line on io_uring side to change that, but then submission may end up waiting for I/O potentially unboundedly. Hopefully, one day we'll push good enough for a change in the pipe code.

fwiw, the benchmark up in the thread doesn't go through io-wq only because O_NONBLOCK and there is special case in io_uring for that.

Don't know what that "frenzy" with lots of rescheduling exactly is, though that's interesting.

avikivity · 2022-04-14T12:09:49Z

Can't we register a poll on the pipe, and issue the write from the poll callback?

The same holds for other socket-like files that support poll() but aren't sockets. They can be handled in the same way as recv.

avikivity · 2022-04-14T12:24:47Z

Now I'm not sure that recv completes without a workqueue.

Looking at the code, it ends up in io_apoll_task_func, which punts via io_req_task_submit. Is recv indeed using workqueue?

avikivity · 2022-04-14T12:26:05Z

I'm probably lost in the maze.

axboe · 2022-04-14T12:52:13Z

We do use poll for any file type that supports it, but since we can't issue a nonblocking read attempt on a pipe, it still has to be done from a worker. What needs to happen here is just converting pipe from using struct file_operations->read to ->read_iter. The solution is known, and patches do exist.

This isn't specific to pipes, but obviously they are one of the more important file types. Thankfully most files use ->read_iter and ->write_iter these days.

axboe · 2022-04-14T12:53:09Z

Looking at the code, it ends up in io_apoll_task_func, which punts via io_req_task_submit. Is recv indeed using workqueue?

It is not, this is task_work. That's different from thread offload. For the latter, look for io_queue_async_work().

avikivity · 2022-04-14T12:55:59Z

Looking at the code, it ends up in io_apoll_task_func, which punts via io_req_task_submit. Is recv indeed using workqueue?

It is not, this is task_work. That's different from thread offload. For the latter, look for io_queue_async_work().

Yes, sorry.

Cagoh · 2022-07-02T16:04:34Z

I'm trying to fork() the io_uring simulation by creating parent and child process, each process have its own rings, parent for writing to the pipe, child for reading from the pipe, however I stumbled upon some error when the reader side read an empty pipe and instead of waiting for content in the pipe (if therers no error the simulation indeed run faaster), it update the CQE Async task failed. To overcome the problem, I didn't set the O_NONBLOCK in the pipe2(), this makes the reader to wait until theres content in the pipes before unblocking. But on the downside, it will call fcntl() with call syscall and the performance of the simulation is slower than the original simulation file. I wanted to know how to set reader side of pipe to be ready without syscalls

ywave620 · 2023-09-22T02:16:18Z

This is a general question. Does replacing epoll with io_uring reduce the CPU usage for a given workload, said an nginx like http gateway? Reducing CPU usage in our production environment is the primary task rather than achieving a higher throughput or RPS in the benchmark environment.

This question comes to me because, IMO, io_uring makes socket IO purely asynchronous, but the data copying between user and kernel is preserved, assuming zero copy(IORING_OP_SEND_ZC, IORING_OP_SENDMSG_ZC) is not used.

Qix- mentioned this issue Sep 30, 2020

Any io_uring performance tests after all of the work that has been done? netty/netty#10622

Closed

beef9999 mentioned this issue Feb 22, 2022

Yet another comparison between io_uring and epoll on network performance #536

Closed

guonaihong mentioned this issue Sep 19, 2023

18. io-uring guonaihong/pub-notes#18

Open

panjf2000 mentioned this issue Aug 14, 2024

internal/poll: transparently support new linux io_uring interface golang/go#31908

Open

axboe closed this as completed Oct 1, 2024

io_uring is slower than epoll #189

io_uring is slower than epoll #189

Comments

ghost commented Aug 30, 2020 • edited by ghost Loading

ghost commented Aug 30, 2020

isilence commented Sep 2, 2020

InternalHigh commented Sep 13, 2020

ghost commented Sep 13, 2020

markpapadakis commented Sep 13, 2020 via email

Qix- commented Sep 30, 2020 • edited Loading

ghost commented Sep 30, 2020

ghost commented Sep 30, 2020

Qix- commented Sep 30, 2020

InternalHigh commented Sep 30, 2020

axboe commented Sep 30, 2020

romange commented Sep 30, 2020

ghost commented Dec 7, 2020

ghost commented Dec 7, 2020 • edited by ghost Loading

ghost commented Dec 7, 2020

ghost commented Dec 7, 2020

romange commented Dec 7, 2020

ghost commented Dec 7, 2020

santigimeno commented Dec 7, 2020

romange commented Dec 7, 2020 • edited Loading

Qix- commented Dec 7, 2020

martin-g commented Dec 7, 2020

romange commented Dec 7, 2020 • edited Loading

YoSTEALTH commented Dec 7, 2020

romange commented Dec 7, 2020

ghost commented Dec 7, 2020

ghost commented Dec 7, 2020

axboe commented Dec 7, 2020

ghost commented Dec 7, 2020

ghost commented Oct 14, 2021

ghost commented Oct 14, 2021

isilence commented Oct 14, 2021

ghost commented Oct 14, 2021

vcaputo commented Oct 18, 2021 • edited Loading

ghost commented Oct 19, 2021

vcaputo commented Oct 19, 2021 • edited Loading

ghost commented Oct 19, 2021

ghost commented Oct 19, 2021

isilence commented Oct 20, 2021

rbernon commented Dec 5, 2021 • edited Loading

Qix- commented Dec 5, 2021

rbernon commented Dec 5, 2021

axboe commented Dec 6, 2021

axboe commented Dec 6, 2021

rbernon commented Dec 6, 2021 • edited Loading

rbernon commented Dec 6, 2021

isilence commented Dec 15, 2021

avikivity commented Apr 14, 2022

avikivity commented Apr 14, 2022

avikivity commented Apr 14, 2022

axboe commented Apr 14, 2022

axboe commented Apr 14, 2022

avikivity commented Apr 14, 2022

Cagoh commented Jul 2, 2022

ywave620 commented Sep 22, 2023 • edited Loading

ghost commented Aug 30, 2020 •

edited by ghost

Loading

Qix- commented Sep 30, 2020 •

edited

Loading

ghost commented Dec 7, 2020 •

edited by ghost

Loading

romange commented Dec 7, 2020 •

edited

Loading

romange commented Dec 7, 2020 •

edited

Loading

vcaputo commented Oct 18, 2021 •

edited

Loading

vcaputo commented Oct 19, 2021 •

edited

Loading

rbernon commented Dec 5, 2021 •

edited

Loading

rbernon commented Dec 6, 2021 •

edited

Loading

ywave620 commented Sep 22, 2023 •

edited

Loading