-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The eager read/write optimization leaves curio programs highly susceptible to event loop starvation #106
Comments
I think this needs more study. If all of this is running on a single machine, there are two many weird quirks of scheduling that could be in play regarding threads, processes, and everything else. I'm also not sure that a client hitting the server at maximum speed through the loopback interface represents a scenario that would happen for real, but then again, I honestly don't know. I just know that performing all sorts of extra system calls and scheduling around I/O isn't free. So, all things equal, I don't want to unnecessarily cripple the performance of curio if I don't have to. Just as a note: I've seen similar "starvation" effects simply programming with threads (no async). A lot of this may be tied up in the scheduler of the OS. I'll have more to say about this later.... have to step away to pick up kids though. |
The theoretical analysis is pretty simple: if the peer is fast enough that On Nov 13, 2016 10:45 AM, "David Beazley" [email protected] wrote:
|
When testing things like this, it does help to have a datacenter machine with a 10gigE card. You'd be surprised how fast those send buffers empty, especially without PyPy to speed your hot read/write loops along :-). |
I understand the theory of it and agree that it might be a problem, but I'm not yet convinced that this would actually be a common situation--it seems that it would require the task in question to be pegged at 100% CPU load continuously and to never block ever. Given all the variances on a typical network, it seems like this would be pretty hard to achieve except in carefully controlled tests. And even if that did happen, it seems like you'd have a much more serious problem on your hands related to the scaling of whatever thing you were running. I could be wrong though. I just don't know. How are other frameworks dealing with this? How do programs that use threads deal with this? All things equal, I don't want to completely discount the performance overhead of constantly doing everything on the event loop by doing away with eager read/write. I could possibly envision some kind of hybrid approach that allows a certain small number of eager reads to take place first before blocking on the event loop. That might allow more normal workloads to take advantage of eager read/write while forcing the extreme case to more properly schedule itself. |
As a followup, I'm looking at the code for gevent and unless I'm reading it wrong, it looks to me like it's doing eager recv/send in basically the exact same manner as curio. For example (in
Maybe it suffers the exact same problem. However, gevent has certainly seen much more real-world use than curio. So, if there are known starvation problems there, that would be an interesting thing to know about. Edit: Update: I did find this: http://stackoverflow.com/questions/13904114/why-is-gevent-sleep0-1-necessary-in-this-example-to-prevent-the-app-from-block |
To make the echo server example a little more realistic, I added a 1 ms busy loop to the send/receive loop, to simulate it doing a bit of on-CPU processing to each message (code). With this change, the output becomes:
With threads this can't happen because the scheduler interrupt will fire, the scheduler will notice that the eager thread has outlived its allocated time-slice, and it'll get pre-empted and bumped to the back of the line. Then the next time the eager thread is running, if another some I/O suddenly arrives for another thread that's blocked on I/O, the eager thread will get pre-empted to let the I/O bound thread run, because the I/O bound thread hasn't been using up its timeslices so it gets preferred access to the CPU. I guess you already know all this, but maybe it's useful for those reading the thread later :-).
I don't actually find this very reassuring, because even if it's rare/probabilistic... if you give me a choice between a server that usually handles requests in 1 ms but sometimes has a 100 ms latency spike, versus one that always handles requests in 5 ms, then I think almost everyone will prefer the 5 ms server. To me it doesn't make sense to talk about average case latencies until after at least putting a bound on tail latencies. All that's required to hit this is a program that is occasionally CPU-bound and occasionally has fast peers. That seems pretty realistic to me :-/
I don't either, but I guess my inclination would be to focus on getting the semantics right first. Premature optimization is the root of all evil... or specifically, spending energy on optimizing now seems like a poor investment compared to working on correctness, because correctness is what lets us start writing apps on top of curio, and apps on top of curio are what let us start nailing down the semantics, and are also what give us real benchmarks to replace this echo server stuff. And it's a much easier to try and optimize a fixed set of semantics with real benchmarks than it is to try to optimize an unstable set of semantics without real benchmarks. From a quick check it looks like fixing this in the most crude way possible (adding a |
Maybe a different kind of question, but if you see a call like I'm not suggesting that curio has to follow a similar approach. Maybe it doesn't. I don't know. |
I still want to know what people are doing about this in other frameworks. Curio is not the only implementation that uses eager read/write. Truth be told, I wasn't even really thinking about optimization when that code was written--the implementation uses a pretty "standard" approach to non-blocking I/O that I've seen documented elsewhere. Namely, you try the operation first and if it causes an exception due to blocking, you suspend. What happens if you run this echo benchmark on gevent? |
I want to separate this into two pieces, conceptually: scheduling policy and pre-emption points. Definition: a pre-emption point is a places where the scheduler is invoked; the scheduling policy then decides whether the current task gets pre-empted, and if so then which task runs next. For common operating systems, every instruction is effectively a pre-emption point, so For curio, we have a cooperative scheduler, so pre-emption points only occur at explicitly marked places. Users need to be aware of these, because it's their responsibility to ensure that a pre-emption point is executed on a fairly regular basis to avoid starving the event loop, and because they need to be aware of when they can be pre-empted for the usual concurrency management reasons. So in curio, my expectation is that [Edit to add: Right now in curio the distinction between pre-emption points and the scheduling policy is somewhat obscured, because the current scheduling policy is "every pre-emption point triggers a pre-emption and puts the task back at the end of the FIFO scheduling queue". But this could change, and the choice of scheduling policy is conceptually orthogonal to the eager read/write optimization. The problem with eager read/write is that it means that the scheduler never runs at all, so even a fancier scheduling policy would still be helpless.] (It's funny that in the other thread right now I'm arguing that |
It occurs to me that curio could potentially get much fancier about its scheduler: e.g. it could spawn a worker thread to do the actual watching for I/O and calculate which task ought to be running at each time (taking into account an up-to-date view of I/O readiness, statistics on different tasks, etc.). And set a flag whenever a task has overstayed its welcome (i.e., when it would have been pre-empted in a non-cooperative system). And then when we hit a pre-emption point like |
Here's the gevent version of the instrumented echo server, with the 1 ms busy loop: https://gist.github.com/njsmith/6b0db29a4f5046b795396e78c763a651 Consistent with that stackoverflow link you found, it shows the same starvation behavior as curio, but even worse -- I've left it running for a while now with two clients, and the second client still hasn't received a single byte of data back. Currently it's at ~500,000 loops over ~9 minutes, and ~55 GiB of data transferred to the first client, without once yielding to the scheduler. I guess what people do about this in the gevent case is that they post frustrated questions on stackoverflow... I feel like if your networking library can't be used without getting help from @minrk then maybe your networking library is too hard to use :-). |
On further investigation, it looks like the reason gevent seemed to perform worse than curio is that the gevent echo server was doing tl;dr: the curio results pasted above are unrealistically optimistic. |
First, I would agree, that given the low probability of this phenomenon to occur, the hybrid approach should work. Second, I think, this issue could be resolved on the application level. Let the user decide, whether to use the eager r/w optimization, or to guard himself against such a DoS. To do this, user can explicitly tell the socket to receive only at most some number of bytes, and than voluntarily call async def send(sock, data, chunk_size=1000):
totalsent = 0
while totalsent < len(data):
sent = await self.sock.send(data[totalsent:totalsent + chunk_size])
totalsent = totalsent + sent
await switch() The only thing we should definitely do is to stress the eager r/w behavior in the documentation, maybe even with the example solution to this corner case. |
On 11/14/16, David Beazley [email protected] wrote:
I doubt this is something Curio should solve. If any length is supplied (i.e read(10000)) So I'm at the side of non-eagerness. If performance/throughput is top priority So please keep it stupid. This way we can be sure |
After thinking about this, I'm going to keep the curio recv()/send() behavior as it is. A couple of reasons:
I want to make one other comment. The code that curio is using here is NOT some weird aggressive optimization. This is basic non-blocking I/O code that you see in just about every tutorial you will ever see on non-blocking I/O. You try the I/O operation first. If it fails with a blocking error exception, you then fall back. That's how you do it. I'm also not convinced that this is a real problem. I tried searching around for "gevent starvation" and didn't find much. Yes, there are occasional bizarre corner cases where people have seen that. However, a lot of eyeballs have looked at that project and it's doing things exactly the same way as curio. I'm just not seeing much of an issue surrounding this. Note: I'd probably be in favor of a special |
The problem with sleep(0)/switch() is that, it is like But if nobody is greedy then there is no need for an |
If you're putting sleep(0) calls all over your code, you're probably doing something wrong. The test case and problem being discussed here is an extreme corner case--an extremely aggressive client that's hammering a server on the loopback interface and doing little else. That is simply not a common scenario. I'm not discounting a high-traffic client completely as a potential problem, but I'd like to see it discussed with a more real-world application (e.g., HTTP) before doing much more with it. Yes, there are certain situations where inserting a There are also other ways that In any case, I want to see curio used in more real-world settings before considering any possible change to I/O handling at the moment. |
On 11/15/16, David Beazley [email protected] wrote:
Sure it is wrong. There are two cases I can name:
In any case using sleep(0) can be forgotton or can not be
I'm very content with it in my custom HTTP caching/blocking |
I can respect waiting to gather more information before making a decision, but I do strongly disagree with the characterization of this as some weird irrelevant corner case. Rather, I'd say that it's a nasty latent bug that will lurk in a large proportion of curio applications, and then when the moons align it'll cause surprising and hard to debug failures (including partial failures / weird performance issues). Not all curio applications, but even figuring out whether a particular piece of code is potentially susceptible to this issue is really difficult -- probably impossible in any moderately complex system. The whole reason I'm interested in curio is that (I think) it makes it much easier to write correct applications than asyncio does; the idea that we should accept some weird flakiness in order to get a speed increase that might not even be measurable in real life is pretty worrisome to me. If this is really the behavior that you intend to standardize on, then I'm debating with myself whether I need to go back to my blog post about asyncio and curio -- specifically the part where I say "in asyncio the code looks right, but in fact you need to add this weird
Curio doesn't have to context switch on recv/send either, even if we do trap to the scheduler -- just write a better scheduler ;-). Anyway, the difference from an operating system is: when writing curio code, users have to keep track of where the scheduler is being invoked, because otherwise their code will have concurrency bugs, and cancellation bugs, and fairness bugs. This is a problem that just doesn't exist when working with pre-emptive threads. So an important part of curio's public API is deciding which operations invoke the scheduler, and under what circumstances, and communicating that to users. As you say, |
I feel like there is a certain amount of catastrophizing going on this bug report. The particular test here (hammering a server as fast as possible on loopback) without even doing anything with the data is simply not representative of reality. I think it's reasonable to wait for more evidence before taking action. For example: running this same benchmark using a network connection between two machines. Or modifying the client to do some kind of work on receive. For example, what happens if the client does a Unicode decode() on all of its received data? An aside, when I was studying the GIL 7 years ago, I saw a lot of weird scheduling behavior simple due to the fact that multiple cores were being used. I contend that there could be all sorts of crazy things happening in this specific test because a) client/server are on the same machine b) multiple cores c) no real work other than I/O is being performed. |
On 11/15/16, Nathaniel J. Smith [email protected] wrote:
Never encountered such a case. Infact, I'm very concerned that the 'solution', drain() on asyncio is nasty. I would not favor |
As a note: I'd never want to see direct usage of sleep(0) as the "solution" to this. If I were solving this for real, it would disappear into the underlying socket class. Or maybe a subclass of socket. |
I don't have a strong opinion on curio's behaviour here, but just noting a GIL-related example that I think is analagous to @njsmith's concern: CPython consistently releases the GIL around potentially blocking IO operations, so Pythonistas are accustomed to sock.send()/recv() being reasonably good guarantees that a given thread isn't going to starve other threads of access to the GIL. This is done unconditionally, before making the potentially blocking OS level call. The naive interpreter level scheduler can still cause latency spikes when CPU bound threads are mixed with IO bound ones, but IO bound threads will always play nice with each other. However, I think your main real-world defence against the specific problem being raised here isn't anything to do with the typical relative speeds of clients and servers, but everything to do with typical round trip latencies for TCP packet acknowledgements, and that's where having both the hostile client and the server under attack being on the same system is misleading: instead of there being a ping delay between "data sent by server" and "server receives notification that the data was received by the client", those two events are nigh simultaneous. If I'm right, then putting Vaurien between the client and server with a stable 10+ ms delay in each direction may be enough to eliminate the starvation: http://vaurien.readthedocs.io/en/1.8/#using-vaurien-from-the-command-line A UDP echo server wouldn't have that defence, but UDP servers aren't connection oriented either, so "client starvation" isn't really a thing that can happen the way it can with TCP. |
Infinitely greedy send/recv has been a problem in some async uses of pyzmq, though the cases I'm aware of are generally described as "always publish x as fast as I can." This is because it's close to impossible for send to block in most cases. Starvation on read has been harder to accomplish in reality, and I don't believe I've ever had a report of it. What I've been interested in for async zmq is a starvation limit - i.e. greedily send/recv up to a certain amount on a socket in a loop, but cap starvation. This could be a tuning parameter - high limit for maximum throughput on a given channel, but possible starvation of other channels during load. Low limit for maximum 'fairness', but tradeoff in maximum burst throughput. The measure could be either time or data. |
I took a stab at bounded greedy handling in pyzmq here. I set the default of 10ms maximum starvation. In my case on pyzmq, I got roughly these performance penalties in a pyzmq async throughput benchmark:
For comparison, the introduction of greedy handling here provided a ~5x improvement in the same benchmark. (note that these are benchmarks intended to stress the overhead of the Python implementation, not actual network behavior) |
Aren't time based deadlines quite expensive? Syscall and all? How about a simple counter? Be greedy for maximal N times in a row, for example. |
Yup, I think so! I plan to also take a stab at size-based limits, to compare in reality. size-based limits are also probably simpler for plain sockets than they are for pyzmq. |
The benchmark where you get complete starvation for minutes on end isn't really the case I'm worried about; it's just that I tried the most obvious simple benchmark and that's what happened :-). What I'm really worried about is things like, a proxy server where a high bandwidth client happens by chance to send a large burst of data at the right time to starve out all other connections for 10s or 100s of milliseconds, creating a "bursty" latency profile that creates a low perceived quality-of-service. Or the HTTP/1 server I described above, where a high-performance pipelining client could starve out other connections. (Usually we... like deploying high-performance programs?) Or a gRPC server: gRPC uses HTTP/2 to multiplex multiple RPC calls over a single connection, which can create conditions similar to the pipelining case. I'm imagining after-action reports like: We've now completed our analysis of the outage this weekend. In our system, front-end servers receive requests, and then issue a series of RPC calls to a layer of back-end servers, which in turn issue queries against the database layer. Two weeks ago, we deployed new caching logic in the back-end servers; when the cache hits (~90% of the time) this allows the back-end to respond to RPCs without accessing the database. This improved average response latency, but there was an unexpected interaction with the "eager read/write optimization" in the back-end's networking library: previously, between each RPC request and RPC response, the synchronous database access forced the RPC handler to be rescheduled, which in turn enforced fairness across different RPC connections. Now, in cases where we hit the cache, the response can be computed entirely on-CPU without rescheduling; if multiple such requests come in multiplexed on the same connection, then we may find ourselves ignoring requests on other connections until we've completed sending all of the cache hit responses. Normally, the only effect this has is to create a somewhat spikier latency distribution. But early Sunday morning, the following conditions lined up: a front-end server issued 64 parallel RPC requests; by chance, our sharding algorithm assigned 23 of these to the same back-end server; by chance, 21 in a row of those 23 requests were cache hits. The back-end server processed all 21 of those requests without yielding to the event loop, making it unresponsive to other requests for long enough to trip our watchdog timer and remove that back-end from the service pool. This increased the load on the other back-end servers, which had the effect of somewhat increasing the average latency across the board. This wouldn't have been a problem -- we maintain some excess capacity exactly to handle situations like this -- except that the higher baseline latency combined with our newly-spiky latency distribution increased the probability that one of the remaining servers would have an "unlucky enough" series of requests to trigger their watchdog timer. At T+42 minutes, this occurred on a second server, thus further increasing the load on the remaining servers, which in turn created a further increase in baseline latency, and we spiraled down in a cascading failure; complete service outage occurred at T+63 minutes. Does that seem like unrealistic catastrophizing? It sounds pretty typical to me...
I don't disagree with this. I'm just a little baffled that so far the evidence in favor of the eager read/write optimization is a 15% speedup on one probably-irrelevant microbenchmark, and we could reduce this gap even more by optimizing the scheduling, yet everyone else seems to think that this is probably worth it. I guess that when it comes to distributed systems prone to emergent complexity then I'm just a "better safe than sorry" kind of guy :-). All I'm asking is that eventually, the public API should guarantee that a simple send/recv loop has some kind of bound on its worst-case unfairness.
(1) There are lots of network apps running in data centers where the typical RTT is sub-millisecond. (2) The RTT is actually almost irrelevant to all of the examples I gave above. If you have a send/recv loop that's implementing a classic request/reply protocol where I send something, and then there's nothing for me to recv until after you've received my message and sent a response, then almost any RTT is enough to avoid these issues. That's why I'm talking about situations like proxies and pipelining and multiplexing, where send and recv can happen independently :-). You suggest that even in this case the slowness of TCP packet acknowledgements might save us -- but the only way this will help is if the TCP packet acknowledgements are so slow in coming that the make the send buffer become completely full, or the receive buffer completely empty. But if this happened it would be a wasteful stall! The whole goal of TCP's flow control mechanisms is to try to prevent this from ever happening -- and in particular, all modern TCP stacks will estimate RTT and try to increase the sizes of these buffers to make sure that ACKs have enough time to arrive before this happens. Note in particular that in the echo server example, RTT is basically irrelevant. What is relevant is that a client is able to push enough data to saturate the server's CPU. If you have a well-functioning network, then a client should be able to push that from another machine, with or without a 10 ms RTT, etc. Of course normally you wouldn't want to run a server at 100% CPU usage in its steady state -- but intermittent/temporary CPU saturation seems normal and even desirable. (If you have work to do, you want to get it done as fast as possible!)
But a realistic UDP server probably does have other background tasks running on the same event loop that could get starved: e.g. DNS servers serve UDP and TCP simultaneously, or you might have a background task that does some housekeeping on a timer tick.
Please don't jump to conclusions about relative performance like this... it tends to really confuse matters. FWIW on my laptop, it looks like incrementing a Python-level global integer costs 60 ns, and calling
So apparently time based deadlines are cheaper than counters! Maybe. To really know you'd have to implement and optimize both. And then you'd need to weigh the absolute costs against other considerations like maintainability, reliability, etc. Personally I find it weird to be making any kind of decision at all about public API semantics based on comparing little 10-20% speedups/slowdowns. |
@njsmith Thanks for the clarification - I was thinking in terms of the "frontend server talking to client devices" case rather than the "backend server talking to frontend servers" case, and I agree that in the latter case loopback testing is more likely to be representative of real network conditions. |
I think, I'll go into the to very basics again.
await revc/read() should not do anything more than blocking And the kernel should only do mere enough to
|
The current version of Curio already provides these semantics. How does a recv() call work with threads? If data is available, it is returned immediately. If not, the calling thread is blocked by the OS kernel and a new thread is scheduled. How does a recv() call with with Curio? If data is available, it is returned immediately. If not, the calling task is blocked by the Curio kernel and a new task is scheduled. The only difference between the two concerns who is responsible for blocking the caller when no data is available. Otherwise, the semantics are the same--as currently implemented in Curio. I don't get this "ONLY AFTER REALLY COMPLETING" business. If curio tries a recv() call and it runs without an exception, then the operation has, in fact, completed. It's not an "optimization." It's not a "hack." There is nothing to see here. The issue at hand in this report is whether or not Curio should additionally invoke the scheduler in all I/O operations. I don't think it should. And neither should anyone else who wants the semantics to be exactly the same as blocking I/O in synchronous code. |
As the Curio BDFL, I've given a lot of thought to issue. I do not intend to change the current behavior of curio with regards to this. There are a couple of reasons:
|
On 11/16/16, David Beazley [email protected] wrote:
It's for write() as described for asyncio.
Agreed. |
This comparison doesn't really hold up. First, that's not how network I/O scheduling works on real kernels. Once your thread has context switched into the kernel, it might get scheduled out if too many other threads are ready to run, whether a result is available from your particular syscall or not. This can happen as a result of a call to By contrast, in a cooperatively-scheduled framework like Curio's, user-space scheduling is all that we have, so being more careful about scheduling decisions and taking the opportunity of a I should point out that interactive performance problems will happen anyway - see for example this Stack Overflow question about interactive performance on Twisted's SSH server becoming untenable when one user is hogging all the bandwidth - http://stackoverflow.com/questions/40601119/python-ssh-server-twisted-conch-takes-up-high-cpu-usage-when-a-large-number-of/40627446 - the ability to do scheduling at all is only the first step to doing some kind of fair scheduling. |
On 11/16/16, Glyph [email protected] wrote:
I guess, you mean that opportunity should be Then are we talking about a hardened scheduler? |
Or a client from a hogging server, of course :).
A "hardened" app can't control what the libraries that it's integrating does. This kind of tuning needs to be done in the event loop core, not sprinkled haphazardly through application code. That said, it just needs to be tunable in the core somehow; although I believe it should be the default, it doesn't need to be the default, as long as you can swap in another eagerness-optimization provider. |
On 11/16/16, Glyph [email protected] wrote:
Then we are talking about a pluggable scheduler scheme. Others may choose a custom one that fits their use case. Any further closing remarks? |
I was looking at some of the code and got to thinking. Consider the current implementation of recv()::
The code is "eager", but if you modified it to wait for reading first (i.e., calling |
I think this might be at the core of why we're at such cross-purposes here... from conversations so far, I get the impression that the experts you have in mind are folks (like yourself?) whose main interest is in messing about with the guts of event loops and experimenting with all the little details? OTOH, when I think of expert developers, the image in my mind -- the heroes I try to emulate -- are people with brains the size of planets who spend that energy on building complex systems out of simple, straightforward code that's obviously correct, and who appreciate APIs that make that easier to do. Right now, the only way to write an obviously correct I/O loop in curio is to put an Also, it should hopefully go without saying, but just in case: these are both totally valid approaches! Just... it's not clear how compatible they are. I guess we will probably continue to push-and-pull on this going forward.
I actually went down the rabbit hole reading about fair scheduling last night, and it looks like it'd actually be super easy to implement in curio :-). I'm trying to resist though, since there's really no use until we have more complex apps to play with... but yeah, even a smart scheduler can't do anything unless the tasks have schedule points in them.
Heh, good point! As a technical matter, though, it's easily solvable -- my suggestion is that async def recv(self, maxsize, flags=0):
while True:
try:
result = self._socket_recv(maxsize, flags)
await _sleep(0, False)
return result
except WantRead:
await _read_wait(self._fileno)
except WantWrite:
await _write_wait(self._fileno) One could (should) make that prettier and smarter, but it accomplishes the basic requirement. |
On 11/17/16, Nathaniel J. Smith [email protected] wrote:
What's performance cost of this on your hogging setup? |
On the echo server microbenchmark in |
Just a quick observation, as I've started to read (and then scanned)
As such, I read this as if seeing cooperative- and preemptive- conflated.
Caveat - I spent no more than lunchtime to scan this (but my impression Cheers, On Wednesday, November 16, 2016, Nathaniel J. Smith <
|
On 11/17/16, yarko [email protected] wrote:
As you point out, properly documenting the tradeoff and providing Depending on the use case possible levels of solutions are:
|
I'm actually slightly on @njsmith's side, as the downsides here are asymmetric (a constant small performance cost which may be tuned as low as you like via a "yield to the event loop at least once every N IO operations based on a thread-local counter" mechanism vs "occasionally monopolise the CPU"). I see the status quo in curio as somewhat similar to the old bytecode-counting based switching model for CPython's GIL - it was mostly OK, but would sometimes go horribly wrong if you were calling a lot of computationally intensive C code that held on to the GIL without calling back in to Python code. People's solution to that was sticking |
On 11/18/16, Nick Coghlan [email protected] wrote:
It's a solution which is at same the school with "Garbage Collection". However, you may guess what will be my stance on a python specific On the other hand, I'm impressed as this thread turning into a
Good news for Curio is that, a problem touching
Anyway, info in this thread should definetely crawl |
It's probably worth pointing out that this is part of a larger discussion about what sorts of guarantees different curio operations should make or not make around scheduling and cancellation and similar. See #101 for more of that broader discussion. |
I'm closing this for now. Currently Curio runs all tasks that are ready after an I/O poll. After that, it goes back to polling regardless of whether or not any of those tasks have gone back on the ready queue. It largely seems to address this. |
There was some discussion in #83 about removing the "eager write" optimization that curio does, where
await sock.send
will attempt to synchronously write to the socket directly, and only yield to the event loop if that fails. It turned out that the original reason we were considering removing it didn't actually apply, plus it significantly slows down the echo server microbenchmark, so for now the optimization survived.But... I've just realized that actually the eager read/write optimizations create a serious correctness issue, in that they can cause ordinary-looking code to starve out the event loop in realistic situations.
Here's a standard curio echo server, along with a particularly aggressive echo client. The test case is thta we run two copies of the client, and watch to see how well the server switches between servicing both of them. The server is instrumented to report how often it switches.
The server output pretty regularly contains lines like:
What this is saying is that one of the
echo_handler
s managed to complete 220 passes through its read/write loop before yielding to the kernel, so all other tasks were starved out for 100.7 ms.I even saw a
without trying particularly hard. And in theory the only limit on how high those numbers can go is that at some point some scheduling quirk or something on my laptop causes one of the clients to be slow to read/write and makes the server block; under the right conditions it could go arbitrarily high.
@glyph has some interesting comments on this optimization in Twisted, and how they removed it.
There's also an important difference between curio and twisted: in twisted, reads are only triggered by callbacks, so they always go through the event loop; it's only writes where you can even contemplate doing the eager-write optimization. So for them, this starvation problem is less likely to happen, because you can only hit it if you're doing an endless series of writes without any reads in between (and as glyph says, "/dev/urandom as a service has pretty limited utility"). In curio, we currently have both eager-read and eager-write optimizations, which makes the problem much easier to hit -- all kinds of realistic protocols are susceptible, starting with echo servers and proxies and moving on to things like an HTTP/1.1 server talking to a pipelining client.
I'm particularly worried about this because as you know I've been poking at HTTP server support for curio, and it's completely unacceptable for a fast client to totally lock up the server for unbounded amounts of time, even if it's a low-probability event :-(. (I.e., I don't think this is something where tweaking some heuristics will be enough; we need to able to actually put a hard bound on how long a task can monopolize the kernel.) It can be worked around by manually inserting calls to
curio.sleep(0)
(now that that's fixed), but that's not at all a nice solution in general.The obvious solution is to make sure to always unschedule-then-reschedule on reads and writes. I can't immediately think of any cleverer (= faster) solution that is guaranteed to actually work.
The text was updated successfully, but these errors were encountered: