-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: IO simplification #219
Conversation
While losing the ability for libraries to use either native or green IO is somewhat unfortunate, I think this is overall a positive change: There have been a few times where I have opted to just call the syscalls directly because adding support to libgreen was going to be too painful, so sidestepping the entire Rust IO stack was easier. The embedding use cases are a lot nicer here, especially if we get better lowering operations so interacting with C code via file descriptor passing becomes cleaner. Overall: Thumbs up and agreement. I haven't thought a lot about the scheduling API changes yet, but will soon and then post another comment. |
for more details. | ||
|
||
- *Task-local storage*. The current implementation of task-local storage is | ||
designed to work seamlessly across native and green threads, and its performs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
performs -> performance
I certainly get the justification for having two layers of IO -- high-level platform agnostic vs low-level platform specific, but why 3? I'd love it if some of the "abstract io" machinary such as the Moreoever, with this the only libstd facade crates are libcore, liballoc, libnative, libunicode, and libsync (iirc); do we even need libstd at all? Each of these has its own well-defined niche. |
@Ericson2314 The idea with the three layers is pretty straightforward. The lowest level is essentially direct C bindings. The highest level provides safe Rust abstractions that may do a fair amount of work over the underlying system calls. The middle level bridges between the two, allowing the high level to be implemented in a cross-platform way. This design stays closest to how things are set up today, but in terms of API stabilization we will likely only mark the highest level as stable, to begin with. (That's what most Rust code is and will be built on, for now.) It may turn out that the I agree that moving out some of the key io abstractions in |
There's been a fair amount of discussion on this RFC already, but I wanted to reiterate some key points regarding the long-term vision:
So, to summarize, the intent of the RFC is to increase flexibility and to simplify our I/O story. It is meant to open doors (to nonblocking/async I/O, and platform-specific APIs), not close them. |
FWIW, I am working on an (yet another) async IO library which features 3 layers:
There are obviously downsides to this approach, but they are known (from wide usage in both .NET and also Clojure's async implementation, which is similar) This can all exist as libraries in cargo. Currently the only dependency on rust IO is native::io::file:fd_t My hope is, from that point, to submit RFCs to canonize the libraries from 1 and 2, which will allow us to provide async and await keywords in the rust syntax, and do the code generation in a preprocessing step. If it makes it into 1.0, great, if not, oh well. I have completed most of step one (epoll and the traits). I hope to have step 2 completed (as a POC) by the end of the week. Also note that this is not just a hobby project, but I will be committing time to this at my day job as well. We've deemed async IO in Rust as important to our success. So our effort will be to ensure that this code meets stability and documentation standards suitable for libraries. |
@rrichardson Exciting! I'm stoked to see where your work goes. |
@rrichardson I'm curious to see how your poll trait looks like, and how you unify epoll/kqueue which is readyness based with IOCP which is completion based. I think the only way it can be done is to make the API itself completion based, like libuv has done. |
@geertj @carllerche and I are working on (yet another 😉) IO library (mio). Our plan is to unify the readiness model by providing a slab buffer in Windows. "Ready to read" would mean that there is data in the slab to read and "ready to write" would mean that there was available space in the slab to write to. In general, our goals for mio are zero-allocations, an efficient implementation of multiplexing across multiple "blocking" operations, and a portable readiness model. We chose to unify around the readiness model because certain kinds of applications (high performance proxies) simply require it to get the optimal performance and zero allocations. It is also possible to build higher level abstractions on top of the readiness model, including callback-based APIs, futures, streams, and even APIs based on shallow or stackful coroutines. The short version is that for optimal performance and minimal buffers in high-performance situations, you need the readiness model, and it doesn't preclude higher-level abstractions built on top. So why not 😄 |
@wycats isn't a slab buffer incompatible with zero allocation? I assume that you'd use IOCP to write into the slab buffer on Windows? And so a subsequent read() would copy from the buffer? Also I'd be curious to see how efficient flow control will be (I have no idea myself, just wondering if you frequently need to start/continue IO when the buffer gets full and what the impact of this is). Finally were you aware of https://github.com/piscisaureus/epoll_windows? It's been moved into libuv and has been sitting there for the last 2 years as a backend for uv_poll. I am not sure how suitable it is as it's not always available (but usually it is). Anyway it promises an efficient, readiness based model also on Windows, by hooking into some super low-level APIs (the same APIs I assume that WsaPoll and select() are using as a backend). |
@geertj TBQH I am not even attempting to address completion vs readiness. This is definitely a readiness API. When I mentioned someone contributing IOCP I was being very optimistic. In cases where the number of FDs to be polled is less than approximately 500, poll is actually more efficient than Epoll. My plan is to provide a poll interface which could easily match WSAPoll. But really, I haven't cared about Windows since I moved from Redmond to Manhattan 8 years ago ;) I actually think the completion model is superior to the readiness model, but who am I to judge? I just need to make this work on Linux. Interesting, though about the epoll_windows, perhaps someone can distill that notion into a pure epoll abstraction directly over IOCP. **edit** looking at the code, I now see that it is a pure epoll abstraction directly over IOCP. This seems like the best possible route to get async IO running on Windows... if someone were so inclined. |
I was being a bit flippant about Windows support above, but I just want to point out that deciding to not support IOCP results in a much, much smaller and less complex codebase. About 10x less, I would guess. A poll/epoll/kqueue abstrattion is dead simple, you're basically just standardizing function and enum/event names. I don't consider Windows a target for massively wide scaling IO, and therefore, don't see a reason to go through the hassle. WSA poll could handily support a few hundred sockets. If someone has a use case for something larger, I am all ears. |
Zero amortized allocation 😄 You allocate the slab once for the lifetime of your application and then you're done.
It's definitely an open question. We'll see!
I was not! I'll read through the code and see how it works. |
@geertj I am targeting zero allocations at runtime, and optimizing for posix vs. windows (though windows will be quite performant). MIO is going to be a low level abstraction, allowing building higher level abstractions on top of it. On windows, the buffer slab will be preallocated and gracefully handle filling the slab. The user of MIO will not have to worry about these details though. |
The completion model has fundamental limitations that make it very difficult to implement highest-performance streaming (for example proxies). In particular, since writing is async, and you may have multiple pending write operations, you need at least one buffer per concurrent write operation. In practice, people often end up with one buffer per pending read and one buffer per pending write. The readiness model is well-optimized for reusing stack buffers because the reads and writes are synchronous (and non-blocking) once readiness is established. In Rust parlance, this means that the readiness model doesn't require transferring buffer ownership while the completion model does. |
I'm having a bit of a difficult time following the contention over the lightweight-tasks model as a concurrency primitive. It seems like the favorable view is that it presents a unified concurrency API to the developer regardless of if they want LWT's or dedicated 1:1/Task:Thread behavior. This seems like a pretty useful unified abstraction to me. This is a position I'm sympathetic to as it's one of the things I really like about Erlang and Go. The negative view if this model seems to be that it's trivially easy to block the green threads' (LWT) real underlying thread pool, potentially exhausting the pool as a result in the absence of a pre-empting scheduler or some auto-inlined yields, and ultimately defeating the supposed purpose of the LWT model. This is also a position I'm sympathetic to because it's trivially easy to induce this same behavior in Erlang when running code in a NIF, since the Erlang scheduler can't preempt code executing in a NIF context. What I don't understand is why there's a conflict pushing the exclusion of one for the other. They're both essentially fine, precisely because Rust allows and escape hatch by allowing end users to decide which scheduler behavior to schedule any particular task on. Erlang got a similar mechanism in R17 with the dirty-schedulers work. You can intentionally schedule your native code to run on a thread pool that's outside the general VM scheduler pool so that you don't cause long-running/high-throughput native execution or legacy blocking IO to impede the execution and preemption of regular Erlang LWP's. Before the only way to do this this was to maintain your own long-running thread pool on the C side of the NIF boundary and a lot of monkeying around with resource maintenance as you go back and forth across that boundary, or just abandon using the NIF interface entirely and go C Port Driver instead (which is much more callback-y). The fact that Rust already essentially offers this flexibility, by allowing you to choose which scheduler to run a given task on (libnative or libgreen) means that I as the end user can decide when to make a bunch of lightweight requests that I know are going to be low-latency (libgreen) or when to favor high-throughput or blocking IO calls on a dedicated thread pool (libnative) that are quarantined. All while consuming the exact same concurrency API... message passing between tasks. There's a whole different discussion about how concerns of libnative and libgreen leak into IO library implementation. One that, having not written an IO library in Rust, I'm ill suited to discuss in-depth, but that seems like a sorta weird concern to warrant throwing out the lightweight-task baby with the implementation complexity bathwater. Especially since you're basically just sacrificing one sort of end user (people consuming a unified concurrency API) at the altar of another sort of user (IO library makers). Am I missing something? |
None of this has to do with lightweight threads. Green threads are just as heavy as native ones, since the memory slab used as the stack is by far the most dominant resource. In fact, the current implementation of green threads is heavier than native threads because of the overhead from libuv. It uses a completion-based API, which implies a lot of memory allocation behind the scenes. It also needs to spin up a thread pool for anything not covered by OS AIO APIs, like On Linux, it doesn't even know how to use the kernel's AIO implementation and needs the thread pool even for normal file IO, but that could be fixed. It does mean that it's pretty hard to take libuv seriously as a backend for a performance oriented language, among other issues like total lack of support for a modern multi-threaded event loop.
Another way of phrasing this is that Rust has a crippled concurrency and I/O API. It compares very unfavourably to other languages without the limitations brought on by green threads. Every feature implemented for native threads also has to be implemented via libuv for green threads, and it's very far from being up to the task. It makes maintenance much harder and both libuv and the Rust side of the code will be an endless stream of security vulnerabilities. It's a large library written in a memory unsafe language.
Rust's green threads are slower than native threads in every vaguely real world benchmark. There is simply no demonstrated use case for the feature. Unlike Erlang, it has no pre-emption for green threads because it's both infeasible to implement for native code and the lack of pre-emption overhead is the only advantage of green threads over native ones. The compiler also doesn't take the route of inserting yield points because it would have far too much overhead. Rust is a systems language and it's not going to make that kind of performance sacrifice.
Green tasks are anything but low-latency. The scheduler can't do pre-emption and doesn't make any attempt to implement fairness. If you need low-latency, libgreen is the last thing you want to use. Resource usage also has nothing to do with it, they're not lighter.
Every I/O and concurrency method goes through a virtual function table in order to support mixed green and native threads at runtime. This adds significant overhead at runtime and also results in massive code bloat. Instead of a small program created with
Yes. Many of the things I've stated in this comment are part of the RFC text already. |
@pcwalton: Go's approach to the problem is pretending it doesn't exist. That's not suitable for programming large systems where a non-deterministic issue heavily dependent on the workload would be very hard to debug and eliminate from the software. It's a bigger problem for libgreen because it's unable to steal an ongoing I/O request to another scheduler. Blocking a scheduler in Rust implies blocking any ongoing I/O requests indefinitely, so it cannot be used for writing correct software without judiciously inserting yield checks. That could be fixed by abandoning libuv, but there seems to be no plan to do that. Go has a tendency to sweep the edge cases under the rug since they only care about good enough. It's designed for Google's use cases where unpredictable latency and less than perfect reliability / safety are acceptable because they're able to throw lots and lots of hardware at every problem and distributed systems make individual failures more likely and less important. It has latency issues both due to the global garbage collector and a naive scheduler without fairness / pre-emption. |
@thestinger I find your lengthy comment to be well-reasoned, reasonably complete and accurate. Thanks 😄 |
FWIW, the silly little syntax extension, from above, now supports "yielding" inside loop bodies. Not sure how useful something like this would be, esp. since it only really works if you control the source of everything that might loop or call a function, but it's at least a PoC that something potentially useful could be done in a third-party syntax extension. |
Most of the issues seem like a scathing indictment of the implementation and choice of tethering that implementation it to libuv (which is admittedly an odd choice FWIW). Though I'm not sure I follow the logic on green threads being slower. Of course they're slower. They ought to be. They're just supposed to have a lower resource, creation, and destruction footprint. Erlang LWP's are slower than fully occupying a scheduler too. Because you get can get descheduled, get thrown to the back of a work queue, have the overhead of reduction counting, and the work involved in sleeping/waking schedulers. I mean there's a pretty obvious tradeoff choice to be made, one which you decide between when choosing the That they're slower isn't a problem per se, but that they're not actually lightweight does beg the question of... WTF? Making it the case that the only benefit they could even plausibly provide is lower context switching overhead. Though that seems unlikely in this case. Having tasks, both lightweight and not-so-lightweight, be the unifying concurrency API still seems like a worthy goal. At this point is there enough time to create a runtime for Rust that can support both properly, given that libgreen/librustuv isn't it? |
How are you going to get lightweight tasks? Haskell doesn't have a contiguous call stack so it doesn't have this problem. Erlang is probably in a similar boat and as a managed language it can do pre-emption based on counting the approximate number of instructions executed. The performance is in a completely different ballpark than Rust so the same concerns don't apply. Go is using relocatable stacks, which is not possible in Rust because it has unrestricted raw pointers and safe lightweight references. Segmented stacks turned out to be a massive performance problem and were dropped. Even the stack space function preludes alone have a fairly high overhead, and are being replaced with stack probes. I think the only acceptable solution would be static stack space analysis. It would be very difficult to implement and it would place a lot of restrictions on coding style. This would require quite a bit of work on LLVM upstream.
I/O operations from a green task are significantly slower, because completion-based AIO is slow. It implies thread pool synchronization, memory allocation and far more system calls. It will never compete with direct usage of blocking or non-blocking IO in terms of raw performance, and it won't scale nearly as well as direct usage of non-blocking IO / AIO. There are other high-level abstractions able to preserve more of the performance / scalability. I'm not talking about anything to do with scheduling. I/O in the M:N threading model is significantly slower even if it ends up having 1 scheduler thread for every green thread. |
Automatically inserted yields, segmented stacks, dynamic dispatch on blocking calls and slow TLS are never going to be sane sacrifices for a systems language to make. Green threads can still be implemented in a third party library, and without any of those sacrifices there's actually no benefit to it being integrated into the standard library. If static stack space analysis is ever implemented, it can be exposed via an intrinsic giving the worst-case stack usage of a function pointer. |
@thestinger I'm curious, do you think there is a future for green threads in Rust ever? By green threads I mean a task-like primitive which is lighter-weight than a full OS thread? If so, what does that primitive look like? |
I'm only interested in low-level / efficient support for non-blocking IO and AIO along with libraries and possibly language features building abstractions over those. I don't think green threads in Rust are ever going to be comparable to a language like Erlang and I don't see the point in offering a barely useful feature that's never going to be competitive with other languages. If it was rewritten on top of lower level primitives rather than libuv and there was stack space analysis, then it could actually offer the lightweight feature but code written without them would still be significantly more efficient. Dealing with the strict limitations of stack space analysis would be quite painful too. Without automatically inserted yields and the dynamic dispatch system, it's just as easy to screw up with green threads as it is with a normal event loop. There's the same issue of libraries performing work without yields, either via loops / recursion, a blocking system call or page faults. |
This would seem like a misconception only in-use by people who've never used any lightweight process system to any depth. At some point they all have an edge case around long-blocking processes someplace that you have to design around (recall R17's dirty-schedulers). The point of such systems to me is the API. You can implement similar looking things in user land with execution/work queues containing closures and scheduling the queues across a set of consumer threads that invoke the closures, which is what I'm gathering @thestinger would prefer to see done given the complexity of making a lower-level solution in Rust that isn't fraught with peril. |
Rust's green threads are an I/O feature, they're not designed for CPU-bound work like work queues / task trees. The alternative to green threads is using non-blocking I/O directly or using a high-level abstraction like async/await or some sort of reactor pattern (Boost ASIO). CPU-bound work doesn't block on system calls so there is no need for anything more than work queues of closures and it's a first class citizen without any additional compiler support. |
Having preëmption-by-default with green threads forces you to worry about atomicity and add locks - so this isn't clearly a win - if your workload is inherently sequential you may have to add a big lock around everything and still have infinite loops hanging your program. The typical solution to this problem is a watchdog. |
@nathanaschbacher So just as native IO is bad for Haskell, green threads probably isn't the way to go for Rust. If you think of the choices I've mentioned as a 2^3 discrete design space, Haskell and Rust inhabit two local maxima, and other combinations are probably worse than both. Eventually, I'd love for Rust to be flexible enough to try both ends of that design space -- e.g. we will be able to write a tracing GC in a library that will plugin to compiler-generated stackmaps, and enable or disable segmented stacks. But Rust is not there yet. This RFC is no doubt a temporary setback for green threads, but working with less machinery on top of the OS IO primitives will give us more flexibility long term. It also will make writing a rust exokerenel easier, which, aside from being a personal goal of mine, will allow experiments in IO design space that are not possible on top of Unix or Windows. |
Contiguous stacks are only a good thing for throughput, which is why Rust dropped segmented stacks. The design of jemalloc isn't really a sacrifice of throughput compared to a garbage collector either. In fact, it amortizes the cost of allocation/deallocation to O(1) on average by performing incremental garbage collection on thread caches. Full static knowledge of when data can be freed isn't a bad thing and it doesn't imply that the allocator actually has to do something beyond writing 2 words. Paying a cost for tracking ownership via reference counting is the exception in Rust, not the norm. |
I guess that makes some sense to me, save for the whole point of LWP's and reduction-based preemption in Erlang is to facilitate its original design goal of making soft-realtime systems. Systems which favor latency over throughput. So I think you may have your performance axises reversed. If LWT's can't actually be lightweight, then I'm begrudgingly in favor of punting libgreen to the hinterlands. I don't care if they can be blocked and not pre-empted as much since I know I can put tasks at risk of blocking onto their own native threads. The former seems like a defensible reason to accept the RFC, the latter not as much. I'm loathe to have to implement my own user-land version of this functionality, but inevitably I will if I have to. :-/ |
Green threads shouldn't be preemptive-ish otherwise the overhead is far too great (despite my earlier thoughts). Golang is the major example but there're others: D vibe.d, Python gevent, etc.. |
@thestinger |
@Ericson2314: There's no paper on the current jemalloc design that I'm aware of. There's some good documentation of the internals in the man page. There's no connection between the number of allocations and the time complexity of The main costs in the average case where it's not allocating more memory from the OS or doing the lazy dirty page purging are just constant time ones, such as the branch for supporting valgrind, a branch for supporting a user-specified arena, etc. It can all be optimized better in the future. |
Thanks for the link, the man page was an interesting read. This has little do with the RFC, but for posterity: notice I said "free is called on every dead object". While free is called once per dead object, one trace can clean a huge number of dead objects. Assuming free is not only O(1) in the common case, but also O(1) amortized, the time spent to free all dead objects at a given point is O(dead-objects), while the time spent to do the same thing with (one) tracing gc call is O(live-objects). |
An update on this RFC: after the discussion on this thread and further discussion with the core team, we've decided to close this RFC and open a new one that:
I will be overhauling the text and posting a new RFC in the near future, and will post a link to it once available. |
Sorry, I realize I'm a bit late to the party. The flippant response to your remark is "no, it can't." :-) I've looked hard and long at the "When the kernel is known" doesn't mean that version sniffing is sufficient. Vendor kernels are patchworks of forward-ported and back-ported changesets where the version number is essentially meaningless. You would have to perform feature detection but that is a hard problem. The presence of AIO support is easily established but the presence of reliable AIO is not nearly as simple. Summary: Linux native AIO has been carefully considered but found lacking.
A word of admonishment: you were making unwarranted assumptions. I'm not mad but it doesn't do you credit either.
That was a deliberate design decision (with added alliterative appeal). If it's an issue for you, you should bring it up on the libuv mailing list - ditto for AIO - but please check the mailing list archives first, both subjects have been discussed before. |
Version detection works fine as a conservative estimate.
It has been good enough for lighttpd and nginx for years, although they still have the blocking IO backend because it's often (usually?) faster than non-blocking AIO operations. It's always going to be way faster and lighter than farming out the work to a thread pool though.
I don't think they're unwarranted assumptions. It also doesn't make use of the signalfd / timerfd family of functions. It's not the weakest link when it's being used by node.js which has no threads and isn't a lightweight systems language, but it is the weakest link for Rust.
It's not a bad design decision, but it means it's quite broken when used in the context of M:N threading. Rust has to pin an ongoing I/O request to the scheduler it was started on, so work stealing doesn't work as expected and a request can time out if a specific scheduler is too busy / blocked. |
The new RFC has been posted. |
Use SeqCst instead of Acquire and Release in Lock.
Update Ember Issue of Module Unification RFC
This RFC proposes a significant simplification to the I/O stack distributed with
Rust. It proposes to move green threading into an external Cargo package, and
instead weld
std::io
directly to the native threading model.The
std::io
module will remain completely cross-platform.Rendered