-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: make the page allocator scale #35112
Comments
Change https://golang.org/cl/200439 mentions this issue: |
Change https://golang.org/cl/195698 mentions this issue: |
Change https://golang.org/cl/190621 mentions this issue: |
Change https://golang.org/cl/196640 mentions this issue: |
Change https://golang.org/cl/195697 mentions this issue: |
Change https://golang.org/cl/196643 mentions this issue: |
Change https://golang.org/cl/190622 mentions this issue: |
Change https://golang.org/cl/190619 mentions this issue: |
Change https://golang.org/cl/196642 mentions this issue: |
Change https://golang.org/cl/201763 mentions this issue: |
Change https://golang.org/cl/190620 mentions this issue: |
Change https://golang.org/cl/196639 mentions this issue: |
Change https://golang.org/cl/195701 mentions this issue: |
Change https://golang.org/cl/201764 mentions this issue: |
Change https://golang.org/cl/195700 mentions this issue: |
Change https://golang.org/cl/195699 mentions this issue: |
Change https://golang.org/cl/201765 mentions this issue: |
Change https://golang.org/cl/196638 mentions this issue: |
Change https://golang.org/cl/196641 mentions this issue: |
Gopherbot isn't updating the issue with this, but the design doc is at @ https://go-review.googlesource.com/c/proposal/+/202857. I'll update the first post in the issue once it lands. |
Change https://golang.org/cl/203318 mentions this issue: |
Change https://golang.org/cl/203858 mentions this issue: |
Change https://golang.org/cl/203859 mentions this issue: |
#31222 |
@un000 It's possible but unlikely this will help. Let's assume that the problem in that issue is fully that a long allocation blocks STW for 43 ms. Firstly, 100 MB allocations don't really get any faster with this proposal. In that issue, asking for 100 MB is definitely going to go to the OS and map pages in. No matter what we're bound by how quickly we can get 100 MB worth of committed address space from the OS. |
@un000 Well, it helps maxWait at least and brings it down to 1.0s vs. 1.3s on my machine, so perhaps I spoke too soon. But I'm hitting stalls trying to copy the traces to verify whether there's an improvement in the STW delay. I'll try again later and let you know what I found. |
This change defines a maximum supported physical and huge page size in the runtime based on the new page allocator's implementation, and uses them where appropriate. Furthemore, if the system exceeds the maximum supported huge page size, we simply ignore it silently. It also fixes a huge-page-related test which is only triggered by a condition which is definitely wrong. Finally, it adds a few TODOs related to code clean-up and supporting larger huge page sizes. Updates #35112. Fixes #35431. Change-Id: Ie4348afb6bf047cce2c1433576d1514720d8230f Reviewed-on: https://go-review.googlesource.com/c/go/+/205937 Run-TryBot: Michael Knyszek <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Keith Randall <[email protected]> Reviewed-by: Cherry Zhang <[email protected]>
mheap_.alloc currently accepts both a spanClass and a "large" parameter indicating whether the allocation is large. These are redundant, since spanClass.sizeclass() == 0 is an equivalent way to determine this and is already used in mheap_.alloc. There are no places in the runtime where the size class could be non-zero and large == true. Updates #35112. Change-Id: Ie66facf8f0faca6f4cd3d20a8ac4bc259e11823d Reviewed-on: https://go-review.googlesource.com/c/go/+/196639 Run-TryBot: Michael Knyszek <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Austin Clements <[email protected]>
This change makes it so that allocation and free related page sweeper metadata operations (e.g. pageInUse and pagesInUse) are atomic rather than protected by the heap lock. This will help in reducing the length of the critical path with the heap lock held in future changes. Updates #35112. Change-Id: Ie82bff024204dd17c4c671af63350a7a41add354 Reviewed-on: https://go-review.googlesource.com/c/go/+/196640 Run-TryBot: Michael Knyszek <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Austin Clements <[email protected]>
Currently gcSweepBuf guarantees that push operations may be performed concurrently with each other and that block operations may be performed concurrently with push operations as well. Unfortunately, this isn't quite true. The existing code allows push operations to happen concurrently with each other, but block operations may return blocks with nil entries. The way this can happen is if two concurrent pushers grab a slot to push to, and the first one (the one with the earlier slot in the buffer) doesn't quite write a span value when the block is called. The existing code in block only checks if the very last value in the block is nil, when really an arbitrary number of the last few values in the block may or may not be nil. Today, this case can't actually happen because when push operations happen concurrently during a GC (which is the only time block is called), they only ever happen during an allocation with the heap lock held, effectively serializing them. A block operation may happen concurrently with one of these pushes, but its callers will never see a nil mspan. Outside of a GC, this isn't a problem because although push operations from allocations can run concurrently with push operations from sweeping, block operations will never run. In essence, the real concurrency guarantees provided by gcSweepBuf are that block operations may happen concurrently with push operations, but that push operations may not be concurrent with each other if there are any block operations. To fix this, and to prepare for push operations happening without the heap lock held in a future CL, we update the documentation for block to correctly state that there may be nil entries in the returned slice. While we're here, make the mspan writes into the buffer atomic to avoid a block user racing on a nil check, and document that the user should load mspan values from the returned slice atomically. Finally, we make all callers of block adhere to the new rules. We choose to allow nil values rather than filter them out because the only caller of block is markrootSpans, and if it catches a nil entry, then there wasn't anything to mark in there anyway since the span is just being created. Updates #35112. Change-Id: I6450aab15f51690d7a000ba5b3d529cf2ca5da1e Reviewed-on: https://go-review.googlesource.com/c/go/+/203318 Run-TryBot: Michael Knyszek <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Austin Clements <[email protected]>
This change combines the functionality of allocSpanLocked, allocManual, and alloc_m into a new method called allocSpan. While these methods' abstraction boundaries are OK when the heap lock is held throughout, they start to break down when we want finer-grained locking in the page allocator. allocSpan does just that, and only locks the heap when it absolutely has to. Piggy-backing off of work in previous CLs to make more of span initialization lockless, this change makes span initialization entirely lockless as part of the reorganization. Ultimately this change will enable us to add a lockless fast path to allocSpan. Updates #35112. Change-Id: I99875939d75fb4e958a67ac99e4a7cda44f06864 Reviewed-on: https://go-review.googlesource.com/c/go/+/196641 Run-TryBot: Michael Knyszek <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Austin Clements <[email protected]>
This change adds a per-p mspan object cache similar to the sudog cache. Unfortunately this cache can't quite operate like the sudog cache, since it is used in contexts where write barriers are disallowed (i.e. allocation codepaths), so rather than managing an array and a slice, it's just an array and a length. A little bit more unsafe, but avoids any write barriers. The purpose of this change is to reduce the number of operations which require the heap lock in allocation, paving the way for a lockless fast path. Updates #35112. Change-Id: I32cfdcd8528fb7be985640e4f3a13cb98ffb7865 Reviewed-on: https://go-review.googlesource.com/c/go/+/196642 Run-TryBot: Michael Knyszek <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Austin Clements <[email protected]>
This change adds a page cache structure which owns a chunk of free pages at a given base address. It also adds code to allocate to this cache from the page allocator. Finally, it adds tests for both. Notably this change does not yet integrate the code into the runtime, just into runtime tests. Updates #35112. Change-Id: Ibe121498d5c3be40390fab58a3816295601670df Reviewed-on: https://go-review.googlesource.com/c/go/+/196643 Run-TryBot: Michael Knyszek <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Austin Clements <[email protected]>
This change adds a per-p free page cache which the page allocator may allocate out of without a lock. The change also introduces a completely lockless page allocator fast path. Although the cache contains at most 64 pages (and usually less), the vast majority (85%+) of page allocations are exactly 1 page in size. Updates #35112. Change-Id: I170bf0a9375873e7e3230845eb1df7e5cf741b78 Reviewed-on: https://go-review.googlesource.com/c/go/+/195701 Run-TryBot: Michael Knyszek <[email protected]> Reviewed-by: Austin Clements <[email protected]>
Change https://golang.org/cl/206199 mentions this issue: |
CL 201765 activated calls from the runtime to functions in math/bits. When coverage and race detection were simultaneously enabled, this caused a crash when the covered+race-checked code in math/bits was called from the runtime before there was even a P. PS Win for gdlv in helping sort this out. TODO - next CL intrinsifies the new functions in runtime/internal/sys TODO/Would-be-nice - Ctz64 and TrailingZeros64 are the same function; 386.s is intrinsified; clean all that up. Fixes #35461. Updates #35112. Change-Id: I750a54dba493130ad3e68a06530ede7687d41e1d Reviewed-on: https://go-review.googlesource.com/c/go/+/206199 Reviewed-by: Michael Knyszek <[email protected]> Run-TryBot: Michael Knyszek <[email protected]> TryBot-Result: Gobot Gobot <[email protected]>
Change https://golang.org/cl/206277 mentions this issue: |
Change https://golang.org/cl/206200 mentions this issue: |
This restores intrinsic status to functions copied from math/bits into runtime/internal/sys, as an aid to runtime performance. Updates #35112. Change-Id: I41a7d87cf00f1e64d82aa95c5b1000bc128de820 Reviewed-on: https://go-review.googlesource.com/c/go/+/206200 Run-TryBot: David Chase <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Cherry Zhang <[email protected]>
This change makes the test addresses start at 1 GiB instead of 2 GiB to support mips and mipsle, which only have 31-bit address spaces. It also changes some tests to use smaller offsets for the chunk index to avoid jumping too far ahead in the address space to support 31-bit address spaces. The tests don't require such large jumps for what they're testing anyway. Updates #35112. Fixes #35440. Change-Id: Ic68ff2b0a1f10ef37ac00d4bb5b910ddcdc76f2e Reviewed-on: https://go-review.googlesource.com/c/go/+/205938 Run-TryBot: Brad Fitzpatrick <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Keith Randall <[email protected]> Reviewed-by: Cherry Zhang <[email protected]>
Before this CL, if max > min and max was unaligned to min, then the function could return an unaligned (unaligned to min) region to scavenge. On most platforms, this leads to some kind of crash. Fix this by explicitly aligning max to the next multiple of min. Fixes #35445. Updates #35112. Change-Id: I0af42d4a307b48a97e47ed152c619d77b0298291 Reviewed-on: https://go-review.googlesource.com/c/go/+/206277 Reviewed-by: Ian Lance Taylor <[email protected]>
Change https://golang.org/cl/206978 mentions this issue: |
In scavengeOne's fast path, we currently don't check the summary for the chunk that scavAddr points to, which means that we might accidentally scavenge unused address space if the previous scavenge moves the scavAddr into that space. The result of this today is a crash. This change makes it so that scavengeOne's fast path only happens after the check, following the comment in mpagealloc.go. It also adds a test for this case. Fixes #35465. Updates #35112. Change-Id: I861d44ee75e42a0e1f5aaec243bc449228273903 Reviewed-on: https://go-review.googlesource.com/c/go/+/206978 Run-TryBot: Michael Knyszek <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Cherry Zhang <[email protected]>
@mknyszek Anything else to do for this issue? |
Nope! Closing. |
Change https://golang.org/cl/219121 mentions this issue: |
Over the course of the last year, we've seen several cases where making relatively minor changes to the allocator slow path, which allocates pages from the heap, caused serious performance issues (#32828, #31678). The problem stemmed largely from contention on the heap lock: a central m-based spinlock (so we can't schedule another g while it's waiting!) which guards nearly all operations on the page heap. Since Go 1.11, (and likely earlier) we've observed barging behavior on this lock in applications which allocate larger objects (~1K) frequently, which indicates a collapsing lock. Furthermore, these applications tend to stay the same or worsen in performance as the number of available cores on the machine increases. This represents a fundamental scalability bottleneck in the runtime.
Currently the page allocator is based on a treap (and formerly had a small linked-list based cache for smaller free spans), but I propose we rethink it and rewrite to something that:
The former just makes the page allocator faster and less likely to fall into bad paths in the microarchitecture, while the latter directly reduces contention on the lock.
While increased performance across the board is what we want, what we're most interested in solving here is the scalability bottleneck: when we increase the number of cores available to a Go application, we want to see a performance improvement.
Edit: Design doc
I've already built both an out-of-tree and in-tree prototype of a solution which is based on a bitmap representing the whole heap and a radix tree over that bitmap. The key idea with bitmaps here being that we may "land grab" several pages at once and cache them in a P. Two RPC benchmarks, a Google internal one, and one based on Tile38, show up to a 30% increase in throughput and up to a 40% drop in tail latencies (90th, 99th percentile) on a 48-core machine, with the benefit increasing with the number of cores available to the benchmark.
The text was updated successfully, but these errors were encountered: