-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Bump allocator for rustc #38725
Conversation
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @alexcrichton (or someone else) soon. If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes. Please see the contribution instructions for more information. |
const CHUNK_ALIGN: usize = 4096; | ||
|
||
static mut HEAP: *mut u8 = ptr::null_mut(); | ||
static mut HEAP_LEFT: usize = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These variables are global and there's no implicit lock around allocation calls, so the implementation of __rust_allocate
below is full of data races. Maybe these should be atomics or thread local?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yep. I had implemented this using atomics the first time I wanted to do this (but got stuck getting the allocator to inject). I'll go dig that code up.
This seems like it's mostly related to the compiler, so r? @eddyb |
} | ||
|
||
let heap = HEAP.load(Ordering::SeqCst); | ||
let heap_left = HEAP_LEFT.load(Ordering::SeqCst); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this eliminates the instant UB due to data races, but it's still a logical race. Suppose a context switch happens after these two loads have been executed but before the following lines are executed, and another thread allocates. When the first thread continues, it will return the same address as the other thread did.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I have little experience with atomics, as you may have guessed. I would rather use a Mutex
or thread_local!
, but neither of those are in core. I'll make it use a spinlock which should hopefully be correct.
@@ -137,6 +137,12 @@ | |||
# Whether or not jemalloc is built with its debug option set | |||
#debug-jemalloc = false | |||
|
|||
# Whether or not the frame allocator is built and enabled in std | |||
#use-alloc-frame = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see where this option is used. Is the doc out of date?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still TODO
@@ -13,6 +13,7 @@ crate-type = ["dylib", "rlib"] | |||
alloc = { path = "../liballoc" } | |||
alloc_jemalloc = { path = "../liballoc_jemalloc", optional = true } | |||
alloc_system = { path = "../liballoc_system" } | |||
alloc_frame = { path = "../liballoc_frame" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes all platforms require a alloc_frame implementation, doesn't it? I don't have much opinion about that. It doesn't look too hard to implement. A better setup would be for this feature to be optional and off by default, so porters don't have to think about it. I don't know whether that's necessary for this PR but something to consider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, hm. Platforms without libc would have a hard time implementing frame_alloc.
cc @jackpot51
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, this is meant to be optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brson: Redox is using the libc
allocator right now, so this would not be an issue.
This is a neat exercise. Thanks for submitting it. |
One more data point for now, I'll have to do better benchmarks in a few days: |
cc @rust-lang/compiler |
|
||
pub unsafe fn allocate(size: usize, align: usize) -> *mut u8 { | ||
if align <= MIN_ALIGN { | ||
libc::malloc(size as libc::size_t) as *mut u8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to inter-operate with the regular system allocator? If that’s not the case, then why not use a plain sbrk
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the very least, LLVM will be using its the system allocator in the same process, won't it? Calling nvmsbrk
may interact badly with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is probably a good idea. I'll look into using sbrk and VirtualPageAllocEx or whatever. The current implementation as simple as possible just to test out the idea.
@mattico Regarding thread safety in the implementation, I think a spin lock is okay. Most parts of the compilation process aren't parallelized, and those that are mostly happen within LLVM, which won't be using this allocator IIUC. Besides, I really wouldn't be comfortable with a complicated atomics-based implementation, even if nobody could poke holes in it — these things are notoriously tricky and the failure mode (non-deterministic allocator bugs) is really terrible. Thread locals would have been ideal, since this allocator never frees memory there are none of the usual complications and it would neatly solve any concurrency issues. But whatever. |
@rkuppe the current spinlock implementation does seem to have significant overhead. I'll post benchmarks when I get back to a computer. I agree about a more complicated atomic implementation. If one were even desired, I definitely shouldn't be the one to write it. I wish we had thread locals in core :/ Perhaps someday core and std will be modularized enough to use them without needing an allocator. I'll look into what hackery would be required to use them here but I don't expect it'll be worthwhile. |
☔ The latest upstream changes (presumably #38482) made this pull request unmergeable. Please resolve the merge conflicts. |
Interesting thought experiment. I'd definitely want to run a wider variety of compiles to compare the performance characteristics (also -- can we measure peak memory usage?) |
Profiling is going to have to wait until next week, since performance counters don't work in WSL/Virtualbox, and I can't get MSVC to build. My current computer doesn't have an actual linux install. |
I don't have time to work on this at the moment, so I'm closing this for now just to keep the PR queue clean. I'll revisit this sometime soon. |
In case anybody is wondering where this went, I tested this on a 4 core machine and the performance degraded to the point that it was a bit slower than jemalloc. I imagine that 4 threads stuck behind a single lock erased any gains we may have had. This is worth revisiting if we ever get |
Inspired by Dlang's allocator, I stuck a bump allocator into rustc to see what would happen. No good benchmarks yet because I need to think about how to do them. Lots of work to be done, but posting now because there should probably be discussion about whether or not this is a Good Idea. For now this will tease:
ripgrep compile:
BEFORE: 53.52s
AFTER: 44.38s
TODO: