-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use mimalloc in attempt to reduce mem alloc perf. oddities #1250
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #1250 +/- ##
=======================================
Coverage 82.7% 82.7%
=======================================
Files 872 872
Lines 370361 370361
=======================================
+ Hits 306528 306547 +19
+ Misses 63833 63814 -19 |
@Lichtso I seems you tried mimalloc before, right? did you find any potential blocker? as far as i can tell, it's better than jemalloc. as a quick context, I tried rpmalloc before. but the crate is stale and leaked after quick testing. So, now I'm betting on this newest kid. |
@alessandrod Firstly, note that I'm not opposed to reducing allocs. Obviously, it's better for perf. I'm just getting tired of jemalloc. While it's well faster than the glibc allocator with less fragmentation, it has its own peculiar perf profile. So, I'm considering replacing it as a separate effort with bp-ing this to v1.17 & v1.18 in mind, parallel to the on-going alloc reduction efforts. I think you're opinionated around mem allocators in general. Is there any concern for using mimalloc? If any, I'd like to address with any additional investigation. As for reliability (read: no leak & no unbounded fragmentation), I'm now running a devbox against mb, needs another day or two, before conclusion. As a starter, I confirmed this fixes solana-labs#27275: before(f180b08):
after(438018a):
blockstore-processor is ~27% faster and unified scheduler is ~36% faster according to solana-labs#35286 (comment). I think this alone is quite good perf gain by its own. :) Also, I haven't confirmed yet but I'm moderately certain that this fixes the replay stall issue under system overload like rpmalloc fixed the stall (note that rpmalloc is out-of-question due to #1250 (comment)) |
Yes I tried mimalloc a few months ago and no, nothing problematic stood out to me at the time. |
Uh? Having spent a significant chunk of my time looking at profiles and reading the jemalloc source code my opinion is that it sucks 😂 I'm more than open to replacing jemalloc, I just don't think that we should do it lightly. Changing allocator impacts every subsystem in the validator, personally I think it's a pretty scary change.
Hum imo you can't draw conclusions after running a devbox for a couple of days. Jemalloc sucks, and it's really struggling with the amount and pattern of allocations we do, especially since we do them from so many threads. But at least it isn't crashing 😅 Imagine we switch allocator, and then we find that in some configurations after a while it goes OOM - I don't think it's something we can establish in a couple of days based on a single machine configuration, with a single workload, etc. IMO it requires a lot of testing, especially on nodes with a lot of stake, which are the ones struggling the most with memory.
This is amazing!
There are at least two stalls that I'm aware of, here I'm assuming you're talking about the one caused by jemalloc's time based decay. Mimalloc almost certainly gets rid of that one, but how? How does it compact/release memory? When? And from what threads? Because getting rid of that stall is also possible by configuring jemalloc. Not saying that we should do it mind, and based on the replay speedups it looks like mimalloc might have better defaults for our workload, I just think that we should understand how mimalloc works and why it's better for us before we switch to it. |
I'm just reading the mimalloc docs, and just saw this
This alone is so much better than what jemalloc does for our workflow |
It looks like mimalloc does time based purge like jemalloc |
I think I'm pretty mimalloc-pilled. I think that the key is that even if it does time based purge like jemalloc, since it has sharded freelists and works at a "mimalloc page" level, it never does gigantic (and contended) deallocations like jemalloc does, which means than even when purging after spikes, deallocation costs are amortized over a bunch of call sites and bounded in time, instead of having seconds long spikes like we do with jemalloc. I personally would ship this to master and let it ride the trains, and see what happens. While it's in master it'll get tested on the master canaries, and we could ask Joe to test it on at least one big-stake node. |
It looks like mimalloc avoids the IPIs and TLB shootdowns jemalloc causes in calloc by... not implementing calloc. calloc in mimalloc is just malloc() + memset. I'm not sure how this scales with high number of txs, that's a lot of pagefaults. |
thanks for various thoughts and quick look into mimalloc so far! seems you have good first impression of it. lol I'll spend some time to dig in further and write my own thoughts later. |
Finally, it's time to share my thoughts after all the investigations. As of now, I'm a bit neural to adapting mimalloc. Admittedly, I got hyped with the perf increase of replaying stage and mimalloc's promising design descriptions, originally.
I agree. I just misused the word I wanted to see whether mimalloc would leak miserably, like rpmalloc did when used for
i confirmed this is fixed.
here's a tone-down fact: the replay perf gain can be realized with jemalloc by working-around this particular alloc source in vm: #1364 Moreover, notice that jemalloc is slightly faster than mimalloc there. So, I don't think mimalloc is a clear win than jemalloc, as a general-purpose mem allocator, assuming the replay stage code-path should have exercised allocators with some varieties of alloc patterns. Also, I've seen no general perf improvement over jemalloc as far as i've run a unstaked mainnet-beta node.
As for this concern, I think you self-answered like below (and this aligns with my understanding):
To add some colors from me, I stumbled on this paper, which is relevant for the time-decay stall: https://arxiv.org/abs/2401.11347:
on the other hand, as for mimalloc from the same paper:
So, I think there's still some hope to mitigate the pathological perf degradation we're experiencing altogether with mimalloc.
to be clear, i don't have particular words on this as my knowledge is still limited... at least, system-wide minor pagefaults aren't particularly increased as compared to jemalloc (rather was about half of node with jemalloc) lastly, here's the remaining topic to cover: market share and product quality: my impression is that it's not ideal but acceptable for adaption. market share: seems it's used at microsoft internally. but, basically that's all. various high-profile projects (ClickHouse, rustc, Polkadot, Apache Arrow) tried it. however, none officially chose it, it seems. Admittedly, the market share growth is quite slow, considering it's been 5 years since released. product quality: as far as i read mimalloc's code, it's maintained, yet not in the highest coding standards. Also the technical report ( https://www.microsoft.com/en-us/research/publication/mimalloc-free-list-sharding-in-action/ )'s coverage is rather sparse with typos. as for rust-bindings, it's straight-forwardly written and updated to follow upstream releases: https://crates.io/crates/mimalloc |
as a sample among our bench programs, I picked
as a quick glance, it's so outstanding that glibc is slow. rpmalloc, jemalloc and mimalloc are similarly better than glibc by ~120%. One subtle takeaway is that mimalloc's perf variance is noticeably small, compared to rpmalloc and jemalloc, while maintaining the 120% perf boost compared to the baseline (glibc). I think this is illustrating mimalloc's 2 selling points, which make this result understandable:
and
I hope this property could alleviate our overall system perf investigation a bit in the coming future.... Lastly, glibc's variance is very low as well. and it is kind of expected: its impl is way more naive in the light of multi-threaded age, than other newer kids (and that's why it's slow...) Anyway, these observations are just from running a single bench. take them with a grain of salt. |
status update: almost stale. but i'm moving forward this very slowly. Recently, I've managed to repro the jemalloc-induced pathologically slow Also, I just found another encouraging paper about mimalloc: https://retis.sssup.it/~a.biondi/papers/RTAS24.pdf |
Problem
it's getting known that jemalloc is experiencing perf degradation under some certain work load.
Summary of Changes
Use mimalloc, in the hope of fixing it.
I'm intending to land this pr rather prematurely and see how our canaries like or not. Will revert without hesitation if it turned out that things aren't working well.
At least, this pr fixes solana-labs#27275 without this hack: #1364. However, I'm not sure mimalloc could fix all other known mem alloc issue around banking.
todo
Fixes solana-labs#27275