Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use mimalloc in attempt to reduce mem alloc perf. oddities #1250

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

ryoqun
Copy link
Member

@ryoqun ryoqun commented May 8, 2024

Problem

it's getting known that jemalloc is experiencing perf degradation under some certain work load.

Summary of Changes

Use mimalloc, in the hope of fixing it.

I'm intending to land this pr rather prematurely and see how our canaries like or not. Will revert without hesitation if it turned out that things aren't working well.

At least, this pr fixes solana-labs#27275 without this hack: #1364. However, I'm not sure mimalloc could fix all other known mem alloc issue around banking.

todo

  • Remove jemalloc deps

Fixes solana-labs#27275

@codecov-commenter
Copy link

codecov-commenter commented May 8, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.7%. Comparing base (da02962) to head (2f78e18).
Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1250   +/-   ##
=======================================
  Coverage    82.7%    82.7%           
=======================================
  Files         872      872           
  Lines      370361   370361           
=======================================
+ Hits       306528   306547   +19     
+ Misses      63833    63814   -19     

@ryoqun
Copy link
Member Author

ryoqun commented May 10, 2024

@Lichtso I seems you tried mimalloc before, right? did you find any potential blocker? as far as i can tell, it's better than jemalloc.

as a quick context, I tried rpmalloc before. but the crate is stale and leaked after quick testing. So, now I'm betting on this newest kid.

@ryoqun
Copy link
Member Author

ryoqun commented May 10, 2024

@alessandrod Firstly, note that I'm not opposed to reducing allocs. Obviously, it's better for perf.

I'm just getting tired of jemalloc. While it's well faster than the glibc allocator with less fragmentation, it has its own peculiar perf profile. So, I'm considering replacing it as a separate effort with bp-ing this to v1.17 & v1.18 in mind, parallel to the on-going alloc reduction efforts. I think you're opinionated around mem allocators in general. Is there any concern for using mimalloc? If any, I'd like to address with any additional investigation.

As for reliability (read: no leak & no unbounded fragmentation), I'm now running a devbox against mb, needs another day or two, before conclusion.

As a starter, I confirmed this fixes solana-labs#27275:

before(f180b08):

# --block-verification-method=blockstore-processor
ledger processed in 31 seconds, 473 ms
ledger processed in 31 seconds, 164 ms
ledger processed in 31 seconds, 335 ms

# --block-verification-method=unified-scheduler
ledger processed in 16 seconds, 363 ms
ledger processed in 16 seconds, 601 ms
ledger processed in 16 seconds, 512 ms

after(438018a):

# --block-verification-method=blockstore-processor
ledger processed in 24 seconds, 665 ms
ledger processed in 24 seconds, 848 ms
ledger processed in 24 seconds, 623 ms

# --block-verification-method=unified-scheduler
ledger processed in 12 seconds, 165 ms
ledger processed in 12 seconds, 78 ms
ledger processed in 12 seconds, 117 ms

blockstore-processor is ~27% faster and unified scheduler is ~36% faster according to solana-labs#35286 (comment). I think this alone is quite good perf gain by its own. :)

Also, I haven't confirmed yet but I'm moderately certain that this fixes the replay stall issue under system overload like rpmalloc fixed the stall (note that rpmalloc is out-of-question due to #1250 (comment))

@Lichtso
Copy link

Lichtso commented May 10, 2024

Yes I tried mimalloc a few months ago and no, nothing problematic stood out to me at the time.

@alessandrod
Copy link

alessandrod commented May 10, 2024

@alessandrod Firstly, note that I'm not opposed to reducing allocs. Obviously, it's better for perf.

I'm just getting tired of jemalloc. While it's well faster than the glibc allocator with less fragmentation, it has its own peculiar perf profile. So, I'm considering replacing it as a separate effort with bp-ing this to v1.17 & v1.18 in mind, parallel to the on-going alloc reduction efforts. I think you're opinionated around mem allocators in general. Is there any concern for using mimalloc? If any, I'd like to address with any additional investigation.

Uh? Having spent a significant chunk of my time looking at profiles and reading the jemalloc source code my opinion is that it sucks 😂 I'm more than open to replacing jemalloc, I just don't think that we should do it lightly. Changing allocator impacts every subsystem in the validator, personally I think it's a pretty scary change.

As for reliability (read: no leak & no unbounded fragmentation), I'm now running a devbox against mb, needs another day or two, before conclusion.

Hum imo you can't draw conclusions after running a devbox for a couple of days. Jemalloc sucks, and it's really struggling with the amount and pattern of allocations we do, especially since we do them from so many threads. But at least it isn't crashing 😅 Imagine we switch allocator, and then we find that in some configurations after a while it goes OOM - I don't think it's something we can establish in a couple of days based on a single machine configuration, with a single workload, etc. IMO it requires a lot of testing, especially on nodes with a lot of stake, which are the ones struggling the most with memory.

As a starter, I confirmed this fixes solana-labs#27275:

before:
# --block-verification-method=blockstore-processor
ledger processed in 31 seconds, 473 ms
ledger processed in 31 seconds, 164 ms
ledger processed in 31 seconds, 335 ms

# --block-verification-method=unified-scheduler
ledger processed in 16 seconds, 363 ms
ledger processed in 16 seconds, 601 ms
ledger processed in 16 seconds, 512 ms

after:
# --block-verification-method=blockstore-processor
ledger processed in 24 seconds, 665 ms
ledger processed in 24 seconds, 848 ms
ledger processed in 24 seconds, 623 ms

# --block-verification-method=unified-scheduler
ledger processed in 12 seconds, 165 ms
ledger processed in 12 seconds, 78 ms
ledger processed in 12 seconds, 117 ms

blockstore-processor is ~27% faster and unified scheduler is ~36% faster according to solana-labs#35286 (comment). I think this alone is quite good perf gain by its own. :)

This is amazing!

Also, I haven't confirmed yet but I'm moderately certain that this fixes the replay stall issue under system overload like rpmalloc fixed the stall (note that rpmalloc is out-of-question due to #1250 (comment))

There are at least two stalls that I'm aware of, here I'm assuming you're talking about the one caused by jemalloc's time based decay. Mimalloc almost certainly gets rid of that one, but how? How does it compact/release memory? When? And from what threads? Because getting rid of that stall is also possible by configuring jemalloc. Not saying that we should do it mind, and based on the replay speedups it looks like mimalloc might have better defaults for our workload, I just think that we should understand how mimalloc works and why it's better for us before we switch to it.

@alessandrod
Copy link

alessandrod commented May 10, 2024

I'm just reading the mimalloc docs, and just saw this

free list multi-sharding: the big idea! Not only do we shard the free list per mimalloc page, but for each page we have multiple free lists. In particular, there is one list for thread-local free operations, and another one for concurrent free operations. Free-ing from another thread can now be a single CAS without needing sophisticated coordination between threads. Since there will be thousands of separate free lists, contention is naturally distributed over the heap, and the chance of contending on a single location will be low – this is quite similar to randomized algorithms like skip lists where adding a random oracle removes the need for a more complex algorithm.

This alone is so much better than what jemalloc does for our workflow

@alessandrod
Copy link

Mimalloc almost certainly gets rid of that one, but how?

It looks like mimalloc does time based purge like jemalloc

@alessandrod
Copy link

It looks like mimalloc does time based purge like jemalloc

I think I'm pretty mimalloc-pilled. I think that the key is that even if it does time based purge like jemalloc, since it has sharded freelists and works at a "mimalloc page" level, it never does gigantic (and contended) deallocations like jemalloc does, which means than even when purging after spikes, deallocation costs are amortized over a bunch of call sites and bounded in time, instead of having seconds long spikes like we do with jemalloc.

I personally would ship this to master and let it ride the trains, and see what happens. While it's in master it'll get tested on the master canaries, and we could ask Joe to test it on at least one big-stake node.

@alessandrod
Copy link

It looks like mimalloc avoids the IPIs and TLB shootdowns jemalloc causes in calloc by... not implementing calloc. calloc in mimalloc is just malloc() + memset. I'm not sure how this scales with high number of txs, that's a lot of pagefaults.

@ryoqun
Copy link
Member Author

ryoqun commented May 11, 2024

thanks for various thoughts and quick look into mimalloc so far! seems you have good first impression of it. lol

I'll spend some time to dig in further and write my own thoughts later.

@ryoqun
Copy link
Member Author

ryoqun commented May 17, 2024

Finally, it's time to share my thoughts after all the investigations.

As of now, I'm a bit neural to adapting mimalloc. Admittedly, I got hyped with the perf increase of replaying stage and mimalloc's promising design descriptions, originally.

As for reliability (read: no leak & no unbounded fragmentation), I'm now running a devbox against mb, needs another day or two, before conclusion.

Hum imo you can't draw conclusions after running a devbox for a couple of days. ... But at least it isn't crashing 😅 Imagine we switch allocator, and then we find that in some configurations after a while it goes OOM - I don't think it's something we can establish in a couple of days based on a single machine configuration ...

I agree. I just misused the word conclusion... Changing allocator is quite dangerous.

I wanted to see whether mimalloc would leak miserably, like rpmalloc did when used for agave-validator. After almost 10 days, i think mimalloc doesn't leak/crash at the very least (note that slight mem increase after 2-3 days is a known issue, which occurs even with jemalloc):

image

Also, I haven't confirmed yet but I'm moderately certain that this fixes the replay stall issue under system overload

i confirmed this is fixed.

blockstore-processor is ~27% faster and unified scheduler is ~36% faster according to solana-labs#35286 (comment). I think this alone is quite good perf gain by its own. :)

This is amazing!

here's a tone-down fact: the replay perf gain can be realized with jemalloc by working-around this particular alloc source in vm: #1364 Moreover, notice that jemalloc is slightly faster than mimalloc there. So, I don't think mimalloc is a clear win than jemalloc, as a general-purpose mem allocator, assuming the replay stage code-path should have exercised allocators with some varieties of alloc patterns.

Also, I've seen no general perf improvement over jemalloc as far as i've run a unstaked mainnet-beta node.

Mimalloc almost certainly gets rid of that one, but how? How does it compact/release memory? When? And from what threads? Because getting rid of that stall is also possible by configuring jemalloc. Not saying that we should do it mind, and based on the replay speedups it looks like mimalloc might have better defaults for our workload, I just think that we should understand how mimalloc works and why it's better for us before we switch to it.

As for this concern, I think you self-answered like below (and this aligns with my understanding):

It looks like mimalloc does time based purge like jemalloc

... I think that the key is that even if it does time based purge like jemalloc, since it has sharded freelists and works at a "mimalloc page" level, it never does gigantic (and contended) deallocations like jemalloc does, which means than even when purging after spikes, deallocation costs are amortized over a bunch of call sites and bounded in time, instead of having seconds long spikes like we do with jemalloc.

To add some colors from me, I stumbled on this paper, which is relevant for the time-decay stall: https://arxiv.org/abs/2401.11347:

algorithms ... that free objects in large batches circumvent a key optimization in JEmalloc. This optimization is intended to avoid the overhead of returning an object to a remote thread that allocated it (i.e., its owner), by instead placing the object in a local buffer that the local thread can subsequently allocate from. Every object allocated locally from this buffer is an object that does not need to be freed remotely, back to its owner. Freeing a large batch of objects can overflow this buffer, triggering an extremely high latency free call (on the order of tens of milliseconds or more) in which many objects are removed from this buffer and freed remotely to their respective owners, incurring extremely high lock contention in the process.
...
Further investigation using Linux Perftools led to the realization that poor performance in JEmalloc, such as when running on four sockets, is usually accompanied by a large fraction of the total cycle count being spent in function called je_tcache_bin_flush_small. Table 1 summarizes perf results to support the following discussion, and also quantifies how the total number of epochs changes as the thread count increases. These results confirm that the cost of freeing objects becomes prohibitive at high thread counts, preventing the data structure from scaling. According to the source code for this version (5.0.1-25) of JEmalloc, when a thread invokes free, it places the freed object in a thread local buffer, and then checks whether the buffer is filled beyond a given threshold. If so, it takes a large number of objects from that buffer (approximately 3/4 of the buffer), and for each object, does the following. First, it identifies which bin the object belongs to. If the object was originally allocated by a different thread, this bin might reside on a remote core, or even a remote socket. The thread locks the bin, then iterates over all objects in its buffer (while holding the lock), and for each object that belongs to this bin, it performs the necessary bookkeeping to free the object to that bin.
...
In summary, it is extremely expensive in JEmalloc for free to return objects to the remote threads that allocated them. To avoid this overhead, a thread frees to a local buffer, and subsequently allocates from that buffer. Every object allocated from that buffer is an object that does not need to be freed to a remote thread in future. Freeing a large batch overflows the buffer, forcing all objects to be freed remotely, causing extremely high lock contention.

on the other hand, as for mimalloc from the same paper:

MImalloc Sidesteps the Problem Altogether. MImalloc, on the other hand, is essentially immune to the problem we describe above by design. In MImalloc, a remote free synchronizes on a particular page’s free list. ... This makes it relatively inexpensive to immediately free an individual object to a remote thread in MImalloc, as doing so will cause contention only if another thread is simultaneously freeing another object that was allocated from the same page.
...
MImalloc is quite unique in its approach. To our knowledge, no other allocator maintains per-page free lists.

So, I think there's still some hope to mitigate the pathological perf degradation we're experiencing altogether with mimalloc.

It looks like mimalloc avoids the IPIs and TLB shootdowns jemalloc causes in calloc by... not implementing calloc. calloc in mimalloc is just malloc() + memset. I'm not sure how this scales with high number of txs, that's a lot of pagefaults.

to be clear, i don't have particular words on this as my knowledge is still limited... at least, system-wide minor pagefaults aren't particularly increased as compared to jemalloc (rather was about half of node with jemalloc)

lastly, here's the remaining topic to cover: market share and product quality: my impression is that it's not ideal but acceptable for adaption.

market share:

seems it's used at microsoft internally. but, basically that's all. various high-profile projects (ClickHouse, rustc, Polkadot, Apache Arrow) tried it. however, none officially chose it, it seems. Admittedly, the market share growth is quite slow, considering it's been 5 years since released.

product quality:

as far as i read mimalloc's code, it's maintained, yet not in the highest coding standards. Also the technical report ( https://www.microsoft.com/en-us/research/publication/mimalloc-free-list-sharding-in-action/ )'s coverage is rather sparse with typos. as for rust-bindings, it's straight-forwardly written and updated to follow upstream releases: https://crates.io/crates/mimalloc

@ryoqun ryoqun marked this pull request as ready for review May 17, 2024 07:23
@ryoqun ryoqun requested a review from alessandrod May 17, 2024 07:30
@ryoqun
Copy link
Member Author

ryoqun commented May 17, 2024

as a sample among our bench programs, I picked solana-banking-bench, because this is the closest bench I'd like to address the perf issue (banking stage flooded with mb txes):

glibc:

sol@dev-equinix-tokyo-5:~/work/solana-labs/final-unified-scheduler-branch$ target/x86_64-unknown-linux-gnu/release/solana-banking-bench |& tail -n 4                                                      [total_sent: 1152000, base_tx_count: 55296, txs_processed: 1195456, txs_landed: 1140160, total_us: 22874161, tx_total_us: 22138973]
{'name': 'banking_bench_total', 'median': '50362.50'}
{'name': 'banking_bench_tx_total', 'median': '52034.93'}
{'name': 'banking_bench_success_tx_total', 'median': '49844.89'}
sol@dev-equinix-tokyo-5:~/work/solana-labs/final-unified-scheduler-branch$ target/x86_64-unknown-linux-gnu/release/solana-banking-bench |& tail -n 4
[total_sent: 1152000, base_tx_count: 55296, txs_processed: 1200576, txs_landed: 1145280, total_us: 22701875, tx_total_us: 21960779]
{'name': 'banking_bench_total', 'median': '50744.71'}
{'name': 'banking_bench_tx_total', 'median': '52457.16'}
{'name': 'banking_bench_success_tx_total', 'median': '50448.70'}
sol@dev-equinix-tokyo-5:~/work/solana-labs/final-unified-scheduler-branch$ target/x86_64-unknown-linux-gnu/release/solana-banking-bench |& tail -n 4
[total_sent: 1152000, base_tx_count: 55296, txs_processed: 1202944, txs_landed: 1147648, total_us: 22738385, tx_total_us: 22000919]
{'name': 'banking_bench_total', 'median': '50663.23'}
{'name': 'banking_bench_tx_total', 'median': '52361.45'}
{'name': 'banking_bench_success_tx_total', 'median': '50471.83'}

rpmalloc:

sol@dev-equinix-tokyo-5:~/work/solana-labs/final-unified-scheduler-branch$ target/x86_64-unknown-linux-gnu/release/solana-banking-bench |& tail -n 4
[total_sent: 1152000, base_tx_count: 55296, txs_processed: 1191360, txs_landed: 1136064, total_us: 19135997, tx_total_us: 18700941]
{'name': 'banking_bench_total', 'median': '60200.68'}
{'name': 'banking_bench_tx_total', 'median': '61601.18'}
{'name': 'banking_bench_success_tx_total', 'median': '59367.90'}
sol@dev-equinix-tokyo-5:~/work/solana-labs/final-unified-scheduler-branch$ target/x86_64-unknown-linux-gnu/release/solana-banking-bench |& tail -n 4
[total_sent: 1152000, base_tx_count: 55296, txs_processed: 1193472, txs_landed: 1138176, total_us: 18655284, tx_total_us: 18223262]
{'name': 'banking_bench_total', 'median': '61751.94'}
{'name': 'banking_bench_tx_total', 'median': '63215.91'}
{'name': 'banking_bench_success_tx_total', 'median': '61010.92'}
sol@dev-equinix-tokyo-5:~/work/solana-labs/final-unified-scheduler-branch$ target/x86_64-unknown-linux-gnu/release/solana-banking-bench |& tail -n 4
[total_sent: 1152000, base_tx_count: 55296, txs_processed: 1196736, txs_landed: 1141440, total_us: 19450234, tx_total_us: 19005383]
{'name': 'banking_bench_total', 'median': '59228.08'}
{'name': 'banking_bench_tx_total', 'median': '60614.41'}
{'name': 'banking_bench_success_tx_total', 'median': '58685.16'}

mimalloc:

sol@dev-equinix-tokyo-5:~/work/solana-labs/final-unified-scheduler-branch$ target/x86_64-unknown-linux-gnu/release/solana-banking-bench |& tail -n 4
[total_sent: 1152000, base_tx_count: 55296, txs_processed: 1194609, txs_landed: 1139313, total_us: 18883418, tx_total_us: 18343669]
{'name': 'banking_bench_total', 'median': '61005.90'}
{'name': 'banking_bench_tx_total', 'median': '62800.96'}
{'name': 'banking_bench_success_tx_total', 'median': '60334.05'}
sol@dev-equinix-tokyo-5:~/work/solana-labs/final-unified-scheduler-branch$ target/x86_64-unknown-linux-gnu/release/solana-banking-bench |& tail -n 4
[total_sent: 1152000, base_tx_count: 55296, txs_processed: 1197824, txs_landed: 1142528, total_us: 18926181, tx_total_us: 18364487]
{'name': 'banking_bench_total', 'median': '60868.06'}
{'name': 'banking_bench_tx_total', 'median': '62729.77'}
{'name': 'banking_bench_success_tx_total', 'median': '60367.59'}
sol@dev-equinix-tokyo-5:~/work/solana-labs/final-unified-scheduler-branch$ target/x86_64-unknown-linux-gnu/release/solana-banking-bench |& tail -n 4
[total_sent: 1152000, base_tx_count: 55296, txs_processed: 1194560, txs_landed: 1139264, total_us: 18904023, tx_total_us: 18361670]
{'name': 'banking_bench_total', 'median': '60939.41'}
{'name': 'banking_bench_tx_total', 'median': '62739.39'}
{'name': 'banking_bench_success_tx_total', 'median': '60265.69'}

jemalloc:

sol@dev-equinix-tokyo-5:~/work/solana-labs/final-unified-scheduler-branch$ target/x86_64-unknown-linux-gnu/release/solana-banking-bench |& tail -n 4
[total_sent: 1152000, base_tx_count: 55296, txs_processed: 1197440, txs_landed: 1142144, total_us: 18922589, tx_total_us: 18402904]
{'name': 'banking_bench_total', 'median': '60879.62'}
{'name': 'banking_bench_tx_total', 'median': '62598.82'}
{'name': 'banking_bench_success_tx_total', 'median': '60358.76'}
sol@dev-equinix-tokyo-5:~/work/solana-labs/final-unified-scheduler-branch$ target/x86_64-unknown-linux-gnu/release/solana-banking-bench |& tail -n 4
[total_sent: 1152000, base_tx_count: 55296, txs_processed: 1194432, txs_landed: 1139136, total_us: 18952931, tx_total_us: 18392036]
{'name': 'banking_bench_total', 'median': '60782.16'}
{'name': 'banking_bench_tx_total', 'median': '62635.81'}
{'name': 'banking_bench_success_tx_total', 'median': '60103.42'}
sol@dev-equinix-tokyo-5:~/work/solana-labs/final-unified-scheduler-branch$ target/x86_64-unknown-linux-gnu/release/solana-banking-bench |& tail -n 4
[total_sent: 1152000, base_tx_count: 55296, txs_processed: 1199104, txs_landed: 1143808, total_us: 18344841, tx_total_us: 17826278]
{'name': 'banking_bench_total', 'median': '62796.95'}
{'name': 'banking_bench_tx_total', 'median': '64623.70'}
{'name': 'banking_bench_success_tx_total', 'median': '62350.39'}

as a quick glance, it's so outstanding that glibc is slow. rpmalloc, jemalloc and mimalloc are similarly better than glibc by ~120%.

One subtle takeaway is that mimalloc's perf variance is noticeably small, compared to rpmalloc and jemalloc, while maintaining the 120% perf boost compared to the baseline (glibc). I think this is illustrating mimalloc's 2 selling points, which make this result understandable:

bounded: it does not suffer from blowup [1], has bounded worst-case allocation times (wcat) (upto OS primitives)

and

... has no internal points of contention using only atomic operations

I hope this property could alleviate our overall system perf investigation a bit in the coming future....

Lastly, glibc's variance is very low as well. and it is kind of expected: its impl is way more naive in the light of multi-threaded age, than other newer kids (and that's why it's slow...)

Anyway, these observations are just from running a single bench. take them with a grain of salt.

@ryoqun ryoqun changed the title Use mimalloc Replace jemalloc with mimalloc for reduced perf. peculiarities May 17, 2024
@ryoqun ryoqun changed the title Replace jemalloc with mimalloc for reduced perf. peculiarities Use mimalloc in attempt to reduce mem alloc perf. peculiarities May 17, 2024
@ryoqun ryoqun changed the title Use mimalloc in attempt to reduce mem alloc perf. peculiarities Use mimalloc in attempt to reduce mem alloc perf. oddities May 17, 2024
@ryoqun
Copy link
Member Author

ryoqun commented Oct 3, 2024

status update: almost stale. but i'm moving forward this very slowly. Recently, I've managed to repro the jemalloc-induced pathologically slow free()-ing via agave-ledger-tool simulate-block-production (#2733)

Also, I just found another encouraging paper about mimalloc: https://retis.sssup.it/~a.biondi/papers/RTAS24.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Poor jemalloc performance with zeroed allocations leading to TLB shootdown
4 participants