Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare jemalloc performance #6897

Closed
brson opened this issue Jun 2, 2013 · 10 comments
Closed

Compare jemalloc performance #6897

brson opened this issue Jun 2, 2013 · 10 comments
Assignees
Labels
A-runtime Area: std's runtime and "pre-main" init for handling backtraces, unwinds, stack overflows
Milestone

Comments

@brson
Copy link
Contributor

brson commented Jun 2, 2013

#6895 converts Rust to jemalloc. This is a big change and we should really capture some performance numbers to have an idea of the effect this has.

I am thinking just a few workloads:

  • build rustc (with --no-trans)
  • A multithreaded test using treemap (like in core-map)

For each collect the running time and max memory used, both before and after jemalloc, on linux, mac and win.

@brson
Copy link
Contributor Author

brson commented Jun 2, 2013

19:56 <@brson> what are the most important metrics for measuring allocator performance?
19:57 < strcat> brson: small allocation speed/overhead, fragmentation (hard to measure, I assume) and lock contention
19:57 < strcat> I guess checking peak memory usage of a rustc build would be a good metric
19:58 < strcat> brson: the core-map test has a good test for the small allocs (treemap)
19:58 < strcat> could also try it with a bunch of threads
20:02 < strcat> brson: no, just binaries linked against librustrt.so
20:02 < strcat> brson: I don't think it would matter much for LLVM because they use their own allocator things

@brson brson mentioned this issue Jun 2, 2013
@ghost ghost assigned brson Jun 2, 2013
@thestinger
Copy link
Contributor

By the way, to check peak memory usage on Linux:

(assuming /sys/fs/cgroup is mounted, and the kernel has support for the memory cgroup)

# mkdir /sys/fs/cgroup/memory/rust
# echo 0 > /sys/fs/cgroup/memory/rust/tasks tasks # to add the current process to the cgroup
# su -l non_root_user
$ exec process_to_test 

and then when it finishes:

# cat /sys/fs/cgroup/memory/rust/memory.max_usage_in_bytes
# rmdir /sys/fs/cgroup/memory/rust

You can do this with make since a cgroup accounts for all of the subprocesses too.

@emberian
Copy link
Member

emberian commented Jun 2, 2013

Numbers for core-map benchmark on windows 8: https://gist.github.com/cmr/fdec972a739b893671d0
Numbers for core-map benchmark on linux (glibc 2.17) including tcmalloc: https://gist.github.com/cmr/8f675944ba45f3277cfd

Additionally on linux, max memory usages:

glibc tcmalloc jemalloc
170033152 (166048K) 247701504 (241896K) 189943808 (185492K)

However, I'm not sure these numbers mean anything. Watching htop as they run shows periodic spikes where res doesn't go above 100M and then a huge spike up to the max. jemalloc definitely has the lowest overall RES over time, about half that of glibc. tcmalloc hits the max and it looks like it never frees anything.

@pcwalton Recalling a discussion from IRC, @thestinger did some benchs of jemalloc vs tcmalloc and found the performance was near identical. jemalloc was chosen for better runtime reporting and such.

@emberian
Copy link
Member

emberian commented Jun 2, 2013

Using the benchmark from https://www.citi.umich.edu/u/cel/linux-scalability/reports/malloc.html, I get the following:

glibc

Thread 333063936 adjusted timing: 0.675780 seconds for 16777216 requests of 1024 bytes.
Thread 349849344 adjusted timing: 0.911601 seconds for 16777216 requests of 1024 bytes.
Thread 324671232 adjusted timing: 0.914302 seconds for 16777216 requests of 1024 bytes.
Thread 341456640 adjusted timing: 1.035064 seconds for 16777216 requests of 1024 bytes.

jemalloc

Thread -79698176 adjusted timing: 0.263953 seconds for 16777216 requests of 1024 bytes.
Thread -71305472 adjusted timing: 0.363943 seconds for 16777216 requests of 1024 bytes.
Thread -88090880 adjusted timing: 0.356622 seconds for 16777216 requests of 1024 bytes.
Thread -54528256 adjusted timing: 0.357841 seconds for 16777216 requests of 1024 bytes.

tcmalloc

Thread -951392512 adjusted timing: 0.217104 seconds for 16777216 requests of 1024 bytes.
Thread -926214400 adjusted timing: 0.217926 seconds for 16777216 requests of 1024 bytes.
Thread -934607104 adjusted timing: 0.222338 seconds for 16777216 requests of 1024 bytes.
Thread -942999808 adjusted timing: 0.341169 seconds for 16777216 requests of 1024 bytes.

When bumping it up to 8 threads, jemalloc performs a bit worse than tcmalloc:

jemalloc

Thread -2067810560 adjusted timing: 0.495102 seconds for 16777216 requests of 1024 bytes.
Thread -2084595968 adjusted timing: 0.501311 seconds for 16777216 requests of 1024 bytes.
Thread -2017462528 adjusted timing: 0.502598 seconds for 16777216 requests of 1024 bytes.
Thread -2059417856 adjusted timing: 0.499790 seconds for 16777216 requests of 1024 bytes.
Thread -2034239744 adjusted timing: 0.502179 seconds for 16777216 requests of 1024 bytes.
Thread -2051025152 adjusted timing: 0.500838 seconds for 16777216 requests of 1024 bytes.
Thread -2076203264 adjusted timing: 0.510907 seconds for 16777216 requests of 1024 bytes.
Thread -2042632448 adjusted timing: 0.529690 seconds for 16777216 requests of 1024 bytes.

tcmalloc

Thread 1207236352 adjusted timing: 0.415216 seconds for 16777216 requests of 1024 bytes.
Thread 1224021760 adjusted timing: 0.437269 seconds for 16777216 requests of 1024 bytes.
Thread 1165272832 adjusted timing: 0.418753 seconds for 16777216 requests of 1024 bytes.
Thread 1215629056 adjusted timing: 0.436506 seconds for 16777216 requests of 1024 bytes.
Thread 1198843648 adjusted timing: 0.437209 seconds for 16777216 requests of 1024 bytes.
Thread 1173665536 adjusted timing: 0.433804 seconds for 16777216 requests of 1024 bytes.
Thread 1182058240 adjusted timing: 0.398201 seconds for 16777216 requests of 1024 bytes.
Thread 1190450944 adjusted timing: 0.391288 seconds for 16777216 requests of 1024 bytes.

But, for much smaller allocations, jemalloc way outperforms tcmalloc:

jemalloc

Thread -1669343488 adjusted timing: 0.275617 seconds for 16777216 requests of 8 bytes.
Thread -1652558080 adjusted timing: 0.295337 seconds for 16777216 requests of 8 bytes.
Thread -1635780864 adjusted timing: 0.315195 seconds for 16777216 requests of 8 bytes.
Thread -1660950784 adjusted timing: 0.318332 seconds for 16777216 requests of 8 bytes.

tcmalloc

Thread 180864768 adjusted timing: 0.775013 seconds for 16777216 requests of 8 bytes.
Thread 197650176 adjusted timing: 0.934877 seconds for 16777216 requests of 8 bytes.
Thread 172472064 adjusted timing: 1.055541 seconds for 16777216 requests of 8 bytes.
Thread 189257472 adjusted timing: 1.031184 seconds for 16777216 requests of 8 bytes.

This is only a cursory overview of runtime performance, but it seems like jemalloc is the better choice overall.

@emberian
Copy link
Member

emberian commented Jun 2, 2013

Building rustc like x86_64-unknown-linux-gnu/stage2/bin/rustc --no-trans --cfg stage2 -O -Z no-debug-borrows --target=x86_64-unknown-linux-gnu -o x86_64-unknown-linux-gnu/stage2/lib/rustc/x86_64-unknown-linux-gnu/lib/librustc.so /home/cmr/hacking/rust-incoming/src/librustc/rustc.rc gives me:

glibc jemalloc
time 0:14.01 0:12.17
memory 536544K 589868K

jemalloc is tuned via opt.lg_dirty_mult which controls 'Per-arena minimum ratio (log base 2) of active to dirty pages' before purging pages. It may just not be being reached, which could explain the higher memory use, but I'm not sure.

@thestinger
Copy link
Contributor

@cmr: is that VIRT or RES?

@emberian
Copy link
Member

emberian commented Jun 2, 2013

@thestinger as measured with cgroups, memory.max_usage_in_bytes

@thestinger
Copy link
Contributor

Oh, so that's the true physical memory usage then.

@brson
Copy link
Contributor Author

brson commented Jun 6, 2013

On linux, bench/msgsend-pipes with the RUST_BENCH difficulty bumped up to ~[~"", ~"10000000", ~"32"].

before:

brian@brian-ThinkPad-W520:~/dev/rust3/build$ time RUST_BENCH=1 x86_64-unknown-linux-gnu/test/bench/msgsend-pipes.stage1-x86_64-unknown-linux-gnu
Count is 1000000000
Test took 12.40329822 seconds
Throughput=806237.165326 per sec

real    0m12.497s
user    0m18.017s
sys     0m6.020s
brian@brian-ThinkPad-W520:~/dev/rust3/build$ time RUST_BENCH=1 x86_64-unknown-linux-gnu/test/bench/msgsend-pipes.stage1-x86_64-unknown-linux-gnu
Count is 1000000000
Test took 14.62787277 seconds
Throughput=683626.400137 per sec

real    0m14.718s
user    0m21.233s
sys     0m9.113s
brian@brian-ThinkPad-W520:~/dev/rust3/build$ time RUST_BENCH=1 x86_64-unknown-linux-gnu/test/bench/msgsend-pipes.stage1-x86_64-unknown-linux-gnu
Count is 1000000000
Test took 12.58228702 seconds
Throughput=794768.07211 per sec

real    0m12.682s
user    0m18.565s
sys     0m6.040s

after:

brian@brian-ThinkPad-W520:~/dev/rust3/build$ time RUST_BENCH=1 x86_64-unknown-linux-gnu/test/bench/msgsend-pipes.stage1-x86_64-unknown-linux-gnu
Count is 1000000000
Test took 7.96421961 seconds
Throughput=1255615.802732 per sec

real    0m7.980s
user    0m11.169s
sys     0m0.652s

brian@brian-ThinkPad-W520:~/dev/rust3/build$ time RUST_BENCH=1 x86_64-unknown-linux-gnu/test/bench/msgsend-pipes.stage1-x86_64-unknown-linux-gnu
Count is 1000000000
Test took 7.7545624 seconds
Throughput=1289563.419319 per sec

real    0m7.768s
user    0m10.901s
sys     0m0.796s

brian@brian-ThinkPad-W520:~/dev/rust3/build$ time RUST_BENCH=1 x86_64-unknown-linux-gnu/test/bench/msgsend-pipes.stage1-x86_64-unknown-linux-gnu
Count is 1000000000
Test took 7.62918568 seconds
Throughput=1310755.880838 per sec

real    0m7.645s
user    0m10.845s
sys     0m0.700s

The mean time changed from 13.1s to 7.7s

@thestinger
Copy link
Contributor

This landed as 5d2cadb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-runtime Area: std's runtime and "pre-main" init for handling backtraces, unwinds, stack overflows
Projects
None yet
Development

No branches or pull requests

3 participants