NUMA support #22

lilyanatia · 2019-02-12T06:26:01Z

on a quad-socket server (4x Xeon E5-4640), a single process with 64 threads maxes out at about 3700 H/s.

running 4 processes at once (one per physical CPU) with 8 threads each, I get about 2525 H/s per process for a total of 10100 H/s.

since real mining software will almost certainly have NUMA support, it would probably be good to implement it here so people get a more accurate idea of actual mining hashrates.

tevador · 2019-02-12T08:01:44Z

Interesting. Thanks for the test.

Actually, NUMA is only part of the story here, DDR3 is limited to about 1500 H/s per channel, so even if this was a 64-core machine with uniform access to 4 channels, it would still be limited to ~6000 H/s. Machines like this definitely need multiple copies of the dataset for maximum performance. I'll keep it in mind.

BTW, DDR4 is noticeably less limiting due to its multiple internal banks (> 3000 H/s per channel).

tevador · 2019-02-12T17:30:53Z

I noticed the CPU has 20 MB of L3 cache, so for the best performance, you should be using 10 threads per CPU or 40 threads total.

bin/randomx --mine --largePages --threads 10 --nonces 100000

lilyanatia · 2019-02-12T20:37:04Z

it looks like the best performance should be with 10 threads per CPU, but it isn't.

with a single process, I tested all numbers from 32 to 80 and got the highest hashrate at 64 threads. 64 threads did about 3700 H/s, 32 threads did about 3100 H/s, and 40 threads did about 2900 H/s

with 4 processes, I tested from 8 to 16 and got the highest hashrate at 8 threads per process. on a single CPU, 8 threads did about 2525 H/s, 16 threads did about 2450 H/s, and 10 threads did about 1950 H/s.

with cryptonight, this machine does get the best performance at 10 threads per CPU.

tevador · 2019-02-12T21:29:12Z

So it seems the L2 cache (256 KiB per core) is the limiting factor. RandomX needs 16 KiB of L1D, 256 KiB of L2 and 2 MiB of L3 per thread.

MoneroChan · 2019-04-11T12:33:40Z

interesting... if hotaru2k3 isn't using a DDR4 server board,
that means non uniform memory access can overcome the DDR3 ~6000H/s max limitation by ~40% in some configurations, but sacrifices ~40% from max possible h/s...

(This reminds me of data interleaving / XMR-Stak's Interleve function)

lilyanatia · 2019-04-12T01:02:15Z

I'm using a DDR3 server board with 4 sockets and 4 memory channels per socket. the DDR3 ~6000H/s max is for 4 channels, not 16.

MoneroChan · 2019-04-12T03:40:07Z

ahh. it's per socket. so it's similar to 4 motherboards joined together with 4 channels each.
So 16 channels in total. That makes sense now.

So you've got 50% spare RAM bandwidth before your RAM becomes a bottleneck.
and theoretical Max on your system is 20000 to 24000H/s if you upgrade the CPUs?

lilyanatia · 2019-04-14T04:17:35Z

the best CPUs for this board have 12 cores (a 50% increase over what I have), so it'd probably max out around 15000. L2 cache would still be the bottleneck.

lilyanatia · 2019-06-22T18:33:01Z

How exactly is this done?

numactl -s | grep \^nodebind:\ | cut -c 11- | sed s/\ /\\n/g | xargs -P 0 -I node numactl -N node ./randomx-benchmark --mine --jit --largePages --init $(numactl -H | grep 'node 0 cpus: ' | cut -c 14- | wc -w) --threads $((echo $(lstopo --restrict 0xff --only L2 | grep -o '[0-9]\+[KM]B') | sed 's/ /+/g;s/KB//g;s/MB/*1024/g;s/^/(/;s/$/)\/256/' | bc; echo $(lstopo --restrict 0xff --only L3 | grep -o '[0-9]\+[KM]B') | sed 's/ /+/g;s/KB//g;s/MB/*1024/g;s/^/(/;s/$/)\/2048/' | bc) | sort -n | head -1)

kio3i0j9024vkoenio · 2019-06-24T00:02:42Z

NUMA support really needs to be implemented. The slowdown by having to access memory through another processor causes a drop of about 52% in performance.

In the many forums I have read lots of users that will be mining with RandomX are buying now or utilizing servers that they already have that have two or more processors.

My Test System in a HP DL580 with four Xeon E7-8837 eight core processors with 8GB of memory on each processor or 32GB memory total.

Doing the benchmark tests shows that when RandomX allocates all the Dataset to only one of the processors memory that it runs 52% slower than when it spreads it out to other processors.

sudo sysctl -w vm.nr_hugepages=1200
./benchmark --mine --largePages --jit --threads 28 --nonces 100000
RandomX benchmark
Performance: 7193.61 hashes per second

Now allocate Dataset to only one processor:

sudo sysctl -w vm.nr_hugepages=4800
./benchmark --mine --largePages --jit --threads 28 --nonces 100000
RandomX benchmark
Performance: 3471.44 hashes per second or a slowdown of 51.7%

Sscreenshots shows that the 28 threads are spread over the four processors in both tests so when the Dataset is only in one of the processors local memory the other three need to go through another processor to access it. That caused the massive slowdown.

kio3i0j9024vkoenio · 2019-06-27T03:36:01Z

sudo sysctl -w vm.nr_hugepages=4800

Long story short since NUMA is still not in the benchmark RandomX miner you need to benchmark using this command:

seq 0 3 | xargs -P 0 -I node numactl -N node ./benchmark --mine --largePages --jit --nonces 100000 --init 8 --threads 8

That command runs four benchmarks each assigned to only one processor and that processor only uses its local memory.

This is the results I have obtained:

Running benchmark (100000 nonces) ...
Calculated result: 9b22794882187000d62c6d2b228fab5e585767aaaa5eb74905b0c7c00fcbdad8
Performance: 2319.77 hashes per second
Calculated result: 9b22794882187000d62c6d2b228fab5e585767aaaa5eb74905b0c7c00fcbdad8
Performance: 2296.53 hashes per second
Calculated result: 9b22794882187000d62c6d2b228fab5e585767aaaa5eb74905b0c7c00fcbdad8
Performance: 2289.64 hashes per second
Calculated result: 9b22794882187000d62c6d2b228fab5e585767aaaa5eb74905b0c7c00fcbdad8
Performance: 2280.22 hashes per second

That is a total of 9186 H/s for the four processors or an average of 2296 H/s for each Xeon E7-8837.

kio3i0j9024vkoenio · 2019-08-16T22:50:41Z

I just wanted to point out that XMrig V3.1.0 has implemented NUMA for RandomX with a Testnet and it works flawlessly.

https://github.com/xmrig/xmrig/releases/tag/v3.1.0

This is how I test on Ubuntu:

RandomX testing

wget https://github.com/xmrig/xmrig/releases/download/v3.1.0/xmrig-3.1.0-xenial-x64.tar.gz
tar xvzf xmrig-3.1.0-xenial-x64.tar.gz

cd xmrig-3.1.0

edit config.json:
   "asm": "bulldozer", - change to this if Opteron's 6200 or 6300 otherwise leave it alone
   "donate-level": 1,
   "algo": "rx/test",
   "pools": "randomx-benchmark.xmrig.com:7777",

./xmrig

jtgrassie · 2019-08-30T22:02:59Z

I created a NUMA patch for the benchmark some time ago (just rebased now as well). Quite honestly though, numactl works fine as well.

tevador mentioned this issue Feb 20, 2019

Performance and portability testing #25

Closed

tevador mentioned this issue Jun 28, 2019

Low performance for the amd epic 7351p #91

Closed

tevador added the enhancement New feature or request label Jun 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NUMA support #22

NUMA support #22

lilyanatia commented Feb 12, 2019

tevador commented Feb 12, 2019

tevador commented Feb 12, 2019

lilyanatia commented Feb 12, 2019

tevador commented Feb 12, 2019

MoneroChan commented Apr 11, 2019

lilyanatia commented Apr 12, 2019

MoneroChan commented Apr 12, 2019

lilyanatia commented Apr 14, 2019

lilyanatia commented Jun 22, 2019 •

edited

Loading

kio3i0j9024vkoenio commented Jun 24, 2019 •

edited

Loading

kio3i0j9024vkoenio commented Jun 27, 2019

kio3i0j9024vkoenio commented Aug 16, 2019 •

edited

Loading

jtgrassie commented Aug 30, 2019

NUMA support #22

NUMA support #22

Comments

lilyanatia commented Feb 12, 2019

tevador commented Feb 12, 2019

tevador commented Feb 12, 2019

lilyanatia commented Feb 12, 2019

tevador commented Feb 12, 2019

MoneroChan commented Apr 11, 2019

lilyanatia commented Apr 12, 2019

MoneroChan commented Apr 12, 2019

lilyanatia commented Apr 14, 2019

lilyanatia commented Jun 22, 2019 • edited Loading

kio3i0j9024vkoenio commented Jun 24, 2019 • edited Loading

kio3i0j9024vkoenio commented Jun 27, 2019

kio3i0j9024vkoenio commented Aug 16, 2019 • edited Loading

jtgrassie commented Aug 30, 2019

lilyanatia commented Jun 22, 2019 •

edited

Loading

kio3i0j9024vkoenio commented Jun 24, 2019 •

edited

Loading

kio3i0j9024vkoenio commented Aug 16, 2019 •

edited

Loading