-
Notifications
You must be signed in to change notification settings - Fork 307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NUMA support #22
Comments
Interesting. Thanks for the test. Actually, NUMA is only part of the story here, DDR3 is limited to about 1500 H/s per channel, so even if this was a 64-core machine with uniform access to 4 channels, it would still be limited to ~6000 H/s. Machines like this definitely need multiple copies of the dataset for maximum performance. I'll keep it in mind. BTW, DDR4 is noticeably less limiting due to its multiple internal banks (> 3000 H/s per channel). |
I noticed the CPU has 20 MB of L3 cache, so for the best performance, you should be using 10 threads per CPU or 40 threads total.
|
it looks like the best performance should be with 10 threads per CPU, but it isn't. with a single process, I tested all numbers from 32 to 80 and got the highest hashrate at 64 threads. 64 threads did about 3700 H/s, 32 threads did about 3100 H/s, and 40 threads did about 2900 H/s with 4 processes, I tested from 8 to 16 and got the highest hashrate at 8 threads per process. on a single CPU, 8 threads did about 2525 H/s, 16 threads did about 2450 H/s, and 10 threads did about 1950 H/s. with cryptonight, this machine does get the best performance at 10 threads per CPU. |
So it seems the L2 cache (256 KiB per core) is the limiting factor. RandomX needs 16 KiB of L1D, 256 KiB of L2 and 2 MiB of L3 per thread. |
interesting... if hotaru2k3 isn't using a DDR4 server board, (This reminds me of data interleaving / XMR-Stak's Interleve function) |
I'm using a DDR3 server board with 4 sockets and 4 memory channels per socket. the DDR3 ~6000H/s max is for 4 channels, not 16. |
ahh. it's per socket. so it's similar to 4 motherboards joined together with 4 channels each. So you've got 50% spare RAM bandwidth before your RAM becomes a bottleneck. |
the best CPUs for this board have 12 cores (a 50% increase over what I have), so it'd probably max out around 15000. L2 cache would still be the bottleneck. |
|
NUMA support really needs to be implemented. The slowdown by having to access memory through another processor causes a drop of about 52% in performance. In the many forums I have read lots of users that will be mining with RandomX are buying now or utilizing servers that they already have that have two or more processors. My Test System in a HP DL580 with four Xeon E7-8837 eight core processors with 8GB of memory on each processor or 32GB memory total. Doing the benchmark tests shows that when RandomX allocates all the Dataset to only one of the processors memory that it runs 52% slower than when it spreads it out to other processors. sudo sysctl -w vm.nr_hugepages=1200 Now allocate Dataset to only one processor: sudo sysctl -w vm.nr_hugepages=4800 Sscreenshots shows that the 28 threads are spread over the four processors in both tests so when the Dataset is only in one of the processors local memory the other three need to go through another processor to access it. That caused the massive slowdown. |
sudo sysctl -w vm.nr_hugepages=4800 Long story short since NUMA is still not in the benchmark RandomX miner you need to benchmark using this command: seq 0 3 | xargs -P 0 -I node numactl -N node ./benchmark --mine --largePages --jit --nonces 100000 --init 8 --threads 8 That command runs four benchmarks each assigned to only one processor and that processor only uses its local memory. This is the results I have obtained: Running benchmark (100000 nonces) ... That is a total of 9186 H/s for the four processors or an average of 2296 H/s for each Xeon E7-8837. |
I just wanted to point out that XMrig V3.1.0 has implemented NUMA for RandomX with a Testnet and it works flawlessly. https://github.com/xmrig/xmrig/releases/tag/v3.1.0 This is how I test on Ubuntu:
|
I created a NUMA patch for the benchmark some time ago (just rebased now as well). Quite honestly though, numactl works fine as well. |
on a quad-socket server (4x Xeon E5-4640), a single process with 64 threads maxes out at about 3700 H/s.
running 4 processes at once (one per physical CPU) with 8 threads each, I get about 2525 H/s per process for a total of 10100 H/s.
since real mining software will almost certainly have NUMA support, it would probably be good to implement it here so people get a more accurate idea of actual mining hashrates.
The text was updated successfully, but these errors were encountered: