Suboptimal speed on multi-socket / numa systems #5253

vondele · 2024-05-16T20:01:51Z

Describe the issue

we observe slower than expected nodes per second on multi-socket and/or numa systems (e.g. local, tcec, potentially ccc). The likely reason is increased memory bandwidth contention due to the larger networks and accumulator caches.

Expected behavior

nps / performance following more closely the nps on single CPU / single numa chips. Potential speedups could be 2x and larger.

Steps to reproduce

Reproducing/testing needs access to multi-socket system, and either looking at historical data, or doing a comparison with a synthetic benchmark, where the speed of multiple instances of SF with fewer threads (each bound to a numa domain) are compared to one instance of SF with correspondingly more threads. An example is the following set of historic data:

                                                             sha                             date       pinned      default
                        dcb02337844d71e56df57b9a8ba17646f953711c        2024-05-15T16:27:03+02:00   8273351050   3266542305
                        49ef4c935a5cb0e4d94096e6354caa06b36b3e3c        2024-04-24T18:38:20+02:00   8548621496   3491153172
                        0716b845fdef8a20102b07eaec074b8da8162523        2024-04-02T08:49:48+02:00   8059206362   4248207804
                        bd579ab5d1a931a09a62f2ed33b5149ada7bc65f        2024-03-07T19:53:48+01:00   9424434014   5654516665
                        e67cc979fd2c0e66dfc2b2f2daa0117458cfc462        2024-02-24T18:15:04+01:00   9470247485   5864264415
                        8e75548f2a10969c1c9211056999efbcebe63f9a        2024-02-17T17:11:46+01:00   9411936135   6000562457
                        6deb88728fb141e853243c2873ad0cda4dd19320        2024-01-08T18:34:36+01:00   9346121150   5796929238
                        f12035c88c58a5fd568d26cde9868f73a8d7b839        2023-12-30T11:08:03+01:00   9454744857   6531509540
                        afe7f4d9b0c5e1a1aa224484d2cd9e04c7f099b9        2023-09-29T22:30:27+02:00   9235658137   7375676116
                        70ba9de85cddc5460b1ec53e0a99bee271e26ece        2023-09-22T19:26:16+02:00   9284309149   7583913771
                        3d1b067d853d6e8cc22cf18c1abb4cd9833dd38f        2023-09-11T22:37:39+02:00  10860286083  10026471309
                        e699fee513ce26b3794ac43d08826c89106e10ea        2023-07-06T23:03:58+02:00  10399561315   9485800488
                        915532181f11812c80ef0b57bc018de4ea2155ec        2023-07-01T13:34:30+02:00  10172761439   8869020884

here default is running a single instance of SF with 256T without pinning, while the pinning setup uses 8 instances of SF each pinned to a suitable numa domain. It can be seen that a year ago the performance difference was just about 10% whereas now the difference is about 200%.

data generated with:

for version in  dcb02337844d71e56df57b9a8ba17646f953711c 49ef4c935a5cb0e4d94096e6354caa06b36b3e3c 0716b845fdef8a20102b07eaec074b8da8162523 bd579ab5d1a931a09a62f2ed33b5149ada7bc65f e67cc979fd2c0e66dfc2b2f2daa0117458cfc462 8e75548f2a10969c1c9211056999efbcebe63f9a 6deb88728fb141e853243c2873ad0cda4dd19320 f12035c88c58a5fd568d26cde9868f73a8d7b839 afe7f4d9b0c5e1a1aa224484d2cd9e04c7f099b9 70ba9de85cddc5460b1ec53e0a99bee271e26ece 3d1b067d853d6e8cc22cf18c1abb4cd9833dd38f e699fee513ce26b3794ac43d08826c89106e10ea 915532181f11812c80ef0b57bc018de4ea2155ec ef94f77f8c827a2395f1c40f53311a3b1f20bc5b a49b3ba7ed5d9be9151c8ceb5eed40efe3387c75 932f5a2d657c846c282adcf2051faef7ca17ae15 373359b44d0947cce2628a9a8c9b432a458615a8 c1fff71650e2f8bf5a2d63bdc043161cdfe8e460 41f50b2c83a0ba36a2b9c507c1783e57c9b13485 68e1e9b3811e16cad014b590d7443b9063b3eb52 758f9c9350abee36a5865ec701560db8ea62004d e6e324eb28fd49c1fc44b3b65784f85a773ec61c 7262fd5d14810b7b495b5038e348a448fda1bcc3 773dff020968f7a6f590cfd53e8fd89f12e15e36 3597f1942ec6f2cfbd50b905683739b0900ff5dd c306d838697011da0a960758dde3f7ede6849060 c3483fa9a7d7c0ffa9fcc32b467ca844cfb63790
do

git checkout $version >& checkout.log.$version
make -j ARCH=x86-64-avx2 profile-build  >& build.log.$version
mv stockfish stockfish.$version

for split in 1 2 4 8 16 32
do

threads=$((256/split))
hash=$((128000/split))

cat << EOF > inp_$split
setoption name Threads value $threads
setoption name Hash value $hash
go movetime 100000
ucinewgame
quit
EOF

done

# no binding
split=1
for instance in `seq 1 $split`
do
cat inp_$split | ./stockfish.$version > out_nobind_${split}_${instance} &
done
wait

total_nodes_nobind=0
for instance in `seq 1 $split`
do
nodes=`grep -B1 bestmove out_nobind_${split}_${instance} | grep -o "nodes [0-9]*" | awk '{print $2}'`
total_nodes_nobind=$((total_nodes_nobind + nodes))
done


# binding
split=8
threads=$((256/split))
for instance in `seq 1 $split`
do
# this cpu list must match the numa domains ... depends on the system
tasksetlow=$(((instance-1)*threads/2))
tasksethigh=$(((instance-1)*threads/2 + threads/2 - 1))
cat inp_$split | taskset --cpu-list $tasksetlow-$tasksethigh,$((128+tasksetlow))-$((128+tasksethigh)) ./stockfish.$version > out_bind_${split}_${instance} &
done
wait

total_nodes_bind=0
for instance in `seq 1 $split`
do
nodes=`grep -B1 bestmove out_bind_${split}_${instance} | grep -o "nodes [0-9]*" | awk '{print $2}'`
total_nodes_bind=$((total_nodes_bind + nodes))
done

epoch=`git show --pretty=fuller --date=iso-strict $rev | grep 'CommitDate' | awk '{print $NF}'`

printf "%64s %32s %12d %12d\n" $version $epoch $total_nodes_bind $total_nodes_nobind

done

On the same system the following performance is observed according the splitting/pinning startegy:

 split         bind      no bind
     1   3387763787   3412555326
     2   6249669886   4568189260
     4   8345110549   3516812166
     8   8283342858   3523839484
    16   8013377138   2302587649
    32   7888404092   3094224313

On a different system with 4 sockets the observation is:

 split         bind      no bind
     1   9204147529   9082536561
     2  15654018181  10057157190
     4  20958931771   8864636565
     8  20290433821   4744173824
    16  19448457275   3913568825

Anything else?

Tentatively, a solution could be to introduce thread affinity, and replicated the network weights across numa domains. A potential interface could be along the lines:

#specify cpu masks for two numa domains
setoption name affinityMasks value 0xFF00,0x00FF
#roundrobin allocate threads to these domains, with a net allocated for each numa domain
setoption name Threads value 256

Some more discussion here https://discord.com/channels/435943710472011776/813919248455827515/1240709279049191525
As well as a rebased version of a code that goes in that direction (but needs further work):
https://github.com/official-stockfish/Stockfish/compare/master...Disservin:Stockfish:numareplicatedweights?expand=1

Operating system

All

Stockfish version

master

The text was updated successfully, but these errors were encountered:

Sopel97 · 2024-05-16T21:09:14Z

I'd be in favor of adding a simple boolean numa yes/no uci option. The affinity mask may get unwieldy. On linux we can do either libnuma (which is a bit of a problematic dependency) or parse lscpu output. On windows we can do with winapi easily.

That last attempt was not beneficial, presumably due to the cost of finding out the current numa node. So indeed the best approach would be to bind the threads to specific cpus (or I think better to bind N/M threads to each numa node, where N is the number of threads, M is the number of nodes). The Thread class would be aware of its numa node.

vondele · 2024-05-16T21:21:53Z

while a simple boolean has some appeal, in my experience it tends to give too little control. E.g. one might want to run engine testing matches on a 128c/256T machine using 64 threads per engine, in that case the user would need some control. Actually good idea to mention lscpu, e.g. for the 2xepyc it reports:

NUMA node0 CPU(s):               0-15,128-143
NUMA node1 CPU(s):               16-31,144-159
NUMA node2 CPU(s):               32-47,160-175
NUMA node3 CPU(s):               48-63,176-191
NUMA node4 CPU(s):               64-79,192-207
NUMA node5 CPU(s):               80-95,208-223
NUMA node6 CPU(s):               96-111,224-239
NUMA node7 CPU(s):               112-127,240-255

so you could adopt as a format

setoption name affinityMasks value 0-15,128-143:16-31,144-159:32-47,160-175:48-63,176-191:64-79,192-207:80-95,208-223:96-111,224-239:112-127,240-255

which would allow full user control, and would also be fairly easy to document/hint.

One could indeed consider a value of 'auto' that actually calls lscpu and parses the output.

vondele · 2024-05-16T21:30:43Z

actually, turns out the relevant part of lscpu can be confusing as well, so here the 'auto' mode would probably fail to parse this:

NUMA:                   
  NUMA node(s):         36
  NUMA node0 CPU(s):    0-71
  NUMA node1 CPU(s):    72-143
  NUMA node2 CPU(s):    144-215
  NUMA node3 CPU(s):    216-287
  NUMA node4 CPU(s):    
  NUMA node5 CPU(s):    
  NUMA node6 CPU(s):    
  NUMA node7 CPU(s):    
  NUMA node8 CPU(s):    
  ... (up to node35)

Sopel97 · 2024-05-16T21:32:17Z

one might want to run engine testing matches on a 128c/256T machine using 64 threads per engine, in that case the user would need some control

right, we could add an uci option for overall processor affinity mask, but that's kinda taking over what taskset should be doing a level above

for the 2xepyc it reports:

now, I'm not sure how desirable it would be to do all of these nodes (2 sockets but 4 nodes per cpu), I guess it shouldn't hurt other than memory usage which should be irrelevant for these usecases at this scale?

so you could adopt as a format

yea, I think if we mimicked lscpu format it would be good

so with this I presume it should assign threads to numa domains in a greedy manner, choosing the node with lowest utilization (by %, calculated with the thread to be added). So for example for

0-3:4-19 and say 16 threads, it would assign 3 threads to the first node and 13 threads to the second node

edit. on second thought, this might be suboptimal if the number of threads for stockfish is << number of vcpus. will need some better strategy, or rely on the user to provide tight numa node bounds. There is also a potential issue with conflicting affinity being set from outside, but I'm not sure how to resolve that.

One could indeed consider a value of 'auto' that actually calls lscpu and parses the output.

sounds good. We should have a way to disable the feature (also by default, maybe change later), maybe disabled

actually, turns out the relevant part of lscpu can be broken as well, so here the 'auto' mode would probably fail to parse this:

quality of implementation issue, I think we can handle this well

vondele · 2024-05-16T21:41:47Z

one might want to run engine testing matches on a 128c/256T machine using 64 threads per engine, in that case the user would need some control

right, we could add an uci option for overall processor affinity mask, but that's kinda taking over what taskset should be doing a level above

taskset can't fully cater, e.g. the other node with 4 sockets and say engines matches where each engine uses 2 sockets. Admittedly, these are corner cases.

for the 2xepyc it reports:

now, I'm not sure how desirable it would be to do all of these nodes (2 sockets but 4 nodes per cpu), I guess it shouldn't hurt other than memory usage which should be irrelevant for these usecases at this scale?

yes, net memory usage irrelevant in these cases.

so with this I presume it should assign threads to numa domains in a greedy manner, choosing the node with lowest utilization (by %, calculated with the thread to be added). So for example for

0-3:4-19 and say 16 threads, it would assign 3 threads to the first node and 13 threads to the second node

Probably anything that doesn't fully populate the cpus of the specified domain is suspect and likely a user error. greedy sounds like a good option, could also just fill domains in order, or round robin.

One could indeed consider a value of 'auto' that actually calls lscpu and parses the output.

sounds good. We should have a way to disable the feature (also by default, maybe change later), maybe disabled

yes, it must be possible to disable. disabled or none should possibly be even the default. We can see.

Sopel97 · 2024-05-16T21:43:24Z

okay, I think I have a good enough idea about the overall behaviour of this to make a good prototype

…nsure execution on a specific NUMA node. This patch introduces NUMA memory replication, currently only utilized for the NNUE weights. Along with it comes all machinery required to identify NUMA nodes and bind threads to specific processors/nodes. It also comes with small changes to Thread and ThreadPool to allow easier execution of custom functions on the designated thread. Old thread binding (WinProcGroup) machinery is removed because it's incompatible with this patch. Small changes to unrelated parts of the code were made to ensure correctness, like some classes being made unmovable, raw pointers replaced with unique_ptr. etc. Windows 7 and Windows 10 is partially supported. Windows 11 is fully supported. Linux is fully supported, with explicit exclusion of Android. No additional dependencies. ----------------- A new UCI option `NumaPolicy` is introduced. It can take the following values: ``` system - gathers NUMA node information from the system (lscpu or windows api), for each threads binds it to a single NUMA node none - assumes there is 1 NUMA node, never binds threads auto - this is the default value, depends on the number of set threads and NUMA nodes, will only enable binding on multinode systems and when the number of threads reaches a threshold (dependent on node size and count) [[custom]] - // ':'-separated numa nodes // ','-separated cpu indices // supports "first-last" range syntax for cpu indices, for example '0-15,32-47:16-31,48-63' ``` Setting `NumaPolicy` forces recreation of the threads in the ThreadPool, which in turn forces the recreation of the TT. The threads are distributed among NUMA nodes in a round-robin fashion based on fill percentage (i.e. it will strive to fill all NUMA nodes evenly). Threads are bound to NUMA nodes, not specific processors, because that's our only requirement and the OS can schedule them better. Special care is made that maximum memory usage on systems that do not require memory replication stays as previously, that is, unnecessary copies are avoided. On linux the process' processor affinity is respected. This means that if you for example use taskset to restrict Stockfish to a single NUMA node then the `system` and `auto` settings will only see a single NUMA node (more precisely, the processors included in the current affinity mask) and act accordingly. ----------------- We can't ensure that a memory allocation takes place on a given NUMA node without using libnuma on linux, or using appropriate custom allocators on windows (https://learn.microsoft.com/en-us/windows/win32/memory/allocating-memory-from-a-numa-node), so to avoid complications the current implementation relies on first-touch policy. Due to this we also rely on the memory allocator to give us a new chunk of untouched memory from the system. This appears to work reliably on linux, but results may vary. MacOS is not supported, because AFAIK it's not affected, and implementation would be problematic anyway. Windows is supported since Windows 7 (https://learn.microsoft.com/en-us/windows/win32/api/processtopologyapi/nf-processtopologyapi-setthreadgroupaffinity). Until Windows 11/Server 2022 NUMA nodes are split such that they cannot span processor groups. This is because before Windows 11/Server 2022 it's not possible to set thread affinity spanning processor groups. The splitting is done manually in some cases (required after Windows 10 Build 20348). Since Windows 11/Server 2022 we can set affinites spanning processor group so this splitting is not done, so the behaviour is pretty much like on linux. Linux is supported, **without** libnuma requirement. `lscpu` is expected. ----------------- Passed SMP STC: https://tests.stockfishchess.org/tests/view/6654fc74a86388d5e27db1cd ``` LLR: 2.95 (-2.94,2.94) <-1.75,0.25> Total: 67152 W: 17354 L: 17177 D: 32621 Ptnml(0-2): 64, 7428, 18408, 7619, 57 ``` Passed STC: https://tests.stockfishchess.org/tests/view/6654fb27a86388d5e27db15c ``` LLR: 2.94 (-2.94,2.94) <-1.75,0.25> Total: 131648 W: 34155 L: 34045 D: 63448 Ptnml(0-2): 426, 13878, 37096, 14008, 416 ``` fixes official-stockfish#5253

…nsure execution on a specific NUMA node. This patch introduces NUMA memory replication, currently only utilized for the NNUE weights. Along with it comes all machinery required to identify NUMA nodes and bind threads to specific processors/nodes. It also comes with small changes to Thread and ThreadPool to allow easier execution of custom functions on the designated thread. Old thread binding (WinProcGroup) machinery is removed because it's incompatible with this patch. Small changes to unrelated parts of the code were made to ensure correctness, like some classes being made unmovable, raw pointers replaced with unique_ptr. etc. Windows 7 and Windows 10 is partially supported. Windows 11 is fully supported. Linux is fully supported, with explicit exclusion of Android. No additional dependencies. ----------------- A new UCI option `NumaPolicy` is introduced. It can take the following values: ``` system - gathers NUMA node information from the system (lscpu or windows api), for each threads binds it to a single NUMA node none - assumes there is 1 NUMA node, never binds threads auto - this is the default value, depends on the number of set threads and NUMA nodes, will only enable binding on multinode systems and when the number of threads reaches a threshold (dependent on node size and count) [[custom]] - // ':'-separated numa nodes // ','-separated cpu indices // supports "first-last" range syntax for cpu indices, for example '0-15,32-47:16-31,48-63' ``` Setting `NumaPolicy` forces recreation of the threads in the ThreadPool, which in turn forces the recreation of the TT. The threads are distributed among NUMA nodes in a round-robin fashion based on fill percentage (i.e. it will strive to fill all NUMA nodes evenly). Threads are bound to NUMA nodes, not specific processors, because that's our only requirement and the OS can schedule them better. Special care is made that maximum memory usage on systems that do not require memory replication stays as previously, that is, unnecessary copies are avoided. On linux the process' processor affinity is respected. This means that if you for example use taskset to restrict Stockfish to a single NUMA node then the `system` and `auto` settings will only see a single NUMA node (more precisely, the processors included in the current affinity mask) and act accordingly. ----------------- We can't ensure that a memory allocation takes place on a given NUMA node without using libnuma on linux, or using appropriate custom allocators on windows (https://learn.microsoft.com/en-us/windows/win32/memory/allocating-memory-from-a-numa-node), so to avoid complications the current implementation relies on first-touch policy. Due to this we also rely on the memory allocator to give us a new chunk of untouched memory from the system. This appears to work reliably on linux, but results may vary. MacOS is not supported, because AFAIK it's not affected, and implementation would be problematic anyway. Windows is supported since Windows 7 (https://learn.microsoft.com/en-us/windows/win32/api/processtopologyapi/nf-processtopologyapi-setthreadgroupaffinity). Until Windows 11/Server 2022 NUMA nodes are split such that they cannot span processor groups. This is because before Windows 11/Server 2022 it's not possible to set thread affinity spanning processor groups. The splitting is done manually in some cases (required after Windows 10 Build 20348). Since Windows 11/Server 2022 we can set affinites spanning processor group so this splitting is not done, so the behaviour is pretty much like on linux. Linux is supported, **without** libnuma requirement. `lscpu` is expected. ----------------- Passed 60+1 @ 256t 16000MB hash: https://tests.stockfishchess.org/tests/view/6654e443a86388d5e27db0d8 ``` LLR: 2.95 (-2.94,2.94) <0.00,10.00> Total: 278 W: 110 L: 29 D: 139 Ptnml(0-2): 0, 1, 56, 82, 0 ``` Passed SMP STC: https://tests.stockfishchess.org/tests/view/6654fc74a86388d5e27db1cd ``` LLR: 2.95 (-2.94,2.94) <-1.75,0.25> Total: 67152 W: 17354 L: 17177 D: 32621 Ptnml(0-2): 64, 7428, 18408, 7619, 57 ``` Passed STC: https://tests.stockfishchess.org/tests/view/6654fb27a86388d5e27db15c ``` LLR: 2.94 (-2.94,2.94) <-1.75,0.25> Total: 131648 W: 34155 L: 34045 D: 63448 Ptnml(0-2): 426, 13878, 37096, 14008, 416 ``` fixes official-stockfish#5253

Allow for NUMA memory replication for NNUE weights. Bind threads to ensure execution on a specific NUMA node. This patch introduces NUMA memory replication, currently only utilized for the NNUE weights. Along with it comes all machinery required to identify NUMA nodes and bind threads to specific processors/nodes. It also comes with small changes to Thread and ThreadPool to allow easier execution of custom functions on the designated thread. Old thread binding (WinProcGroup) machinery is removed because it's incompatible with this patch. Small changes to unrelated parts of the code were made to ensure correctness, like some classes being made unmovable, raw pointers replaced with unique_ptr. etc. Windows 7 and Windows 10 is partially supported. Windows 11 is fully supported. Linux is fully supported, with explicit exclusion of Android. No additional dependencies. ----------------- A new UCI option `NumaPolicy` is introduced. It can take the following values: ``` system - gathers NUMA node information from the system (lscpu or windows api), for each threads binds it to a single NUMA node none - assumes there is 1 NUMA node, never binds threads auto - this is the default value, depends on the number of set threads and NUMA nodes, will only enable binding on multinode systems and when the number of threads reaches a threshold (dependent on node size and count) [[custom]] - // ':'-separated numa nodes // ','-separated cpu indices // supports "first-last" range syntax for cpu indices, for example '0-15,32-47:16-31,48-63' ``` Setting `NumaPolicy` forces recreation of the threads in the ThreadPool, which in turn forces the recreation of the TT. The threads are distributed among NUMA nodes in a round-robin fashion based on fill percentage (i.e. it will strive to fill all NUMA nodes evenly). Threads are bound to NUMA nodes, not specific processors, because that's our only requirement and the OS can schedule them better. Special care is made that maximum memory usage on systems that do not require memory replication stays as previously, that is, unnecessary copies are avoided. On linux the process' processor affinity is respected. This means that if you for example use taskset to restrict Stockfish to a single NUMA node then the `system` and `auto` settings will only see a single NUMA node (more precisely, the processors included in the current affinity mask) and act accordingly. ----------------- We can't ensure that a memory allocation takes place on a given NUMA node without using libnuma on linux, or using appropriate custom allocators on windows (https://learn.microsoft.com/en-us/windows/win32/memory/allocating-memory-from-a-numa-node), so to avoid complications the current implementation relies on first-touch policy. Due to this we also rely on the memory allocator to give us a new chunk of untouched memory from the system. This appears to work reliably on linux, but results may vary. MacOS is not supported, because AFAIK it's not affected, and implementation would be problematic anyway. Windows is supported since Windows 7 (https://learn.microsoft.com/en-us/windows/win32/api/processtopologyapi/nf-processtopologyapi-setthreadgroupaffinity). Until Windows 11/Server 2022 NUMA nodes are split such that they cannot span processor groups. This is because before Windows 11/Server 2022 it's not possible to set thread affinity spanning processor groups. The splitting is done manually in some cases (required after Windows 10 Build 20348). Since Windows 11/Server 2022 we can set affinites spanning processor group so this splitting is not done, so the behaviour is pretty much like on linux. Linux is supported, **without** libnuma requirement. `lscpu` is expected. ----------------- Passed 60+1 @ 256t 16000MB hash: https://tests.stockfishchess.org/tests/view/6654e443a86388d5e27db0d8 ``` LLR: 2.95 (-2.94,2.94) <0.00,10.00> Total: 278 W: 110 L: 29 D: 139 Ptnml(0-2): 0, 1, 56, 82, 0 ``` Passed SMP STC: https://tests.stockfishchess.org/tests/view/6654fc74a86388d5e27db1cd ``` LLR: 2.95 (-2.94,2.94) <-1.75,0.25> Total: 67152 W: 17354 L: 17177 D: 32621 Ptnml(0-2): 64, 7428, 18408, 7619, 57 ``` Passed STC: https://tests.stockfishchess.org/tests/view/6654fb27a86388d5e27db15c ``` LLR: 2.94 (-2.94,2.94) <-1.75,0.25> Total: 131648 W: 34155 L: 34045 D: 63448 Ptnml(0-2): 426, 13878, 37096, 14008, 416 ``` fixes official-stockfish#5253 closes official-stockfish#5285 No functional change

mstembera · 2024-08-04T03:19:26Z

Do we know if it still makes sense to use hyper threading on these systems where we are memory bandwidth limited?

vondele · 2024-08-12T07:37:47Z

It is definitely no longer clear cut in my experience, but I think it is still advantageous. There might be a dependency on the hardware though.

vondele closed this as completed in a169c78 May 28, 2024

This comment was marked as spam.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suboptimal speed on multi-socket / numa systems #5253

Suboptimal speed on multi-socket / numa systems #5253

vondele commented May 16, 2024

Sopel97 commented May 16, 2024

vondele commented May 16, 2024

vondele commented May 16, 2024 •

edited

Loading

Sopel97 commented May 16, 2024 •

edited

Loading

vondele commented May 16, 2024

Sopel97 commented May 16, 2024 •

edited

Loading

mstembera commented Aug 4, 2024

vondele commented Aug 12, 2024

This comment was marked as spam.

Suboptimal speed on multi-socket / numa systems #5253

Suboptimal speed on multi-socket / numa systems #5253

Comments

vondele commented May 16, 2024

Describe the issue

Expected behavior

Steps to reproduce

Anything else?

Operating system

Stockfish version

Sopel97 commented May 16, 2024

vondele commented May 16, 2024

vondele commented May 16, 2024 • edited Loading

Sopel97 commented May 16, 2024 • edited Loading

vondele commented May 16, 2024

Sopel97 commented May 16, 2024 • edited Loading

mstembera commented Aug 4, 2024

vondele commented Aug 12, 2024

This comment was marked as spam.

vondele commented May 16, 2024 •

edited

Loading

Sopel97 commented May 16, 2024 •

edited

Loading

Sopel97 commented May 16, 2024 •

edited

Loading