Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suboptimal speed on multi-socket / numa systems #5253

Closed
vondele opened this issue May 16, 2024 · 9 comments
Closed

Suboptimal speed on multi-socket / numa systems #5253

vondele opened this issue May 16, 2024 · 9 comments

Comments

@vondele
Copy link
Member

vondele commented May 16, 2024

Describe the issue

we observe slower than expected nodes per second on multi-socket and/or numa systems (e.g. local, tcec, potentially ccc). The likely reason is increased memory bandwidth contention due to the larger networks and accumulator caches.

Expected behavior

nps / performance following more closely the nps on single CPU / single numa chips. Potential speedups could be 2x and larger.

Steps to reproduce

Reproducing/testing needs access to multi-socket system, and either looking at historical data, or doing a comparison with a synthetic benchmark, where the speed of multiple instances of SF with fewer threads (each bound to a numa domain) are compared to one instance of SF with correspondingly more threads. An example is the following set of historic data:

                                                             sha                             date       pinned      default
                        dcb02337844d71e56df57b9a8ba17646f953711c        2024-05-15T16:27:03+02:00   8273351050   3266542305
                        49ef4c935a5cb0e4d94096e6354caa06b36b3e3c        2024-04-24T18:38:20+02:00   8548621496   3491153172
                        0716b845fdef8a20102b07eaec074b8da8162523        2024-04-02T08:49:48+02:00   8059206362   4248207804
                        bd579ab5d1a931a09a62f2ed33b5149ada7bc65f        2024-03-07T19:53:48+01:00   9424434014   5654516665
                        e67cc979fd2c0e66dfc2b2f2daa0117458cfc462        2024-02-24T18:15:04+01:00   9470247485   5864264415
                        8e75548f2a10969c1c9211056999efbcebe63f9a        2024-02-17T17:11:46+01:00   9411936135   6000562457
                        6deb88728fb141e853243c2873ad0cda4dd19320        2024-01-08T18:34:36+01:00   9346121150   5796929238
                        f12035c88c58a5fd568d26cde9868f73a8d7b839        2023-12-30T11:08:03+01:00   9454744857   6531509540
                        afe7f4d9b0c5e1a1aa224484d2cd9e04c7f099b9        2023-09-29T22:30:27+02:00   9235658137   7375676116
                        70ba9de85cddc5460b1ec53e0a99bee271e26ece        2023-09-22T19:26:16+02:00   9284309149   7583913771
                        3d1b067d853d6e8cc22cf18c1abb4cd9833dd38f        2023-09-11T22:37:39+02:00  10860286083  10026471309
                        e699fee513ce26b3794ac43d08826c89106e10ea        2023-07-06T23:03:58+02:00  10399561315   9485800488
                        915532181f11812c80ef0b57bc018de4ea2155ec        2023-07-01T13:34:30+02:00  10172761439   8869020884

here default is running a single instance of SF with 256T without pinning, while the pinning setup uses 8 instances of SF each pinned to a suitable numa domain. It can be seen that a year ago the performance difference was just about 10% whereas now the difference is about 200%.

data generated with:

for version in  dcb02337844d71e56df57b9a8ba17646f953711c 49ef4c935a5cb0e4d94096e6354caa06b36b3e3c 0716b845fdef8a20102b07eaec074b8da8162523 bd579ab5d1a931a09a62f2ed33b5149ada7bc65f e67cc979fd2c0e66dfc2b2f2daa0117458cfc462 8e75548f2a10969c1c9211056999efbcebe63f9a 6deb88728fb141e853243c2873ad0cda4dd19320 f12035c88c58a5fd568d26cde9868f73a8d7b839 afe7f4d9b0c5e1a1aa224484d2cd9e04c7f099b9 70ba9de85cddc5460b1ec53e0a99bee271e26ece 3d1b067d853d6e8cc22cf18c1abb4cd9833dd38f e699fee513ce26b3794ac43d08826c89106e10ea 915532181f11812c80ef0b57bc018de4ea2155ec ef94f77f8c827a2395f1c40f53311a3b1f20bc5b a49b3ba7ed5d9be9151c8ceb5eed40efe3387c75 932f5a2d657c846c282adcf2051faef7ca17ae15 373359b44d0947cce2628a9a8c9b432a458615a8 c1fff71650e2f8bf5a2d63bdc043161cdfe8e460 41f50b2c83a0ba36a2b9c507c1783e57c9b13485 68e1e9b3811e16cad014b590d7443b9063b3eb52 758f9c9350abee36a5865ec701560db8ea62004d e6e324eb28fd49c1fc44b3b65784f85a773ec61c 7262fd5d14810b7b495b5038e348a448fda1bcc3 773dff020968f7a6f590cfd53e8fd89f12e15e36 3597f1942ec6f2cfbd50b905683739b0900ff5dd c306d838697011da0a960758dde3f7ede6849060 c3483fa9a7d7c0ffa9fcc32b467ca844cfb63790
do

git checkout $version >& checkout.log.$version
make -j ARCH=x86-64-avx2 profile-build  >& build.log.$version
mv stockfish stockfish.$version

for split in 1 2 4 8 16 32
do

threads=$((256/split))
hash=$((128000/split))

cat << EOF > inp_$split
setoption name Threads value $threads
setoption name Hash value $hash
go movetime 100000
ucinewgame
quit
EOF

done

# no binding
split=1
for instance in `seq 1 $split`
do
cat inp_$split | ./stockfish.$version > out_nobind_${split}_${instance} &
done
wait

total_nodes_nobind=0
for instance in `seq 1 $split`
do
nodes=`grep -B1 bestmove out_nobind_${split}_${instance} | grep -o "nodes [0-9]*" | awk '{print $2}'`
total_nodes_nobind=$((total_nodes_nobind + nodes))
done


# binding
split=8
threads=$((256/split))
for instance in `seq 1 $split`
do
# this cpu list must match the numa domains ... depends on the system
tasksetlow=$(((instance-1)*threads/2))
tasksethigh=$(((instance-1)*threads/2 + threads/2 - 1))
cat inp_$split | taskset --cpu-list $tasksetlow-$tasksethigh,$((128+tasksetlow))-$((128+tasksethigh)) ./stockfish.$version > out_bind_${split}_${instance} &
done
wait

total_nodes_bind=0
for instance in `seq 1 $split`
do
nodes=`grep -B1 bestmove out_bind_${split}_${instance} | grep -o "nodes [0-9]*" | awk '{print $2}'`
total_nodes_bind=$((total_nodes_bind + nodes))
done

epoch=`git show --pretty=fuller --date=iso-strict $rev | grep 'CommitDate' | awk '{print $NF}'`

printf "%64s %32s %12d %12d\n" $version $epoch $total_nodes_bind $total_nodes_nobind

done

On the same system the following performance is observed according the splitting/pinning startegy:

 split         bind      no bind
     1   3387763787   3412555326
     2   6249669886   4568189260
     4   8345110549   3516812166
     8   8283342858   3523839484
    16   8013377138   2302587649
    32   7888404092   3094224313

On a different system with 4 sockets the observation is:

 split         bind      no bind
     1   9204147529   9082536561
     2  15654018181  10057157190
     4  20958931771   8864636565
     8  20290433821   4744173824
    16  19448457275   3913568825

Anything else?

Tentatively, a solution could be to introduce thread affinity, and replicated the network weights across numa domains. A potential interface could be along the lines:

#specify cpu masks for two numa domains
setoption name affinityMasks value 0xFF00,0x00FF
#roundrobin allocate threads to these domains, with a net allocated for each numa domain
setoption name Threads value 256

Some more discussion here https://discord.com/channels/435943710472011776/813919248455827515/1240709279049191525
As well as a rebased version of a code that goes in that direction (but needs further work):
https://github.com/official-stockfish/Stockfish/compare/master...Disservin:Stockfish:numareplicatedweights?expand=1

Operating system

All

Stockfish version

master

@Sopel97
Copy link
Member

Sopel97 commented May 16, 2024

I'd be in favor of adding a simple boolean numa yes/no uci option. The affinity mask may get unwieldy. On linux we can do either libnuma (which is a bit of a problematic dependency) or parse lscpu output. On windows we can do with winapi easily.

That last attempt was not beneficial, presumably due to the cost of finding out the current numa node. So indeed the best approach would be to bind the threads to specific cpus (or I think better to bind N/M threads to each numa node, where N is the number of threads, M is the number of nodes). The Thread class would be aware of its numa node.

@vondele
Copy link
Member Author

vondele commented May 16, 2024

while a simple boolean has some appeal, in my experience it tends to give too little control. E.g. one might want to run engine testing matches on a 128c/256T machine using 64 threads per engine, in that case the user would need some control. Actually good idea to mention lscpu, e.g. for the 2xepyc it reports:

NUMA node0 CPU(s):               0-15,128-143
NUMA node1 CPU(s):               16-31,144-159
NUMA node2 CPU(s):               32-47,160-175
NUMA node3 CPU(s):               48-63,176-191
NUMA node4 CPU(s):               64-79,192-207
NUMA node5 CPU(s):               80-95,208-223
NUMA node6 CPU(s):               96-111,224-239
NUMA node7 CPU(s):               112-127,240-255

so you could adopt as a format

setoption name affinityMasks value 0-15,128-143:16-31,144-159:32-47,160-175:48-63,176-191:64-79,192-207:80-95,208-223:96-111,224-239:112-127,240-255

which would allow full user control, and would also be fairly easy to document/hint.

One could indeed consider a value of 'auto' that actually calls lscpu and parses the output.

@vondele
Copy link
Member Author

vondele commented May 16, 2024

actually, turns out the relevant part of lscpu can be confusing as well, so here the 'auto' mode would probably fail to parse this:

NUMA:                   
  NUMA node(s):         36
  NUMA node0 CPU(s):    0-71
  NUMA node1 CPU(s):    72-143
  NUMA node2 CPU(s):    144-215
  NUMA node3 CPU(s):    216-287
  NUMA node4 CPU(s):    
  NUMA node5 CPU(s):    
  NUMA node6 CPU(s):    
  NUMA node7 CPU(s):    
  NUMA node8 CPU(s):    
  ... (up to node35)

@Sopel97
Copy link
Member

Sopel97 commented May 16, 2024

one might want to run engine testing matches on a 128c/256T machine using 64 threads per engine, in that case the user would need some control

right, we could add an uci option for overall processor affinity mask, but that's kinda taking over what taskset should be doing a level above

for the 2xepyc it reports:

now, I'm not sure how desirable it would be to do all of these nodes (2 sockets but 4 nodes per cpu), I guess it shouldn't hurt other than memory usage which should be irrelevant for these usecases at this scale?

so you could adopt as a format

yea, I think if we mimicked lscpu format it would be good

so with this I presume it should assign threads to numa domains in a greedy manner, choosing the node with lowest utilization (by %, calculated with the thread to be added). So for example for

0-3:4-19 and say 16 threads, it would assign 3 threads to the first node and 13 threads to the second node

edit. on second thought, this might be suboptimal if the number of threads for stockfish is << number of vcpus. will need some better strategy, or rely on the user to provide tight numa node bounds. There is also a potential issue with conflicting affinity being set from outside, but I'm not sure how to resolve that.

One could indeed consider a value of 'auto' that actually calls lscpu and parses the output.

sounds good. We should have a way to disable the feature (also by default, maybe change later), maybe disabled

actually, turns out the relevant part of lscpu can be broken as well, so here the 'auto' mode would probably fail to parse this:

quality of implementation issue, I think we can handle this well

@vondele
Copy link
Member Author

vondele commented May 16, 2024

one might want to run engine testing matches on a 128c/256T machine using 64 threads per engine, in that case the user would need some control

right, we could add an uci option for overall processor affinity mask, but that's kinda taking over what taskset should be doing a level above

taskset can't fully cater, e.g. the other node with 4 sockets and say engines matches where each engine uses 2 sockets. Admittedly, these are corner cases.

for the 2xepyc it reports:

now, I'm not sure how desirable it would be to do all of these nodes (2 sockets but 4 nodes per cpu), I guess it shouldn't hurt other than memory usage which should be irrelevant for these usecases at this scale?

yes, net memory usage irrelevant in these cases.

so with this I presume it should assign threads to numa domains in a greedy manner, choosing the node with lowest utilization (by %, calculated with the thread to be added). So for example for

0-3:4-19 and say 16 threads, it would assign 3 threads to the first node and 13 threads to the second node

Probably anything that doesn't fully populate the cpus of the specified domain is suspect and likely a user error. greedy sounds like a good option, could also just fill domains in order, or round robin.

One could indeed consider a value of 'auto' that actually calls lscpu and parses the output.

sounds good. We should have a way to disable the feature (also by default, maybe change later), maybe disabled

yes, it must be possible to disable. disabled or none should possibly be even the default. We can see.

@Sopel97
Copy link
Member

Sopel97 commented May 16, 2024

okay, I think I have a good enough idea about the overall behaviour of this to make a good prototype

Sopel97 added a commit to Sopel97/Stockfish that referenced this issue May 28, 2024
…nsure execution on a specific NUMA node.

This patch introduces NUMA memory replication, currently only utilized for the NNUE weights. Along with it comes all machinery required to identify NUMA nodes and bind threads to specific processors/nodes. It also comes with small changes to Thread and ThreadPool to allow easier execution of custom functions on the designated thread. Old thread binding (WinProcGroup) machinery is removed because it's incompatible with this patch. Small changes to unrelated parts of the code were made to ensure correctness, like some classes being made unmovable, raw pointers replaced with unique_ptr. etc.

Windows 7 and Windows 10 is partially supported. Windows 11 is fully supported. Linux is fully supported, with explicit exclusion of Android. No additional dependencies.

-----------------

A new UCI option `NumaPolicy` is introduced. It can take the following values:
```
system - gathers NUMA node information from the system (lscpu or windows api), for each threads binds it to a single NUMA node
none - assumes there is 1 NUMA node, never binds threads
auto - this is the default value, depends on the number of set threads and NUMA nodes, will only enable binding on multinode systems and when the number of threads reaches a threshold (dependent on node size and count)
[[custom]] - 
  // ':'-separated numa nodes
  // ','-separated cpu indices
  // supports "first-last" range syntax for cpu indices, 
  for example '0-15,32-47:16-31,48-63'
```

Setting `NumaPolicy` forces recreation of the threads in the ThreadPool, which in turn forces the recreation of the TT.

The threads are distributed among NUMA nodes in a round-robin fashion based on fill percentage (i.e. it will strive to fill all NUMA nodes evenly). Threads are bound to NUMA nodes, not specific processors, because that's our only requirement and the OS can schedule them better.

Special care is made that maximum memory usage on systems that do not require memory replication stays as previously, that is, unnecessary copies are avoided.

On linux the process' processor affinity is respected. This means that if you for example use taskset to restrict Stockfish to a single NUMA node then the `system` and `auto` settings will only see a single NUMA node (more precisely, the processors included in the current affinity mask) and act accordingly.

-----------------

We can't ensure that a memory allocation takes place on a given NUMA node without using libnuma on linux, or using appropriate custom allocators on windows (https://learn.microsoft.com/en-us/windows/win32/memory/allocating-memory-from-a-numa-node), so to avoid complications the current implementation relies on first-touch policy. Due to this we also rely on the memory allocator to give us a new chunk of untouched memory from the system. This appears to work reliably on linux, but results may vary.

MacOS is not supported, because AFAIK it's not affected, and implementation would be problematic anyway.

Windows is supported since Windows 7 (https://learn.microsoft.com/en-us/windows/win32/api/processtopologyapi/nf-processtopologyapi-setthreadgroupaffinity). Until Windows 11/Server 2022 NUMA nodes are split such that they cannot span processor groups. This is because before Windows 11/Server 2022 it's not possible to set thread affinity spanning processor groups. The splitting is done manually in some cases (required after Windows 10 Build 20348). Since Windows 11/Server 2022 we can set affinites spanning processor group so this splitting is not done, so the behaviour is pretty much like on linux.

Linux is supported, **without** libnuma requirement. `lscpu` is expected.

-----------------

Passed SMP STC: https://tests.stockfishchess.org/tests/view/6654fc74a86388d5e27db1cd
```
LLR: 2.95 (-2.94,2.94) <-1.75,0.25>
Total: 67152 W: 17354 L: 17177 D: 32621
Ptnml(0-2): 64, 7428, 18408, 7619, 57
```

Passed STC: https://tests.stockfishchess.org/tests/view/6654fb27a86388d5e27db15c
```
LLR: 2.94 (-2.94,2.94) <-1.75,0.25>
Total: 131648 W: 34155 L: 34045 D: 63448
Ptnml(0-2): 426, 13878, 37096, 14008, 416
```

fixes official-stockfish#5253
Sopel97 added a commit to Sopel97/Stockfish that referenced this issue May 28, 2024
…nsure execution on a specific NUMA node.

This patch introduces NUMA memory replication, currently only utilized for the NNUE weights. Along with it comes all machinery required to identify NUMA nodes and bind threads to specific processors/nodes. It also comes with small changes to Thread and ThreadPool to allow easier execution of custom functions on the designated thread. Old thread binding (WinProcGroup) machinery is removed because it's incompatible with this patch. Small changes to unrelated parts of the code were made to ensure correctness, like some classes being made unmovable, raw pointers replaced with unique_ptr. etc.

Windows 7 and Windows 10 is partially supported. Windows 11 is fully supported. Linux is fully supported, with explicit exclusion of Android. No additional dependencies.

-----------------

A new UCI option `NumaPolicy` is introduced. It can take the following values:
```
system - gathers NUMA node information from the system (lscpu or windows api), for each threads binds it to a single NUMA node
none - assumes there is 1 NUMA node, never binds threads
auto - this is the default value, depends on the number of set threads and NUMA nodes, will only enable binding on multinode systems and when the number of threads reaches a threshold (dependent on node size and count)
[[custom]] - 
  // ':'-separated numa nodes
  // ','-separated cpu indices
  // supports "first-last" range syntax for cpu indices, 
  for example '0-15,32-47:16-31,48-63'
```

Setting `NumaPolicy` forces recreation of the threads in the ThreadPool, which in turn forces the recreation of the TT.

The threads are distributed among NUMA nodes in a round-robin fashion based on fill percentage (i.e. it will strive to fill all NUMA nodes evenly). Threads are bound to NUMA nodes, not specific processors, because that's our only requirement and the OS can schedule them better.

Special care is made that maximum memory usage on systems that do not require memory replication stays as previously, that is, unnecessary copies are avoided.

On linux the process' processor affinity is respected. This means that if you for example use taskset to restrict Stockfish to a single NUMA node then the `system` and `auto` settings will only see a single NUMA node (more precisely, the processors included in the current affinity mask) and act accordingly.

-----------------

We can't ensure that a memory allocation takes place on a given NUMA node without using libnuma on linux, or using appropriate custom allocators on windows (https://learn.microsoft.com/en-us/windows/win32/memory/allocating-memory-from-a-numa-node), so to avoid complications the current implementation relies on first-touch policy. Due to this we also rely on the memory allocator to give us a new chunk of untouched memory from the system. This appears to work reliably on linux, but results may vary.

MacOS is not supported, because AFAIK it's not affected, and implementation would be problematic anyway.

Windows is supported since Windows 7 (https://learn.microsoft.com/en-us/windows/win32/api/processtopologyapi/nf-processtopologyapi-setthreadgroupaffinity). Until Windows 11/Server 2022 NUMA nodes are split such that they cannot span processor groups. This is because before Windows 11/Server 2022 it's not possible to set thread affinity spanning processor groups. The splitting is done manually in some cases (required after Windows 10 Build 20348). Since Windows 11/Server 2022 we can set affinites spanning processor group so this splitting is not done, so the behaviour is pretty much like on linux.

Linux is supported, **without** libnuma requirement. `lscpu` is expected.

-----------------

Passed SMP STC: https://tests.stockfishchess.org/tests/view/6654fc74a86388d5e27db1cd
```
LLR: 2.95 (-2.94,2.94) <-1.75,0.25>
Total: 67152 W: 17354 L: 17177 D: 32621
Ptnml(0-2): 64, 7428, 18408, 7619, 57
```

Passed STC: https://tests.stockfishchess.org/tests/view/6654fb27a86388d5e27db15c
```
LLR: 2.94 (-2.94,2.94) <-1.75,0.25>
Total: 131648 W: 34155 L: 34045 D: 63448
Ptnml(0-2): 426, 13878, 37096, 14008, 416
```

fixes official-stockfish#5253
Sopel97 added a commit to Sopel97/Stockfish that referenced this issue May 28, 2024
…nsure execution on a specific NUMA node.

This patch introduces NUMA memory replication, currently only utilized for the NNUE weights. Along with it comes all machinery required to identify NUMA nodes and bind threads to specific processors/nodes. It also comes with small changes to Thread and ThreadPool to allow easier execution of custom functions on the designated thread. Old thread binding (WinProcGroup) machinery is removed because it's incompatible with this patch. Small changes to unrelated parts of the code were made to ensure correctness, like some classes being made unmovable, raw pointers replaced with unique_ptr. etc.

Windows 7 and Windows 10 is partially supported. Windows 11 is fully supported. Linux is fully supported, with explicit exclusion of Android. No additional dependencies.

-----------------

A new UCI option `NumaPolicy` is introduced. It can take the following values:
```
system - gathers NUMA node information from the system (lscpu or windows api), for each threads binds it to a single NUMA node
none - assumes there is 1 NUMA node, never binds threads
auto - this is the default value, depends on the number of set threads and NUMA nodes, will only enable binding on multinode systems and when the number of threads reaches a threshold (dependent on node size and count)
[[custom]] - 
  // ':'-separated numa nodes
  // ','-separated cpu indices
  // supports "first-last" range syntax for cpu indices, 
  for example '0-15,32-47:16-31,48-63'
```

Setting `NumaPolicy` forces recreation of the threads in the ThreadPool, which in turn forces the recreation of the TT.

The threads are distributed among NUMA nodes in a round-robin fashion based on fill percentage (i.e. it will strive to fill all NUMA nodes evenly). Threads are bound to NUMA nodes, not specific processors, because that's our only requirement and the OS can schedule them better.

Special care is made that maximum memory usage on systems that do not require memory replication stays as previously, that is, unnecessary copies are avoided.

On linux the process' processor affinity is respected. This means that if you for example use taskset to restrict Stockfish to a single NUMA node then the `system` and `auto` settings will only see a single NUMA node (more precisely, the processors included in the current affinity mask) and act accordingly.

-----------------

We can't ensure that a memory allocation takes place on a given NUMA node without using libnuma on linux, or using appropriate custom allocators on windows (https://learn.microsoft.com/en-us/windows/win32/memory/allocating-memory-from-a-numa-node), so to avoid complications the current implementation relies on first-touch policy. Due to this we also rely on the memory allocator to give us a new chunk of untouched memory from the system. This appears to work reliably on linux, but results may vary.

MacOS is not supported, because AFAIK it's not affected, and implementation would be problematic anyway.

Windows is supported since Windows 7 (https://learn.microsoft.com/en-us/windows/win32/api/processtopologyapi/nf-processtopologyapi-setthreadgroupaffinity). Until Windows 11/Server 2022 NUMA nodes are split such that they cannot span processor groups. This is because before Windows 11/Server 2022 it's not possible to set thread affinity spanning processor groups. The splitting is done manually in some cases (required after Windows 10 Build 20348). Since Windows 11/Server 2022 we can set affinites spanning processor group so this splitting is not done, so the behaviour is pretty much like on linux.

Linux is supported, **without** libnuma requirement. `lscpu` is expected.

-----------------

Passed SMP STC: https://tests.stockfishchess.org/tests/view/6654fc74a86388d5e27db1cd
```
LLR: 2.95 (-2.94,2.94) <-1.75,0.25>
Total: 67152 W: 17354 L: 17177 D: 32621
Ptnml(0-2): 64, 7428, 18408, 7619, 57
```

Passed STC: https://tests.stockfishchess.org/tests/view/6654fb27a86388d5e27db15c
```
LLR: 2.94 (-2.94,2.94) <-1.75,0.25>
Total: 131648 W: 34155 L: 34045 D: 63448
Ptnml(0-2): 426, 13878, 37096, 14008, 416
```

fixes official-stockfish#5253
Sopel97 added a commit to Sopel97/Stockfish that referenced this issue May 28, 2024
…nsure execution on a specific NUMA node.

This patch introduces NUMA memory replication, currently only utilized for the NNUE weights. Along with it comes all machinery required to identify NUMA nodes and bind threads to specific processors/nodes. It also comes with small changes to Thread and ThreadPool to allow easier execution of custom functions on the designated thread. Old thread binding (WinProcGroup) machinery is removed because it's incompatible with this patch. Small changes to unrelated parts of the code were made to ensure correctness, like some classes being made unmovable, raw pointers replaced with unique_ptr. etc.

Windows 7 and Windows 10 is partially supported. Windows 11 is fully supported. Linux is fully supported, with explicit exclusion of Android. No additional dependencies.

-----------------

A new UCI option `NumaPolicy` is introduced. It can take the following values:
```
system - gathers NUMA node information from the system (lscpu or windows api), for each threads binds it to a single NUMA node
none - assumes there is 1 NUMA node, never binds threads
auto - this is the default value, depends on the number of set threads and NUMA nodes, will only enable binding on multinode systems and when the number of threads reaches a threshold (dependent on node size and count)
[[custom]] - 
  // ':'-separated numa nodes
  // ','-separated cpu indices
  // supports "first-last" range syntax for cpu indices, 
  for example '0-15,32-47:16-31,48-63'
```

Setting `NumaPolicy` forces recreation of the threads in the ThreadPool, which in turn forces the recreation of the TT.

The threads are distributed among NUMA nodes in a round-robin fashion based on fill percentage (i.e. it will strive to fill all NUMA nodes evenly). Threads are bound to NUMA nodes, not specific processors, because that's our only requirement and the OS can schedule them better.

Special care is made that maximum memory usage on systems that do not require memory replication stays as previously, that is, unnecessary copies are avoided.

On linux the process' processor affinity is respected. This means that if you for example use taskset to restrict Stockfish to a single NUMA node then the `system` and `auto` settings will only see a single NUMA node (more precisely, the processors included in the current affinity mask) and act accordingly.

-----------------

We can't ensure that a memory allocation takes place on a given NUMA node without using libnuma on linux, or using appropriate custom allocators on windows (https://learn.microsoft.com/en-us/windows/win32/memory/allocating-memory-from-a-numa-node), so to avoid complications the current implementation relies on first-touch policy. Due to this we also rely on the memory allocator to give us a new chunk of untouched memory from the system. This appears to work reliably on linux, but results may vary.

MacOS is not supported, because AFAIK it's not affected, and implementation would be problematic anyway.

Windows is supported since Windows 7 (https://learn.microsoft.com/en-us/windows/win32/api/processtopologyapi/nf-processtopologyapi-setthreadgroupaffinity). Until Windows 11/Server 2022 NUMA nodes are split such that they cannot span processor groups. This is because before Windows 11/Server 2022 it's not possible to set thread affinity spanning processor groups. The splitting is done manually in some cases (required after Windows 10 Build 20348). Since Windows 11/Server 2022 we can set affinites spanning processor group so this splitting is not done, so the behaviour is pretty much like on linux.

Linux is supported, **without** libnuma requirement. `lscpu` is expected.

-----------------

Passed SMP STC: https://tests.stockfishchess.org/tests/view/6654fc74a86388d5e27db1cd
```
LLR: 2.95 (-2.94,2.94) <-1.75,0.25>
Total: 67152 W: 17354 L: 17177 D: 32621
Ptnml(0-2): 64, 7428, 18408, 7619, 57
```

Passed STC: https://tests.stockfishchess.org/tests/view/6654fb27a86388d5e27db15c
```
LLR: 2.94 (-2.94,2.94) <-1.75,0.25>
Total: 131648 W: 34155 L: 34045 D: 63448
Ptnml(0-2): 426, 13878, 37096, 14008, 416
```

fixes official-stockfish#5253
Sopel97 added a commit to Sopel97/Stockfish that referenced this issue May 28, 2024
…nsure execution on a specific NUMA node.

This patch introduces NUMA memory replication, currently only utilized for the NNUE weights. Along with it comes all machinery required to identify NUMA nodes and bind threads to specific processors/nodes. It also comes with small changes to Thread and ThreadPool to allow easier execution of custom functions on the designated thread. Old thread binding (WinProcGroup) machinery is removed because it's incompatible with this patch. Small changes to unrelated parts of the code were made to ensure correctness, like some classes being made unmovable, raw pointers replaced with unique_ptr. etc.

Windows 7 and Windows 10 is partially supported. Windows 11 is fully supported. Linux is fully supported, with explicit exclusion of Android. No additional dependencies.

-----------------

A new UCI option `NumaPolicy` is introduced. It can take the following values:
```
system - gathers NUMA node information from the system (lscpu or windows api), for each threads binds it to a single NUMA node
none - assumes there is 1 NUMA node, never binds threads
auto - this is the default value, depends on the number of set threads and NUMA nodes, will only enable binding on multinode systems and when the number of threads reaches a threshold (dependent on node size and count)
[[custom]] - 
  // ':'-separated numa nodes
  // ','-separated cpu indices
  // supports "first-last" range syntax for cpu indices, 
  for example '0-15,32-47:16-31,48-63'
```

Setting `NumaPolicy` forces recreation of the threads in the ThreadPool, which in turn forces the recreation of the TT.

The threads are distributed among NUMA nodes in a round-robin fashion based on fill percentage (i.e. it will strive to fill all NUMA nodes evenly). Threads are bound to NUMA nodes, not specific processors, because that's our only requirement and the OS can schedule them better.

Special care is made that maximum memory usage on systems that do not require memory replication stays as previously, that is, unnecessary copies are avoided.

On linux the process' processor affinity is respected. This means that if you for example use taskset to restrict Stockfish to a single NUMA node then the `system` and `auto` settings will only see a single NUMA node (more precisely, the processors included in the current affinity mask) and act accordingly.

-----------------

We can't ensure that a memory allocation takes place on a given NUMA node without using libnuma on linux, or using appropriate custom allocators on windows (https://learn.microsoft.com/en-us/windows/win32/memory/allocating-memory-from-a-numa-node), so to avoid complications the current implementation relies on first-touch policy. Due to this we also rely on the memory allocator to give us a new chunk of untouched memory from the system. This appears to work reliably on linux, but results may vary.

MacOS is not supported, because AFAIK it's not affected, and implementation would be problematic anyway.

Windows is supported since Windows 7 (https://learn.microsoft.com/en-us/windows/win32/api/processtopologyapi/nf-processtopologyapi-setthreadgroupaffinity). Until Windows 11/Server 2022 NUMA nodes are split such that they cannot span processor groups. This is because before Windows 11/Server 2022 it's not possible to set thread affinity spanning processor groups. The splitting is done manually in some cases (required after Windows 10 Build 20348). Since Windows 11/Server 2022 we can set affinites spanning processor group so this splitting is not done, so the behaviour is pretty much like on linux.

Linux is supported, **without** libnuma requirement. `lscpu` is expected.

-----------------

Passed SMP STC: https://tests.stockfishchess.org/tests/view/6654fc74a86388d5e27db1cd
```
LLR: 2.95 (-2.94,2.94) <-1.75,0.25>
Total: 67152 W: 17354 L: 17177 D: 32621
Ptnml(0-2): 64, 7428, 18408, 7619, 57
```

Passed STC: https://tests.stockfishchess.org/tests/view/6654fb27a86388d5e27db15c
```
LLR: 2.94 (-2.94,2.94) <-1.75,0.25>
Total: 131648 W: 34155 L: 34045 D: 63448
Ptnml(0-2): 426, 13878, 37096, 14008, 416
```

fixes official-stockfish#5253
Sopel97 added a commit to Sopel97/Stockfish that referenced this issue May 28, 2024
…nsure execution on a specific NUMA node.

This patch introduces NUMA memory replication, currently only utilized for the NNUE weights. Along with it comes all machinery required to identify NUMA nodes and bind threads to specific processors/nodes. It also comes with small changes to Thread and ThreadPool to allow easier execution of custom functions on the designated thread. Old thread binding (WinProcGroup) machinery is removed because it's incompatible with this patch. Small changes to unrelated parts of the code were made to ensure correctness, like some classes being made unmovable, raw pointers replaced with unique_ptr. etc.

Windows 7 and Windows 10 is partially supported. Windows 11 is fully supported. Linux is fully supported, with explicit exclusion of Android. No additional dependencies.

-----------------

A new UCI option `NumaPolicy` is introduced. It can take the following values:
```
system - gathers NUMA node information from the system (lscpu or windows api), for each threads binds it to a single NUMA node
none - assumes there is 1 NUMA node, never binds threads
auto - this is the default value, depends on the number of set threads and NUMA nodes, will only enable binding on multinode systems and when the number of threads reaches a threshold (dependent on node size and count)
[[custom]] - 
  // ':'-separated numa nodes
  // ','-separated cpu indices
  // supports "first-last" range syntax for cpu indices, 
  for example '0-15,32-47:16-31,48-63'
```

Setting `NumaPolicy` forces recreation of the threads in the ThreadPool, which in turn forces the recreation of the TT.

The threads are distributed among NUMA nodes in a round-robin fashion based on fill percentage (i.e. it will strive to fill all NUMA nodes evenly). Threads are bound to NUMA nodes, not specific processors, because that's our only requirement and the OS can schedule them better.

Special care is made that maximum memory usage on systems that do not require memory replication stays as previously, that is, unnecessary copies are avoided.

On linux the process' processor affinity is respected. This means that if you for example use taskset to restrict Stockfish to a single NUMA node then the `system` and `auto` settings will only see a single NUMA node (more precisely, the processors included in the current affinity mask) and act accordingly.

-----------------

We can't ensure that a memory allocation takes place on a given NUMA node without using libnuma on linux, or using appropriate custom allocators on windows (https://learn.microsoft.com/en-us/windows/win32/memory/allocating-memory-from-a-numa-node), so to avoid complications the current implementation relies on first-touch policy. Due to this we also rely on the memory allocator to give us a new chunk of untouched memory from the system. This appears to work reliably on linux, but results may vary.

MacOS is not supported, because AFAIK it's not affected, and implementation would be problematic anyway.

Windows is supported since Windows 7 (https://learn.microsoft.com/en-us/windows/win32/api/processtopologyapi/nf-processtopologyapi-setthreadgroupaffinity). Until Windows 11/Server 2022 NUMA nodes are split such that they cannot span processor groups. This is because before Windows 11/Server 2022 it's not possible to set thread affinity spanning processor groups. The splitting is done manually in some cases (required after Windows 10 Build 20348). Since Windows 11/Server 2022 we can set affinites spanning processor group so this splitting is not done, so the behaviour is pretty much like on linux.

Linux is supported, **without** libnuma requirement. `lscpu` is expected.

-----------------

Passed SMP STC: https://tests.stockfishchess.org/tests/view/6654fc74a86388d5e27db1cd
```
LLR: 2.95 (-2.94,2.94) <-1.75,0.25>
Total: 67152 W: 17354 L: 17177 D: 32621
Ptnml(0-2): 64, 7428, 18408, 7619, 57
```

Passed STC: https://tests.stockfishchess.org/tests/view/6654fb27a86388d5e27db15c
```
LLR: 2.94 (-2.94,2.94) <-1.75,0.25>
Total: 131648 W: 34155 L: 34045 D: 63448
Ptnml(0-2): 426, 13878, 37096, 14008, 416
```

fixes official-stockfish#5253
Sopel97 added a commit to Sopel97/Stockfish that referenced this issue May 28, 2024
…nsure execution on a specific NUMA node.

This patch introduces NUMA memory replication, currently only utilized for the NNUE weights. Along with it comes all machinery required to identify NUMA nodes and bind threads to specific processors/nodes. It also comes with small changes to Thread and ThreadPool to allow easier execution of custom functions on the designated thread. Old thread binding (WinProcGroup) machinery is removed because it's incompatible with this patch. Small changes to unrelated parts of the code were made to ensure correctness, like some classes being made unmovable, raw pointers replaced with unique_ptr. etc.

Windows 7 and Windows 10 is partially supported. Windows 11 is fully supported. Linux is fully supported, with explicit exclusion of Android. No additional dependencies.

-----------------

A new UCI option `NumaPolicy` is introduced. It can take the following values:
```
system - gathers NUMA node information from the system (lscpu or windows api), for each threads binds it to a single NUMA node
none - assumes there is 1 NUMA node, never binds threads
auto - this is the default value, depends on the number of set threads and NUMA nodes, will only enable binding on multinode systems and when the number of threads reaches a threshold (dependent on node size and count)
[[custom]] - 
  // ':'-separated numa nodes
  // ','-separated cpu indices
  // supports "first-last" range syntax for cpu indices, 
  for example '0-15,32-47:16-31,48-63'
```

Setting `NumaPolicy` forces recreation of the threads in the ThreadPool, which in turn forces the recreation of the TT.

The threads are distributed among NUMA nodes in a round-robin fashion based on fill percentage (i.e. it will strive to fill all NUMA nodes evenly). Threads are bound to NUMA nodes, not specific processors, because that's our only requirement and the OS can schedule them better.

Special care is made that maximum memory usage on systems that do not require memory replication stays as previously, that is, unnecessary copies are avoided.

On linux the process' processor affinity is respected. This means that if you for example use taskset to restrict Stockfish to a single NUMA node then the `system` and `auto` settings will only see a single NUMA node (more precisely, the processors included in the current affinity mask) and act accordingly.

-----------------

We can't ensure that a memory allocation takes place on a given NUMA node without using libnuma on linux, or using appropriate custom allocators on windows (https://learn.microsoft.com/en-us/windows/win32/memory/allocating-memory-from-a-numa-node), so to avoid complications the current implementation relies on first-touch policy. Due to this we also rely on the memory allocator to give us a new chunk of untouched memory from the system. This appears to work reliably on linux, but results may vary.

MacOS is not supported, because AFAIK it's not affected, and implementation would be problematic anyway.

Windows is supported since Windows 7 (https://learn.microsoft.com/en-us/windows/win32/api/processtopologyapi/nf-processtopologyapi-setthreadgroupaffinity). Until Windows 11/Server 2022 NUMA nodes are split such that they cannot span processor groups. This is because before Windows 11/Server 2022 it's not possible to set thread affinity spanning processor groups. The splitting is done manually in some cases (required after Windows 10 Build 20348). Since Windows 11/Server 2022 we can set affinites spanning processor group so this splitting is not done, so the behaviour is pretty much like on linux.

Linux is supported, **without** libnuma requirement. `lscpu` is expected.

-----------------

Passed 60+1 @ 256t 16000MB hash: https://tests.stockfishchess.org/tests/view/6654e443a86388d5e27db0d8
```
LLR: 2.95 (-2.94,2.94) <0.00,10.00>
Total: 278 W: 110 L: 29 D: 139
Ptnml(0-2): 0, 1, 56, 82, 0
```

Passed SMP STC: https://tests.stockfishchess.org/tests/view/6654fc74a86388d5e27db1cd
```
LLR: 2.95 (-2.94,2.94) <-1.75,0.25>
Total: 67152 W: 17354 L: 17177 D: 32621
Ptnml(0-2): 64, 7428, 18408, 7619, 57
```

Passed STC: https://tests.stockfishchess.org/tests/view/6654fb27a86388d5e27db15c
```
LLR: 2.94 (-2.94,2.94) <-1.75,0.25>
Total: 131648 W: 34155 L: 34045 D: 63448
Ptnml(0-2): 426, 13878, 37096, 14008, 416
```

fixes official-stockfish#5253
vondele pushed a commit to vondele/Stockfish that referenced this issue May 28, 2024
…nsure execution on a specific NUMA node.

This patch introduces NUMA memory replication, currently only utilized for the NNUE weights. Along with it comes all machinery required to identify NUMA nodes and bind threads to specific processors/nodes. It also comes with small changes to Thread and ThreadPool to allow easier execution of custom functions on the designated thread. Old thread binding (WinProcGroup) machinery is removed because it's incompatible with this patch. Small changes to unrelated parts of the code were made to ensure correctness, like some classes being made unmovable, raw pointers replaced with unique_ptr. etc.

Windows 7 and Windows 10 is partially supported. Windows 11 is fully supported. Linux is fully supported, with explicit exclusion of Android. No additional dependencies.

-----------------

A new UCI option `NumaPolicy` is introduced. It can take the following values:
```
system - gathers NUMA node information from the system (lscpu or windows api), for each threads binds it to a single NUMA node
none - assumes there is 1 NUMA node, never binds threads
auto - this is the default value, depends on the number of set threads and NUMA nodes, will only enable binding on multinode systems and when the number of threads reaches a threshold (dependent on node size and count)
[[custom]] - 
  // ':'-separated numa nodes
  // ','-separated cpu indices
  // supports "first-last" range syntax for cpu indices, 
  for example '0-15,32-47:16-31,48-63'
```

Setting `NumaPolicy` forces recreation of the threads in the ThreadPool, which in turn forces the recreation of the TT.

The threads are distributed among NUMA nodes in a round-robin fashion based on fill percentage (i.e. it will strive to fill all NUMA nodes evenly). Threads are bound to NUMA nodes, not specific processors, because that's our only requirement and the OS can schedule them better.

Special care is made that maximum memory usage on systems that do not require memory replication stays as previously, that is, unnecessary copies are avoided.

On linux the process' processor affinity is respected. This means that if you for example use taskset to restrict Stockfish to a single NUMA node then the `system` and `auto` settings will only see a single NUMA node (more precisely, the processors included in the current affinity mask) and act accordingly.

-----------------

We can't ensure that a memory allocation takes place on a given NUMA node without using libnuma on linux, or using appropriate custom allocators on windows (https://learn.microsoft.com/en-us/windows/win32/memory/allocating-memory-from-a-numa-node), so to avoid complications the current implementation relies on first-touch policy. Due to this we also rely on the memory allocator to give us a new chunk of untouched memory from the system. This appears to work reliably on linux, but results may vary.

MacOS is not supported, because AFAIK it's not affected, and implementation would be problematic anyway.

Windows is supported since Windows 7 (https://learn.microsoft.com/en-us/windows/win32/api/processtopologyapi/nf-processtopologyapi-setthreadgroupaffinity). Until Windows 11/Server 2022 NUMA nodes are split such that they cannot span processor groups. This is because before Windows 11/Server 2022 it's not possible to set thread affinity spanning processor groups. The splitting is done manually in some cases (required after Windows 10 Build 20348). Since Windows 11/Server 2022 we can set affinites spanning processor group so this splitting is not done, so the behaviour is pretty much like on linux.

Linux is supported, **without** libnuma requirement. `lscpu` is expected.

-----------------

Passed 60+1 @ 256t 16000MB hash: https://tests.stockfishchess.org/tests/view/6654e443a86388d5e27db0d8
```
LLR: 2.95 (-2.94,2.94) <0.00,10.00>
Total: 278 W: 110 L: 29 D: 139
Ptnml(0-2): 0, 1, 56, 82, 0
```

Passed SMP STC: https://tests.stockfishchess.org/tests/view/6654fc74a86388d5e27db1cd
```
LLR: 2.95 (-2.94,2.94) <-1.75,0.25>
Total: 67152 W: 17354 L: 17177 D: 32621
Ptnml(0-2): 64, 7428, 18408, 7619, 57
```

Passed STC: https://tests.stockfishchess.org/tests/view/6654fb27a86388d5e27db15c
```
LLR: 2.94 (-2.94,2.94) <-1.75,0.25>
Total: 131648 W: 34155 L: 34045 D: 63448
Ptnml(0-2): 426, 13878, 37096, 14008, 416
```

fixes official-stockfish#5253
Disservin pushed a commit to Disservin/Stockfish that referenced this issue May 28, 2024
Allow for NUMA memory replication for NNUE weights.  Bind threads to ensure execution on a specific NUMA node.

This patch introduces NUMA memory replication, currently only utilized for the NNUE weights. Along with it comes all machinery required to identify NUMA nodes and bind threads to specific processors/nodes. It also comes with small changes to Thread and ThreadPool to allow easier execution of custom functions on the designated thread. Old thread binding (WinProcGroup) machinery is removed because it's incompatible with this patch. Small changes to unrelated parts of the code were made to ensure correctness, like some classes being made unmovable, raw pointers replaced with unique_ptr. etc.

Windows 7 and Windows 10 is partially supported. Windows 11 is fully supported. Linux is fully supported, with explicit exclusion of Android. No additional dependencies.

-----------------

A new UCI option `NumaPolicy` is introduced. It can take the following values:
```
system - gathers NUMA node information from the system (lscpu or windows api), for each threads binds it to a single NUMA node
none - assumes there is 1 NUMA node, never binds threads
auto - this is the default value, depends on the number of set threads and NUMA nodes, will only enable binding on multinode systems and when the number of threads reaches a threshold (dependent on node size and count)
[[custom]] -
  // ':'-separated numa nodes
  // ','-separated cpu indices
  // supports "first-last" range syntax for cpu indices,
  for example '0-15,32-47:16-31,48-63'
```

Setting `NumaPolicy` forces recreation of the threads in the ThreadPool, which in turn forces the recreation of the TT.

The threads are distributed among NUMA nodes in a round-robin fashion based on fill percentage (i.e. it will strive to fill all NUMA nodes evenly). Threads are bound to NUMA nodes, not specific processors, because that's our only requirement and the OS can schedule them better.

Special care is made that maximum memory usage on systems that do not require memory replication stays as previously, that is, unnecessary copies are avoided.

On linux the process' processor affinity is respected. This means that if you for example use taskset to restrict Stockfish to a single NUMA node then the `system` and `auto` settings will only see a single NUMA node (more precisely, the processors included in the current affinity mask) and act accordingly.

-----------------

We can't ensure that a memory allocation takes place on a given NUMA node without using libnuma on linux, or using appropriate custom allocators on windows (https://learn.microsoft.com/en-us/windows/win32/memory/allocating-memory-from-a-numa-node), so to avoid complications the current implementation relies on first-touch policy. Due to this we also rely on the memory allocator to give us a new chunk of untouched memory from the system. This appears to work reliably on linux, but results may vary.

MacOS is not supported, because AFAIK it's not affected, and implementation would be problematic anyway.

Windows is supported since Windows 7 (https://learn.microsoft.com/en-us/windows/win32/api/processtopologyapi/nf-processtopologyapi-setthreadgroupaffinity). Until Windows 11/Server 2022 NUMA nodes are split such that they cannot span processor groups. This is because before Windows 11/Server 2022 it's not possible to set thread affinity spanning processor groups. The splitting is done manually in some cases (required after Windows 10 Build 20348). Since Windows 11/Server 2022 we can set affinites spanning processor group so this splitting is not done, so the behaviour is pretty much like on linux.

Linux is supported, **without** libnuma requirement. `lscpu` is expected.

-----------------

Passed 60+1 @ 256t 16000MB hash: https://tests.stockfishchess.org/tests/view/6654e443a86388d5e27db0d8
```
LLR: 2.95 (-2.94,2.94) <0.00,10.00>
Total: 278 W: 110 L: 29 D: 139
Ptnml(0-2): 0, 1, 56, 82, 0
```

Passed SMP STC: https://tests.stockfishchess.org/tests/view/6654fc74a86388d5e27db1cd
```
LLR: 2.95 (-2.94,2.94) <-1.75,0.25>
Total: 67152 W: 17354 L: 17177 D: 32621
Ptnml(0-2): 64, 7428, 18408, 7619, 57
```

Passed STC: https://tests.stockfishchess.org/tests/view/6654fb27a86388d5e27db15c
```
LLR: 2.94 (-2.94,2.94) <-1.75,0.25>
Total: 131648 W: 34155 L: 34045 D: 63448
Ptnml(0-2): 426, 13878, 37096, 14008, 416
```

fixes official-stockfish#5253
closes official-stockfish#5285

No functional change
@mstembera
Copy link
Contributor

Do we know if it still makes sense to use hyper threading on these systems where we are memory bandwidth limited?

@vondele
Copy link
Member Author

vondele commented Aug 12, 2024

It is definitely no longer clear cut in my experience, but I think it is still advantageous. There might be a dependency on the hardware though.

@DamirDes1981

This comment was marked as spam.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants