bind replication threads to specific cores #1305

porcuquine · 2020-10-06T02:26:06Z

In order to maximize replication (precommit phase 1) performance, bind all replication threads to a single core (distinct from other replication threads). Consumer and producer threads for each replication task should be on cores which share cache. If multiple such cores exist, keep threads for a single replication task together (on cores sharing a cache). Otherwise, group cores based on the number of threads required. We create groups even when they do not correspond to a shared cache in order to make use of the core binding facility used in the preferred case.

If a core cannot be bound, a message will be logged in a debug mode, but operation is not interrupted. The thread is simply scheduled without constraint as though no attempt to bind a core had been made.

NOTE: This implementation does not support currently support core binding on macos.

Document in README.
Identify and remove minimal cause of perf regression (vs best seen in this work, not vs previous).

PERF UPDATES

I benchmarked replication of a 32GiB sector with the final code at 2:03:05 (on 3970x) — so regression is conquered.
I benchmarked the currently-latest version without mlocked window pages (and not run as root) and saw a noticeably faster result: 2:02:23.
2 sectors, with window pages locked: 2:06:27
2 sectors, without window pages locked 2:06:17

.circleci/config.yml

storage-proofs/porep/Cargo.toml

storage-proofs/porep/src/stacked/vanilla/cores.rs

vmx · 2020-10-06T14:17:12Z

@porcuquine The CI is setup properly, hwloc is installed on Linux as well as macOS. The current CI failure seem to be actual failures.

storage-proofs/porep/src/stacked/vanilla/memory_handling.rs

storage-proofs/porep/src/stacked/vanilla/create_label/multi.rs

storage-proofs/porep/src/stacked/vanilla/memory_handling.rs

Dizans · 2020-10-07T07:41:23Z

Just curious, how much can this optimize accelerate the PreCommitPhase1 on AMD 3970X.

porcuquine · 2020-10-07T15:12:11Z

@Dizans I put the latest benchmark in the description. It is almost down to two hours on 3970x.

storage-proofs/porep/src/stacked/vanilla/cores.rs

storage-proofs/porep/src/stacked/vanilla/create_label/multi.rs

storage-proofs/porep/src/stacked/vanilla/cores.rs

xueyangl · 2021-04-06T08:04:43Z

@porcuquine i am trying to run lotus-bench 1.5.2 on Intel server platform with 2 Icelake cpu , i run 14 task in parallel, at the very begin of replication, your changes can work well, there are 7 tasks bind into cpu0 and 7 tasks bind into cpu1, however after running some time but still in replication, i saw the binding rule is changed, some tasks(maybe in layer 6 or layer 7) bind to cpu0 originally are re-bind to cpu1, do you have any idea about this symptom ?

porcuquine · 2021-04-07T00:57:12Z

I don't have any insight into why that would be, but I'll be curious to hear as you or others learn more. I think @dignifiedquire may soon have hardware on which he could experiment.

xueyangl · 2021-04-08T08:29:34Z

@porcuquine so glad to hear your feedback, just want to know why you didn't consider HWLOC_CPUBIND_STRICT when trying to bind CPU instead of using CPUBIND_THREAD . actually , we are trying CPUBIND_STRICT on our symptom to see if there is any improvement.

porcuquine · 2021-04-14T05:26:35Z

I don't remember thinking about it either way. The current implementation worked well on my test hardware, which is most of what I was considering at the time.

xueyangl · 2021-04-14T13:59:54Z

So glad to hear your feedback, actually I am working for intel , and we have Icelake products right now which can support SHA256. We test both intel platform and AMD platform, both have this kind of issues, I am little confused what’s different between us, my platform configuration : Platform ：Intel Whitley Server Platform CPU : Intel Icelake CPU x 2 Memory :2T, each CPU has 1T Regarding to CPU binding code, I still have some questions, could you please help clarify ? Let’s have an example of one task with multi_core_sdr enabled. Consumer thread binding to CPU core at the early beginning -- my understanding is consumer thread should be fixed on one cpu core during whole progress, should not migrate to other core . Producer thread : on each layer, producer thread will do cpu core binding again , so it means for different layer, producer thread may be bind onto different CPU core. Is my understanding correct ? If my understanding is correct, so how can we guarantee consumer thread and producer threads are in same socket ? is it possible consumer thread bind onto CPU0 core and related producer threads bind onto CPU1 cores? Thanks Derek From: porcuquine ***@***.***> Sent: Wednesday, April 14, 2021 1:27 PM To: filecoin-project/rust-fil-proofs ***@***.***> Cc: Lun, Derek ***@***.***>; Comment ***@***.***> Subject: Re: [filecoin-project/rust-fil-proofs] bind replication threads to specific cores (#1305) I don't remember thinking about it either way. The current implementation worked well on my test hardware, which is most of what I was considering at the time. — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#1305 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AIGV5DNFSAYW44KMLNOWSCDTIURRPANCNFSM4SFNQUFQ>.

porcuquine · 2021-04-15T04:29:05Z

Out of curiosity, how many cores share an L2 L3 cache on this Ice Lake CPU? You will need to make sure you don't try to use more processes than this number (set appropriate environment variables) for sealing, since shared cache is important to this optimization.

Consumer thread binding to CPU core at the early beginning -- my understanding is consumer thread should be fixed on one cpu core during whole progress, should not migrate to other core .

That should be true.

Producer thread : on each layer, producer thread will do cpu core binding again , so it means for different layer, producer thread may be bind onto different CPU core.

Although the binding is released between layers, the same one should be reacquired for the next layer. Maybe some other thread is getting scheduled there in between, and this reacquisition is failing. The code notes:

                // This could fail, but we will ignore the error if so.
                // It will be logged as a warning by `bind_core`.

Check your logs for evidence (or lack thereof) that this may be happening. Make sure to set log level appropriately.

If my understanding is correct, so how can we guarantee consumer thread and producer threads are in same socket ? is it possible consumer thread bind onto CPU0 core and related producer threads bind onto CPU1 cores?

You might want to also log at DEBUG level and check the output from cores.rs. This might help you understand what is happening. If you compare that output against what you know about the topology, you might be able to confirm or deny that things are being detected and chosen as expected.

xueyangl · 2021-04-15T04:56:20Z

Here is the hwloc information I got from my dual sockets system, on Intel ICX CPU(should be all Intel CPU), L2 is specific to core , not shared, each core has specific 1.28M L2, L3 is shared between all cores. I am getting the log with debug level= Trace and will do further analysis later. Appreciate for your help, please share me any insights if you have . *** Objects at level 0 0: Machine () *** Objects at level 1 0: Package () 1: Package () *** Objects at level 2 0: L3 (48MB) 1: L3 (48MB) *** Objects at level 3 0: L2 (1280KB) 1: L2 (1280KB) 2: L2 (1280KB) 3: L2 (1280KB) 4: L2 (1280KB) 5: L2 (1280KB) 6: L2 (1280KB) 7: L2 (1280KB) 8: L2 (1280KB) 9: L2 (1280KB) 10: L2 (1280KB) 11: L2 (1280KB) 12: L2 (1280KB) 13: L2 (1280KB) 14: L2 (1280KB) 15: L2 (1280KB) 16: L2 (1280KB) 17: L2 (1280KB) 18: L2 (1280KB) 19: L2 (1280KB) 20: L2 (1280KB) 21: L2 (1280KB) 22: L2 (1280KB) 23: L2 (1280KB) 24: L2 (1280KB) 25: L2 (1280KB) 26: L2 (1280KB) 27: L2 (1280KB) 28: L2 (1280KB) 29: L2 (1280KB) 30: L2 (1280KB) 31: L2 (1280KB) 32: L2 (1280KB) 33: L2 (1280KB) 34: L2 (1280KB) 35: L2 (1280KB) 36: L2 (1280KB) 37: L2 (1280KB) 38: L2 (1280KB) 39: L2 (1280KB) 40: L2 (1280KB) 41: L2 (1280KB) 42: L2 (1280KB) 43: L2 (1280KB) 44: L2 (1280KB) 45: L2 (1280KB) 46: L2 (1280KB) 47: L2 (1280KB) 48: L2 (1280KB) 49: L2 (1280KB) 50: L2 (1280KB) 51: L2 (1280KB) 52: L2 (1280KB) 53: L2 (1280KB) 54: L2 (1280KB) 55: L2 (1280KB) 56: L2 (1280KB) 57: L2 (1280KB) 58: L2 (1280KB) 59: L2 (1280KB) 60: L2 (1280KB) 61: L2 (1280KB) 62: L2 (1280KB) 63: L2 (1280KB) From: porcuquine ***@***.***> Sent: Thursday, April 15, 2021 12:29 PM To: filecoin-project/rust-fil-proofs ***@***.***> Cc: Lun, Derek ***@***.***>; Comment ***@***.***> Subject: Re: [filecoin-project/rust-fil-proofs] bind replication threads to specific cores (#1305) Out of curiosity, how many cores share an L2 cache on this Ice Lake CPU? You will need to make sure you don't try to use more processes than this number (set appropriate environment variables) for sealing, since shared cache is important to this optimization. Consumer thread binding to CPU core at the early beginning -- my understanding is consumer thread should be fixed on one cpu core during whole progress, should not migrate to other core . That should be true. Producer thread : on each layer, producer thread will do cpu core binding again , so it means for different layer, producer thread may be bind onto different CPU core. Although the binding is released between layers, the same one should be reacquired for the next layer. Maybe some other thread is getting scheduled there in between, and this reacquisition is failing. The code notes: // This could fail, but we will ignore the error if so. // It will be logged as a warning by `bind_core`. Check your logs for evidence (or lack thereof) that this may be happening. Make sure to set log level appropriately. If my understanding is correct, so how can we guarantee consumer thread and producer threads are in same socket ? is it possible consumer thread bind onto CPU0 core and related producer threads bind onto CPU1 cores? You might want to also log at DEBUG level and check the output from cores.rs. This might help you understand what is happening. If you compare that output against what you know about the topology, you might be able to confirm or deny that things are being detected and chosen as expected. — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#1305 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AIGV5DKLC5NQJHLW7LPRNLLTIZTR3ANCNFSM4SFNQUFQ>.

porcuquine · 2021-04-15T05:12:06Z

Right, sorry, I should have said L3 (fixed in my original message now).

If the L3 cache is shared by 32 cores, then it seems quite likely this optimization will not help much — or not nearly as much as is ideal. It might help some, so it's still worth trying to make sure you can get the producer threads stably pinned to a single core. But I don't think you will be able to get the very fast inter-core communication (via cache) which makes this particular consumer-producer pattern fast enough to let the producers keep up with the hashing of the consumer.

The idea is that we want the hashing to be the bottleneck, which means we need to be able to feed the cache fast enough. For this to work, the cache needs to be fast — and it needs to stay undisturbed by other work loads. The buffer shared between the cooperating cores (running the producers and consumer threads) needs to stay in that fast cache. This works out well when we can ensure no other work happens on the cores sharing an L3 cache. I am not sure it will work out well with your topology.

xueyangl · 2021-04-15T12:52:20Z

We properly have fixed this NUMA issue on intel cpu, and I am wondering if it can be workaround commited into community to make sure end user can benefit it. I am trying to find some way to identify CPU vendor info , and add some code to check if it is intel cpu this workaround will be executed, make sure no impact to your previous code . can hwloc retrieve cpu information ? vendor or CPUID . From: porcuquine ***@***.***> Sent: Thursday, April 15, 2021 1:12 PM To: filecoin-project/rust-fil-proofs ***@***.***> Cc: Lun, Derek ***@***.***>; Comment ***@***.***> Subject: Re: [filecoin-project/rust-fil-proofs] bind replication threads to specific cores (#1305) Right, sorry, I should have said L3 (fixed in my original message now). If the L3 cache is shared by 32 cores, then it seems quite likely this optimization will not help much — or not nearly as much as is ideal. It might help some, so it's still trying to make sure you can get the producer threads stably pinned to a single core. But I don't think you will be able to get the very fast inter-core communication (via cache) which makes this particular consumer-producer pattern fast enough to let the producers keep up with the hashing of the consumer. The idea is that we want the hashing to be the bottleneck, which means we need to be able to feed the cache fast enough. For this to work, the cache needs to be fast — and it needs to stay undisturbed by other work loads. The buffer shared between the cooperating cores (running the producers and consumer threads) needs to stay in that fast cache. This works out well when we can ensure no other work happens on the cores sharing an L3 cache. I am not sure it will work out well with your topology. — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#1305 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AIGV5DKSL7SCHRQE6UERUBDTIZYTFANCNFSM4SFNQUFQ>.

porcuquine · 2021-04-16T07:55:01Z

I'm not sure hwloc can, but you might try cpuid.

xueyangl · 2021-04-16T07:57:03Z

Yes, appreciate for your great help , we are doing performance measurement now , will sync up with you later . From: porcuquine ***@***.***> Sent: Friday, April 16, 2021 3:55 PM To: filecoin-project/rust-fil-proofs ***@***.***> Cc: Lun, Derek ***@***.***>; Comment ***@***.***> Subject: Re: [filecoin-project/rust-fil-proofs] bind replication threads to specific cores (#1305) I'm not sure hwloc can, but you might try cpuid<https://docs.rs/cpuid/0.1.1/cpuid/>. — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#1305 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AIGV5DPTTRONFNTK62G43F3TI7UOHANCNFSM4SFNQUFQ>.

xueyangl · 2021-04-18T03:07:38Z

Sorry for trouble you again, I have another problem that we frequently saw some PRODUCER NOT READY message when I enable Trace log, Just like below , do you have any idea how to fix it ? or any insight to guide us to improve it ? 2021-04-17T13:20:50.443 DEBUG storage_proofs_porep::stacked::vanilla::create_label::multi > PRODUCER NOT READY! 978743553 2021-04-17T13:20:50.444 DEBUG storage_proofs_porep::stacked::vanilla::create_label::multi > PRODUCER NOT READY! 978744449 2021-04-17T13:20:50.445 DEBUG storage_proofs_porep::stacked::vanilla::create_label::multi > PRODUCER NOT READY! 978745217 2021-04-17T13:20:50.446 DEBUG storage_proofs_porep::stacked::vanilla::create_label::multi > PRODUCER NOT READY! 978746369 2021-04-17T13:20:50.447 DEBUG storage_proofs_porep::stacked::vanilla::create_label::multi > PRODUCER NOT READY! 978747265 2021-04-17T13:20:50.448 DEBUG storage_proofs_porep::stacked::vanilla::create_label::multi > PRODUCER NOT READY! 978748161 2021-04-17T13:20:50.449 DEBUG storage_proofs_porep::stacked::vanilla::create_label::multi > PRODUCER NOT READY! 978748929 From: Lun, Derek Sent: Friday, April 16, 2021 3:57 PM To: filecoin-project/rust-fil-proofs ***@***.***>; filecoin-project/rust-fil-proofs ***@***.***> Cc: Comment ***@***.***> Subject: RE: [filecoin-project/rust-fil-proofs] bind replication threads to specific cores (#1305) Yes, appreciate for your great help , we are doing performance measurement now , will sync up with you later . From: porcuquine ***@***.******@***.***>> Sent: Friday, April 16, 2021 3:55 PM To: filecoin-project/rust-fil-proofs ***@***.******@***.***>> Cc: Lun, Derek ***@***.******@***.***>>; Comment ***@***.******@***.***>> Subject: Re: [filecoin-project/rust-fil-proofs] bind replication threads to specific cores (#1305) I'm not sure hwloc can, but you might try cpuid<https://docs.rs/cpuid/0.1.1/cpuid/>. — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#1305 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AIGV5DPTTRONFNTK62G43F3TI7UOHANCNFSM4SFNQUFQ>.

porcuquine · 2021-05-28T15:45:10Z

This means that the main thread is waiting for data from a producer. If it doesn't happen 'too much', it's probably fine and is (I think) pretty normal. But if you are always waiting then that is a bottleneck. The obvious thing to try is to increase the number of producers, which you can set with FIL_PROOFS_MULTICORE_SDR_PRODUCERS. For this to be useful, though, the producers and the consumer need to share a cache (and no other threads should be disrupting that cache during the sealing process).

If you can't increase number of producers, you need to make sure the producers are producing fast enough. They read from disk, so I guess making sure you are using SSD would help. There are many possible reasons, some of which might not be fixable, and it might not matter… Hope that helps.

porcuquine force-pushed the feat/sdr-core-handling branch 5 times, most recently from f493237 to 4a2ac47 Compare October 6, 2020 07:43