Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Multithreaded POW mining worker #9629

Conversation

Wizdave97
Copy link
Contributor

@Wizdave97 Wizdave97 commented Aug 25, 2021

What this PR does?

This PR modifies the sc_consensus_pow::start_mining worker to return a clonable channel that recieves a broadcast of the latest mining metadata, so nodes can try to compute the seal using a multithreaded approach.

@Wizdave97 Wizdave97 requested a review from sorpaas as a code owner August 25, 2021 21:19
@cla-bot-2021
Copy link

cla-bot-2021 bot commented Aug 25, 2021

User @Wizdave97, please sign the CLA here.

Copy link
Member

@sorpaas sorpaas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like I said before, multithreaded PoW is already working.

On the other hand, this PR forces the mining thread to be designed with a certain lifecycle style (which must deliberately be designed with stream/future), but that is not always possible. Some mining algorithms have certain "tricks" that coordinates multithreading themselves. For them with this PR it's unusable. The pow module should strive to be as generic as possible to support as many Substrate chains as possible.

@Wizdave97
Copy link
Contributor Author

Like I said before, multithreaded PoW is already working.

On the other hand, this PR forces the mining thread to be designed with a certain lifecycle style (which must deliberately be designed with stream/future), but that is not always possible. Some mining algorithms have certain "tricks" that coordinates multithreading themselves. For them with this PR it's unusable. The pow module should strive to be as generic as possible to support as many Substrate chains as possible.

@sorpaas in the current code the mining worker is wrapped in an Arc and mutex, so multithreaded access to the underlying worker and hence mining metadata is not possible, since each thread races to acquire a lock to the worker and once that happens in one thread, all other threads become blocked on waiting to access a lock, so in essence only a single thread will be used to compute the seal. Whereas in this approach of using a stream, the metadata is broadcast across a channel and all threads listening can acess this data and true multithreading can be achieved since each thread can use the data receieved to try and compute a seal, no thread is blocked or starved of mining metadata in this approach.

@Wizdave97
Copy link
Contributor Author

Wizdave97 commented Aug 25, 2021

Like I said before, multithreaded PoW is already working.

On the other hand, this PR forces the mining thread to be designed with a certain lifecycle style (which must deliberately be designed with stream/future), but that is not always possible. Some mining algorithms have certain "tricks" that coordinates multithreading themselves. For them with this PR it's unusable. The pow module should strive to be as generic as possible to support as many Substrate chains as possible.

@sorpaas I don't think this obstructs any third party from implementing multithreading, they just have to listen to the stream instead of acquiring a lock to a mutex.

WIth this approach we eliminate the possibility of creating deadlocks as a result of faulty strategies in acquiring a lock to the mutex

@Wizdave97 Wizdave97 force-pushed the wizdave/pow-multithreaded-worker branch 3 times, most recently from 653f078 to afe91b0 Compare August 25, 2021 23:48
@sorpaas
Copy link
Member

sorpaas commented Aug 26, 2021

That's a misunderstanding of Rust's future. Performance gain of Rust future is obtained when you have less threads than concurrent tasks. If you already need that many sync threads, then you're already paying all the costs of threads.

If you look into the tokio::sync::watch code you're using in this PR, you will see that it uses RwLock anyway. RwLock is pretty fast unless you misuse it.

@Wizdave97
Copy link
Contributor Author

Wizdave97 commented Aug 26, 2021

That's a misunderstanding of Rust's future. Performance gain of Rust future is obtained when you have less threads than concurrent tasks. If you already need that many sync threads, then you're already paying all the costs of threads.

If you look into the tokio::sync::watch code you're using in this PR, you will see that it uses RwLock anyway. RwLock is pretty fast unless you misuse it.

We're not using futures for performance gain, we're using futures for synchronization, which in turn gives us performance gain.

sealing threads before this pr

for _ in 0..num_cpus {
    thread::spawn(move || loop {

        // all threads need to call `MiningWorker::metadata` in a loop to be updated with new metadata,

        // this causes starvation

        // Ideally, we should lock it once, then run compute until a new metadata arrives

        compute(worker.lock().metadata())
    });

}

sealing threads after this pr

for _ in 0..num_cpus {
    thread::spawn(move  ||  {

        // yes `tokio::sync::watch::channel` uses an `RwLock` behind the scenes.
        // but now that its behind a future, we can race our compute task against new metadata
       // instead of locking in a hot loop 
       // NOTE: all threads will receive metadata here as it is a `spmc` channel

       let mut metadata = block_on(stream.next());

       loop {

          match block_on(select(compute(metadata), stream.next())) {

              Either::Right((Some(next_metadata), _)) => metadata = next_metadata,

            _ => {},

         }
    });
}

@sorpaas
Copy link
Member

sorpaas commented Aug 26, 2021

I'd appreciate it if you can show some performance metrics if you want to convince me that it's actually useful to use stream/future here.

It's all the same thread lock primitives under the hood so I'm not sure why using a plain Mutex / RwLock would suddenly make things slower.

@Wizdave97
Copy link
Contributor Author

I'd appreciate it if you can show some performance metrics if you want to convince me that it's actually useful to use stream/future here.

It's all the same thread lock primitives under the hood so I'm not sure why using a plain Mutex / RwLock would suddenly make things slower.

Will get back to you with that asap

@sorpaas
Copy link
Member

sorpaas commented Aug 26, 2021

If your loop is really "hot" (in that the compute is really fast), then you can create an atomic value in MiningWorker to indicate the current version of the metadata and only fetch again the metadata if it changed.

In general, future uses the same primitives as threads so unless you're doing the things future are designed for, nothing will magically be faster.

@Wizdave97 Wizdave97 force-pushed the wizdave/pow-multithreaded-worker branch 4 times, most recently from 56390a3 to 79cbe8b Compare August 29, 2021 06:59
@Wizdave97
Copy link
Contributor Author

@sorpaas Here are the benchmarks results below,
you can clone this repo https://github.com/polytope-labs/sybil and run them yourself using the bench.ts script in the blocktime
folder

For the default mutex worker use the update-master-to-work-with-latest-substrate branch
For the multithreaded worker branch you can use wiz-mining-threaded branch

Default Mutex worker

Platform: linux

Intel(R) Core(TM) i5-9300H CPU @ 2.40GHz
4 cores, 8 threads

Total memory: 25.047011328gb

Difficulty:1_000_000_000

Block times(seconds) from Genesis block
58.04699993133545
364.9449999332428
286.91600012779236
681.7999999523163
182.71200013160706
1674.4679999351501
2556.4079999923706
78.77099990844727
126.03200006484985
437.81699991226196
283.3200001716614
560.6549999713898
497.5529999732971
1444.0049998760223
51.52300000190735
168.64200019836426
2186.25
5240.287999868393
74.10000014305115

Average Block time: 892.329052636498

Multithreaded mining worker

Platform: linux
Intel(R) Core(TM) i5-9300H CPU @ 2.40GHz
4 cores, 8 threads

Total memory: 25.047011328gb

Benchmarked with 8 threads
Difficulty:1_000_000_000

Block times (seconds) from Genesis block
691.5700001716614
9.455999851226807
1174.965000152588
1441.9349999427795
184.58599996566772
158.2829999923706
163.4630000591278
238.4099998474121
417.4450001716614
149.61299991607666
236.46399998664856
82.84700012207031
116.86999988555908
5.727999925613403
463.8310000896454
512.045000076294
389.3139998912811
45.00600004196167
105.22699999809265

Average block time: 346.687263162512

@Wizdave97 Wizdave97 requested a review from sorpaas August 29, 2021 07:12
@Wizdave97 Wizdave97 force-pushed the wizdave/pow-multithreaded-worker branch 2 times, most recently from 67f4b9e to 4d5f55c Compare August 29, 2021 07:17
@sorpaas
Copy link
Member

sorpaas commented Aug 29, 2021

You're using metadata call in your hot loop. Please fix that and other issues first! I know you're trying to show that this PR is merge-able but it's not the way to go to use the worst optimized code vs this PR.

Also I'd appreciate to see if you can port https://github.com/kulupu/kulupu/blob/master/pow/src/lib.rs over to work with this PR. It has an optimization where the execution of a new loop is dependent on the previous loop. This PR actually looks like a revert of #7060. We did handle the whole thread management within Substrate for mining workers, but later figured out that just won't work for any mining algorithms that's slightly complex.

@Wizdave97 Wizdave97 force-pushed the wizdave/pow-multithreaded-worker branch 6 times, most recently from 67e2cc7 to 8060a5c Compare September 2, 2021 19:43
@Wizdave97
Copy link
Contributor Author

@sorpaas I was not able to get the benchmarks for the multithreaded mining worker based on the mutex worker, it keeps deadlocking once multiple threads are enabled, if you can take a look at the code and see what's wrong, that would go a long way in helping speed up getting those benchmarks. https://github.com/polytope-labs/sybil -> branch -> update-master-to-work-with-latest-substrate.

@Wizdave97 Wizdave97 force-pushed the wizdave/pow-multithreaded-worker branch from 8060a5c to cdb68cd Compare September 5, 2021 15:32
@sorpaas
Copy link
Member

sorpaas commented Sep 5, 2021

Let me get this straight -- unless there can be drastic performance improvement with this change, I don't think we'd want to revert back to the model in this PR, and I see the chance of this ever getting merged to be slim.

Handling threads internally in the pow model is incompatible with many mining algorithms. Using additional abstractions like future or channel does not add much benefits when things can be described clearly with low-level primitives. You also do not want to wait for futures in the mining threads, because it adds extra latency.

@Wizdave97
Copy link
Contributor Author

Let me get this straight -- unless there can be drastic performance improvement with this change, I don't think we'd want to revert back to the model in this PR, and I see the chance of this ever getting merged to be slim.

Handling threads internally in the pow model is incompatible with many mining algorithms. Using additional abstractions like future or channel does not add much benefits when things can be described clearly with low-level primitives. You also do not want to wait for futures in the mining threads, because it adds extra latency.

@sorpaas Any luck getting the multithreading to work here https://github.com/polytope-labs/sybil -> branch -> update-master-to-work-with-latest-substrate

@sorpaas
Copy link
Member

sorpaas commented Sep 8, 2021

@Wizdave97 Try #9698 and see if it works for you. It handles locking internally now so should prevent most possible misuses.

@Wizdave97
Copy link
Contributor Author

Mutex based worker

linux
Intel(R) Core(TM) i5-9300H CPU @ 2.40GHz
8 cores
Total memory: 25.047007232gb
Difficulty:1_000_000_000

Block times(seconds)
60.77199983596802
180.78200006484985
93.36800003051758
51.019999980926514
38.34500002861023
177.07999992370605
393.3180000782013
16.391000032424927
531.3409998416901
208.5970001220703
59.675999879837036
692.7750000953674
177.0420000553131
77.00099992752075
748.1489999294281
348.3100001811981
214.72299981117249
487.49500012397766
252.47300004959106
592.5089998245239

Avg Block time: 270.0583499908447

Stream based worker

linux
Intel(R) Core(TM) i5-9300H CPU @ 2.40GHz
8 threads
Total memory: 25.047007232gb
Difficulty:1_000_000_000

Block times(seconds)
16.628000020980835
317.0720000267029
499.17199993133545
116.1819999217987
2496.996999979019
20.974000215530396
181.79499983787537
1229.4260001182556
129.30999994277954
457.6329998970032
67.22200012207031
583.5969998836517
92.56900000572205
483.90000009536743
755.2969999313354
223.375
191.24100017547607
172.80299997329712
69.33599996566772
73.18400001525879

Average block time: 408.88565000295637

@Wizdave97 Wizdave97 closed this Sep 10, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants