Fix broken GPU locking #1380

cryptonemo · 2020-12-01T19:33:19Z

Description

GPU lock contention is flaky, as it sometimes works and sometimes doesn't. It seems to be most problematic when trying to prove snarks on machines without a GPU. The locking/serialization for the GPU column and tree builders is also not reliable, as it can cause hangs when running all ignored tests (by default, in parallel) when the GPU column and tree builders are enabled. Running the tests in parallel with the GPU column/tree builders enabled doesn't generally make sense, as the trees are so small, they are likely going to be slower than CPU building in parallel, however, the locking shouldn't fail by hanging either.

Acceptance criteria

All tests using GPU pass without hanging

Where to begin

Proper multi-gpu support is slated to improve or resolve this situation.

cryptonemo · 2021-02-08T21:06:00Z

Note that this issue was originally created when running tests would hang, however the issue is showing up in production for miners.

cryptonemo · 2021-02-08T21:18:09Z

Very likely related to filecoin-project/bellperson#146

The proof computation is done asynchronously, tasks are created which are then waited for. If such a "waiter" is in the same thread pool as the task, it can happen that the task is waited for before it get executed. That leads to a deadlock. The solution is to make sure that the parts that are waiting for the computation run on the main thread, while the computations happens in a thread pool. Closed #146, filecoin-project/rust-fil-proofs#1380

@cryptonemo

The proof computation is done asynchronously, tasks are created which are then waited for. If such a "waiter" is in the same thread pool as the task, it can happen that the task is waited for before it get executed. That leads to a deadlock. The solution is to make sure that the parts that are waiting for the computation run on the main thread, while the computations happens in a thread pool. Here are all the details for the interested reader (thanks @cryptonemo for this writeup): The root of the problem is in both the design of the Worker, and how it's used in the code currently: https://github.com/filecoin-project/bellperson/blob/d2e88544efb9876bc96c3d7e26a527943c0fdb68/src/multicore.rs#L35 In particular, when a Worker is constrained to a single thread pool, it's possible that compute is called (https://github.com/filecoin-project/bellperson/blob/d2e88544efb9876bc96c3d7e26a527943c0fdb68/src/multicore.rs#L44), which in turns requests a 'spawn' on the THREAD_POOL, and then returns a Waiter, where a caller waits for a message receipt (sent at some point in the future from the executed body of the spawned closure). There exists a race when the Worker is constrained to the same thread pool such that a request can be made to spawn (via compute) and is requested to be entered into the rayon thread pool, but the thread is not yet started. Instead the waiter is returned (perhaps in the same thread) and the receive is called before the worker thread can be executed. This is often easily observable in a thread pool with a single thread (BELLMAN_NUM_CPUS=1), even when running the repo's tests normally. Try this to attempt to observe the hang on the `master` branch of https://github.com/filecoin-project/bellperson (currently at commit `d2e88544efb9876bc96c3d7e26a527943c0fdb68`): ``` BELLMAN_NUM_CPUS=1 RUSTFLAGS="-C target-cpu=native" RUST_LOG=debug cargo test --release --all --workspace -- --nocapture ``` I was able to isolate this case in the bellman code itself with added logging, confirming this flow as problematic, and @vmx made a simplified repo that also clearly demonstrates the issue: https://github.com/vmx/bellman_deadlock In order for this to happen on hardware with more threads/cores, this should be a much more rare occurence because each thread will have to do exactly this behaviour in sequence (i.e. all threads call spawn, and all threads then start waiting before any of them start). This is a race condition that is unavoidable when the Waiter's are constrained to the same thread pool as the Workers. Practically speaking, this condition is created here: https://github.com/filecoin-project/bellperson/blob/d2e88544efb9876bc96c3d7e26a527943c0fdb68/src/groth16/prover.rs#L276 Calling `THREAD_POOL.install` causes the `create_proof_batch_priority` function to block, while the inner method is executed entirely within the context of the THREAD_POOL. Inside of this method, `multiexp` is called a number of times, which in turn calls `compute` and later the `wait` methods, required for this behaviour to lockup. Again, given that the `compute` and `wait`s exist in the same pool, the potential race condition exists in this context. Possible solutions: Given the nature of this problem, conceptually the 'fix' is that Waiters should be waiting in a separate thread pool than the compute Workers. We observed that instead of re-factoring in a new thread pool for this, the `THREAD_POOL.install` line mentioned above could instead execute all of the compute operations in a blocking manner (as `.install` is blocking in this usage) and the code that was previously inside of the inner method that 'wait'ed could be moved out of the inner method and into the top level caller, to be executed after the computation has completed. A previous solution considered was removing the `THREAD_POOL.install` line at https://github.com/filecoin-project/bellperson/blob/d2e88544efb9876bc96c3d7e26a527943c0fdb68/src/groth16/prover.rs#L276 and replacing it simply with a direct function call to `create_proof_batch_priority_inner`. By doing this, all of the multiexp calls would still called `compute`, which adds all long running tasks to the THREAD_POOL internally, allowing the waits to not be constrained to that same thread pool. This has tested to work properly, however it should be noted that it has the unfortunate side-effect of unconstraining parallelism from the pool in other rayon parallel cases, such as used here: https://github.com/filecoin-project/bellperson/blob/d2e88544efb9876bc96c3d7e26a527943c0fdb68/src/groth16/prover.rs#L291 Closes #146, filecoin-project/rust-fil-proofs#1380

cryptonemo · 2021-02-15T15:32:06Z

This will be resolved by filecoin-project/bellperson#150

@cryptonemo

The proof computation is done asynchronously, tasks are created which are then waited for. If such a "waiter" is in the same thread pool as the task, it can happen that the task is waited for before it get executed. That leads to a deadlock. The solution is to make sure that the parts that are waiting for the computation run on the main thread, while the computations happens in a thread pool. Here are all the details for the interested reader (thanks @cryptonemo for this writeup): The root of the problem is in both the design of the Worker, and how it's used in the code currently: https://github.com/filecoin-project/bellperson/blob/d2e88544efb9876bc96c3d7e26a527943c0fdb68/src/multicore.rs#L35 In particular, when a Worker is constrained to a single thread pool, it's possible that compute is called (https://github.com/filecoin-project/bellperson/blob/d2e88544efb9876bc96c3d7e26a527943c0fdb68/src/multicore.rs#L44), which in turns requests a 'spawn' on the THREAD_POOL, and then returns a Waiter, where a caller waits for a message receipt (sent at some point in the future from the executed body of the spawned closure). There exists a race when the Worker is constrained to the same thread pool such that a request can be made to spawn (via compute) and is requested to be entered into the rayon thread pool, but the thread is not yet started. Instead the waiter is returned (perhaps in the same thread) and the receive is called before the worker thread can be executed. This is often easily observable in a thread pool with a single thread (BELLMAN_NUM_CPUS=1), even when running the repo's tests normally. Try this to attempt to observe the hang on the `master` branch of https://github.com/filecoin-project/bellperson (currently at commit `d2e88544efb9876bc96c3d7e26a527943c0fdb68`): ``` BELLMAN_NUM_CPUS=1 RUSTFLAGS="-C target-cpu=native" RUST_LOG=debug cargo test --release --all --workspace -- --nocapture ``` I was able to isolate this case in the bellman code itself with added logging, confirming this flow as problematic, and @vmx made a simplified repo that also clearly demonstrates the issue: https://github.com/vmx/bellman_deadlock In order for this to happen on hardware with more threads/cores, this should be a much more rare occurence because each thread will have to do exactly this behaviour in sequence (i.e. all threads call spawn, and all threads then start waiting before any of them start). This is a race condition that is unavoidable when the Waiter's are constrained to the same thread pool as the Workers. Practically speaking, this condition is created here: https://github.com/filecoin-project/bellperson/blob/d2e88544efb9876bc96c3d7e26a527943c0fdb68/src/groth16/prover.rs#L276 Calling `THREAD_POOL.install` causes the `create_proof_batch_priority` function to block, while the inner method is executed entirely within the context of the THREAD_POOL. Inside of this method, `multiexp` is called a number of times, which in turn calls `compute` and later the `wait` methods, required for this behaviour to lockup. Again, given that the `compute` and `wait`s exist in the same pool, the potential race condition exists in this context. Possible solutions: Given the nature of this problem, conceptually the 'fix' is that Waiters should be waiting in a separate thread pool than the compute Workers. We observed that instead of re-factoring in a new thread pool for this, the `THREAD_POOL.install` line mentioned above could instead execute all of the compute operations in a blocking manner (as `.install` is blocking in this usage) and the code that was previously inside of the inner method that 'wait'ed could be moved out of the inner method and into the top level caller, to be executed after the computation has completed. A previous solution considered was removing the `THREAD_POOL.install` line at https://github.com/filecoin-project/bellperson/blob/d2e88544efb9876bc96c3d7e26a527943c0fdb68/src/groth16/prover.rs#L276 and replacing it simply with a direct function call to `create_proof_batch_priority_inner`. By doing this, all of the multiexp calls would still called `compute`, which adds all long running tasks to the THREAD_POOL internally, allowing the waits to not be constrained to that same thread pool. This has tested to work properly, however it should be noted that it has the unfortunate side-effect of unconstraining parallelism from the pool in other rayon parallel cases, such as used here: https://github.com/filecoin-project/bellperson/blob/d2e88544efb9876bc96c3d7e26a527943c0fdb68/src/groth16/prover.rs#L291 Closes #146, filecoin-project/rust-fil-proofs#1380

@cryptonemo

The proof computation is done asynchronously, tasks are created which are then waited for. If such a "waiter" is in the same thread pool as the task, it can happen that the task is waited for before it get executed. That leads to a deadlock. The solution is to make sure that the parts that are waiting for the computation run on the main thread, while the computations happens in a thread pool. Here are all the details for the interested reader (thanks @cryptonemo for this writeup): The root of the problem is in both the design of the Worker, and how it's used in the code currently: https://github.com/filecoin-project/bellperson/blob/d2e88544efb9876bc96c3d7e26a527943c0fdb68/src/multicore.rs#L35 In particular, when a Worker is constrained to a single thread pool, it's possible that compute is called (https://github.com/filecoin-project/bellperson/blob/d2e88544efb9876bc96c3d7e26a527943c0fdb68/src/multicore.rs#L44), which in turns requests a 'spawn' on the THREAD_POOL, and then returns a Waiter, where a caller waits for a message receipt (sent at some point in the future from the executed body of the spawned closure). There exists a race when the Worker is constrained to the same thread pool such that a request can be made to spawn (via compute) and is requested to be entered into the rayon thread pool, but the thread is not yet started. Instead the waiter is returned (perhaps in the same thread) and the receive is called before the worker thread can be executed. This is often easily observable in a thread pool with a single thread (BELLMAN_NUM_CPUS=1), even when running the repo's tests normally. Try this to attempt to observe the hang on the `master` branch of https://github.com/filecoin-project/bellperson (currently at commit `d2e88544efb9876bc96c3d7e26a527943c0fdb68`): ``` BELLMAN_NUM_CPUS=1 RUSTFLAGS="-C target-cpu=native" RUST_LOG=debug cargo test --release --all --workspace -- --nocapture ``` I was able to isolate this case in the bellman code itself with added logging, confirming this flow as problematic, and @vmx made a simplified repo that also clearly demonstrates the issue: https://github.com/vmx/bellman_deadlock In order for this to happen on hardware with more threads/cores, this should be a much more rare occurence because each thread will have to do exactly this behaviour in sequence (i.e. all threads call spawn, and all threads then start waiting before any of them start). This is a race condition that is unavoidable when the Waiter's are constrained to the same thread pool as the Workers. Practically speaking, this condition is created here: https://github.com/filecoin-project/bellperson/blob/d2e88544efb9876bc96c3d7e26a527943c0fdb68/src/groth16/prover.rs#L276 Calling `THREAD_POOL.install` causes the `create_proof_batch_priority` function to block, while the inner method is executed entirely within the context of the THREAD_POOL. Inside of this method, `multiexp` is called a number of times, which in turn calls `compute` and later the `wait` methods, required for this behaviour to lockup. Again, given that the `compute` and `wait`s exist in the same pool, the potential race condition exists in this context. Possible solutions: Given the nature of this problem, conceptually the 'fix' is that Waiters should be waiting in a separate thread pool than the compute Workers. We observed that instead of re-factoring in a new thread pool for this, the `THREAD_POOL.install` line mentioned above could instead execute all of the compute operations in a blocking manner (as `.install` is blocking in this usage) and the code that was previously inside of the inner method that 'wait'ed could be moved out of the inner method and into the top level caller, to be executed after the computation has completed. A previous solution considered was removing the `THREAD_POOL.install` line at https://github.com/filecoin-project/bellperson/blob/d2e88544efb9876bc96c3d7e26a527943c0fdb68/src/groth16/prover.rs#L276 and replacing it simply with a direct function call to `create_proof_batch_priority_inner`. By doing this, all of the multiexp calls would still called `compute`, which adds all long running tasks to the THREAD_POOL internally, allowing the waits to not be constrained to that same thread pool. This has tested to work properly, however it should be noted that it has the unfortunate side-effect of unconstraining parallelism from the pool in other rayon parallel cases, such as used here: https://github.com/filecoin-project/bellperson/blob/d2e88544efb9876bc96c3d7e26a527943c0fdb68/src/groth16/prover.rs#L291 Closes #146, filecoin-project/rust-fil-proofs#1380

dignifiedquire · 2021-02-19T09:41:06Z

can this be closed?

cryptonemo · 2021-02-19T12:19:12Z

can this be closed?

Yes, thanks. Resolved with filecoin-project/bellperson#150

cryptonemo added the tests label Dec 1, 2020

DrPeterVanNostrand assigned DrPeterVanNostrand and unassigned DrPeterVanNostrand Dec 14, 2020

cryptonemo added the gpu label Jan 21, 2021

cryptonemo assigned vmx and cryptonemo Feb 8, 2021

cryptonemo changed the title ~~Improve testing (GPU locking)~~ Fix broken GPU locking Feb 8, 2021

cryptonemo added the P0 Highest priority label Feb 8, 2021

vmx mentioned this issue Feb 10, 2021

fix: prevent deadlocks in thread pool filecoin-project/bellperson#147

Closed

cryptonemo closed this as completed Feb 19, 2021

travisperson mentioned this issue Mar 5, 2021

Running window post locks with many 512MB sectors filecoin-project/lotus#5446

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix broken GPU locking #1380

Fix broken GPU locking #1380

cryptonemo commented Dec 1, 2020

cryptonemo commented Feb 8, 2021

cryptonemo commented Feb 8, 2021

cryptonemo commented Feb 15, 2021

dignifiedquire commented Feb 19, 2021

cryptonemo commented Feb 19, 2021

Fix broken GPU locking #1380

Fix broken GPU locking #1380

Comments

cryptonemo commented Dec 1, 2020

Description

Acceptance criteria

Where to begin

cryptonemo commented Feb 8, 2021

cryptonemo commented Feb 8, 2021

cryptonemo commented Feb 15, 2021

dignifiedquire commented Feb 19, 2021

cryptonemo commented Feb 19, 2021