Fix lots of deadlocks by managing threads and encapsulating locks #1852

lucksus · 2019-11-08T14:12:36Z

PR summary

Use a thread pool to throttle number of threads that get spawned dynamically
Throttle number of simultaneously run zome calls by implementing a queue within NuceleusState
Queue all holding workflows in DhtStore and consume/run only one at a time
Simplify Instance::process_action and replace panic with an error which fixes shutdown deadlocks and makes Rust tests pass again
Don't pass around lock guards with Context::state() and instead return copies
Don't pass around lock guards of the CAS/EAV storages through state getters and handle thread synchronization locally inside those getters

Limiting threads and workflows (points 1-3)

This started with the idea to use a thread pool in order to not have the conductor spawn way to many threads under heavy load. Quickly realized that threads are the wrong level of granularity to put a limit on, since we have high-level workflows (such as calling zome functions and holding entries) that consist of one root thread, but that also might spawn a number of sub-threads that the root thread then might wait for completion on. Not limiting those workflows but limiting the threads they spawn creates another source for deadlocks. Hence adding points 2 and 3.

try-lock in action loop

After adding those limits on workflows, I could get much further into our stress tests, but Rust tests started failing. This seemed to be a problem only happening during shutdown. The redux action loop would panic after not being able to get a lock on the state, while the instance was already shutting down. Here I've changed the signature of Instance::process_action() not deal with observers any more (handled in the loop that calls it) and instead returning a result - and transforming the panic when not being able to get a lock, to an error. This way we can have the redux action loop run again if can't get a lock. It might realize that we already got a kill signal and can stop trying to get a lock on something that isn't there any more. This fixed the failing Rust test.

Not passing around lock-guards

With this change Context::state() is not handing out lock guards (which is asking for deadlocks) and instead returns a copy. This is actually not expensive at all since the State consists almost only of Arcs on the top level anyway.

This removed several deadlocks we kept seeing in the stress tests.

We were left with deadlocks involving the CAS/EAV storages.
The remaining commits are doing the same (hiding locks locally and not passing around lock guards) for the CAS locks too.

testing/benchmarking notes

( if any manual testing or benchmarking was/should be done, add notes and/or screenshots here )

followups

( any new tickets/concerns that were discovered or created during this work but aren't in scope for review here )

changelog

if this is a code change that effects some consumer (e.g. zome developers) of holochain core, then it has been added to our between-release changelog with the format

- summary of change [PR#1234](https://github.com/holochain/holochain-rust/pull/1234)

documentation

this code has been documented according to our docs checklist

…n deadlock

…ill all the deadlocks

…locks internally

zippy

Changes requested:

renaming and commenting the 10 second and 1 second scheduled jobs for state dump and pending validations
concurrency issues with call to reduce_pop_next_holding_workflow

crates/core/src/network/mod.rs

… pop the wrong queued holding item b/c of race condition

crates/core/src/context.rs

…ines

maackle

Looks good, one small question

crates/core/src/dht/dht_reducers.rs

maackle · 2019-11-14T18:20:15Z

crates/core/src/instance.rs

        // Mutate state
        {
            let new_state: StateWrapper;

            // Get write lock
            let mut state = self
                .state
-                .write()
-                .expect("owners of the state RwLock shouldn't panic");
+                .try_write_until(Instant::now().checked_add(Duration::from_secs(10)).unwrap())


can we make this a const?

lucksus and others added 17 commits November 8, 2019 15:01

Use a thread pool to throttle number of threads that get spawned dynamic

0cf5897

Fix starvation of redux loop with atomic bool

ac62279

rustfmt

0e81276

Increase number of threads in threadpool to 20

7c1a886

Merge remote-tracking branch 'origin/develop' into threadpool

8585ae1

Use threadpool in a couple other places

e4e78a6

rustfmt

b8993e7

Manage number of running zome calls in State

c8f9b37

Merge branch 'develop' into threadpool

265c413

rustfmt

fee5fb4

Manage queue of holding workflows in DHT store and only run one @ a time

f764bd9

rustfmt

a1c2c08

Remove newly added deadlock

b7bc5dd

Log errors of holding workflows

9cfd732

Consume all queued holding workflows in one tick but still sequentially

fafdaa3

rustfmt

a13974e

Fix consistency model for links

4714970

lucksus changed the title ~~Use a thread pool to throttle number of threads that get spawned dynamically~~ Make HC resilient against high loads by managing threads Nov 12, 2019

lucksus mentioned this pull request Nov 12, 2019

Fix consistency model for links #1866

Merged

lucksus added 9 commits November 12, 2019 16:33

Merge branch 'develop' into threadpool

2113f18

Simplify Instance::process_action() and make it impossible to panic o…

fe9a763

…n deadlock

Merge branch 'develop' into threadpool

b4c0efc

Have Context::state() return a copy of the state instead of Lock to k…

87c284f

…ill all the deadlocks

Remove accidentally committed debug logs

31912f6

Bump sleep in telephone game - we know our waiter doesn't apply here

62200da

Replace DhtStore::content_storage() with getters/setters that handle …

71aebc8

…locks internally

Remove CAS/EAV getters that still have to be replaced by proxy fns

0eee8ff

Remove dead code

15464e6

lucksus changed the title ~~Make HC resilient against high loads by managing threads~~ Make HC resilient against high loads by managing threads and locks Nov 13, 2019

lint

921244f

thedavidmeister and others added 9 commits November 14, 2019 21:58

cleanup some chain store locks

9faa2ce

wip on centralising locks

3496c49

more compiler fixes

8257126

compiling yay

37c5967

rustfmt

3a4d295

Merge branch 'refs/heads/develop' into threadpool

e15fd47

Cargo.lock

b245da5

clippy

ee692eb

Adjust new test from develop to simplified Instance::process_action()

11cef23

lucksus marked this pull request as ready for review November 14, 2019 13:13

lucksus changed the title ~~Make HC resilient against high loads by managing threads and locks~~ Fix lots of deadlocks by managing threads and encapsulating locks Nov 14, 2019

changelog

dd775d0

zippy suggested changes Nov 14, 2019

View reviewed changes

crates/core/src/network/mod.rs Outdated Show resolved Hide resolved

Nicolas Luck added 4 commits November 14, 2019 17:16

Clarify name of scheduled function -> create_state_dump_callback()

864a1d1

Add parameter to Action::PopNextHoldingWorkflow so don't accidentally…

53f438c

… pop the wrong queued holding item b/c of race condition

Action comments

95751a7

debug -> trace log

b653f9a

zippy approved these changes Nov 14, 2019

View reviewed changes

StaticallyTypedAnxiety approved these changes Nov 14, 2019

View reviewed changes

dymayday reviewed Nov 14, 2019

View reviewed changes

crates/core/src/context.rs Outdated Show resolved Hide resolved

lucksus added 2 commits November 14, 2019 18:25

Adjust deactivated test to new signatures instead of commenting out l…

f7de262

…ines

Rename Context::spawn_thread() to Context::spawn_task()

6e0b13e

maackle approved these changes Nov 14, 2019

View reviewed changes

lucksus merged commit 802bfa9 into develop Nov 14, 2019

This was referenced Nov 15, 2019

Control direct CAS access #1876

Merged

NucleusState::pending_validations -> DhtStore::queued_holding_workflows & fix too fast validation retry #1890

Merged

lucksus mentioned this pull request Dec 6, 2019

"IMMORTAL LOCK GUARD FOUND" error with multiple edits to same entry in single request #1936

Closed

zippy deleted the threadpool branch January 4, 2020 01:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix lots of deadlocks by managing threads and encapsulating locks #1852

Fix lots of deadlocks by managing threads and encapsulating locks #1852

lucksus commented Nov 8, 2019 •

edited

Loading

zippy left a comment

maackle left a comment •

edited

Loading

maackle Nov 14, 2019

Fix lots of deadlocks by managing threads and encapsulating locks #1852

Fix lots of deadlocks by managing threads and encapsulating locks #1852

Conversation

lucksus commented Nov 8, 2019 • edited Loading

PR summary

Limiting threads and workflows (points 1-3)

try-lock in action loop

Not passing around lock-guards

testing/benchmarking notes

followups

changelog

documentation

zippy left a comment

Choose a reason for hiding this comment

maackle left a comment • edited Loading

Choose a reason for hiding this comment

maackle Nov 14, 2019

Choose a reason for hiding this comment

lucksus commented Nov 8, 2019 •

edited

Loading

maackle left a comment •

edited

Loading