Skip to content
This repository has been archived by the owner on Feb 3, 2023. It is now read-only.

Fix lots of deadlocks by managing threads and encapsulating locks #1852

Merged
merged 43 commits into from
Nov 14, 2019

Conversation

lucksus
Copy link
Collaborator

@lucksus lucksus commented Nov 8, 2019

PR summary

  1. Use a thread pool to throttle number of threads that get spawned dynamically
  2. Throttle number of simultaneously run zome calls by implementing a queue within NuceleusState
  3. Queue all holding workflows in DhtStore and consume/run only one at a time
  4. Simplify Instance::process_action and replace panic with an error which fixes shutdown deadlocks and makes Rust tests pass again
  5. Don't pass around lock guards with Context::state() and instead return copies
  6. Don't pass around lock guards of the CAS/EAV storages through state getters and handle thread synchronization locally inside those getters

Limiting threads and workflows (points 1-3)

This started with the idea to use a thread pool in order to not have the conductor spawn way to many threads under heavy load. Quickly realized that threads are the wrong level of granularity to put a limit on, since we have high-level workflows (such as calling zome functions and holding entries) that consist of one root thread, but that also might spawn a number of sub-threads that the root thread then might wait for completion on. Not limiting those workflows but limiting the threads they spawn creates another source for deadlocks. Hence adding points 2 and 3.

try-lock in action loop

After adding those limits on workflows, I could get much further into our stress tests, but Rust tests started failing. This seemed to be a problem only happening during shutdown. The redux action loop would panic after not being able to get a lock on the state, while the instance was already shutting down. Here I've changed the signature of Instance::process_action() not deal with observers any more (handled in the loop that calls it) and instead returning a result - and transforming the panic when not being able to get a lock, to an error. This way we can have the redux action loop run again if can't get a lock. It might realize that we already got a kill signal and can stop trying to get a lock on something that isn't there any more. This fixed the failing Rust test.

Not passing around lock-guards

With this change Context::state() is not handing out lock guards (which is asking for deadlocks) and instead returns a copy. This is actually not expensive at all since the State consists almost only of Arcs on the top level anyway.

This removed several deadlocks we kept seeing in the stress tests.

We were left with deadlocks involving the CAS/EAV storages.
The remaining commits are doing the same (hiding locks locally and not passing around lock guards) for the CAS locks too.

testing/benchmarking notes

( if any manual testing or benchmarking was/should be done, add notes and/or screenshots here )

followups

( any new tickets/concerns that were discovered or created during this work but aren't in scope for review here )

changelog

  • if this is a code change that effects some consumer (e.g. zome developers) of holochain core, then it has been added to our between-release changelog with the format
- summary of change [PR#1234](https://github.com/holochain/holochain-rust/pull/1234)

documentation

@lucksus lucksus changed the title Use a thread pool to throttle number of threads that get spawned dynamically Make HC resilient against high loads by managing threads Nov 12, 2019
@lucksus lucksus changed the title Make HC resilient against high loads by managing threads Make HC resilient against high loads by managing threads and locks Nov 13, 2019
@lucksus lucksus marked this pull request as ready for review November 14, 2019 13:13
@lucksus lucksus changed the title Make HC resilient against high loads by managing threads and locks Fix lots of deadlocks by managing threads and encapsulating locks Nov 14, 2019
Copy link
Member

@zippy zippy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes requested:

  1. renaming and commenting the 10 second and 1 second scheduled jobs for state dump and pending validations
  2. concurrency issues with call to reduce_pop_next_holding_workflow

crates/core/src/network/mod.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@maackle maackle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, one small question

crates/core/src/dht/dht_reducers.rs Show resolved Hide resolved
// Mutate state
{
let new_state: StateWrapper;

// Get write lock
let mut state = self
.state
.write()
.expect("owners of the state RwLock shouldn't panic");
.try_write_until(Instant::now().checked_add(Duration::from_secs(10)).unwrap())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make this a const?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants