This repository has been archived by the owner on Feb 3, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 267
Fix lots of deadlocks by managing threads and encapsulating locks #1852
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
lucksus
changed the title
Use a thread pool to throttle number of threads that get spawned dynamically
Make HC resilient against high loads by managing threads
Nov 12, 2019
…ill all the deadlocks
lucksus
changed the title
Make HC resilient against high loads by managing threads
Make HC resilient against high loads by managing threads and locks
Nov 13, 2019
lucksus
changed the title
Make HC resilient against high loads by managing threads and locks
Fix lots of deadlocks by managing threads and encapsulating locks
Nov 14, 2019
zippy
suggested changes
Nov 14, 2019
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes requested:
- renaming and commenting the 10 second and 1 second scheduled jobs for state dump and pending validations
- concurrency issues with call to reduce_pop_next_holding_workflow
… pop the wrong queued holding item b/c of race condition
zippy
approved these changes
Nov 14, 2019
StaticallyTypedAnxiety
approved these changes
Nov 14, 2019
dymayday
reviewed
Nov 14, 2019
maackle
approved these changes
Nov 14, 2019
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, one small question
// Mutate state | ||
{ | ||
let new_state: StateWrapper; | ||
|
||
// Get write lock | ||
let mut state = self | ||
.state | ||
.write() | ||
.expect("owners of the state RwLock shouldn't panic"); | ||
.try_write_until(Instant::now().checked_add(Duration::from_secs(10)).unwrap()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make this a const?
This was referenced Nov 15, 2019
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR summary
NuceleusState
DhtStore
and consume/run only one at a timeInstance::process_action
and replace panic with an error which fixes shutdown deadlocks and makes Rust tests pass againContext::state()
and instead return copiesLimiting threads and workflows (points 1-3)
This started with the idea to use a thread pool in order to not have the conductor spawn way to many threads under heavy load. Quickly realized that threads are the wrong level of granularity to put a limit on, since we have high-level workflows (such as calling zome functions and holding entries) that consist of one root thread, but that also might spawn a number of sub-threads that the root thread then might wait for completion on. Not limiting those workflows but limiting the threads they spawn creates another source for deadlocks. Hence adding points 2 and 3.
try-lock in action loop
After adding those limits on workflows, I could get much further into our stress tests, but Rust tests started failing. This seemed to be a problem only happening during shutdown. The redux action loop would panic after not being able to get a lock on the state, while the instance was already shutting down. Here I've changed the signature of
Instance::process_action()
not deal with observers any more (handled in the loop that calls it) and instead returning a result - and transforming the panic when not being able to get a lock, to an error. This way we can have the redux action loop run again if can't get a lock. It might realize that we already got a kill signal and can stop trying to get a lock on something that isn't there any more. This fixed the failing Rust test.Not passing around lock-guards
With this change
Context::state()
is not handing out lock guards (which is asking for deadlocks) and instead returns a copy. This is actually not expensive at all since theState
consists almost only ofArcs
on the top level anyway.This removed several deadlocks we kept seeing in the stress tests.
We were left with deadlocks involving the CAS/EAV storages.
The remaining commits are doing the same (hiding locks locally and not passing around lock guards) for the CAS locks too.
testing/benchmarking notes
( if any manual testing or benchmarking was/should be done, add notes and/or screenshots here )
followups
( any new tickets/concerns that were discovered or created during this work but aren't in scope for review here )
changelog
documentation