table: fix deadlocks caused by lock fairness #2156
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Alright, story time :D
Background
Some context first: we have a simple application that runs on two threads:
ctx.request_repaint()
once in a while, depending on various external factors that are irrelevant to this issue.Table
every frame.Problem is: more often than not, that application will deadlock.
In particular, it seems the more thread 0 tries to request repaints, the more likely we are to deadlock.
The situation just described can be dumbed down to this:
and, indeed, that piece of code will always deadlock without fail.
The fix
The solution to all our problems is the following..
..which really just raises more questions.
First off, here's how the compiler desugars that condition:
becomes:
i.e. we end up with two read guards that live concurrently until the end of the scope, the second of which effectively behaves as a reentrant read.
That still doesn't explain why or how we end up with a deadlock.
Let's recap the situation: we have one thread fighting for the write-side of the lock, and 2 readers on another thread fighting for the read-side: sure we expect some contention but besides that, this shouldn't be an issue... unless
parking_lot
'sRwLock
aren't reentrant? But they are though, right?Well, our fix is saying otherwise... gotta dig deeper.
Is
parking_lot::RwLock
reentrant??At this point we can just take
egui
out of the equation and focus onparking_lot
:This will reproduce the deadlock all the same.
Let's take a step back: intuitively, it seems that an
RwLock
should be read-reentrant, right? Why wouldn't it be? After all it can have all the readers it want!In fact, the example shown in
parking_lot
's documentation seems to agree with that first intuition:Even better: this example is doing exactly the same thing as our condition from earlier!
So,
parking_lot
'sRwLock
s: read-reentrant or not? Weeeell... it depends.In particular, it depends on whether there is another thread concurrently trying to grab the lock exclusively.
This has to do with how
parking_lot
handles fairness internally: the implementation is biased towards writers.When a thread tries to grab an
RwLock
exclusively, a flag is immediately set on the lock to block future readers, thereby preventing writer starvation.At this point, the writer-to-be is guaranteed to get exclusivity of the lock at some point, and just waits there for existing readers to leave.
Knowing this, it's pretty straightforward to infer what's going on in our case, based on the desugared condition:
guard1
.guard1
to leave.guard2
, except it can't, so it waits.Make sure those two read-guards don't live concurrently, and the issue goes away.
Follow up
This issue is known and has already been documented here.
Though I feel the current example in the documentation is quite misleading; maybe I'll communicate that upstream.
(In fairness to
parking_lot
though, the documentation of the method itself does clearly state that it is not reentrant!)While debugging this, I stumbled upon
parking_lot
's experimental deadlock detection feature, which works quite well and is well aware ofparking_lot
's implementation details, contrary to ours.I'll publish a follow-up PR to replace our custom detection with theirs when running on native.