Fix race when a rapids buffer is aliased while it is spilled #9084
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #9082
Likely root cause for #8939
This fixes a regression introduced #8936 where
RapidsBuffer.free
is being invoked outside of the catalog lock. Because we allow aliasing of buffers (re-adding the same buffer doesn't create a newRapidsBuffer
) we were aliasing a buffer that had been spilled and removed from the catalog, leading to aSpillableColumnarBatch
pointing to aRapidsBuffer
that wasn't valid and would lead to task exceptions while trying to acquire this buffer.This makes it so we hold the catalog lock at a higher level than before, which includes the call to free.
There is a slight change in behavior from before as well. Before this change we would go into a loop where several threads would satisfy a
target < spillable
test (so we need to reduce the store size to approachtarget
). One thread could begin spilling to target, releasing the lock each time during this loop and rechecking the catalog size each time before taking the lock. If two threads are racing it was not deterministic who would spill (both could spill some then one would take over or one could win from the start)In this PR instead we lock higher level and 1 thread is allowed to spill, others are told to retry. The thread that is allowed to spill does the same check and drives the store size to match
target
.