Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race when a rapids buffer is aliased while it is spilled #9084

Merged
merged 3 commits into from
Aug 23, 2023

Conversation

abellina
Copy link
Collaborator

@abellina abellina commented Aug 21, 2023

Closes #9082
Likely root cause for #8939

This fixes a regression introduced #8936 where RapidsBuffer.free is being invoked outside of the catalog lock. Because we allow aliasing of buffers (re-adding the same buffer doesn't create a new RapidsBuffer) we were aliasing a buffer that had been spilled and removed from the catalog, leading to a SpillableColumnarBatch pointing to a RapidsBuffer that wasn't valid and would lead to task exceptions while trying to acquire this buffer.

This makes it so we hold the catalog lock at a higher level than before, which includes the call to free.

There is a slight change in behavior from before as well. Before this change we would go into a loop where several threads would satisfy a target < spillable test (so we need to reduce the store size to approach target). One thread could begin spilling to target, releasing the lock each time during this loop and rechecking the catalog size each time before taking the lock. If two threads are racing it was not deterministic who would spill (both could spill some then one would take over or one could win from the start)

In this PR instead we lock higher level and 1 thread is allowed to spill, others are told to retry. The thread that is allowed to spill does the same check and drives the store size to match target.

@abellina
Copy link
Collaborator Author

build

@abellina
Copy link
Collaborator Author

build

@revans2 revans2 merged commit 6729888 into NVIDIA:branch-23.10 Aug 23, 2023
26 of 27 checks passed
@sameerz sameerz added the bug Something isn't working label Aug 24, 2023
mythrocks pushed a commit to mythrocks/spark-rapids that referenced this pull request Aug 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Race condition while spilling and aliasing a RapidsBuffer (regression)
4 participants