-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NO-OOM] Exploration of cudf-spilling + UVM #1504
base: branch-24.06
Are you sure you want to change the base?
Conversation
@madsbk thank you for this! Would you be able to extract and summarize the important performance numbers? Perhaps in a graph or table to make clear the performance landscape? |
Some more results. I am now calling I am measuring the total time of all the joins (5 repeats each) . The timings of join repeats vary quite a bit when using UVM. This is expected since we do not reset the memory page locations between repeats.
Raw data
The results show that, at least in this case, combining cudf-spilling with UVM clearly outperforms UVM-only without any real downside. I think the next step is to implement this in a more unintrusive way and test more workflows and hardware setups. E.g., how is the performance of cudf-spilling+UVM when running on multiple GPUs using UCX? |
@@ -203,7 +211,7 @@ class stream_ordered_memory_resource : public crtp<PoolResource>, public device_ | |||
|
|||
if (size <= 0) { return nullptr; } | |||
|
|||
lock_guard lock(mtx_); | |||
// lock_guard lock(mtx_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: why comment out the lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is because of a deadlock that would otherwise trigger when the allocation results in cudf-spilling. In this case, cudf-spilling will find another buffer to spill and de-allocate its memory, which also
requires the lock.
I haven't given this too much thought, but I think this could be handled with a reentrant lock.
In order to avoid out-of-memory errors, we try to combine cudf-spilling and UVM in this PR.
Goal
The approach
Preliminary results
Running an extended version of @wence-'s join benchmark https://gist.github.com/madsbk/9589dedc45dbcfc828f7274ce3bdabc6 on a DGX-1 (32GiB).
Running with three configs:
--use-pool --base-memory-resource cuda --use-spilling
--use-pool --base-memory-resource managed --use-spilling
--use-pool --base-memory-resource managed --no-use-spilling
TL;DR
When spilling isn’t needed, we see some overhead spikes when going from cudf-spilling to cudf-spilling+UVM and UVM-only but overall, the performance is very much on par. It might be possible to avoid these spikes using
cudaMemAdvise()
.When spilling is needed, the performance of cudf-spilling and cudf-spilling+UVM is very similar but again, we see some overhead spikes.
Finally, when the peak memory usage exceeds device memory, cudf-spilling fails with an OOM error. In this case, we need UVM. The performance of cudf-spilling+UVM and UVM-only are similar but cudf-spilling+UVM is more variable.
Raw Numbers
Everything fits in device memory, no spilling
cudf-spilling
cudf-spilling+UVM
UVM-only
Spilling is required
cudf-spilling
cudf-spilling+UVM
UVM-only
UVM is required
cudf-spilling
cudf-spilling+UVM
UVM-only
NB
The current code is hacky. If this approach is viable, we should implement the functionality in a standalone rmm resource.