Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Allow spilling from device memory to host memory by default #334

Closed
beckernick opened this issue Jul 1, 2020 · 12 comments · Fixed by #439
Closed

[FEA] Allow spilling from device memory to host memory by default #334

beckernick opened this issue Jul 1, 2020 · 12 comments · Fixed by #439

Comments

@beckernick
Copy link
Member

beckernick commented Jul 1, 2020

In some workflows, it's common to set a device_memory_limit to allow spilling from device to host memory once a memory threshold is reached. By default, the device_memory_limit is configured to be the total memory of the GPU, which functionally means that spilling is off. For workflows that are not memory constrained, this is nice, as it lets us use the full memory availability without trying to spill at all.

Many workflows are at least partially memory constrained, though, which then with default settings will pose significant risks of out-of-memory. It would be nice to default device_memory_limit to a lower percentage of GPU memory to trigger spilling rather than going out-of-memory.

Anecdotally, it feels like 75 or 80% of total memory capacity might be a good default threshold. There may also be a better way to achieve spilling by default. I'm interested to hear others' thoughts.

@pentschev
Copy link
Member

Personally, I don't share the same thinking. While I see your point that going OOM is terrible users will at least be notified of that. Now if we change the default so that spilling is always enabled this will most of the time allow workflows to complete, but will definitely not deliver good performance, what may discourage newcomers from using dask-cuda in general.

To add to the problem, it is known that spilling is still very inefficient -- see #106 .

@pentschev
Copy link
Member

It would be nice to see other opinions though.

cc @quasiben @jakirkham

@beckernick
Copy link
Member Author

beckernick commented Jul 1, 2020

Now if we change the default so that spilling is always enabled this will most of the time allow workflows to complete, but will definitely not deliver good performance, what may discourage newcomers from using dask-cuda in general.

This is a good point, and something I thought about too. The performance / user experience tradeoff is important to consider. It might be valuable to learn about mainline Dask's experience setting defaults. I believe the spill-to-disk behavior is by default set to occur at 70% 60% of CPU memory capacity. @quasiben may be able to comment more on how they selected that threshold

@quasiben
Copy link
Member

quasiben commented Jul 1, 2020

I am not sure how Dask selected the 60% default threshold (happy to ask around). If I had to guess it may be because Dask defaults were configured to have Dask run without too many errors out of the box and then fine tuning could happen after a workflow was established.

Perhaps this point, about whether to crash due to OOM vs complete and fine-tune should be considered. I honestly can see arguments for both. I would suggest leaving this issue open for some time to gather more feedback from other dask-cuda users. If users are always setting some device memory limit it may make sense to change to some reasonable default

@jakirkham
Copy link
Member

I think some of us have discussed this before, but will resurface here as well. We could consider handling OOM in Dask. Namely when a task raises a MemoryError, we use that as an opportunity to spill and retry. ( dask/distributed#3612 ) Also there seems to be some appetite for this behavior amongst the larger community. This could be slow as others have noted, but it would provide a path forward. If we want this behind a toggleable flag to allow users to make this choice, that seems like a reasonable compromise to me at least. Thoughts? 🙂

@pentschev
Copy link
Member

I think the spill and retry on MemoryError is a nice feature to have. Regardless of that feature, after many years of writing CUDA code my bias is always towards making things implicitly fast and allow users to explicitly configure things that will make things slower when providing some additional benefit -- in this case the benefit of not going OOM. Explicit errors are always much easier to understand and adapt with explicit configurations, whereas implicit slowdown is much harder to track.

@beckernick
Copy link
Member Author

Anecdotally, here is a report of a user setting a device memory limit of 16 GB on a 32GB V100 Azure setup.

https://stackoverflow.com/questions/62695360/how-can-i-use-xgboost-dask-with-gpu-to-model-a-very-large-dataset-in-both-a-dist

@jakirkham
Copy link
Member

Though xgboost doesn’t use RMM today, which makes it a bit hard to tell what the issue is. Are we just running into memory contention between xgboost and RMM?

@beckernick
Copy link
Member Author

Only included that example as a data point on the device memory limit threshold (though it appears they may be using 16 GB V100s in that example). That user doesn't appear to be using RMM, from what I can tell

@jakirkham
Copy link
Member

Sure, I guess what I'm saying is, some use cases involve a few different libraries that have different memory management techniques, when used together this can make the overall workflow more sensitive to memory usage.

@beckernick
Copy link
Member Author

Sure, I guess what I'm saying is, some use cases involve a few different libraries that have different memory management techniques, when used together this can make the overall workflow more sensitive to memory usage.

Very true. Super important to keep this in mind across the board

@jakirkham
Copy link
Member

At least to the XGBoost case, there's some work going on in PR ( dmlc/xgboost#5873 ) to support RMM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants