[FEA] Allow spilling from device memory to host memory by default #334

beckernick · 2020-07-01T15:51:26Z

In some workflows, it's common to set a device_memory_limit to allow spilling from device to host memory once a memory threshold is reached. By default, the device_memory_limit is configured to be the total memory of the GPU, which functionally means that spilling is off. For workflows that are not memory constrained, this is nice, as it lets us use the full memory availability without trying to spill at all.

Many workflows are at least partially memory constrained, though, which then with default settings will pose significant risks of out-of-memory. It would be nice to default device_memory_limit to a lower percentage of GPU memory to trigger spilling rather than going out-of-memory.

Anecdotally, it feels like 75 or 80% of total memory capacity might be a good default threshold. There may also be a better way to achieve spilling by default. I'm interested to hear others' thoughts.

The text was updated successfully, but these errors were encountered:

pentschev · 2020-07-01T16:12:22Z

Personally, I don't share the same thinking. While I see your point that going OOM is terrible users will at least be notified of that. Now if we change the default so that spilling is always enabled this will most of the time allow workflows to complete, but will definitely not deliver good performance, what may discourage newcomers from using dask-cuda in general.

To add to the problem, it is known that spilling is still very inefficient -- see #106 .

pentschev · 2020-07-01T16:13:00Z

It would be nice to see other opinions though.

cc @quasiben @jakirkham

beckernick · 2020-07-01T16:18:48Z

Now if we change the default so that spilling is always enabled this will most of the time allow workflows to complete, but will definitely not deliver good performance, what may discourage newcomers from using dask-cuda in general.

This is a good point, and something I thought about too. The performance / user experience tradeoff is important to consider. It might be valuable to learn about mainline Dask's experience setting defaults. I believe the spill-to-disk behavior is by default set to occur at ~~70%~~ 60% of CPU memory capacity. @quasiben may be able to comment more on how they selected that threshold

quasiben · 2020-07-01T17:46:58Z

I am not sure how Dask selected the 60% default threshold (happy to ask around). If I had to guess it may be because Dask defaults were configured to have Dask run without too many errors out of the box and then fine tuning could happen after a workflow was established.

Perhaps this point, about whether to crash due to OOM vs complete and fine-tune should be considered. I honestly can see arguments for both. I would suggest leaving this issue open for some time to gather more feedback from other dask-cuda users. If users are always setting some device memory limit it may make sense to change to some reasonable default

jakirkham · 2020-07-01T18:18:47Z

I think some of us have discussed this before, but will resurface here as well. We could consider handling OOM in Dask. Namely when a task raises a MemoryError, we use that as an opportunity to spill and retry. ( dask/distributed#3612 ) Also there seems to be some appetite for this behavior amongst the larger community. This could be slow as others have noted, but it would provide a path forward. If we want this behind a toggleable flag to allow users to make this choice, that seems like a reasonable compromise to me at least. Thoughts? 🙂

pentschev · 2020-07-01T18:43:30Z

I think the spill and retry on MemoryError is a nice feature to have. Regardless of that feature, after many years of writing CUDA code my bias is always towards making things implicitly fast and allow users to explicitly configure things that will make things slower when providing some additional benefit -- in this case the benefit of not going OOM. Explicit errors are always much easier to understand and adapt with explicit configurations, whereas implicit slowdown is much harder to track.

beckernick · 2020-07-02T13:29:26Z

Anecdotally, here is a report of a user setting a device memory limit of 16 GB on a 32GB V100 Azure setup.

https://stackoverflow.com/questions/62695360/how-can-i-use-xgboost-dask-with-gpu-to-model-a-very-large-dataset-in-both-a-dist

jakirkham · 2020-07-02T14:56:26Z

Though xgboost doesn’t use RMM today, which makes it a bit hard to tell what the issue is. Are we just running into memory contention between xgboost and RMM?

beckernick · 2020-07-02T15:33:27Z

Only included that example as a data point on the device memory limit threshold (though it appears they may be using 16 GB V100s in that example). That user doesn't appear to be using RMM, from what I can tell

jakirkham · 2020-07-02T15:54:10Z

Sure, I guess what I'm saying is, some use cases involve a few different libraries that have different memory management techniques, when used together this can make the overall workflow more sensitive to memory usage.

beckernick · 2020-07-02T16:02:54Z

Sure, I guess what I'm saying is, some use cases involve a few different libraries that have different memory management techniques, when used together this can make the overall workflow more sensitive to memory usage.

Very true. Super important to keep this in mind across the board

jakirkham · 2020-07-24T21:15:12Z

At least to the XGBoost case, there's some work going on in PR ( dmlc/xgboost#5873 ) to support RMM.

pentschev mentioned this issue Jul 24, 2020

[FEA] Consider using different default values for cluster configurations #348

Open

pentschev mentioned this issue Nov 6, 2020

Unify device_memory_limit parsing and set default to 0.8 #439

Merged

quasiben closed this as completed in #439 Nov 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Allow spilling from device memory to host memory by default #334

[FEA] Allow spilling from device memory to host memory by default #334

beckernick commented Jul 1, 2020 •

edited

Loading

pentschev commented Jul 1, 2020

pentschev commented Jul 1, 2020

beckernick commented Jul 1, 2020 •

edited

Loading

quasiben commented Jul 1, 2020

jakirkham commented Jul 1, 2020

pentschev commented Jul 1, 2020

beckernick commented Jul 2, 2020

jakirkham commented Jul 2, 2020

beckernick commented Jul 2, 2020

jakirkham commented Jul 2, 2020

beckernick commented Jul 2, 2020

jakirkham commented Jul 24, 2020

[FEA] Allow spilling from device memory to host memory by default #334

[FEA] Allow spilling from device memory to host memory by default #334

Comments

beckernick commented Jul 1, 2020 • edited Loading

pentschev commented Jul 1, 2020

pentschev commented Jul 1, 2020

beckernick commented Jul 1, 2020 • edited Loading

quasiben commented Jul 1, 2020

jakirkham commented Jul 1, 2020

pentschev commented Jul 1, 2020

beckernick commented Jul 2, 2020

jakirkham commented Jul 2, 2020

beckernick commented Jul 2, 2020

jakirkham commented Jul 2, 2020

beckernick commented Jul 2, 2020

jakirkham commented Jul 24, 2020

beckernick commented Jul 1, 2020 •

edited

Loading

beckernick commented Jul 1, 2020 •

edited

Loading