[FEA] Consider using different default values for cluster configurations #348

randerzander · 2020-07-24T15:54:42Z

In our gpu-bdb benchmarking, we found we needed to configure quite a few bash environment variables when setting up a dask-cuda cluster for optimal performance.

For example, if you're not using UCX, it's likely that GPU worker to GPU worker communication over TCP is at least higher latency, if not slower. To avoid TCP connection failures we needed to reconfigure Dask's "COMM" settings.

export DASK_DISTRIBUTED__COMM__TIMEOUTS__CONNECT="100s"
export DASK_DISTRIBUTED__COMM__TIMEOUTS__TCP="600s"
export DASK_DISTRIBUTED__COMM__RETRY__DELAY__MIN="1s"
export DASK_DISTRIBUTED__COMM__RETRY__DELAY__MAX="60s"

I suspect a great many dask-cuda users will be using TCP and not UCX, so would benefit from dask-cuda automatically re-configuring Dask's TCP defaults.

We also need to set --memory-limit, --device-memory-limit, and --rmm-pool-size values:

MAX_SYSTEM_MEMORY=$(free -m | awk '/^Mem:/{print $2}')M
DEVICE_MEMORY_LIMIT="25GB"
POOL_SIZE="30GB"

dask-cuda-worker --device-memory-limit $DEVICE_MEMORY_LIMIT --rmm-pool-size=$POOL_SIZE --memory-limit=$MAX_SYSTEM_MEMORY ...

It would be nice if dask-cuda included logic for detecting GPU memory per card and setting default values for device memory limit and RMM pool size. As a starting point, for a 32GB card, 25GB memory limit with 30GB pool size work well for the 30 tpcx-bb queries. I believe that's a decent representation of typical workflows where those values could be used proportionally for other card memory sizes as well.

The text was updated successfully, but these errors were encountered:

quasiben · 2020-07-24T17:02:38Z

I think having some of these defaults makes sense . In dask core --memory-limit is set to auto by default .

In [1]: import pynvml

In [2]: pynvml.nvmlInit()

In [3]: handle = pynvml.nvmlDeviceGetHandleByIndex(0)

In [4]: info = pynvml.nvmlDeviceGetMemoryInfo(handle)

In [5]: info.total
Out[6]: 34089730048

maybe rmm-pool-size = Total*90%. @pentschev @jakirkham do you have thoughts here ?

randerzander · 2020-07-24T18:15:35Z

@mt-jones may remember why we had to set --memory-limit specifically. There might have been a bug where it wasn't being set properly, or the heuristic Dask uses is inappropriate for, say a DGX1.

pentschev · 2020-07-24T18:32:42Z

While I understand how changing such defaults make sense for TPCx-BB, I'm not so sure this makes sense for everyone, I don't think it will be difficult to find people trying to run dask-cuda together with other applications that also require memory (either device or host), and that could be a bit annoying for users. Think for example of someone running this on a workstation for which one of the GPUs is also used as display. That's not to say I'm totally against the idea, but I'm just a bit concerned about the broader usability of dask-cuda.

The --memory-limit issue was really a bug that got fixed in #269 .

As for the communication variables, I think they were not for the workers (or not only, perhaps), but for the scheduler, am I mistaken? Changing the defaults for the scheduler would have to go to distributed upstream. As for workers, if we change those values we should probably check what would be really appropriate numbers, I have a feeling those were just defined arbitrarily high to make things work, was that not the case? And when we scale, would we have to scale the numbers too?

pentschev · 2020-07-24T18:34:42Z

I was just thinking after writing the comment above, maybe we should have some sort of "default recipes" for different use cases? E.g., the TPCx-BB case could use a "performance recipe", while today's defaults could be something like a "conservative recipe" and so on. I think this would alleviate complexity for multiple uses while still providing defaults that should work mostly anywhere and have an easy switch.

pentschev · 2020-07-24T20:19:23Z

I forgot to mention another related (potentially duplicate) issue: #334 .

github-actions · 2021-02-16T19:08:56Z

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

github-actions · 2021-05-17T19:10:33Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

pentschev added the proposal A code change suggestion to be considered or discussed label Jan 8, 2021

github-actions bot added the inactive-30d label Feb 16, 2021

github-actions bot added the inactive-90d label May 17, 2021

quasiben mentioned this issue Oct 18, 2022

Configuration Profiles #1020

Open

caryr35 added this to dask-cuda Dec 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Consider using different default values for cluster configurations #348

[FEA] Consider using different default values for cluster configurations #348

randerzander commented Jul 24, 2020 •

edited

Loading

quasiben commented Jul 24, 2020

randerzander commented Jul 24, 2020

pentschev commented Jul 24, 2020

pentschev commented Jul 24, 2020

pentschev commented Jul 24, 2020

github-actions bot commented Feb 16, 2021

github-actions bot commented May 17, 2021

[FEA] Consider using different default values for cluster configurations #348

[FEA] Consider using different default values for cluster configurations #348

Comments

randerzander commented Jul 24, 2020 • edited Loading

quasiben commented Jul 24, 2020

randerzander commented Jul 24, 2020

pentschev commented Jul 24, 2020

pentschev commented Jul 24, 2020

pentschev commented Jul 24, 2020

github-actions bot commented Feb 16, 2021

github-actions bot commented May 17, 2021

randerzander commented Jul 24, 2020 •

edited

Loading