Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cudaErrorDevicesUnavailable: CUDA-capable device(s) is/are busy or unavailable #1342

Open
trivialfis opened this issue May 30, 2024 · 2 comments

Comments

@trivialfis
Copy link
Member

Hi, I'm running a cupy test with dask, but ran into a CUDA error that's only reproducible with dask-cuda but not with cupy alone, and hoping that I can get some help here. The error:

2024-05-30 07:53:26,923 - distributed.worker - WARNING - Compute Failed
Key:       ('uniform-fb64484a444179abec146c7c4ac41c23', 103, 0)
Function:  _apply_random_func
args:      (<class 'cupy.random._bit_generator.XORWOW'>, 'uniform', SeedSequence(
    entropy=1,
    spawn_key=(103,),
), (65536, 256), [0.0, 1.0], {})
kwargs:    {}
Exception: "CUDARuntimeError('cudaErrorDevicesUnavailable: CUDA-capable device(s) is/are busy or unavailable')"

The script I'm running along with dependency version:

import dask
from dask import array as da
import dask_cuda
from dask_cuda import LocalCUDACluster
from distributed import Client, wait
import cupy

print(dask.__version__)
print(dask_cuda.__version__)
print(cupy.__version__)

# 2024.1.1
# 24.04.00
# 13.1.0

def make_regression(n_samples: int, n_features: int) -> tuple[da.Array, da.Array]:
    rng = da.random.default_rng(1)
    X = rng.uniform(size=(n_samples, n_features), chunks=(256**2, -1))
    y = X.sum(axis=1)
    return X, y


def main(client: Client) -> None:
    X, y = make_regression(n_samples=2**25, n_features=256)
    X, y = client.persist([X, y])
    wait([X, y])

if __name__ == "__main__":
    with LocalCUDACluster() as cluster:
        with Client(cluster) as client:
            with dask.config.set({"array.backend": "cupy"}):
                main(client)

If I run cupy alone, it can generate random numbers just fine:

>>> import cupy
>>> rng = cupy.random.default_rng(1)
>>> rng.uniform(size=256**2)
array([0.85849577, 0.01410829, 0.28965574, ..., 0.6419934 , 0.35287664,
       0.16616132])

Lastly, I'm running the test on a EGX cluster with 4 T4s:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
@quasiben
Copy link
Member

I haven't seen that error before but I suspect it's do to the older driver/runtime. I see that 495 should be valid for CUDA Enhanced Compatibility and Minor Version Compatilibty. Still, I'm suspicious that this is that issue. Would it be possible to try with a toolkit 11.8 / 12+ ?

@pentschev
Copy link
Member

@quasiben might be right, I'm unable to reproduce the error with driver 535.161.08/Dask-CUDA 24.08/CUDA 11.8, this is the output I see for the sample in the original description:

2024.5.1
24.08.00a3
13.1.0
2024.5.1
24.08.00a3
13.1.0
2024.5.1
24.08.00a3
13.1.0
2024.5.1
2024.5.1
2024.5.1
24.08.00a3
24.08.00a3
24.08.00a3
13.1.0
13.1.0
13.1.0
2024.5.1
24.08.00a3
13.1.0
2024.5.1
2024.5.1
24.08.00a3
24.08.00a3
13.1.0
13.1.0

@trivialfis could you confirm if this is reproducible on newer RAPIDS or same RAPIDS version on a machine with different drivers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants