Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] How to Run Multiple Concurrent Tasks on a GPU in Dask cuML #6112

Open
mnlcarv opened this issue Oct 15, 2024 · 2 comments
Open

[QST] How to Run Multiple Concurrent Tasks on a GPU in Dask cuML #6112

mnlcarv opened this issue Oct 15, 2024 · 2 comments
Labels
? - Needs Triage Need team to review and classify question Further information is requested

Comments

@mnlcarv
Copy link

mnlcarv commented Oct 15, 2024

I want to test the execution of multiple concurrent tasks on the GPU in Dask cuML. I'm using K-means (code below) and I'm changing the chunk size so that I can create multiple tasks for the fit method and run them in the GPU.

from cuml.dask.cluster import KMeans as cumlKMeans
from dask_cuda import LocalCUDACluster
from dask.distributed import Client, wait
import dask.array as da
import time

def main():

    cluster = LocalCUDACluster(CUDA_VISIBLE_DEVICES="0", n_workers=1, threads_per_worker=8)
    client = Client(cluster)

    n_samples = 12500
    n_features = 100
    start_random_state = 170
    # n_block_rows = 12500 # Case 1 (1 task)
    n_block_rows = 1563 # Case 2 (8 tasks)
    n_clusters = 100

    X = da.random.random((n_samples, n_features), chunks=(n_block_rows, n_features))
    dX = X.to_dask_dataframe().to_backend('cudf')

    kmeans_model = cumlKMeans(n_clusters=n_clusters, random_state=start_random_state, max_iter=5)

    wait(dX)
    start_time = time.time()
    kmeans_model.fit(dX)
    wait(kmeans_model._func_fit)
    print(f"Fit took: {time.time() - start_time} sec")

    client.close()

if __name__ == '__main__':
    main()

Since I only have 1 GPU in my machine, I can only use 1 worker. To support multiple tasks in 1 worker, I increased the number of threads per worker to 8 (the maximum number of CPU cores on my machine).

Basically, I'm testing the following cases:

Case 1: 1 task in a GPU
Case 2: 8 tasks in a GPU

I measured the execution time (directly in the code also calling wait() to make sure any async task is finished) of the fit method and case 1 is faster than case 2 for a toy dataset:

Case 1 (1 task): 1.54s
Case 2 (8 tasks): 2.49s

I also checked the time of the fit method displayed in the Dask dashboard (Task Stream tab):

Case 1 (1 task):  0.60s
Case 2 (8 tasks): 0.94s

Finally, I took a look at the graph displayed in the Dask dashboard for each case in the figures below. The fit method is represented by the red task and cuDF operations are represented by the blue tasks. In both cases, the fit method is being represented by a single task. However, I was expecting to see 8 fit tasks in parallel for case 2 (i.e. 8 tasks being executed concurrently in a single GPU).

Graph Case 1
Image

Graph Case 2
Image

Could anyone help me to understand these results? In particular, I have the following questions:

  1. Why is there a difference between the execution times measured directly in the code and the fit execution time displayed in the Dask dashboard? Am I measuring something wrong?

  2. Am I generating multiple fit tasks in this code? If so, how multiple tasks are processed in a GPU for case 2?
    I have 2 hypotheses: A) cuML uses concurrent CUDA streams to process task's kernels in parallel ; or B) The kernels of each task are being appended to the default CUDA stream and being processed sequentially in the GPU. Since case 2 is slower than case 1, it looks like hypothesis B) is the most likely to be occurring.

Note: I wanted to profile the code with nsys to see how kernels are being executed inside the GPU, but I'm using Ubuntu WSL and apparently there is not support yet for collecting kernel data in such a setup, according to this discussion.


My setup:
A laptop with one MX450 GPU and 8 CPU cores
Ubuntu 22.04 (WSL)
CUDA 11.8
CUDA driver version 565.90
RAPIDS for CUDA 11 (cuml-cu11 24.10.0, dask-cudf-cu11 24.10.1)
dask-cuda 24.10.0
dask 2024.9.0

@mnlcarv mnlcarv added ? - Needs Triage Need team to review and classify question Further information is requested labels Oct 15, 2024
@divyegala
Copy link
Member

@mnlcarv for my understanding of your code and what you are trying to achieve, are you expecting cuML to use CPU threads for parallelism? We do not optimize CPU parallelism generally.

@marcosnlc4
Copy link

@divyegala thanks for your reply. Yes, but my goal was to run multiple concurrent tasks on a single GPU. However it seems that cuML uses NCCL to manage inter-GPU communication and NCCL currently supports only one process per GPU. Hence, it's only possible to run one task per GPU, no matter how I partition the dataset into multiple chunks (assuming that each chunk will be assigned to a task).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants