[QST] How to Run Multiple Concurrent Tasks on a GPU in Dask cuML #6112

mnlcarv · 2024-10-15T21:33:26Z

I want to test the execution of multiple concurrent tasks on the GPU in Dask cuML. I'm using K-means (code below) and I'm changing the chunk size so that I can create multiple tasks for the fit method and run them in the GPU.

from cuml.dask.cluster import KMeans as cumlKMeans
from dask_cuda import LocalCUDACluster
from dask.distributed import Client, wait
import dask.array as da
import time

def main():

    cluster = LocalCUDACluster(CUDA_VISIBLE_DEVICES="0", n_workers=1, threads_per_worker=8)
    client = Client(cluster)

    n_samples = 12500
    n_features = 100
    start_random_state = 170
    # n_block_rows = 12500 # Case 1 (1 task)
    n_block_rows = 1563 # Case 2 (8 tasks)
    n_clusters = 100

    X = da.random.random((n_samples, n_features), chunks=(n_block_rows, n_features))
    dX = X.to_dask_dataframe().to_backend('cudf')

    kmeans_model = cumlKMeans(n_clusters=n_clusters, random_state=start_random_state, max_iter=5)

    wait(dX)
    start_time = time.time()
    kmeans_model.fit(dX)
    wait(kmeans_model._func_fit)
    print(f"Fit took: {time.time() - start_time} sec")

    client.close()

if __name__ == '__main__':
    main()

Since I only have 1 GPU in my machine, I can only use 1 worker. To support multiple tasks in 1 worker, I increased the number of threads per worker to 8 (the maximum number of CPU cores on my machine).

Basically, I'm testing the following cases:

Case 1: 1 task in a GPU
Case 2: 8 tasks in a GPU

I measured the execution time (directly in the code also calling wait() to make sure any async task is finished) of the fit method and case 1 is faster than case 2 for a toy dataset:

Case 1 (1 task): 1.54s
Case 2 (8 tasks): 2.49s

I also checked the time of the fit method displayed in the Dask dashboard (Task Stream tab):

Case 1 (1 task): 0.60s
Case 2 (8 tasks): 0.94s

Finally, I took a look at the graph displayed in the Dask dashboard for each case in the figures below. The fit method is represented by the red task and cuDF operations are represented by the blue tasks. In both cases, the fit method is being represented by a single task. However, I was expecting to see 8 fit tasks in parallel for case 2 (i.e. 8 tasks being executed concurrently in a single GPU).

Graph Case 1

Graph Case 2

Could anyone help me to understand these results? In particular, I have the following questions:

Why is there a difference between the execution times measured directly in the code and the fit execution time displayed in the Dask dashboard? Am I measuring something wrong?
Am I generating multiple fit tasks in this code? If so, how multiple tasks are processed in a GPU for case 2?
I have 2 hypotheses: A) cuML uses concurrent CUDA streams to process task's kernels in parallel ; or B) The kernels of each task are being appended to the default CUDA stream and being processed sequentially in the GPU. Since case 2 is slower than case 1, it looks like hypothesis B) is the most likely to be occurring.

Note: I wanted to profile the code with nsys to see how kernels are being executed inside the GPU, but I'm using Ubuntu WSL and apparently there is not support yet for collecting kernel data in such a setup, according to this discussion.

My setup:
A laptop with one MX450 GPU and 8 CPU cores
Ubuntu 22.04 (WSL)
CUDA 11.8
CUDA driver version 565.90
RAPIDS for CUDA 11 (cuml-cu11 24.10.0, dask-cudf-cu11 24.10.1)
dask-cuda 24.10.0
dask 2024.9.0

The text was updated successfully, but these errors were encountered:

divyegala · 2024-10-17T15:37:11Z

@mnlcarv for my understanding of your code and what you are trying to achieve, are you expecting cuML to use CPU threads for parallelism? We do not optimize CPU parallelism generally.

marcosnlc4 · 2024-10-18T07:35:12Z

@divyegala thanks for your reply. Yes, but my goal was to run multiple concurrent tasks on a single GPU. However it seems that cuML uses NCCL to manage inter-GPU communication and NCCL currently supports only one process per GPU. Hence, it's only possible to run one task per GPU, no matter how I partition the dataset into multiple chunks (assuming that each chunk will be assigned to a task).

mnlcarv added ? - Needs Triage Need team to review and classify question Further information is requested labels Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] How to Run Multiple Concurrent Tasks on a GPU in Dask cuML #6112

[QST] How to Run Multiple Concurrent Tasks on a GPU in Dask cuML #6112

mnlcarv commented Oct 15, 2024

divyegala commented Oct 17, 2024

marcosnlc4 commented Oct 18, 2024

[QST] How to Run Multiple Concurrent Tasks on a GPU in Dask cuML #6112

[QST] How to Run Multiple Concurrent Tasks on a GPU in Dask cuML #6112

Comments

mnlcarv commented Oct 15, 2024

divyegala commented Oct 17, 2024

marcosnlc4 commented Oct 18, 2024