-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements/ideas for GPU support #991
Comments
@splhack @IdrisMiles @larsbijl Any thoughts? |
True. Because Cuebot has the logic to avoid scheduling CPU-only job onto GPU hosts. This totally makes sense, because, theoretically, GPU host can be saturated by only CPU jobs if Cuebot doesn't have the logic. It is wasting GPU. So the recommendation would be, create dedicated facility for GPU to schedule GPU job to GPU hosts, create another dedicated facility for CPU to schedule CPU-only job to CPU hosts. --
Could you elaborate on the detail? RQD already tells Cuebot number of free GPU units. |
Ah, I see the reasoning for GPU only jobs on GPU host. If there was a need for CPU booking it could be arranged manually. For GPU assignment, from what I understand, RQD keeps track of CPU cores and book jobs to cores. RQD uses taskset to do this. Similar functionality can be done with CUDA_VISIBLE_DEVICES environment variable for GPUs. There is also the issue of CPUs to GPUs, how or should that be controlled. If we are just treating GPU machine as GPU only, then CPU issues can become an issue. Should CPUs be divided up per GPU? Each GPU would get 5CPU all the time. Edit: Also thinking some software takes a option for which GPUs to run on. This could be passed with some token %GPU_IDX% or something. |
What is the reason of why the job takes 10 CPU cores in this example? -- OpenCue/cuebot/src/main/java/com/imageworks/spcue/VirtualProc.java Lines 142 to 143 in c22fe12
So, back to the example, 20 CPU cores and 4 GPU units machine can run 4 sets of GPU job frames (2 CPU cores and 1 GPU unit x 4 = 8 CPU cores and 4 GPU units total) concurrently, as long as you didn't set explicitly the number of CPU cores or memory reservation. To repeat, the current Cuebot implementation implicitly expects GPU machine should have twice as many CPU cores as GPU units. --
We can use these environment variables in the rendering job.
|
Just an example of using CPUs before GPUs. I think every GPU should be evenly assigned CPUs. There could situations where this isn't ideal, but the GPU is the main resource that should be considered.
I think this has to do with hyper threading, where each hyper thread is seen as 2 cores to the system but actually just 1 physical core.
Why not use all the CPUs as I mention above. The tasks I am running are helped by CPU for initial data processing before the data is sent to the GPU. Additionally some tasks can benefit from more than one GPU, how many CPUs are then also assigned?
If something like |
Hi @splhack , We have specific tasks requiring intensive CPU and GPU, meaning we have nodes with multiple CPU and GPU. Currently, if we have no job requiring a GPU on the farm then we'll waste those CPUs. Could this be a possible behaviour to implement ? Edit : We can only assign CPU job on nodes with GPU if there is no GPU task pending (not assigned) on the farm. This way we don't waste resources. What do you think ? Thanks |
@thunders82 |
Thank you for point it out @splhack ! |
I'm not 100% how to word this, but here it goes.
Looking at #924 I have these ideas.
GPU hosts don't pick up non-GPU frames: If a host has CPU and GPU it should be able to run CPU and GPU frames regardless of type, with possible preference to GPU. Additionally some frame types may be able to utilize CPUs and GPUs,
GPU jobs are not like CPU jobs: OS will migrate CPU jobs to idle CPUs as needed. GPU jobs are usually assigned to 1 or more GPUs at startup and they continue to run on only those GPS until finished. I propose something similar to how a frame number passed to the command for GPU assignment. This would be managed by rqd. rqd only needs to tell cuebot number of free GPUs.
Are these bad ideas? Are there better ideas?
The text was updated successfully, but these errors were encountered: