Add multiple GPU support #760 #924

splhack · 2021-02-19T09:54:57Z

Plan

Keeping original code and order as much as possible. This pull request concentrates on adding multiple GPU support. No intention of doing cosmetic changes.
- Keeping naming convention. Examples
  - cores -> gpus
  - coreUnits -> gpuUnits
  - memory -> gpuMemory
  - mem_min -> gpu_mem_min
- The order is vary from code to code, keeping it.
  - memory, gpu, cores -> memory, gpuMemory, cores, gpus
  - cores, memory, gpu -> cores, memory, gpus, gpuMemory
- Place similar methods close to each other
  - cores-gpus, memory-gpuMemory
This is the first iteration/attempt of shipping multiple GPU support. It may not cover all the use-cases.
- First target is symmetric set of GPUs. Assuming all GPUs are same on one host.
  - A host can have multiple variety of types of GPUs. Multiple vendors, different amount of memory from GPU unit to GPU unit, etc, etc. Not supported.
- The amount of GPU memory.
  - GPU doesn't share memory with the other GPU unit. So, 8 units of GPU with 16GiB doesn't mean the host has linear 128GiB GPU memory. Scheduling with combined both minimum GPU units and minimum GPU memory may not work well.

Commits

Add GetDefault and SetDefault to AllcationInterface #939 and Fix PyOutline test #942
- To fix CI and avoid VERSION.in conflict
Add DB migration for supporting multiple GPU
- Create V10 migration sql from feat: Add multiple GPU support #760
- Changes from feat: Add multiple GPU support #760
  - Add more GPU memory stats
  - Use BIGINT for memory
  - Clean up unrelated changes
  - Create new indexes
- Code review helper
  - Original feat: Add multiple GPU support #760 V1 diff https://gist.github.com/splhack/fab852153505c65742909917888c51bc
  - V9 vs V10 pg_dump https://github.com/splhack/OpenCue/pull/1/files
Sync with V10 migration
- Rename
  - int_gpu_free -> int_gpu_mem_free
  - int_gpu_min -> int_gpu_mem_min
  - int_gpu_reserved -> int_gpu_mem_reserved
  - int_gpu_total -> int_gpu_mem_total
  - int_gpu_idle -> int_gpu_mem_idle
  - int_gpu_max -> int_gpu_mem_max
- gradlew test passed
Bump minor version number
Add Job Spec DTD 1.12
- Diff from 1.11 https://gist.github.com/splhack/565df692e0320693d60b2b15e7c9b920
Update proto files for multiple GPU support
- Keep backward compatibility
  - Rename message fields (gpu -> gpu_memory) since it doesn't break binary compatibility
  - Keep rpc (SetMinGpu) but mark as deprecated
Sync with proto changes
- Add warning of Job spec 'gpu's
- gradlew test passed
Replace gpu with gpus and gpu_memory
- Add counterpart logic, gpus for cores, gpuMemory for memory
- gradlew test passed
[RQD] Support multiple GPUs with nvidia-smi
[PyOutline] Support gpus and gpu_memory
- until 1.11
  - PyOutline has not been supporting gpu
- from 1.12
  - gpu_memory = The amount of GPU memory
  - gpus = The number of GPU units
[PyCue] Support gpus and gpu_memory
[cuegui] Sync with proto changes
- Minimum changes to pass python setup.py test

splhack · 2021-02-19T19:14:17Z

I found an issue of the inconsistency of gpu usage in #760. User may find difficult to track issue due to this.

until DTD 1.11
- gpu = The amount of GPU memory
from DTD 1.12
- gpu = The number of GPUs

Possible scenario:

User has gpu in their Job PyOutline script.
They updates PyOutline, which means user starts to use DTD 1.12.
The meaning of gpu was changed from The amount of GPU memory to The number of GPUs without warning or anything.

So, from DTD 1.12,

gpu = The amount of GPU memory
- For the backward compatibility. Cuebot will migrate it to gpu_memory internally.
gpu_memory = The amount of GPU memory
gpus = The number of GPUs

User has gpu in their Job PyOutline script.
They updates PyOutline, which means user starts to use DTD 1.12.
PyOutline warns gpu usage and generate XML with gpu_memory

Also I will rename gpu to gpu_mem or gpus in V10 migration and Cuebot codes. It will help catching unintentional errors earlier.

splhack · 2021-02-22T07:04:31Z

~~OpenCue uses 100 = 1 physical CPU core.~~
~~Probably we should use the same way for GPU.~~
~~It's not so common that multiple processes share one GPU core, but it's not impossible. Like NVIDIA MPS.~~
Apparently not good idea, fractional allocation is not supported.

OpenCue/rqd/rqd/rqmachine.py

Lines 637 to 639 in 58c81fb

    
           if reservedCores % 100: 
        
               log.debug('Taskset: Can not reserveHT with fractional cores') 
        
               return None

splhack · 2021-03-03T00:55:13Z

@larsbijl @bcipriano I think we can start review technical design and code since, at least all test passed.

larsbijl · 2021-03-07T12:54:04Z

self.min_gpu.setValue(service.data.min_gpu // 1024)
AttributeError: min_gpu

in the services menu of cuegui

splhack · 2021-03-12T00:44:36Z

will update (migration V10 -> V11, VERSION 0.9 -> 0.10) when #936 merged

Co-authored-by: Lars van der Bijl <[email protected]>

splhack · 2021-05-06T06:18:52Z

@bcipriano @larsbijl
Is there anything I can do to help merging this pull request?

bcipriano

Did one last review, changes LGTM.

After the changes on the gpu PR AcademySoftwareFoundation#924 the performance of the booking query degraded up to 4 times the previous throughput. Creating some indexes for columns that changed names seems to have fixed the problem. Signed-off-by: Diego Tavares <[email protected]>

* Add new indexes to improve booking performance After the changes on the gpu PR #924 the performance of the booking query degraded up to 4 times the previous throughput. Creating some indexes for columns that changed names seems to have fixed the problem. Signed-off-by: Diego Tavares <[email protected]> * Update cuebot/src/main/resources/conf/ddl/postgres/migrations/V18_Add_New_Indexes Signed-off-by: Diego Tavares da Silva <[email protected]> * Update cuebot/src/main/resources/conf/ddl/postgres/migrations/V18_Add_New_Indexes Signed-off-by: Diego Tavares da Silva <[email protected]> * Update cuebot/src/main/resources/conf/ddl/postgres/migrations/V18_Add_New_Indexes Signed-off-by: Diego Tavares da Silva <[email protected]> --------- Signed-off-by: Diego Tavares <[email protected]> Signed-off-by: Diego Tavares da Silva <[email protected]>

…tion#1304) * Add new indexes to improve booking performance After the changes on the gpu PR AcademySoftwareFoundation#924 the performance of the booking query degraded up to 4 times the previous throughput. Creating some indexes for columns that changed names seems to have fixed the problem. Signed-off-by: Diego Tavares <[email protected]> * Update cuebot/src/main/resources/conf/ddl/postgres/migrations/V18_Add_New_Indexes Signed-off-by: Diego Tavares da Silva <[email protected]> * Update cuebot/src/main/resources/conf/ddl/postgres/migrations/V18_Add_New_Indexes Signed-off-by: Diego Tavares da Silva <[email protected]> * Update cuebot/src/main/resources/conf/ddl/postgres/migrations/V18_Add_New_Indexes Signed-off-by: Diego Tavares da Silva <[email protected]> --------- Signed-off-by: Diego Tavares <[email protected]> Signed-off-by: Diego Tavares da Silva <[email protected]>

…columns - Fix the column indexing on the "addColumn" of class CueJobMonitorTree. - This bug was introduced after the merge from the pull request "Add multiple GPU support AcademySoftwareFoundation#760 (AcademySoftwareFoundation#924)" on 4/18/22 at 11:45 AM where the following new columns were introduced on the CueJobMonitorTree: "Gpus", "Min Gpus", "Max Gpus", "MaxGpuMem" and the indexing of the columns were wrongly defined.

Fix "Monitor Cue" with incorrect column indexing for "Min" and "Max" columns - Fix the column indexing on the "addColumn" of class CueJobMonitorTree. - This bug was introduced after the merge from the pull request "Add multiple GPU support #760 (#924)" on 4/18/22 at 11:45 AM where the following new columns were introduced on the CueJobMonitorTree: "Gpus", "Min Gpus", "Max Gpus", "MaxGpuMem" and the indexing of the columns were wrongly defined.

…dation#1431) Fix "Monitor Cue" with incorrect column indexing for "Min" and "Max" columns - Fix the column indexing on the "addColumn" of class CueJobMonitorTree. - This bug was introduced after the merge from the pull request "Add multiple GPU support AcademySoftwareFoundation#760 (AcademySoftwareFoundation#924)" on 4/18/22 at 11:45 AM where the following new columns were introduced on the CueJobMonitorTree: "Gpus", "Min Gpus", "Max Gpus", "MaxGpuMem" and the indexing of the columns were wrongly defined.

splhack requested review from bcipriano, DiegoTavares, gregdenton, IdrisMiles, jrray, larsbijl and smith1511 as code owners February 19, 2021 09:54

splhack changed the title ~~Add multiple GPU support #760~~ [1] Add multiple GPU support #760 Feb 19, 2021

splhack changed the title ~~[1] Add multiple GPU support #760~~ WIP: Add multiple GPU support #760 Feb 19, 2021

splhack force-pushed the 459-mutliple-gpu-support-1 branch 6 times, most recently from bae92f4 to df2fc0c Compare February 22, 2021 06:26

splhack force-pushed the 459-mutliple-gpu-support-1 branch 3 times, most recently from 27b4a61 to e6ff394 Compare February 22, 2021 19:46

splhack mentioned this pull request Feb 22, 2021

feat: Add multiple GPU support #760

Closed

splhack force-pushed the 459-mutliple-gpu-support-1 branch 2 times, most recently from 1f793fa to 76d547b Compare February 23, 2021 09:14

splhack changed the title ~~WIP: Add multiple GPU support #760~~ Add multiple GPU support #760 Mar 3, 2021

splhack force-pushed the 459-mutliple-gpu-support-1 branch from 76d547b to a958c3a Compare March 8, 2021 06:08

splhack mentioned this pull request Mar 21, 2021

Fix #936 #938

Merged

splhack force-pushed the 459-mutliple-gpu-support-1 branch from a958c3a to d2a82dc Compare March 24, 2021 03:40

splhack and others added 10 commits April 28, 2021 09:03

[RQD] Support multiple GPUs with nvidia-smi

be89e17

Co-authored-by: Lars van der Bijl <[email protected]>

[PyOutline] Supoprt gpus and gpu_memory

75f6178

Co-authored-by: Lars van der Bijl <[email protected]>

[PyCue] Supoprt gpus and gpu_memory

d2fd9c0

Co-authored-by: Lars van der Bijl <[email protected]>

[cuegui] Sync with proto changes

5b0cc10

Co-authored-by: Lars van der Bijl <[email protected]>

[Cuebot] 1.12 spec test

ecd899c

[Cuebot] host report test for GPU

2fea9bd

[Cuebot] GPU dispatch test

d0c9505

[Cuebot] Add FrameCompleteHandlerTests

376f8f7

[RQD] Fix maxUsedGpuMemory

a47e2c4

Bump up version

dad1300

splhack force-pushed the 459-mutliple-gpu-support-1 branch from e942206 to dad1300 Compare April 28, 2021 16:04

bcipriano approved these changes Jun 9, 2021

View reviewed changes

jasonmads mentioned this pull request Jun 12, 2021

Improvements/ideas for GPU support #991

Open

bcipriano merged commit c22fe12 into AcademySoftwareFoundation:master Jun 20, 2021

bcipriano mentioned this pull request Jun 20, 2021

Fix JobDao unit test. #992

Merged

This was referenced Aug 23, 2021

Fix number of GPU units in RunningFrameInfo #1017

Merged

Clean up strand(ed) GPU units #1020

Merged

splhack mentioned this pull request Sep 11, 2021

Update GPU memory usage #1032

Merged

splhack mentioned this pull request Oct 15, 2021

[Cuebot] Fix FrameCompleteHandler #1053

Merged

splhack mentioned this pull request Nov 23, 2021

Fix Group GPU APIs #1064

Merged

DiegoTavares mentioned this pull request Jul 11, 2023

Add new indexes to improve booking performance #1304

Merged

ramonfigueiredo mentioned this pull request Jul 24, 2024

Fix "Monitor Cue" with incorrect column indexing #1431

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multiple GPU support #760 #924

Add multiple GPU support #760 #924

splhack commented Feb 19, 2021 •

edited

Loading

splhack commented Feb 19, 2021 •

edited

Loading

splhack commented Feb 22, 2021 •

edited

Loading

splhack commented Mar 3, 2021

larsbijl commented Mar 7, 2021

splhack commented Mar 12, 2021

splhack commented May 6, 2021

bcipriano left a comment

Add multiple GPU support #760 #924

Add multiple GPU support #760 #924

Conversation

splhack commented Feb 19, 2021 • edited Loading

Plan

Commits

splhack commented Feb 19, 2021 • edited Loading

splhack commented Feb 22, 2021 • edited Loading

splhack commented Mar 3, 2021

larsbijl commented Mar 7, 2021

splhack commented Mar 12, 2021

splhack commented May 6, 2021

bcipriano left a comment

Choose a reason for hiding this comment

splhack commented Feb 19, 2021 •

edited

Loading

splhack commented Feb 19, 2021 •

edited

Loading

splhack commented Feb 22, 2021 •

edited

Loading