Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multiple GPU support #760 #924

Merged

Conversation

splhack
Copy link
Contributor

@splhack splhack commented Feb 19, 2021

Plan

  • Keeping original code and order as much as possible. This pull request concentrates on adding multiple GPU support. No intention of doing cosmetic changes.

    • Keeping naming convention. Examples
      • cores -> gpus
      • coreUnits -> gpuUnits
      • memory -> gpuMemory
      • mem_min -> gpu_mem_min
    • The order is vary from code to code, keeping it.
      • memory, gpu, cores -> memory, gpuMemory, cores, gpus
      • cores, memory, gpu -> cores, memory, gpus, gpuMemory
    • Place similar methods close to each other
      • cores-gpus, memory-gpuMemory
  • This is the first iteration/attempt of shipping multiple GPU support. It may not cover all the use-cases.

    • First target is symmetric set of GPUs. Assuming all GPUs are same on one host.
      • A host can have multiple variety of types of GPUs. Multiple vendors, different amount of memory from GPU unit to GPU unit, etc, etc. Not supported.
    • The amount of GPU memory.
      • GPU doesn't share memory with the other GPU unit. So, 8 units of GPU with 16GiB doesn't mean the host has linear 128GiB GPU memory. Scheduling with combined both minimum GPU units and minimum GPU memory may not work well.

Commits

  1. Add GetDefault and SetDefault to AllcationInterface #939 and Fix PyOutline test #942

    • To fix CI and avoid VERSION.in conflict
  2. Add DB migration for supporting multiple GPU

  3. Sync with V10 migration

    • Rename
      • int_gpu_free -> int_gpu_mem_free
      • int_gpu_min -> int_gpu_mem_min
      • int_gpu_reserved -> int_gpu_mem_reserved
      • int_gpu_total -> int_gpu_mem_total
      • int_gpu_idle -> int_gpu_mem_idle
      • int_gpu_max -> int_gpu_mem_max
    • gradlew test passed
  4. Bump minor version number

  5. Add Job Spec DTD 1.12

  6. Update proto files for multiple GPU support

    • Keep backward compatibility
      • Rename message fields (gpu -> gpu_memory) since it doesn't break binary compatibility
      • Keep rpc (SetMinGpu) but mark as deprecated
  7. Sync with proto changes

    • Add warning of Job spec 'gpu's
    • gradlew test passed
  8. Replace gpu with gpus and gpu_memory

    • Add counterpart logic, gpus for cores, gpuMemory for memory
    • gradlew test passed
  9. [RQD] Support multiple GPUs with nvidia-smi

  10. [PyOutline] Support gpus and gpu_memory

    • until 1.11
      • PyOutline has not been supporting gpu
    • from 1.12
      • gpu_memory = The amount of GPU memory
      • gpus = The number of GPU units
  11. [PyCue] Support gpus and gpu_memory

  12. [cuegui] Sync with proto changes

    • Minimum changes to pass python setup.py test

@splhack splhack changed the title Add multiple GPU support #760 [1] Add multiple GPU support #760 Feb 19, 2021
@splhack splhack changed the title [1] Add multiple GPU support #760 WIP: Add multiple GPU support #760 Feb 19, 2021
@splhack
Copy link
Contributor Author

splhack commented Feb 19, 2021

I found an issue of the inconsistency of gpu usage in #760. User may find difficult to track issue due to this.

  • until DTD 1.11
    • gpu = The amount of GPU memory
  • from DTD 1.12
    • gpu = The number of GPUs

Possible scenario:

  1. User has gpu in their Job PyOutline script.
  2. They updates PyOutline, which means user starts to use DTD 1.12.
  3. The meaning of gpu was changed from The amount of GPU memory to The number of GPUs without warning or anything.

So, from DTD 1.12,

  • gpu = The amount of GPU memory
    • For the backward compatibility. Cuebot will migrate it to gpu_memory internally.
  • gpu_memory = The amount of GPU memory
  • gpus = The number of GPUs
  1. User has gpu in their Job PyOutline script.
  2. They updates PyOutline, which means user starts to use DTD 1.12.
  3. PyOutline warns gpu usage and generate XML with gpu_memory

Also I will rename gpu to gpu_mem or gpus in V10 migration and Cuebot codes. It will help catching unintentional errors earlier.

@splhack splhack force-pushed the 459-mutliple-gpu-support-1 branch 6 times, most recently from bae92f4 to df2fc0c Compare February 22, 2021 06:26
@splhack
Copy link
Contributor Author

splhack commented Feb 22, 2021

OpenCue uses 100 = 1 physical CPU core.
Probably we should use the same way for GPU.
It's not so common that multiple processes share one GPU core, but it's not impossible. Like NVIDIA MPS.
Apparently not good idea, fractional allocation is not supported.

if reservedCores % 100:
log.debug('Taskset: Can not reserveHT with fractional cores')
return None

@splhack splhack force-pushed the 459-mutliple-gpu-support-1 branch 3 times, most recently from 27b4a61 to e6ff394 Compare February 22, 2021 19:46
@splhack splhack force-pushed the 459-mutliple-gpu-support-1 branch 2 times, most recently from 1f793fa to 76d547b Compare February 23, 2021 09:14
@splhack splhack changed the title WIP: Add multiple GPU support #760 Add multiple GPU support #760 Mar 3, 2021
@splhack
Copy link
Contributor Author

splhack commented Mar 3, 2021

@larsbijl @bcipriano I think we can start review technical design and code since, at least all test passed.

@larsbijl
Copy link
Contributor

larsbijl commented Mar 7, 2021

self.min_gpu.setValue(service.data.min_gpu // 1024)
AttributeError: min_gpu

in the services menu of cuegui

@splhack splhack force-pushed the 459-mutliple-gpu-support-1 branch from 76d547b to a958c3a Compare March 8, 2021 06:08
@splhack
Copy link
Contributor Author

splhack commented Mar 12, 2021

will update (migration V10 -> V11, VERSION 0.9 -> 0.10) when #936 merged

@splhack splhack mentioned this pull request Mar 21, 2021
@splhack splhack force-pushed the 459-mutliple-gpu-support-1 branch from a958c3a to d2a82dc Compare March 24, 2021 03:40
@splhack splhack force-pushed the 459-mutliple-gpu-support-1 branch from e942206 to dad1300 Compare April 28, 2021 16:04
@splhack
Copy link
Contributor Author

splhack commented May 6, 2021

@bcipriano @larsbijl
Is there anything I can do to help merging this pull request?

Copy link
Collaborator

@bcipriano bcipriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did one last review, changes LGTM.

@bcipriano bcipriano merged commit c22fe12 into AcademySoftwareFoundation:master Jun 20, 2021
@bcipriano bcipriano mentioned this pull request Jun 20, 2021
@splhack splhack mentioned this pull request Sep 11, 2021
@splhack splhack mentioned this pull request Nov 23, 2021
DiegoTavares added a commit to DiegoTavares/OpenCue that referenced this pull request Jul 11, 2023
After the changes on the gpu PR AcademySoftwareFoundation#924
the performance of the booking query degraded up to 4 times the previous
throughput. Creating some indexes for columns that changed names seems to
have fixed the problem.

Signed-off-by: Diego Tavares <[email protected]>
DiegoTavares added a commit that referenced this pull request Jan 30, 2024
* Add new indexes to improve booking performance

After the changes on the gpu PR #924
the performance of the booking query degraded up to 4 times the previous
throughput. Creating some indexes for columns that changed names seems to
have fixed the problem.

Signed-off-by: Diego Tavares <[email protected]>

* Update cuebot/src/main/resources/conf/ddl/postgres/migrations/V18_Add_New_Indexes

Signed-off-by: Diego Tavares da Silva <[email protected]>

* Update cuebot/src/main/resources/conf/ddl/postgres/migrations/V18_Add_New_Indexes

Signed-off-by: Diego Tavares da Silva <[email protected]>

* Update cuebot/src/main/resources/conf/ddl/postgres/migrations/V18_Add_New_Indexes

Signed-off-by: Diego Tavares da Silva <[email protected]>

---------

Signed-off-by: Diego Tavares <[email protected]>
Signed-off-by: Diego Tavares da Silva <[email protected]>
carlosfelgarcia pushed a commit to carlosfelgarcia/OpenCue that referenced this pull request May 22, 2024
…tion#1304)

* Add new indexes to improve booking performance

After the changes on the gpu PR AcademySoftwareFoundation#924
the performance of the booking query degraded up to 4 times the previous
throughput. Creating some indexes for columns that changed names seems to
have fixed the problem.

Signed-off-by: Diego Tavares <[email protected]>

* Update cuebot/src/main/resources/conf/ddl/postgres/migrations/V18_Add_New_Indexes

Signed-off-by: Diego Tavares da Silva <[email protected]>

* Update cuebot/src/main/resources/conf/ddl/postgres/migrations/V18_Add_New_Indexes

Signed-off-by: Diego Tavares da Silva <[email protected]>

* Update cuebot/src/main/resources/conf/ddl/postgres/migrations/V18_Add_New_Indexes

Signed-off-by: Diego Tavares da Silva <[email protected]>

---------

Signed-off-by: Diego Tavares <[email protected]>
Signed-off-by: Diego Tavares da Silva <[email protected]>
carlosfelgarcia pushed a commit to carlosfelgarcia/OpenCue that referenced this pull request May 22, 2024
…tion#1304)

* Add new indexes to improve booking performance

After the changes on the gpu PR AcademySoftwareFoundation#924
the performance of the booking query degraded up to 4 times the previous
throughput. Creating some indexes for columns that changed names seems to
have fixed the problem.

Signed-off-by: Diego Tavares <[email protected]>

* Update cuebot/src/main/resources/conf/ddl/postgres/migrations/V18_Add_New_Indexes

Signed-off-by: Diego Tavares da Silva <[email protected]>

* Update cuebot/src/main/resources/conf/ddl/postgres/migrations/V18_Add_New_Indexes

Signed-off-by: Diego Tavares da Silva <[email protected]>

* Update cuebot/src/main/resources/conf/ddl/postgres/migrations/V18_Add_New_Indexes

Signed-off-by: Diego Tavares da Silva <[email protected]>

---------

Signed-off-by: Diego Tavares <[email protected]>
Signed-off-by: Diego Tavares da Silva <[email protected]>
ramonfigueiredo added a commit to ramonfigueiredo/OpenCue that referenced this pull request Jul 24, 2024
…columns

- Fix the column indexing on the "addColumn" of class CueJobMonitorTree.
- This bug was introduced after the merge from the pull request "Add multiple GPU support AcademySoftwareFoundation#760 (AcademySoftwareFoundation#924)" on 4/18/22 at 11:45 AM where the following new columns were introduced on the CueJobMonitorTree: "Gpus", "Min Gpus", "Max Gpus", "MaxGpuMem" and the indexing of the columns were wrongly defined.
ramonfigueiredo added a commit that referenced this pull request Jul 26, 2024
Fix "Monitor Cue" with incorrect column indexing for "Min" and "Max"
columns
- Fix the column indexing on the "addColumn" of class CueJobMonitorTree.
- This bug was introduced after the merge from the pull request "Add
multiple GPU support #760 (#924)" on 4/18/22 at 11:45 AM where the
following new columns were introduced on the CueJobMonitorTree: "Gpus",
"Min Gpus", "Max Gpus", "MaxGpuMem" and the indexing of the columns were
wrongly defined.
n-jay pushed a commit to n-jay/OpenCue that referenced this pull request Jul 26, 2024
…dation#1431)

Fix "Monitor Cue" with incorrect column indexing for "Min" and "Max"
columns
- Fix the column indexing on the "addColumn" of class CueJobMonitorTree.
- This bug was introduced after the merge from the pull request "Add
multiple GPU support AcademySoftwareFoundation#760 (AcademySoftwareFoundation#924)" on 4/18/22 at 11:45 AM where the
following new columns were introduced on the CueJobMonitorTree: "Gpus",
"Min Gpus", "Max Gpus", "MaxGpuMem" and the indexing of the columns were
wrongly defined.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants