Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NERC Maintenance (Taint nerc-ocp-prod GPUs and add acceleratorProfiles) - Jan 7 #849

Open
3 of 4 tasks
dystewart opened this issue Dec 5, 2024 · 1 comment
Open
3 of 4 tasks
Assignees
Labels
documentation Improvements or additions to documentation gpu openshift This issue pertains to NERC OpenShift rhoai RHOAI

Comments

@dystewart
Copy link

dystewart commented Dec 5, 2024

Motivation

To prevent general non-GPU workloads from scheduling on nodes with GPUs we will be tainting the nerc-ocp-prod GPU nodes. This is important because users are billed on a per host basis, so currently user workloads can be unintentionally scheduled on GPU nodes even when no GPU resources are allocated, resulting in being billed for GPU usage. Additionally, this keeps the GPUs clear of all workloads not explicitly requesting GPU resources.
As a result of adding taints to the GPU nodes, we will also be adding accelerators to allow RHOAI users who have GPUs allocated, to select which tainted GPU they would like to land on (eg. A100 or V100).

We will also fix the "None" acceleratorProfile behavior in: issue

Completion Criteria

During the Jan 7 maintenance window, nerc-ocp-prod GPU nodes are tainted and acceleratorProfiles are added to the nerc-ocp-prod cluster RHOAI installation.

Description

Completion dates

Required - 2025-01-07

@dystewart dystewart added documentation Improvements or additions to documentation gpu openshift This issue pertains to NERC OpenShift rhoai RHOAI labels Dec 5, 2024
@dystewart dystewart self-assigned this Dec 5, 2024
dystewart added a commit to dystewart/nerc-ocp-config that referenced this issue Dec 11, 2024
Adds unique acceleratorProfiles for each GPU type in production cluster.
These profiles will show up under accelerators in the RHOAI workbench wizard.

Part of: nerc-project/operations#849
Closes: nerc-project/operations#847
@Milstein
Copy link

@here: Maintenance schedule is setup: https://nerc.instatus.com/cm4retg180022119ladrkl2t9
https://nerc.mghpcc.org/event/network-equipment-switch-maintenance-20250107/

Email announcement to all [NERC Users] is sent!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation gpu openshift This issue pertains to NERC OpenShift rhoai RHOAI
Projects
None yet
Development

No branches or pull requests

2 participants