Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding GPU devices to the OCP prod cluster #219

Closed
Milstein opened this issue Sep 5, 2023 · 7 comments
Closed

Adding GPU devices to the OCP prod cluster #219

Milstein opened this issue Sep 5, 2023 · 7 comments
Assignees

Comments

@Milstein
Copy link

Milstein commented Sep 5, 2023

  • We need to setup few GPU devices to the NERC OCP prod cluster for course and other workload. Also need to understand if MIG can possible in this setup.
@joachimweyl
Copy link
Contributor

@Milstein is this now resolved?

@jtriley
Copy link

jtriley commented Sep 20, 2023

The 2x V100 nodes have been added to the cluster, however, I currently have scheduling disabled. There is a machineconfig required to properly configure networking on these hosts. This will cause all worker nodes to reboot so this needs to be done during a maintenance event. Planning to do this before the upgrade during the 9/27 maintenance.

@jtriley
Copy link

jtriley commented Sep 20, 2023

re: MIG there are issues with using MIG that I've mentioned previously related to setting quotas/limits and also we don't currently have MIG-supported devices in the cluster (e.g. A100s). That said, I believe we will have some new A100 hosts landing in the next few weeks/month.

@jtriley
Copy link

jtriley commented Sep 20, 2023

I'm also looking into an issue with argocd not applying resources currently due to long-standing sync issues. This is preventing the necessary operators required for GPU computing on the prod cluster from being installed (see OCP-on-NERC/nerc-ocp-config#277)

@jtriley
Copy link

jtriley commented Sep 21, 2023

I'm also looking into an issue with argocd not applying resources currently due to long-standing sync issues. This is preventing the necessary operators required for GPU computing on the prod cluster from being installed (see OCP-on-NERC/nerc-ocp-config#277)

Turns out this was a misconfiguration on my part that should be resolved by OCP-on-NERC/nerc-ocp-config#283

@joachimweyl
Copy link
Contributor

joachimweyl commented Oct 12, 2023

@jtriley Was this resolved by OCP-on-NERC/nerc-ocp-config#283?

@jtriley
Copy link

jtriley commented Oct 13, 2023

I'm going to close this issue given that we added them and configured networking properly during our last maintenance. There is still some ongoing work to set the proper quotas for GPUs from coldfront/openshift-acct-mgt. I will have a PR out soon for that.

@jtriley jtriley closed this as completed Oct 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants