-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding GPU devices to the OCP prod cluster #219
Comments
@Milstein is this now resolved? |
The 2x V100 nodes have been added to the cluster, however, I currently have scheduling disabled. There is a machineconfig required to properly configure networking on these hosts. This will cause all worker nodes to reboot so this needs to be done during a maintenance event. Planning to do this before the upgrade during the 9/27 maintenance. |
re: MIG there are issues with using MIG that I've mentioned previously related to setting quotas/limits and also we don't currently have MIG-supported devices in the cluster (e.g. A100s). That said, I believe we will have some new A100 hosts landing in the next few weeks/month. |
I'm also looking into an issue with argocd not applying resources currently due to long-standing sync issues. This is preventing the necessary operators required for GPU computing on the prod cluster from being installed (see OCP-on-NERC/nerc-ocp-config#277) |
Turns out this was a misconfiguration on my part that should be resolved by OCP-on-NERC/nerc-ocp-config#283 |
@jtriley Was this resolved by OCP-on-NERC/nerc-ocp-config#283? |
I'm going to close this issue given that we added them and configured networking properly during our last maintenance. There is still some ongoing work to set the proper quotas for GPUs from coldfront/openshift-acct-mgt. I will have a PR out soon for that. |
The text was updated successfully, but these errors were encountered: