-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(civo-github): add gpu operator to allow use of GPU nodes #789
base: main
Are you sure you want to change the base?
Conversation
935f87b
to
39006e9
Compare
39006e9
to
d1a196d
Compare
metadata { | ||
name = "gpu-operator" | ||
labels = { | ||
"pod-security.kubernetes.io/enforce" = "privileged" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why the limitation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's required in the NVIDIA docs if you're using pod security admissions. As this is a PoC, I think it should be in to avoid any issues where it's not applied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I read it as "it's required". The statement starts with "If your cluster uses Pod Security Admission (PSA) to restrict the behavior of pods" but I'm not sure if you installed it that it makes this a requirement.
touch /run/nvidia/validations/toolkit-ready; | ||
touch /run/nvidia/validations/.driver-ctr-ready; | ||
touch /run/nvidia/validations/driver-ready | ||
sleep infinity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend changing this (and maybe the container) to a kubernetes/pause container that can correctly assert or trap signals.
sleep infinity
inside a bash command will only trap a signal after it has been issued if the sleep command is done (which in this case it never is). That's why kubernetes relies better on kubernetes/pause than any sleep mechanism since they are a bit wasteful for this purpose.
More info: https://mywiki.wooledge.org/SignalTrap#When_is_the_signal_handled.3F
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is straight out of the Civo guide. I don't particularly like this, but the fake-operator pod is required to prompt the operator to apply the labels/annotations to the node.
I'll have a look at your suggestion, but this might need to be stay as-is if can't get it to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I know it's required (several other things come out of similar things like enable higher file descriptor limits so this pattern is quite common).
While that might be in thekr guide, I would want to avoid ourselves having to troubleshoot struggling-to-finalize pods if we can leverage pause instead. Happy to help you make the conversion if need be!
Description
This is a piece of work that requires a bit of thought. For Civo-GitHub, I've added the ability to use GPU nodes. This requires some changes to how things work which could cause problems.
Uncontroversial changes
is_gpu
flag as it assumes that all GPU nodes startg4g.
oran.
which may not always be true - if anyone has a better idea how to achieve this, I'm all ears (I had hoped thatdata.civo_size
would have it, but it doesn't).Controversial changes
helm_release
Terraform resource, which has a problem with the Crossplane provider not being able to download the charts (see Terraform helm provider cannot retrieve chart crossplane-contrib/provider-terraform#54). In order to fix this, I need to use anemptyDir
volume mount on the Crossplane provider's pod and the v1.12.2 version of the CRD doesn't have it inControllerConfig.pkg.crossplane.io/v1alpha1
civo-github
has a different Crossplane version to all the other providers which I feel should probably be consistent.ControllerConfig
as it's deprecated as of v1.11 and will be removed at some future date. Again, this is a big change to do across all providers.Related Issue(s)
Fixes #
How to test
g4g.40.kube.small