-
Notifications
You must be signed in to change notification settings - Fork 453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initialize securityContext in injected metrics container #964
Initialize securityContext in injected metrics container #964
Conversation
Hi @vpavlin. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @richardsliu @johnugeorge |
/ok-to-test |
/test kubeflow-katib-presubmit |
@@ -178,11 +178,13 @@ func (s *sidecarInjector) getMetricsCollectorContainer(trial *trialsv1alpha3.Tri | |||
} | |||
args := getMetricsCollectorArgs(trial.Name, metricName, mc) | |||
sidecarContainerName := getSidecarContainerName(trial.Spec.MetricsCollector.Collector.Kind) | |||
securityContext := originalPod.Spec.Containers[0].SecurityContext.DeepCopy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how can it make sure that originalPod.Spec.Containers[0].SecurityContext here can fit the metricsCollector and the originalPod.Spec.Containers[0].SecurityContext may not be generated by other mutating Webhook but defined explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I fully understand the questions. Please look at my comment below which explains how the securityContext is generated and see if it helps.
For how I understand your question - the metricsCollector should have the same securityContext as the [0]
container as that is how it would be generated by PSP by default - based on the ServiceAccount privileges it would generate securityContext for Pod and from there for all the containers. So as the [0]
container already has the right securityContext it is safe to simply replicate it into the metricsCollector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The container securityContext can be generated by PSP, but also can be specified explicitly, right?
For example, a user adds a training job in a container with securityContext: runAsUser: 1000
, if user 1000 can work well for metricsCollector container, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see what you mean now. That is a fair point. I would say it would work fine - IIUC both containers are working with the same files, so if the user container works with UID 1000
, the metrics container should work too, right?
I am trying to figure out if there is a common pattern about how to solve these cases with the OpenShift team.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned somewhere else the best solution would be to use reinvocationPolicy
which should take care of generating the SC in the new container after it is added by webhook. That has 2 issues at the moment -
- it is coming in next version of OpenShift, so I cannot even test it:)
- the code katib uses to register webhooks does not know about the field, so I'll have to figure out how to actually do it when it is available
It seems (after discussing with OpenShift team) that using the existing SC from the other container might be our best shot here. I will try to git into it a bit more, but as I mentioned - as long as the sidecar container does not do anything special, it should work fine with the SC from existing container. The only way I see this could potentially fail if if the user configures the container very restrictively and metrics container cannot get to what it needs to read.
The only other way I could think of is to configure the SC explicitly to the state that we know will work for most configurations. Question is what happens if the user then configures his container to run as root
and metrics container does not get the same SC and thus will probably not be able to access files produced by the main container.
That said, I still think copying the SC is the best solution at the moment. I will try to play around with it a bit more tomorrow to see if I can actually break this approach:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can try another solution as workaround: now k8s runs Webhooks in string order by webhook name, we can try to run katib mutating before other related webhooks by renaming it.
for example, now its name is katib-mutating-webhook-config
, we can rename it to admission-katib-mutating-webhook-config
, not sure what openshift related Webhooks name is.
I think other projects like Istio have similar issue, have no idea how the project handle this case (for istio, I think istioctl kube-inject
to bypass the istio mutatingwebhook as a workaround.).
BTW, @johnugeorge @richardsliu @gaocegege WDYT we provide a tool like istioctl kube-inject
to inject metricsCollector container in job level offline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I asked RH istio team and their issue seems to be different because they set the SC
explicitly as they need their container to run as privileged
.
I am not sure about reordering by name because AFAIK the component generating SC
is PSP (Pod Security Policies) which does not run as a webhook.
Adding the metrics container on Job
level might help though - I added comment here #928 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metrics collector container now reads metrics file and then reports metrics to DB server, I wonder if two operations fail owning to training job SC configuration.
I tested this before sending the PR and the operations surely did not fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why RH istio team think their issue is different.
the root cause is: when a Pod generated and before persisted, PSP adds security context to all containers; then mutatingwebhook injecting new container into a Pod which leads that the new container has no consistent SC which conflicts with PSP, right?
so I think it is a common use case, many projects would inject new container into a Pod by mutatingwebhook, such as istio, kfserving.
we had better handle this case in a best practice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a command line flag to control the behavior whether we use other containers's securityContext if we cannot come to an agreement now?
@vpavlin Is this |
@johnugeorge It is a problem on OpenShift in general - OpenShift is very security focused and enforces quite limited SCC by default. This is happening as part of Pod Security Policies controller and the securityContext is first generated on pod level and then from the pod into containers. (See https://github.com/openshift/apiserver-library-go/blob/master/pkg/securitycontextconstraints/sccmatching/provider.go#L152) This apparently runs before the mutating webhooks, so when Katib then injects new container without securityContext initialized, OpenShfit is not happy about it. So the solution I am proposing for now is to replicate the securityContext from a container that already passed through the PSP mutation into the injected container. Does it make sense? |
Got it. |
As we discussed, we decide to merge it and add an additional flag to control it. I will add it after the PR is merged. |
/lgtm /cc @hougangliu |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gaocegege The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
The issue is that the injected container does not have
securityContext
initialized (as that is done by some earlier mutating webhook). The simplest solution for now is to simply copy thesecurityContext
from the container that already exists (and is fully mutated) in the pod.Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #963
Special notes for your reviewer:
No image changes
Release note:
This change is