-
Notifications
You must be signed in to change notification settings - Fork 453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add pod level inject webhook #716
Conversation
/cc @gaocegege |
"github.com/kubeflow/katib/pkg/webhook/v1alpha2/experiment/injector" | ||
) | ||
|
||
// experimentDefaulter that sets default fields in experiment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Please update the comment
@@ -0,0 +1,72 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add this webhook under pkg/webhook/v1alpha2/pod
@@ -84,5 +86,16 @@ func register(manager manager.Manager, server *webhook.Server) error { | |||
if err != nil { | |||
return err | |||
} | |||
return server.Register(mutatingWebhook, validatingWebhook) | |||
injectWebhook, err := builder.NewWebhookBuilder(). | |||
Name("mutating.pod.kubeflow.org"). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think maybe we can rename it to one which is katib related
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, do you have any suggestion about the name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should follow xx.$group.$domain style (group should be katib
and domain is kubeflow.org
here), we'd better make all others consistent with this style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM, it is a historical problem. Can we do the change in another PR? I think we could keep the name here and update it after the PR is merged, with other changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
@hougangliu Please take a look. We succeeded to use the modified metrics collector to get the metrics. |
cool! I will look at it later today soon |
logs, err := d.clientset.CoreV1().Pods(namespace).GetLogs(pl.Items[0].ObjectMeta.Name, &logopt).Do().Raw() | ||
if err != nil { | ||
return nil, err | ||
logopt := apiv1.PodLogOptions{Container: "tensorflow", Timestamps: true, Follow: true} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only works for "tfjob"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only a PoC now.
@@ -41,8 +41,8 @@ func GetJobLabelMap(jobKind string, trialName string) map[string]string { | |||
labelMap := make(map[string]string) | |||
|
|||
if jobKind == "TFJob" { | |||
labelMap["tf-job-name"] = trialName | |||
labelMap["tf-job-role"] = "master" | |||
labelMap["job-name"] = trialName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubeflow/pytorch-operator#204 had made pytorch also can use "job-name"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, updated the pytorch label name.
} | ||
|
||
func (s *sidecarInjector) MutationRequired(pod *v1.Pod) bool { | ||
value, err := s.GetLabel(pod, JobRoleLabel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it can handle tfjob and pytorchjob case, but not cover k8s Job case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it is just a PoC. In the future, we will support job
// Get the namespace from req since the namespace in the pod is empty. | ||
namespace := req.AdmissionRequest.Namespace | ||
// Do mutation | ||
mutatedPod, err := s.Mutate(pod, namespace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
namespace can be get from pod object, so I think we can just use one argument instead, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to our experiment, we fail to get the namespace from pod.
VolumeMounts: pod.Spec.Containers[0].VolumeMounts, | ||
} | ||
mutatedPod.Spec.Containers = append(mutatedPod.Spec.Containers, injectContainer) | ||
mutatedPod.Spec.ServiceAccountName = pod.Spec.ServiceAccountName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we need line 149, line 120 cannot assign ServiceAccountName
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that we cannot get it. There is a hack.
Name: "metrics-collector", | ||
Image: "gcr.io/kubeflow-images-public/katib/v1alpha2/metrics-collector", | ||
Command: []string{"./metricscollector"}, | ||
Args: []string{"-e", "TODO_Experiment", "-t", trialName, "-k", "TFJob", "-n", namespace, "-m", katibmanagerv1alpha2.GetManagerAddr(), "-mn", metricName}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can use "controller-name" label of Pod to get the "-k" value
we can get "-e" value, that is the experiment name, from trial's metadata ownerReferences field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, got "-k" and "-e" values. It seems pytorch job does not have "controller-name" label yet?
All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the ℹ️ Googlers: Go here for more info. |
02c885e
to
0a08f6d
Compare
0a08f6d
to
12608c5
Compare
CLAs look good, thanks! ℹ️ Googlers: Go here for more info. |
@@ -46,18 +45,15 @@ func (d *MetricsCollector) CollectObservationLog(tId string, jobKind string, met | |||
if len(pl.Items) == 0 { | |||
return nil, fmt.Errorf("No Pods are found in Trial %v", tId) | |||
} | |||
logopt := apiv1.PodLogOptions{Container: "tensorflow", Timestamps: true, Follow: true} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep the file unmodified since we have pkg/util/v1alpha2/sidecarmetricscollector
Please remove [WIP] in the title |
OK, ready to merge. |
/lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hougangliu The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
For issue #685
This change is