Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pod level inject webhook #716

Merged
merged 9 commits into from
Aug 28, 2019
Merged

Conversation

wuchunghsuan
Copy link
Contributor

@wuchunghsuan wuchunghsuan commented Aug 12, 2019

For issue #685


This change is Reviewable

@wuchunghsuan
Copy link
Contributor Author

/cc @gaocegege

@k8s-ci-robot k8s-ci-robot requested a review from gaocegege August 12, 2019 10:19
"github.com/kubeflow/katib/pkg/webhook/v1alpha2/experiment/injector"
)

// experimentDefaulter that sets default fields in experiment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Please update the comment

@@ -0,0 +1,72 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add this webhook under pkg/webhook/v1alpha2/pod

@@ -84,5 +86,16 @@ func register(manager manager.Manager, server *webhook.Server) error {
if err != nil {
return err
}
return server.Register(mutatingWebhook, validatingWebhook)
injectWebhook, err := builder.NewWebhookBuilder().
Name("mutating.pod.kubeflow.org").
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe we can rename it to one which is katib related

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, do you have any suggestion about the name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should follow xx.$group.$domain style (group should be katib and domain is kubeflow.org here), we'd better make all others consistent with this style.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM, it is a historical problem. Can we do the change in another PR? I think we could keep the name here and update it after the PR is merged, with other changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@gaocegege
Copy link
Member

@hougangliu Please take a look. We succeeded to use the modified metrics collector to get the metrics.

@hougangliu
Copy link
Member

@hougangliu Please take a look. We succeeded to use the modified metrics collector to get the metrics.

cool! I will look at it later today soon

logs, err := d.clientset.CoreV1().Pods(namespace).GetLogs(pl.Items[0].ObjectMeta.Name, &logopt).Do().Raw()
if err != nil {
return nil, err
logopt := apiv1.PodLogOptions{Container: "tensorflow", Timestamps: true, Follow: true}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only works for "tfjob"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a PoC now.

@@ -41,8 +41,8 @@ func GetJobLabelMap(jobKind string, trialName string) map[string]string {
labelMap := make(map[string]string)

if jobKind == "TFJob" {
labelMap["tf-job-name"] = trialName
labelMap["tf-job-role"] = "master"
labelMap["job-name"] = trialName
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubeflow/pytorch-operator#204 had made pytorch also can use "job-name"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, updated the pytorch label name.

}

func (s *sidecarInjector) MutationRequired(pod *v1.Pod) bool {
value, err := s.GetLabel(pod, JobRoleLabel)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it can handle tfjob and pytorchjob case, but not cover k8s Job case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is just a PoC. In the future, we will support job

// Get the namespace from req since the namespace in the pod is empty.
namespace := req.AdmissionRequest.Namespace
// Do mutation
mutatedPod, err := s.Mutate(pod, namespace)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

namespace can be get from pod object, so I think we can just use one argument instead, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to our experiment, we fail to get the namespace from pod.

VolumeMounts: pod.Spec.Containers[0].VolumeMounts,
}
mutatedPod.Spec.Containers = append(mutatedPod.Spec.Containers, injectContainer)
mutatedPod.Spec.ServiceAccountName = pod.Spec.ServiceAccountName
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need line 149, line 120 cannot assign ServiceAccountName?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we cannot get it. There is a hack.

cc @wuchunghsuan

Name: "metrics-collector",
Image: "gcr.io/kubeflow-images-public/katib/v1alpha2/metrics-collector",
Command: []string{"./metricscollector"},
Args: []string{"-e", "TODO_Experiment", "-t", trialName, "-k", "TFJob", "-n", namespace, "-m", katibmanagerv1alpha2.GetManagerAddr(), "-mn", metricName},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can use "controller-name" label of Pod to get the "-k" value
we can get "-e" value, that is the experiment name, from trial's metadata ownerReferences field

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, got "-k" and "-e" values. It seems pytorch job does not have "controller-name" label yet?

@googlebot
Copy link

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

@googlebot
Copy link

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@@ -46,18 +45,15 @@ func (d *MetricsCollector) CollectObservationLog(tId string, jobKind string, met
if len(pl.Items) == 0 {
return nil, fmt.Errorf("No Pods are found in Trial %v", tId)
}
logopt := apiv1.PodLogOptions{Container: "tensorflow", Timestamps: true, Follow: true}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep the file unmodified since we have pkg/util/v1alpha2/sidecarmetricscollector

@gaocegege
Copy link
Member

Please remove [WIP] in the title

@wuchunghsuan wuchunghsuan changed the title [WIP] Add pod level inject webhook Add pod level inject webhook Aug 28, 2019
@wuchunghsuan
Copy link
Contributor Author

OK, ready to merge.

@hougangliu
Copy link
Member

/lgtm
we cab merge it now to enhance based on it

@hougangliu
Copy link
Member

/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hougangliu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants