Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg: Add recorder support #312

Merged
merged 1 commit into from
Jan 22, 2018
Merged

pkg: Add recorder support #312

merged 1 commit into from
Jan 22, 2018

Conversation

gaocegege
Copy link
Member

@gaocegege gaocegege commented Jan 15, 2018

This PR focus on recorder support in Kubernetes. Now we could push events to Kubernetes about the TFJob:

Events:
  Type    Reason            Age   From      Message
  ----    ------            ----  ----      -------
  Normal  SuccessfulCreate  9s    kubeflow  Created service: example-job-master-xpnw-0
  Normal  SuccessfulCreate  9s    kubeflow  Created job: example-job-master-xpnw-0
  Normal  SuccessfulCreate  9s    kubeflow  Created service: example-job-worker-xpnw-0
  Normal  SuccessfulCreate  9s    kubeflow  Created job: example-job-worker-xpnw-0

Signed-off-by: Ce Gao [email protected]


This change is Reviewable

@k8s-ci-robot
Copy link

Hi @gaocegege. Thanks for your PR.

I'm waiting for a kubernetes or tensorflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@coveralls
Copy link

coveralls commented Jan 15, 2018

Coverage Status

Coverage increased (+0.07%) to 31.68% when pulling 1968e71 on gaocegege:recorder into 98a34a1 on tensorflow:master.

Copy link
Member

@ScorpioCPH ScorpioCPH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM with some nits.

return err
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we have returned in the previous if statement, so else condition can be removed here.

}
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the else at line 195, not line 191.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I know. But if we remove the else, the branch that the service already exists will publish events, too.

}
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

if err != nil {
log.Errorf("Error creating PS ConfigMap: %v, %v", cm.ObjectMeta.Name, err)
log.Errorf("Error creating PS ConfigMap: %v, %v", createdCM.ObjectMeta.Name, err)
s.recorder.Eventf(s.Job.job, v1.EventTypeWarning, FailedCreateReason, "Error creating: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add more info here: Error creating configmaps: xxx

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM


// If the job already exists do nothing.
if err != nil {
if k8s_errors.IsAlreadyExists(err) {
log.Infof("Service %v already exists.", s.jobName(index))
} else {
return k8sErrors.NewAggregate([]error{fmt.Errorf("Creating service %v returned error.", service.ObjectMeta.Name), err})
s.recorder.Eventf(s.Job.job, v1.EventTypeWarning, FailedCreateReason, "Error creating: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error creating service xxx


// If the job already exists do nothing.
if err != nil {
if k8s_errors.IsAlreadyExists(err) {
log.Infof("%v already exists.", s.jobName(index))

} else {
return k8sErrors.NewAggregate([]error{fmt.Errorf("Creating Job %v returned error.", newJ.ObjectMeta.Name), err})
s.recorder.Eventf(s.Job.job, v1.EventTypeWarning, FailedCreateReason, "Error creating: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error creating job: xxx

@coveralls
Copy link

coveralls commented Jan 15, 2018

Coverage Status

Coverage increased (+0.07%) to 31.68% when pulling 8431e13 on gaocegege:recorder into 98a34a1 on tensorflow:master.

@jlewi
Copy link
Contributor

jlewi commented Jan 15, 2018

Can you update the PR description to explain what a recorder does? I suspect it records events and publishes them?

@jlewi
Copy link
Contributor

jlewi commented Jan 15, 2018

/ok-to-test

@gaocegege
Copy link
Member Author

@jlewi

The recorder records events for the TFJob instance. For example, if we create a TFJob, we need to create some services and jobs for the TFJob. Then we record the creations for these via the recorder.

@jlewi
Copy link
Contributor

jlewi commented Jan 16, 2018

@gaocegege Can you open up an issue to add E2E tests to verify the events are published?

Please sync but otherwise LGTM.

@gaocegege
Copy link
Member Author

@jlewi OK, and I can take the issue.

@ScorpioCPH
Copy link
Member

@gaocegege Need rebase :)

@googlebot
Copy link

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for the commit author(s). If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and have the pull request author add another comment and the bot will run again. If the bot doesn't comment, it means it doesn't think anything has changed.

Signed-off-by: Ce Gao <[email protected]>
@googlebot
Copy link

CLAs look good, thanks!

@coveralls
Copy link

coveralls commented Jan 19, 2018

Coverage Status

Coverage increased (+0.2%) to 31.611% when pulling 2ec6bc7 on gaocegege:recorder into 04425d9 on tensorflow:master.

@coveralls
Copy link

coveralls commented Jan 19, 2018

Coverage Status

Coverage increased (+0.2%) to 31.611% when pulling 2ec6bc7 on gaocegege:recorder into 04425d9 on tensorflow:master.

@coveralls
Copy link

coveralls commented Jan 19, 2018

Coverage Status

Coverage increased (+0.2%) to 31.611% when pulling 2ec6bc7 on gaocegege:recorder into 04425d9 on tensorflow:master.

@gaocegege
Copy link
Member Author

PTAL

@jlewi
Copy link
Contributor

jlewi commented Jan 20, 2018

@ScorpioCPH Have your comments been addressed?

@ScorpioCPH
Copy link
Member

@jlewi @gaocegege Mostly, only some nits, LGTM.

@jlewi jlewi merged commit ca638ed into kubeflow:master Jan 22, 2018
@gaocegege gaocegege deleted the recorder branch April 10, 2018 03:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants