-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor the TfJob to use K8s libraries #215
Conversation
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed, please reply here (e.g.
|
Hi @wackxu. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I signed it! |
/ok-to-test |
Thanks for this PR. This looks like a great PR. The PR is very big and I'm still trying to wrap my head around it. It would be very helpful if you could update the initial comment to provide more description about the motivation for the PR and the relevant changes. Specifically, it would be great if the initial comment in the PR could address the following questions.
Reviewed 6 of 35 files at r1. hack/update-codegen.sh, line 2 at r1 (raw file):
Please add a comment explaining what this is for? hack/update-codegen.sh, line 28 at r1 (raw file):
Can these scripts for code gen be moved into their own PR to reduce the PR size? hack/verify-codegen.sh, line 2 at r1 (raw file):
Please add a comment explaining what this is for. pkg/apis/tensorflow/helper/helpers.go, line 1 at r1 (raw file):
Is there a package level comment somewhere? pkg/apis/tensorflow/helper/helpers.go, line 12 at r1 (raw file):
Exported functions should have comments. pkg/apis/tensorflow/helper/helpers.go, line 14 at r1 (raw file):
I think we can delete this TODO since we require 1.8 for CRD. pkg/apis/tensorflow/helper/helpers.go, line 83 at r1 (raw file):
Is the convention to use "Cleanup" to denote this functionality? pkg/apis/tensorflow/helper/helpers.go, line 95 at r1 (raw file):
I think we can delete this since we don't have a scaling down condition. pkg/apis/tensorflow/helper/helpers.go, line 104 at r1 (raw file):
What is a recovering condition? Can you add a comment since its an exported method? pkg/apis/tensorflow/helper/helpers.go, line 112 at r1 (raw file):
Here and below, please add comments explaining what these conditions mean in terms of the TfJob semantics. pkg/apis/tensorflow/helper/helpers.go, line 123 at r1 (raw file):
What is this condition in the context of TfJob? pkg/apis/tensorflow/helper/helpers.go, line 159 at r1 (raw file):
Do we need this? TfJob currently doesn't support scaling. pkg/apis/tensorflow/v1alpha1/defaults.go, line 1 at r1 (raw file):
Package comment. pkg/apis/tensorflow/v1alpha1/defaults.go, line 11 at r1 (raw file):
Why did you make SetDefaults a pure function as opposed to making at a member function (I know that's not the correct terminalogy) of TfJobSpec? pkg/apis/tensorflow/v1alpha1/defaults.go, line 35 at r1 (raw file):
The code to sSetDefaults changed in http://github.com/tensorflow/k8s/pull/204 . That code started setting the TerminationPolicy. Can you update the code please? Comments from Reviewable |
Ignore my previous comment about splitting this PR up. I looked at it more thoroughly and it looks like the bulk of it is autogenerated code I think its fine to submit this as one PR. Overall the code looks good. I think my main comments are to delete some of the code that is a legacy of the etcd operator that we don't need anymore. Reviewed 26 of 35 files at r1. pkg/apis/tensorflow/v1alpha1/defaults.go, line 1 at r1 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
NM. There's a comment in doc.go. pkg/apis/tensorflow/v1alpha1/types.go, line 20 at r1 (raw file):
Are these annotations that the code gen client understands? pkg/apis/tensorflow/v1alpha1/types.go, line 33 at r1 (raw file):
I think we can delete this TODO now. pkg/apis/tensorflow/v1alpha1/types.go, line 112 at r1 (raw file):
Conditions were copied from the etcd cluster. I think we should delete them. pkg/apis/tensorflow/v1alpha1/types.go, line 136 at r1 (raw file):
I think we can delete ControlPaused. Currently there's no notion of pausing a job. If we want to add that later we can add it back. pkg/apis/tensorflow/v1alpha1/types.go, line 139 at r1 (raw file):
I think we should delete conditions. It was inherited from etcd operator and I don't think it makes sense for TfJob. pkg/apis/tensorflow/v1alpha1/types.go, line 171 at r1 (raw file):
etcd -> TfJobs pkg/apis/tensorflow/v1alpha1/types.go, line 177 at r1 (raw file):
third party objects -> TfJobs pkg/apis/tensorflow/validation/validation.go, line 31 at r1 (raw file):
Can you add the validation that was added in #204 to check termination policy? pkg/client/clientset/versioned/typed/tensorflow/v1alpha1/tfjob.go, line 17 at r1 (raw file):
This file is autogenerated right even though it doesn't have a comment to that effect? Comments from Reviewable |
Reviewed 35 of 35 files at r1. glide.yaml, line 19 at r1 (raw file):
pkg/apis/tensorflow/v1alpha1/register.go, line 33 at r1 (raw file):
It is better to add a new line. pkg/apis/tensorflow/v1alpha1/types.go, line 20 at r1 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
Yeah, we have to add these annotations to make code-generator works. Comments from Reviewable |
Let me know when this is ready for another look. |
@jlewi Sorry for the delay, I am very busy now and today night I will fix all the comments. |
When I fix it, I will call you @jlewi |
@jlewi Just update the PR comment and sorry that I will addressed the comments tomorrow, It is about eleven twenty in the evening and I need to go to home. |
No rush on my end. Just want to make sure you aren't blocked waiting on me. Thank you very much for doing this. |
* update k8s dependency to stable version * now the dependency for k8s is not the release version and this causes problems for #215, because the code-generator is outdated but the latest version for kubernetes 1.8.X fixes problem.
Reviewed 31 of 527 files at r2. hack/update-codegen.sh, line 2 at r1 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
Can you add a comment please? hack/update-codegen.sh, line 28 at r1 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
As I mentioned elsewhere I think it makes sense to keep them in this PR. pkg/apis/tensorflow/helper/helpers.go, line 83 at r1 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
I think we can take care of cleaning this up in a follow on PR since this PR didn't actually add the cleanup function. pkg/apis/tensorflow/helper/helpers.go, line 159 at r1 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
OK pkg/apis/tensorflow/v1alpha1/defaults.go, line 11 at r1 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
I think you missed this question? pkg/apis/tensorflow/v1alpha1/defaults.go, line 35 at r1 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
ok pkg/apis/tensorflow/v1alpha1/register.go, line 33 at r1 (raw file): Previously, gaocegege (Ce Gao) wrote…
Can you add a newline please? Comments from Reviewable |
There were a couple unresolved comments but nothing major. All the vendor changes makes reviewing this PR a pain. So I'm going to merge and we can follow up in folllow on PRs. Review status: 55 of 551 files reviewed at latest revision, 10 unresolved discussions, some commit checks failed. Comments from Reviewable |
Actually can you fix the travis build, it looks like there is an issue with dependencies
I suspect since we check in the vendor directory we no longer need to run glide install and that might fix the issue. |
I think #236 (removing glide install from travis build) should also fix the issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your good refactor works!
This is a very big PR, so I only take a look at API definition, and leave some quick comments here:
- There are some default value but i think it is defined by user.
- Some
state
is a little complicated, maybe it is difficult to maintain so many states in the control logic.
type AcceleratorVolume struct { | ||
Name string | ||
HostPath string | ||
MountPath string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just FYI, there is no need to mount GPU manually if we use nvidia device-plugin and nvidia-docker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some useful information here: kubernetes/kubernetes#54011
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do all Kubernetes distributions support device-plugin? As of a couple months ago I still think we needed to use the file path on ACS and some other platforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think device-plugin will be beta (enabled by default) in 1.10, and NVIDIA GPU discovery feature (alpha.kubernetes.io/nvidia-gpu
) will be deprecated in the future.
Please read this doc for more details: kubernetes/website#6736.
// SetDefaults sets any unspecified values to defaults | ||
func SetDefaults_TfJobSpec(c *TfJobSpec) error { | ||
if c.TfImage == "" { | ||
c.TfImage = DefaultTFImage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this TfImage
required, why should we set a default value for this?
|
||
//Set the default configuration for a PS server if the user didn't specify a PodTemplateSpec | ||
if r.Template == nil && r.TfReplicaType == PS { | ||
setDefaultPSPodTemplateSpec(r, c.TfImage) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should we set a default value here?
} | ||
|
||
if string(r.TfReplicaType) == "" { | ||
r.TfReplicaType = MASTER |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, should this be required? and MASTER
is only one in distributed TF training.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wackxu this should fix
|
||
type TfJobSpec struct { | ||
// TODO(jlewi): Can we we get rid of this and use some value from Kubernetes or a random ide. | ||
RuntimeId string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reuse metadata.UID?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you open up an issue for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI #249
State ReplicaState `json:"state"` | ||
|
||
// ReplicasStates provides the number of replicas in each status. | ||
ReplicasStates map[ReplicaState]int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: why there is so many states
, State/ReplicaState/TfReplicaStatus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you open up an issue for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI #249
/test all |
@ScorpioCPH Can you open up an issue with your comments about the API issues? This PR is primarily changing the existing implementation but not the API. So it would be better to address API changes in another issue. There are already a some API related issues tagged with the API label. |
Any idea what's going on with the test failures? Maybe try pulling in the latest changes from master. It looks like the travis build is still running @wackxu I think we can merge this as soon as the test failures are fixed. Would be great to get this merged since it touches a lot of files and will unblock other refactoring. |
@jlewi This style is keep to be consistent with k8s. And I have add auto generated |
Sorry for the delay, Addressed comments. PTAL. I will do all the refactor work in follow two days since I am not busy now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leaves some comments,this PR generated lgtm. @wackxu thank you do this :)
} | ||
} | ||
|
||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should not return nil
just return
return nil | ||
} | ||
|
||
func setDefault_PSPodTemplateSpec(r *TfReplicaSpec, tfImage string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems this func not join defaulter generator, we can remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zjj2wry The setDefault_PSPodTemplateSpec
is called by the SetDefaults_TfJob
func. I think we should keep it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right, i miss this~
} | ||
|
||
|
||
// SetDefaults sets any unspecified values to defaults |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix this lints
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you attach more info about lints errors :)
It maybe helpful to fix this.
@wackxu maybe loses some dependency~
|
@zjj2wry Seems reasonable to me. I think we can merge this as soon as the tests are fixed and fix any remaining issues in follow up PRs. |
@jlewi Addressed comments, PTLA |
Reviewed 1 of 527 files at r2, 1 of 6 files at r3, 1 of 1 files at r4. Comments from Reviewable |
Any idea why the travis build is still failing? Not sure why build would succeed in the presubmit but not the travis build.
|
@jlewi fix travis build fail, PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Thanks for the hard work. |
fix #206
1.refactor the
spec
package to the kubernetes standard API, all detention of our CRD is in pkgpkg/apis/tensorflow
and also the validation、 set Default for the CRD and some help func.2.for use K8s code gen, wo import the project https://github.com/kubernetes/code-generator, and some shell script in the folder
hack
. You can see how to use it in https://blog.openshift.com/kubernetes-deep-dive-code-generation-customresources/3.All the file in
pkg/client
are all auto generated and we should not modify it manually.4.I do not plan to change the logical to use the sharedInformer and clientset. This PR is big enough, I think I can do it in a follow PR.
5.SharedInformer is a helpful tool with cache that we can use it to list and watch our resources. More details you can see http://borismattijssen.github.io/articles/kubernetes-informers-controllers-reflectors-stores
This change is