-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v1alpha2] Add PDB of TFReplicaSet for gang scheduling by kube-arbitrator #717
[v1alpha2] Add PDB of TFReplicaSet for gang scheduling by kube-arbitrator #717
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Travis tests have failedHey @codeflitting, 2nd Buildgometalinter --config=linter_config.json --vendor ./...
goveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/util/testutil/*.go,pkg/apis/tensorflow/*/zz_generated.*.go,pkg/apis/tensorflow/*/*_generated.go"
3rd Buildgometalinter --config=linter_config.json --vendor ./...
goveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/util/testutil/*.go,pkg/apis/tensorflow/*/zz_generated.*.go,pkg/apis/tensorflow/*/*_generated.go"
|
pkg/controller.v2/controller.go
Outdated
expectations: controller.NewControllerExpectations(), | ||
workQueue: workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), tfv1alpha2.Plural), | ||
recorder: recorder, | ||
enableGangScheduling: option.EnableGangScheduling, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should define the field in TFJobControllerConfiguration and set it in config here https://github.com/kubeflow/tf-operator/pull/717/files#diff-adc8ad6b53f8e2be56aa2ba02afbcc2bR85
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you means set it like this
// Create new TFJobController.
tc := &TFJobController{
config: TFJobControllerConfiguration{
ReconcilerSyncLoopPeriod: metav1.Duration{Duration: 15 * time.Second},
enableGangScheduling: option.EnableGangScheduling,
},
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it is better, WDYT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM while waiting for @jinzhejz 's comment
pkg/controller.v2/controller.go
Outdated
} | ||
|
||
// Check the pdb exist or not | ||
pdbName := "tf-job-pdb-" + tfjob.Name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could just use the job name as pdb name.
|
||
// Create pdb for gang scheduling by kube-arbitrator | ||
minAvailable := intstr.FromInt(int(tfjobReplicas)) | ||
createPdb := &v1beta1.PodDisruptionBudget{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the code work for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we set OwnerReference
with controller? In kube-arbitrator, we prefer to use OwnerReference
to avoid Selector
because of performance concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, when is PDB deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we are going to add owner reference now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean to set the owner reference for the pdb?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to rely on Kubernetes garbage collector for the pdb currently. While we should implement the logic about deleting pdb when deleting the tfjob.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean to set the owner reference for the pdb?
Yes, set OwernReference for the PDB, and also set OwerReference.Controller
to true
We want to rely on Kubernetes garbage collector for the pdb currently.
That's OK to me :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We come to an agreement, please update the code, thanks for your contribution!
@gaocegege I have changed the name of PDB with tfjob’ name @k82cn OwnerReference added. Now the PDB is automatically deleted when the tfjob is deleted |
…ator Signed-off-by: Pengyu Chen <[email protected]>
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gaocegege The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
fix #575
Signed-off-by: Pengyu Chen [email protected]
This change is