-
Notifications
You must be signed in to change notification settings - Fork 971
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Can not sync job status correctly when upgrading from v1.5 #3652
Conversation
Welcome @QingyaFan! |
Welcome! Please sign off your commit with |
if err := cc.vcClient.SchedulingV1beta1().PodGroups(job.Namespace).Delete(context.TODO(), pgName, metav1.DeleteOptions{}); err != nil { | ||
pgName := jobhelpers.GetRelatedPodGroupName(job) | ||
pgIface := cc.vcClient.SchedulingV1beta1().PodGroups(job.Namespace) | ||
if err := pgIface.Delete(context.TODO(), pgName, metav1.DeleteOptions{}); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why named Iface here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cause in the scenario that update volcano from 1.5 to later verison, if we can not delete the new name podgroup, we should try to delete the podgroup with legacy name(ie. job.Name). We could use interface twice, to aviod manual error in typing cc.vcClient.SchedulingV1beta1().PodGroups(job.Namespace)
, I named a new var.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean pgIface itself is bit ambiguous: )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed cc.vcClient.SchedulingV1beta1().PodGroups(job.Namespace)
returns a v1beta1.PodGroupInterface
type, so i set the variable name to pgIface(v1beta1.PodGroupInterface), can you give me a suggestion 🙇🏼
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think pgClient is ok.
/assign @hwdef |
There are many duplicate logic when dealing with legacy pg and codes are not so readable,I think we can wrap a func to get real pg and do not expose so many logic in where controller call it. |
Yeah, that make sense. What you mean is wrap a func that check which pg(normal or legacy) exist in k8s cluster, then return it. However, this will cause a problem that: in race condition, when we get the pg name, and before we operate it, perhaps other routine has already operate it, for example maybe the pg has already been deleted. The get and the other operation are not in a transaction. |
That's a good point, but seems that current code is also executed in order and no lock is added,right? |
And if we get a pg and then call Delete method to delete the pg, if the pg is alredy deleted then NotFoundErr will also be returned and we can be aware of it. |
ok, I updated the code, please review again. |
// getRelatedPodGroup returns the podgroup related to the vcjob. | ||
// it will return normal pg if it exist in cluster, | ||
// else it return legacy pg before version 1.5. | ||
func (cc *jobcontroller) getRelatedPodGroup(job *batch.Job) (*scheduling.PodGroup, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getRelatedPodGroup -》 getPodGroupByJob
// it will return normal pg if it exist in cluster, | ||
// else it return legacy pg before version 1.5. | ||
func (cc *jobcontroller) getRelatedPodGroup(job *batch.Job) (*scheduling.PodGroup, error) { | ||
pgName := cc.generateRelatedPodGroupName(job) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pgName := cc.generateRelatedPodGroupName(job)
pg, err := cc.pgLister.PodGroups(job.Namespace).Get(pgName)
if err == nil {
return pg, nil
}
if apierrors.IsNotFound(err) {
pg, err = cc.pgLister.PodGroups(job.Namespace).Get(job.Name)
if err != nil {
return nil, err
}
return pg, nil
}
return nil, err
Change to this is better: )
klog.Errorf("Failed to delete PodGroup of Job %v/%v: %v", | ||
job.Namespace, job.Name, err) | ||
return err | ||
pg, _ := cc.getRelatedPodGroup(job) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should check the returned err first.
@@ -281,8 +306,7 @@ func (cc *jobcontroller) syncJob(jobInfo *apis.JobInfo, updateStatus state.Updat | |||
} | |||
|
|||
var syncTask bool | |||
pgName := job.Name + "-" + string(job.UID) | |||
if pg, _ := cc.pgLister.PodGroups(job.Namespace).Get(pgName); pg != nil { | |||
if pg, _ := cc.getRelatedPodGroup(job); pg != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The err here should also be checked.
b7f6e4a
to
0b45c2e
Compare
line 205 in your pr of file |
however,i did not change line 205... |
/ok-to-test |
HI,please rebase your pr: ) |
@Monokaix I fix the conflict. Can you add lgtm label. And the |
Please squash your commit to one, the CI will restart automatically |
85cd9aa
to
800038b
Compare
f7803ce
to
fb98665
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
fb98665
to
8b835ab
Compare
v1.5 changed the naming logics of pod group by adding UID into the name: #2140, syncJob function should change some logic. Signed-off-by: cheerfun <[email protected]>
the commits are split in different branches and not continus, i don't know how to squash into one. what about i redo the change in a different branch and create a new pull request? |
If you are not familiar with git,you can build a new branch such as |
You can do that: ) |
I create a new pull request, please have a look: #3786 @Monokaix |
tracked by #3786 |
/close |
@Monokaix: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
v1.5 changed the naming logics of pod group by adding UID into the name: #2140, and there is also another fix regarding handling the already created pod group without UID in create or update: #2400. But a similar fix does not exist in the syncJob function.
Fixes #3640