fix: Can not sync job status correctly when upgrading from v1.5 #3652

QingyaFan · 2024-08-05T11:41:03Z

v1.5 changed the naming logics of pod group by adding UID into the name: #2140, and there is also another fix regarding handling the already created pod group without UID in create or update: #2400. But a similar fix does not exist in the syncJob function.

Fixes #3640

volcano-sh-bot · 2024-08-05T11:41:06Z

Welcome @QingyaFan!

It looks like this is your first PR to volcano-sh/volcano.

Thank you, and welcome to Volcano. 😃

Monokaix · 2024-08-05T12:29:47Z

Welcome! Please sign off your commit with git commit -s.

Monokaix · 2024-08-05T12:30:37Z

pkg/controllers/job/job_controller_actions.go

-	if err := cc.vcClient.SchedulingV1beta1().PodGroups(job.Namespace).Delete(context.TODO(), pgName, metav1.DeleteOptions{}); err != nil {
+	pgName := jobhelpers.GetRelatedPodGroupName(job)
+	pgIface := cc.vcClient.SchedulingV1beta1().PodGroups(job.Namespace)
+	if err := pgIface.Delete(context.TODO(), pgName, metav1.DeleteOptions{}); err != nil {


Why named Iface here?

Cause in the scenario that update volcano from 1.5 to later verison, if we can not delete the new name podgroup, we should try to delete the podgroup with legacy name(ie. job.Name). We could use interface twice, to aviod manual error in typing cc.vcClient.SchedulingV1beta1().PodGroups(job.Namespace), I named a new var.

I mean pgIface itself is bit ambiguous: )

I noticed cc.vcClient.SchedulingV1beta1().PodGroups(job.Namespace) returns a v1beta1.PodGroupInterface type, so i set the variable name to pgIface(v1beta1.PodGroupInterface), can you give me a suggestion 🙇🏼

I think pgClient is ok.

QingyaFan · 2024-08-06T03:00:39Z

/assign @hwdef

Monokaix · 2024-08-06T07:37:01Z

There are many duplicate logic when dealing with legacy pg and codes are not so readable，I think we can wrap a func to get real pg and do not expose so many logic in where controller call it.

QingyaFan · 2024-08-06T12:37:11Z

There are many duplicate logic when dealing with legacy pg and codes are not so readable，I think we can wrap a func to get real pg and do not expose so many logic in where controller call it.

Yeah, that make sense. What you mean is wrap a func that check which pg(normal or legacy) exist in k8s cluster, then return it. However, this will cause a problem that: in race condition, when we get the pg name, and before we operate it, perhaps other routine has already operate it, for example maybe the pg has already been deleted. The get and the other operation are not in a transaction.

Monokaix · 2024-08-06T12:45:01Z

There are many duplicate logic when dealing with legacy pg and codes are not so readable，I think we can wrap a func to get real pg and do not expose so many logic in where controller call it.

Yeah, that make sense. What you mean is wrap a func that check which pg(normal or legacy) exist in k8s cluster, then return it. However, this will cause a problem that: in race condition, when we get the pg name, and before we operate it, perhaps other routine has already operate it, for example maybe the pg has already been deleted. The get and the other operation are not in a transaction.

That's a good point, but seems that current code is also executed in order and no lock is added，right?

Monokaix · 2024-08-06T12:46:48Z

There are many duplicate logic when dealing with legacy pg and codes are not so readable，I think we can wrap a func to get real pg and do not expose so many logic in where controller call it.

Yeah, that make sense. What you mean is wrap a func that check which pg(normal or legacy) exist in k8s cluster, then return it. However, this will cause a problem that: in race condition, when we get the pg name, and before we operate it, perhaps other routine has already operate it, for example maybe the pg has already been deleted. The get and the other operation are not in a transaction.

And if we get a pg and then call Delete method to delete the pg, if the pg is alredy deleted then NotFoundErr will also be returned and we can be aware of it.

QingyaFan · 2024-08-06T14:37:01Z

There are many duplicate logic when dealing with legacy pg and codes are not so readable，I think we can wrap a func to get real pg and do not expose so many logic in where controller call it.

Yeah, that make sense. What you mean is wrap a func that check which pg(normal or legacy) exist in k8s cluster, then return it. However, this will cause a problem that: in race condition, when we get the pg name, and before we operate it, perhaps other routine has already operate it, for example maybe the pg has already been deleted. The get and the other operation are not in a transaction.

And if we get a pg and then call Delete method to delete the pg, if the pg is alredy deleted then NotFoundErr will also returned and we can be aware of it.

ok, I updated the code, please review again.

Monokaix · 2024-08-08T03:47:39Z

pkg/controllers/job/job_controller_actions.go

+// getRelatedPodGroup returns the podgroup related to the vcjob.
+// it will return normal pg if it exist in cluster,
+// else it return legacy pg before version 1.5.
+func (cc *jobcontroller) getRelatedPodGroup(job *batch.Job) (*scheduling.PodGroup, error) {


getRelatedPodGroup -》 getPodGroupByJob

Monokaix · 2024-08-08T07:03:40Z

pkg/controllers/job/job_controller_actions.go

+// it will return normal pg if it exist in cluster,
+// else it return legacy pg before version 1.5.
+func (cc *jobcontroller) getRelatedPodGroup(job *batch.Job) (*scheduling.PodGroup, error) {
+	pgName := cc.generateRelatedPodGroupName(job)


pgName := cc.generateRelatedPodGroupName(job) pg, err := cc.pgLister.PodGroups(job.Namespace).Get(pgName) if err == nil { return pg, nil } if apierrors.IsNotFound(err) { pg, err = cc.pgLister.PodGroups(job.Namespace).Get(job.Name) if err != nil { return nil, err } return pg, nil } return nil, err Change to this is better: )

Monokaix · 2024-08-08T07:04:21Z

pkg/controllers/job/job_controller_actions.go

-			klog.Errorf("Failed to delete PodGroup of Job %v/%v: %v",
-				job.Namespace, job.Name, err)
-			return err
+	pg, _ := cc.getRelatedPodGroup(job)


We should check the returned err first.

Monokaix · 2024-08-08T07:08:52Z

pkg/controllers/job/job_controller_actions.go

@@ -281,8 +306,7 @@ func (cc *jobcontroller) syncJob(jobInfo *apis.JobInfo, updateStatus state.Updat
 	}

 	var syncTask bool
-	pgName := job.Name + "-" + string(job.UID)
-	if pg, _ := cc.pgLister.PodGroups(job.Namespace).Get(pgName); pg != nil {
+	if pg, _ := cc.getRelatedPodGroup(job); pg != nil {


The err here should also be checked.

Monokaix · 2024-09-02T11:45:03Z

line 205 in your pr of file job_controller_actions_test.go didn't check the expected err and actual err, we should modify it: )

QingyaFan · 2024-10-02T01:02:01Z

line 205 in your pr of file job_controller_actions_test.go didn't check the expected err and actual err, we should modify it: )

however，i did not change line 205...

Monokaix · 2024-10-08T01:35:49Z

/ok-to-test

Monokaix · 2024-10-08T02:00:03Z

HI，please rebase your pr: )

QingyaFan · 2024-10-08T16:04:42Z

@Monokaix I fix the conflict. Can you add lgtm label. And the Vcctl Test / E2E about Volcano CLI (pull_request) test failed, can you retrigger the test to see if this happens occasionally(because the pr did not change the failed test file). Thanks !

hwdef · 2024-10-09T03:25:59Z

ok, I updated the code, please review again.

Please squash your commit to one, the CI will restart automatically

volcano-sh-bot · 2024-10-09T18:18:59Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign hwdef
You can assign the PR to them by writing /assign @hwdef in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/controllers/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

v1.5 changed the naming logics of pod group by adding UID into the name: #2140, syncJob function should change some logic. Signed-off-by: cheerfun <[email protected]>

into resolve_conflicts

QingyaFan · 2024-10-09T23:58:59Z

ok, I updated the code, please review again.

Please squash your commit to one, the CI will restart automatically

the commits are split in different branches and not continus, i don't know how to squash into one. what about i redo the change in a different branch and create a new pull request?

Monokaix · 2024-10-10T01:17:51Z

ok, I updated the code, please review again.

Please squash your commit to one, the CI will restart automatically

the commits are split in different branches and not continus, i don't know how to squash into one. what about i redo the change in a different branch and create a new pull request?

If you are not familiar with git，you can build a new branch such as tmp locally based on the current branch first, and reset your current branch to latest master，and then just cherry pick your own commit from the tmp branch.

Monokaix · 2024-10-21T02:27:49Z

ok, I updated the code, please review again.

Please squash your commit to one, the CI will restart automatically

the commits are split in different branches and not continus, i don't know how to squash into one. what about i redo the change in a different branch and create a new pull request?

If you are not familiar with git，you can build a new branch such as tmp locally based on the current branch first, and reset your current branch to latest master，and then just cherry pick your own commit from the tmp branch.

ok, I updated the code, please review again.

Please squash your commit to one, the CI will restart automatically

the commits are split in different branches and not continus, i don't know how to squash into one. what about i redo the change in a different branch and create a new pull request?

You can do that: )

QingyaFan · 2024-10-21T15:20:57Z

ok, I updated the code, please review again.

Please squash your commit to one, the CI will restart automatically

the commits are split in different branches and not continus, i don't know how to squash into one. what about i redo the change in a different branch and create a new pull request?

If you are not familiar with git，you can build a new branch such as tmp locally based on the current branch first, and reset your current branch to latest master，and then just cherry pick your own commit from the tmp branch.

ok, I updated the code, please review again.

Please squash your commit to one, the CI will restart automatically

the commits are split in different branches and not continus, i don't know how to squash into one. what about i redo the change in a different branch and create a new pull request?

You can do that: )

I create a new pull request, please have a look: #3786 @Monokaix

Monokaix · 2024-10-22T06:58:13Z

tracked by #3786

Monokaix · 2024-10-22T06:58:20Z

/close

volcano-sh-bot · 2024-10-22T06:58:25Z

@Monokaix: Closed this PR.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

volcano-sh-bot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Aug 5, 2024

volcano-sh-bot requested review from hwdef and hzxuzhonghu August 5, 2024 11:41

volcano-sh-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Aug 5, 2024

QingyaFan force-pushed the master branch from 6ab70aa to eee96f2 Compare August 5, 2024 12:19

Monokaix reviewed Aug 5, 2024

View reviewed changes

QingyaFan force-pushed the master branch from eee96f2 to bd2da53 Compare August 5, 2024 12:36

QingyaFan requested a review from Monokaix August 5, 2024 12:59

volcano-sh-bot assigned hwdef Aug 6, 2024

QingyaFan force-pushed the master branch from cb34ef3 to 0a8c850 Compare August 6, 2024 14:32

volcano-sh-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 6, 2024

QingyaFan force-pushed the master branch from 0a8c850 to 6d6e588 Compare August 6, 2024 14:36

Monokaix reviewed Aug 8, 2024

View reviewed changes

QingyaFan force-pushed the master branch 3 times, most recently from b7f6e4a to 0b45c2e Compare August 8, 2024 08:48

volcano-sh-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 8, 2024

Monokaix mentioned this pull request Sep 10, 2024

fix: not remove podgroup uid will cause topology annotation to be useless #3711

Merged

volcano-sh-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 1, 2024

Merge branch 'master' into resolve_conflicts

8b835ab

volcano-sh-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 8, 2024

QingyaFan force-pushed the resolve_conflicts branch from 85cd9aa to 800038b Compare October 9, 2024 18:08

volcano-sh-bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 9, 2024

QingyaFan force-pushed the resolve_conflicts branch from f7803ce to fb98665 Compare October 9, 2024 18:18

volcano-sh-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Oct 9, 2024

QingyaFan force-pushed the resolve_conflicts branch from fb98665 to 8b835ab Compare October 9, 2024 18:21

QingyaFan added 2 commits October 10, 2024 02:29

Can not sync job status correctly when upgrading from v1.5 #3640

9cc30ff

v1.5 changed the naming logics of pod group by adding UID into the name: #2140, syncJob function should change some logic. Signed-off-by: cheerfun <[email protected]>

Merge branch 'resolve_conflicts' of https://github.com/QingyaFan/volcano

4100d73

into resolve_conflicts

Merge branch 'master' into resolve_conflicts

7f0b6f5

QingyaFan mentioned this pull request Oct 21, 2024

Can not sync job status correctly when upgrading from v1.5 #3640 #3786

Merged

volcano-sh-bot closed this Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Can not sync job status correctly when upgrading from v1.5 #3652

fix: Can not sync job status correctly when upgrading from v1.5 #3652

QingyaFan commented Aug 5, 2024

volcano-sh-bot commented Aug 5, 2024

Monokaix commented Aug 5, 2024

Monokaix Aug 5, 2024

QingyaFan Aug 5, 2024

Monokaix Aug 6, 2024

QingyaFan Aug 6, 2024

Monokaix Aug 6, 2024

QingyaFan commented Aug 6, 2024

Monokaix commented Aug 6, 2024 •

edited

Loading

QingyaFan commented Aug 6, 2024

Monokaix commented Aug 6, 2024

Monokaix commented Aug 6, 2024 •

edited

Loading

QingyaFan commented Aug 6, 2024

Monokaix Aug 8, 2024

Monokaix Aug 8, 2024

Monokaix Aug 8, 2024

Monokaix Aug 8, 2024

Monokaix commented Sep 2, 2024

QingyaFan commented Oct 2, 2024

Monokaix commented Oct 8, 2024

Monokaix commented Oct 8, 2024

QingyaFan commented Oct 8, 2024

hwdef commented Oct 9, 2024

volcano-sh-bot commented Oct 9, 2024

QingyaFan commented Oct 9, 2024

Monokaix commented Oct 10, 2024

Monokaix commented Oct 21, 2024

QingyaFan commented Oct 21, 2024

Monokaix commented Oct 22, 2024

Monokaix commented Oct 22, 2024

volcano-sh-bot commented Oct 22, 2024

fix: Can not sync job status correctly when upgrading from v1.5 #3652

fix: Can not sync job status correctly when upgrading from v1.5 #3652

Conversation

QingyaFan commented Aug 5, 2024

volcano-sh-bot commented Aug 5, 2024

Monokaix commented Aug 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

QingyaFan commented Aug 6, 2024

Monokaix commented Aug 6, 2024 • edited Loading

QingyaFan commented Aug 6, 2024

Monokaix commented Aug 6, 2024

Monokaix commented Aug 6, 2024 • edited Loading

QingyaFan commented Aug 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Monokaix commented Sep 2, 2024

QingyaFan commented Oct 2, 2024

Monokaix commented Oct 8, 2024

Monokaix commented Oct 8, 2024

QingyaFan commented Oct 8, 2024

hwdef commented Oct 9, 2024

volcano-sh-bot commented Oct 9, 2024

QingyaFan commented Oct 9, 2024

Monokaix commented Oct 10, 2024

Monokaix commented Oct 21, 2024

QingyaFan commented Oct 21, 2024

Monokaix commented Oct 22, 2024

Monokaix commented Oct 22, 2024

volcano-sh-bot commented Oct 22, 2024

Monokaix commented Aug 6, 2024 •

edited

Loading

Monokaix commented Aug 6, 2024 •

edited

Loading