Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Can not sync job status correctly when upgrading from v1.5 #3652

Closed
wants to merge 11 commits into from

Conversation

QingyaFan
Copy link
Contributor

v1.5 changed the naming logics of pod group by adding UID into the name: #2140, and there is also another fix regarding handling the already created pod group without UID in create or update: #2400. But a similar fix does not exist in the syncJob function.

Fixes #3640

@volcano-sh-bot
Copy link
Contributor

Welcome @QingyaFan!

It looks like this is your first PR to volcano-sh/volcano.

Thank you, and welcome to Volcano. 😃

@volcano-sh-bot volcano-sh-bot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Aug 5, 2024
@volcano-sh-bot volcano-sh-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Aug 5, 2024
@Monokaix
Copy link
Member

Monokaix commented Aug 5, 2024

Welcome! Please sign off your commit with git commit -s.

if err := cc.vcClient.SchedulingV1beta1().PodGroups(job.Namespace).Delete(context.TODO(), pgName, metav1.DeleteOptions{}); err != nil {
pgName := jobhelpers.GetRelatedPodGroupName(job)
pgIface := cc.vcClient.SchedulingV1beta1().PodGroups(job.Namespace)
if err := pgIface.Delete(context.TODO(), pgName, metav1.DeleteOptions{}); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why named Iface here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cause in the scenario that update volcano from 1.5 to later verison, if we can not delete the new name podgroup, we should try to delete the podgroup with legacy name(ie. job.Name). We could use interface twice, to aviod manual error in typing cc.vcClient.SchedulingV1beta1().PodGroups(job.Namespace), I named a new var.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean pgIface itself is bit ambiguous: )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed cc.vcClient.SchedulingV1beta1().PodGroups(job.Namespace) returns a v1beta1.PodGroupInterface type, so i set the variable name to pgIface(v1beta1.PodGroupInterface), can you give me a suggestion 🙇🏼

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think pgClient is ok.

@QingyaFan
Copy link
Contributor Author

/assign @hwdef

@Monokaix
Copy link
Member

Monokaix commented Aug 6, 2024

There are many duplicate logic when dealing with legacy pg and codes are not so readable,I think we can wrap a func to get real pg and do not expose so many logic in where controller call it.

@QingyaFan
Copy link
Contributor Author

There are many duplicate logic when dealing with legacy pg and codes are not so readable,I think we can wrap a func to get real pg and do not expose so many logic in where controller call it.

Yeah, that make sense. What you mean is wrap a func that check which pg(normal or legacy) exist in k8s cluster, then return it. However, this will cause a problem that: in race condition, when we get the pg name, and before we operate it, perhaps other routine has already operate it, for example maybe the pg has already been deleted. The get and the other operation are not in a transaction.

@Monokaix
Copy link
Member

Monokaix commented Aug 6, 2024

There are many duplicate logic when dealing with legacy pg and codes are not so readable,I think we can wrap a func to get real pg and do not expose so many logic in where controller call it.

Yeah, that make sense. What you mean is wrap a func that check which pg(normal or legacy) exist in k8s cluster, then return it. However, this will cause a problem that: in race condition, when we get the pg name, and before we operate it, perhaps other routine has already operate it, for example maybe the pg has already been deleted. The get and the other operation are not in a transaction.

That's a good point, but seems that current code is also executed in order and no lock is added,right?

@Monokaix
Copy link
Member

Monokaix commented Aug 6, 2024

There are many duplicate logic when dealing with legacy pg and codes are not so readable,I think we can wrap a func to get real pg and do not expose so many logic in where controller call it.

Yeah, that make sense. What you mean is wrap a func that check which pg(normal or legacy) exist in k8s cluster, then return it. However, this will cause a problem that: in race condition, when we get the pg name, and before we operate it, perhaps other routine has already operate it, for example maybe the pg has already been deleted. The get and the other operation are not in a transaction.

And if we get a pg and then call Delete method to delete the pg, if the pg is alredy deleted then NotFoundErr will also be returned and we can be aware of it.

@volcano-sh-bot volcano-sh-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 6, 2024
@QingyaFan
Copy link
Contributor Author

There are many duplicate logic when dealing with legacy pg and codes are not so readable,I think we can wrap a func to get real pg and do not expose so many logic in where controller call it.

Yeah, that make sense. What you mean is wrap a func that check which pg(normal or legacy) exist in k8s cluster, then return it. However, this will cause a problem that: in race condition, when we get the pg name, and before we operate it, perhaps other routine has already operate it, for example maybe the pg has already been deleted. The get and the other operation are not in a transaction.

And if we get a pg and then call Delete method to delete the pg, if the pg is alredy deleted then NotFoundErr will also returned and we can be aware of it.

ok, I updated the code, please review again.

// getRelatedPodGroup returns the podgroup related to the vcjob.
// it will return normal pg if it exist in cluster,
// else it return legacy pg before version 1.5.
func (cc *jobcontroller) getRelatedPodGroup(job *batch.Job) (*scheduling.PodGroup, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getRelatedPodGroup -》 getPodGroupByJob

// it will return normal pg if it exist in cluster,
// else it return legacy pg before version 1.5.
func (cc *jobcontroller) getRelatedPodGroup(job *batch.Job) (*scheduling.PodGroup, error) {
pgName := cc.generateRelatedPodGroupName(job)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

	pgName := cc.generateRelatedPodGroupName(job)
	pg, err := cc.pgLister.PodGroups(job.Namespace).Get(pgName)
	if err == nil {
		return pg, nil
	}
	if apierrors.IsNotFound(err) {
		pg, err = cc.pgLister.PodGroups(job.Namespace).Get(job.Name)
		if err != nil {
			return nil, err
		}
		return pg, nil
	}
	return nil, err
Change to this is better: )

klog.Errorf("Failed to delete PodGroup of Job %v/%v: %v",
job.Namespace, job.Name, err)
return err
pg, _ := cc.getRelatedPodGroup(job)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check the returned err first.

@@ -281,8 +306,7 @@ func (cc *jobcontroller) syncJob(jobInfo *apis.JobInfo, updateStatus state.Updat
}

var syncTask bool
pgName := job.Name + "-" + string(job.UID)
if pg, _ := cc.pgLister.PodGroups(job.Namespace).Get(pgName); pg != nil {
if pg, _ := cc.getRelatedPodGroup(job); pg != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The err here should also be checked.

@QingyaFan QingyaFan force-pushed the master branch 3 times, most recently from b7f6e4a to 0b45c2e Compare August 8, 2024 08:48
@volcano-sh-bot volcano-sh-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 8, 2024
@Monokaix
Copy link
Member

Monokaix commented Sep 2, 2024

line 205 in your pr of file job_controller_actions_test.go didn't check the expected err and actual err, we should modify it: )

@volcano-sh-bot volcano-sh-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 1, 2024
@QingyaFan
Copy link
Contributor Author

line 205 in your pr of file job_controller_actions_test.go didn't check the expected err and actual err, we should modify it: )

however,i did not change line 205...

@Monokaix
Copy link
Member

Monokaix commented Oct 8, 2024

/ok-to-test

@Monokaix
Copy link
Member

Monokaix commented Oct 8, 2024

HI,please rebase your pr: )

@volcano-sh-bot volcano-sh-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 8, 2024
@QingyaFan
Copy link
Contributor Author

@Monokaix I fix the conflict. Can you add lgtm label. And the Vcctl Test / E2E about Volcano CLI (pull_request) test failed, can you retrigger the test to see if this happens occasionally(because the pr did not change the failed test file). Thanks !

@hwdef
Copy link
Member

hwdef commented Oct 9, 2024

ok, I updated the code, please review again.

Please squash your commit to one, the CI will restart automatically

@volcano-sh-bot volcano-sh-bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 9, 2024
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign hwdef
You can assign the PR to them by writing /assign @hwdef in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Oct 9, 2024
v1.5 changed the naming logics of pod group by adding UID into the
name: #2140, syncJob function should change some logic.

Signed-off-by: cheerfun <[email protected]>
@QingyaFan
Copy link
Contributor Author

ok, I updated the code, please review again.

Please squash your commit to one, the CI will restart automatically

the commits are split in different branches and not continus, i don't know how to squash into one. what about i redo the change in a different branch and create a new pull request?

@Monokaix
Copy link
Member

ok, I updated the code, please review again.

Please squash your commit to one, the CI will restart automatically

the commits are split in different branches and not continus, i don't know how to squash into one. what about i redo the change in a different branch and create a new pull request?

If you are not familiar with git,you can build a new branch such as tmp locally based on the current branch first, and reset your current branch to latest master,and then just cherry pick your own commit from the tmp branch.

@Monokaix
Copy link
Member

ok, I updated the code, please review again.

Please squash your commit to one, the CI will restart automatically

the commits are split in different branches and not continus, i don't know how to squash into one. what about i redo the change in a different branch and create a new pull request?

If you are not familiar with git,you can build a new branch such as tmp locally based on the current branch first, and reset your current branch to latest master,and then just cherry pick your own commit from the tmp branch.

ok, I updated the code, please review again.

Please squash your commit to one, the CI will restart automatically

the commits are split in different branches and not continus, i don't know how to squash into one. what about i redo the change in a different branch and create a new pull request?

You can do that: )

@QingyaFan
Copy link
Contributor Author

ok, I updated the code, please review again.

Please squash your commit to one, the CI will restart automatically

the commits are split in different branches and not continus, i don't know how to squash into one. what about i redo the change in a different branch and create a new pull request?

If you are not familiar with git,you can build a new branch such as tmp locally based on the current branch first, and reset your current branch to latest master,and then just cherry pick your own commit from the tmp branch.

ok, I updated the code, please review again.

Please squash your commit to one, the CI will restart automatically

the commits are split in different branches and not continus, i don't know how to squash into one. what about i redo the change in a different branch and create a new pull request?

You can do that: )

I create a new pull request, please have a look: #3786 @Monokaix

@Monokaix
Copy link
Member

tracked by #3786

@Monokaix
Copy link
Member

/close

@volcano-sh-bot
Copy link
Contributor

@Monokaix: Closed this PR.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Can not sync job status correctly when upgrading from v1.5
5 participants