Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix resuming Jobset after restoring PodTemplate (by deleting Jobs) #625
Fix resuming Jobset after restoring PodTemplate (by deleting Jobs) #625
Changes from 2 commits
b8d1c9b
495b8ff
8022696
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be helpful to validate if the jobs don't have startTime?
Because if any jobs have startTime, this operation should fail due to batch/job validation.
But, I'm not sure if we should add validations deeply depending on the batch/job specifications.
@danielvegamyhre @mimowo WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the flow is that we first suspend the parent JobSet, then it suspends the Jobs. So, if we added this condition, the
startTime
on running child jobs would prevent the update.However, this is complex, and I think I should have added an e2e test in JobSet to verify the suspending works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I’m thinking integration and e2e test solidifying the requirements that JobSet needs for working with Kueue will be a great idea.
We obviously don’t want to bring in Kueue as a dependency so I think just verifying that the updates/patches work in integration/e2e will be welcome
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm, that makes sense.
But, I guess that users fall into the rabbit hole when they accidentally modify the scheduling directives against the ReplicatedJob with startTime because the jobset-controller just output errors in the controller logs, this is not directly user feedback.
Maybe we can improve these implied specifications for the JobSet users once we introduce kubernetes/kubernetes#113221.
So, I'm ok with the current implementation for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the e2e test would not work (see in the PR I already added it), because there is a deeper issue - JobSet does not delete the Jobs once suspended: #535. The fact that we keep the old jobs means we never create new jobs on resuming again if the PodTemplate is updated in JobSet.
I'm trying to implement it by deleting the Jobs on suspend, and it looks promising. There are some integration tests for startupPolicy to adjust I yet need to look into.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if you have some more context knowledge if traps I may encounter with this approach (I will probably continue next week), but essentially this is the only I see, and it would mimic the Job behavior, where suspended Jobs deletes the pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we should include terminating?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, currently we only count as "active" Jobs which have at least one "active" pods. This means that it is possible that some Jobs which have only terminating pods may still exist.
One fixing idea would be to count Jobs as active if they have at least one active or terminating pod. Another is to introduce a new counter for "terminating" jobs - Jobs which have all pods terminating. WDYT?
I guess this is a corner case for now, so maybe could be done in a follow up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is not too bad in this PR, because a terminating Job (all pods are terminating) is going to be deleted eventually. Yes, if we resume the JobSet while it exists, then it can block creating replacement Job for a while, but IIUC the JobSet controller will create replacement for such a Job as soon as it is fully gone, and the replacement Job will have a good PodTemplate.
In order to prevent resuming a JobSet while there is a terminating Job we could:
job.Status.Active > 0 || ptr.Deref(job.Status.Terminating, 0) > 0
hereTerminating
Jobs. Terminating would be a fallback (so job.Status.Active == 0) counter ifptr.Deref(job.Status.Terminating, 0) > 0
I'm leaning towards option (2.). I could implement it either in this PR or a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a follow up is fine. Terminating on Jobs is currently a beta feature so logic may be somewhat complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need any code in JobSet depending on the feature-gate in k8s. If the feature gate is disabled in k8s the field is not set. So, the code in JobSet would see
nil
, we could prevent panics withptr.Deref(job.Status.Terminating, 0)
, but this should be enough, IIUC.