-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create resources (Services/Jobs) only once #418
Create resources (Services/Jobs) only once #418
Conversation
/assign jlewi |
pkg/trainer/training.go
Outdated
// now we always call Create. | ||
if j.job.Status.Phase == tfv1alpha1.TFJobPhaseCreating || j.job.Status.Phase == tfv1alpha1.TFJobPhaseRunning { | ||
// Use Status.Phase to determine whether we should create the resources or not. | ||
if j.job.Status.Phase == tfv1alpha1.TFJobPhaseCreating { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if a Job is running and one the resources like one of the services or job controllers is deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be handled in this PR #344
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks to me like #344 takes care of recreating pods if they go away but not services.
My expectation is that if a user accidentally deletes the service the job will either recover by recreating it (preferable) or fail the job.
You could follow the same pattern for services as you do for pods; i.e. do the service create in a function syncServices that gets called periodically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
35a55e8
to
4bb10fc
Compare
Will #344 (Create Pod instead of a job) make this PR unnecessary? |
4bb10fc
to
df711fa
Compare
@jlewi I don't think so, they are different things :) |
Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion. pkg/trainer/training.go, line 366 at r1 (raw file): Can we just delete this block of code and rely on SyncPods, SyncServices to create the resources? Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion. pkg/trainer/training.go, line 366 at r1 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
Do you mean delete all Create() logic? Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion. pkg/trainer/training.go, line 366 at r1 (raw file): Previously, ScorpioCPH (Penghao Cen) wrote…
I was specifically referring to this if block I think this might be the only place we call createResources in which case that function and possibly others might be dead code that could be deleted. My expectation though is that syncServices and syncPods should be able to handle the case where no resources exist because this is the first time they are called. So we don't need to have special logic to handle the job creating phase. Its possible I'm missing something though. Comments from Reviewable |
a64062d
to
41a7c46
Compare
/hold /lgtm This looks good except for the failing unittest. Please fix the test. You can then approve it yourself and it will merge automatically. |
41a7c46
to
f547d4b
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gaocegege, jlewi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
Hi, this PR update a little in
Reconcile()
loop:GetStatus()
logic out ofCreation
block.GetStatus()
in each reconcile loop to make it more robust.TFJob.Status.Phase
toTFJobPhaseRunning
to keep creating resources (services/jobs) only once.@jlewi @gaocegege @wackxu PTAL.
This change is