Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create resources (Services/Jobs) only once #418

Merged

Conversation

ScorpioCPH
Copy link
Member

@ScorpioCPH ScorpioCPH commented Feb 28, 2018

Hi, this PR update a little in Reconcile() loop:

  • Move the GetStatus() logic out of Creation block.
    • Call GetStatus() in each reconcile loop to make it more robust.
  • set TFJob.Status.Phase to TFJobPhaseRunning to keep creating resources (services/jobs) only once.

@jlewi @gaocegege @wackxu PTAL.


This change is Reviewable

@coveralls
Copy link

coveralls commented Feb 28, 2018

Coverage Status

Coverage decreased (-0.07%) to 28.415% when pulling f547d4b on ScorpioCPH:create-resources-only-once into 83c1ab2 on kubeflow:master.

@jlewi
Copy link
Contributor

jlewi commented Feb 28, 2018

/assign jlewi

// now we always call Create.
if j.job.Status.Phase == tfv1alpha1.TFJobPhaseCreating || j.job.Status.Phase == tfv1alpha1.TFJobPhaseRunning {
// Use Status.Phase to determine whether we should create the resources or not.
if j.job.Status.Phase == tfv1alpha1.TFJobPhaseCreating {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if a Job is running and one the resources like one of the services or job controllers is deleted?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be handled in this PR #344

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me like #344 takes care of recreating pods if they go away but not services.
My expectation is that if a user accidentally deletes the service the job will either recover by recreating it (preferable) or fail the job.

You could follow the same pattern for services as you do for pods; i.e. do the service create in a function syncServices that gets called periodically.

Copy link
Member Author

@ScorpioCPH ScorpioCPH Mar 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jlewi Hi, syncServices() has been added in #344.

@ScorpioCPH ScorpioCPH force-pushed the create-resources-only-once branch from 35a55e8 to 4bb10fc Compare March 1, 2018 08:48
@jlewi
Copy link
Contributor

jlewi commented Mar 3, 2018

Will #344 (Create Pod instead of a job) make this PR unnecessary?

@ScorpioCPH ScorpioCPH force-pushed the create-resources-only-once branch from 4bb10fc to df711fa Compare March 6, 2018 02:00
@ScorpioCPH
Copy link
Member Author

@jlewi I don't think so, they are different things :)

@jlewi
Copy link
Contributor

jlewi commented Mar 6, 2018

Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion.


pkg/trainer/training.go, line 366 at r1 (raw file):

Previously, ScorpioCPH (Penghao Cen) wrote…

@jlewi Hi, syncServices() has been added in #344.

Can we just delete this block of code and rely on SyncPods, SyncServices to create the resources?


Comments from Reviewable

@ScorpioCPH
Copy link
Member Author

Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion.


pkg/trainer/training.go, line 366 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Can we just delete this block of code and rely on SyncPods, SyncServices to create the resources?

Do you mean delete all Create() logic?


Comments from Reviewable

@jlewi
Copy link
Contributor

jlewi commented Mar 6, 2018

Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion.


pkg/trainer/training.go, line 366 at r1 (raw file):

Previously, ScorpioCPH (Penghao Cen) wrote…

Do you mean delete all Create() logic?

I was specifically referring to this if block
if j.job.Status.Phase == TFJobPhaseCreating {
...
}

I think this might be the only place we call createResources in which case that function and possibly others might be dead code that could be deleted.

My expectation though is that syncServices and syncPods should be able to handle the case where no resources exist because this is the first time they are called. So we don't need to have special logic to handle the job creating phase.

Its possible I'm missing something though.


Comments from Reviewable

@ScorpioCPH ScorpioCPH force-pushed the create-resources-only-once branch 2 times, most recently from a64062d to 41a7c46 Compare March 8, 2018 12:55
@jlewi
Copy link
Contributor

jlewi commented Mar 8, 2018

/hold
Because travis is broken due to a failing unittest

/lgtm

This looks good except for the failing unittest. Please fix the test. You can then approve it yourself and it will merge automatically.

@ScorpioCPH ScorpioCPH force-pushed the create-resources-only-once branch from 41a7c46 to f547d4b Compare March 9, 2018 01:41
@k8s-ci-robot k8s-ci-robot removed the lgtm label Mar 9, 2018
@gaocegege
Copy link
Member

/lgtm

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gaocegege, jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gaocegege
Copy link
Member

/hold cancel

@k8s-ci-robot k8s-ci-robot merged commit 423b0d0 into kubeflow:master Mar 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants