Consider how we manage replicas (stateful sets, managing pods directly) #45

jlewi · 2017-09-29T01:41:49Z

In the current implementation if a TfProcess (e.g. PS, MASTER, WORKER) has N replicas. We end up creating N job controllers. This is largely a hold over from the initial implementation which predated StatefulSets. Now that StatefulSets are more mature we should consider switching to StatefulSets. This should simplify the logic in the CRD.

The main challenge to using StatefulSets is figuring how to set the TF_CONFIG environment variable which depends on the index in the stateful set.

Here's a snippet showing the struct that's stored in TF_CONFIG.

TfConfig{
	Cluster: s.Job.ClusterSpec(),
	Task: map[string]interface{}{
		"type":  strings.ToLower(string(s.Spec.TfReplicaType)),
		"index": index,
	},
}

Currently we construct a unique value of the environment variable TF_CONFIG for each job controller. For stateful sets we'd need a new mechanism to configure this for each replica.

It doesn't look like we can use a PreStart hook since there's no guarantee it runs before the ENTRYPOINT.

The text was updated successfully, but these errors were encountered:

jlewi · 2017-09-29T01:44:02Z

@vishh FYI.

jlewi · 2017-10-02T23:37:49Z

Another issue is figuring out what to do about logs.

I'd logs to be available after a job finishes but before it is deleted via kubectl logs. But a stateful set will just restart the container so I'm not sure how we preserve logs.

foxish · 2017-10-16T18:11:30Z

The main challenge to using StatefulSets is figuring how to set the TF_CONFIG environment variable which depends on the index in the stateful set.

That should be possible through the downward API and some sort of init script that configures env-var differently on each replica. The logs issue is more difficult because StatefulSets were not designed to support completion semantics.

Another alternative worth investigating is just scheduling individual pods in which there is complete control over all of those aspects.

jlewi · 2017-12-07T16:06:59Z

I think we should also consider creating the pods directly ourselves so that we can get the semantics we want in terms

Restart behavior based on exit code
Setting of environment variables per replica

I think GoogleCloudPlatform/kube-metacontroller could make it really simple to define a controller that just manages a single pod with the exact semantics we want.

Using a single controller per replica means we can set environment variables specific to each replica (e.g. index) which I think is very convenient for end-users because they can use the downward api to set replica specific flags based on environment variables e.g.

--index=${replica_index}

foxish · 2017-12-07T16:15:02Z

We should create the pods ourselves, I agree. But having a controller per pod is not a common pattern - I think it would be quite a lot of complexity. One place the meta-controller fits in, is in replacing the existing go-based controller code. But IIRC, there's certain things we may not want to do (yet) with the meta-controller - like scaling (watch optimizations & informers), authorization of individual controllers, etc. For those things, it might be better to continue to use the go-based custom controller.

DjangoPeng · 2018-01-04T02:16:58Z

A same discussion in caicloud internally. Ref: https://github.com/caicloud/kubeflow-controller/issues/71#issuecomment-355056365

PTAL @jlewi @foxish

jlewi · 2018-01-04T05:20:44Z

Here's some background.

The original motivation to use a Job or Replica set controller (as opposed to managing pods directly) was to let K8s handle failures that would cause a pod to be terminated
- For example if a VM becomes unhealthy terminating the pod, we want to restart the pod; the assumption is that the training code is robust to these types of failures i.e. they are using Supervisor or similar mechanism to automatically recover
I originally picked Job controller over Replica set because it seemed like the semantics we wanted were "run until completion"; which maps to job controller pretty well
Job controller's semantics turn out not to be quite what we want
- We eventually realized we needed a mechanism to distinguish between retryable and permanent errors
- So we settled on a mechanism of classifying certain exit codes as retryable and others as permanent.
- Once we did that we can no longer rely on the built in controller semantics we have to look at container terminations to decide what to do based on the exit code and what type of process (worker, ps, master)
- So at this point none of the built in controllers really give us the semantics we want.
Auto restarting containers can lead to confusing behavior for users
- Suppose a container exits with a code indicating a permanent exit code
- The job controller would automatically restart it
- The TfJob controller would detect the process exited with a permanent exit code and terminate the process
- So if you look at the logs for the container, you see the container exiting due to some error, restarting and then being killed by SIGTERM
  - Telling users to look at the previous container termination and not the most recent one is very confusing.

Thoughts for next iteration

From a requirements perspective,I think we want to be able to supply our own function that decides whether a container should be restarted or not
- I think this will give us the flexibility to define semantics that make the most sense for TensorFlow
Conceptually I think its simpler if the semantics are "containers are never restarted unless the TfJob Controller explicitly restarts them"
- Then its pretty clear that the controller just needs to enumerate the cases under which containers/pods need to be restarted
- We have to enumerate some cases that the default controllers (Job/Replica) automatically handle but I don't think that adds undue complexity.
I think we want to think about how this interacts with kube-arbitrator for gang-scheduling Prevent scheduling deadlocks #165

ScorpioCPH · 2018-01-04T07:13:29Z

@jlewi Hi, Thanks for your background update.

Auto restarting containers can lead to confusing behavior for users.

We have an option restartPolicy in TFJob API here, so let user to define the restarting policy maybe better.

jlewi · 2018-01-04T18:19:56Z

We have an option restartPolicy in TFJob API here, so let user to define the restarting policy maybe better.

I agree with letting the user specify the restart behavior. In the current implementation, restart behavior is tied to replica type. Workers, parameter servers, and masters have different restart behaviors.

I think in a future iteration we should let users define the restart behavior for each replica by picking from a set of policies.

I don't know if we can reuse the existing restartPolicy field. The restartPolicy field is a side effect of using PodTemplateSpec. The built in values for K8s restartPolicies aren't sufficient. For example, there's no policy that corresponds to the existing behavior of restarting on retryable errors but not permanent errors.

Its not clear to me whether we could introduce TfJob specific values for PodTemplateSpec's restartPolicy; this might interfere with automatic template validation.

I suspect a cleaner implementation will be to add an appropriate restart behavior field to TfReplicaSpec. The restartPolicy in PodTemplateSpec should always be set to some suitable default chosen by TfJobController (#55). So users should never explicitly set restartPolicy. We can enforce consistency for restartPolicy by failing jobs with a validation error if the user explicitly sets it and the value doesn't match the value deemed correct as chosen by the TfJobController.

ScorpioCPH · 2018-01-10T09:08:56Z

@jlewi

retryable errors

Do you mean we can re-started some workers with checkpoint file?

We can enforce consistency for restartPolicy by failing jobs with a validation error if the user explicitly sets it and the value doesn't match the value deemed correct as chosen by the TfJobController.

Good idea, and in consideration of these smart features, I prefer Pod now :), It will give us the full control of the whole workflow.

DjangoPeng · 2018-01-10T09:16:58Z

@jlewi @ScorpioCPH Sorry for the late reply.

Good idea, and in consideration of these smart features, I prefer Pod now :), It will give us the full control of the whole workflow.

SGTM. We'd better move it forward and take care of the implementation at the same time.

gaocegege · 2018-01-10T09:28:10Z

If we decide to use pod, we could reuse the code in caicloud/kubeflow-controller 😄

And I also vote for pod, now

DjangoPeng · 2018-01-10T09:34:46Z

@ScorpioCPH @gaocegege Maybe we should merge the CRD at first. There are some difference between upstream and caicloud's implementation.

Some discussion here https://github.com/caicloud/kubeflow-controller/issues/80

ScorpioCPH · 2018-01-18T08:16:42Z

@jlewi I think we have reached an agreement (Pod) after discussion, should we close this?

jlewi · 2018-01-18T13:25:12Z

Good idea. Opened #325

move to TA pipeline

jlewi changed the title ~~Consider using statefulsets~~ Consider how we manage replicas (stateful sets, managing pods directly) Dec 7, 2017

jlewi mentioned this issue Jan 3, 2018

refactor the TfJob to use Informer and Controller #206

Closed

gaocegege mentioned this issue Jan 11, 2018

[discussion] Differences between tensorflow/k8s and caicloud/kubeflow-controller #283

Closed

6 tasks

ScorpioCPH mentioned this issue Jan 18, 2018

Don't leave pods running just to get logs #128

Closed

jlewi mentioned this issue Jan 18, 2018

Manage Pods directly instead of using Job controllers #325

Closed

jlewi closed this as completed Jan 18, 2018

llhuii mentioned this issue Mar 17, 2021

kubeflow FAQ llhuii/sedna#4

Open

gaocegege mentioned this issue Jul 19, 2021

Evaluate usage of upstream Job API #1303

Open

oksanabaza pushed a commit to oksanabaza/training-operator that referenced this issue Jan 13, 2025

Merge pull request kubeflow#45 from red-hat-data-services/move-ta

1e6803a

move to TA pipeline

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider how we manage replicas (stateful sets, managing pods directly) #45

Consider how we manage replicas (stateful sets, managing pods directly) #45

jlewi commented Sep 29, 2017

jlewi commented Sep 29, 2017

jlewi commented Oct 2, 2017

foxish commented Oct 16, 2017

jlewi commented Dec 7, 2017

foxish commented Dec 7, 2017

DjangoPeng commented Jan 4, 2018

jlewi commented Jan 4, 2018

ScorpioCPH commented Jan 4, 2018

jlewi commented Jan 4, 2018

ScorpioCPH commented Jan 10, 2018 •

edited

Loading

DjangoPeng commented Jan 10, 2018

gaocegege commented Jan 10, 2018

DjangoPeng commented Jan 10, 2018

ScorpioCPH commented Jan 18, 2018

jlewi commented Jan 18, 2018

Consider how we manage replicas (stateful sets, managing pods directly) #45

Consider how we manage replicas (stateful sets, managing pods directly) #45

Comments

jlewi commented Sep 29, 2017

jlewi commented Sep 29, 2017

jlewi commented Oct 2, 2017

foxish commented Oct 16, 2017

jlewi commented Dec 7, 2017

foxish commented Dec 7, 2017

DjangoPeng commented Jan 4, 2018

jlewi commented Jan 4, 2018

ScorpioCPH commented Jan 4, 2018

jlewi commented Jan 4, 2018

ScorpioCPH commented Jan 10, 2018 • edited Loading

DjangoPeng commented Jan 10, 2018

gaocegege commented Jan 10, 2018

DjangoPeng commented Jan 10, 2018

ScorpioCPH commented Jan 18, 2018

jlewi commented Jan 18, 2018

ScorpioCPH commented Jan 10, 2018 •

edited

Loading