Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider how we manage replicas (stateful sets, managing pods directly) #45

Closed
jlewi opened this issue Sep 29, 2017 · 15 comments
Closed

Comments

@jlewi
Copy link
Contributor

jlewi commented Sep 29, 2017

In the current implementation if a TfProcess (e.g. PS, MASTER, WORKER) has N replicas. We end up creating N job controllers. This is largely a hold over from the initial implementation which predated StatefulSets. Now that StatefulSets are more mature we should consider switching to StatefulSets. This should simplify the logic in the CRD.

The main challenge to using StatefulSets is figuring how to set the TF_CONFIG environment variable which depends on the index in the stateful set.

Here's a snippet showing the struct that's stored in TF_CONFIG.

TfConfig{
	Cluster: s.Job.ClusterSpec(),
	Task: map[string]interface{}{
		"type":  strings.ToLower(string(s.Spec.TfReplicaType)),
		"index": index,
	},
}

Currently we construct a unique value of the environment variable TF_CONFIG for each job controller. For stateful sets we'd need a new mechanism to configure this for each replica.

It doesn't look like we can use a PreStart hook since there's no guarantee it runs before the ENTRYPOINT.

@jlewi
Copy link
Contributor Author

jlewi commented Sep 29, 2017

@vishh FYI.

@jlewi
Copy link
Contributor Author

jlewi commented Oct 2, 2017

Another issue is figuring out what to do about logs.

I'd logs to be available after a job finishes but before it is deleted via kubectl logs. But a stateful set will just restart the container so I'm not sure how we preserve logs.

@foxish
Copy link

foxish commented Oct 16, 2017

The main challenge to using StatefulSets is figuring how to set the TF_CONFIG environment variable which depends on the index in the stateful set.

That should be possible through the downward API and some sort of init script that configures env-var differently on each replica. The logs issue is more difficult because StatefulSets were not designed to support completion semantics.

Another alternative worth investigating is just scheduling individual pods in which there is complete control over all of those aspects.

@jlewi jlewi changed the title Consider using statefulsets Consider how we manage replicas (stateful sets, managing pods directly) Dec 7, 2017
@jlewi
Copy link
Contributor Author

jlewi commented Dec 7, 2017

I think we should also consider creating the pods directly ourselves so that we can get the semantics we want in terms

  • Restart behavior based on exit code
  • Setting of environment variables per replica

I think GoogleCloudPlatform/kube-metacontroller could make it really simple to define a controller that just manages a single pod with the exact semantics we want.

Using a single controller per replica means we can set environment variables specific to each replica (e.g. index) which I think is very convenient for end-users because they can use the downward api to set replica specific flags based on environment variables e.g.

--index=${replica_index}

@foxish
Copy link

foxish commented Dec 7, 2017

We should create the pods ourselves, I agree. But having a controller per pod is not a common pattern - I think it would be quite a lot of complexity. One place the meta-controller fits in, is in replacing the existing go-based controller code. But IIRC, there's certain things we may not want to do (yet) with the meta-controller - like scaling (watch optimizations & informers), authorization of individual controllers, etc. For those things, it might be better to continue to use the go-based custom controller.

@DjangoPeng
Copy link
Member

A same discussion in caicloud internally. Ref: https://github.com/caicloud/kubeflow-controller/issues/71#issuecomment-355056365

PTAL @jlewi @foxish

@jlewi
Copy link
Contributor Author

jlewi commented Jan 4, 2018

Here's some background.

  • The original motivation to use a Job or Replica set controller (as opposed to managing pods directly) was to let K8s handle failures that would cause a pod to be terminated

    • For example if a VM becomes unhealthy terminating the pod, we want to restart the pod; the assumption is that the training code is robust to these types of failures i.e. they are using Supervisor or similar mechanism to automatically recover
  • I originally picked Job controller over Replica set because it seemed like the semantics we wanted were "run until completion"; which maps to job controller pretty well

  • Job controller's semantics turn out not to be quite what we want

    • We eventually realized we needed a mechanism to distinguish between retryable and permanent errors
    • So we settled on a mechanism of classifying certain exit codes as retryable and others as permanent.
    • Once we did that we can no longer rely on the built in controller semantics we have to look at container terminations to decide what to do based on the exit code and what type of process (worker, ps, master)
    • So at this point none of the built in controllers really give us the semantics we want.
  • Auto restarting containers can lead to confusing behavior for users

    • Suppose a container exits with a code indicating a permanent exit code
    • The job controller would automatically restart it
    • The TfJob controller would detect the process exited with a permanent exit code and terminate the process
    • So if you look at the logs for the container, you see the container exiting due to some error, restarting and then being killed by SIGTERM
      • Telling users to look at the previous container termination and not the most recent one is very confusing.

Thoughts for next iteration

  • From a requirements perspective,I think we want to be able to supply our own function that decides whether a container should be restarted or not
    • I think this will give us the flexibility to define semantics that make the most sense for TensorFlow
  • Conceptually I think its simpler if the semantics are "containers are never restarted unless the TfJob Controller explicitly restarts them"
    - Then its pretty clear that the controller just needs to enumerate the cases under which containers/pods need to be restarted
    - We have to enumerate some cases that the default controllers (Job/Replica) automatically handle but I don't think that adds undue complexity.
  • I think we want to think about how this interacts with kube-arbitrator for gang-scheduling Prevent scheduling deadlocks #165

@ScorpioCPH
Copy link
Member

@jlewi Hi, Thanks for your background update.

Auto restarting containers can lead to confusing behavior for users.

We have an option restartPolicy in TFJob API here, so let user to define the restarting policy maybe better.

@jlewi
Copy link
Contributor Author

jlewi commented Jan 4, 2018

We have an option restartPolicy in TFJob API here, so let user to define the restarting policy maybe better.

I agree with letting the user specify the restart behavior. In the current implementation, restart behavior is tied to replica type. Workers, parameter servers, and masters have different restart behaviors.

I think in a future iteration we should let users define the restart behavior for each replica by picking from a set of policies.

I don't know if we can reuse the existing restartPolicy field. The restartPolicy field is a side effect of using PodTemplateSpec. The built in values for K8s restartPolicies aren't sufficient. For example, there's no policy that corresponds to the existing behavior of restarting on retryable errors but not permanent errors.

Its not clear to me whether we could introduce TfJob specific values for PodTemplateSpec's restartPolicy; this might interfere with automatic template validation.

I suspect a cleaner implementation will be to add an appropriate restart behavior field to TfReplicaSpec. The restartPolicy in PodTemplateSpec should always be set to some suitable default chosen by TfJobController (#55). So users should never explicitly set restartPolicy. We can enforce consistency for restartPolicy by failing jobs with a validation error if the user explicitly sets it and the value doesn't match the value deemed correct as chosen by the TfJobController.

@ScorpioCPH
Copy link
Member

ScorpioCPH commented Jan 10, 2018

@jlewi

retryable errors

Do you mean we can re-started some workers with checkpoint file?

We can enforce consistency for restartPolicy by failing jobs with a validation error if the user explicitly sets it and the value doesn't match the value deemed correct as chosen by the TfJobController.

Good idea, and in consideration of these smart features, I prefer Pod now :), It will give us the full control of the whole workflow.

@DjangoPeng
Copy link
Member

@jlewi @ScorpioCPH Sorry for the late reply.

Good idea, and in consideration of these smart features, I prefer Pod now :), It will give us the full control of the whole workflow.

SGTM. We'd better move it forward and take care of the implementation at the same time.

@gaocegege
Copy link
Member

If we decide to use pod, we could reuse the code in caicloud/kubeflow-controller 😄

And I also vote for pod, now

@DjangoPeng
Copy link
Member

@ScorpioCPH @gaocegege Maybe we should merge the CRD at first. There are some difference between upstream and caicloud's implementation.

Some discussion here https://github.com/caicloud/kubeflow-controller/issues/80

@ScorpioCPH
Copy link
Member

@jlewi I think we have reached an agreement (Pod) after discussion, should we close this?

@jlewi
Copy link
Contributor Author

jlewi commented Jan 18, 2018

Good idea. Opened #325

@jlewi jlewi closed this as completed Jan 18, 2018
oksanabaza pushed a commit to oksanabaza/training-operator that referenced this issue Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants