-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider how we manage replicas (stateful sets, managing pods directly) #45
Comments
@vishh FYI. |
Another issue is figuring out what to do about logs. I'd logs to be available after a job finishes but before it is deleted via kubectl logs. But a stateful set will just restart the container so I'm not sure how we preserve logs. |
That should be possible through the downward API and some sort of init script that configures env-var differently on each replica. The logs issue is more difficult because StatefulSets were not designed to support completion semantics. Another alternative worth investigating is just scheduling individual pods in which there is complete control over all of those aspects. |
I think we should also consider creating the pods directly ourselves so that we can get the semantics we want in terms
I think GoogleCloudPlatform/kube-metacontroller could make it really simple to define a controller that just manages a single pod with the exact semantics we want. Using a single controller per replica means we can set environment variables specific to each replica (e.g. index) which I think is very convenient for end-users because they can use the downward api to set replica specific flags based on environment variables e.g.
|
We should create the pods ourselves, I agree. But having a controller per pod is not a common pattern - I think it would be quite a lot of complexity. One place the meta-controller fits in, is in replacing the existing go-based controller code. But IIRC, there's certain things we may not want to do (yet) with the meta-controller - like scaling (watch optimizations & informers), authorization of individual controllers, etc. For those things, it might be better to continue to use the go-based custom controller. |
A same discussion in caicloud internally. Ref: https://github.com/caicloud/kubeflow-controller/issues/71#issuecomment-355056365 |
Here's some background.
Thoughts for next iteration
|
I agree with letting the user specify the restart behavior. In the current implementation, restart behavior is tied to replica type. Workers, parameter servers, and masters have different restart behaviors. I think in a future iteration we should let users define the restart behavior for each replica by picking from a set of policies. I don't know if we can reuse the existing Its not clear to me whether we could introduce TfJob specific values for PodTemplateSpec's I suspect a cleaner implementation will be to add an appropriate restart behavior field to TfReplicaSpec. The |
Do you mean we can re-started some workers with checkpoint file?
Good idea, and in consideration of these |
@jlewi @ScorpioCPH Sorry for the late reply.
SGTM. We'd better move it forward and take care of the implementation at the same time. |
If we decide to use pod, we could reuse the code in caicloud/kubeflow-controller 😄 And I also vote for pod, now |
@ScorpioCPH @gaocegege Maybe we should merge the CRD at first. There are some difference between upstream and caicloud's implementation. Some discussion here https://github.com/caicloud/kubeflow-controller/issues/80 |
@jlewi I think we have reached an agreement (Pod) after discussion, should we close this? |
Good idea. Opened #325 |
move to TA pipeline
In the current implementation if a TfProcess (e.g. PS, MASTER, WORKER) has N replicas. We end up creating N job controllers. This is largely a hold over from the initial implementation which predated StatefulSets. Now that StatefulSets are more mature we should consider switching to StatefulSets. This should simplify the logic in the CRD.
The main challenge to using StatefulSets is figuring how to set the TF_CONFIG environment variable which depends on the index in the stateful set.
Here's a snippet showing the struct that's stored in TF_CONFIG.
Currently we construct a unique value of the environment variable TF_CONFIG for each job controller. For stateful sets we'd need a new mechanism to configure this for each replica.
It doesn't look like we can use a PreStart hook since there's no guarantee it runs before the ENTRYPOINT.
The text was updated successfully, but these errors were encountered: