You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Yes, not sure about the severity of the issue, however if you leave out the restartPolicy, it seems K8s 1.8 defaults to Always which results in the creation of master/workers but not ps (error below):
I Creating Service: master-xyzz-0
I Service master-xyzz-0 already exists.
I Creating Job: master-xyzz-0
I master-xyzz-0 already exists.
I Creating Service: worker-xyzz-0
I Service worker-xyzz-0 already exists.
I Creating Job: worker-xyzz-0
I worker-xyzz-0 already exists.
I Creating Service: ps-xyzz-0
I Service ps-xyzz-0 already exists.
I Creating Job: ps-xyzz-0
E trainingJobCreateReplicas() error; [Creating Job ps-xyzz-0 returned error., Job.batch "ps-xyzz-0" is invalid: spec.template.spec.restartPolicy: Unsupported value: "Always": supported values: OnFailure, Never]
undefined
A simple TfJob configuration with one master, one worker, and one ps server would suffice where the restartPolicy for ps server is omitted (while it is present for master/worker).
We let users set this field according to their model code.
If set RestartPolicy to OnFailure/Always, user should add reloading checkpoint code by themselves.
Otherwise restarting will take no effect.
ExitCode policy means that user should add exit code by themselves, tf-operator will check these exit codes to determine the behavior when an error occurs:
Users shouldn't need to explicitly set restart policy. We should be able to pick sensible values.
The text was updated successfully, but these errors were encountered: