[feature] python sdk should report errors in created TFJobs #1180

yashjakhotiya · 2020-07-03T11:29:56Z

In case of errors in training code tfjob_client.wait_for_job shows Running and exits after some time. Instead of looking at Error Reporting from Google Cloud Dashboard, the python sdk should report them

The text was updated successfully, but these errors were encountered:

issue-label-bot · 2020-07-03T11:30:04Z

Issue-Label Bot is automatically applying the labels:

Label	Probability
feature	0.87

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

issue-label-bot · 2020-07-03T11:30:10Z

Issue-Label Bot is automatically applying the labels:

Label	Probability
area/front-end	0.91

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

yashjakhotiya · 2020-07-11T14:12:20Z

/assign @jinchihe

jinchihe · 2020-07-13T01:48:33Z

@yashjakhotiya Thanks. Would you think user should get the error from logs if training has error? if so the SDK has a api get_logs
see more https://github.com/kubeflow/tf-operator/blob/master/sdk/python/docs/TFJobClient.md#get_logs

yashjakhotiya · 2020-07-13T08:43:07Z

Turned out that there have been a couple of bugs - 1. with get_logs and 2. the tfjob keeps running even if there has been no error and training code exits normally. If both of these get fixed, I don't think we need wait_for_job to explicitly report errors.

yashjakhotiya · 2020-07-24T12:46:53Z

Got get_logs to work by turning off istio in PodTemplateSpec

template=V1PodTemplateSpec(
        metadata=V1ObjectMeta(
            annotations={'sidecar.istio.io/inject':'false'}
        ),
        spec=V1PodSpec(
            containers=[container]
        )
    )

This was the same reason why TFJob won't stop running. Now that we have solved them both, we don't need wait_for_job to explicitly report errors. Closing the issue now.

issue-label-bot bot added the feature label Jul 3, 2020

issue-label-bot bot added the area/front-end label Jul 3, 2020

jlewi added kind/feature and removed feature labels Jul 3, 2020

k8s-ci-robot assigned jinchihe Jul 11, 2020

yashjakhotiya closed this as completed Jul 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature] python sdk should report errors in created TFJobs #1180

[feature] python sdk should report errors in created TFJobs #1180

yashjakhotiya commented Jul 3, 2020 •

edited

Loading

issue-label-bot bot commented Jul 3, 2020

issue-label-bot bot commented Jul 3, 2020

yashjakhotiya commented Jul 11, 2020

jinchihe commented Jul 13, 2020

yashjakhotiya commented Jul 13, 2020

yashjakhotiya commented Jul 24, 2020

[feature] python sdk should report errors in created TFJobs #1180

[feature] python sdk should report errors in created TFJobs #1180

Comments

yashjakhotiya commented Jul 3, 2020 • edited Loading

issue-label-bot bot commented Jul 3, 2020

issue-label-bot bot commented Jul 3, 2020

yashjakhotiya commented Jul 11, 2020

jinchihe commented Jul 13, 2020

yashjakhotiya commented Jul 13, 2020

yashjakhotiya commented Jul 24, 2020

yashjakhotiya commented Jul 3, 2020 •

edited

Loading