Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] python sdk should report errors in created TFJobs #1180

Closed
yashjakhotiya opened this issue Jul 3, 2020 · 6 comments
Closed

[feature] python sdk should report errors in created TFJobs #1180

yashjakhotiya opened this issue Jul 3, 2020 · 6 comments

Comments

@yashjakhotiya
Copy link

yashjakhotiya commented Jul 3, 2020

In case of errors in training code tfjob_client.wait_for_job shows Running and exits after some time. Instead of looking at Error Reporting from Google Cloud Dashboard, the python sdk should report them

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
feature 0.87

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
area/front-end 0.91

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@yashjakhotiya
Copy link
Author

/assign @jinchihe

@jinchihe
Copy link
Member

@yashjakhotiya Thanks. Would you think user should get the error from logs if training has error? if so the SDK has a api get_logs
see more https://github.com/kubeflow/tf-operator/blob/master/sdk/python/docs/TFJobClient.md#get_logs

@yashjakhotiya
Copy link
Author

Turned out that there have been a couple of bugs - 1. with get_logs and 2. the tfjob keeps running even if there has been no error and training code exits normally. If both of these get fixed, I don't think we need wait_for_job to explicitly report errors.

@yashjakhotiya
Copy link
Author

Got get_logs to work by turning off istio in PodTemplateSpec

template=V1PodTemplateSpec(
        metadata=V1ObjectMeta(
            annotations={'sidecar.istio.io/inject':'false'}
        ),
        spec=V1PodSpec(
            containers=[container]
        )
    )

This was the same reason why TFJob won't stop running. Now that we have solved them both, we don't need wait_for_job to explicitly report errors. Closing the issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants