Surface Pod and other Errors that Prevent TFJob from starting #1131

jlewi · 2020-02-05T17:42:35Z

We need a good way to surface errors starting or running pods to the users. Right now it looks like users would have to look at the operator logs.

Users should be able to do

kubectl describe tfjobs ${MYJOB}

to see relevant errors problems

The text was updated successfully, but these errors were encountered:

issue-label-bot · 2020-02-05T17:42:43Z

Issue-Label Bot is automatically applying the labels:

Label	Probability
feature	0.94

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

johnugeorge · 2020-02-06T04:04:58Z

Good point. Will take this up in the next release

jlewi · 2020-04-06T13:53:16Z

You might want to check out kubeflow/kubeflow#3637 to see how this was solved in the case of notebooks. I think the approach we followed was to replay events from the pod.

jlewi · 2020-05-18T13:11:50Z

@johnugeorge How's this coming? Do you think this will land for 1.1?

gaocegege · 2020-05-19T01:46:54Z

/cc @ChanYiLin

ChanYiLin · 2020-05-21T13:12:24Z

@gaocegege @johnugeorge
I am thinking should we put this in the common library?
It seems this is a feature that every operator needs.
/cc @Jeffwan

Jeffwan · 2020-05-21T19:59:38Z

This is a reasonable request. I think the engineer story is if pods come into failed status, try to filter events of pods and create CR events along with error msg from pods side. It should catch 1.1 timeline once we move all operator to common (target for 1.1 as well)

jlewi · 2020-06-15T17:49:27Z

@gaocegege @johnugeorge @Jeffwan Is this on track for 1.1? What is the likelihood it lands this week? If not should we downgrade it to P2 and remove from KF 1.1?

gaocegege · 2020-06-16T01:39:40Z

Personally, I think it is not on track. @Jeffwan

gaocegege · 2020-08-19T09:07:37Z

/cc @whalecold

whalecold · 2020-08-20T01:47:56Z

/assign

whalecold · 2020-08-27T08:38:49Z

The pod error event has be recorded in the common repo and I can find the pod error event in my kubernetes cluster. It seems that he didn't use kubectl describe to track the unexpected condition in the issue. What do you think？@gaocegege

gaocegege · 2020-08-27T09:19:33Z

Sometimes the pod is created successfully, but it is failed to schedule.

whalecold · 2020-08-27T10:04:54Z

Sometimes the pod is created successfully, but it is failed to schedule.

OK, I have two ideas, One is using the active pod status which is False like spark operator. Another is collecting the events which were generated by the abnormal pods.
I think the first is better because of the second solution need stores all the events in memory, but the pod status may not be as detailed as the event.

gaocegege · 2020-08-27T10:06:14Z

The first SGTM. I think it works for tf-operator and easy to implement.

whalecold · 2020-08-27T12:21:10Z

The first SGTM. I think it works for tf-operator and easy to implement.

Done, PTAL

gaocegege · 2020-10-09T05:17:22Z

I think we can update the vendor to use the latest common.

whalecold · 2020-10-12T06:25:36Z

I think we can update the vendor to use the latest common.

As the tag v0.3.1 is not latest, we should release a new tag first.

stale · 2021-01-10T13:58:54Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

jlewi added kind/feature priority/p1 area/tfjob labels Feb 5, 2020

issue-label-bot bot added the feature label Feb 5, 2020

jlewi removed the feature label Mar 20, 2020

k8s-ci-robot assigned whalecold Aug 20, 2020

whalecold mentioned this issue Aug 27, 2020

Feat(event): record the tfjob event from the abnormal pod status kubeflow/common#101

Merged

Jeffwan mentioned this issue Sep 1, 2020

[Release 1.2] Feature Planning / Roadmap kubeflow/kubeflow#5224

Closed

stale bot added the lifecycle/stale label Jan 10, 2021

stale bot closed this as completed Jan 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Surface Pod and other Errors that Prevent TFJob from starting #1131

Surface Pod and other Errors that Prevent TFJob from starting #1131

jlewi commented Feb 5, 2020

issue-label-bot bot commented Feb 5, 2020

johnugeorge commented Feb 6, 2020

jlewi commented Apr 6, 2020

jlewi commented May 18, 2020

gaocegege commented May 19, 2020

ChanYiLin commented May 21, 2020 •

edited

Loading

Jeffwan commented May 21, 2020

jlewi commented Jun 15, 2020

gaocegege commented Jun 16, 2020

gaocegege commented Aug 19, 2020

whalecold commented Aug 20, 2020

whalecold commented Aug 27, 2020

gaocegege commented Aug 27, 2020

whalecold commented Aug 27, 2020 •

edited

Loading

gaocegege commented Aug 27, 2020

whalecold commented Aug 27, 2020

gaocegege commented Oct 9, 2020

whalecold commented Oct 12, 2020

stale bot commented Jan 10, 2021

Surface Pod and other Errors that Prevent TFJob from starting #1131

Surface Pod and other Errors that Prevent TFJob from starting #1131

Comments

jlewi commented Feb 5, 2020

issue-label-bot bot commented Feb 5, 2020

johnugeorge commented Feb 6, 2020

jlewi commented Apr 6, 2020

jlewi commented May 18, 2020

gaocegege commented May 19, 2020

ChanYiLin commented May 21, 2020 • edited Loading

Jeffwan commented May 21, 2020

jlewi commented Jun 15, 2020

gaocegege commented Jun 16, 2020

gaocegege commented Aug 19, 2020

whalecold commented Aug 20, 2020

whalecold commented Aug 27, 2020

gaocegege commented Aug 27, 2020

whalecold commented Aug 27, 2020 • edited Loading

gaocegege commented Aug 27, 2020

whalecold commented Aug 27, 2020

gaocegege commented Oct 9, 2020

whalecold commented Oct 12, 2020

stale bot commented Jan 10, 2021

ChanYiLin commented May 21, 2020 •

edited

Loading

whalecold commented Aug 27, 2020 •

edited

Loading