-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Surface Pod and other Errors that Prevent TFJob from starting #1131
Comments
Issue-Label Bot is automatically applying the labels:
Please mark this comment with 👍 or 👎 to give our bot feedback! |
Good point. Will take this up in the next release |
You might want to check out kubeflow/kubeflow#3637 to see how this was solved in the case of notebooks. I think the approach we followed was to replay events from the pod. |
@johnugeorge How's this coming? Do you think this will land for 1.1? |
/cc @ChanYiLin |
@gaocegege @johnugeorge |
This is a reasonable request. I think the engineer story is if pods come into failed status, try to filter events of pods and create CR events along with error msg from pods side. It should catch 1.1 timeline once we move all operator to common (target for 1.1 as well) |
@gaocegege @johnugeorge @Jeffwan Is this on track for 1.1? What is the likelihood it lands this week? If not should we downgrade it to P2 and remove from KF 1.1? |
Personally, I think it is not on track. @Jeffwan |
/cc @whalecold |
/assign |
The pod error event has be recorded in the common repo and I can find the pod error event in my kubernetes cluster. It seems that he didn't use |
Sometimes the pod is created successfully, but it is failed to schedule. |
OK, I have two ideas, One is using the active pod status which is |
The first SGTM. I think it works for tf-operator and easy to implement. |
Done, PTAL |
I think we can update the vendor to use the latest common. |
As the tag v0.3.1 is not latest, we should release a new tag first. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
See kubeflow/kubeflow#4711
We need a good way to surface errors starting or running pods to the users. Right now it looks like users would have to look at the operator logs.
Users should be able to do
to see relevant errors problems
The text was updated successfully, but these errors were encountered: