-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIGSEGV in training-operator container #1799
Comments
After deploying freshly, it seems to stay in |
I should mention that I installed the operator with the following:
And here's my k8s:
|
It looks like spec.Replicas is nil pointer.
Could you provide the mpijob yaml? |
Ah, that was it. I had no value for |
We can add a default in the next release |
agree. |
/assign |
The
training-operator
pod seems to continually crash:~$ k get pod -n kubeflow training-operator-64886bdddc-q6nr5 -w NAME READY STATUS RESTARTS AGE training-operator-64886bdddc-q6nr5 0/1 CrashLoopBackOff 20 (24s ago) 79m training-operator-64886bdddc-q6nr5 0/1 Running 21 (90s ago) 80m training-operator-64886bdddc-q6nr5 0/1 Error 21 (109s ago) 80m training-operator-64886bdddc-q6nr5 0/1 CrashLoopBackOff 21 (5s ago) 81m training-operator-64886bdddc-q6nr5 0/1 Running 22 (2m47s ago) 83m training-operator-64886bdddc-q6nr5 1/1 Running 22 (3m6s ago) 84m training-operator-64886bdddc-q6nr5 0/1 Error 22 (3m7s ago) 84m training-operator-64886bdddc-q6nr5 0/1 CrashLoopBackOff 22 (13s ago) 84m training-operator-64886bdddc-q6nr5 0/1 Running 23 (5m11s ago) 89m training-operator-64886bdddc-q6nr5 1/1 Running 23 (5m30s ago) 89m training-operator-64886bdddc-q6nr5 0/1 Error 23 (5m30s ago) 89m training-operator-64886bdddc-q6nr5 0/1 CrashLoopBackOff 23 (9s ago) 89m
Here are the logs:
I've seen it mentioned on previous issues that requested memory or certain probe timeouts might need to be increased, e.g., #1717. Does this seem like one of those cases?
The text was updated successfully, but these errors were encountered: