Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGSEGV in training-operator container #1799

Closed
benash opened this issue May 3, 2023 · 7 comments · Fixed by #1800
Closed

SIGSEGV in training-operator container #1799

benash opened this issue May 3, 2023 · 7 comments · Fixed by #1800
Assignees

Comments

@benash
Copy link

benash commented May 3, 2023

The training-operator pod seems to continually crash:

~$ k get pod -n kubeflow training-operator-64886bdddc-q6nr5 -w
NAME                                 READY   STATUS             RESTARTS       AGE
training-operator-64886bdddc-q6nr5   0/1     CrashLoopBackOff   20 (24s ago)   79m
training-operator-64886bdddc-q6nr5   0/1     Running            21 (90s ago)   80m
training-operator-64886bdddc-q6nr5   0/1     Error              21 (109s ago)   80m
training-operator-64886bdddc-q6nr5   0/1     CrashLoopBackOff   21 (5s ago)     81m
training-operator-64886bdddc-q6nr5   0/1     Running            22 (2m47s ago)   83m
training-operator-64886bdddc-q6nr5   1/1     Running            22 (3m6s ago)    84m
training-operator-64886bdddc-q6nr5   0/1     Error              22 (3m7s ago)    84m
training-operator-64886bdddc-q6nr5   0/1     CrashLoopBackOff   22 (13s ago)     84m
training-operator-64886bdddc-q6nr5   0/1     Running            23 (5m11s ago)   89m
training-operator-64886bdddc-q6nr5   1/1     Running            23 (5m30s ago)   89m
training-operator-64886bdddc-q6nr5   0/1     Error              23 (5m30s ago)   89m
training-operator-64886bdddc-q6nr5   0/1     CrashLoopBackOff   23 (9s ago)      89m

Here are the logs:

$ k logs -n kubeflow training-operator-64886bdddc-q6nr5
1.6831492767944047e+09	INFO	controller-runtime.metrics	Metrics server is starting to listen	{"addr": ":8080"}
I0503 21:27:57.936384       1 request.go:682] Waited for 1.047013258s due to client-side throttling, not priority and fairness, request: GET:https://10.96.0.1:443/apis/apps/v1?timeout=32s
I0503 21:28:07.936448       1 request.go:682] Waited for 1.447652758s due to client-side throttling, not priority and fairness, request: GET:https://10.96.0.1:443/apis/scheduling.k8s.io/v1?timeout=32s
1.6831492959898794e+09	INFO	setup	starting manager
1.6831492959905822e+09	INFO	Starting server	{"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
1.683149295990631e+09	INFO	Starting server	{"kind": "health probe", "addr": "[::]:8081"}
1.6831492959908807e+09	INFO	Starting EventSource	{"controller": "tfjob-controller", "source": "kind source: *v1.TFJob"}
1.6831492959908864e+09	INFO	Starting EventSource	{"controller": "pytorchjob-controller", "source": "kind source: *v1.PyTorchJob"}
1.6831492959910178e+09	INFO	Starting EventSource	{"controller": "pytorchjob-controller", "source": "kind source: *v1.Pod"}
1.6831492959909546e+09	INFO	Starting EventSource	{"controller": "paddlejob-controller", "source": "kind source: *v1.PaddleJob"}
1.6831492959910433e+09	INFO	Starting EventSource	{"controller": "pytorchjob-controller", "source": "kind source: *v1.Service"}
1.6831492959910588e+09	INFO	Starting Controller	{"controller": "pytorchjob-controller"}
1.6831492959910316e+09	INFO	Starting EventSource	{"controller": "mpijob-controller", "source": "kind source: *v1.MPIJob"}
1.6831492959909766e+09	INFO	Starting EventSource	{"controller": "tfjob-controller", "source": "kind source: *v1.Pod"}
1.683149295991097e+09	INFO	Starting EventSource	{"controller": "mpijob-controller", "source": "kind source: *v1.Pod"}
1.68314929599107e+09	INFO	Starting EventSource	{"controller": "paddlejob-controller", "source": "kind source: *v1.Pod"}
1.6831492959911213e+09	INFO	Starting EventSource	{"controller": "mpijob-controller", "source": "kind source: *v1.ConfigMap"}
1.6831492959911323e+09	INFO	Starting EventSource	{"controller": "paddlejob-controller", "source": "kind source: *v1.Service"}
1.6831492959911401e+09	INFO	Starting EventSource	{"controller": "mpijob-controller", "source": "kind source: *v1.Role"}
1.6831492959911938e+09	INFO	Starting EventSource	{"controller": "mpijob-controller", "source": "kind source: *v1.RoleBinding"}
1.68314929599115e+09	INFO	Starting Controller	{"controller": "paddlejob-controller"}
1.6831492959911199e+09	INFO	Starting EventSource	{"controller": "mxjob-controller", "source": "kind source: *v1.MXJob"}
1.6831492959912286e+09	INFO	Starting EventSource	{"controller": "mpijob-controller", "source": "kind source: *v1.ServiceAccount"}
1.683149295991192e+09	INFO	Starting EventSource	{"controller": "tfjob-controller", "source": "kind source: *v1.Service"}
1.6831492959912865e+09	INFO	Starting EventSource	{"controller": "mxjob-controller", "source": "kind source: *v1.Pod"}
1.6831492959913003e+09	INFO	Starting Controller	{"controller": "tfjob-controller"}
1.683149295991317e+09	INFO	Starting EventSource	{"controller": "mxjob-controller", "source": "kind source: *v1.Service"}
1.6831492959913347e+09	INFO	Starting Controller	{"controller": "mxjob-controller"}
1.6831492959912405e+09	INFO	Starting EventSource	{"controller": "xgboostjob-controller", "source": "kind source: *v1.XGBoostJob"}
1.6831492959913042e+09	INFO	Starting Controller	{"controller": "mpijob-controller"}
1.683149295991399e+09	INFO	Starting EventSource	{"controller": "xgboostjob-controller", "source": "kind source: *v1.Pod"}
1.6831492959914432e+09	INFO	Starting EventSource	{"controller": "xgboostjob-controller", "source": "kind source: *v1.Service"}
1.6831492959914646e+09	INFO	Starting Controller	{"controller": "xgboostjob-controller"}
time="2023-05-03T21:28:16Z" level=info msg="MPIJob default/distributed-pytorch is created."
1.6831492960953653e+09	INFO	Starting workers	{"controller": "pytorchjob-controller", "worker count": 1}
1.6831492960979536e+09	INFO	Starting workers	{"controller": "mxjob-controller", "worker count": 1}
1.683149296098295e+09	INFO	Starting workers	{"controller": "mpijob-controller", "worker count": 1}
1.6831492960982726e+09	INFO	Starting workers	{"controller": "xgboostjob-controller", "worker count": 1}
1.6831492960983038e+09	INFO	Starting workers	{"controller": "paddlejob-controller", "worker count": 1}
1.683149296098327e+09	INFO	Starting workers	{"controller": "tfjob-controller", "worker count": 1}
time="2023-05-03T21:28:16Z" level=info msg="Reconciling for job distributed-pytorch"
1.6831492960995376e+09	DEBUG	events	ServiceAccount: distributed-pytorch-launcher	{"type": "Normal", "object": {"kind":"MPIJob","namespace":"default","name":"distributed-pytorch","uid":"3f79ab3c-01ef-40b1-9933-28d4850aab97","apiVersion":"kubeflow.org/v1","resourceVersion":"14930637"}, "reason": "ServiceAccount is exist"}
1.6831492960998297e+09	DEBUG	events	LauncherRole: distributed-pytorch-launcher	{"type": "Normal", "object": {"kind":"MPIJob","namespace":"default","name":"distributed-pytorch","uid":"3f79ab3c-01ef-40b1-9933-28d4850aab97","apiVersion":"kubeflow.org/v1","resourceVersion":"14930637"}, "reason": "LauncherRole is exist"}
1.6831492960998945e+09	DEBUG	events	RoleBinding: distributed-pytorch-launcher	{"type": "Normal", "object": {"kind":"MPIJob","namespace":"default","name":"distributed-pytorch","uid":"3f79ab3c-01ef-40b1-9933-28d4850aab97","apiVersion":"kubeflow.org/v1","resourceVersion":"14930637"}, "reason": "RoleBinding is exist"}
1.683149296100283e+09	DEBUG	events	ServiceAccount: distributed-pytorch-launcher	{"type": "Normal", "object": {"kind":"MPIJob","namespace":"default","name":"distributed-pytorch","uid":"3f79ab3c-01ef-40b1-9933-28d4850aab97","apiVersion":"kubeflow.org/v1","resourceVersion":"14930637"}, "reason": "ServiceAccount is exist"}
1.6831492961005068e+09	DEBUG	events	LauncherRole: distributed-pytorch-launcher	{"type": "Normal", "object": {"kind":"MPIJob","namespace":"default","name":"distributed-pytorch","uid":"3f79ab3c-01ef-40b1-9933-28d4850aab97","apiVersion":"kubeflow.org/v1","resourceVersion":"14930637"}, "reason": "LauncherRole is exist"}
1.6831492961005259e+09	DEBUG	events	RoleBinding: distributed-pytorch-launcher	{"type": "Normal", "object": {"kind":"MPIJob","namespace":"default","name":"distributed-pytorch","uid":"3f79ab3c-01ef-40b1-9933-28d4850aab97","apiVersion":"kubeflow.org/v1","resourceVersion":"14930637"}, "reason": "RoleBinding is exist"}
1.683149296100843e+09	INFO	Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference	{"controller": "mpijob-controller", "object": {"name":"distributed-pytorch","namespace":"default"}, "namespace": "default", "name": "distributed-pytorch", "reconcileID": "3684a27a-dd10-4f63-be0b-f732937ff5d5"}
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x165dd6e]

goroutine 1919 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:118 +0x1f4
panic({0x18a95e0, 0x2a95b90})
	/usr/local/go/src/runtime/panic.go:884 +0x212
github.com/kubeflow/training-operator/pkg/controller.v1/mpi.(*MPIJobReconciler).UpdateJobStatus(0xc0003f0280, {0x1ac0a80?, 0xc000e51860?}, 0x1ac0a80?, 0xc00089c780)
	/workspace/pkg/controller.v1/mpi/mpijob_controller.go:597 +0x16e
github.com/kubeflow/common/pkg/controller.v1/common.(*JobController).ReconcileJobs(0xc0003f0280, {0x1ac0a80, 0xc000e51860}, 0x1dbaf40?, {{0xc0007148c0, 0x1, 0x1}, 0x0, 0x0, 0x0, ...}, ...)
	/go/pkg/mod/github.com/kubeflow/[email protected]/pkg/controller.v1/common/job.go:333 +0x1e35
github.com/kubeflow/training-operator/pkg/controller.v1/mpi.(*MPIJobReconciler).Reconcile(0xc0003f0280, {0x1da5158, 0xc0006c7110}, {{{0xc000724378, 0x7}, {0xc0012980d8, 0x13}}})
	/workspace/pkg/controller.v1/mpi/mpijob_controller.go:161 +0x472
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x1da50b0?, {0x1da5158?, 0xc0006c7110?}, {{{0xc000724378?, 0x1a2ef20?}, {0xc0012980d8?, 0x4045d4?}}})
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121 +0xc8
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000b172c0, {0x1da50b0, 0xc000a9f700}, {0x1938ce0?, 0xc0009bc0a0?})
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320 +0x33c
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000b172c0, {0x1da50b0, 0xc000a9f700})
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273 +0x1d9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:230 +0x333

I've seen it mentioned on previous issues that requested memory or certain probe timeouts might need to be increased, e.g., #1717. Does this seem like one of those cases?

@benash
Copy link
Author

benash commented May 3, 2023

After deploying freshly, it seems to stay in Running without a problem until the instant I launch an MPIJob, at which point it crashes.

@benash
Copy link
Author

benash commented May 3, 2023

I should mention that I installed the operator with the following:

kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.6.0"

And here's my k8s:

Client Version: v1.26.1
Kustomize Version: v4.5.7
Server Version: v1.26.1

@Syulin7
Copy link
Contributor

Syulin7 commented May 4, 2023

expected := *(spec.Replicas) - succeeded

It looks like spec.Replicas is nil pointer.

After deploying freshly, it seems to stay in Running without a problem until the instant I launch an MPIJob, at which point it crashes.

Could you provide the mpijob yaml?

@benash
Copy link
Author

benash commented May 4, 2023

Ah, that was it. I had no value for mpiReplicaSpecs.Launcher.replicas, thinking that it would default to 1. Adding that back into the spec prevents the bad behavior I observed earlier.

@johnugeorge
Copy link
Member

We can add a default in the next release

@tenzen-y
Copy link
Member

tenzen-y commented May 5, 2023

We can add a default in the next release

agree.

@Syulin7
Copy link
Contributor

Syulin7 commented May 6, 2023

We can add a default in the next release

/assign
I will add a default for custom job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants