Training Operator in CrashLoopBackOff #1717

ReggieCarey · 2023-01-06T17:56:12Z

WHAT DID YOU DO:

Deployed Kubeflow 1.6.0 using manifests (single command) into a v1.25.4 Kubernetes cluster.

EXPECTED:

TrainingOperator runs without failure

ACTUAL:

TrainingOperator constantly restarts with CrashLoopBackOff

DETAILS: Status Block of Training Operator

status:
  phase: Running
  conditions:
    - type: Initialized
      status: 'True'
      lastProbeTime: null
      lastTransitionTime: '2023-01-06T17:35:06Z'
    - type: Ready
      status: 'False'
      lastProbeTime: null
      lastTransitionTime: '2023-01-06T17:35:06Z'
      reason: ContainersNotReady
      message: 'containers with unready status: [training-operator]'
    - type: ContainersReady
      status: 'False'
      lastProbeTime: null
      lastTransitionTime: '2023-01-06T17:35:06Z'
      reason: ContainersNotReady
      message: 'containers with unready status: [training-operator]'
    - type: PodScheduled
      status: 'True'
      lastProbeTime: null
      lastTransitionTime: '2023-01-06T17:35:06Z'
  hostIP: 192.168.20.35
  podIP: 172.25.98.242
  podIPs:
    - ip: 172.25.98.242
  startTime: '2023-01-06T17:35:06Z'
  containerStatuses:
    - name: training-operator
      state:
        waiting:
          reason: CrashLoopBackOff
          message: >-
            back-off 5m0s restarting failed container=training-operator
            pod=training-operator-546966c58f-8jjph_kubeflow(ce66c232-815c-422c-86f7-0b6ed44e9c3e)
      lastState:
        terminated:
          exitCode: 137
          reason: OOMKilled
          startedAt: '2023-01-06T17:47:09Z'
          finishedAt: '2023-01-06T17:47:17Z'
          containerID: >-
            containerd://ada3d8a4408ae3c99aac017fd0dc23264fc2ece2b18464f354a0f7c424b0e0fe
      ready: false
      restartCount: 7
      image: docker.io/kubeflow/training-operator:v1-e1434f6
      imageID: >-
        docker.io/kubeflow/training-operator@sha256:ff847e2b6af07389a4a4ce73b7444f2f0741ad41e6066345ae858254cf6a562f
      containerID: >-
        containerd://ada3d8a4408ae3c99aac017fd0dc23264fc2ece2b18464f354a0f7c424b0e0fe
      started: false
  qosClass: Burstable

LOGS FROM Training Operator

2023-01-06T12:52:31-05:00 I0106 17:52:31.691141       1 request.go:601] Waited for 1.000357678s due to client-side throttling, not priority and fairness, request: GET:https://172.25.0.1:443/apis/networking.istio.io/v1alpha3?timeout=32s
2023-01-06T12:52:35-05:00 1.6730275553966036e+09	INFO	controller-runtime.metrics	Metrics server is starting to listen	{"addr": ":8080"}
2023-01-06T12:52:35-05:00 1.6730275553994968e+09	INFO	setup	starting manager
2023-01-06T12:52:35-05:00 1.673027555490869e+09	INFO	Starting server	{"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
2023-01-06T12:52:35-05:00 1.6730275554909165e+09	INFO	Starting server	{"kind": "health probe", "addr": "[::]:8081"}
2023-01-06T12:52:35-05:00 1.6730275554911332e+09	INFO	Starting EventSource	{"controller": "tfjob-controller", "source": "kind source: *v1.TFJob"}
2023-01-06T12:52:35-05:00 1.673027555491248e+09	INFO	Starting EventSource	{"controller": "tfjob-controller", "source": "kind source: *v1.Pod"}
2023-01-06T12:52:35-05:00 1.6730275554911945e+09	INFO	Starting EventSource	{"controller": "mxjob-controller", "source": "kind source: *v1.MXJob"}
2023-01-06T12:52:35-05:00 1.673027555491272e+09	INFO	Starting EventSource	{"controller": "tfjob-controller", "source": "kind source: *v1.Service"}
2023-01-06T12:52:35-05:00 1.6730275554913e+09	INFO	Starting Controller	{"controller": "tfjob-controller"}
2023-01-06T12:52:35-05:00 1.6730275554913092e+09	INFO	Starting EventSource	{"controller": "mxjob-controller", "source": "kind source: *v1.Pod"}
2023-01-06T12:52:35-05:00 1.6730275554913273e+09	INFO	Starting EventSource	{"controller": "mxjob-controller", "source": "kind source: *v1.Service"}
2023-01-06T12:52:35-05:00 1.6730275554913373e+09	INFO	Starting Controller	{"controller": "mxjob-controller"}
2023-01-06T12:52:35-05:00 1.6730275554913096e+09	INFO	Starting EventSource	{"controller": "xgboostjob-controller", "source": "kind source: *v1.XGBoostJob"}
2023-01-06T12:52:35-05:00 1.6730275554914267e+09	INFO	Starting EventSource	{"controller": "xgboostjob-controller", "source": "kind source: *v1.Pod"}
2023-01-06T12:52:35-05:00 1.6730275554914446e+09	INFO	Starting EventSource	{"controller": "xgboostjob-controller", "source": "kind source: *v1.Service"}
2023-01-06T12:52:35-05:00 1.6730275554914546e+09	INFO	Starting Controller	{"controller": "xgboostjob-controller"}
2023-01-06T12:52:35-05:00 1.6730275554914246e+09	INFO	Starting EventSource	{"controller": "mpijob-controller", "source": "kind source: *v1.MPIJob"}
2023-01-06T12:52:35-05:00 1.6730275554914916e+09	INFO	Starting EventSource	{"controller": "mpijob-controller", "source": "kind source: *v1.Pod"}
2023-01-06T12:52:35-05:00 1.6730275554914904e+09	INFO	Starting EventSource	{"controller": "pytorchjob-controller", "source": "kind source: *v1.PyTorchJob"}
2023-01-06T12:52:35-05:00 1.6730275554915273e+09	INFO	Starting EventSource	{"controller": "pytorchjob-controller", "source": "kind source: *v1.Pod"}
2023-01-06T12:52:35-05:00 1.6730275554915442e+09	INFO	Starting EventSource	{"controller": "pytorchjob-controller", "source": "kind source: *v1.Service"}
2023-01-06T12:52:35-05:00 1.6730275554915557e+09	INFO	Starting Controller	{"controller": "pytorchjob-controller"}
2023-01-06T12:52:35-05:00 1.6730275554915254e+09	INFO	Starting EventSource	{"controller": "mpijob-controller", "source": "kind source: *v1.ConfigMap"}
2023-01-06T12:52:35-05:00 1.6730275554915824e+09	INFO	Starting EventSource	{"controller": "mpijob-controller", "source": "kind source: *v1.Role"}
2023-01-06T12:52:35-05:00 1.6730275554915977e+09	INFO	Starting EventSource	{"controller": "mpijob-controller", "source": "kind source: *v1.RoleBinding"}
2023-01-06T12:52:35-05:00 1.6730275554916086e+09	INFO	Starting EventSource	{"controller": "mpijob-controller", "source": "kind source: *v1.ServiceAccount"}
2023-01-06T12:52:35-05:00 1.673027555491626e+09	INFO	Starting Controller	{"controller": "mpijob-controller"}
2023-01-06T12:52:37-05:00 I0106 17:52:37.091771       1 trace.go:205] Trace[1095383938]: "DeltaFIFO Pop Process" ID:olm/ababafea8a7d7684f918b1fcd8325da39a88f276761bfe895ca2046042a4c85,Depth:32,Reason:slow event handlers blocking the queue (06-Jan-2023 17:52:36.793) (total time: 298ms):
2023-01-06T12:52:37-05:00 Trace[1095383938]: [298.103765ms] [298.103765ms] END
2023-01-06T12:52:37-05:00 I0106 17:52:37.490491       1 trace.go:205] Trace[1967622247]: "DeltaFIFO Pop Process" ID:kube-system/token-cleaner,Depth:185,Reason:slow event handlers blocking the queue (06-Jan-2023 17:52:37.093) (total time: 397ms):
2023-01-06T12:52:37-05:00 Trace[1967622247]: [397.302525ms] [397.302525ms] END

The text was updated successfully, but these errors were encountered:

johnugeorge · 2023-01-07T16:40:44Z

Can you increase your memory resource for the training operator deployment?

johnugeorge · 2023-01-07T16:50:51Z

Related: #1693

There are multiple issues with the default deployment manifests in which memory resources requests are not set.

/cc @terrytangyuan @gaocegege @tenzen-y @zw0610

tenzen-y · 2023-01-07T17:07:17Z

@johnugeorge As far as I remember, we removed the resources field from manifests.

#1668

Since computing resource requirements depend on cluster size, it is difficult to provide the optimal resource requirement for all users...

johnugeorge · 2023-01-07T17:22:45Z

Yes. It is a difficult choice

yangoos57 · 2023-01-12T06:22:46Z

Try this command . I solved the problem.

The command is Master Branch version.

kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone"

Increase the Resources suggested in kubeflow/training-operator#1717

hongbo-miao · 2023-07-14T06:18:17Z

I tried @yangoos57 's solution at #1717 (comment), unfortunately it does not work for me.

Here is another ticket with same issue, I summarized the working version at #1841 (comment)
Hopefully it saves some time for future people who meet same issue ☺️

johnugeorge · 2023-10-08T18:29:10Z

Closing this as it is a deployment environment specific

curtistai added a commit to curtistai/kubeflow-openshift that referenced this issue Apr 30, 2023

Update deployment.yaml

40642e6

Increase the Resources suggested in kubeflow/training-operator#1717

benash mentioned this issue May 3, 2023

SIGSEGV in training-operator container #1799

Closed

srinandan mentioned this issue Jun 27, 2023

Training Operator in CrashLoopBackOff #1841

Closed

hongbo-miao mentioned this issue Jul 14, 2023

Failed to deploy PyTorchJob hongbo-miao/hongbomiao.com#9665

Closed

johnugeorge closed this as completed Oct 8, 2023

patrickblackjr mentioned this issue Aug 15, 2024

The training-operator liveness probe causes crash loops deployKF/deployKF#196

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Operator in CrashLoopBackOff #1717

Training Operator in CrashLoopBackOff #1717

ReggieCarey commented Jan 6, 2023

johnugeorge commented Jan 7, 2023

johnugeorge commented Jan 7, 2023

tenzen-y commented Jan 7, 2023

johnugeorge commented Jan 7, 2023

yangoos57 commented Jan 12, 2023

hongbo-miao commented Jul 14, 2023 •

edited

Loading

johnugeorge commented Oct 8, 2023

Training Operator in CrashLoopBackOff #1717

Training Operator in CrashLoopBackOff #1717

Comments

ReggieCarey commented Jan 6, 2023

johnugeorge commented Jan 7, 2023

johnugeorge commented Jan 7, 2023

tenzen-y commented Jan 7, 2023

johnugeorge commented Jan 7, 2023

yangoos57 commented Jan 12, 2023

hongbo-miao commented Jul 14, 2023 • edited Loading

johnugeorge commented Oct 8, 2023

hongbo-miao commented Jul 14, 2023 •

edited

Loading