Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Operator in CrashLoopBackOff #1717

Closed
ReggieCarey opened this issue Jan 6, 2023 · 7 comments
Closed

Training Operator in CrashLoopBackOff #1717

ReggieCarey opened this issue Jan 6, 2023 · 7 comments

Comments

@ReggieCarey
Copy link

WHAT DID YOU DO:

Deployed Kubeflow 1.6.0 using manifests (single command) into a v1.25.4 Kubernetes cluster.

EXPECTED:

TrainingOperator runs without failure

ACTUAL:

TrainingOperator constantly restarts with CrashLoopBackOff

DETAILS: Status Block of Training Operator

status:
  phase: Running
  conditions:
    - type: Initialized
      status: 'True'
      lastProbeTime: null
      lastTransitionTime: '2023-01-06T17:35:06Z'
    - type: Ready
      status: 'False'
      lastProbeTime: null
      lastTransitionTime: '2023-01-06T17:35:06Z'
      reason: ContainersNotReady
      message: 'containers with unready status: [training-operator]'
    - type: ContainersReady
      status: 'False'
      lastProbeTime: null
      lastTransitionTime: '2023-01-06T17:35:06Z'
      reason: ContainersNotReady
      message: 'containers with unready status: [training-operator]'
    - type: PodScheduled
      status: 'True'
      lastProbeTime: null
      lastTransitionTime: '2023-01-06T17:35:06Z'
  hostIP: 192.168.20.35
  podIP: 172.25.98.242
  podIPs:
    - ip: 172.25.98.242
  startTime: '2023-01-06T17:35:06Z'
  containerStatuses:
    - name: training-operator
      state:
        waiting:
          reason: CrashLoopBackOff
          message: >-
            back-off 5m0s restarting failed container=training-operator
            pod=training-operator-546966c58f-8jjph_kubeflow(ce66c232-815c-422c-86f7-0b6ed44e9c3e)
      lastState:
        terminated:
          exitCode: 137
          reason: OOMKilled
          startedAt: '2023-01-06T17:47:09Z'
          finishedAt: '2023-01-06T17:47:17Z'
          containerID: >-
            containerd://ada3d8a4408ae3c99aac017fd0dc23264fc2ece2b18464f354a0f7c424b0e0fe
      ready: false
      restartCount: 7
      image: docker.io/kubeflow/training-operator:v1-e1434f6
      imageID: >-
        docker.io/kubeflow/training-operator@sha256:ff847e2b6af07389a4a4ce73b7444f2f0741ad41e6066345ae858254cf6a562f
      containerID: >-
        containerd://ada3d8a4408ae3c99aac017fd0dc23264fc2ece2b18464f354a0f7c424b0e0fe
      started: false
  qosClass: Burstable

LOGS FROM Training Operator

2023-01-06T12:52:31-05:00 I0106 17:52:31.691141       1 request.go:601] Waited for 1.000357678s due to client-side throttling, not priority and fairness, request: GET:https://172.25.0.1:443/apis/networking.istio.io/v1alpha3?timeout=32s
2023-01-06T12:52:35-05:00 1.6730275553966036e+09	INFO	controller-runtime.metrics	Metrics server is starting to listen	{"addr": ":8080"}
2023-01-06T12:52:35-05:00 1.6730275553994968e+09	INFO	setup	starting manager
2023-01-06T12:52:35-05:00 1.673027555490869e+09	INFO	Starting server	{"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
2023-01-06T12:52:35-05:00 1.6730275554909165e+09	INFO	Starting server	{"kind": "health probe", "addr": "[::]:8081"}
2023-01-06T12:52:35-05:00 1.6730275554911332e+09	INFO	Starting EventSource	{"controller": "tfjob-controller", "source": "kind source: *v1.TFJob"}
2023-01-06T12:52:35-05:00 1.673027555491248e+09	INFO	Starting EventSource	{"controller": "tfjob-controller", "source": "kind source: *v1.Pod"}
2023-01-06T12:52:35-05:00 1.6730275554911945e+09	INFO	Starting EventSource	{"controller": "mxjob-controller", "source": "kind source: *v1.MXJob"}
2023-01-06T12:52:35-05:00 1.673027555491272e+09	INFO	Starting EventSource	{"controller": "tfjob-controller", "source": "kind source: *v1.Service"}
2023-01-06T12:52:35-05:00 1.6730275554913e+09	INFO	Starting Controller	{"controller": "tfjob-controller"}
2023-01-06T12:52:35-05:00 1.6730275554913092e+09	INFO	Starting EventSource	{"controller": "mxjob-controller", "source": "kind source: *v1.Pod"}
2023-01-06T12:52:35-05:00 1.6730275554913273e+09	INFO	Starting EventSource	{"controller": "mxjob-controller", "source": "kind source: *v1.Service"}
2023-01-06T12:52:35-05:00 1.6730275554913373e+09	INFO	Starting Controller	{"controller": "mxjob-controller"}
2023-01-06T12:52:35-05:00 1.6730275554913096e+09	INFO	Starting EventSource	{"controller": "xgboostjob-controller", "source": "kind source: *v1.XGBoostJob"}
2023-01-06T12:52:35-05:00 1.6730275554914267e+09	INFO	Starting EventSource	{"controller": "xgboostjob-controller", "source": "kind source: *v1.Pod"}
2023-01-06T12:52:35-05:00 1.6730275554914446e+09	INFO	Starting EventSource	{"controller": "xgboostjob-controller", "source": "kind source: *v1.Service"}
2023-01-06T12:52:35-05:00 1.6730275554914546e+09	INFO	Starting Controller	{"controller": "xgboostjob-controller"}
2023-01-06T12:52:35-05:00 1.6730275554914246e+09	INFO	Starting EventSource	{"controller": "mpijob-controller", "source": "kind source: *v1.MPIJob"}
2023-01-06T12:52:35-05:00 1.6730275554914916e+09	INFO	Starting EventSource	{"controller": "mpijob-controller", "source": "kind source: *v1.Pod"}
2023-01-06T12:52:35-05:00 1.6730275554914904e+09	INFO	Starting EventSource	{"controller": "pytorchjob-controller", "source": "kind source: *v1.PyTorchJob"}
2023-01-06T12:52:35-05:00 1.6730275554915273e+09	INFO	Starting EventSource	{"controller": "pytorchjob-controller", "source": "kind source: *v1.Pod"}
2023-01-06T12:52:35-05:00 1.6730275554915442e+09	INFO	Starting EventSource	{"controller": "pytorchjob-controller", "source": "kind source: *v1.Service"}
2023-01-06T12:52:35-05:00 1.6730275554915557e+09	INFO	Starting Controller	{"controller": "pytorchjob-controller"}
2023-01-06T12:52:35-05:00 1.6730275554915254e+09	INFO	Starting EventSource	{"controller": "mpijob-controller", "source": "kind source: *v1.ConfigMap"}
2023-01-06T12:52:35-05:00 1.6730275554915824e+09	INFO	Starting EventSource	{"controller": "mpijob-controller", "source": "kind source: *v1.Role"}
2023-01-06T12:52:35-05:00 1.6730275554915977e+09	INFO	Starting EventSource	{"controller": "mpijob-controller", "source": "kind source: *v1.RoleBinding"}
2023-01-06T12:52:35-05:00 1.6730275554916086e+09	INFO	Starting EventSource	{"controller": "mpijob-controller", "source": "kind source: *v1.ServiceAccount"}
2023-01-06T12:52:35-05:00 1.673027555491626e+09	INFO	Starting Controller	{"controller": "mpijob-controller"}
2023-01-06T12:52:37-05:00 I0106 17:52:37.091771       1 trace.go:205] Trace[1095383938]: "DeltaFIFO Pop Process" ID:olm/ababafea8a7d7684f918b1fcd8325da39a88f276761bfe895ca2046042a4c85,Depth:32,Reason:slow event handlers blocking the queue (06-Jan-2023 17:52:36.793) (total time: 298ms):
2023-01-06T12:52:37-05:00 Trace[1095383938]: [298.103765ms] [298.103765ms] END
2023-01-06T12:52:37-05:00 I0106 17:52:37.490491       1 trace.go:205] Trace[1967622247]: "DeltaFIFO Pop Process" ID:kube-system/token-cleaner,Depth:185,Reason:slow event handlers blocking the queue (06-Jan-2023 17:52:37.093) (total time: 397ms):
2023-01-06T12:52:37-05:00 Trace[1967622247]: [397.302525ms] [397.302525ms] END
@johnugeorge
Copy link
Member

Can you increase your memory resource for the training operator deployment?

@johnugeorge
Copy link
Member

Related: #1693

There are multiple issues with the default deployment manifests in which memory resources requests are not set.

/cc @terrytangyuan @gaocegege @tenzen-y @zw0610

@tenzen-y
Copy link
Member

tenzen-y commented Jan 7, 2023

@johnugeorge As far as I remember, we removed the resources field from manifests.

#1668

Since computing resource requirements depend on cluster size, it is difficult to provide the optimal resource requirement for all users...

@johnugeorge
Copy link
Member

Yes. It is a difficult choice

@yangoos57
Copy link

Try this command . I solved the problem.

The command is Master Branch version.

kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone"

@hongbo-miao
Copy link

hongbo-miao commented Jul 14, 2023

I tried @yangoos57 's solution at #1717 (comment), unfortunately it does not work for me.

Here is another ticket with same issue, I summarized the working version at #1841 (comment)
Hopefully it saves some time for future people who meet same issue ☺️

@johnugeorge
Copy link
Member

Closing this as it is a deployment environment specific

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants