-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training Operator in CrashLoopBackOff #1717
Comments
Can you increase your memory resource for the training operator deployment? |
Related: #1693 There are multiple issues with the default deployment manifests in which memory resources requests are not set. |
@johnugeorge As far as I remember, we removed the resources field from manifests. Since computing resource requirements depend on cluster size, it is difficult to provide the optimal resource requirement for all users... |
Yes. It is a difficult choice |
Try this command . I solved the problem. The command is Master Branch version.
|
Increase the Resources suggested in kubeflow/training-operator#1717
I tried @yangoos57 's solution at #1717 (comment), unfortunately it does not work for me. Here is another ticket with same issue, I summarized the working version at #1841 (comment) |
Closing this as it is a deployment environment specific |
WHAT DID YOU DO:
Deployed Kubeflow 1.6.0 using manifests (single command) into a v1.25.4 Kubernetes cluster.
EXPECTED:
TrainingOperator runs without failure
ACTUAL:
TrainingOperator constantly restarts with CrashLoopBackOff
DETAILS: Status Block of Training Operator
LOGS FROM Training Operator
The text was updated successfully, but these errors were encountered: