You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.
We plan to introduce a new state for a job, "initializing", to describe a state of a pod when the resource has been reserved by the scheduler but the pod container has not yet started running.
The platform is pulling the docker image and preparing for the environment necessary for a pod in the initializing state. It is possible that a pod (and the job the pod belongs to) stuck in "initializing" state very long.
Therefore, for general case (long time waiting after scheduled, such as long time image pull, or docker container creation error, but k8s still always retry it):
We plan to add a new job and task state, such as allocated or initializing, to describe that we have already count its resource usage, but the user’s binary has not running yet. (It is our resource accounting boundary state)
The state is readable for both machine (program) and human.
We plan to expose (and refine/enrich) the backend k8s events, so that user can understand why it is in current state in details.
The state is readable only for human.
Example:
“Failed to pull image "pytorch:pytorch-stable-py37": rpc error: code = Unknown desc = Error response from daemon: pull access denied for pytorch, repository does not exist or may require 'docker login'”
“Pulling Image”
“Mounting Storage”
“Preparing SSH Server”
The text was updated successfully, but these errors were encountered:
We plan to introduce a new state for a job, "initializing", to describe a state of a pod when the resource has been reserved by the scheduler but the pod container has not yet started running.
The platform is pulling the docker image and preparing for the environment necessary for a pod in the initializing state. It is possible that a pod (and the job the pod belongs to) stuck in "initializing" state very long.
Therefore, for general case (long time waiting after scheduled, such as long time image pull, or docker container creation error, but k8s still always retry it):
The state is readable for both machine (program) and human.
The state is readable only for human.
Example:
“Failed to pull image "pytorch:pytorch-stable-py37": rpc error: code = Unknown desc = Error response from daemon: pull access denied for pytorch, repository does not exist or may require 'docker login'”
“Pulling Image”
“Mounting Storage”
“Preparing SSH Server”
The text was updated successfully, but these errors were encountered: