Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

A job (or a pod of a job) may get stuck in a state neither running nor waiting #4141

Open
fanyangCS opened this issue Jan 15, 2020 · 1 comment

Comments

@fanyangCS
Copy link
Contributor

We plan to introduce a new state for a job, "initializing", to describe a state of a pod when the resource has been reserved by the scheduler but the pod container has not yet started running.

The platform is pulling the docker image and preparing for the environment necessary for a pod in the initializing state. It is possible that a pod (and the job the pod belongs to) stuck in "initializing" state very long.

Therefore, for general case (long time waiting after scheduled, such as long time image pull, or docker container creation error, but k8s still always retry it):

  1. We plan to add a new job and task state, such as allocated or initializing, to describe that we have already count its resource usage, but the user’s binary has not running yet. (It is our resource accounting boundary state)
    The state is readable for both machine (program) and human.
  2. We plan to expose (and refine/enrich) the backend k8s events, so that user can understand why it is in current state in details.
    The state is readable only for human.
    Example:
    “Failed to pull image "pytorch:pytorch-stable-py37": rpc error: code = Unknown desc = Error response from daemon: pull access denied for pytorch, repository does not exist or may require 'docker login'”
    “Pulling Image”
    “Mounting Storage”
    “Preparing SSH Server”
@yqwang-ms
Copy link
Member

Related to #3572

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants