A job (or a pod of a job) may get stuck in a state neither running nor waiting #4141

fanyangCS · 2020-01-15T02:00:11Z

We plan to introduce a new state for a job, "initializing", to describe a state of a pod when the resource has been reserved by the scheduler but the pod container has not yet started running.

The platform is pulling the docker image and preparing for the environment necessary for a pod in the initializing state. It is possible that a pod (and the job the pod belongs to) stuck in "initializing" state very long.

Therefore, for general case (long time waiting after scheduled, such as long time image pull, or docker container creation error, but k8s still always retry it):

We plan to add a new job and task state, such as allocated or initializing, to describe that we have already count its resource usage, but the user’s binary has not running yet. (It is our resource accounting boundary state)
The state is readable for both machine (program) and human.
We plan to expose (and refine/enrich) the backend k8s events, so that user can understand why it is in current state in details.
The state is readable only for human.
Example:
“Failed to pull image "pytorch:pytorch-stable-py37": rpc error: code = Unknown desc = Error response from daemon: pull access denied for pytorch, repository does not exist or may require 'docker login'”
“Pulling Image”
“Mounting Storage”
“Preparing SSH Server”

yqwang-ms · 2020-01-15T03:03:23Z

Related to #3572

fanyangCS self-assigned this Jan 15, 2020

fanyangCS added the known issue label Feb 18, 2020

scarlett2018 mentioned this issue Feb 18, 2020

Pure K8S Beta Release Plan - v0.17 #3872

Closed

54 tasks

scarlett2018 added the pai-dev label Apr 17, 2020

hzy46 mentioned this issue Apr 27, 2020

add release note #4452

Merged

scarlett2018 mentioned this issue May 14, 2020

OpenPAI Backlog #4512

Open

5 tasks

fanyangCS mentioned this issue Jul 21, 2020

Enrich job debugging info #4649

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A job (or a pod of a job) may get stuck in a state neither running nor waiting #4141

A job (or a pod of a job) may get stuck in a state neither running nor waiting #4141

fanyangCS commented Jan 15, 2020

yqwang-ms commented Jan 15, 2020

A job (or a pod of a job) may get stuck in a state neither running nor waiting #4141

A job (or a pod of a job) may get stuck in a state neither running nor waiting #4141

Comments

fanyangCS commented Jan 15, 2020

yqwang-ms commented Jan 15, 2020