Skip to content
This repository has been archived by the owner on May 25, 2023. It is now read-only.

v0.4 runtime panic #861

Closed
kinglion811 opened this issue May 28, 2019 · 10 comments · Fixed by volcano-retired/scheduler#26 or #863
Closed

v0.4 runtime panic #861

kinglion811 opened this issue May 28, 2019 · 10 comments · Fixed by volcano-retired/scheduler#26 or #863
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@kinglion811
Copy link

kinglion811 commented May 28, 2019

kubernetes version is :1.11
kube-batch version is:v0.5
when I start kube-batch and schedule tf job,After running for a while ,kube-batch will panic:
panic information is

Resource is not sufficient to do operation: <cpu 52000.00, memory 261334462464.00, GPU 0.00> sub <cpu 12000.00, memory 60000000000.00, GPU 2000.00> [recovered]
	panic: Resource is not sufficient to do operation: <cpu 52000.00, memory 261334462464.00, GPU 0.00> sub <cpu 12000.00, memory 60000000000.00, GPU 2000.00>

Causing panic are:
image
image

@kinglion811
Copy link
Author

 @k82cn

@k82cn
Copy link
Contributor

k82cn commented May 28, 2019

We merged #860 few days ago, which maybe helpful :)

@kinglion811
Copy link
Author

kinglion811 commented May 29, 2019

@k82cn that can not slove my problem,this issue is mainly node's Idle And the main reason is the inconsistent resources of gpu,I merge the code,the problem reappear

@kinglion811 kinglion811 changed the title v0.5 runtime panic v0.4 runtime panic May 29, 2019
@k82cn
Copy link
Contributor

k82cn commented May 30, 2019

Thanks for your confirmation :)
We also meet similar issue this morning; in our cause, device plugin did not report gpu info in time when kubelet restart. We're working on the PR. Is that similar to your scenario, panic when kubelet with device plugin restart?

@k82cn
Copy link
Contributor

k82cn commented May 30, 2019

/kind bug
/priority important-soon
/sig scheduling

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels May 30, 2019
@kinglion811
Copy link
Author

@k82cn maybe,I will confirm your information

@kinglion811
Copy link
Author

@k82cn
how is the progress of this issue?

@k82cn
Copy link
Contributor

k82cn commented Jun 12, 2019

@asifdxtreme , would you help to cherry pick volcano-retired#26 into kube-batch :)

@kinglion811
Copy link
Author

Observed a panic: &errors.errorString{s:"Resource is not sufficient to do operation: <cpu 56000.00, memory 270086234112.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00, nvidia.com/gpu 5000.00> sub <cpu 54000.00, memory 268435456000.00, nvidia.com/gpu 8000.00>"} (Resource is not sufficient to do operation: <cpu 56000.00, memory 270086234112.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00, nvidia.com/gpu 5000.00> sub <cpu 54000.00, memory 268435456000.00, nvidia.com/gpu 8000.00>)
/workspace/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:76
/workspace/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/workspace/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/usr/local/go/src/runtime/asm_amd64.s:522
/usr/local/go/src/runtime/panic.go:513
/workspace/src/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/api/resource_info.go:158
/workspace/src/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/api/node_info.go:182
/workspace/src/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/cache/event_handlers.go:82
/workspace/src/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/cache/event_handlers.go:93
/workspace/src/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/cache/event_handlers.go:192
/workspace/src/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/cache/cache.go:262
/workspace/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/client-go/tools/cache/controller.go:195
/workspace/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/client-go/tools/cache/controller.go:227
:0
/workspace/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/client-go/tools/cache/shared_informer.go:554
/workspace/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:203
/workspace/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:203
/workspace/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/client-go/tools/cache/shared_informer.go:548
/workspace/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
/workspace/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
/workspace/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/workspace/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/client-go/tools/cache/shared_informer.go:546
/workspace/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/client-go/tools/cache/shared_informer.go:390
/workspace/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:71

This is mainly caused by gpu lost,
If the gpu plugin reports gpu lost, it will cause the resource view to be inconsistent.

@kinglion811
Copy link
Author

@k82cn

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
3 participants