-
Notifications
You must be signed in to change notification settings - Fork 633
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FYI - Simple remedy system designed for use with NPD #199
Comments
@negz This is a quite good use case for NPD. Will learn about what you said detailedly later. Would you mind to add your use case of NPD to the usage case section in ReadMe. |
This is quite what NPD is first proposed to do. Because the remedy system is end user dependent, common remedy system is not so easily developed. |
@andyxning Thanks! I'd be happy to mention this use case in the README. Would it be too self-promotional to link to our Draino tool there? :) |
@negz No. Draino is actually an POC of a remedy system based on NPD. :) Could you please make a PR to add the use case? |
@negz I have read Draino code briefly. It seems quite good and absolutely worth a use case of NPD. Please do not hesitate to add the Draino use case. I am willing to review it. :) |
Hello, using draino, the permanent problem detected by the node problem detector -- it simply blocks and drains the node that behaves as a drainable node condition, Here is my example shown below. I blocked for more than 300 seconds by echoing an echo "task docker:7 SEC." | systemd-cat-t kernel The drain causes my kernel error and the rule KernelDeadlock True, but the draino doesn't work together, making my node set as unschedulable. Is this the wrong item This is my runtime environment
My KernelDeadlock True has triggered the rule, but the draino seems to drain
|
Hello,
I wanted to bring Draino to your attention, in case it's useful to others. Draino is a very simple 'remedy' system for permanent problems detected by the Node Problem Detector - it simply cordons and drains nodes exhibiting configurable Node Conditions.
At Planet we run a small handful of Kubernetes clusters on GCE (not GKE). We have a particular analytics workload that is really good at killing GCE persistent volumes. Without going into too much detail, we see persistent volume related processes (
mkfs.ext4
,mount
, etc) hanging forever in uninterruptible sleep, preventing the pods wanting to consume said volumes from running. We're working with GCP to resolve this issue, but in the meantime we got tired of manually cordoning and draining affected nodes, so we wrote Draino.Our remedy system looks like:
KernelDeadlock
condition, or a variant ofKernelDeadlock
we callVolumeTaskHung
.It's worth noting that once the Descheduler supports descheduling pods based on taints Draino could be replaced by the Descheduler running in combination with the scheduler's
TaintNodesByCondition
functionality.The text was updated successfully, but these errors were encountered: