Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods get stuck on down kubernetes node. #260

Open
SGC41 opened this issue Oct 24, 2024 · 0 comments
Open

Pods get stuck on down kubernetes node. #260

SGC41 opened this issue Oct 24, 2024 · 0 comments
Assignees
Labels
repo/provider Akash provider-services repo issues

Comments

@SGC41
Copy link

SGC41 commented Oct 24, 2024

Assign the support label to any support escalation issues

Describe the bug
Pods, any pods afaik can get stuck, on a "kubernetes / akash manual doc installed" node that is downed for whatever reason.
Such as shutdown with a regular sudo shutdown command or if one pulls the power plug of the server... or similar crash like events.

pods will then get stuck in a state, only completing their next step "whatever kubectl says they are doing..." when their host node returns to the kubernetes cluster.

They will usually end up after a bit in a terminating state, "when the kubernetes cluster has decided to kill them..." i suppose.
but they seem equally stuck in earlier states and will seemingly always and only resume / continue the stuck state or step, when the node comes back into the kubernetes cluster.

Its been like that for a long time.... think its all if not most providers, maybe a kubernetes settings thing or some akash configuration that causes it.... can't imagine its happens to kubernetes on real production scales.... but it seems to happen for us.

it can be very disruptive, and it will not resolve itself over time.... even extended periods.... IE recently it took down my ceph, even tho it wasn't suppose to be able to do that... but apparently some pod that was critical got stuck, on a node i was shutting down to conserve power, due to low customer demand.... took me like 48 hours before i noticed my ceph was being weird.

Not sure, if it was actually non functional for that period... but the pod was stuck in a terminating state for that period, thats for sure... and only when i brought it back, did it resolve itself.... then i could shutdown the node, because it wasn't needed.... it just had that stuck pod....

should be pretty easy to replicate, shutdown something with a pod on it, and it will get stuck... just last night i had a deployment pod die, because i didn't see it was on a node i was added a GPU to... and it also got stuck and then couldn't move to another node in the cluster, and so ended up in a terminating state, until i brought back the server and then it was immediately terminated.

What should happen
Well it seems like there should be some sort of timeout and then a pod should be started on another node "if possible", but it seems like there is some sort of consensus or timeout that will just wait forever, which then leads to stuck pods.

The pods should wait a bit and after ..... x minutes.... how long is x minutes?.... currently they wait like 10 minutes.... which seems a bit long to me.... then a pod that has been disrupted seems to jump to another node, if their source node allows it. or so...
5 - 10 minutes for the wait might be fine.... its also nice that pods don't jump to fast, if one just want a quick reboot of a node.

Some nodes might not be redundant and thus pods will be forced to wait, until the node is back.
Because they will have nowhere else to go.... and thus termination of pods should take a while..... nobody wants pods to die.... not providers nor customers.... ofc we can't have offline pods.... i guess unless if we aren't getting paid while they are down... not sure how that works..... anyways!..

I digress... a pod should wait for a while, if the node isn't back then it should, jump to another node if possible.

else it should keep waiting... until the customer or the provider kills it.
we can't have pods just being terminated.... for nearly no reason.
but i guess thats also a separate issue of sorts.

in short
what should happen.... A pod on an offline node, should, wait, then try to jump to another node, or keep waiting.

My nodes are ubuntu 22:04, and all the standard recommended versions for providers at present.

@chainzero chainzero added repo/provider Akash provider-services repo issues and removed awaiting-triage labels Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
repo/provider Akash provider-services repo issues
Projects
None yet
Development

No branches or pull requests

3 participants