Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubeadm upgrade apply should stop kube-apiserver before updating etcd #2991

Closed
clementnuss opened this issue Jan 8, 2024 · 15 comments
Closed
Labels
area/upgrades kind/documentation Categorizes issue or PR as related to documentation. kind/feature Categorizes issue or PR as related to a new feature. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.
Milestone

Comments

@clementnuss
Copy link

What keywords did you search in kubeadm issues before filing this one?

upgrade, stop, sigterm, slo

Is this a BUG REPORT or FEATURE REQUEST?

FEATURE REQUEST

Description

kubeadm upgrade apply should stop kube-apiserver prior to upgrading etcd, as it otherwise seriously impacts in-flights requests.

Rationale

When upgrading our control-plane with kubeadm upgrade apply, following the procedure outlined in the documentation, we always take a serious hit on our SLO. Typically, during the normal kubeadm upgrade apply (with static pods), the following actions happen (details in the code):

  1. etcd is upgraded
  2. all control-plane components are upgraded

The problem with this approach is that when etcd is upgraded, kube-apiserver is still processing requests, and when kube-apiserver is configured to use etcd on localhost (as described in the stacked etcd topology section of the HA configuration documentation), this will drastically increase the processing time of those in-flight requests, leading to a serious ditch for the SLO.

Idea

Prior to upgrading etcd, it would be nice to send a SIGTERM to kube-apiserver, in order to properly drain in-flight requests. Then the etcd upgrade could be done, and finally the new kube-apiserver manifest could be enabled again.

Problem

I've tried manually stopping the kube-apiserver pod prior to running the upgrade, but kubeadm expects all static pods to be up and running before actually upgrading them.

This could be mitigated through another feature in kube-apiserver itself: a --maintenance flag that could be set prior to an upgrade

This --maintenance flag would prevent kube-apiserver from receiving requests, through answering a 5xx on /readyz and through not modifying the endpoints for kubernetes.default.service.cluster.local service. The kube-apiserver pod would appear as NotReady, external LoadBalancers would send traffic (as /readyz would answer a 5xx), and internal traffic towards kubernetes.default.service wouldn't reach that kube-apiserver, as it wouldn't have modified the endpoints.

Then during an upgrade, we could set the --maintenance flag in the manifest for kube-apiserver prior to upgrading etcd, which would achieve 2 things:

  • drain existing requests, as the pod would have to be re-created
  • make sure that during the etcd upgrade, there would be no requests on kube-apiserver

Versions, environment

kubeadm version (use kubeadm version): 1.27.9
Affected HA configuration: stacked etcd topology, where kube-apiserver communicates with etcd on localhost.

Discussion - way forward

Let me know if I've missed something in the code or the documentation. Maybe there's a simpler way! As a workaround for the moment, I simply send a SIGTERM prior to running kubeadm upgrade apply, this already makes sure that there aren't too many requests on kube-apiserver anymore.

Also, I'd be willing to implement the functionality in kube-apiserver and kubeadm if we all agree that this solution could prove useful.

@neolit123 neolit123 added area/upgrades kind/feature Categorizes issue or PR as related to a new feature. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Jan 8, 2024
@neolit123 neolit123 added this to the v1.30 milestone Jan 8, 2024
@neolit123
Copy link
Member

neolit123 commented Jan 8, 2024

The problem with this approach is that when etcd is upgraded, kube-apiserver is still processing requests, and when kube-apiserver is configured to use etcd on localhost (as described in the stacked etcd topology section of the HA configuration documentation), this will drastically increase the processing time of those in-flight requests, leading to a serious ditch for the SLO.

when etcd is restarted to be upgraded doesn't the api-server enter a crashloop state until the storage backend is up again? that would leave some requests pending, but they should drop right shortly after, no? is the SLO disturbance related to this "shortly after"?

during this time the LB leader election should trigger and make another server the leader.

related topic to kubernetes/enhancements#4356
which we should likely opt-in kubeadm if/when it's out and it will mitigate your problem.
EDIT: actually no, that's apiserver.

for single CP scenarios there is obvious downtimes, so this does not matter much.

This could be mitigated through another feature in kube-apiserver itself: a --maintenance flag that could be set prior to an upgrade

this is not something to be tracked in the kubernetes/kubeadm repository. FRs against kube-apiserver should go at kubernetes/kubernetes and be tagged with /sig api-machinery. also this sounds like something that needs a KEP - design document proposal for the apiserver.
https://github.com/kubernetes/enhancements/blob/master/keps/README.md

EDIT-start

Then during an upgrade, we could set the --maintenance flag in the manifest for kube-apiserver prior to upgrading etcd, which would achieve 2 things:

worth noting that this will delay the upgrade process by N seconds, because the kubelet pod sync loop must catch the static pod manifest change and restart it to reflect the flag amend. perhaps around 10 seconds on a decent machine.

EDIT-end

Let me know if I've missed something in the code or the documentation. Maybe there's a simpler way! As a workaround for the moment, I simply send a SIGTERM prior to running kubeadm upgrade apply, this already makes sure that there aren't too many requests on kube-apiserver anymore.

this is something we can do on the kubeadm side, i think. but i'm trying to understand, how are we improving the situation exactly. can you share some numbers of your SLO observations perhaps?

@neolit123
Copy link
Member

cc @aojea
perhaps you can provide some feedback around the SLO and apiserver downtime due to storage upgrade topic.

also cc @pacoxu about the proposed kubeadm change.

@clementnuss
Copy link
Author

to make it more precise, my problem is not only with a higher latency SLO (although it is also negatively impacted), but with the availability SLO. for info, this is how our availability SLO is defined (we then generate recording rules and alerting rules with sloth

error_query: sum(rate(apiserver_request_total{code=~"(5..|429)"}[{{.window}}]))
total_query: sum(rate(apiserver_request_total[{{.window}}]))

and here is the result of running kubeadm upgrade apply v1.27.7 (from v1.27.6) on the first control-plane node (which was also the etcd leader during the upgrade, worst case scenario) of a small 3+3 nodes lab cluster. control-plane nodes are 6 CPU, 16GB RAM each.
There was around 40RPS on the cluster, 1/3 of which were in-flight on that control-plane.

image

what precisely happens:

  1. etcd is stopped, which I think happens pretty fast (<5s). It then takes more than 20s to have the new etcd pod up and running (this was from 3.5.7-0 to 3.5.9-0). not sure if when etcd changes version it needs to modify its local storage, but that's what happened. further etcd restarts are snappy (<5s), but in the first boot of a new image, I've observed it took quite some time.
  2. during that time period, requests on that kube-apiserver appear to fail, as storage cannot be contacted.

thanks for the precision regarding --maintenance flag to kube-apiserver. I think I will open a KEP 👍🏼

we could set the --maintenance flag in the manifest for kube-apiserver prior to upgrading etcd
worth noting that this will delay the upgrade process by N seconds, because the kubelet pod sync loop must catch the static pod manifest change and restart it to reflect the flag amend. perhaps around 10 seconds on a decent machine.

I think it would delay the update a little more: when you change the manifest, the "old" pod receives a SIGTERM, and depending on how kube-apiserver is configured (--shutdown-duration etc.), this can last upwards of 30s (that's what I observed).

Something else, more generally: the current upgrade scenario recommends draining the control-plane nodes, but the current kubeadm upgrade apply is upgrading etcd while kube-apiserver still processes requests. Even if the etcd upgrade is fast, this makes for degraded latency, and potentially errors if the etcd upgrade is slow.
I would have assumed/expected the kube-apiserver would be stopped before applying the etcd upgrade. Maybe we could create a kubeadm flag to permit doing so ?

That way we would be gracefully stopping kube-apiserver, thereby draining in-flight requests, and then we would be disrupting/upgrading the storage.

@neolit123
Copy link
Member

neolit123 commented Jan 9, 2024

thanks for the details.

and here is the result of running kubeadm upgrade apply v1.27.7 (from v1.27.6) on the first control-plane node (which was also the etcd leader during the upgrade, worst case scenario) of a small 3+3 nodes lab cluster. control-plane nodes are 6 CPU, 16GB RAM each.

to me it seems this is the problem, right here. it should be possible to delegate leadership to another server while a certain server node is being upgraded / enters maintenance. terminating a kube-apiserver should trigger the LB to do such a swap presumably. IMO, the question is - should kubeadm upgrade do that or the user should manually orchestrate.

I think it would delay the update a little more: when you change the manifest, the "old" pod receives a SIGTERM, and depending on how kube-apiserver is configured (--shutdown-duration etc.), this can last upwards of 30s (that's what I observed).

yes, and kubeadm has supported upgrades for a long time but this is the first FR for terminating the api-server before etcd.
we have to be careful around certain setups here, such that do not want the additional delay (edge and bare-metal). thus, IMO, this cannot be the default mode.

thanks for the precision regarding --maintenance flag to kube-apiserver. I think I will open a KEP 👍🏼

actually, to not waste your time writing the KEP right away. it might be better to first join the api-machinery zoom call and discuss with the apiserver maintainers (the SIG owns the component):
https://github.com/kubernetes/community/tree/master/sig-api-machinery#api-machinery-special-interest-group

I would have assumed/expected the kube-apiserver would be stopped before applying the etcd upgrade. Maybe we could create a kubeadm flag to permit doing so ?

yes, we could but i need more feedback from other maintainers.
also i'm generally -1 to add such a flag on the CLI, thus i think it should go in the new UpgradeConfiguration that is WIP for v1beta4 and we have a tentative ETA for 1.30 for it to be released.
#2890

@neolit123
Copy link
Member

cc @jpbetz do you have feedback on the overall topic of apiserver entering maintenance mode if colocated etcd is upgraded on nodes?

@SataQiu
Copy link
Member

SataQiu commented Jan 9, 2024

should kubeadm upgrade do that or the user should manually orchestrate.

In common practice, users need to manually remove the kube-apiserver instance IP from the front LB before do the upgrade, and then add it back after the control-plane node upgrade is finished. kubeadm does not seem to have a easy way to automatically adjust the LB config for the user, nor to check whether the current control-plane instance has been safely removed from the LB.

@clementnuss
Copy link
Author

clementnuss commented Jan 9, 2024

I agree that the user should be removing it from the external LB. however, this only partially solves the issue, as internal api-server requests towards kubernetes.default.svc.cluster.local will still land on that apiserver.
currently, as soon as kube-apiserver starts, it modifies the endpoints for kubernetes.default.svc, thereby receiving requests, independent of what you configured for your external LB.

@neolit123 currently in my upgrade scripts (and unlike in the example above), I always transfer the etcd leadership away prior to running kubeadm upgrade apply. this helps, but on heavily loaded servers, you still get a fair amount of requests that are stuck/fail because the colocated etcd is unavailable.
I also manually stop kube-apiserver 20s before running the upgrade, and with those workarounds the update is much smoother.

Overall, I feel like we would need a way to stop kube-apiserver shortly before upgrading (which would automatically drain in-flight requests and remove the current kube-apiserver from the internal and external LB (provided you have health checks 😅 )).
the issue I face while trying to do that is that kubelet pod reconciliation loop is too fast, and there's no way (to my knowledge) to stop a static pod without it restarting right away.

@SataQiu
Copy link
Member

SataQiu commented Jan 9, 2024

If we stop kube-apiserver first, it may cascade to make more components unavailable, such as controller-manager and scheduler. If etcd starts slowly, the kube-apiserver will crash and restart due to the health check failures, and the corresponding endpoint will be removed from kubernetes.default.svc.cluster.local automatically.

And would you consider also setting up an LB for the etcd cluster, so that all the kube-apiserver instances connect to etcd through a LB? If you have a LB set up for etcd, the upgrade process might look like this:

  1. manually remove etcd instance ip from etcd LB
  2. manually remove kube-apiserver instance ip from kube-apiserver LB
  3. do control-plane node upgrade with kubeadm, and check everything is OK
  4. manually add kube-apiserver instance ip into kube-apiserver LB
  5. manually add etcd instance ip into etcd LB

@clementnuss
Copy link
Author

if etcd is behind an external LB, and removed manually before the upgrade, then the SLO would minimally be impacted, and this would solve this issue.

before our current stacked etcd topology, we had the "thick client" topology, where every kube-apiserver was configured to use all 3 etcd servers as endpoint. The problem back then was that upgrading etcd was dragging the SLO down for all 3 kube-apiservers.

maybe we could consider using an external L4 LB for etcd, but that's (yet) another component (and point of failure), and I must say we are otherwise quite happy with the stacked etcd topology (where kube-apiserver communicates with etcd on localhost)

I however understand your point that stopping kube-apiserver is quite dangerous (for single control-plane environments it could be critical). but if that's an opt-in UpgradeConfiguration option, that was would quite convenient for us (and maybe others?).

Alternatively, I could discuss with sig-api-machinery folks and check whether a --maintenance flag is something they could consider

@neolit123
Copy link
Member

If we stop kube-apiserver first, it may cascade to make more components unavailable, such as controller-manager and scheduler

mhm, if we terminate the api-server and wait for it to come up for a longer time, all components on the node will enter a crash loop. leader election will trigger relatively fast for the KCM and scheduler and we could orchestrate this at some point with this:
kubernetes/enhancements#4356
but that needs a careful plan, if so.

Overall, I feel like we would need a way to stop kube-apiserver shortly before upgrading (which would automatically drain in-flight requests and remove the current kube-apiserver from the internal and external LB (provided you have health checks 😅 )).
the issue I face while trying to do that is that kubelet pod reconciliation loop is too fast, and there's no way (to my knowledge) to stop a static pod without it restarting right away.

FWIW, an apiserver --maintenance flag would not work well for that, IMO. flags require component restart which does not help the SLO. there is a need for a dynamic API call-like reconfiguration in flight and a callback when the drain of request is completed. somewhere in that loop API server leader must also change, likely manually out of band.

these are topics to check with API machinery as well.

@neolit123
Copy link
Member

these are topics to check with API machinery as well.

@clementnuss any updates?

currently in my upgrade scripts (and unlike in the example above), I always transfer the etcd leadership away prior to running kubeadm upgrade apply. this helps, but on heavily loaded servers, you still get a fair amount of requests that are stuck/fail because the colocated etcd is unavailable.
I also manually stop kube-apiserver 20s before running the upgrade, and with those workarounds the update is much smoother.

i don't think we should make code changes due to the discussed risks.

your current approach sounds like something that can be documented in the kubeadm upgrade docs as a "consideration before etcd upgrade":
https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/

would you like to help us with a documentation PR at the kubernetes/website repo?

@clementnuss
Copy link
Author

salut @neolit123

that sounds like a good idea. in any case this shouldn't be implemented on kubeadm side I think. at least not at this stage.

I didn't have an occasion to reach out to SIG machinery yet, I cannot attend the bi-weekly meetings. I'll try slack.
I still believe that a maintenance mode, triggered by a special signal or by monitoring a file, would be helpful.

In any case I will document that in the kubeadm-upgrade doc. I will link to this issue in the PR.

IMO you can close this issue. Thanks for your support!

@neolit123
Copy link
Member

In any case I will document that in the kubeadm-upgrade doc. I will link to this issue in the PR.

thanks,
let's keep this open until we can update the docs.

/kind documentation

@k8s-ci-robot k8s-ci-robot added the kind/documentation Categorizes issue or PR as related to documentation. label Feb 7, 2024
clementnuss added a commit to clementnuss/website that referenced this issue Feb 14, 2024
relates to kubernetes/kubeadm#2991 (comment)

docs(kubeadm-upgrade): change wording

Signed-off-by: Clément Nussbaumer <[email protected]>

Apply suggestions from code review

Co-authored-by: Lubomir I. Ivanov <[email protected]>

add commands example

Signed-off-by: Clément Nussbaumer <[email protected]>
clementnuss added a commit to clementnuss/website that referenced this issue Feb 14, 2024
@clementnuss
Copy link
Author

closing this issue as the documentation PR was successfully merged:
image

I will write to SIG api machinery mailing list now, and try to see if we can create a mechanism with e.g. a SIGUSR1 signal that would put kube-apiserver in maintenance mode prior to an upgrade.

I might reopen this issue later on, if we implement the above functionality. it might then make sense to include that logic into kubeadm.

@neolit123
Copy link
Member

closing this issue as the documentation PR was successfully merged: image

I will write to SIG api machinery mailing list now, and try to see if we can create a mechanism with e.g. a SIGUSR1 signal that would put kube-apiserver in maintenance mode prior to an upgrade.

I might reopen this issue later on, if we implement the above functionality. it might then make sense to include that logic into kubeadm.

thanks

Andygol pushed a commit to Andygol/k8s-website that referenced this issue Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/upgrades kind/documentation Categorizes issue or PR as related to documentation. kind/feature Categorizes issue or PR as related to a new feature. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.
Projects
None yet
Development

No branches or pull requests

4 participants