-
Notifications
You must be signed in to change notification settings - Fork 715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubeadm upgrade apply should stop kube-apiserver before updating etcd #2991
Comments
when etcd is restarted to be upgraded doesn't the api-server enter a crashloop state until the storage backend is up again? that would leave some requests pending, but they should drop right shortly after, no? is the SLO disturbance related to this "shortly after"? during this time the LB leader election should trigger and make another server the leader. related topic to kubernetes/enhancements#4356 for single CP scenarios there is obvious downtimes, so this does not matter much.
this is not something to be tracked in the kubernetes/kubeadm repository. FRs against kube-apiserver should go at kubernetes/kubernetes and be tagged with EDIT-start
worth noting that this will delay the upgrade process by N seconds, because the kubelet pod sync loop must catch the static pod manifest change and restart it to reflect the flag amend. perhaps around 10 seconds on a decent machine. EDIT-end
this is something we can do on the kubeadm side, i think. but i'm trying to understand, how are we improving the situation exactly. can you share some numbers of your SLO observations perhaps? |
to make it more precise, my problem is not only with a higher latency SLO (although it is also negatively impacted), but with the availability SLO. for info, this is how our availability SLO is defined (we then generate recording rules and alerting rules with sloth error_query: sum(rate(apiserver_request_total{code=~"(5..|429)"}[{{.window}}]))
total_query: sum(rate(apiserver_request_total[{{.window}}])) and here is the result of running what precisely happens:
thanks for the precision regarding
I think it would delay the update a little more: when you change the manifest, the "old" pod receives a SIGTERM, and depending on how kube-apiserver is configured ( Something else, more generally: the current upgrade scenario recommends draining the control-plane nodes, but the current That way we would be gracefully stopping |
thanks for the details.
to me it seems this is the problem, right here. it should be possible to delegate leadership to another server while a certain server node is being upgraded / enters maintenance. terminating a kube-apiserver should trigger the LB to do such a swap presumably. IMO, the question is - should kubeadm upgrade do that or the user should manually orchestrate.
yes, and kubeadm has supported upgrades for a long time but this is the first FR for terminating the api-server before etcd.
actually, to not waste your time writing the KEP right away. it might be better to first join the api-machinery zoom call and discuss with the apiserver maintainers (the SIG owns the component):
yes, we could but i need more feedback from other maintainers. |
cc @jpbetz do you have feedback on the overall topic of apiserver entering maintenance mode if colocated etcd is upgraded on nodes? |
In common practice, users need to manually remove the kube-apiserver instance IP from the front LB before do the upgrade, and then add it back after the control-plane node upgrade is finished. kubeadm does not seem to have a easy way to automatically adjust the LB config for the user, nor to check whether the current control-plane instance has been safely removed from the LB. |
I agree that the user should be removing it from the external LB. however, this only partially solves the issue, as internal api-server requests towards @neolit123 currently in my upgrade scripts (and unlike in the example above), I always transfer the Overall, I feel like we would need a way to stop |
If we stop And would you consider also setting up an LB for the etcd cluster, so that all the kube-apiserver instances connect to etcd through a LB? If you have a LB set up for etcd, the upgrade process might look like this:
|
if before our current stacked etcd topology, we had the "thick client" topology, where every kube-apiserver was configured to use all 3 etcd servers as endpoint. The problem back then was that upgrading etcd was dragging the SLO down for all 3 kube-apiservers. maybe we could consider using an external L4 LB for etcd, but that's (yet) another component (and point of failure), and I must say we are otherwise quite happy with the stacked etcd topology (where kube-apiserver communicates with etcd on localhost) I however understand your point that stopping Alternatively, I could discuss with sig-api-machinery folks and check whether a |
mhm, if we terminate the api-server and wait for it to come up for a longer time, all components on the node will enter a crash loop. leader election will trigger relatively fast for the KCM and scheduler and we could orchestrate this at some point with this:
FWIW, an apiserver --maintenance flag would not work well for that, IMO. flags require component restart which does not help the SLO. there is a need for a dynamic API call-like reconfiguration in flight and a callback when the drain of request is completed. somewhere in that loop API server leader must also change, likely manually out of band. these are topics to check with API machinery as well. |
@clementnuss any updates?
i don't think we should make code changes due to the discussed risks. your current approach sounds like something that can be documented in the kubeadm upgrade docs as a "consideration before etcd upgrade": would you like to help us with a documentation PR at the kubernetes/website repo? |
salut @neolit123 that sounds like a good idea. in any case this shouldn't be implemented on I didn't have an occasion to reach out to SIG machinery yet, I cannot attend the bi-weekly meetings. I'll try slack. In any case I will document that in the kubeadm-upgrade doc. I will link to this issue in the PR. IMO you can close this issue. Thanks for your support! |
thanks, /kind documentation |
relates to kubernetes/kubeadm#2991 (comment) docs(kubeadm-upgrade): change wording Signed-off-by: Clément Nussbaumer <[email protected]> Apply suggestions from code review Co-authored-by: Lubomir I. Ivanov <[email protected]> add commands example Signed-off-by: Clément Nussbaumer <[email protected]>
relates to kubernetes/kubeadm#2991 (comment) Co-authored-by: Lubomir I. Ivanov <[email protected]> Signed-off-by: Clément Nussbaumer <[email protected]>
relates to kubernetes/kubeadm#2991 (comment) Co-authored-by: Lubomir I. Ivanov <[email protected]> Signed-off-by: Clément Nussbaumer <[email protected]>
What keywords did you search in kubeadm issues before filing this one?
upgrade, stop, sigterm, slo
Is this a BUG REPORT or FEATURE REQUEST?
FEATURE REQUEST
Description
kubeadm upgrade apply
should stopkube-apiserver
prior to upgradingetcd
, as it otherwise seriously impacts in-flights requests.Rationale
When upgrading our control-plane with
kubeadm upgrade apply
, following the procedure outlined in the documentation, we always take a serious hit on our SLO. Typically, during the normalkubeadm upgrade apply
(with static pods), the following actions happen (details in the code):etcd
is upgradedThe problem with this approach is that when
etcd
is upgraded,kube-apiserver
is still processing requests, and whenkube-apiserver
is configured to use etcd on localhost (as described in the stacked etcd topology section of the HA configuration documentation), this will drastically increase the processing time of those in-flight requests, leading to a serious ditch for the SLO.Idea
Prior to upgrading
etcd
, it would be nice to send aSIGTERM
tokube-apiserver
, in order to properly drain in-flight requests. Then theetcd
upgrade could be done, and finally the newkube-apiserver
manifest could be enabled again.Problem
I've tried manually stopping the
kube-apiserver
pod prior to running the upgrade, butkubeadm
expects all static pods to be up and running before actually upgrading them.This could be mitigated through another feature in
kube-apiserver
itself: a--maintenance
flag that could be set prior to an upgradeThis
--maintenance
flag would preventkube-apiserver
from receiving requests, through answering a 5xx on/readyz
and through not modifying the endpoints forkubernetes.default.service.cluster.local
service. The kube-apiserver pod would appear asNotReady
, external LoadBalancers would send traffic (as/readyz
would answer a 5xx), and internal traffic towardskubernetes.default.service
wouldn't reach thatkube-apiserver
, as it wouldn't have modified the endpoints.Then during an upgrade, we could set the
--maintenance
flag in the manifest forkube-apiserver
prior to upgradingetcd
, which would achieve 2 things:etcd
upgrade, there would be no requests onkube-apiserver
Versions, environment
kubeadm version (use
kubeadm version
): 1.27.9Affected HA configuration: stacked etcd topology, where
kube-apiserver
communicates withetcd
on localhost.Discussion - way forward
Let me know if I've missed something in the code or the documentation. Maybe there's a simpler way! As a workaround for the moment, I simply send a SIGTERM prior to running
kubeadm upgrade apply
, this already makes sure that there aren't too many requests onkube-apiserver
anymore.Also, I'd be willing to implement the functionality in
kube-apiserver
andkubeadm
if we all agree that this solution could prove useful.The text was updated successfully, but these errors were encountered: