evaluate the retry logic for API calls #1606

neolit123 · 2019-06-12T15:53:13Z

with the merge of PR:
kubernetes/kubernetes#78915

we added retry logic to kubeadm when fetching ConfigMaps.

The PR also added a TODO to evaluate if this can be done better.
This work should happen in the 1.16 cycle.

update:

we are seeing users having random API server downtime which trips our API calls without retry, such as the ones in uplaod-certs phase of init:

kubeadm on version 1.29.1 with upload-certs times out during the configMap POST #3022
kubeadm init fails to create ConfigMap when --upload certs is set kubernetes#112411

this is tracked for a fix in 1.30, but no backports are planned.
as a first step we should add retries for all calls in idempotency.go

fabriziopandini · 2019-12-06T00:39:46Z

I think that we need to arrange for a code-walkthrough and add retries to everything is accessing api-server/etcd during join and fix it consistently

ereslibre · 2020-01-22T16:10:13Z

In the meantime, I'll give this one a go.

/assign

xlgao-zju · 2020-06-09T12:46:33Z

I'd like to help this. cloud I assign?

neolit123 · 2020-06-09T12:48:31Z

i think @rosti was working on an idea on how to create a generic retry client.

xlgao-zju · 2020-06-09T12:51:11Z

create a generic retry client.

that sounds cool, ping me if you @rosti need help.

neolit123 · 2020-09-03T15:02:20Z

/kind design feature
/remove-kind bug

fejta-bot · 2020-12-02T15:36:20Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

pacoxu · 2022-02-17T10:05:41Z

i think @rosti was working on an idea on how to create a generic retry client.

I cannot find new PR from @rosti about this. Any update for this?

https://github.com/kubernetes/kubernetes/blame/e777f721638cf585b4e9e5d933d27e753a35fabe/cmd/kubeadm/app/util/apiclient/idempotency.go#L342-L362

neolit123 · 2022-02-17T12:44:00Z

I don't think we had a POC. Are you seeing any particular problems?

…

On Feb 17, 2022 12:05, "Paco Xu" ***@***.***> wrote: i think @rosti <https://github.com/rosti> was working on an idea on how to create a generic retry client. I cannot find new PR from @rosti <https://github.com/rosti> about this. Any update for this? https://github.com/kubernetes/kubernetes/blame/ e777f721638cf585b4e9e5d933d27e753a35fabe/cmd/kubeadm/app/ util/apiclient/idempotency.go#L342-L362 — Reply to this email directly, view it on GitHub <#1606 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACRATHIWXJDJBZOV5E6VF3U3TCABANCNFSM4HXKNECQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were assigned.Message ID: ***@***.***>

pacoxu · 2022-03-04T03:56:20Z

After walking through the thread in #78915, it works well. The discussion focus on whether there are any hidden problems in the apiserver side or proxy/LB side.

To summarize it:

we want to know if there are any random failures.
retry helps in some potential unknown/unsure cases

Why not log the unknown error as a warning and keep the retry logic. Or the current retry logic is acceptable.

neolit123 · 2022-03-04T21:00:36Z

we want to know if there are any random failures.

retry helps in some potential unknown/unsure cases

retry does help and IIRC the ideas was to have all API calls behind a client that retries for a per-use-case time.
not sure how doable that is...maybe doable but it's also a lot of work.

Why not log the unknown error as a warning and keep the retry logic. Or the current retry logic is acceptable.

some of these warnings in polls can spam the logs a lot.
high verbosity logs are better with V(x), but this is also inconsistent.
one of the discussed topics was to make the API client logic consistent.

i wouldn't mind closing this ticket until further notice.

pacoxu · 2022-03-05T06:26:02Z

If so, I think the priority would be priority/backlog 😄.

neolit123 · 2024-02-13T11:05:56Z

moving to 1.30 with priority soon.
there was another report of failed API call without retry, so i think we should retry everywhere.

xref

kubeadm on version 1.29.1 with upload-certs times out during the configMap POST #3022
kubeadm init fails to create ConfigMap when --upload certs is set kubernetes#112411

neolit123 · 2024-02-13T16:46:46Z

WIP PR
kubernetes/kubernetes#123271

meezaan · 2024-02-16T05:57:51Z

@neolit123 regarding kubernetes/kubernetes#112411, I was able to figure out what the problem was, at least in my case.

I had installed containerd using apt install containerd and not using the instructions in the docker repo which installs containerd.io. Once I installed the latter, the API was up and all the API calls were successful. I caught this when following kubeadm init phase step by step and eventually saw something in the containerd logs related to telemetry failures.

neolit123 · 2024-02-16T06:06:37Z

@neolit123 regarding kubernetes/kubernetes#112411, I was able to figure out what the problem was, at least in my case.

I had installed containerd using apt install containerd and not using the instructions in the docker repo which installs containerd.io. Once I installed the latter, the API was up and all the API calls were successful. I caught this when following kubeadm init phase step by step and eventually saw something in the containerd logs related to telemetry failures.

thanks for the update. so it's bound to a specific containerd version or config; i'd assume the alternative version is much newer?

neolit123 · 2024-02-16T06:09:26Z

we recommend to users to install containerd using their guide in the containerd repo
https://kubernetes.io/docs/setup/production-environment/container-runtimes/#containerd

meezaan · 2024-02-16T07:04:42Z

@neolit123 regarding kubernetes/kubernetes#112411, I was able to figure out what the problem was, at least in my case.
I had installed containerd using apt install containerd and not using the instructions in the docker repo which installs containerd.io. Once I installed the latter, the API was up and all the API calls were successful. I caught this when following kubeadm init phase step by step and eventually saw something in the containerd logs related to telemetry failures.

thanks for the update. so it's bound to a specific containerd version or config; i'd assume the alternative version is much newer?

Unfortunately I did not check the version that came as the default Debian package, but I subsequently used https://docs.docker.com/engine/install/debian/ to install containerd.io.

neolit123 added kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Jun 12, 2019

neolit123 added this to the v1.16 milestone Jun 12, 2019

k8s-ci-robot assigned ereslibre and neolit123 Jun 12, 2019

This was referenced Jun 12, 2019

kubeadm: Add ability to retry ConfigMap get if certain errors happen kubernetes/kubernetes#78915

Merged

create HA cluster is flaky kubernetes-sigs/kind#588

Closed

neolit123 modified the milestones: v1.16, v1.17 Sep 2, 2019

RainbowMango mentioned this issue Sep 28, 2019

kubeadm reset hangs #1805

Closed

neolit123 modified the milestones: v1.17, v1.18 Nov 13, 2019

neolit123 mentioned this issue Dec 6, 2019

kubeadm: Improve resiliency in CreateOrMutateConfigMap kubernetes/kubernetes#85763

Merged

neolit123 modified the milestones: v1.18, v1.19 Mar 8, 2020

neolit123 mentioned this issue Jun 9, 2020

Add retries for kubeadm join / UpdateStatus kubernetes/kubernetes#91815

Closed

neolit123 modified the milestones: v1.19, v1.20 Jul 27, 2020

neolit123 unassigned ereslibre and neolit123 Sep 3, 2020

k8s-ci-robot added kind/design Categorizes issue or PR as related to design. kind/feature Categorizes issue or PR as related to a new feature. and removed kind/bug Categorizes issue or PR as related to a bug. labels Sep 3, 2020

neolit123 added this to the v1.24 milestone Nov 23, 2021

neolit123 added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Mar 5, 2022

neolit123 modified the milestones: v1.24, v1.25 Mar 29, 2022

neolit123 modified the milestones: v1.25, Next May 11, 2022

neolit123 mentioned this issue Sep 12, 2022

kubeadm: remove todo about retry function change kubernetes/kubernetes#112383

Merged

neolit123 changed the title ~~evaluate API object fetch retry logic~~ evaluate the retry logic for API calls Nov 30, 2023

neolit123 removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Feb 13, 2024

neolit123 self-assigned this Feb 13, 2024

neolit123 added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Feb 13, 2024

neolit123 modified the milestones: Next, v1.30 Feb 13, 2024

k8s-ci-robot closed this as completed in kubernetes/kubernetes#123271 Feb 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluate the retry logic for API calls #1606

evaluate the retry logic for API calls #1606

neolit123 commented Jun 12, 2019 •

edited

Loading

fabriziopandini commented Dec 6, 2019

ereslibre commented Jan 22, 2020

xlgao-zju commented Jun 9, 2020

neolit123 commented Jun 9, 2020

xlgao-zju commented Jun 9, 2020

neolit123 commented Sep 3, 2020

fejta-bot commented Dec 2, 2020

pacoxu commented Feb 17, 2022

neolit123 commented Feb 17, 2022 via email

pacoxu commented Mar 4, 2022 •

edited

Loading

neolit123 commented Mar 4, 2022

pacoxu commented Mar 5, 2022

neolit123 commented Feb 13, 2024 •

edited

Loading

neolit123 commented Feb 13, 2024

meezaan commented Feb 16, 2024 •

edited

Loading

neolit123 commented Feb 16, 2024 •

edited

Loading

neolit123 commented Feb 16, 2024

meezaan commented Feb 16, 2024

evaluate the retry logic for API calls #1606

evaluate the retry logic for API calls #1606

Comments

neolit123 commented Jun 12, 2019 • edited Loading

fabriziopandini commented Dec 6, 2019

ereslibre commented Jan 22, 2020

xlgao-zju commented Jun 9, 2020

neolit123 commented Jun 9, 2020

xlgao-zju commented Jun 9, 2020

neolit123 commented Sep 3, 2020

fejta-bot commented Dec 2, 2020

pacoxu commented Feb 17, 2022

neolit123 commented Feb 17, 2022 via email

pacoxu commented Mar 4, 2022 • edited Loading

neolit123 commented Mar 4, 2022

pacoxu commented Mar 5, 2022

neolit123 commented Feb 13, 2024 • edited Loading

neolit123 commented Feb 13, 2024

meezaan commented Feb 16, 2024 • edited Loading

neolit123 commented Feb 16, 2024 • edited Loading

neolit123 commented Feb 16, 2024

meezaan commented Feb 16, 2024

neolit123 commented Jun 12, 2019 •

edited

Loading

pacoxu commented Mar 4, 2022 •

edited

Loading

neolit123 commented Feb 13, 2024 •

edited

Loading

meezaan commented Feb 16, 2024 •

edited

Loading

neolit123 commented Feb 16, 2024 •

edited

Loading