Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

evaluate the retry logic for API calls #1606

Closed
neolit123 opened this issue Jun 12, 2019 · 23 comments · Fixed by kubernetes/kubernetes#123271
Closed

evaluate the retry logic for API calls #1606

neolit123 opened this issue Jun 12, 2019 · 23 comments · Fixed by kubernetes/kubernetes#123271
Assignees
Labels
kind/design Categorizes issue or PR as related to design. kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@neolit123
Copy link
Member

neolit123 commented Jun 12, 2019

with the merge of PR:
kubernetes/kubernetes#78915

we added retry logic to kubeadm when fetching ConfigMaps.

The PR also added a TODO to evaluate if this can be done better.
This work should happen in the 1.16 cycle.


update:

we are seeing users having random API server downtime which trips our API calls without retry, such as the ones in uplaod-certs phase of init:

this is tracked for a fix in 1.30, but no backports are planned.
as a first step we should add retries for all calls in idempotency.go

@neolit123 neolit123 added kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Jun 12, 2019
@neolit123 neolit123 added this to the v1.16 milestone Jun 12, 2019
@neolit123 neolit123 modified the milestones: v1.16, v1.17 Sep 2, 2019
@neolit123 neolit123 modified the milestones: v1.17, v1.18 Nov 13, 2019
@fabriziopandini
Copy link
Member

I think that we need to arrange for a code-walkthrough and add retries to everything is accessing api-server/etcd during join and fix it consistently

@ereslibre
Copy link
Contributor

In the meantime, I'll give this one a go.

/assign

@xlgao-zju
Copy link

I'd like to help this. cloud I assign?

@neolit123
Copy link
Member Author

i think @rosti was working on an idea on how to create a generic retry client.

@xlgao-zju
Copy link

create a generic retry client.

that sounds cool, ping me if you @rosti need help.

@neolit123 neolit123 modified the milestones: v1.19, v1.20 Jul 27, 2020
@neolit123
Copy link
Member Author

/kind design feature
/remove-kind bug

@k8s-ci-robot k8s-ci-robot added kind/design Categorizes issue or PR as related to design. kind/feature Categorizes issue or PR as related to a new feature. and removed kind/bug Categorizes issue or PR as related to a bug. labels Sep 3, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@neolit123 neolit123 added this to the v1.24 milestone Nov 23, 2021
@pacoxu
Copy link
Member

pacoxu commented Feb 17, 2022

i think @rosti was working on an idea on how to create a generic retry client.

I cannot find new PR from @rosti about this. Any update for this?

https://github.com/kubernetes/kubernetes/blame/e777f721638cf585b4e9e5d933d27e753a35fabe/cmd/kubeadm/app/util/apiclient/idempotency.go#L342-L362

@neolit123
Copy link
Member Author

neolit123 commented Feb 17, 2022 via email

@pacoxu
Copy link
Member

pacoxu commented Mar 4, 2022

After walking through the thread in #78915, it works well. The discussion focus on whether there are any hidden problems in the apiserver side or proxy/LB side.

To summarize it:

  1. we want to know if there are any random failures.
  2. retry helps in some potential unknown/unsure cases

Why not log the unknown error as a warning and keep the retry logic. Or the current retry logic is acceptable.

@neolit123
Copy link
Member Author

  • we want to know if there are any random failures.
  • retry helps in some potential unknown/unsure cases

retry does help and IIRC the ideas was to have all API calls behind a client that retries for a per-use-case time.
not sure how doable that is...maybe doable but it's also a lot of work.

Why not log the unknown error as a warning and keep the retry logic. Or the current retry logic is acceptable.

some of these warnings in polls can spam the logs a lot.
high verbosity logs are better with V(x), but this is also inconsistent.
one of the discussed topics was to make the API client logic consistent.

i wouldn't mind closing this ticket until further notice.

@pacoxu
Copy link
Member

pacoxu commented Mar 5, 2022

If so, I think the priority would be priority/backlog 😄.

@neolit123 neolit123 added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Mar 5, 2022
@neolit123 neolit123 modified the milestones: v1.24, v1.25 Mar 29, 2022
@neolit123 neolit123 modified the milestones: v1.25, Next May 11, 2022
@neolit123 neolit123 changed the title evaluate API object fetch retry logic evaluate the retry logic for API calls Nov 30, 2023
@neolit123
Copy link
Member Author

neolit123 commented Feb 13, 2024

moving to 1.30 with priority soon.
there was another report of failed API call without retry, so i think we should retry everywhere.

xref

@neolit123 neolit123 removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Feb 13, 2024
@neolit123 neolit123 self-assigned this Feb 13, 2024
@neolit123 neolit123 added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Feb 13, 2024
@neolit123 neolit123 modified the milestones: Next, v1.30 Feb 13, 2024
@neolit123
Copy link
Member Author

WIP PR
kubernetes/kubernetes#123271

@meezaan
Copy link

meezaan commented Feb 16, 2024

@neolit123 regarding kubernetes/kubernetes#112411, I was able to figure out what the problem was, at least in my case.

I had installed containerd using apt install containerd and not using the instructions in the docker repo which installs containerd.io. Once I installed the latter, the API was up and all the API calls were successful. I caught this when following kubeadm init phase step by step and eventually saw something in the containerd logs related to telemetry failures.

@neolit123
Copy link
Member Author

neolit123 commented Feb 16, 2024

@neolit123 regarding kubernetes/kubernetes#112411, I was able to figure out what the problem was, at least in my case.

I had installed containerd using apt install containerd and not using the instructions in the docker repo which installs containerd.io. Once I installed the latter, the API was up and all the API calls were successful. I caught this when following kubeadm init phase step by step and eventually saw something in the containerd logs related to telemetry failures.

thanks for the update. so it's bound to a specific containerd version or config; i'd assume the alternative version is much newer?

@neolit123
Copy link
Member Author

we recommend to users to install containerd using their guide in the containerd repo
https://kubernetes.io/docs/setup/production-environment/container-runtimes/#containerd

@meezaan
Copy link

meezaan commented Feb 16, 2024

@neolit123 regarding kubernetes/kubernetes#112411, I was able to figure out what the problem was, at least in my case.
I had installed containerd using apt install containerd and not using the instructions in the docker repo which installs containerd.io. Once I installed the latter, the API was up and all the API calls were successful. I caught this when following kubeadm init phase step by step and eventually saw something in the containerd logs related to telemetry failures.

thanks for the update. so it's bound to a specific containerd version or config; i'd assume the alternative version is much newer?

Unfortunately I did not check the version that came as the default Debian package, but I subsequently used https://docs.docker.com/engine/install/debian/ to install containerd.io.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/design Categorizes issue or PR as related to design. kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants