Skip to content
This repository has been archived by the owner on Apr 25, 2023. It is now read-only.

feat: introduce informer cache sync timeout #1460

Merged
merged 4 commits into from
Oct 21, 2021

Conversation

zqzten
Copy link
Contributor

@zqzten zqzten commented Oct 15, 2021

What this PR does / why we need it:
According to this issue of controller-runtime, an informer cache timeout is needed to prevent controllers from blocking indefinitely. In KubeFed, we have far more informers than a simple controller and any of the informer cache sync problems will block the whole reconcile loop (which has been encountered in our prod env serveral times).

This PR introduces a configurable informer cache timeout to the core controllers of KubeFed. It can let KubeFed controller error out if its informers are unable to sync their caches within this timeout. With this behavior, one KubeFed controller will never be kept running without working which can be useful for users to discover watch/list problems in time and can also give chance to other working replicas to run.

This PR also adds logs to the ClustersSynced check to help find out which member cluster's informer cache sync is blocking.

Which issue(s) this PR fixes:
Fixes #1459

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 15, 2021
Copy link
Contributor

@mars1024 mars1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

friendly ping @hectorj2f

@k8s-ci-robot k8s-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 18, 2021
@zqzten
Copy link
Contributor Author

zqzten commented Oct 18, 2021

/assign @xunpan

@zqzten
Copy link
Contributor Author

zqzten commented Oct 21, 2021

friendly ping @xunpan for approval or further opinions

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hectorj2f, mars1024, zqzten

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 21, 2021
@k8s-ci-robot k8s-ci-robot merged commit 224fe95 into kubernetes-retired:master Oct 21, 2021
@zqzten zqzten deleted the cache_sync_timeout branch October 22, 2021 02:35
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

informer cache sync needs a timeout
5 participants