Address scalability issue when Node Watcher is enabled #76

xing-yang · 2021-06-14T19:29:30Z

We have an issue #75 to change the code to only watch Pods and Nodes when the Node Watcher component is enabled. We still need to address the scalability issue when Node Watcher is enabled:

kubernetes/kubernetes#102452 (comment)

xing-yang · 2021-06-15T00:37:12Z

@NickrenREN I wonder if you've seen a similar issue in production.

NickrenREN · 2021-06-15T03:31:34Z

Node Watcher is a single instance controller, what is the scalability issue ?

xing-yang · 2021-06-15T13:55:38Z

@NickrenREN It affects the e2e tests. Details are in this issue: kubernetes/kubernetes#102452

By disabling the external-health-monitor, the failure went away.

NickrenREN · 2021-06-16T03:13:54Z

IIUC, the root cause of the scalability issue you mention is: Node Watcher watches PVCs, Nodes and Pods ?
I just don't understand the reason. k8s default scheduler also does the same thing.

NickrenREN · 2021-06-16T03:15:54Z

Watch is persistent connection, and Node Watcher is a single instance controller. Is this really the root cause ?

NickrenREN · 2021-06-16T03:17:13Z

I saw many API Throttlings, so maybe we can decrease the API call frequency ?

xing-yang · 2021-06-16T03:18:54Z

Watch is persistent connection, and Node Watcher is a single instance controller. Is this really the root cause ?

This needs more investigation. The observation is that the failure went away when external-health-monitor was disabled, came back again when it is enabled, and went away again when it was disabled.

I saw many API Throttlings, so maybe we can decrease the API call frequency ?

We could try that.

NickrenREN · 2021-06-16T03:21:17Z

This needs more investigation. The observation is that the failure went away when external-health-monitor was disabled, came back again when it is enabled, and went away again when it was disabled.

This indicates the controller causes the failure (API throttling ?), but i still don't think Watch is the root cause.

xing-yang · 2021-06-16T03:22:54Z

This indicates the controller causes the failure (API throttling ?), but i still don't think Watch is the root cause.

The external-health-monitor controller added more load to the API server which might have triggered those failures.

NickrenREN · 2021-06-16T03:33:15Z

The external-health-monitor controller added more load to the API server which might have triggered those failures.

I agree, so we can try to decrease the API call frequency first.

sonasingh46 · 2021-08-25T17:21:35Z

I would like to work on this issue. Will start to look into it and understand.

sonasingh46 · 2021-08-25T17:21:46Z

/assign

k8s-triage-robot · 2021-11-23T17:37:59Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

xing-yang · 2021-12-07T03:51:43Z

/remove-lifecycle stale

k8s-triage-robot · 2022-03-07T04:43:56Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-04-06T05:03:53Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-05-06T05:21:08Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-05-06T05:21:11Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pohly · 2022-08-08T12:27:28Z

/reopen

k8s-ci-robot · 2022-08-08T12:27:30Z

@pohly: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pohly · 2022-08-08T12:28:16Z

/lifecycle frozen

mowangdk · 2024-08-30T02:14:36Z

/assign

k8s-ci-robot assigned sonasingh46 Aug 25, 2021

sonasingh46 mentioned this issue Aug 27, 2021

add configurable throughput for clients #91

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 23, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 7, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 7, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 6, 2022

k8s-ci-robot closed this as completed May 6, 2022

k8s-ci-robot reopened this Aug 8, 2022

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Aug 8, 2022

pohly mentioned this issue Aug 8, 2022

KEP-1432 Move volume health monitoring to beta kubernetes/enhancements#3321

Closed

xing-yang unassigned sonasingh46 Aug 30, 2024

k8s-ci-robot assigned mowangdk Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address scalability issue when Node Watcher is enabled #76

Address scalability issue when Node Watcher is enabled #76

xing-yang commented Jun 14, 2021

xing-yang commented Jun 15, 2021

NickrenREN commented Jun 15, 2021

xing-yang commented Jun 15, 2021

NickrenREN commented Jun 16, 2021

NickrenREN commented Jun 16, 2021

NickrenREN commented Jun 16, 2021

xing-yang commented Jun 16, 2021

NickrenREN commented Jun 16, 2021

xing-yang commented Jun 16, 2021

NickrenREN commented Jun 16, 2021

sonasingh46 commented Aug 25, 2021

sonasingh46 commented Aug 25, 2021

k8s-triage-robot commented Nov 23, 2021

xing-yang commented Dec 7, 2021

k8s-triage-robot commented Mar 7, 2022

k8s-triage-robot commented Apr 6, 2022

k8s-triage-robot commented May 6, 2022

k8s-ci-robot commented May 6, 2022

pohly commented Aug 8, 2022

k8s-ci-robot commented Aug 8, 2022

pohly commented Aug 8, 2022

mowangdk commented Aug 30, 2024

Address scalability issue when Node Watcher is enabled #76

Address scalability issue when Node Watcher is enabled #76

Comments

xing-yang commented Jun 14, 2021

xing-yang commented Jun 15, 2021

NickrenREN commented Jun 15, 2021

xing-yang commented Jun 15, 2021

NickrenREN commented Jun 16, 2021

NickrenREN commented Jun 16, 2021

NickrenREN commented Jun 16, 2021

xing-yang commented Jun 16, 2021

NickrenREN commented Jun 16, 2021

xing-yang commented Jun 16, 2021

NickrenREN commented Jun 16, 2021

sonasingh46 commented Aug 25, 2021

sonasingh46 commented Aug 25, 2021

k8s-triage-robot commented Nov 23, 2021

xing-yang commented Dec 7, 2021

k8s-triage-robot commented Mar 7, 2022

k8s-triage-robot commented Apr 6, 2022

k8s-triage-robot commented May 6, 2022

k8s-ci-robot commented May 6, 2022

pohly commented Aug 8, 2022

k8s-ci-robot commented Aug 8, 2022

pohly commented Aug 8, 2022

mowangdk commented Aug 30, 2024