Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalability issues with MCM #943

Open
rishabh-11 opened this issue Oct 4, 2024 · 4 comments
Open

Scalability issues with MCM #943

rishabh-11 opened this issue Oct 4, 2024 · 4 comments
Assignees
Labels
area/control-plane Control plane related area/scalability Scalability related kind/bug Bug kind/enhancement Enhancement, improvement, extension priority/1 Priority (lower number equals higher priority)

Comments

@rishabh-11
Copy link
Contributor

How to categorize this issue?

/area control-plane
/area scalability
/kind enhancement
/priority 1

What happened:
In the recent live update, we saw that the worker pools for our seeds were replaced with new ones concurrently. This caused the deletion of old machine-level objects (machine deployments, machine sets, machine classes and machines) and subsequent creation of new machine objects. All the old machines across worker pools went to Terminating state simultaneously.

Since we have a shared queue and around 50 workers picking up items from this queue, this caused massive throttling due to certain potentially long-running operations like the draining of nodes. Because the workers were blocked in the drain operation, the create requests were getting stuck in the queue with no worker available to process these requests.

The drain timeout was 2 hrs but it took more than 4 hrs because of #785 which is part of 0.54.0 version of MCM which had not reached live landscape with the corresponding mcm-provider release.

In the recent live update:-

  1. 5-6 new worker pools were introduced to replace the existing 2 worker pools.
  2. We observed around >100 or so machines in the Terminating state.
  3. Around 50 or so create machine requests were stuck in the queue.
  4. For around 4 hrs, due to long drain times and throttling, the machines were stuck in drain.

What you expected to happen:
MCM should scale much beyond handling 100 concurrent deletion/creation requests.

@rishabh-11 rishabh-11 added the kind/bug Bug label Oct 4, 2024
@gardener-robot gardener-robot added area/control-plane Control plane related area/scalability Scalability related kind/enhancement Enhancement, improvement, extension priority/1 Priority (lower number equals higher priority) labels Oct 4, 2024
@hoeltcl
Copy link

hoeltcl commented Oct 14, 2024

Referenced in PTASK0034014 as preventive measure. Do you already have a ETA date?

@gardener-robot
Copy link

@hoeltcl You have mentioned internal references in the public. Please check.

@elankath elankath self-assigned this Nov 21, 2024
@elankath
Copy link
Contributor

elankath commented Dec 9, 2024

This is in progress. ETA will be updated later.

@hoeltcl
Copy link

hoeltcl commented Dec 9, 2024

I've extended the due date in the linked problem task to 2025-01-31. Do you think you will be able to make it until that date?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Control plane related area/scalability Scalability related kind/bug Bug kind/enhancement Enhancement, improvement, extension priority/1 Priority (lower number equals higher priority)
Projects
None yet
Development

No branches or pull requests

4 participants