Scalability issues with MCM #943

rishabh-11 · 2024-10-04T06:12:25Z

How to categorize this issue?

/area control-plane
/area scalability
/kind enhancement
/priority 1

What happened:
In the recent live update, we saw that the worker pools for our seeds were replaced with new ones concurrently. This caused the deletion of old machine-level objects (machine deployments, machine sets, machine classes and machines) and subsequent creation of new machine objects. All the old machines across worker pools went to Terminating state simultaneously.

Since we have a shared queue and around 50 workers picking up items from this queue, this caused massive throttling due to certain potentially long-running operations like the draining of nodes. Because the workers were blocked in the drain operation, the create requests were getting stuck in the queue with no worker available to process these requests.

The drain timeout was 2 hrs but it took more than 4 hrs because of #785 which is part of 0.54.0 version of MCM which had not reached live landscape with the corresponding mcm-provider release.

In the recent live update:-

5-6 new worker pools were introduced to replace the existing 2 worker pools.
We observed around >100 or so machines in the Terminating state.
Around 50 or so create machine requests were stuck in the queue.
For around 4 hrs, due to long drain times and throttling, the machines were stuck in drain.

What you expected to happen:
MCM should scale much beyond handling 100 concurrent deletion/creation requests.

The text was updated successfully, but these errors were encountered:

hoeltcl · 2024-10-14T09:00:42Z

Referenced in PTASK0034014 as preventive measure. Do you already have a ETA date?

gardener-robot · 2024-10-14T09:00:49Z

@hoeltcl You have mentioned internal references in the public. Please check.

elankath · 2024-12-09T09:53:11Z

This is in progress. ETA will be updated later.

hoeltcl · 2024-12-09T10:04:43Z

I've extended the due date in the linked problem task to 2025-01-31. Do you think you will be able to make it until that date?

rishabh-11 added the kind/bug Bug label Oct 4, 2024

gardener-robot added area/control-plane Control plane related area/scalability Scalability related kind/enhancement Enhancement, improvement, extension priority/1 Priority (lower number equals higher priority) labels Oct 4, 2024

elankath self-assigned this Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scalability issues with MCM #943

Scalability issues with MCM #943

rishabh-11 commented Oct 4, 2024

hoeltcl commented Oct 14, 2024 •

edited

Loading

gardener-robot commented Oct 14, 2024

elankath commented Dec 9, 2024

hoeltcl commented Dec 9, 2024

Scalability issues with MCM #943

Scalability issues with MCM #943

Comments

rishabh-11 commented Oct 4, 2024

hoeltcl commented Oct 14, 2024 • edited Loading

gardener-robot commented Oct 14, 2024

elankath commented Dec 9, 2024

hoeltcl commented Dec 9, 2024

hoeltcl commented Oct 14, 2024 •

edited

Loading