Scalability issues with MCM #943
Labels
area/control-plane
Control plane related
area/scalability
Scalability related
kind/bug
Bug
kind/enhancement
Enhancement, improvement, extension
priority/1
Priority (lower number equals higher priority)
How to categorize this issue?
/area control-plane
/area scalability
/kind enhancement
/priority 1
What happened:
In the recent live update, we saw that the worker pools for our seeds were replaced with new ones concurrently. This caused the deletion of old machine-level objects (machine deployments, machine sets, machine classes and machines) and subsequent creation of new machine objects. All the old machines across worker pools went to
Terminating
state simultaneously.Since we have a shared queue and around 50 workers picking up items from this queue, this caused massive throttling due to certain potentially long-running operations like the draining of nodes. Because the workers were blocked in the drain operation, the create requests were getting stuck in the queue with no worker available to process these requests.
The drain timeout was 2 hrs but it took more than 4 hrs because of #785 which is part of
0.54.0
version of MCM which had not reached live landscape with the corresponding mcm-provider release.In the recent live update:-
Terminating
state.What you expected to happen:
MCM should scale much beyond handling 100 concurrent deletion/creation requests.
The text was updated successfully, but these errors were encountered: