linstor-satellite restart leads to linstor-controller overutillization #391

ddpolyakov · 2024-03-05T14:15:12Z

Hi! Im using 1.25.1 version Linstor + etcd on separate nods as database. around 100 diskless nodes and 10 storage nodes. Total around 1.5K resources
Every time I restart satellite (any) - linstor controller goes mad eating every cpu possible via threads. Stracing Controller shows tons of futexes all over the spawned threads
[pid 1910062] futex(0x7f82495fd77c, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 1910061] futex(0x7f82495fa0c8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 1910060] futex(0x7f82495f8678, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 1910059] futex(0x7f82495f6a68, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 1910058] futex(0x7f82495f4c98, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 1910057] futex(0x7f82495f2ed8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 1910056] futex(0x7f82495f12c8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 1910055] futex(0x7f82495ef518, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 1910054] futex(0x7f82495ed908, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 1910053] futex(0x7f82495ebcf8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 1910052] futex(0x7f82495ea0e8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>

Attaching the htop output at controller server during linstor-satellite restart

The text was updated successfully, but these errors were encountered:

ghernadi · 2024-03-05T15:57:10Z

Please upgrade to at least 1.26.1 and see if this issue persists. In the said version we tried to fix a bug that could lead to such a behavior.

ddpolyakov · 2024-03-06T00:13:25Z

Updated at the 1.26.2 - the same behaviour

ddpolyakov · 2024-03-06T00:43:42Z

and the same thing using mysqld galera cluster as a database

ddpolyakov · 2024-03-06T00:44:21Z

switching back to H2 seems to resolve the problem

ghernadi · 2024-03-08T09:31:17Z

If this is reproducible and you are willing to test this further, can you trigger the controller into this state and poke it a few times with kill -3 <pid_of_controller_java_process> and get me an SOS report? kill -3 causes the JVM to print a thread-dump to its stdout (which is usually captured by journalctl, which is then collected via LINSTOR's SOS report). If possible, run the kill -3 a few times, so we have a chance to see what the Threads are doing.

Additionally you could also activate TRACE logging for the controller and then trigger this behavior. Feel free to send me the resulting SOS report to the email from my profile

ddpolyakov · 2024-03-11T13:19:43Z

Here is my sos-report - ive run kill -3 few times just after all satellites restart. The same picture - Controller ate all cpu

amykhalskyi · 2024-07-02T11:29:02Z

After updating to 1.27.1 and mariadb backend, we still can see this issue. Sometimes, after restart of linstor satellite or crash of some node with satellite, linstor controller stuck with very high CPU consumption and doesn`t respond to any command.
I attached JVM thread-dump and screen of perf top
linstor_stuck_02072024_dump.gz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

linstor-satellite restart leads to linstor-controller overutillization #391

linstor-satellite restart leads to linstor-controller overutillization #391

ddpolyakov commented Mar 5, 2024

ghernadi commented Mar 5, 2024

ddpolyakov commented Mar 6, 2024

ddpolyakov commented Mar 6, 2024

ddpolyakov commented Mar 6, 2024

ghernadi commented Mar 8, 2024

ddpolyakov commented Mar 11, 2024 •

edited

Loading

amykhalskyi commented Jul 2, 2024

linstor-satellite restart leads to linstor-controller overutillization #391

linstor-satellite restart leads to linstor-controller overutillization #391

Comments

ddpolyakov commented Mar 5, 2024

ghernadi commented Mar 5, 2024

ddpolyakov commented Mar 6, 2024

ddpolyakov commented Mar 6, 2024

ddpolyakov commented Mar 6, 2024

ghernadi commented Mar 8, 2024

ddpolyakov commented Mar 11, 2024 • edited Loading

amykhalskyi commented Jul 2, 2024

ddpolyakov commented Mar 11, 2024 •

edited

Loading