Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

linstor-satellite restart leads to linstor-controller overutillization #391

Open
ddpolyakov opened this issue Mar 5, 2024 · 7 comments
Open

Comments

@ddpolyakov
Copy link

Hi! Im using 1.25.1 version Linstor + etcd on separate nods as database. around 100 diskless nodes and 10 storage nodes. Total around 1.5K resources
Every time I restart satellite (any) - linstor controller goes mad eating every cpu possible via threads. Stracing Controller shows tons of futexes all over the spawned threads
[pid 1910062] futex(0x7f82495fd77c, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 1910061] futex(0x7f82495fa0c8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 1910060] futex(0x7f82495f8678, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 1910059] futex(0x7f82495f6a68, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 1910058] futex(0x7f82495f4c98, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 1910057] futex(0x7f82495f2ed8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 1910056] futex(0x7f82495f12c8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 1910055] futex(0x7f82495ef518, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 1910054] futex(0x7f82495ed908, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 1910053] futex(0x7f82495ebcf8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 1910052] futex(0x7f82495ea0e8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>

image (6)
Attaching the htop output at controller server during linstor-satellite restart

@ghernadi
Copy link
Contributor

ghernadi commented Mar 5, 2024

Please upgrade to at least 1.26.1 and see if this issue persists. In the said version we tried to fix a bug that could lead to such a behavior.

@ddpolyakov
Copy link
Author

Updated at the 1.26.2 - the same behaviour

@ddpolyakov
Copy link
Author

and the same thing using mysqld galera cluster as a database

@ddpolyakov
Copy link
Author

switching back to H2 seems to resolve the problem

@ghernadi
Copy link
Contributor

ghernadi commented Mar 8, 2024

If this is reproducible and you are willing to test this further, can you trigger the controller into this state and poke it a few times with kill -3 <pid_of_controller_java_process> and get me an SOS report? kill -3 causes the JVM to print a thread-dump to its stdout (which is usually captured by journalctl, which is then collected via LINSTOR's SOS report). If possible, run the kill -3 a few times, so we have a chance to see what the Threads are doing.

Additionally you could also activate TRACE logging for the controller and then trigger this behavior. Feel free to send me the resulting SOS report to the email from my profile

@ddpolyakov
Copy link
Author

ddpolyakov commented Mar 11, 2024

Here is my sos-report - ive run kill -3 few times just after all satellites restart. The same picture - Controller ate all cpu

@amykhalskyi
Copy link

After updating to 1.27.1 and mariadb backend, we still can see this issue. Sometimes, after restart of linstor satellite or crash of some node with satellite, linstor controller stuck with very high CPU consumption and doesn`t respond to any command.
I attached JVM thread-dump and screen of perf top
linstor_stuck_02072024_dump.gz
perf_top

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants