Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

occassionally freezes the whole system #5

Open
allergicapple opened this issue May 5, 2024 · 6 comments
Open

occassionally freezes the whole system #5

allergicapple opened this issue May 5, 2024 · 6 comments

Comments

@allergicapple
Copy link

This service happens to freeze the whole system for several seconds, with few seconds in-between before freezing everything up again.

The time between freezes can be used to switch to a tty console and reboot the system in the hope the next time it won't freeze again.

Of course this is not acceptable so I stopped and disabled the uksmd service for good.

There are entries in the systemd journal which repeat:

$ sudo journalctl

[...]
May 05 19:25:58 cachyos systemd[1]: uksmd.service: Watchdog timeout (limit 30s)!
May 05 19:25:58 cachyos systemd[1]: uksmd.service: Killing process 829 (uksmd) with signal SIGABRT.
May 05 19:25:58 cachyos systemd[1]: uksmd.service: Main process exited, code=killed, status=6/ABRT
May 05 19:25:58 cachyos systemd[1]: uksmd.service: Failed with result 'watchdog'.
May 05 19:25:58 cachyos systemd[1]: uksmd.service: Scheduled restart job, restart counter is at 1.
May 05 19:25:58 cachyos systemd[1]: Starting Userspace KSM helper daemon...
@ptr1337
Copy link
Member

ptr1337 commented May 5, 2024

This service happens to freeze the whole system for several seconds, with few seconds in-between before freezing everything up again.

The time between freezes can be used to switch to a tty console and reboot the system in the hope the next time it won't freeze again.

Of course this is not acceptable so I stopped and disabled the uksmd service for good.

There are entries in the systemd journal which repeat:

$ sudo journalctl

[...]
May 05 19:25:58 cachyos systemd[1]: uksmd.service: Watchdog timeout (limit 30s)!
May 05 19:25:58 cachyos systemd[1]: uksmd.service: Killing process 829 (uksmd) with signal SIGABRT.
May 05 19:25:58 cachyos systemd[1]: uksmd.service: Main process exited, code=killed, status=6/ABRT
May 05 19:25:58 cachyos systemd[1]: uksmd.service: Failed with result 'watchdog'.
May 05 19:25:58 cachyos systemd[1]: uksmd.service: Scheduled restart job, restart counter is at 1.
May 05 19:25:58 cachyos systemd[1]: Starting Userspace KSM helper daemon...

This is really weird and wondering me. Could you maybe share more informations how this happend and also on which hardware?

@pfactum Do you have a idea how to debug this?

@allergicapple
Copy link
Author

Yes, it's a bit strange. I am not sure but I think this happenes since a few weeks and maybe once or twice a week, very unpredictable.

@pfactum
Copy link
Contributor

pfactum commented May 5, 2024

Watchdog timeout means uksmd was not able to inform systemd that it's alive (https://codeberg.org/pf-kernel/uksmd/src/commit/ec2bfd88585d7b900baaaede1f57566c95e8c506/uksmd.c#L420), and systemd kills it forcibly.

Normally uksmd should send pings every 15 seconds (https://codeberg.org/pf-kernel/uksmd/src/commit/ec2bfd88585d7b900baaaede1f57566c95e8c506/uksmd.c#L501). If it doesn't, it either doesn't get scheduled, or it is stuck somewhere, maybe while traversing /proc.

Should this re-occur, build uksmd with debug symbols and get a coredump before systemd kills it again. Or at least collect /proc/<uksmd_PID>/stack. Or check strace.

The service itself should not cause system freezes. It's rather something is going on on the kernel side. For that, at least check for blocked tasks (echo w | sudo tee /proc/sysrq-trigger, and then dmesg/journalctl -kb), and maybe perf top.

@allergicapple
Copy link
Author

Thanks for joining in,

like I said, the whole system locks up. When the watchdog barks, the service is killed and it becomes responsive again, until the service is restarted, then the cycle repeats.
That's what I interpret the situation as.
For the period the system is locked up, no interaction is possible, not even Num Lock reacts.
Can something be analyzed after the fact?

@pfactum
Copy link
Contributor

pfactum commented May 6, 2024

I don't think so, but you can also try to collect a vmcore via kdump.

@allergicapple
Copy link
Author

I'd advocate for closing this issue. I have no problem with not using uksmd and it seems to ponly affect my setup.
If it appears for someone else, we have this ticket for reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants