-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Containerd not restarting properly after upgrade to systemd 252.11 on stable #1157
Comments
Thanks for the detailed report! This is a regression in systemd v252.11, fixed in v252.12. Here are some links:
A workaround is to execute |
Understood. Thanks for the quick reaction. We are using these systemd units distributed via ignition as a rather crude solution for now
Is there an estimate as to when systemd v252.12 or higher might hit stable? |
@tormath1 is (likely) going to try cherry-picking the commit into stable before v252.12+ arrives. In that case we would aim for the next stable. |
@heilerich I'm curious, who's sending the |
@tormath1 You are absolutely right, I know that nvidia's container toolkit is doing this, but we also had problems on a cluster that does not have NVIDIA devices. I can't say right know what was killing containerd there. Possibly kata order kubevirt? I would have to ask. |
If fixes an issue with Systemd service restart when the main process is being killed by a SIGHUP signal. See also: flatcar/Flatcar#1157 Commit-Ref: systemd/systemd-stable@34e834f Signed-off-by: Mathieu Tortuyaux <[email protected]>
If fixes an issue with Systemd service restart when the main process is being killed by a SIGHUP signal. See also: flatcar/Flatcar#1157 Commit-Ref: systemd/systemd-stable@34e834f Signed-off-by: Mathieu Tortuyaux <[email protected]>
@heilerich thanks for the information. Looks like with Nvidia you can specify to restart using |
I can confirm that this works. gpu-operator related problems can be fixed by adding toolkit:
env:
- name: RUNTIME_ARGS
value: --restart-mode=systemd to the NVIDIA ClusterPolicy. |
If fixes an issue with Systemd service restart when the main process is being killed by a SIGHUP signal. See also: flatcar/Flatcar#1157 Commit-Ref: systemd/systemd-stable@34e834f Signed-off-by: Mathieu Tortuyaux <[email protected]>
If fixes an issue with Systemd service restart when the main process is being killed by a SIGHUP signal. See also: flatcar/Flatcar#1157 Commit-Ref: systemd/systemd-stable@34e834f Signed-off-by: Mathieu Tortuyaux <[email protected]>
If fixes an issue with Systemd service restart when the main process is being killed by a SIGHUP signal. See also: flatcar/Flatcar#1157 Commit-Ref: systemd/systemd-stable@34e834f Signed-off-by: Mathieu Tortuyaux <[email protected]>
Fix has been backported to all channels and it will be available in the next set of releases around the first week of September (https://github.com/orgs/flatcar/projects/7/views/8). A test case has been added too. |
Just as a note for other people hitting this problem ... We identified one more culprit causing downtimes related to this bug. The system ugprade component of Rancher / RKE2 can get stuck in a deadlock while upgrading. |
Description
We have noticed strange downtimes all over our kubernetes infrastructure since the recent upgrades to v3510. The root cause seems to be that the containerd service is not properly restarting when it exits sometimes. The symptoms are as follows: The containerd systemd unit stays in
active(running)
state even when the main process has exited even though the systemd unit hasExitType
set to main and not cgroup. Containers already running stay active, but everthing else (i.e. kubernetes, docker) stops working obviously.The systemd unit looks like this
Impact
This effectively breaks any environment requiring containerd restarts/reloads such as environments using alternative runtimes that are loaded after the initial boot process e.g. nvidia runtime or kata-containers.
Environment and steps to reproduce
a. Start a container e.g.
docker run -d busybox sleep 9999999
b. Look at
systemctl status containerd
, make sure the main process and container are running, note the main PIDc.
kill -SIGHUP <containerd-main-pid>
systemctl status containerd
again and wait for a restart (which will not happen)Expected behavior
Containerd is restarted as specified in the systemd unit. I have just verified this with a fresh 3374.2.5 VM and containerd restarts as excpeded.
Additional information
I am not entirely sure what release has introduced this behaviour since it took me a while to track this down, but it must have happened somewhere between 3374.2.5 and 3510.2.6. It probably was 3510.2.5 though since it upgraded systemd to 252.11
I would greatly appreciate if someone has any idea for a temporary hotfix other than switching to LTS until this is fixed.
The text was updated successfully, but these errors were encountered: