You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Issue seen in 202205 branch, but may be also impacting other branches too.
Good case : system-health.service takes ~2s to shutdown. IMO 2s is also too long in this time critical path. But, it is better than bad case which takes ~100s.
Apr 28 07:04:45.463212 str-msn2700-01 INFO systemd[1]: Stopping SONiC system health monitor...
Apr 28 07:04:45.463953 str-msn2700-01 NOTICE healthd: Caught SIGTERM - exiting...
Apr 28 07:04:45.465299 str-msn2700-01 NOTICE healthd: Caught SIGTERM - exiting...
Apr 28 07:04:45.471012 str-msn2700-01 NOTICE healthd: Caught SIGTERM - exiting...
Apr 28 07:04:46.452616 str-msn2700-01 NOTICE healthd: Caught SIGTERM - exiting...
Apr 28 07:04:47.553187 str-msn2700-01 INFO systemd[1]: system-health.service: Succeeded.
Apr 28 07:04:47.553998 str-msn2700-01 INFO systemd[1]: Stopped SONiC system health monitor.
Apr 28 07:04:50.873897 str-msn2700-01 NOTICE admin: Rebooting with /sbin/kexec -e to SONiC-OS-20220531.26 ...
Bad case: system-health.service takes 100s to go down and keeps the system in waiting state. This causes warmreboot to fail as LACP session window of 90s is exceeded and LAGs flap and traffic is dropped.
Apr 28 07:53:44.494314 str-msn2700-01 INFO systemd[1]: Stopping SONiC system health monitor...
Apr 28 07:53:44.511256 str-msn2700-01 NOTICE healthd: Caught SIGTERM - exiting...
Apr 28 07:53:44.515118 str-msn2700-01 NOTICE healthd: Caught SIGTERM - exiting...
Apr 28 07:53:44.517966 str-msn2700-01 NOTICE healthd: Caught SIGTERM - exiting...
Apr 28 07:53:44.538264 str-msn2700-01 NOTICE healthd: Caught SIGTERM - exiting...
Apr 28 07:53:45.540920 str-msn2700-01 NOTICE healthd: Caught SIGTERM - exiting...
Apr 28 07:53:56.592671 str-msn2700-01 NOTICE healthd: Caught SIGTERM - exiting...
Apr 28 07:53:57.594171 str-msn2700-01 NOTICE healthd: message repeated 2 times: [ Caught SIGTERM - exiting...]
Apr 28 07:55:26.634031 str-msn2700-01 WARNING systemd[1]: system-health.service: State 'final-sigterm' timed out. Killing.
Apr 28 07:55:26.634317 str-msn2700-01 NOTICE systemd[1]: system-health.service: Killing process 4528 (healthd) with signal SIGKILL.
Apr 28 07:55:26.634317 str-msn2700-01 NOTICE systemd[1]: system-health.service: Killing process 4528 (healthd) with signal SIGKILL.
Apr 28 07:55:26.634483 str-msn2700-01 NOTICE systemd[1]: system-health.service: Killing process 4544 (healthd) with signal SIGKILL.
Apr 28 07:55:26.634483 str-msn2700-01 NOTICE systemd[1]: system-health.service: Killing process 4544 (healthd) with signal SIGKILL.
Apr 28 07:55:26.634629 str-msn2700-01 NOTICE systemd[1]: system-health.service: Killing process 4548 (healthd) with signal SIGKILL.
Apr 28 07:55:26.634629 str-msn2700-01 NOTICE systemd[1]: system-health.service: Killing process 4548 (healthd) with signal SIGKILL.
Apr 28 07:55:26.634799 str-msn2700-01 NOTICE systemd[1]: system-health.service: Killing process 4532 (healthd) with signal SIGKILL.
Apr 28 07:55:26.634799 str-msn2700-01 NOTICE systemd[1]: system-health.service: Killing process 4532 (healthd) with signal SIGKILL.
Apr 28 07:55:26.634945 str-msn2700-01 NOTICE systemd[1]: system-health.service: Killing process 4551 (healthd) with signal SIGKILL.
Apr 28 07:55:26.634945 str-msn2700-01 NOTICE systemd[1]: system-health.service: Killing process 4551 (healthd) with signal SIGKILL.
Apr 28 07:55:26.635088 str-msn2700-01 NOTICE systemd[1]: system-health.service: Killing process 4852 (healthd) with signal SIGKILL.
Apr 28 07:55:26.635088 str-msn2700-01 NOTICE systemd[1]: system-health.service: Killing process 4852 (healthd) with signal SIGKILL.
Apr 28 07:55:26.646842 str-msn2700-01 WARNING systemd[1]: system-health.service: Failed with result 'timeout'.
Apr 28 07:55:26.648759 str-msn2700-01 INFO systemd[1]: Stopped SONiC system health monitor.
Apr 28 07:55:28.882753 str-msn2700-01 NOTICE admin: Rebooting with /sbin/kexec -e to SONiC-OS-20220531.26 ...
Describe the results you expected:
system-health.service should go down as fast as possible (~0.1s).
Output of show version:
(paste your output here)
Output of show techsupport:
(paste your output here or download and attach the file here )
Additional information you deem important (e.g. issue happens only occasionally):
vaibhavhd
changed the title
system-health.service cannot be killed with SIGTERM : causes warmreboot to fail
system-health.service takes 100s to go down and delays warm shutdown causing LAGs flap
May 5, 2023
Description
Steps to reproduce the issue:
PR causing it: #4835
Describe the results you received:
Issue seen in 202205 branch, but may be also impacting other branches too.
Good case :
system-health.service
takes ~2s to shutdown. IMO 2s is also too long in this time critical path. But, it is better than bad case which takes ~100s.Bad case:
system-health.service
takes 100s to go down and keeps the system in waiting state. This causes warmreboot to fail as LACP session window of 90s is exceeded and LAGs flap and traffic is dropped.Describe the results you expected:
system-health.service
should go down as fast as possible (~0.1s).Output of
show version
:Output of
show techsupport
:Additional information you deem important (e.g. issue happens only occasionally):
syslog_healthd_fail.log
The text was updated successfully, but these errors were encountered: