Possibility of Teamd can start before swss restart #5999

abdosi · 2020-11-23T03:14:53Z

Issue was found on multi-asic platform.
Teamd got started before swss restart which resulted in bad state since swss on its init
cleaned up the State DB populated by teamd and system went into bad state
and could not recover on it’s own

Root cause for above condition is Feature Table handling being done in hostcfgd.
Feature Table does not know about state of swss when starting teamd(/syncd)
and can start teamd (/syncd) while swss is about to get stopped (because of some crash/error handling) but not completely stopped and is active.

Issue Flow during init:

• Linux start swss/teamd/syncd/hostcfgd services
• Syncd crash
• Syncd/swss/teamd are getting stopped (service is still active)
• Hostcfgd processing Feature Table in parallel and start teamd which populate State DB
• Linux finally stop and restart swss and it cleans up State DB populated by above Step.

Swss2 gets notified of orchangent being killed

Nov 19 01:41:30.843894 STG02-0101-0102-01T1 INFO swss2#supervisor-proc-exit-listener: Process orchagent exited unxepectedly. Terminating supervisor...

Teamd2 getting started from hostcfgd (swss2 is still active )

Nov 19 01:41:51.649500 STG02-0101-0102-01T1 INFO systemd[1]: Reloading.
Nov 19 01:41:51.733279 STG02-0101-0102-01T1 INFO hostcfgd: Running cmd: 'sudo systemctl start [email protected]'
Nov 19 01:41:51.744685 STG02-0101-0102-01T1 INFO systemd[1]: Starting TEAMD container...

Swss2 service finally getting stop and restart

Nov 19 01:42:14.064349 STG02-0101-0102-01T1 INFO systemd[1]: [email protected]: Service hold-off time over, scheduling restart.
Nov 19 01:42:14.064899 STG02-0101-0102-01T1 INFO systemd[1]: Stopped switch state service.
Nov 19 01:42:14.065865 STG02-0101-0102-01T1 INFO systemd[1]: Starting switch state service...
Nov 19 01:42:14.071078 STG02-0101-0102-01T1 NOTICE root: Starting swss2 service...
Nov 19 01:42:14.075242 STG02-0101-0102-01T1 NOTICE root: Locking /tmp/swss-syncd-lock2 from swss2 service
Nov 19 01:42:14.080174 STG02-0101-0102-01T1 NOTICE root: Locked /tmp/swss-syncd-lock2 (10) from swss2 service
Nov 19 01:42:14.443526 STG02-0101-0102-01T1 NOTICE root: Warm boot flag: swss2 false.
Nov 19 01:42:14.447624 STG02-0101-0102-01T1 NOTICE root: Flushing APP, ASIC, COUNTER, CONFIG, and partial STATE databases ...

This issue is definitely more prominent in case of multi-asic as we have more services to take action on
but can come in single asic also.

abdosi added the Request for 201911 Branch label Nov 23, 2020

rlhui added the Issue for 201911 label Nov 23, 2020

abdosi mentioned this issue Nov 23, 2020

Enhanced Feature table to support 'always_enabled' value for state and auto-restart fields. #6000

Merged

abdosi linked a pull request Nov 25, 2020 that will close this issue

Enhanced Feature table to support 'always_enabled' value for state and auto-restart fields. #6000

Merged

rlhui assigned abdosi Nov 25, 2020

abdosi closed this as completed in #6000 Nov 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possibility of Teamd can start before swss restart #5999

Possibility of Teamd can start before swss restart #5999

abdosi commented Nov 23, 2020

Possibility of Teamd can start before swss restart #5999

Possibility of Teamd can start before swss restart #5999

Comments

abdosi commented Nov 23, 2020