Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

warmboot cancelled - vxlanmgrd warm-start reconciliation failed in previous warmboot #6772

Closed
vaibhavhd opened this issue Feb 11, 2021 · 4 comments

Comments

@vaibhavhd
Copy link
Contributor

Description

After warm-reboot:

  1. Processes start doing warmstart.
  2. The common warm-start states are initialized -> replayed -> reconciled
  3. vxlanmgrd is one such process, but it continuously (for > 10mins) fails to reconcile Orchagent.
  4. WARMBOOT_FINALIZER reports that orchagent is not reconciled, but goes ahead with Finalizing warmboot.
  5. After some time, a new warm reboot is issued, but the RESTARTCHECK times-out for 5 (max-allowed) retries due to the fact that orchagent from last warmboot was not reconciled.

Steps to reproduce the issue:

  1. Run test_cont_warm_reboot (the error was seen on KVM).
  2. The issue is seen after 40 successful iterations.
  3. Check syslog after the failure - warmboot failed due to OA RESTARTCHECK failed.

Describe the results you received:

The error was caught by test_cont_warm_reboot` on KVM test. Artifacts are here https://dev.azure.com/mssonic/build/_build/results?buildId=3698&view=artifacts&pathAsName=false&type=publishedArtifacts

The failure was 42nd iteration.


Feb 11 13:32:53.221127 vlab-01 NOTICE swss#orchagent: :- checkWarmStart: orchagent doing warm start, restore count 40

Feb 11 13:32:56.663604 vlab-01 INFO swss#supervisord 2021-02-11 13:32:56,647 INFO spawned: 'vxlanmgrd' with pid 147
Feb 11 13:32:56.786068 vlab-01 NOTICE swss#vxlanmgrd: :- main: --- Starting vxlanmgrd ---
Feb 11 13:32:56.786622 vlab-01 NOTICE swss#vxlanmgrd: :- checkWarmStart: vxlanmgrd doing warm start, restore count 40
Feb 11 13:32:56.793019 vlab-01 NOTICE swss#vxlanmgrd: :- setWarmStartState: vxlanmgrd warm start state changed to initialized
Feb 11 13:32:57.649618 vlab-01 INFO swss#supervisord 2021-02-11 13:32:57,648 INFO success: vxlanmgrd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
Feb 11 13:33:02.033737 vlab-01 NOTICE swss#vxlanmgrd: :- setWarmStartState: vxlanmgrd warm start state changed to replayed
Feb 11 13:33:02.034131 vlab-01 NOTICE swss#vxlanmgrd: :- main: Waiting Until Orchagent is reconciled. Current 40. Waited 0 secs
Feb 11 13:33:03.037357 vlab-01 NOTICE swss#vxlanmgrd: :- main: Waiting Until Orchagent is reconciled. Current 40. Waited 1 secs
Feb 11 13:33:04.042161 vlab-01 NOTICE swss#vxlanmgrd: :- main: Waiting Until Orchagent is reconciled. Current 40. Waited 2 secs
..
Feb 11 13:38:06.515441 vlab-01 NOTICE root: WARMBOOT_FINALIZER : Some components didn't finish reconcile: orchagent ...
Feb 11 13:38:06.523556 vlab-01 NOTICE root: WARMBOOT_FINALIZER : Finalizing warmboot...
Feb 11 13:38:07.124812 vlab-01 INFO systemd[1]: warmboot-finalizer.service: Succeeded.
..
Feb 11 13:41:54.584331 vlab-01 NOTICE admin: Saving counters folder before warmboot...
Feb 11 13:41:58.865214 vlab-01 NOTICE swss#orchagent_restart_check: :- main: Wait time for response from orchagent set to 2000 milliseconds
Feb 11 13:41:58.865214 vlab-01 NOTICE swss#orchagent_restart_check: :- main: Number of retries for the request to orchagent is set to 5
Feb 11 13:41:58.868188 vlab-01 INFO swss#orchagent_restart_check: :- subscribe: subscribed to RESTARTCHECKREPLY
Feb 11 13:42:06.910388 vlab-01 NOTICE swss#orchagent_restart_check: :- main: RESTARTCHECK for  timed out
Feb 11 13:42:06.919690 vlab-01 NOTICE swss#orchagent_restart_check: :- main: requested orchagent to do warm restart state check, retry count: 4
Feb 11 13:42:07.266429 vlab-01 NOTICE swss#vxlanmgrd: :- main: Waiting Until Orchagent is reconciled. Current 40. Waited 543 secs
Feb 11 13:42:08.267817 vlab-01 NOTICE swss#vxlanmgrd: :- main: Waiting Until Orchagent is reconciled. Current 40. Waited 544 secs
Feb 11 13:42:08.921712 vlab-01 NOTICE swss#orchagent_restart_check: :- main: RESTARTCHECK for  timed out
Feb 11 13:42:08.925296 vlab-01 NOTICE swss#orchagent_restart_check: :- main: requested orchagent to do warm restart state check, retry count: 5
Feb 11 13:42:09.269244 vlab-01 NOTICE swss#vxlanmgrd: :- main: Waiting Until Orchagent is reconciled. Current 40. Waited 545 secs
Feb 11 13:42:10.270319 vlab-01 NOTICE swss#vxlanmgrd: :- main: Waiting Until Orchagent is reconciled. Current 40. Waited 546 secs
Feb 11 13:42:10.924137 vlab-01 NOTICE swss#orchagent_restart_check: :- main: RESTARTCHECK for  timed out

Feb 11 13:42:11.137017 vlab-01 NOTICE admin: warm-reboot failure (0) cleanup ...
..
..
Feb 11 13:46:44.864286 vlab-01 NOTICE swss#vxlanmgrd: :- main: Waiting Until Orchagent is reconciled. Current 40. Waited 819 secs

Describe the results you expected:

Services should reconcile after warmstart. And, Orchagent RESTARTCHECK should not fail when warmboot is issued.

Output of show version:

SONiC-OS-HEAD.0-11937d37

Additional information you deem important (e.g. issue happens only occasionally):

@prsunny
Copy link
Contributor

prsunny commented Feb 12, 2021

@srj102, can you please take a look at this issue. It would have been caused by sonic-net/sonic-swss#1266. You could also check on the recent discussion on sonic-net/sonic-swss#1556 (comment) and remove this orchagent dependency.

@prsunny
Copy link
Contributor

prsunny commented Feb 19, 2021

Fix as part of sonic-net/sonic-swss#1647

@vaibhavhd
Copy link
Contributor Author

Thanks @prsunny . But I think 1647 will not solve warm reboot issue captured here. The change will let vxlanmgr not wait for orchagent reconciliation. But, if the Orchagent is not reconciled, will the RESTARTCHECK still not fail when the warm-reboot is issues? I think it will still fail with

Feb 11 13:42:08.925296 vlab-01 NOTICE swss#orchagent_restart_check: :- main: requested orchagent to do warm restart state check, retry count: 5
Feb 11 13:42:09.269244 vlab-01 NOTICE swss#vxlanmgrd: :- main: Waiting Until Orchagent is reconciled. Current 40. Waited 545 secs
Feb 11 13:42:10.270319 vlab-01 NOTICE swss#vxlanmgrd: :- main: Waiting Until Orchagent is reconciled. Current 40. Waited 546 secs
Feb 11 13:42:10.924137 vlab-01 NOTICE swss#orchagent_restart_check: :- main: RESTARTCHECK for  timed out

Thoughts?

@prsunny
Copy link
Contributor

prsunny commented Mar 8, 2021

Fix merged

@prsunny prsunny closed this as completed Mar 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants