Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple LAG flaps are noticed after warm reboot #6773

Closed
vaibhavhd opened this issue Feb 11, 2021 · 5 comments
Closed

Multiple LAG flaps are noticed after warm reboot #6773

vaibhavhd opened this issue Feb 11, 2021 · 5 comments

Comments

@vaibhavhd
Copy link
Contributor

Description

Warmboot proceeds as expected, and LAG interfaces and members are also added as expected.
After WARMBOOT_FINALIZER finishes, there is a system reload! (need to confirm and RCA this)

Steps to reproduce the issue:

  1. Run warm reboot test.
  2. Test fails consistently reporting 2(or more) LAG flaps.

Describe the results you received:

Test failure log:

FAILED:10.64.247.192:LAG flapped 4 times on 10.64.247.192 after warm boot
FAILED:10.64.247.193:LAG flapped 2 times on 10.64.247.193 after warm boot
FAILED:10.64.247.194:LAG flapped 2 times on 10.64.247.194 after warm boot
FAILED:10.64.247.195:LAG flapped 2 times on 10.64.247.195 after warm boot


FAILED:dut:Total downtime period must be less then 0:00:30 seconds. It was 33.2363586426

Syslog:

Feb 11 19:57:13.918974 str-s6100-acs-2 NOTICE admin: Warm boot flag: teamd true.
Feb 11 19:57:41.550084 str-s6100-acs-2 NOTICE admin: Rebooting with /sbin/kexec -e to SONiC-OS-HEAD.375-d8fe592e ...
Feb 11 19:58:24.503949 str-s6100-acs-2 NOTICE admin: Warm boot flag: teamd false.

Feb 11 19:59:42.602570 str-s6100-acs-2 NOTICE teamd#teamsyncd: :- setWarmStartState: teamsyncd warm start state changed to reconciled
Feb 11 19:59:42.624869 str-s6100-acs-2 NOTICE teamd#tlm_teamd: :- try_add_lag: The LAG 'PortChannel0004' has been added.
Feb 11 19:59:42.625762 str-s6100-acs-2 NOTICE teamd#tlm_teamd: :- try_add_lag: The LAG 'PortChannel0001' has been added.
Feb 11 19:59:42.628059 str-s6100-acs-2 NOTICE teamd#tlm_teamd: :- try_add_lag: The LAG 'PortChannel0002' has been added.

Feb 11 20:00:45.704651 str-s6100-acs-2 NOTICE root: WARMBOOT_FINALIZER : Finalizing warmboot...
Feb 11 20:00:46.369254 str-s6100-acs-2 INFO systemd[1]: warmboot-finalizer.service: Succeeded.
Feb 11 20:00:46.370578 str-s6100-acs-2 INFO systemd[1]: Started Monitor warm recovery and disable warmboot when done.
Feb 11 20:00:46.371863 str-s6100-acs-2 INFO systemd[1]: Reached target Multi-User System.
Feb 11 20:00:46.372091 str-s6100-acs-2 INFO systemd[1]: Reached target Graphical Interface.
Feb 11 20:00:46.374877 str-s6100-acs-2 INFO systemd[1]: Starting Update UTMP about System Runlevel Changes...
Feb 11 20:00:46.396380 str-s6100-acs-2 INFO systemd[1]: systemd-update-utmp-runlevel.service: Succeeded.
Feb 11 20:00:46.398098 str-s6100-acs-2 INFO systemd[1]: Started Update UTMP about System Runlevel Changes.
Feb 11 20:00:46.409865 str-s6100-acs-2 INFO systemd[1]: Startup finished in 3.258s (kernel) + 2min 59.267s (userspace) = 3min 2.525s.
Feb 11 20:00:46.412366 str-s6100-acs-2 INFO /hostcfgd: systemctl has finished initialization -- proceeding ...
Feb 11 20:00:46.424908 str-s6100-acs-2 INFO /hostcfgd: Running cmd: 'sudo systemctl unmask syncd.service'
Feb 11 20:00:46.468116 str-s6100-acs-2 INFO systemd[1]: Reloading.

Feb 11 20:00:56.088052 str-s6100-acs-2 INFO /hostcfgd: Running cmd: 'sudo systemctl unmask teamd.service'
Feb 11 20:00:56.609136 str-s6100-acs-2 INFO /hostcfgd: Running cmd: 'sudo systemctl enable teamd.service'
Feb 11 20:00:56.628581 str-s6100-acs-2 INFO hostcfgd[1586]: Created symlink /etc/systemd/system/multi-user.target.wants/teamd.service → /lib/systemd/system/teamd.service.
Feb 11 20:00:57.134328 str-s6100-acs-2 INFO /hostcfgd: Running cmd: 'sudo systemctl start teamd.service'
Feb 11 20:00:57.159854 str-s6100-acs-2 INFO /hostcfgd: Feature 'teamd.service' is enabled and started

Describe the results you expected:

LAGs should not flap after warm reboot.

Output of show version:

# show version 

SONiC Software Version: SONiC.HEAD.375-d8fe592e
Distribution: Debian 10.8
Kernel: 4.19.0-12-2-amd64
Build commit: d8fe592e
Build date: Wed Feb 10 20:38:40 UTC 2021
Built by: johnar@jenkins-worker-22

Platform: x86_64-dell_s6100_c2538-r0
HwSKU: Force10-S6100
[syslog.2.gz](https://github.com/Azure/sonic-buildimage/files/5968607/syslog.2.gz)
[teamd.log](https://github.com/Azure/sonic-buildimage/files/5968608/teamd.log)


Additional information you deem important (e.g. issue happens only occasionally):

@vaibhavhd
Copy link
Contributor Author

Just tested the latest of 202012 and the issue is not seen there. I think this has always been only master branch issue, and I was mistaken to tag 202012 here.

@vaibhavhd
Copy link
Contributor Author

This is now a priority issue as some of the other SAI issues seem to have been solved.
Unfortunately new tests have revealed that this issue is also seen on 202012 branch images. Also, this is seen on TD3 platform as well.

The first warmboot goes through and no lag-flaps are noticed. However, after/during second warm-reboot, LAG flaps are pretty consistently noticed.

I will post an update shortly.

@vaibhavhd
Copy link
Contributor Author

The LAGs flap due to swss restart as a SELECT TIMEOUT failure is seen in INIT_VIEW processing

It looks like Orchagent is not reconciled and syncd did not respond within 60s. This led to to wait: select operation result: timeout on notify. When swss restarts, LAGs flap, and the test failed.

Mar 15 19:17:48.509796 str-s6100-acs-2 NOTICE teamd#teammgrd: :- addLag: Start port channel PortChannel0001 with teamd
Mar 15 19:18:03.633781 str-s6100-acs-2 NOTICE teamd#teammgrd: :- addLagMember: Add Ethernet0 to port channel PortChannel0001
Mar 15 19:18:03.817849 str-s6100-acs-2 NOTICE teamd#teammgrd: :- addLagMember: Add Ethernet1 to port channel PortChannel0001
Mar 15 19:18:11.986819 str-s6100-acs-2 NOTICE swss#orchagent: :- addLag: Create an empty LAG PortChannel0001 lid:2000000000fe4
Mar 15 19:18:11.995548 str-s6100-acs-2 NOTICE swss#orchagent: :- addLagMember: Add member Ethernet0 to LAG PortChannel0001 lid:2000000000fe4 pid:1000000000034
Mar 15 19:18:12.001086 str-s6100-acs-2 NOTICE swss#orchagent: :- addLagMember: Add member Ethernet1 to LAG PortChannel0001 lid:2000000000fe4 pid:1000000000035

Mar 15 19:21:00.342307 str-s6100-acs-2 NOTICE swss#vxlanmgrd: :- main: Waiting Until Orchagent is reconciled. Current 3. Waited 93 secs
Mar 15 19:21:01.342780 str-s6100-acs-2 NOTICE swss#vxlanmgrd: :- main: Waiting Until Orchagent is reconciled. Current 3. Waited 94 secs
Mar 15 19:21:02.343151 str-s6100-acs-2 NOTICE swss#vxlanmgrd: :- main: Waiting Until Orchagent is reconciled. Current 3. Waited 95 secs

Mar 15 19:21:03.002016 str-s6100-acs-2 ERR swss#orchagent: :- wait: select operation result: timeout on notify
Mar 15 19:21:03.002250 str-s6100-acs-2 ERR swss#orchagent: :- wait: failed to get response for notify
Mar 15 19:21:03.002551 str-s6100-acs-2 ERR swss#orchagent: :- initSaiRedis: Failed to notify syncd INIT_VIEW, rv:-1 gSwitchId 0
Mar 15 19:21:03.002740 str-s6100-acs-2 NOTICE swss#orchagent: :- uninitialize: begin
Mar 15 19:21:03.003017 str-s6100-acs-2 NOTICE swss#orchagent: :- uninitialize: begin
Mar 15 19:21:03.003079 str-s6100-acs-2 NOTICE swss#orchagent: :- ~RedisChannel: join ntf thread begin

Mar 15 19:23:14.767360 str-s6100-acs-2 NOTICE teamd#teammgrd: :- removeLag: Stop port channel PortChannel0001
Mar 15 19:24:34.242938 str-s6100-acs-2 NOTICE teamd#teammgrd: :- addLag: Start port channel PortChannel0001 with teamd
Mar 15 19:25:11.326954 str-s6100-acs-2 NOTICE teamd#teammgrd: :- addLagMember: Add Ethernet0 to port channel PortChannel0001
Mar 15 19:25:11.437774 str-s6100-acs-2 NOTICE teamd#teammgrd: :- addLagMember: Add Ethernet1 to port channel PortChannel0001
Mar 15 19:25:14.451037 str-s6100-acs-2 NOTICE swss#orchagent: :- addLag: Create an empty LAG PortChannel0001 lid:2000000000b5f
Mar 15 19:25:14.930928 str-s6100-acs-2 NOTICE swss#orchagent: :- addLagMember: Add member Ethernet0 to LAG PortChannel0001 lid:2000000000b5f pid:1000000000034
Mar 15 19:25:14.953906 str-s6100-acs-2 NOTICE swss#orchagent: :- addLagMember: Add member Ethernet1 to LAG PortChannel0001 lid:2000000000b5f pid:1000000000035

@vaibhavhd
Copy link
Contributor Author

From the logs, it also appears that vxlanmgrd is stuck for orchagent to reconcile. I tested using the fix provided in sonic-net/sonic-swss#1647 and the INIT_VIEW operation succeeded and the warm-reboot proceeded to APPLY_VIEW.
When a new image with sonic-swss changes is ready, I will test this again. Meanwhile, it is not clear why OA was not reconciled, and syncd did not responded in 60s (leading to TIMEOUT).

@vaibhavhd
Copy link
Contributor Author

This issue is not seen in the latest image in the manner it was seen before. However, LAGs still flap after warm-reboot. But, that is due to swss restart after warm-reboot due to SAI issue in syncd preshutdown. The new issue is tracked separately at

#7071

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant