Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dual tor: Initialization Loop Detected Among Gemini Components #6761

Closed
tahmed-dev opened this issue Feb 10, 2021 · 2 comments
Closed

dual tor: Initialization Loop Detected Among Gemini Components #6761

tahmed-dev opened this issue Feb 10, 2021 · 2 comments
Assignees
Labels
Issue for 202012 Triaged this issue has been triaged

Comments

@tahmed-dev
Copy link
Contributor

tahmed-dev commented Feb 10, 2021

Description

Dual ToR initialization requires muxcable to be active on one ToR and standby on peer ToR when the mux is healthy.
When orchagent goes bad as noticed in the lab dues to h/w vendor SAI bug, it fails to create/tear down tunnels and put overall mux state as unknown. Linkmgrd during initialization, reads the overall mux state and when it finds unknown (no tunnel created,) it probes xcvrd for the current mux state. Xcvrd reports either active/standby and so linkmgrd switches the mux state to match xcvrd. However due to H/W SAI bug, orchagent fails to switch the mux and the loop continues.

Aside problem was also noticed. When BRCM SAI is fixed and orchagent succeeds in switching the mux state, it noticed the communication continues one between xcvrd and orchagent for ~30 min. It is not clear if backlogged requests were being serviced or what caused such communication to take place.

Other Observations:

  1. Restart of swss service did not recover the issue since pmon was not restarted
  2. The backlogged requests should not happen as linkmgrd will not start new sequence before the current one completes.

Steps to reproduce the issue:

  1. Load Gemini image after 1/26 on ToRs with complete set of muxcables
  2. Reboot the ToR or config reload
  3. Notice initialization loop in the syslog and in swss.rec
  4. Fix swss ipinip issue, and notice communication between xcvrd and orachagent

Describe the results you received:
Init loop

Describe the results you expected:
There should be init loop

Additional information you deem important (e.g. issue happens only occasionally):
Sample logs:

/var/log/syslog.54.gz:Feb  5 23:50:13.770607 BN9-0101-0301-01LT0 INFO linkmgrd: DbInterface.cpp:66 setMuxState: Ethernet52: setting mux to standby
/var/log/syslog.54.gz:Feb  5 23:50:13.770681 BN9-0101-0301-01LT0 INFO linkmgrd: DbInterface.cpp:167 handleSetMuxState: Ethernet52: setting mux state to standby
/var/log/syslog.54.gz:Feb  5 23:50:13.776129 BN9-0101-0301-01LT0 INFO linkmgrd: MuxManager.cpp:134 addOrUpdateMuxPortMuxState: Ethernet76: state db mux state: unknown
/var/log/syslog.54.gz:Feb  5 23:50:13.776210 BN9-0101-0301-01LT0 INFO linkmgrd: DbInterface.cpp:84 probeMuxState: Ethernet76
/var/log/syslog.54.gz:Feb  5 23:50:13.780099 BN9-0101-0301-01LT0 INFO linkmgrd: MuxManager.cpp:180 processProbeMuxState: Ethernet60: app db mux state: standby
/var/log/syslog.54.gz:Feb  5 23:50:13.780193 BN9-0101-0301-01LT0 INFO linkmgrd: link_manager/LinkManagerStateMachine.cpp:408 handleProbeMuxStateNotification: Ethernet60: Initializing MUX state 'Standby' to match xcvrd state
/var/log/syslog.54.gz:Feb  5 23:50:13.780254 BN9-0101-0301-01LT0 INFO linkmgrd: DbInterface.cpp:66 setMuxState: Ethernet60: setting mux to standby
/var/log/syslog.54.gz:Feb  5 23:50:13.780335 BN9-0101-0301-01LT0 INFO linkmgrd: DbInterface.cpp:167 handleSetMuxState: Ethernet60: setting mux state to standby
/var/log/syslog.54.gz:Feb  5 23:50:13.785580 BN9-0101-0301-01LT0 INFO linkmgrd: MuxManager.cpp:134 addOrUpdateMuxPortMuxState: Ethernet84: state db mux state: unknown
/var/log/syslog.54.gz:Feb  5 23:50:13.785639 BN9-0101-0301-01LT0 INFO linkmgrd: DbInterface.cpp:84 probeMuxState: Ethernet84
/var/log/syslog.54.gz:Feb  5 23:50:13.789881 BN9-0101-0301-01LT0 INFO linkmgrd: MuxManager.cpp:180 processProbeMuxState: Ethernet68: app db mux state: standby
/var/log/syslog.54.gz:Feb  5 23:50:13.789966 BN9-0101-0301-01LT0 INFO linkmgrd: link_manager/LinkManagerStateMachine.cpp:408 handleProbeMuxStateNotification: Ethernet68: Initializing MUX state 'Standby' to match xcvrd state
/var/log/syslog.54.gz:Feb  5 23:50:13.790022 BN9-0101-0301-01LT0 INFO linkmgrd: DbInterface.cpp:66 setMuxState: Ethernet68: setting mux to standby
/var/log/syslog.54.gz:Feb  5 23:50:13.790093 BN9-0101-0301-01LT0 INFO linkmgrd: DbInterface.cpp:167 handleSetMuxState: Ethernet68: setting mux state to standby
/var/log/syslog.54.gz:Feb  5 23:50:13.796877 BN9-0101-0301-01LT0 INFO linkmgrd: MuxManager.cpp:134 addOrUpdateMuxPortMuxState: Ethernet0: state db mux state: unknown
/var/log/syslog.54.gz:Feb  5 23:50:13.797048 BN9-0101-0301-01LT0 INFO linkmgrd: DbInterface.cpp:84 probeMuxState: Ethernet0
/var/log/syslog.54.gz:Feb  5 23:50:13.800972 BN9-0101-0301-01LT0 INFO linkmgrd: MuxManager.cpp:180 processProbeMuxState: Ethernet76: app db mux state: standby
/var/log/syslog.54.gz:Feb  5 23:50:13.801041 BN9-0101-0301-01LT0 INFO linkmgrd: link_manager/LinkManagerStateMachine.cpp:408 handleProbeMuxStateNotification: Ethernet76: Initializing MUX state 'Standby' to match xcvrd state
/var/log/syslog.54.gz:Feb  5 23:50:13.801089 BN9-0101-0301-01LT0 INFO linkmgrd: DbInterface.cpp:66 setMuxState: Ethernet76: setting mux to standby
/var/log/syslog.54.gz:Feb  5 23:50:13.801163 BN9-0101-0301-01LT0 INFO linkmgrd: DbInterface.cpp:167 handleSetMuxState: Ethernet76: setting mux state to standby
/var/log/syslog.54.gz:Feb  5 23:50:13.806510 BN9-0101-0301-01LT0 INFO linkmgrd: MuxManager.cpp:134 addOrUpdateMuxPortMuxState: Ethernet104: state db mux state: unknown
/var/log/syslog.54.gz:Feb  5 23:50:13.806594 BN9-0101-0301-01LT0 INFO linkmgrd: DbInterface.cpp:84 probeMuxState: Ethernet104
/var/log/syslog.54.gz:Feb  5 23:50:13.810597 BN9-0101-0301-01LT0 INFO linkmgrd: MuxManager.cpp:180 processProbeMuxState: Ethernet84: app db mux state: standby
/var/log/syslog.54.gz:Feb  5 23:50:13.810685 BN9-0101-0301-01LT0 INFO linkmgrd: link_manager/LinkManagerStateMachine.cpp:408 handleProbeMuxStateNotification: Ethernet84: Initializing MUX state 'Standby' to match xcvrd state
/var/log/syslog.54.gz:Feb  5 23:50:13.810741 BN9-0101-0301-01LT0 INFO linkmgrd: DbInterface.cpp:66 setMuxState: Ethernet84: setting mux to standby
/var/log/syslog.54.gz:Feb  5 23:50:13.810808 BN9-0101-0301-01LT0 INFO linkmgrd: DbInterface.cpp:167 handleSetMuxState: Ethernet84: setting mux state to standby
/var/log/syslog.54.gz:Feb  5 23:50:13.816795 BN9-0101-0301-01LT0 INFO linkmgrd: MuxManager.cpp:134 addOrUpdateMuxPortMuxState: Ethernet4: state db mux state: unknown
/var/log/syslog.54.gz:Feb  5 23:50:13.816901 BN9-0101-0301-01LT0 INFO linkmgrd: DbInterface.cpp:84 probeMuxState: Ethernet4
/var/log/syslog.54.gz:Feb  5 23:50:13.820122 BN9-0101-0301-01LT0 INFO linkmgrd: MuxManager.cpp:180 processProbeMuxState: Ethernet0: app db mux state: standby
/var/log/syslog.54.gz:Feb  5 23:50:13.820210 BN9-0101-0301-01LT0 INFO linkmgrd: link_manager/LinkManagerStateMachine.cpp:408 handleProbeMuxStateNotification: Ethernet0: Initializing MUX state 'Standby' to match xcvrd state
/var/log/syslog.54.gz:Feb  5 23:50:13.820266 BN9-0101-0301-01LT0 INFO linkmgrd: DbInterface.cpp:66 setMuxState: Ethernet0: setting mux to standby
/var/log/syslog.54.gz:Feb  5 23:50:13.820331 BN9-0101-0301-01LT0 INFO linkmgrd: DbInterface.cpp:167 handleSetMuxState: Ethernet0: setting mux state to standby
/var/log/syslog.54.gz:Feb  5 23:50:13.825988 BN9-0101-0301-01LT0 INFO linkmgrd: MuxManager.cpp:134 addOrUpdateMuxPortMuxState: Ethernet52: state db mux state: unknown
/var/log/syslog.54.gz:Feb  5 23:50:13.826083 BN9-0101-0301-01LT0 INFO linkmgrd: DbInterface.cpp:84 probeMuxState: Ethernet52
/var/log/syslog.54.gz:Feb  5 23:50:13.830941 BN9-0101-0301-01LT0 INFO linkmgrd: MuxManager.cpp:180 processProbeMuxState: Ethernet104: app db mux state: standby
/var/log/syslog.54.gz:Feb  5 23:50:13.831026 BN9-0101-0301-01LT0 INFO linkmgrd: link_manager/LinkManagerStateMachine.cpp:408 handleProbeMuxStateNotification: Ethernet104: Initializing MUX state 'Standby' to match xcvrd state
/var/log/syslog.54.gz:Feb  5 23:50:13.831082 BN9-0101-0301-01LT0 INFO linkmgrd: DbInterface.cpp:66 setMuxState: Ethernet104: setting mux to standby
/var/log/syslog.54.gz:Feb  5 23:50:13.831156 BN9-0101-0301-01LT0 INFO linkmgrd: DbInterface.cpp:167 handleSetMuxState: Ethernet104: setting mux state to standby
/var/log/syslog.54.gz:Feb  5 23:50:13.836662 BN9-0101-0301-01LT0 INFO linkmgrd: MuxManager.cpp:134 addOrUpdateMuxPortMuxState: Ethernet60: state db mux state: unknown
/var/log/syslog.54.gz:Feb  5 23:50:13.836743 BN9-0101-0301-01LT0 INFO linkmgrd: DbInterface.cpp:84 probeMuxState: Ethernet60
/var/log/syslog.54.gz:Feb  5 23:50:13.840634 BN9-0101-0301-01LT0 INFO linkmgrd: MuxManager.cpp:180 processProbeMuxState: Ethernet4: app db mux state: standby
/var/log/syslog.54.gz:Feb  5 23:50:13.840719 BN9-0101-0301-01LT0 INFO linkmgrd: link_manager/LinkManagerStateMachine.cpp:408 handleProbeMuxStateNotification: Ethernet4: Initializing MUX state 'Standby' to match xcvrd state
/var/log/syslog.54.gz:Feb  5 23:50:13.840775 BN9-0101-0301-01LT0 INFO linkmgrd: DbInterface.cpp:66 setMuxState: Ethernet4: setting mux to standby
/var/log/syslog.54.gz:Feb  5 23:50:13.840840 BN9-0101-0301-01LT0 INFO linkmgrd: DbInterface.cpp:167 handleSetMuxState: Ethernet4: setting mux state to standby
/var/log/syslog.54.gz:Feb  5 23:50:13.846341 BN9-0101-0301-01LT0 INFO linkmgrd: MuxManager.cpp:134 addOrUpdateMuxPortMuxState: Ethernet68: state db mux state: unknown
/var/log/syslog.54.gz:Feb  5 23:50:13.846532 BN9-0101-0301-01LT0 INFO linkmgrd: DbInterface.cpp:84 probeMuxState: Ethernet68
/var/log/syslog.54.gz:Feb  5 23:50:13.850446 BN9-0101-0301-01LT0 INFO linkmgrd: MuxManager.cpp:180 processProbeMuxState: Ethernet52: app db mux state: standby
/var/log/syslog.54.gz:Feb  5 23:50:13.850525 BN9-0101-0301-01LT0 INFO linkmgrd: 
**Output of `show version`:**

```
(paste your output here)
```

**Attach debug file `sudo generate_dump`:**

```
(paste your output here)
```
@lguohan
Copy link
Collaborator

lguohan commented Feb 12, 2021

  1. orchagent write the unknown|error.
  2. orchagent write to the state db every time the linkmgr write to appdb.

note: linkmgr will ignore mux_state same db update, and log such unsolicited response.

@prsunny
Copy link
Contributor

prsunny commented Mar 5, 2021

Fixed by sonic-net/sonic-swss#1662

@prsunny prsunny closed this as completed Mar 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue for 202012 Triaged this issue has been triaged
Projects
None yet
Development

No branches or pull requests

5 participants