Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[libteam][warm-reboot] fix issue in teamd warm-reboot that teamd starts #8227

Merged
merged 4 commits into from
Aug 16, 2021

Conversation

stepanblyschak
Copy link
Collaborator

@stepanblyschak stepanblyschak commented Jul 20, 2021

with state of tdport from previous warm-reboot.

In case LAG was down before reboot, lacp->wr is not cleared.
In lacp_event_watch_port_flush_data we incremented nr_of_tdports and add
tdport to lacp->wr.state. In case lacp->wr.state already had this tdport
we do not set new state for tdport but appened a new item in
lacp->wr.state. In case we preformed warm-reboot and PortChannel member
was down, after reboot PortChannel member became up next warm-reboot
will initialize teamd with PortChannel member in down state.

Example of PortChannel0002 dump with single member Ethernet24 file when
this issue is reproduced:

admin@sonic:~$ sudo cat /host/warmboot/teamd/PortChannel0002
1
4
Ethernet24
0
Ethernet24
1
Ethernet24
1
Ethernet24
1

Fix this issue by calling stop_wr_mode() when LAG was down. This was probably intended but missed.

Signed-off-by: Stepan Blyschak [email protected]

Why I did it

To fix an issue seen in warm-reboot-sad test cases.

How I did it

I fixed it in SONiC libteam patch that adds warm-reboot support. Details in commit description.

How to verify it

Run warm-reboot-sad test on t0-56 topology.

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106

Description for the changelog

A picture of a cute animal (not mandatory but encouraged)

with state of tdport from previous warm-reboot.

In lacp_event_watch_port_flush_data we incremented nr_of_tdports and add
tdport to lacp->wr.state. In case lacp->wr.state already had this tdport
we do not set new state for tdport but appened a new item in
lacp->wr.state. In case we preformed warm-reboot and PortChannel member
was down, after reboot PortChannel member became up next warm-reboot
will initialize teamd with PortChannel member in down state.

Example of PortChannel0002 dump with single member Ethernet24 file when
this issue is reproduced:

```
admin@sonic:~$ sudo cat /host/warmboot/teamd/PortChannel0002
0
4
Ethernet24
0
Ethernet24
1
Ethernet24
1
Ethernet24
1
```

Fix this issue by searching for existing tdport in lacp->wr.state and set
enabled flag in tdport or append in case tdport is not found.

Signed-off-by: Stepan Blyschak <[email protected]>
Signed-off-by: Stepan Blyschak <[email protected]>
Signed-off-by: Stepan Blyschak <[email protected]>
@stepanblyschak stepanblyschak requested a review from lguohan as a code owner July 20, 2021 15:15
@stepanblyschak
Copy link
Collaborator Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@lguohan
Copy link
Collaborator

lguohan commented Jul 24, 2021

is this upstreamed?

@lguohan
Copy link
Collaborator

lguohan commented Jul 24, 2021

@judyjoseph to check

@stepanblyschak
Copy link
Collaborator Author

is this upstreamed?

@lguohan warm-reboot feature for teamd is not upstreamed

@liat-grozovik
Copy link
Collaborator

@lguohan can we proceed with the approval flow?

@judyjoseph
Copy link
Contributor

Thanks for the fix, yes looks like it was missed out earlier - based on the comments Pavel had put in code.. We can take it.

But in my tests I could not find a functional impact though I find the file eg: /host/warmboot/teamd/PortChannel0002 is having wrong data ( member interface getting appended on multiple warm-reboots ).

The LAG interface state comes up correctly once the member interface comes back up. Can you share the exact test sequence where you see the issue.

@stepanblyschak
Copy link
Collaborator Author

  1. Have a T0 deployed topology
  2. On VM1 (PortChannel0001) shutdown member port
  3. On DUT observe that PortChannel0001 is down and the member is deselected
  4. On DUT perform "sudo warm-reboot"
  5. After warm-reboot completes (warmboot-finalizer is finished) bring the member port back up on VM1.
  6. On DUT observe that PortChannel becomes up and member is selected.
  7. On DUT perform "sudo warm-reboot"
  8. During DUT warm boot teamd starts on PortChannel0001 - it sees wrong data (from previous boot where member was down) and falls back into cold boot mode for the LAG.
  9. PortChannel0001 flaps on VM1 causing data plane disruption.

You can also run warm-reboot sad tests cases in the following order - first lag member sad test, then vlan member sad test. Vlan member sad test will fail with downtime of few sec.

Copy link
Contributor

@judyjoseph judyjoseph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@qiluo-msft qiluo-msft merged commit 8a2ba14 into sonic-net:master Aug 16, 2021
richardyu-ms pushed a commit to richardyu-ms/sonic-buildimage that referenced this pull request Aug 16, 2021
…ts (sonic-net#8227)

with state of tdport from previous warm-reboot.

In case LAG was down before reboot, lacp->wr is not cleared.
In lacp_event_watch_port_flush_data we incremented nr_of_tdports and add
tdport to lacp->wr.state. In case lacp->wr.state already had this tdport
we do not set new state for tdport but appened a new item in
lacp->wr.state. In case we preformed warm-reboot and PortChannel member
was down, after reboot PortChannel member became up next warm-reboot
will initialize teamd with PortChannel member in down state.

Fix this issue by calling stop_wr_mode() when LAG was down. This was probably intended but missed.

#### Why I did it

To fix an issue seen in warm-reboot-sad test cases.

#### How I did it

I fixed it in SONiC libteam patch that adds warm-reboot support. Details in commit description.

#### How to verify it

Run warm-reboot-sad test on t0-56 topology.
qiluo-msft pushed a commit that referenced this pull request Aug 18, 2021
…ts (#8227)

with state of tdport from previous warm-reboot.

In case LAG was down before reboot, lacp->wr is not cleared.
In lacp_event_watch_port_flush_data we incremented nr_of_tdports and add
tdport to lacp->wr.state. In case lacp->wr.state already had this tdport
we do not set new state for tdport but appened a new item in
lacp->wr.state. In case we preformed warm-reboot and PortChannel member
was down, after reboot PortChannel member became up next warm-reboot
will initialize teamd with PortChannel member in down state.

Fix this issue by calling stop_wr_mode() when LAG was down. This was probably intended but missed.

#### Why I did it

To fix an issue seen in warm-reboot-sad test cases.

#### How I did it

I fixed it in SONiC libteam patch that adds warm-reboot support. Details in commit description.

#### How to verify it

Run warm-reboot-sad test on t0-56 topology.
judyjoseph pushed a commit that referenced this pull request Aug 25, 2021
…ts (#8227)

with state of tdport from previous warm-reboot.

In case LAG was down before reboot, lacp->wr is not cleared.
In lacp_event_watch_port_flush_data we incremented nr_of_tdports and add
tdport to lacp->wr.state. In case lacp->wr.state already had this tdport
we do not set new state for tdport but appened a new item in
lacp->wr.state. In case we preformed warm-reboot and PortChannel member
was down, after reboot PortChannel member became up next warm-reboot
will initialize teamd with PortChannel member in down state.

Fix this issue by calling stop_wr_mode() when LAG was down. This was probably intended but missed.

#### Why I did it

To fix an issue seen in warm-reboot-sad test cases.

#### How I did it

I fixed it in SONiC libteam patch that adds warm-reboot support. Details in commit description.

#### How to verify it

Run warm-reboot-sad test on t0-56 topology.
@stepanblyschak stepanblyschak deleted the fix-teamd-warm-reboot-bug branch September 23, 2022 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants