Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test case 2 of PFC watchdog against warm-reboot: sad path #834

Merged
merged 25 commits into from
Oct 10, 2019

Conversation

wendani
Copy link
Contributor

@wendani wendani commented Mar 20, 2019

PFC storm started and detected before warm-reboot
On-going storm on warm-reboot emission, and lasts past the warm-reboot finish
PFC storm stopped and restored after warm-reboot

Tested on regular pfc watchdog without break.

Infrastructure change:

loganalyzer.py & loganalyzer_analyze.yml:
Allow log analyzer to take a specified start marker

loganalyzer_init.yml:
Use lookup('pipe', 'date +%H:%M:%S') in place of ansible_date_time.time, which uses cached time for a certain period of time ansible/ansible#22561

loganalyzer_end.yml:
Dump only the current result and summary files for debugging and troubleshooting purpose
Add the capability to check if the number of exact matches is equal to the target number

Incremental commits on top of #825

Description of PR

Summary:
Fixes # (issue)

Type of change

  • [] Bug fix
  • [] Testbed and Framework(new/improvement)
  • [] Test case(new/improvement)

Approach

How did you do it?

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

wendani added 14 commits March 8, 2019 18:47
Signed-off-by: Wenda Ni <[email protected]>
which uses cached time for a certain period of time ansible/ansible#22561

Signed-off-by: Wenda Ni <[email protected]>
functional_test_storm_perq.yml and functional_test_restore_perq.yml,
respectively

Add the capability to storm multiple queues of a port

Signed-off-by: Wenda Ni <[email protected]>
PFC storm started and detected before warm-reboot
On-going storm on warm-reboot emission, and lasts past the warm-reboot finish
PFC storm stopped and restored after warm-reboot

Signed-off-by: Wenda Ni <[email protected]>
Mar 20 00:40:33.599212 str-a7050-acs-1 ERR syncd#syncd:
_brcm_sai_cosq_stat_get:1146 cosq stat get failed with error Invalid
parameter (0xfffffffc).
Mar 20 00:40:33.599212 str-a7050-acs-1 DEBUG syncd#syncd:
brcm_sai_get_queue_stats:724 cosq stat get failed with error -5 for port
1 qid 10
Mar 20 00:40:33.599212 str-a7050-acs-1 NOTICE syncd#syncd: :-
setQueueCounterList: Queue oid:0x102150000000b does not has supported
counters

Signed-off-by: Wenda Ni <[email protected]>
@wendani wendani changed the title Add test case 2 of PFC watchdog against warm-reboot Test case 2 of PFC watchdog against warm-reboot Mar 20, 2019
@wendani wendani changed the title Test case 2 of PFC watchdog against warm-reboot Test case 2 of PFC watchdog against warm-reboot: sad path Mar 22, 2019
@wendani
Copy link
Contributor Author

wendani commented Oct 4, 2019

@stepanblyschak Let me know if the test case passes at your end.

@stepanblyschak
Copy link
Contributor

stepanblyschak commented Oct 4, 2019

@wendani sonic-net/sonic-swss#876 is missing in 201811, which is important to pass this test

@stepanblyschak
Copy link
Contributor

@wendani Besides of missing commit in 201811, there is another issue with bufferorch and pfcwdorch dependency.
BufferOrch::doTask is executed on 1st restore iteration before PortsOrch and PfcWdOrch, since PortInitDone is not set yet it will retry on 2nd iteration after PfcWdOrch:;doTask is executed and will override zero buffer profiles set on stormed ports.
I assume simple swap the order of PortsOrch and BufferOrch in m_orchList will fix this issue

@wendani
Copy link
Contributor Author

wendani commented Oct 4, 2019

@stepanblyschak please use the 201811 image we shared with you. As said in sonic-net/sonic-swss#876, we have a patch that hardcoded the two lossless TCs, and it is in the image we shared with you #834 (comment)

@wendani
Copy link
Contributor Author

wendani commented Oct 4, 2019

PortsOrch will not proceed on a port if buffer profile is not configured on that port. https://github.com/Azure/sonic-swss/blob/master/orchagent/portsorch.cpp#L1694 This said, BufferOrch must be ahead of PortsOrch in initialization list. There is no question on this.

What I can envision is a transcient behavior using the zero buffer apporach that a storming port may first have a non-zero buffer profile set by BufferOrch after warm-reboot, and then have a zero buffer profile when PfcWdOrch kicks in to install the drop action.

#834 (comment)

@wendani
Copy link
Contributor Author

wendani commented Oct 4, 2019

PfcWdOrch will wait for PortsOrch to signal all ports ready https://github.com/Azure/sonic-swss/blob/master/orchagent/pfcwdorch.cpp#L57

I believe the dependency is well sorted at the orchagent level.

@wendani
Copy link
Contributor Author

wendani commented Oct 4, 2019

@stepanblyschak Other than the test failure, do you have objection to the test case itself?

@stepanblyschak
Copy link
Contributor

stepanblyschak commented Oct 7, 2019

@wendani
The order of Orch::doTask() execution is defined by m_orchList https://github.com/Azure/sonic-swss/blob/master/orchagent/orchdaemon.cpp#L189.
So the order is BufferOrch, PortsOrch, PfcWdSwOrch.
Consider the warm restoration procedure:

  • BufferOrch
    Since host interfaces have not been created yet, PortsOrch::isPortInitDone() returns false, so BufferOrch will defer the processing on 2nd iteration
  • PortsOrch
    m_initDone == false and m_pendingPortSet.empty() == true at this point;
    m_toSync contains the following [Ethernet0, ...., PortConfigDone, PortInitDone]
    PortsOrch processes m_toSync and when PortConfigDone is beeing processed it creates ports, creates host interfaces; right after that there is a condition if alias == "PortConfigDone" https://github.com/Azure/sonic-swss/blob/master/orchagent/portsorch.cpp#L1688 and continues the loop. Next is "PortInitDone" in m_toSync, so PortsOrch::doTask will simply return https://github.com/Azure/sonic-swss/blob/master/orchagent/portsorch.cpp#L1531.
    Note, at this stage m_initDone == true and m_pendingPortSet.empty() == true which makes PortsOrch::allPortsReady() return true.
  • PfcWdSwOrch configures zero buffer
  • BufferOrch overrides the buffer.

So seems that design of m_pendingPortSet has an issue that makes other orchs think that buffer configuration has been applied when it was not.
A simple chnage in m_orchList to swap BufferOrch and PortsOrch makes sense because PortsOrch will create ports, host interfaces at first iteration, BufferOrch will be able to process buffers since PortInitDone flag is set, then on second iteration PortsOrch will proceed with speed, fec, etc.
More preferable will be to change the m_pendingPortSet to be filled before PortInitDone is processed.

@stepanblyschak
Copy link
Contributor

@wendani Regarding transcient state during warm restoration on Mellanox when non-zero profile is set on hw. Do you think it is an issue in production?

@stepanblyschak
Copy link
Contributor

@stepanblyschak please use the 201811 image we shared with you. As said in Azure/sonic-swss#876, we have a patch that hardcoded the two lossless TCs, and it is in the image we shared with you #834 (comment)

Does it mean that sonic-net/sonic-swss#876 won't be cherry-picked to 201811?

Copy link
Contributor

@stepanblyschak stepanblyschak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite there is an issue with PFCWD zero buffer approach, the test looks good to me

@lguohan lguohan merged commit ef58fd0 into sonic-net:master Oct 10, 2019
yxieca pushed a commit that referenced this pull request Oct 15, 2019
* First test case of PFC watchdog against warm-reboot

Signed-off-by: Wenda Ni <[email protected]>

* Add more comments for code readability

Signed-off-by: Wenda Ni <[email protected]>

* First test case of PFC watchdog against warm-reboot

Signed-off-by: Wenda Ni <[email protected]>

* Add more comments for code readability

Signed-off-by: Wenda Ni <[email protected]>

* Modify output message

Signed-off-by: Wenda Ni <[email protected]>

* Allow log analyzer to take a specified start marker

Signed-off-by: Wenda Ni <[email protected]>

* Use lookup('pipe', 'date +%H:%M:%S') in place of ansible_date_time.time,
which uses cached time for a certain period of time ansible/ansible#22561

Signed-off-by: Wenda Ni <[email protected]>

* Add the flexiblity to not start storm at fanout link partener in running
functional_test_storm.yml

Signed-off-by: Wenda Ni <[email protected]>

* Dump only the current result and summary files for debugging and troubleshooting purpose

Signed-off-by: Wenda Ni <[email protected]>

* Add the capability to check if the number of exact matches is equal to
to the target number

Signed-off-by: Wenda Ni <[email protected]>

* Split the actual storm and restore tests into
functional_test_storm_perq.yml and functional_test_restore_perq.yml,
respectively

Add the capability to storm multiple queues of a port

Signed-off-by: Wenda Ni <[email protected]>

* Add test case 2 of PFC watchdog against warm-reboot:

PFC storm started and detected before warm-reboot
On-going storm on warm-reboot emission, and lasts past the warm-reboot finish
PFC storm stopped and restored after warm-reboot

Signed-off-by: Wenda Ni <[email protected]>

* Ignore trival syncd ERR during the warm-reboot, e.g.,

Mar 20 00:40:33.599212 str-a7050-acs-1 ERR syncd#syncd:
_brcm_sai_cosq_stat_get:1146 cosq stat get failed with error Invalid
parameter (0xfffffffc).
Mar 20 00:40:33.599212 str-a7050-acs-1 DEBUG syncd#syncd:
brcm_sai_get_queue_stats:724 cosq stat get failed with error -5 for port
1 qid 10
Mar 20 00:40:33.599212 str-a7050-acs-1 NOTICE syncd#syncd: :-
setQueueCounterList: Queue oid:0x102150000000b does not has supported
counters

Signed-off-by: Wenda Ni <[email protected]>

* Use boolean variable to determine the test run type: regular pfc wd test
or pfcwd warm-reboot test

Signed-off-by: Wenda Ni <[email protected]>

* Feed reboot type to reboot_sonic.yml in warm-reboot happy path test

Signed-off-by: Wenda Ni <[email protected]>

* Feed reboot type to reboot_sonic.yml in warm-reboot sad path test

Signed-off-by: Wenda Ni <[email protected]>

* Add expected errors on mlnx platform

Signed-off-by: Wenda Ni <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants