-
Notifications
You must be signed in to change notification settings - Fork 741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test case 2 of PFC watchdog against warm-reboot: sad path #834
Conversation
Signed-off-by: Wenda Ni <[email protected]>
Signed-off-by: Wenda Ni <[email protected]>
Signed-off-by: Wenda Ni <[email protected]>
Signed-off-by: Wenda Ni <[email protected]>
Signed-off-by: Wenda Ni <[email protected]>
Signed-off-by: Wenda Ni <[email protected]>
which uses cached time for a certain period of time ansible/ansible#22561 Signed-off-by: Wenda Ni <[email protected]>
functional_test_storm.yml Signed-off-by: Wenda Ni <[email protected]>
…bleshooting purpose Signed-off-by: Wenda Ni <[email protected]>
to the target number Signed-off-by: Wenda Ni <[email protected]>
functional_test_storm_perq.yml and functional_test_restore_perq.yml, respectively Add the capability to storm multiple queues of a port Signed-off-by: Wenda Ni <[email protected]>
PFC storm started and detected before warm-reboot On-going storm on warm-reboot emission, and lasts past the warm-reboot finish PFC storm stopped and restored after warm-reboot Signed-off-by: Wenda Ni <[email protected]>
Mar 20 00:40:33.599212 str-a7050-acs-1 ERR syncd#syncd: _brcm_sai_cosq_stat_get:1146 cosq stat get failed with error Invalid parameter (0xfffffffc). Mar 20 00:40:33.599212 str-a7050-acs-1 DEBUG syncd#syncd: brcm_sai_get_queue_stats:724 cosq stat get failed with error -5 for port 1 qid 10 Mar 20 00:40:33.599212 str-a7050-acs-1 NOTICE syncd#syncd: :- setQueueCounterList: Queue oid:0x102150000000b does not has supported counters Signed-off-by: Wenda Ni <[email protected]>
or pfcwd warm-reboot test Signed-off-by: Wenda Ni <[email protected]>
Signed-off-by: Wenda Ni <[email protected]>
Signed-off-by: Wenda Ni <[email protected]>
aba5d1c
to
66bdf4f
Compare
@stepanblyschak Let me know if the test case passes at your end. |
@wendani sonic-net/sonic-swss#876 is missing in 201811, which is important to pass this test |
@wendani Besides of missing commit in 201811, there is another issue with bufferorch and pfcwdorch dependency. |
@stepanblyschak please use the 201811 image we shared with you. As said in sonic-net/sonic-swss#876, we have a patch that hardcoded the two lossless TCs, and it is in the image we shared with you #834 (comment) |
PortsOrch will not proceed on a port if buffer profile is not configured on that port. https://github.com/Azure/sonic-swss/blob/master/orchagent/portsorch.cpp#L1694 This said, BufferOrch must be ahead of PortsOrch in initialization list. There is no question on this. What I can envision is a transcient behavior using the zero buffer apporach that a storming port may first have a non-zero buffer profile set by BufferOrch after warm-reboot, and then have a zero buffer profile when PfcWdOrch kicks in to install the drop action. |
PfcWdOrch will wait for PortsOrch to signal all ports ready https://github.com/Azure/sonic-swss/blob/master/orchagent/pfcwdorch.cpp#L57 I believe the dependency is well sorted at the orchagent level. |
@stepanblyschak Other than the test failure, do you have objection to the test case itself? |
@wendani
So seems that design of m_pendingPortSet has an issue that makes other orchs think that buffer configuration has been applied when it was not. |
@wendani Regarding transcient state during warm restoration on Mellanox when non-zero profile is set on hw. Do you think it is an issue in production? |
Does it mean that sonic-net/sonic-swss#876 won't be cherry-picked to 201811? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Despite there is an issue with PFCWD zero buffer approach, the test looks good to me
* First test case of PFC watchdog against warm-reboot Signed-off-by: Wenda Ni <[email protected]> * Add more comments for code readability Signed-off-by: Wenda Ni <[email protected]> * First test case of PFC watchdog against warm-reboot Signed-off-by: Wenda Ni <[email protected]> * Add more comments for code readability Signed-off-by: Wenda Ni <[email protected]> * Modify output message Signed-off-by: Wenda Ni <[email protected]> * Allow log analyzer to take a specified start marker Signed-off-by: Wenda Ni <[email protected]> * Use lookup('pipe', 'date +%H:%M:%S') in place of ansible_date_time.time, which uses cached time for a certain period of time ansible/ansible#22561 Signed-off-by: Wenda Ni <[email protected]> * Add the flexiblity to not start storm at fanout link partener in running functional_test_storm.yml Signed-off-by: Wenda Ni <[email protected]> * Dump only the current result and summary files for debugging and troubleshooting purpose Signed-off-by: Wenda Ni <[email protected]> * Add the capability to check if the number of exact matches is equal to to the target number Signed-off-by: Wenda Ni <[email protected]> * Split the actual storm and restore tests into functional_test_storm_perq.yml and functional_test_restore_perq.yml, respectively Add the capability to storm multiple queues of a port Signed-off-by: Wenda Ni <[email protected]> * Add test case 2 of PFC watchdog against warm-reboot: PFC storm started and detected before warm-reboot On-going storm on warm-reboot emission, and lasts past the warm-reboot finish PFC storm stopped and restored after warm-reboot Signed-off-by: Wenda Ni <[email protected]> * Ignore trival syncd ERR during the warm-reboot, e.g., Mar 20 00:40:33.599212 str-a7050-acs-1 ERR syncd#syncd: _brcm_sai_cosq_stat_get:1146 cosq stat get failed with error Invalid parameter (0xfffffffc). Mar 20 00:40:33.599212 str-a7050-acs-1 DEBUG syncd#syncd: brcm_sai_get_queue_stats:724 cosq stat get failed with error -5 for port 1 qid 10 Mar 20 00:40:33.599212 str-a7050-acs-1 NOTICE syncd#syncd: :- setQueueCounterList: Queue oid:0x102150000000b does not has supported counters Signed-off-by: Wenda Ni <[email protected]> * Use boolean variable to determine the test run type: regular pfc wd test or pfcwd warm-reboot test Signed-off-by: Wenda Ni <[email protected]> * Feed reboot type to reboot_sonic.yml in warm-reboot happy path test Signed-off-by: Wenda Ni <[email protected]> * Feed reboot type to reboot_sonic.yml in warm-reboot sad path test Signed-off-by: Wenda Ni <[email protected]> * Add expected errors on mlnx platform Signed-off-by: Wenda Ni <[email protected]>
PFC storm started and detected before warm-reboot
On-going storm on warm-reboot emission, and lasts past the warm-reboot finish
PFC storm stopped and restored after warm-reboot
Tested on regular pfc watchdog without break.
Infrastructure change:
loganalyzer.py & loganalyzer_analyze.yml:
Allow log analyzer to take a specified start marker
loganalyzer_init.yml:
Use lookup('pipe', 'date +%H:%M:%S') in place of ansible_date_time.time, which uses cached time for a certain period of time ansible/ansible#22561
loganalyzer_end.yml:
Dump only the current result and summary files for debugging and troubleshooting purpose
Add the capability to check if the number of exact matches is equal to the target number
Incremental commits on top of #825
Description of PR
Summary:
Fixes # (issue)
Type of change
Approach
How did you do it?
How did you verify/test it?
Any platform specific information?
Supported testbed topology if it's a new test case?
Documentation