Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[orchagent]: Added support of PFC WD for BFN platform #823

Merged
merged 5 commits into from
Jun 7, 2019

Conversation

vsenchyshyn
Copy link
Contributor

@vsenchyshyn vsenchyshyn commented Mar 29, 2019

Signed-off-by: Vitaliy Senchyshyn [email protected]

What I did
Added instance of PfcWdSwOrch in OrchDaemon::init() in case of Barefoot platform. Fixed PFC WD detection logic in pfc_detect_barefoot.lua script.

Why I did it
PFC WD wasn't working for BFN platform.

How I verified it

  1. pfc_wd community test has passed

  2. Tested it manually with BFN platform by simulating PFC storm and verified that PWC storm was successfully detected wen occurred and restored when disappeared by the WD. Validated all the counters used by PFC WD functionality worked properly.

Details if related

@msftclas
Copy link

msftclas commented Mar 29, 2019

CLA assistant check
All CLA requirements met.

@vsenchyshyn
Copy link
Contributor Author

@stcheng @marian-pritsak @lguohan Please review

Signed-off-by: Vitaliy Senchyshyn <[email protected]>
@@ -21,70 +21,81 @@ for i = n, 1, -1 do
local is_deadlock = false
local pfc_wd_status = redis.call('HGET', counters_table_name .. ':' .. KEYS[i], 'PFC_WD_STATUS')
local pfc_wd_action = redis.call('HGET', counters_table_name .. ':' .. KEYS[i], 'PFC_WD_ACTION')
if pfc_wd_status == 'operational' or pfc_wd_action == 'alert' then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file identical to Mellanox too? If yes, will the symlink work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it was there before and it's related to a different platform It's better to leave it as is.

@wendani
Copy link
Contributor

wendani commented Apr 2, 2019

Has it passed the pfc watchdog test?

@wendani wendani self-requested a review April 2, 2019 01:03
@vsenchyshyn
Copy link
Contributor Author

Has it passed the pfc watchdog test?
This was tested manually so far. Please take a look at the PR description.

(occupancy_bytes == 0 and packets - packets_last == 0 and (pfc_duration - pfc_duration_last) > poll_time * 0.8) then
if time_left <= poll_time then
redis.call('HDEL', counters_table_name .. ':' .. port_id, pfc_rx_pkt_key .. '_last')
redis.call('HDEL', counters_table_name .. ':' .. port_id, pfc_duration_key .. '_last')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do not need to learn hdel _RX_PKTS_last and _RX_PAUSE_DURATION_last on lines 74 & 75 from the Mellanox script. This is to solve the double storm signaling that was observed on Mellanox platforms. I assume barefoot platforms do not have this issue. #697

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andriymoroz-mlnx for comments

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not Mellanox issue. It is common for this approach. If Barefoot uses the same solution (approach, counters, etc) they can use the same script

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue should not be related to the counters set Mellanox chooses. Excluding the counter factors, this should also happen on brcm platforms. However, we do not observe the issue on brcm platforms so far.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Broadcom uses different way to detect storm and thus do not have this issue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Difference is only in the counter set. The idea behind is the same---no packets going out of the queue and the queue is paused.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dual storm signaling should only happen if the drop action is not installed by the orchagent fast enough. pfcactionhandler initCounters() will flip the PFC_WD_STATUS from operational to stormed as part of its drop actions. If PFC_WD_STATUS is not operational, the detect script logic will exit early on line 26 without running the actual detect state machine. So there should be no occurrence of dual storm signaling at all. The possible cause I can think of is that the control-plane cpu is not fast enough to schedule running the orchagent drop actions within a polling interval of 200 ms https://github.com/Azure/sonic-swss/blob/f22fb80bdfa10ea38e718996235c99233e08c31a/orchagent/pfc_detect_barefoot.lua#L26

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's better keep it as is. In case race condition really happens it will be very hard to catch and fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we start from a clean implementation rather than patch it here and there. The double storming signaling can be captured by PFC watchdog test. This is how mlnx found the problem on their platforms. Function-wise, double signaling does not affect the proper detection and the proper restoration.

Last time, you said bfn still uses manual test for PFC watchdog validation. If you have proof that it does exist also on bfn, we can later add the patch back.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wendani the ct test passed for bfn with these changes. If we are talking about fast or not fast cpu I would say it hard to catch the race condition on all possible CPUs, as bfn sdk could run on different platform vendors and this value could differ. Could we leave this check to avoid further patching and would see in context this feature will evolve?

-- Save values for next run
redis.call('HSET', counters_table_name .. ':' .. KEYS[i], 'SAI_QUEUE_STAT_PACKETS_last', packets)
redis.call('HSET', counters_table_name .. ':' .. KEYS[i], 'PFC_WD_DETECTION_TIME_LEFT', time_left)
if is_deadlock == false then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop the 'if is_deadlock == false' condition. #697

orchagent/pfc_detect_barefoot.lua Outdated Show resolved Hide resolved
@vsenchyshyn
Copy link
Contributor Author

@lguohan I've updated the PR according to baseline changes, but the test log is quite strange as well as the "No test results found" result. Not sure how the PFC WD changes could cause all these fails in mirror and other stuff: https://sonic-jenkins.westus2.cloudapp.azure.com/job/vs/job/sonic-swss-build-pr/40/consoleFull Is this something on your side?

@vsenchyshyn
Copy link
Contributor Author

retest this please

@wendani wendani self-requested a review June 5, 2019 16:28
@vsenchyshyn vsenchyshyn force-pushed the bfn-pfc-wd-support branch from 3d4145e to e52721c Compare June 5, 2019 18:01
@vsenchyshyn
Copy link
Contributor Author

retest this please

3 similar comments
@wendani
Copy link
Contributor

wendani commented Jun 6, 2019

retest this please

@wendani
Copy link
Contributor

wendani commented Jun 7, 2019

retest this please

@wendani
Copy link
Contributor

wendani commented Jun 7, 2019

retest this please

@wendani wendani merged commit cde242b into sonic-net:master Jun 7, 2019
EdenGri pushed a commit to EdenGri/sonic-swss that referenced this pull request Feb 28, 2022
After the change in master branch updating SAI from 3.5.3.1m-25 to 3.7.3.2, we always found kernel panic after running fast-reboot command in testing SONiC with traffic. In the up path of fast-reboot, we can find warning messages like "unhandled irq 16 error" before kernel panic, which implies that some components are not properly closed in the down path.

This fix will unload certain kernel modules by stopping opennsl before fast-reboot, which is suggested by BRCM.

Note that another part of the fix is to add 'ExecStop=-/etc/init.d/opennsl-modules stop' to sonic-buildimage:platform/broadcom/saibcm-modules/systemd/opennsl-modules.service, which will be included in another pull request.
oleksandrivantsiv pushed a commit to oleksandrivantsiv/sonic-swss that referenced this pull request Mar 1, 2023
This fix brings in support for cisco-8000 platform into sonic-sairedis/syncd. It checks for the SONIC_ASIC_TYPE keyword and picks up the PLATFORM type to see if "cisco-8000" word is available. Accordingly, it sources the required paths for SDK to carry on its operations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants