Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACL SAI API not called due to SONiC is busying on updating route entries #3054

Open
keboliu opened this issue Jun 20, 2019 · 1 comment
Open
Assignees

Comments

@keboliu
Copy link
Collaborator

keboliu commented Jun 20, 2019

Description

This issue was observed on ACL test, from the log can see that although the ACL table and rules has been created in SWSS level,
but related SAI API calls haven't been called till the PTF start to send verification traffic.
And system is busying calling "mlnx_create_next_hop_group_member" and "mlnx_remove_next_hop_group_member" to update route.

seems that there is no "fairness" scheduling when SWSS is busy, some task could be significantly delayed.

Steps to reproduce the issue:
trigger massive route update and then start ACL test, it could be reproduced.

Describe the results you received:
only SWSS ACL log observed, no syncd log observed.

Jun 16 18:19:07.228802 arc-mtbc-1001 INFO ansible-<stdin>: Invoked with warn=True executable=None chdir=None _raw_params=sonic-cfggen -j /home/admin/acl/acl_table_DATAINGRESS.json --write-to-db removes=None creates=None _uses_shell=False
Jun 16 18:19:07.800210 arc-mtbc-1001 NOTICE swss#orchagent: :- addAclTable: Created ACL table DATAINGRESS oid:7000000001a25
Jun 16 18:19:23.800192 arc-mtbc-1001 NOTICE swss#orchagent: :- add: Successfully created ACL rule RULE_16 in table DATAINGRESS
Jun 16 18:19:23.800374 arc-mtbc-1001 INFO caclmgrd: ACL configuration changed. Updating iptables rules for control plane ACLs...
Jun 16 18:19:23.803423 arc-mtbc-1001 NOTICE swss#orchagent: :- add: Successfully created ACL rule RULE_8 in table DATAINGRESS
Jun 16 18:19:23.815132 arc-mtbc-1001 NOTICE swss#orchagent: :- add: Successfully created ACL rule DEFAULT_RULE in table DATAINGRESS
Jun 16 18:19:23.831588 arc-mtbc-1001 NOTICE swss#orchagent: :- add: Successfully created ACL rule RULE_11 in table DATAINGRESS
Jun 16 18:19:23.831588 arc-mtbc-1001 NOTICE swss#orchagent: :- add: Successfully created ACL rule RULE_12 in table DATAINGRESS
Jun 16 18:19:23.837167 arc-mtbc-1001 NOTICE swss#orchagent: :- add: Successfully created ACL rule RULE_14 in table DATAINGRESS
Jun 16 18:19:23.847532 arc-mtbc-1001 NOTICE swss#orchagent: :- add: Successfully created ACL rule RULE_15 in table DATAINGRESS
Jun 16 18:19:23.847532 arc-mtbc-1001 NOTICE swss#orchagent: :- add: Successfully created ACL rule RULE_2 in table DATAINGRESS
Jun 16 18:19:23.847532 arc-mtbc-1001 NOTICE swss#orchagent: :- add: Successfully created ACL rule RULE_22 in table DATAINGRESS
Jun 16 18:19:23.847532 arc-mtbc-1001 NOTICE swss#orchagent: :- add: Successfully created ACL rule RULE_25 in table DATAINGRESS

Jun 16 18:19:23.881256 arc-mtbc-1001 NOTICE swss#orchagent: :- add: Successfully created ACL rule RULE_3 in table DATAINGRESS
Jun 16 18:19:23.881256 arc-mtbc-1001 NOTICE swss#orchagent: :- add: Successfully created ACL rule RULE_6 in table DATAINGRESS
Jun 16 18:19:23.881256 arc-mtbc-1001 NOTICE swss#orchagent: :- add: Successfully created ACL rule RULE_7 in table DATAINGRESS
Jun 16 18:19:23.912002 arc-mtbc-1001 NOTICE swss#orchagent: :- add: Successfully created ACL rule RULE_1 in table DATAINGRESS
Jun 16 18:19:23.923543 arc-mtbc-1001 NOTICE swss#orchagent: :- add: Successfully created ACL rule RULE_10 in table DATAINGRESS
Jun 16 18:19:23.923543 arc-mtbc-1001 NOTICE swss#orchagent: :- add: Successfully created ACL rule RULE_13 in table DATAINGRESS
Jun 16 18:19:23.923543 arc-mtbc-1001 NOTICE swss#orchagent: :- add: Successfully created ACL rule RULE_17 in table DATAINGRESS
Jun 16 18:19:23.930624 arc-mtbc-1001 NOTICE swss#orchagent: :- add: Successfully created ACL rule RULE_18 in table DATAINGRESS
Jun 16 18:19:23.930624 arc-mtbc-1001 NOTICE swss#orchagent: :- add: Successfully created ACL rule RULE_19 in table DATAINGRESS
Jun 16 18:19:23.938537 arc-mtbc-1001 NOTICE swss#orchagent: :- add: Successfully created ACL rule RULE_20 in table DATAINGRESS
Jun 16 18:19:23.938537 arc-mtbc-1001 NOTICE swss#orchagent: :- add: Successfully created ACL rule RULE_21 in table DATAINGRESS
Jun 16 18:19:23.938537 arc-mtbc-1001 NOTICE swss#orchagent: :- add: Successfully created ACL rule RULE_23 in table DATAINGRESS
Jun 16 18:19:23.946755 arc-mtbc-1001 NOTICE swss#orchagent: :- add: Successfully created ACL rule RULE_24 in table DATAINGRESS
Jun 16 18:19:23.946755 arc-mtbc-1001 NOTICE swss#orchagent: :- add: Successfully created ACL rule RULE_27 in table DATAINGRESS
Jun 16 18:19:23.954589 arc-mtbc-1001 NOTICE swss#orchagent: :- add: Successfully created ACL rule RULE_28 in table DATAINGRESS
Jun 16 18:19:23.954589 arc-mtbc-1001 NOTICE swss#orchagent: :- add: Successfully created ACL rule RULE_4 in table DATAINGRESS
Jun 16 18:19:23.954589 arc-mtbc-1001 NOTICE swss#orchagent: :- add: Successfully created ACL rule RULE_5 in table DATAINGRESS
Jun 16 18:19:23.963551 arc-mtbc-1001 NOTICE swss#orchagent: :- add: Successfully created ACL rule RULE_9 in table DATAINGRESS

Describe the results you expected:
SAI API calls shall follow SWSS ACL table and rules creation.

Additional information you deem important (e.g. issue happens only occasionally):

**Output of `show version`:**

```
SONiC Software Version: SONiC.HEAD.70-cf9adfe
Distribution: Debian 9.9
Kernel: 4.9.0-8-amd64
Build commit: cf9adfe
Build date: Sun Jun 16 23:28:12 UTC 2019
Built by: johnar@jenkins-worker-4
```

**Attach debug file `sudo generate_dump`:**

```
(paste your output here)
```

syslog.21.gz

@yxieca
Copy link
Contributor

yxieca commented Sep 19, 2019

Community member is working on bulk setting mode. That would help.

@stephenxs
Copy link
Collaborator

Hi @yxieca, @lguohan,
I still encounter a similar issue when I'm working on the dynamic buffer calculation issue.
The flow is like this:

  1. port's speed updated, causing port down and routing entries withdrawn. After that, the cable length of the port has been updated, causing buffer profile on the port updated.
  2. buffermgrd create a new buffer profile and replaces the port's old profile on lossless PG with the newly created one.
Jul 31 06:52:08.235333 mtbc-sonic-03-2700 INFO config: 'interface speed Ethernet112 10000' executing...
Jul 31 06:52:08.346616 mtbc-sonic-03-2700 INFO swss#buffermgrd: :- handlePortTable: Updating BUFFER_PG for port Ethernet112 due to speed or port updated
Jul 31 06:52:08.346616 mtbc-sonic-03-2700 INFO swss#buffermgrd: :- allocateProfile: Allocating new BUFFER_PROFILE pg_lossless_10000_40m_profile
Jul 31 06:52:08.346616 mtbc-sonic-03-2700 NOTICE swss#buffermgrd: :- allocateProfile: Creating new profile 'pg_lossless_10000_40m_profile'
Jul 31 06:52:08.351527 mtbc-sonic-03-2700 INFO swss#orchagent: message repeated 25 times: [ :- doTask: Handling buffer task]
Jul 31 06:52:08.351527 mtbc-sonic-03-2700 INFO swss#orchagent: :- setPortAdminStatus: Set admin status DOWN to port pid:10000000002d6
Jul 31 06:52:08.351527 mtbc-sonic-03-2700 NOTICE swss#buffermgrd: :- allocateProfile: BUFFER_PROFILE pg_lossless_10000_40m_profile has been created successfully
  1. buffermgrd removes the port's lossless PG, causing the newly created buffer profile no longer referenced.
Jul 31 06:52:22.603972 mtbc-sonic-03-2700 INFO ansible-command: Invoked with creates=None executable=None _uses_shell=True strip_empty_ends=True _raw_params=config interface lossless_pg remove Ethernet112 3-4 removes=None argv=None warn=True chdir=None stdin_add_newline=True stdin=None
Jul 31 06:52:23.131873 mtbc-sonic-03-2700 NOTICE swss#buffermgrd: :- doRemovePgTask: Remove BUFFER_PG Ethernet112:3-4 (profile pg_lossless_10000_40m_profile)
The newly created buffer profile is removed as well.
Jul 31 06:52:23.144138 mtbc-sonic-03-2700 NOTICE swss#buffermgrd: :- doRemovePgTask: No lossless PG configured on port Ethernet112 anymore, try removing the original profile pg_lossless_10000_40m_profile
Jul 31 06:52:23.144261 mtbc-sonic-03-2700 NOTICE swss#buffermgrd: :- releaseProfile: BUFFER_PROFILE pg_lossless_10000_40m_profile has been released successfully
  1. bufferorch handles the notification in a reverse order, receiving the notification of buffer profile removing first and failing to remove the profile. The failure is caused by the buffer profile being still referenced at the time
Jul 31 06:52:23.660780 mtbc-sonic-03-2700 INFO swss#orchagent: :- doTask: Handling buffer task
Jul 31 06:52:23.662209 mtbc-sonic-03-2700 ERR swss#orchagent: :- meta_generic_validation_remove: object 0x1900000000074f reference count is 2, can't remove
Jul 31 06:52:23.662372 mtbc-sonic-03-2700 ERR swss#orchagent: :- processBufferProfile: Failed to remove buffer profile pg_lossless_10000_40m_profile with type BUFFER_PROFILE, rv:-5
Jul 31 06:52:23.662503 mtbc-sonic-03-2700 ERR swss#orchagent: :- doTask: Failed to process buffer task, drop it
Jul 31 06:52:23.662724 mtbc-sonic-03-2700 INFO swss#orchagent: :- doTask: Handling buffer task
Jul 31 06:52:23.663667 mtbc-sonic-03-2700 NOTICE swss#orchagent: :- processPriorityGroup: Remove buffer PG Ethernet112:3-4
Jul 31 06:52:23.664102 mtbc-sonic-03-2700 INFO swss#orchagent: :- doTask: Handling buffer task

I suspect that the orchagent handles notifications in a different order in which buffermgrd sends them is caused by the low-level mechanism of the orchagent. It takes advantage of I/O multiply mechanism to receive notifications from redis-db:

  1. Each orch in orchagent creates an FD for each table it listens to and inserts all the FDs into something like a select vector orchdaemon.cpp#L453.
  2. The main loop is blocked at a select-like interface, orchdaemon.cpp#L453.
    When the system isn't busy everything works well. But when system is busy, notifications are backlogged in sockets, the orchagent will handle notifications in the order like FD-ascending or FD-descending rather than the order they come.

yxieca pushed a commit that referenced this issue Nov 30, 2023
* Update sonic-utilities to master branch version

sonic-utilities was (intentionally) pointing to a commit on a fork,
since merging sonic-utilities's changes for Bookworm first onto the
master branch would result in PR checker failures. Now that
sonic-buildimage is on master branch and the Bookworm changes in
sonic-utilities have been merged into master, sonic-utilties can now
point to master.

17e77fe2 Revert "Run yang validation in unit test (#3025)" (#3055)
96dd5559 [dhcp_relay] Fix dhcp_relay counter display issue (#3054)
6dfeee69 [sflow][db_migrator] Egress Sflow support (#3020)
02a588b7 Don't collect /proc/sched_debug
d7ec3251 Fix error about having a mutable default for field headers in dataclass
0ab3ab91 Fix test execution on Bookworm (#3041)
ef8f6f83 Specify test dependencies under extra_requires
61c44e80 Update python packages
1e813105 [wol] Implement wol command line utility (#3048)
8ebc56a0 [sonic_installer]: Improve exception handling: introduce notes. (#3029)
3610ce93 [sonic-package-manager] Fix YANG validation failure on upgrade when feature has constraints in YANG model on FEATURE table (#2933)
cfd2dd39 Add container rsyslog.conf to the sys dump (#3039)
c4b07828 Support new platform in generic configuration update (#3038)
a8d236c8 [fast-reboot-filter-routes.py] Remove click and improve error reporting (#3030)
75199c0f [sonic-package-manager] insert newline in /etc/sonic/generated_services.conf (#3040)
cd855698 [VOQ][saidump] Modify generate_dump: replace save_saidump with save_saidump_by_route_size (#2972)
f1e24ae5 GCU support for Cisco-8000 features (#3010)
67e1c3dc Update GCU rsyslog validator (#3012)
253b7975 [sonic-package-manager] do not modify config_db.json (#3032)
177dd8e8 [sonic-package-manager] add generated service to /etc/sonic/generated_services.conf (#3037)
62fcd77a Configure NTP according to extended configuration (#2835)
ced09404 [dualtor_neighbor_check] Adjust zero-mac check condition (#3034)
a4eeb698 [config] config reload should generate sysinfo if missing  (#3031)
e01fc891 Run yang validation in unit test (#3025)

Signed-off-by: Saikrishna Arcot <[email protected]>
yxieca pushed a commit that referenced this issue Dec 14, 2023
…atically (#17457)

src/sonic-utilities

* 1b1402f5 - (HEAD -> 202311, origin/202311) [hash]: Add ECMP/LAG hash algorithm CLI (#3036) (9 days ago) [Nazarii Hnydyn]
* 71514ea3 - Revert "Run yang validation in unit test (#3025)" (#3055) (9 days ago) [Ying Xie]
* b5daf5d4 - [dhcp_relay] Fix dhcp_relay counter display issue (#3054) (9 days ago) [Yaqiang Zhu]
* b3172505 - [sflow][db_migrator] Egress Sflow support (#3020) (9 days ago) [Rajkumar-Marvell]
* 1e813105 - [wol] Implement wol command line utility (#3048) (3 weeks ago) [Zhijian Li]
* 8ebc56a0 - [sonic_installer]: Improve exception handling: introduce notes. (#3029) (3 weeks ago) [Nazarii Hnydyn]
* 3610ce93 - [sonic-package-manager] Fix YANG validation failure on upgrade when feature has constraints in YANG model on FEATURE table (#2933) (3 weeks ago) [Stepan Blyshchak]
* cfd2dd39 - Add container rsyslog.conf to the sys dump (#3039) (4 weeks ago) [Vivek]
* c4b07828 - Support new platform in generic configuration update (#3038) (4 weeks ago) [Stephen Sun]
* a8d236c8 - [fast-reboot-filter-routes.py] Remove click and improve error reporting (#3030) (4 weeks ago) [Stepan Blyshchak]
* 75199c0f - [sonic-package-manager] insert newline in /etc/sonic/generated_services.conf (#3040) (4 weeks ago) [Stepan Blyshchak]
* cd855698 - [VOQ][saidump] Modify generate_dump: replace save_saidump with save_saidump_by_route_size (#2972) (4 weeks ago) [JunhongMao]
* f1e24ae5 - GCU support for Cisco-8000 features (#3010) (4 weeks ago) [rbpittman]
* 67e1c3dc - Update GCU rsyslog validator (#3012) (4 weeks ago) [jingwenxie]
* 253b7975 - [sonic-package-manager] do not modify config_db.json (#3032) (5 weeks ago) [Stepan Blyshchak]
* 177dd8e8 - [sonic-package-manager] add generated service to /etc/sonic/generated_services.conf (#3037) (5 weeks ago) [Stepan Blyshchak]
* 62fcd77a - Configure NTP according to extended configuration (#2835) (5 weeks ago) [Yevhen Fastiuk]
* ced09404 - [dualtor_neighbor_check] Adjust zero-mac check condition (#3034) (5 weeks ago) [Longxiang Lyu]
* a4eeb698 - [config] config reload should generate sysinfo if missing  (#3031) (6 weeks ago) [jingwenxie]
* e01fc891 - Run yang validation in unit test (#3025) (6 weeks ago) [ganglv]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants