Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected error syslog due to negative refcnt of nexthop #18183

Closed
ysmanman opened this issue Feb 26, 2024 · 11 comments
Closed

Unexpected error syslog due to negative refcnt of nexthop #18183

ysmanman opened this issue Feb 26, 2024 · 11 comments
Assignees
Labels
Arista Triaged this issue has been triaged

Comments

@ysmanman
Copy link
Contributor

Description

We noticed the following unexpected syslog in 202205 sonic-mgmt testing on T2:

	 E               Feb 19 16:48:40.122676 cmp227-6 ERR swss#orchagent: :- decreaseNextHopRefCount: Ref count cannot be negative for next_hop_id: 0x4000000005844 with ip: 10.0.0.5 and alias: Ethernet-IB0
	 E
	 E               Feb 19 16:48:51.104598 cmp227-6 ERR swss#orchagent: :- decreaseNextHopRefCount: Ref count cannot be negative for next_hop_id: 0x4000000005845 with ip: fc00::a and alias: Ethernet-IB0

The syslog was seen in voq tests, e.g., voq/test_voq_chassis_app_db_consistency.py.

Steps to reproduce the issue:

Describe the results you received:

Describe the results you expected:

Output of show version:

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

@ysmanman
Copy link
Contributor Author

Add @arlakshm @kenneth-arista for visibility.

@ysmanman
Copy link
Contributor Author

In another instance of the failure, we also observed orchagent crash along with the error syslogs

	 E               Feb 19 16:56:54.058695 cmp227-5 ERR swss#orchagent: :- decreaseNextHopRefCount: Ref count cannot be negative for next_hop_id: 0x4000000010b8c with ip: fc00::a and alias: Ethernet-IB0
	 E
	 E               Feb 19 16:57:03.828994 cmp227-5 ERR swss#orchagent: :- decreaseNextHopRefCount: Ref count cannot be negative for next_hop_id: 0x4000000010b8b with ip: 10.0.0.5 and alias: Ethernet-IB0
	 E
	 E               Feb 19 16:57:03.846303 cmp227-5 ERR swss#orchagent: :- decreaseNextHopRefCount: Ref count cannot be negative for next_hop_id: 0x4000000010b8b with ip: 10.0.0.5 and alias: Ethernet-IB0
	 E
	 E               Feb 19 16:57:03.858634 cmp227-5 ERR swss#orchagent: :- decreaseNextHopRefCount: Ref count cannot be negative for next_hop_id: 0x4000000010b8b with ip: 10.0.0.5 and alias: Ethernet-IB0
	 E
	 E               Feb 19 16:57:03.869720 cmp227-5 ERR swss#orchagent: :- decreaseNextHopRefCount: Ref count cannot be negative for next_hop_id: 0x4000000010b8c with ip: fc00::a and alias: Ethernet-IB0
	 E
	 E               Feb 19 16:57:03.878436 cmp227-5 ERR swss#orchagent: :- decreaseNextHopRefCount: Ref count cannot be negative for next_hop_id: 0x4000000010b8c with ip: fc00::a and alias: Ethernet-IB0
	 E
	 E               Feb 19 16:57:03.886809 cmp227-5 ERR swss#orchagent: :- decreaseNextHopRefCount: Ref count cannot be negative for next_hop_id: 0x4000000010b8c with ip: fc00::a and alias: Ethernet-IB0
	 E
	 E               Feb 19 16:57:03.895243 cmp227-5 ERR swss#orchagent: :- decreaseNextHopRefCount: Ref count cannot be negative for next_hop_id: 0x4000000010b8c with ip: fc00::a and alias: Ethernet-IB0
	 E
	 E               Feb 19 16:57:03.903355 cmp227-5 ERR swss#orchagent: :- decreaseNextHopRefCount: Ref count cannot be negative for next_hop_id: 0x4000000010b8c with ip: fc00::a and alias: Ethernet-IB0
	 E
	 E               Feb 19 16:57:03.906794 cmp227-5 ERR swss#orchagent: :- decreaseNextHopRefCount: Ref count cannot be negative for next_hop_id: 0x4000000010b8c with ip: fc00::a and alias: Ethernet-IB0
	 E
	 E               Feb 19 16:57:03.914303 cmp227-5 ERR swss#orchagent: :- decreaseNextHopRefCount: Ref count cannot be negative for next_hop_id: 0x4000000010b8c with ip: fc00::a and alias: Ethernet-IB0
	 E
	 E               Feb 19 16:57:03.922499 cmp227-5 ERR swss#orchagent: :- decreaseNextHopRefCount: Ref count cannot be negative for next_hop_id: 0x4000000010b8c with ip: fc00::a and alias: Ethernet-IB0
	 E
	 E               Feb 19 16:57:28.426127 cmp227-5 ERR syncd#syncd: [none] SAI_API_NEIGHBOR:brcm_sai_dnx_set_neighbor_entry_attribute:1265 L3 host find failed with error Entry not found (0xfffffff9).
	 E
	 E               Feb 19 16:57:28.426127 cmp227-5 ERR syncd#syncd: [none] SAI_API_NEIGHBOR:brcm_sai_set_neighbor_entry_attribute:637 pd neighbor set failed with error -7.
	 E
	 E               Feb 19 16:57:28.426127 cmp227-5 ERR syncd#syncd: :- sendApiResponse: api SAI_COMMON_API_SET failed in syncd mode: SAI_STATUS_ITEM_NOT_FOUND
	 E
	 E               Feb 19 16:57:28.426232 cmp227-5 ERR syncd#syncd: :- processQuadEvent: attr: SAI_NEIGHBOR_ENTRY_ATTR_ENCAP_INDEX: 1074790415
	 E
	 E               Feb 19 16:57:28.426539 cmp227-5 ERR swss#orchagent: :- updateVoqNeighborEncapIndex: Failed to update voq encap index for neighbor 10.0.0.5 on cmp227-4|asic0|PortChannel999, rv:-7
	 E
	 E               Feb 19 16:57:28.426566 cmp227-5 ERR swss#orchagent: :- meta_sai_validate_oid: oid is set to null object id on SAI_OBJECT_TYPE_NEXT_HOP
	 E
	 E               Feb 19 16:57:28.426584 cmp227-5 ERR swss#orchagent: :- removeNeighbor: Failed to remove next hop 10.0.0.5 on cmp227-4|asic0|PortChannel999, rv:-5
	 E
	 E               Feb 19 16:57:28.426584 cmp227-5 ERR swss#orchagent: :- handleSaiRemoveStatus: Encountered failure in remove operation, exiting orchagent, SAI API: SAI_API_NEXT_HOP, status: SAI_STATUS_INVALID_PARAMETER

@arlakshm
Copy link
Contributor

@ysmanman, can you attach the tech-support to the issue.

@arlakshm arlakshm added Triaged this issue has been triaged NOKIA labels Feb 28, 2024
@mlok-nokia
Copy link
Contributor

@saksarav-nokia Please help to take a look at this issue.

@saksarav-nokia
Copy link
Contributor

@arlakshm @ysmanman , We don't see this error in our Nokia chassis with full OC run. Also we ran voq/test_voq_chassis_app_db_consistency.py tests ~75 times and we didn't see this error or the crash. with 202205.
We need more details to reproduce the issue.

@arlakshm
Copy link
Contributor

Thanks @saksarav-nokia for the update. Reassigning this issue to @ysmanman as this is crash is seen during their testing.

@arlakshm arlakshm added Arista and removed NOKIA labels Feb 29, 2024
@ysmanman
Copy link
Contributor Author

We ran into the failure in recent T2 sonic-mgmt test run. We observed that stale system neighbors were not deleted from chassis db after load-minigraph:

   Mar 11 20:24:55.932512 cmp227-4 NOTICE root: Chassis db clean up for swss0. Number of SYSTEM_NEIGH entries deleted:
   Mar 11 20:24:55.946421 cmp227-4 NOTICE root: Chassis db clean up for swss0. Number of SYSTEM_INTERFACE entries deleted: 12
   Mar 11 20:24:55.960215 cmp227-4 NOTICE root: Chassis db clean up for swss0. Number of SYSTEM_LAG_MEMBER_TABLE entries deleted: 0
   Mar 11 20:24:55.974263 cmp227-4 NOTICE root: Chassis db clean up for swss0. Number of SYSTEM_LAG_TABLE entries deleted: 9

Note the first syslog didn't show how many system neighbor entries were deleted. The implies chassis-db failed to eval the lua script used to delete system neighbor. The issue may be similar to #17945.

@saksarav-nokia
Copy link
Contributor

@ysmanman , It is already fixed in #17962

@judyjoseph
Copy link
Contributor

@ysmanman please let us know if you were able to check if this issue is resolved with #17962

@ysmanman
Copy link
Contributor Author

HI @judyjoseph , yes, we are in the process to verify if #17962 fixes the issue. Given the issue happened intermittently, we need a couple of runs to confirm the issue is fixed or not.

@ysmanman
Copy link
Contributor Author

ysmanman commented Sep 4, 2024

We didn't see the error syslog in recent testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arista Triaged this issue has been triaged
Projects
Archived in project
Development

No branches or pull requests

5 participants