Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamically learned neighbors are not deleted from ASIC_DB when eBGP interfaces are shutdown and the neighbors are flushed #12442

Closed
vganesan-nokia opened this issue Oct 18, 2022 · 10 comments
Assignees
Labels
Chassis 🤖 Modular chassis support P0 Priority of the issue Triaged this issue has been triaged

Comments

@vganesan-nokia
Copy link
Contributor

Description

After the eBGP neighbors interfaces (ports and port channles) are shut down, when "sonic-clear arp" or "sonic-clear ndp" command is issued, the dynamically learned eBGP neighbors are not cleared from ASIC_DB. The orchagent syslogs show that "Failed to remove still referenced neighbor". This is unexpected and incorrect. Since bringing down the eBGP interfaces will bring down the eBGP neighbors. So all routes attached to the eBGP neighbors will be withdrawn and neighbors should not be referencing to any route. Flushing the neighbors clears the neighbors from the linux kernel but the neighbors still exist in ASIC_DB. (Please see "Additional Information" below for the root cause of this issue)

Steps to reproduce the issue:

Following is one scenario how the problem can be reproduced more frequently

  1. Establish eBGP sessions and advertise default route (IPv4 or IPv6).
  2. Make sure that default routes exists in kernel, APPL_DB and ASIC_DB as expected.
  3. Shutdown interfaces such that all the eBGP neighbors that advertised the default route are down.
  4. Dump the routes in the kernel and make sure that the defautl route has only one next on the "eth0"
  5. Clear the dynamically learned neighbors using "sonic-clearp ndp" or "sonic-clear arp" command.
  6. Dump the neighbors in the linux kernel using "ip neigh show" command
  7. Dump neighbors from ASIC_DB

Describe the results you received:

  • Dynamically learned eBGP neighbors are removed from linux kernel as expected.
  • But some dynamicaly learned neighbors are still in the ASIC_DB,

Describe the results you expected:

  • All the dynamically learned neighbors must be cleared from linux kernel, APPL_DB and ASIC_DB when the neighbors are flushed after shutting down the eBGP neighbor interfaces.

Output of show version:

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

The root cause of this problem is that some routes learned from eBGP neighbors are not cleared from APPL_DB when the eBGP neighbors go down. The routes which show this issue are those routes which include the next hop on interface "eth0". This consistently occurs for default routes for both IPv4 and IPv6 in any asic instance. Since the route is not deleted from APPL DB (though route is deleted from kernel), the neighbor ref_count is not 0 and hence the neighbor is not deleted from the ASIC_DB.

All asic instances include the default route with next hop on eth0 (with large metric to avoid colliding with default routes learned via eBGP). The APPL_DB has this route with only the eBGP next hops due to a filtering in routesync that filters the kernel route updates that has a next hop on eth0 or docker0. Consequently, the next hop on eth0 is not included in the route entry in APPL_DB and ASIC_DB. When all the eBGP neighbor interfaces are shut down, all eBGP neighbors go down. When all eBGP neighbors go down, the default route is withdrawn. The next hop on eth0 becomes the only next hop for the default route. When kernel sends this update, because of the the above mentioned routesync filter the route update is not sent to APPL_DB. Hence the APPL_DB is left with a stale route entry with whatever next hops it had before the eth0 next hop became only next hop. Since these next hops still have this route referenced, these are not cleared when neighbors are flushed.

A possible soultion to both of these issues viz., (1) stale route entry in APPL_DB and ASIC_DB and (2) stale neighbor entry in ASIC_DB is to delete the routes from APPL_DB when kernel updates are received with next hop on "eth0" as the only next hop.

@rlhui
Copy link
Contributor

rlhui commented Oct 24, 2022

@vganesan-nokia - which release is this found and what kind of platform e.g. chassis or pizza box?

@rlhui rlhui added the Chassis 🤖 Modular chassis support label Nov 5, 2022
@rlhui rlhui added the P0 Priority of the issue label Nov 12, 2022
@arlakshm
Copy link
Contributor

@vganesan-nokia,
I tried to repro the issue with the following steps but did see the issue

  • Shutdown all ebgp sessions
  • Check if the default route is removed from app_db and asic_db
  • shutdown the port to flush out the neigh entries.

After these steps I see the default route is removed from APP_db and ASIC_DB
The neighbor entry is removed from APP_db and CHASSIS_APP_DB.

I am testing on the latest 202205 image. Can you confirm if you still see the is problem, if you can you confirm steps to reproduce

vganesan-nokia added a commit to vganesan-nokia/sonic-swss that referenced this issue Nov 29, 2022
Signed-off-by: vedganes <[email protected]>

The changes are for fixing stale neighbor in the ASIC_DB and data path
when eBGP neighbors are shutdown and neighbors are flushed. The problem
is described in issue: sonic-net/sonic-buildimage#12442
The root cause of this issue is due to not deleing the route from the
ASIC_DB when the route's next hop is on eth0 or docker0 interface. The solution is
to delete the route entry from ASIC_DB instead of just returning when the route's
next hop is on the interface eth0 or docker0
@mannytaheri
Copy link

@abdosi
Copy link
Contributor

abdosi commented Dec 13, 2022

@prsunny for viz. Hitting same another chassis platform for Ipv6 Default route.

@abdosi
Copy link
Contributor

abdosi commented Dec 13, 2022

@vganesan-nokia looks like issue with only default route. Because i am seeing it for that only. If that is the case can we update Issues Title and Description to point that (default routes are only impacted)

@abdosi
Copy link
Contributor

abdosi commented Dec 13, 2022

Hitted while running test_default_route.py::test_default_route_with_bgp_flap

@vganesan-nokia
Copy link
Contributor Author

@vganesan-nokia looks like issue with only default route. Because i am seeing it for that only. If that is the case can we update Issues Title and Description to point that (default routes are only impacted)

@abdosi, this issue can happen for any route learned from eBGPs and when eBGPs go down and if that route has an additional next hop on eth0/docker0.

@abdosi
Copy link
Contributor

abdosi commented Dec 13, 2022

wondering is there any other route over eth0/docker0 other than default route ?

vganesan-nokia added a commit to vganesan-nokia/sonic-swss that referenced this issue Jan 6, 2023
Signed-off-by: vedganes <[email protected]>

The changes are for fixing stale neighbor in the ASIC_DB and data path
when eBGP neighbors are shutdown and neighbors are flushed. The problem
is described in issue: sonic-net/sonic-buildimage#12442
The root cause of this issue is due to not deleing the route from the
ASIC_DB when the route's next hop is on eth0 or docker0 interface. The solution is
to delete the route entry from ASIC_DB instead of just returning when the route's
next hop is on the interface eth0 or docker0
@rlhui rlhui added the Triaged this issue has been triaged label Jan 11, 2023
prsunny pushed a commit to sonic-net/sonic-swss that referenced this issue Jan 12, 2023
* [routesync] Fix for stale dynamic neighbor

The changes are for fixing stale neighbor in the ASIC_DB and data path
when eBGP neighbors are shutdown and neighbors are flushed. The problem
is described in issue: sonic-net/sonic-buildimage#12442
The root cause of this issue is due to not deleing the route from the
ASIC_DB when the route's next hop is on eth0 or docker0 interface. The solution is
to delete the route entry from ASIC_DB instead of just returning when the route's
next hop is on the interface eth0 or docker0


This commit fixes the warm restart unit test failure. When the route with
only nh on eth0 or docker0 is removed and if the route is the default
route, orchagent sends "add" black hole route to the syncd. So the ASIC
DB gets n hset message. When this happens during warm restart, the unit
test identifies this as unwanted setting and the unit test fails. To fix
this issues, the route delete is sent only if the warm restart is not in
progress. This is done following the same warm restart handling approach
used for route delete in other palces.

Signed-off-by: vedganes <[email protected]>
yxieca pushed a commit to sonic-net/sonic-swss that referenced this issue Jan 12, 2023
* [routesync] Fix for stale dynamic neighbor

The changes are for fixing stale neighbor in the ASIC_DB and data path
when eBGP neighbors are shutdown and neighbors are flushed. The problem
is described in issue: sonic-net/sonic-buildimage#12442
The root cause of this issue is due to not deleing the route from the
ASIC_DB when the route's next hop is on eth0 or docker0 interface. The solution is
to delete the route entry from ASIC_DB instead of just returning when the route's
next hop is on the interface eth0 or docker0


This commit fixes the warm restart unit test failure. When the route with
only nh on eth0 or docker0 is removed and if the route is the default
route, orchagent sends "add" black hole route to the syncd. So the ASIC
DB gets n hset message. When this happens during warm restart, the unit
test identifies this as unwanted setting and the unit test fails. To fix
this issues, the route delete is sent only if the warm restart is not in
progress. This is done following the same warm restart handling approach
used for route delete in other palces.

Signed-off-by: vedganes <[email protected]>
@vganesan-nokia
Copy link
Contributor Author

Fixed by PR sonic-net/sonic-swss#2553

StormLiangMS pushed a commit to sonic-net/sonic-swss that referenced this issue May 19, 2023
* [routesync] Fix for stale dynamic neighbor

The changes are for fixing stale neighbor in the ASIC_DB and data path
when eBGP neighbors are shutdown and neighbors are flushed. The problem
is described in issue: sonic-net/sonic-buildimage#12442
The root cause of this issue is due to not deleing the route from the
ASIC_DB when the route's next hop is on eth0 or docker0 interface. The solution is
to delete the route entry from ASIC_DB instead of just returning when the route's
next hop is on the interface eth0 or docker0


This commit fixes the warm restart unit test failure. When the route with
only nh on eth0 or docker0 is removed and if the route is the default
route, orchagent sends "add" black hole route to the syncd. So the ASIC
DB gets n hset message. When this happens during warm restart, the unit
test identifies this as unwanted setting and the unit test fails. To fix
this issues, the route delete is sent only if the warm restart is not in
progress. This is done following the same warm restart handling approach
used for route delete in other palces.

Signed-off-by: vedganes <[email protected]>
@yuxuehong
Copy link

when route with muti nexthops which not include eth0, frr has feture which when one nexthop invalid,will resolve to default; thus zebra will update route with muti nexthop, and one of them via eth0, thus we can not handle this situation;
so, handle muti nexthop which include eth0, we sholud update the non-eth0 routes to APPDB?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Chassis 🤖 Modular chassis support P0 Priority of the issue Triaged this issue has been triaged
Projects
Archived in project
Development

No branches or pull requests

6 participants