-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
To delete neighbor entries which are next-hop of routing entries from NEIGH_TABLE causes DEL operations pending in the neighorch::m_toSync #4400
Comments
please take this to the warm reboot subgroup meeting and discuss there. |
This issue appeared recently. |
Hi @lguohan |
After warm reboot test finished, the neighbors were somehow removed and then learned approximately 1-second latter.
and the following messages indicating the entries were learned
And then the following messages observed:
This means the neighbor removing notifications can not be handled and will remain in
|
I suspect even in the 201911 branch the kernel will notify the removal of the neighbors which are the next-hop of the routing entries. But the issue isn't observed in 201911 branch because
However, this logic has been updated by PR #1184, which causes the issue. IMO the logic implemented by PR 1184 is better because it provides the ability to remove and recreate an object, which is necessary when we want to modify a create-only attribute of an object. So I suggest improving the neighorch design by improving the reference count mechanism as above mentioned. |
I experienced exactly the same issue in the master image. I found it happens and blocks the following warm-reboot as long as SONiC boots up (no matter it is cold/fast/warm) with the ping traffic in the warm/fast reboot test. And the logic in sonic-net/sonic-swss#1184 makes the DEL operation stay in m_toSync. So I think this might not completely be a warm-reboot problem, it just happens to block warm-reboot. There would be a DEL operation stuck in m_toSync and keep retrying even if there is no warm-reboot attempts following. |
You’re correct. This issue can also be reproduced without warmreboot. |
The title and description have been updated. |
@stephenxs can you validate and close the issue? |
Description
To delete neighbor entries which are next-hop of routing entries from NEIGH_TABLE causes DEL operations pending in the neighorch::m_toSync
It can be reproduced in any scenario where the kernel ages/removes a neighbor entry (ARP or ND) who is the next-hop of a routing entry. An example is the warm reboot test: after warm reboot orchagent receives neigh remove from kernel/net link.
This issue can prevent orchagent from frozen for the next reboot.
I suspect even in the 201911 branch the kernel will notify the removal of the neighbors which are the next-hop of the routing entries. But the issue isn't observed in 201911 branch because
However, this logic has been updated by PR #1184. IMO the logic implemented by PR 1184 is better because it provides the ability to remove and recreate an object, which is necessary when we want to modify a create-only attribute of an object. So I suggest improving the neighorch implemention.
Currently, the notification of removing a neighbor with non-zero reference will remain in
m_toSync
. I think this is based on an assumption that the reference will be decreased to zero in a short time. However, I'm afraid the assumption is not correct because there is no guarantee that the kernel won't age an ARP entry which is a next-hop of some routing entries. In this case, the notification can remain inm_toSync
for a long time.Is it possible to handle this scenario in the following way?
PENDING_REMOVE
flag to the neighbor entry for which a removing notification is received when the reference count isn't zero and remove it fromm_toSync
.Steps to reproduce the issue:
Describe the results you received:
After warm-reboot test, try
warm-reboot
for the second time, error message loggedAdjust orchagent's log level to INFO, the following logs are found:
However, the neighbors are up and portchannels aren't down after warm-reboot. Once I built an image with additional debug info and found neighbor removing messages are received from netlink socket, which means they're removed by kernel.
Describe the results you expected:
The orchagent shouldn't receive BGP neighbors removing message.
Additional information you deem important (e.g. issue happens only occasionally):
Output of
show version
:sonic_dump_r-tigris-04_20200408_103340.tar.gz
test-record.log
lacp-state-on-vm.log
The text was updated successfully, but these errors were encountered: