-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Routeorch calls SAI API set attribute for a routing entry which was just removed from SAI in the same bulk operation. Should recreate it in such case. #9434
Comments
Can you check if the failure is due to a regression in code or test? Not sure if there was any recent change in route_orch for this flow. |
@prsunny Looks like it's a production issue in route orch introduced by bulk operation but was hardly to be trigerred. Now with sonic-net/sonic-swss#1992 merged it is much easier to be. |
This issue is reproducible by using gdb. |
@shi-su , could you please take a look? |
@shi-su could you please update? |
I'll check it today. |
@abdosi following this. |
@stephenxs I drafted a PR sonic-net/sonic-swss#2071 that would potentially resolve the issue. Would you mind adding a test case for the issue? @abdosi Let me know if you have any thoughts or comments. |
Not sure i understand your request to add test. |
…set operations (#2071) What I did Check if there are items pending removal in bulk before calling bulk set API. Fixes sonic-net/sonic-buildimage#9434 Why I did it When there are items pending removal in bulk before calling set API, it means the item will be removed before the set and it should do create instead.
…set operations (sonic-net#2071) What I did Check if there are items pending removal in bulk before calling bulk set API. Fixes sonic-net/sonic-buildimage#9434 Why I did it When there are items pending removal in bulk before calling set API, it means the item will be removed before the set and it should do create instead.
Description
Orchagent exited due to setting a non-existing routing entry during bgp speaker test with the latest swss
The issue was observed during azure pipeline kvm t0 bgp speaker test of PR #9397.
Steps to reproduce the issue:
Describe the results you received:
The following error observed during bgp speaker test
route orchagent operating flow on the routing entries:
So for the second set operation, route orchagent should call "create" instead of "set".
However, according to sairedis.rec, it called "set" for the second set operation when the routing entries didn't exist in SAI, which caused SAIredis failure.
ROUTE_TABLE in swss.rec:
Describe the results you expected:
Should not be error.
Output of
show version
:Output of
show techsupport
:Additional information you deem important (e.g. issue happens only occasionally):
The flow should be like this.
During the previous test, it announced the routing entries at the beginning and withdrew them after test finished. And then the next test started, the same set of routing entries were announced again.
So for each routing entry P,
fpmsyncd
issued the next operations:routeorch
should handle the notifications as below:m_syncdRoutes
after bulk call returnedm_syncdRoutes
after bulk call returnedm_syncdRoutes
after bulk call returnedHowever, since there were a large amount of routing entries being notified, the notifications of P for
remove
and the secondadd
were packed and handled in the same bulk call.Then the flow was:
gRouteBulker.create_entry
in bulk mode and then inserted P intom_syncdRoutes
after bulk call returnedgRouteBulker.remove_entry
in bulk mode, the call was pending.gRouteBulker.set_entry_attribute
since P had not been removed fromm_syncdRoutes
, which is problematic because when this operation would be handled by SAI P had been removed.When SAI handled the operations,
m_saiObjectCollection
as well.m_saiObjectCollection
whenremove_entry
was called for P. Therefore, it notified orchagent to exit.I think this is a bug in the routeorch for a while but it occurs only in the very rare scenarios.
It is observed after PR sonic-net/sonic-swss#1992 was merged because in that PR there is an optimization that pushes notifications from redis-db table to
m_toSync
as many as possible:The old logic:
The new logic:
So in the new logic, there is a higher probability for orchagent to merge all notifications together and for
remove
and the secondset
to be bundled in one bulk, and therefore, it is more likely to trigger the bug.A solution can be:
pending removal
.pending removal
flag is about to be added, don't handle it until the bulk has been flushed.WAs can be:
The text was updated successfully, but these errors were encountered: