-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Flex Counter] observe race condition while removing a RIF object #14628
Comments
Does this PR fix the issue - sonic-net/sonic-swss#2488? |
@dgsudharsan to check with Junchao and get back |
Hi @neethajohn , the PR sonic-net/sonic-swss#2488 does not fix the issue. Here is my analyzation, correct me if anything is wrong. On removing a RIF, intfsorch.cpp would do it like this:
So, it removes flex counter pro to removing RIF. This is correct. However, it does not guarantee the process order on syncd side. In syncd, we have 4 selectables(sockets) to handle redis DB changes:
Sonic select is based on epoll. From my understanding, epoll cannot guarantee the order among different socket and sonic has no mechanism to order event among sockets. So, an example of the issue: orchagent send:
syncd select is processing m_selectableChannel at this time, so it processes:
syncd select continues running and process m_flexCounter:
In this case, syncd processing order is different than what orchagent sent. Since flex counter is polling in different thread, things become bad if following happens on syncd side:
|
This is not only for RIF objects but for all kinds of objects. Recently we observed a similar syndrome on buffer pool objects when removing the zero buffer pool. in that time, the buffer pool is removed and then the counter. but in syncd the counter is still fetched just after the buffer pool is removed. |
The root cause of the issue is the removing notifications of an object, and its counters are received in a wrong order due to multiple channels between orchagent and sairedis for counter and object operations
The formal solution should be to use a unified communication channel for both counters and objects.
|
There was idea that all operations should go via zna channel, flex counter channel was added outside my supervision and merged, this problem with race condition here is known for a while now, maybe it didn't surface, I mentioned this several times that this can happen and solution is to merge all sairedis operations to 1 channel |
hi |
|
Thanks for clarifying it. |
Could you please share PR when it is available? |
Fixed by sonic-net/sonic-swss#3076 and sonic-net/sonic-sairedis#1362 |
Description
In intfsorch.cpp function
removeRouterIntfs
, it removes flex counter before removing RIF object. See code:However, it cannot guarantee that Flex counter will be removed before RIF object removing. This is because the async architecture of Syncd. The flow may like this:
Steps to reproduce the issue:
Describe the results you received:
Describe the results you expected:
There is no race condition
Output of
show version
:Output of
show techsupport
:Additional information you deem important (e.g. issue happens only occasionally):
The text was updated successfully, but these errors were encountered: