-
Notifications
You must be signed in to change notification settings - Fork 547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stale rif counter db removal (fix for #2193)_ #2199
Conversation
@dgsudharsan , could you please review? |
/AzurePipelines run |
Commenter does not have sufficient privileges for PR 2199 in repo Azure/sonic-swss |
42ae93b
to
02d2ffa
Compare
02d2ffa
to
9cbb288
Compare
/azpw run Azure.[stale_rif_counterdb_clear] |
/AzurePipelines run Azure.[stale_rif_counterdb_clear] |
No pipelines are associated with this pull request. |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
I am not sure that we should do this in swss. There could be race condition: removing a counter from swss does not mean the counter will be removed from syncd immediately. A race condition case could be like:
So I would suggest to do this in syncd. You may refer to this function: https://github.com/Azure/sonic-sairedis/blob/473c99067c8132f8e5b3f8f2abd672b474b42b5a/syncd/FlexCounter.cpp#L986 |
@@ -1219,11 +1220,40 @@ void IntfsOrch::removeRifFromFlexCounter(const string &id, const string &name) | |||
SWSS_LOG_DEBUG("Unregistered interface %s from Flex counter", name.c_str()); | |||
} | |||
|
|||
void IntfsOrch::cleanUpRifFromCounterDb(const string &id, const string &name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you consider moving this to syncd?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Junchao-Mellanox for the review/comments. The current changes done is alongside/inline-with other rif counter cleanup code. The idea was to have all rif counter related cleanup should be triggered at the same place. Please refer to the call made for removeRifFromFlexCounter in removeRouterIntfs .
Will also request @dgsudharsan for comments if all the clean-up for rif counters (existing:removeRifFromFlexCounter and newly added cleanUpRifFromCounterDb) should move to syncd ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Junchao-Mellanox , This request (to move cleanup to syncd) needs to be evaluated and then considered . Please note that there are other cleanup for rif counters which also needs to be moved to syncd in such case.
For now , will move ahead with this PR merge. If after discussion/evaluation it is decided to move these code to syncd , will have a new PR for the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sumanbrcm I believe the reason for requesting to move is due to a race condition when orchagent deletes the counters, syncd can try to access it given the operations are asynchronous. I suggest to have the cleanup of counters in syncd but if this warrants a discussion, lets have one. @prsunny What is your opinion on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dgsudharsan . As mentioned in previous comments , since remaining of cleanup code for rif counters already present in orchagent , we are keeping this change in orchagent. For future , if needed we can move all cleanup code to syncd after discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this race condition won't affect any function, please add a TODO comment here to remind there is a race condition. I will approve the PR after that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sumanbrcm , @Junchao-Mellanox, since the tests are passing, I'm going ahead with the merge as the outstanding concern is to add comment. @sumanbrcm , please raise another PR to address the comment.
/azpw run Azure.[stale_rif_counterdb_clear] |
/AzurePipelines run Azure.[stale_rif_counterdb_clear] |
No pipelines are associated with this pull request. |
/azpw run Azure.sonic-swss |
/AzurePipelines run Azure.sonic-swss |
Azure Pipelines successfully started running 1 pipeline(s). |
/azpw run Azure.sonic-swss |
/AzurePipelines run Azure.sonic-swss |
Azure Pipelines successfully started running 1 pipeline(s). |
/azpw run Azure.sonic-swss |
/AzurePipelines run Azure.sonic-swss |
Azure Pipelines successfully started running 1 pipeline(s). |
/azpw run Azure.sonic-swss |
/AzurePipelines run Azure.sonic-swss |
Azure Pipelines successfully started running 1 pipeline(s). |
/azpw run Azure.sonic-swss |
/AzurePipelines run Azure.sonic-swss |
Azure Pipelines successfully started running 1 pipeline(s). |
/azpw run Azure.sonic-swss |
/AzurePipelines run Azure.sonic-swss |
Azure Pipelines successfully started running 1 pipeline(s). |
/azpw run Azure.sonic-swss |
/AzurePipelines run Azure.sonic-swss |
Azure Pipelines successfully started running 1 pipeline(s). |
@prsunny @sumanbrcm I agree with @Junchao-Mellanox . The cleanup of COUNTERS RIF tables should be done in syncd while cleanup of the mapping should still be done in orchagent because the orchagent is the one who wrote it to. We should follow the rule that the entity that writes to a table is the only one responsible for this table cleanup. In this case it is syncd. Besides of this fix beeing incomplete due to a race condition as pointed out by @Junchao-Mellanox it introduced errors in the logs that are caught by log analyzer during our regression:
@sumanbrcm Could you please handle this issue? |
To handle this , whenever RIF is deleted (removeRouterIntfs) , we ensure that these COUNTER_DB tables are deleted(handled in cleanUpRifFromCounterDb) .
@stepanblyschak If the cleanup of mapping is done in OA but cleanup of RIF counters is done in syncd then we can still land into similar issue which was originally reported. Lets say - |
The cleanup code for stale rif counters are now moved to syncd . Earlier as part of fix for issue #2193 the cleanup for stale rif counters was added. But it could create a race condition between orchagent removes RIF rate counters from DB and lua script fetching them. So as a fix all such cleanup has been moved to syncd. - What I did Fix for sonic-net/sonic-buildimage#11621 As a past fix which aimed at removing stale rif counters (#2199) , there is a chance of race condition and it leads to lua script reporting error. To handle this , the rif counters cleanup code(handled in cleanUpRifFromCounterDb) is now called from syncd ( removeCounter ) to avoid such race condition. - Why I did it The operations in Orchagent and syncd is not synchronous, so while Orchagent deletes the rif counters from Counters Db, the syncd could still access it. In race conditions the lua script trying to fetch rif counters will have errors syslog for such access as it was already deleted by orchagent. The cleanup code is removed from orchagent is added in syncd - it will make sure no such race condition would get hit. - How I verified it Followed the steps in (sonic-net/sonic-buildimage#11621) : Create RIF in SONiC, wait till RIF rates are populated in COUNTERS DB Remove RIF Repeat the steps multiple times and check if any error syslog is seen (No error syslog is seen) Also checked cleanup for rif counters. After RIF creation derived info of oid for RIF from "COUNTERS_RIF_NAME_MAP" 127) "Vlan100" 128) "oid:0x6000000000aa5" Checked all the tabled in COUNTER_DB which has same OID in keys 127.0.0.1:6379[2]> keys 6000000000aa5 1) "RATES:oid:0x6000000000aa5:RIF" 2) "COUNTERS:oid:0x6000000000aa5" 3) "RATES:oid:0x6000000000aa5" 127.0.0.1:6379[2]> Deleted the RIF by removing the ip on the intf. Checked COUNTER_DB again with same OID if there are stale entries or not. No stale entries exist now. 127.0.0.1:6379[2]> keys 6000000000aa5 (empty array) 127.0.0.1:6379[2]> Signed-off-by: Suman Kumar <[email protected]>
The cleanup code for stale rif counters are now moved to syncd . Earlier as part of fix for issue #2193 the cleanup for stale rif counters was added. But it could create a race condition between orchagent removes RIF rate counters from DB and lua script fetching them. So as a fix all such cleanup has been moved to syncd. - What I did Fix for sonic-net/sonic-buildimage#11621 As a past fix which aimed at removing stale rif counters (#2199) , there is a chance of race condition and it leads to lua script reporting error. To handle this , the rif counters cleanup code(handled in cleanUpRifFromCounterDb) is now called from syncd ( removeCounter ) to avoid such race condition. - Why I did it The operations in Orchagent and syncd is not synchronous, so while Orchagent deletes the rif counters from Counters Db, the syncd could still access it. In race conditions the lua script trying to fetch rif counters will have errors syslog for such access as it was already deleted by orchagent. The cleanup code is removed from orchagent is added in syncd - it will make sure no such race condition would get hit. - How I verified it Followed the steps in (sonic-net/sonic-buildimage#11621) : Create RIF in SONiC, wait till RIF rates are populated in COUNTERS DB Remove RIF Repeat the steps multiple times and check if any error syslog is seen (No error syslog is seen) Also checked cleanup for rif counters. After RIF creation derived info of oid for RIF from "COUNTERS_RIF_NAME_MAP" 127) "Vlan100" 128) "oid:0x6000000000aa5" Checked all the tabled in COUNTER_DB which has same OID in keys 127.0.0.1:6379[2]> keys 6000000000aa5 1) "RATES:oid:0x6000000000aa5:RIF" 2) "COUNTERS:oid:0x6000000000aa5" 3) "RATES:oid:0x6000000000aa5" 127.0.0.1:6379[2]> Deleted the RIF by removing the ip on the intf. Checked COUNTER_DB again with same OID if there are stale entries or not. No stale entries exist now. 127.0.0.1:6379[2]> keys 6000000000aa5 (empty array) 127.0.0.1:6379[2]> Signed-off-by: Suman Kumar <[email protected]>
What I did
Fix for https://github.com/Azure/sonic-swss/issues/2193
To handle this , whenever RIF is deleted (removeRouterIntfs) , we ensure that these COUNTER_DB tables are deleted(handled in cleanUpRifFromCounterDb) .
Why I did it
In this issue , stale COUNTER_DB table/entries were present even after RIF was deleted.
So cleaning up of such stale tables are needed.
In the absence of this , if same oid is allocated to a different RIF , we may show stale stats.
Also the COUNTER_DB will keep holding un-necessary tables and this number/used-memory keep growing if there RIFs are deleted .
How I verified it
Before this fix , here is the snapshot of the stale COUNTER_DB tables.
a. Before fix , here the steps where problem was seen
b. After this fix, here are the verification steps
Check syslog too to confirm the changes to handle this issue is invoked
Jul 14 19:37:37.233767 sonic NOTICE swss#orchagent: :- cleanUpRifFromCounterDb: CleanUp interface Vlan100 oid oid:0x6000000000ba6 from counter db
Details if related