Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

swss: Fixing race condition for rif counters #2488

Conversation

sumanbrcm
Copy link
Contributor

@sumanbrcm sumanbrcm commented Oct 13, 2022

The cleanup code for stale rif counters are now moved to syncd . Earlier as part of fix for issue #2193 the cleanup for stale rif counters was added.
But it could create a race condition between orchagent removes RIF rate counters from DB and lua script fetching them.
So as a fix all such cleanup has been moved to syncd.

What I did
Fix for sonic-net/sonic-buildimage#11621

As a past fix which aimed at removing stale rif counters (#2199) , there is a chance of race condition and it leads to lua script reporting error.
To handle this , the rif counters cleanup code(handled in cleanUpRifFromCounterDb) is now called from syncd ( removeCounter ) to avoid such race condition.

Why I did it

The operations in Orchagent and syncd is not synchronous, so while Orchagent deletes the rif counters from Counters Db , the syncd could still access it. In race conditions the lua script trying to fetch rif counters will have errors syslog for such access as it was already deleted by orchagent. The cleanup code is removed from orchagent is added in syncd - it will make sure no such race condition would get hit.

How I verified it

Followed the steps in (sonic-net/sonic-buildimage#11621) :

  1. Create RIF in SONiC, wait till RIF rates are populated in COUNTERS DB
  2. Remove RIF
  3. Repeat the steps multiple times and check if any error syslog is seen (No error syslog is seen)

Also checked cleanup for rif counters.

  1. After RIF creation derived info of oid for RIF from "COUNTERS_RIF_NAME_MAP"
    127) "Vlan100"
    128) "oid:0x6000000000aa5"

  2. Checked all the tabled in COUNTER_DB which has same OID in keys
    127.0.0.1:6379[2]> keys 6000000000aa5
    1) "RATES:oid:0x6000000000aa5:RIF"
    2) "COUNTERS:oid:0x6000000000aa5"
    3) "RATES:oid:0x6000000000aa5"
    127.0.0.1:6379[2]>

  3. Deleted the RIF by removing the ip on the intf.

  4. Checked COUNTER_DB again with same OID if there are stale entries or not. No stale entries exist now.
    127.0.0.1:6379[2]> keys 6000000000aa5
    (empty array)
    127.0.0.1:6379[2]>

Details if related

* Fixing issue #11621
* The cleanup code for stale rif counters are now moved to syncd . Earlier as part of fix for issue sonic-net#2193 the cleanup for stale rif counters was added.
* But it could create a race condition between orchagent removes RIF rate counters from DB and lua script fetching them.
* So as a fix all such cleanup has been moved to syncd.

Signed-off-by: Suman Kumar <[email protected]>
@sumanbrcm sumanbrcm requested a review from prsunny as a code owner October 13, 2022 07:39
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Oct 13, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: sumanbrcm / name: Suman Kumar (c116d1e)

@sumanbrcm
Copy link
Contributor Author

@Junchao-Mellanox , @kcudnik
For reviewing.
This is related to ( sonic-net/sonic-sairedis#1136 )
This specific pull request is for removal of cleanup code from OA.

@sumanbrcm
Copy link
Contributor Author

/easycla

@sumanbrcm
Copy link
Contributor Author

Added @prsunny and @liat-grozovik for merge/related-approval.

@liat-grozovik liat-grozovik requested a review from kcudnik October 28, 2022 10:52
@sumanbrcm
Copy link
Contributor Author

@prsunny @liat-grozovik @kcudnik
This is request for merge/related-approval.

@liat-grozovik
Copy link
Collaborator

@Junchao-Mellanox @sumanbrcm should we also tagged it for 202205?

@liat-grozovik liat-grozovik merged commit ab0e474 into sonic-net:master Nov 8, 2022
@Junchao-Mellanox
Copy link
Collaborator

Junchao-Mellanox commented Nov 9, 2022

@Junchao-Mellanox @sumanbrcm should we also tagged it for 202205?

yes. I added tag

yxieca pushed a commit that referenced this pull request Nov 10, 2022
The cleanup code for stale rif counters are now moved to syncd . Earlier as part of fix for issue #2193 the cleanup for stale rif counters was added.
But it could create a race condition between orchagent removes RIF rate counters from DB and lua script fetching them.
So as a fix all such cleanup has been moved to syncd.

- What I did
Fix for sonic-net/sonic-buildimage#11621

As a past fix which aimed at removing stale rif counters (#2199) , there is a chance of race condition and it leads to lua script reporting error.
To handle this , the rif counters cleanup code(handled in cleanUpRifFromCounterDb) is now called from syncd ( removeCounter ) to avoid such race condition.

- Why I did it

The operations in Orchagent and syncd is not synchronous, so while Orchagent deletes the rif counters from Counters Db, the syncd could still access it. In race conditions the lua script trying to fetch rif counters will have errors syslog for such access as it was already deleted by orchagent. The cleanup code is removed from orchagent is added in syncd - it will make sure no such race condition would get hit.

- How I verified it

Followed the steps in (sonic-net/sonic-buildimage#11621) :

Create RIF in SONiC, wait till RIF rates are populated in COUNTERS DB
Remove RIF
Repeat the steps multiple times and check if any error syslog is seen (No error syslog is seen)
Also checked cleanup for rif counters.

After RIF creation derived info of oid for RIF from "COUNTERS_RIF_NAME_MAP"
127) "Vlan100"
128) "oid:0x6000000000aa5"

Checked all the tabled in COUNTER_DB which has same OID in keys
127.0.0.1:6379[2]> keys 6000000000aa5
1) "RATES:oid:0x6000000000aa5:RIF"
2) "COUNTERS:oid:0x6000000000aa5"
3) "RATES:oid:0x6000000000aa5"
127.0.0.1:6379[2]>

Deleted the RIF by removing the ip on the intf.

Checked COUNTER_DB again with same OID if there are stale entries or not. No stale entries exist now.
127.0.0.1:6379[2]> keys 6000000000aa5
(empty array)
127.0.0.1:6379[2]>

Signed-off-by: Suman Kumar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants