-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chip-tool is still racy on Darwin #7557
Comments
Here is my analysis of the state of threading and dispatch queues on Darwin, especially in light of the changes that were made recently in #7405. On the whole, the Darwin implementation deviates from the API contracts stipulated in PlatformManager.h, exposing a number of race conditions in applications like chip-tool that have a separate application/main dispatch from the dispatch that runs parts of CHIP. Namely:
A lot of the API contracts are not well documented in PlatformMgr.h, which I think causes problems like these. I'll put up a PR to provide clarifications there to ensure we're all on the same page there. |
Most of the above were confirmed with tsan, which reported a number of race conditions in those areas. I've since made a bunch of fixes to the Darwin Platform implementation that keeps the spirit of what's being achieved there, while making it compliant with the updated API contracts I've defined in #7478. I was able to repro the segfault in #7574 quite reliably with the test runner, and with the above fixes, I was able to run up to a 1000 iterations crash free, as well as tsan error free. While this doesn't mean it's completely race free, it's still a positive leg up! Will put it up soon... |
FYI @msandstedt |
If this is the API contract (that was pretty unclear ;) ), this can be replicated by calling
At least it has let us find this issue ;) That said, we could call something like
One of the goal of dispatch queues is to avoid locking by serialising access to the stack instead of accessing it concurrently. If we do need locking with dispatch queues, something is wrong somewhere. One thing that is effectively missing in This is the part where I would expect a nicer API that does not force us to wraps the call to the CHIP stack into
There are 2 issues that I can think of with the way
The root of the issue at the moment is that Following your API I would expect a call to
Pretty sure we can honoured the API contract without resorting to |
External synchronization doesn't require locks. External syncrhonization can be achieved by serializing all stack API calls and callbacks to a single context, therefore ensuring no locks are needed. LockChipStack/UnlockChipStack are helpers which can do no-op if you can guarantee that your code never calls CHIP stack code from multiple thread contexts. However, the question is: is there any global state across dispatch contexts being mutated? If the answer is yes, then the dispatch queues are used in a racy way. The usual way for this method to work is to ensure that each dispatch queue only touches context that depends on the incoming work to it, and there must be a specific design done to design messaging between dispatch queues if communication/state is to be exchanged/shared between dispatch queue contexts. If global state is mutated by work within several dispatch queues, then you basically have the implementation details of "where is this dispatch queue running" being close to "each dispatch queue is effectively a thread". In those cases, dispatch queues don't magically make the code safe, and the shared state has to be protected. |
The way things are used here for v1 is safe, moving this out. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
This comment was marked as off-topic.
This comment was marked as off-topic.
I believe that the virtual device build is probably something separate not related to chip-tool. it looks like it tries to access LevelControl::minLevel |
Problem
#7478 attempts to fix data races detected by tsan in chip-tool - however, the bulk of the work there was validated on POSIX based platforms.
On Darwin which uses dispatch queues, the logic changes in chip-tool don't make any meaningful dent on the races that still exist there.
Proposed Solution
Not sure, need to figure something out!
The text was updated successfully, but these errors were encountered: