-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ExchangeContext leak in chip_im_initiator (Fix: #6915) #6918
Conversation
(#6919): This doesn't compile for Android, because Android has no device layer.connectedhomeip/src/messaging/ExchangeMgr.cpp Lines 102 to 108 in 544887e
This comment was generated by todo based on a
|
src/messaging/ExchangeMgr.cpp
Outdated
@@ -43,6 +43,11 @@ | |||
#include <support/RandUtils.h> | |||
#include <support/logging/CHIPLogging.h> | |||
|
|||
#if CONFIG_DEVICE_LAYER | |||
#include <platform/CHIPDeviceLayer.h> | |||
#include <platform/PlatformManager.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It's already included by CHIPDeviceLayer.h
@@ -94,6 +99,10 @@ CHIP_ERROR ExchangeManager::Init(SecureSessionMgr * sessionMgr) | |||
|
|||
CHIP_ERROR ExchangeManager::Shutdown() | |||
{ | |||
// TODO(#6919): This doesn't compile for Android, because Android has no device layer. | |||
#if CONFIG_DEVICE_LAYER | |||
DeviceLayer::PlatformMgr().LockChipStack(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't the real solution be to have this called in the main event loop task of the stack? Seems like the stack shutting itself down should be done from within the stack task itself, rather than being called from outside with a lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stack lock was introduced to protect CHIP event processing. Putting a chip lock here can only prevent CHIP from operating, but the problem here is the exchange is also released from the client thread.
How this change can prevent leaking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For IM, maybe it is ok, we just moved the logic to close the exchange in OnMessageReceived, but if the exchange closing also occurs in client thread, how this locking helps?
(#6931): Lock guard is a temporary solution. Proper solution should be post chip shutdown function into chip threadconnectedhomeip/src/messaging/ExchangeMgr.cpp Lines 101 to 107 in e3d500b
This comment was generated by todo based on a
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We really need to stop this whack-a-mole, take a few days, decide what our threading model is, document it, add asserts, and make sure our sample apps follow it. It will be less time spent than what we are doing right now. As a strawman:
- Our threading model is that all access to CHIP API must be performed while holding the stack lock. The CHIP event loop takes this lock around each event it processes.
- Switch
stack_log_tracking
tofatal
for linux and darwin. That would presumably catch issues like the one this PR is trying to fix without needing races to fail. - Add more
assertChipStackLockedByCurrentThread
as desired. - For all the failing places, decide whether to take the lock (and whether that should happen outside the CHIP API surface or immediately inside) or whether they should get posted to the CHIP event loop.
- Set
stack_log_tracking
tofatal
for at least one embedded platform, so we can look for bugs on that side. We can start withesp32
; it's got plenty of space afaict.
Obviously we can do this over multiple PRs (e.g. a few PRs to fix issues on Linux, then change Linux to fatal, etc).
Or a second strawman:
- Our threading model is that all access to CHIP API must be performed via events posted to the CHIP event loop.
- Modify the lock tracking assets to assert that we are on the CHIP event loop thread.
- Make the asserts fatal on linux and darwin. That would presumably catch issues like the one this PR is trying to fix without needing races to fail.
- Add more asserts as desired.
- For all the failing places, post to the CHIP event loop.
- Turn the asserts on for esp32.
Size increase report for "nrfconnect-example-build" from 403b5ff
Full report output
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Holding based on feedback from @bzbarsky-apple
I strongly advocate for the first solution. The second solution means cross-thread callers must queue all interactions to/from the stack. Obviously, any delegate callbacks executed by chip need a queue back to the host context if it is executing another event loop. However, many calls don't need this. Imposing a no-cross-thread-api-call-rule across the board is onerous. Also, a truly literal reading of strawman 2 presents an unsurmountable difficulty. Currently, posix platform instantiates a pthread for the event loop. Object creation and platform manager init are in an originating thread, and stack events execute in the created pthread. A literal reading of strawman 2 precludes any type of orderly shutdown: if you're not allowed to call any APIs from thread 1 after the pthread starts, how can you ever shutdown the stack? I know that the answer is simply that the originating thread can make some calls; it must at least be able to Shutdown (pthread_join) platform manager. But my point is that it is impossible for us to impose an across-the-board rule that no cross-thread calls are ever allowed; some must be. So let's just make all of them available with a lock. |
I agree that "everything funelled to the one thread" is onerous. At the same time, blindly locking and not being very careful about lock usage can also cause its own set of problems. I think the solution is a mix of the main loop and locks. Overall, I would say that "orderly shutdown" of the stack, definitely is one of those cases where instead of taking a lock, the caller of shutdown should just join on the main event loop task, and the shutdown posts an event that causes the whole protocol stack to start its own internally serialized shutdown. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with doing this for now but would like us to make the real fix our highest priority instead of continuing to pile on other things on top of our current quicksand...
I'm OK with this change as a stopgap.
@@ -94,6 +98,10 @@ CHIP_ERROR ExchangeManager::Init(SecureSessionMgr * sessionMgr) | |||
|
|||
CHIP_ERROR ExchangeManager::Shutdown() | |||
{ | |||
// TODO(#6931): Lock guard is a temporary solution. Proper solution should be post chip shutdown function into chip thread | |||
#if CONFIG_DEVICE_LAYER | |||
DeviceLayer::PlatformMgr().LockChipStack(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest moving DeviceLayer::PlatformMgr().LockChipStack() to the place where shutdown is called from the client thread, instead of within ExchangeManager::Shutdown() itself.
You are trying to fix the problem in the case which Shutdown is called from a client thread, for the cases Shutdown() are called from CHIP main thread event loop(Which it should be), it will crash since chip lock is not recursive.
@@ -109,6 +117,9 @@ CHIP_ERROR ExchangeManager::Shutdown() | |||
} | |||
|
|||
mState = State::kState_NotInitialized; | |||
#if CONFIG_DEVICE_LAYER | |||
DeviceLayer::PlatformMgr().UnlockChipStack(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, move DeviceLayer::PlatformMgr().UnlockChipStack(); to the place Shutdown is called from the client thread
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per the updated template, can you update the PR text?
#### Problem
What is being fixed? Examples:
* Fix crash on startup
* Fixes #12345 12345 Frobnozzle is leaky (exactly like that, so GitHub will auto-close the issue).
#### Change overview
What's in this PR
#### Testing
How was this tested? (at least one bullet point required)
• If unit tests were added, how do they cover this issue?
• If unit tests existed, how were they fixed/modified to prevent this in future?
• If integration tests were added, how do they verify this change?
• If manually tested, what platforms controller and device platforms were manually tested, and how?
• If no testing is required, why not?
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
This stale pull request has been automatically closed. Thank you for your contributions. |
Fixes: #6915