Fix ExchangeContext leak in chip_im_initiator (Fix: #6915) #6918

kghost · 2021-05-18T06:28:36Z

todo · 2021-05-18T06:45:54Z

(#6919): This doesn't compile for Android, because Android has no device layer.

connectedhomeip/src/messaging/ExchangeMgr.cpp

Lines 102 to 108 in 544887e

    
           // TODO(#6919): This doesn't compile for Android, because Android has no device layer. 
        
           #if CONFIG_DEVICE_LAYER 
        
               DeviceLayer::PlatformMgr().LockChipStack(); 
        
           #endif 
        
               mReliableMessageMgr.Shutdown(); 
        
               mContextPool.ForEachActiveObject([](auto * ec) {

This comment was generated by todo based on a `TODO` comment in `544887e` in #6918. cc @kghost.

Damian-Nordic · 2021-05-18T09:12:46Z

src/messaging/ExchangeMgr.cpp

@@ -43,6 +43,11 @@
 #include <support/RandUtils.h>
 #include <support/logging/CHIPLogging.h>

+#if CONFIG_DEVICE_LAYER
+#include <platform/CHIPDeviceLayer.h>
+#include <platform/PlatformManager.h>


nit: It's already included by CHIPDeviceLayer.h

tcarmelveilleux · 2021-05-18T12:14:16Z

src/messaging/ExchangeMgr.cpp

@@ -94,6 +99,10 @@ CHIP_ERROR ExchangeManager::Init(SecureSessionMgr * sessionMgr)

 CHIP_ERROR ExchangeManager::Shutdown()
 {
+// TODO(#6919): This doesn't compile for Android, because Android has no device layer.
+#if CONFIG_DEVICE_LAYER
+    DeviceLayer::PlatformMgr().LockChipStack();


Wouldn't the real solution be to have this called in the main event loop task of the stack? Seems like the stack shutting itself down should be done from within the stack task itself, rather than being called from outside with a lock.

The stack lock was introduced to protect CHIP event processing. Putting a chip lock here can only prevent CHIP from operating, but the problem here is the exchange is also released from the client thread.

How this change can prevent leaking?

For IM, maybe it is ok, we just moved the logic to close the exchange in OnMessageReceived, but if the exchange closing also occurs in client thread, how this locking helps?

todo · 2021-05-18T13:36:24Z

(#6931): Lock guard is a temporary solution. Proper solution should be post chip shutdown function into chip thread

connectedhomeip/src/messaging/ExchangeMgr.cpp

Lines 101 to 107 in e3d500b

    
           // TODO(#6931): Lock guard is a temporary solution. Proper solution should be post chip shutdown function into chip thread 
        
           #if CONFIG_DEVICE_LAYER 
        
               DeviceLayer::PlatformMgr().LockChipStack(); 
        
           #endif 
        
               mReliableMessageMgr.Shutdown(); 
        
               mContextPool.ForEachActiveObject([](auto * ec) {

This comment was generated by todo based on a `TODO` comment in `e3d500b` in #6918. cc @kghost.

bzbarsky-apple

We really need to stop this whack-a-mole, take a few days, decide what our threading model is, document it, add asserts, and make sure our sample apps follow it. It will be less time spent than what we are doing right now. As a strawman:

Our threading model is that all access to CHIP API must be performed while holding the stack lock. The CHIP event loop takes this lock around each event it processes.
Switch stack_log_tracking to fatal for linux and darwin. That would presumably catch issues like the one this PR is trying to fix without needing races to fail.
Add moreassertChipStackLockedByCurrentThread as desired.
For all the failing places, decide whether to take the lock (and whether that should happen outside the CHIP API surface or immediately inside) or whether they should get posted to the CHIP event loop.
Set stack_log_tracking to fatal for at least one embedded platform, so we can look for bugs on that side. We can start with esp32; it's got plenty of space afaict.

Obviously we can do this over multiple PRs (e.g. a few PRs to fix issues on Linux, then change Linux to fatal, etc).

Or a second strawman:

Our threading model is that all access to CHIP API must be performed via events posted to the CHIP event loop.
Modify the lock tracking assets to assert that we are on the CHIP event loop thread.
Make the asserts fatal on linux and darwin. That would presumably catch issues like the one this PR is trying to fix without needing races to fail.
Add more asserts as desired.
For all the failing places, post to the CHIP event loop.
Turn the asserts on for esp32.

@msandstedt @andy31415

github-actions · 2021-05-18T14:51:51Z

Size increase report for "nrfconnect-example-build" from 403b5ff

File	Section	File	VM
chip-shell.elf	text	24	24
chip-shell.elf	device_handles	-8	-8

Full report output

BLOAT REPORT

Files found only in the build output:
    report.csv

Comparing ./master_artifact/chip-lighting.elf and ./pull_artifact/chip-lighting.elf:

sections,vmsize,filesize
.debug_info,0,26555
.debug_line,0,1173
.debug_abbrev,0,256
.debug_ranges,0,48
.debug_loc,0,16

Comparing ./master_artifact/chip-shell.elf and ./pull_artifact/chip-shell.elf:

sections,vmsize,filesize
.debug_info,0,7921
.debug_line,0,813
.debug_abbrev,0,162
.strtab,0,114
.debug_ranges,0,48
.symtab,0,48
text,24,24
.debug_loc,0,16
.shstrtab,0,2
device_handles,-8,-8

Comparing ./master_artifact/chip-lock.elf and ./pull_artifact/chip-lock.elf:

sections,vmsize,filesize
.debug_info,0,7921
.debug_line,0,797
.debug_abbrev,0,162
.debug_ranges,0,48
.debug_loc,0,16

woody-apple

Holding based on feedback from @bzbarsky-apple

msandstedt · 2021-05-18T15:47:56Z

We really need to stop this whack-a-mole, take a few days, decide what our threading model is, document it, add asserts, and make sure our sample apps follow it. It will be less time spent than what we are doing right now. As a strawman:

Our threading model is that all access to CHIP API must be performed while holding the stack lock. The CHIP event loop takes this lock around each event it processes.

Or a second strawman:

Our threading model is that all access to CHIP API must be performed via events posted to the CHIP event loop.

I strongly advocate for the first solution. The second solution means cross-thread callers must queue all interactions to/from the stack. Obviously, any delegate callbacks executed by chip need a queue back to the host context if it is executing another event loop. However, many calls don't need this. Imposing a no-cross-thread-api-call-rule across the board is onerous.

Also, a truly literal reading of strawman 2 presents an unsurmountable difficulty. Currently, posix platform instantiates a pthread for the event loop. Object creation and platform manager init are in an originating thread, and stack events execute in the created pthread. A literal reading of strawman 2 precludes any type of orderly shutdown: if you're not allowed to call any APIs from thread 1 after the pthread starts, how can you ever shutdown the stack? I know that the answer is simply that the originating thread can make some calls; it must at least be able to Shutdown (pthread_join) platform manager.

But my point is that it is impossible for us to impose an across-the-board rule that no cross-thread calls are ever allowed; some must be. So let's just make all of them available with a lock.

tcarmelveilleux · 2021-05-18T16:36:05Z

We really need to stop this whack-a-mole, take a few days, decide what our threading model is, document it, add asserts, and make sure our sample apps follow it. It will be less time spent than what we are doing right now. As a strawman:

Our threading model is that all access to CHIP API must be performed while holding the stack lock. The CHIP event loop takes this lock around each event it processes.

Or a second strawman:

Our threading model is that all access to CHIP API must be performed via events posted to the CHIP event loop.

I strongly advocate for the first solution. The second solution means cross-thread callers must queue all interactions to/from the stack. Obviously, any delegate callbacks executed by chip need a queue back to the host context if it is executing another event loop. However, many calls don't need this. Imposing a no-cross-thread-api-call-rule across the board is onerous.

Also, a truly literal reading of strawman 2 presents an unsurmountable difficulty. Currently, posix platform instantiates a pthread for the event loop. Object creation and platform manager init are in an originating thread, and stack events execute in the created pthread. A literal reading of strawman 2 precludes any type of orderly shutdown: if you're not allowed to call any APIs from thread 1 after the pthread starts, how can you ever shutdown the stack? I know that the answer is simply that the originating thread can make some calls; it must at least be able to Shutdown (pthread_join) platform manager.

I agree that "everything funelled to the one thread" is onerous. At the same time, blindly locking and not being very careful about lock usage can also cause its own set of problems. I think the solution is a mix of the main loop and locks.

Overall, I would say that "orderly shutdown" of the stack, definitely is one of those cases where instead of taking a lock, the caller of shutdown should just join on the main event loop task, and the shutdown posts an event that causes the whole protocol stack to start its own internally serialized shutdown.

bzbarsky-apple

I'm OK with doing this for now but would like us to make the real fix our highest priority instead of continuing to pile on other things on top of our current quicksand...

I'm OK with this change as a stopgap.

yufengwangca · 2021-05-18T18:01:39Z

src/messaging/ExchangeMgr.cpp

@@ -94,6 +98,10 @@ CHIP_ERROR ExchangeManager::Init(SecureSessionMgr * sessionMgr)

 CHIP_ERROR ExchangeManager::Shutdown()
 {
+// TODO(#6931): Lock guard is a temporary solution. Proper solution should be post chip shutdown function into chip thread
+#if CONFIG_DEVICE_LAYER
+    DeviceLayer::PlatformMgr().LockChipStack();


I would suggest moving DeviceLayer::PlatformMgr().LockChipStack() to the place where shutdown is called from the client thread, instead of within ExchangeManager::Shutdown() itself.

You are trying to fix the problem in the case which Shutdown is called from a client thread, for the cases Shutdown() are called from CHIP main thread event loop(Which it should be), it will crash since chip lock is not recursive.

yufengwangca · 2021-05-18T18:02:20Z

src/messaging/ExchangeMgr.cpp

@@ -109,6 +117,9 @@ CHIP_ERROR ExchangeManager::Shutdown()
    }

    mState = State::kState_NotInitialized;
+#if CONFIG_DEVICE_LAYER
+    DeviceLayer::PlatformMgr().UnlockChipStack();


Same as above, move DeviceLayer::PlatformMgr().UnlockChipStack(); to the place Shutdown is called from the client thread

woody-apple

Per the updated template, can you update the PR text?

#### Problem
What is being fixed?  Examples:
* Fix crash on startup
* Fixes #12345 12345 Frobnozzle is leaky (exactly like that, so GitHub will auto-close the issue).

#### Change overview
What's in this PR

#### Testing
How was this tested? (at least one bullet point required)
    • If unit tests were added, how do they cover this issue?
    • If unit tests existed, how were they fixed/modified to prevent this in future?
    • If integration tests were added, how do they verify this change?
    • If manually tested, what platforms controller and device platforms were manually tested, and how?
    • If no testing is required, why not?

stale · 2021-05-27T20:35:48Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

stale · 2021-06-03T22:53:55Z

This stale pull request has been automatically closed. Thank you for your contributions.

pullapprove bot requested review from andy31415, bzbarsky-apple, chrisdecenzo, Damian-Nordic, hawk248, jepenven-silabs and msandstedt May 18, 2021 06:31

pullapprove bot added the review - pending label May 18, 2021

kghost force-pushed the issue-6856 branch from 84d8b1e to 544887e Compare May 18, 2021 06:45

kghost mentioned this pull request May 18, 2021

Revert "Use RAII for ExchangeContext contruction and destruction instead of Alloc/Free." #6909

Closed

Damian-Nordic approved these changes May 18, 2021

View reviewed changes

tcarmelveilleux reviewed May 18, 2021

View reviewed changes

kghost mentioned this pull request May 18, 2021

Post chip shutdown function into chip thread #6931

Closed

Fix ExchangeContext leak in chip_im_initiator (Fix: project-chip#6915)

e3d500b

kghost force-pushed the issue-6856 branch from 544887e to e3d500b Compare May 18, 2021 13:36

tcarmelveilleux approved these changes May 18, 2021

View reviewed changes

bzbarsky-apple reviewed May 18, 2021

View reviewed changes

woody-apple previously requested changes May 18, 2021

View reviewed changes

bzbarsky-apple approved these changes May 18, 2021

View reviewed changes

yufengwangca requested changes May 18, 2021

View reviewed changes

woody-apple requested changes May 20, 2021

View reviewed changes

kghost mentioned this pull request May 26, 2021

Use ExchangeHandle to track ref count of ExchangeContext #7125

Closed

stale bot added the stale Stale issue or PR label May 27, 2021

stale bot closed this Jun 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ExchangeContext leak in chip_im_initiator (Fix: #6915) #6918

Fix ExchangeContext leak in chip_im_initiator (Fix: #6915) #6918

kghost commented May 18, 2021

todo bot commented May 18, 2021

Damian-Nordic May 18, 2021

tcarmelveilleux May 18, 2021

yufengwangca May 18, 2021

yufengwangca May 18, 2021

todo bot commented May 18, 2021

bzbarsky-apple left a comment •

edited

Loading

github-actions bot commented May 18, 2021

woody-apple left a comment

msandstedt commented May 18, 2021

tcarmelveilleux commented May 18, 2021

bzbarsky-apple left a comment

yufengwangca May 18, 2021 •

edited

Loading

yufengwangca May 18, 2021

woody-apple left a comment •

edited

Loading

stale bot commented May 27, 2021

stale bot commented Jun 3, 2021

Fix ExchangeContext leak in chip_im_initiator (Fix: #6915) #6918

Fix ExchangeContext leak in chip_im_initiator (Fix: #6915) #6918

Conversation

kghost commented May 18, 2021

todo bot commented May 18, 2021

(#6919): This doesn't compile for Android, because Android has no device layer.

This comment was generated by todo based on a TODO comment in 544887e in #6918. cc @kghost.

Damian-Nordic May 18, 2021

Choose a reason for hiding this comment

tcarmelveilleux May 18, 2021

Choose a reason for hiding this comment

yufengwangca May 18, 2021

Choose a reason for hiding this comment

yufengwangca May 18, 2021

Choose a reason for hiding this comment

todo bot commented May 18, 2021

(#6931): Lock guard is a temporary solution. Proper solution should be post chip shutdown function into chip thread

This comment was generated by todo based on a TODO comment in e3d500b in #6918. cc @kghost.

bzbarsky-apple left a comment • edited Loading

Choose a reason for hiding this comment

github-actions bot commented May 18, 2021

woody-apple left a comment

Choose a reason for hiding this comment

msandstedt commented May 18, 2021

tcarmelveilleux commented May 18, 2021

bzbarsky-apple left a comment

Choose a reason for hiding this comment

yufengwangca May 18, 2021 • edited Loading

Choose a reason for hiding this comment

yufengwangca May 18, 2021

Choose a reason for hiding this comment

woody-apple left a comment • edited Loading

Choose a reason for hiding this comment

stale bot commented May 27, 2021

stale bot commented Jun 3, 2021

This comment was generated by todo based on a `TODO` comment in `544887e` in #6918. cc @kghost.

This comment was generated by todo based on a `TODO` comment in `e3d500b` in #6918. cc @kghost.

bzbarsky-apple left a comment •

edited

Loading

yufengwangca May 18, 2021 •

edited

Loading

woody-apple left a comment •

edited

Loading