Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ExchangeContext leak in chip_im_initiator (Fix: #6915) #6918

Closed
wants to merge 1 commit into from

Conversation

kghost
Copy link
Contributor

@kghost kghost commented May 18, 2021

Fixes: #6915

@todo
Copy link

todo bot commented May 18, 2021

(#6919): This doesn't compile for Android, because Android has no device layer.

// TODO(#6919): This doesn't compile for Android, because Android has no device layer.
#if CONFIG_DEVICE_LAYER
DeviceLayer::PlatformMgr().LockChipStack();
#endif
mReliableMessageMgr.Shutdown();
mContextPool.ForEachActiveObject([](auto * ec) {


This comment was generated by todo based on a TODO comment in 544887e in #6918. cc @kghost.

@@ -43,6 +43,11 @@
#include <support/RandUtils.h>
#include <support/logging/CHIPLogging.h>

#if CONFIG_DEVICE_LAYER
#include <platform/CHIPDeviceLayer.h>
#include <platform/PlatformManager.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It's already included by CHIPDeviceLayer.h

@@ -94,6 +99,10 @@ CHIP_ERROR ExchangeManager::Init(SecureSessionMgr * sessionMgr)

CHIP_ERROR ExchangeManager::Shutdown()
{
// TODO(#6919): This doesn't compile for Android, because Android has no device layer.
#if CONFIG_DEVICE_LAYER
DeviceLayer::PlatformMgr().LockChipStack();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't the real solution be to have this called in the main event loop task of the stack? Seems like the stack shutting itself down should be done from within the stack task itself, rather than being called from outside with a lock.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stack lock was introduced to protect CHIP event processing. Putting a chip lock here can only prevent CHIP from operating, but the problem here is the exchange is also released from the client thread.

How this change can prevent leaking?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For IM, maybe it is ok, we just moved the logic to close the exchange in OnMessageReceived, but if the exchange closing also occurs in client thread, how this locking helps?

@todo
Copy link

todo bot commented May 18, 2021

(#6931): Lock guard is a temporary solution. Proper solution should be post chip shutdown function into chip thread

// TODO(#6931): Lock guard is a temporary solution. Proper solution should be post chip shutdown function into chip thread
#if CONFIG_DEVICE_LAYER
DeviceLayer::PlatformMgr().LockChipStack();
#endif
mReliableMessageMgr.Shutdown();
mContextPool.ForEachActiveObject([](auto * ec) {


This comment was generated by todo based on a TODO comment in e3d500b in #6918. cc @kghost.

Copy link
Contributor

@bzbarsky-apple bzbarsky-apple left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really need to stop this whack-a-mole, take a few days, decide what our threading model is, document it, add asserts, and make sure our sample apps follow it. It will be less time spent than what we are doing right now. As a strawman:

  1. Our threading model is that all access to CHIP API must be performed while holding the stack lock. The CHIP event loop takes this lock around each event it processes.
  2. Switch stack_log_tracking to fatal for linux and darwin. That would presumably catch issues like the one this PR is trying to fix without needing races to fail.
  3. Add moreassertChipStackLockedByCurrentThread as desired.
  4. For all the failing places, decide whether to take the lock (and whether that should happen outside the CHIP API surface or immediately inside) or whether they should get posted to the CHIP event loop.
  5. Set stack_log_tracking to fatal for at least one embedded platform, so we can look for bugs on that side. We can start with esp32; it's got plenty of space afaict.

Obviously we can do this over multiple PRs (e.g. a few PRs to fix issues on Linux, then change Linux to fatal, etc).

Or a second strawman:

  1. Our threading model is that all access to CHIP API must be performed via events posted to the CHIP event loop.
  2. Modify the lock tracking assets to assert that we are on the CHIP event loop thread.
  3. Make the asserts fatal on linux and darwin. That would presumably catch issues like the one this PR is trying to fix without needing races to fail.
  4. Add more asserts as desired.
  5. For all the failing places, post to the CHIP event loop.
  6. Turn the asserts on for esp32.

@msandstedt @andy31415

@github-actions
Copy link

Size increase report for "nrfconnect-example-build" from 403b5ff

File Section File VM
chip-shell.elf text 24 24
chip-shell.elf device_handles -8 -8
Full report output
BLOAT REPORT

Files found only in the build output:
    report.csv

Comparing ./master_artifact/chip-lighting.elf and ./pull_artifact/chip-lighting.elf:

sections,vmsize,filesize
.debug_info,0,26555
.debug_line,0,1173
.debug_abbrev,0,256
.debug_ranges,0,48
.debug_loc,0,16

Comparing ./master_artifact/chip-shell.elf and ./pull_artifact/chip-shell.elf:

sections,vmsize,filesize
.debug_info,0,7921
.debug_line,0,813
.debug_abbrev,0,162
.strtab,0,114
.debug_ranges,0,48
.symtab,0,48
text,24,24
.debug_loc,0,16
.shstrtab,0,2
device_handles,-8,-8

Comparing ./master_artifact/chip-lock.elf and ./pull_artifact/chip-lock.elf:

sections,vmsize,filesize
.debug_info,0,7921
.debug_line,0,797
.debug_abbrev,0,162
.debug_ranges,0,48
.debug_loc,0,16


Copy link
Contributor

@woody-apple woody-apple left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Holding based on feedback from @bzbarsky-apple

@msandstedt
Copy link
Contributor

We really need to stop this whack-a-mole, take a few days, decide what our threading model is, document it, add asserts, and make sure our sample apps follow it. It will be less time spent than what we are doing right now. As a strawman:

  1. Our threading model is that all access to CHIP API must be performed while holding the stack lock. The CHIP event loop takes this lock around each event it processes.

Or a second strawman:

  1. Our threading model is that all access to CHIP API must be performed via events posted to the CHIP event loop.

I strongly advocate for the first solution. The second solution means cross-thread callers must queue all interactions to/from the stack. Obviously, any delegate callbacks executed by chip need a queue back to the host context if it is executing another event loop. However, many calls don't need this. Imposing a no-cross-thread-api-call-rule across the board is onerous.

Also, a truly literal reading of strawman 2 presents an unsurmountable difficulty. Currently, posix platform instantiates a pthread for the event loop. Object creation and platform manager init are in an originating thread, and stack events execute in the created pthread. A literal reading of strawman 2 precludes any type of orderly shutdown: if you're not allowed to call any APIs from thread 1 after the pthread starts, how can you ever shutdown the stack? I know that the answer is simply that the originating thread can make some calls; it must at least be able to Shutdown (pthread_join) platform manager.

But my point is that it is impossible for us to impose an across-the-board rule that no cross-thread calls are ever allowed; some must be. So let's just make all of them available with a lock.

@tcarmelveilleux
Copy link
Contributor

We really need to stop this whack-a-mole, take a few days, decide what our threading model is, document it, add asserts, and make sure our sample apps follow it. It will be less time spent than what we are doing right now. As a strawman:

  1. Our threading model is that all access to CHIP API must be performed while holding the stack lock. The CHIP event loop takes this lock around each event it processes.

Or a second strawman:

  1. Our threading model is that all access to CHIP API must be performed via events posted to the CHIP event loop.

I strongly advocate for the first solution. The second solution means cross-thread callers must queue all interactions to/from the stack. Obviously, any delegate callbacks executed by chip need a queue back to the host context if it is executing another event loop. However, many calls don't need this. Imposing a no-cross-thread-api-call-rule across the board is onerous.

Also, a truly literal reading of strawman 2 presents an unsurmountable difficulty. Currently, posix platform instantiates a pthread for the event loop. Object creation and platform manager init are in an originating thread, and stack events execute in the created pthread. A literal reading of strawman 2 precludes any type of orderly shutdown: if you're not allowed to call any APIs from thread 1 after the pthread starts, how can you ever shutdown the stack? I know that the answer is simply that the originating thread can make some calls; it must at least be able to Shutdown (pthread_join) platform manager.

I agree that "everything funelled to the one thread" is onerous. At the same time, blindly locking and not being very careful about lock usage can also cause its own set of problems. I think the solution is a mix of the main loop and locks.

Overall, I would say that "orderly shutdown" of the stack, definitely is one of those cases where instead of taking a lock, the caller of shutdown should just join on the main event loop task, and the shutdown posts an event that causes the whole protocol stack to start its own internally serialized shutdown.

Copy link
Contributor

@bzbarsky-apple bzbarsky-apple left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with doing this for now but would like us to make the real fix our highest priority instead of continuing to pile on other things on top of our current quicksand...

@bzbarsky-apple bzbarsky-apple dismissed woody-apple’s stale review May 18, 2021 17:04

I'm OK with this change as a stopgap.

@@ -94,6 +98,10 @@ CHIP_ERROR ExchangeManager::Init(SecureSessionMgr * sessionMgr)

CHIP_ERROR ExchangeManager::Shutdown()
{
// TODO(#6931): Lock guard is a temporary solution. Proper solution should be post chip shutdown function into chip thread
#if CONFIG_DEVICE_LAYER
DeviceLayer::PlatformMgr().LockChipStack();
Copy link
Contributor

@yufengwangca yufengwangca May 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest moving DeviceLayer::PlatformMgr().LockChipStack() to the place where shutdown is called from the client thread, instead of within ExchangeManager::Shutdown() itself.

You are trying to fix the problem in the case which Shutdown is called from a client thread, for the cases Shutdown() are called from CHIP main thread event loop(Which it should be), it will crash since chip lock is not recursive.

@@ -109,6 +117,9 @@ CHIP_ERROR ExchangeManager::Shutdown()
}

mState = State::kState_NotInitialized;
#if CONFIG_DEVICE_LAYER
DeviceLayer::PlatformMgr().UnlockChipStack();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, move DeviceLayer::PlatformMgr().UnlockChipStack(); to the place Shutdown is called from the client thread

Copy link
Contributor

@woody-apple woody-apple left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the updated template, can you update the PR text?

#### Problem
What is being fixed?  Examples:
* Fix crash on startup
* Fixes #12345 12345 Frobnozzle is leaky (exactly like that, so GitHub will auto-close the issue).

#### Change overview
What's in this PR

#### Testing
How was this tested? (at least one bullet point required)
    • If unit tests were added, how do they cover this issue?
    • If unit tests existed, how were they fixed/modified to prevent this in future?
    • If integration tests were added, how do they verify this change?
    • If manually tested, what platforms controller and device platforms were manually tested, and how?
    • If no testing is required, why not?

@stale
Copy link

stale bot commented May 27, 2021

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale Stale issue or PR label May 27, 2021
@stale
Copy link

stale bot commented Jun 3, 2021

This stale pull request has been automatically closed. Thank you for your contributions.

@stale stale bot closed this Jun 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
review - pending stale Stale issue or PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

racing in chip_im_initiator, exchange context leak
7 participants