Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExchangeManager::OnConnectionExpired is incorrectly closing exchanges, can cause lost messages or crashes #7012

Closed
bzbarsky-apple opened this issue May 20, 2021 · 12 comments · Fixed by #7114
Assignees

Comments

@bzbarsky-apple
Copy link
Contributor

bzbarsky-apple commented May 20, 2021

STEPS TO REPRODUCE:

  1. Compile all-clusters-app for m5stack with pairing mode set to BLE
  2. Run it on the m5stack
  3. On desktop (Mac in my case) run ./gn_build.sh
  4. Run ./out/debug/standalone/chip-tool pairing ble-wifi x y 112233 12345678 3840

where "x" and "y" can be replaced with the WiFi SSID and password as desired.

EXPECTED RESULTS: M5Stack joins the wifi network

ACTUAL RESULTS: M5Stack gets the SSID and password, but never gets an EnableNetwork command.

@pan-apple and I dug through what happens on the chip-tool side, and it looks like the following, starting at the first place where something weird starts to happen:

  1. We send the AddRootCert command on exchange N. Exchange context N has a refcount of 1 at this point.
  2. The response to AddRootCert comes back.
  3. ExchangeContext::HandleResponse takes a ref, refcount is 2.
  4. The actual response handling calls Close because we got a response, refcount is 1
  5. Response handling calls into the app logic, which decides pairing is done and we need to reload some CHIPDevice info now.
  6. We end up in SecureSessionMgr::NewPairing which calls MarkConnectionsExpired.
  7. That lands in ExchangeManager::OnConnectionExpired which calls Close on all the exchange contexts for the session. Refcount goes to 0 and exchange context N is destroyed.
  8. We unwind back to more app logic, which decides to send the AddNetwork command.
  9. The command sender asks for a new exchange context, gets exchange context N (because it's available!), refcount is now 1
  10. The command gets sent.
  11. Stack unwinds all the way back to ExchangeContext::HandleResponse, which drops the ref it took. Refcount goes to 0, exchange context N is destroyed.
  12. Eventually the server responds to the AddNetwork command, but the exchange is gone, so we treat it as an unsolicited non-initiator message and drop it.
  13. We time out the commissioning process and never send EnableNetwork.

A note: If at step 9 we had gotten a different exchange context then instead of step 11 destroying an exchange context that should not be destroyed it would have called Release() on a 0-refcount context and hit the abort() there.

Proposed Solution

Stop calling Close on random exchange contexts that we don't understand the lifetime of. That will lead to use-after-free, as here. If the teardown had happened when someone had allocated an exchange but not yet called SendMessage, they would have ended up calling SendMessage on a deleted exchange.... Maybe that case would have been helped by a delegate being notified the exchange is closing (though I am doubtful), but note in the above steps we are closing an already-closed exchange, so even that does not help.

@kghost @andreilitvin

@bzbarsky-apple
Copy link
Contributor Author

bzbarsky-apple commented May 20, 2021

Note that this may be part of what's going on in #6978

@andy31415
Copy link
Contributor

Thank you for the solid investigation!

Long germ I belive we need some form of RAII/Handle support. We had a super hard time with message buffers before that as well. Manual sequencing addref/removeref seems very error prone.

@yufengwangca @mrjerryjohns as well

@yufengwangca
Copy link
Contributor

@bzbarsky-apple
"
11. Stack unwinds all the way back to ExchangeContext::HandleResponse, which drops the ref it took. Refcount goes to 9, exchange context N is destroyed"

I guess you means the Refcount goes to 0?

@yufengwangca
Copy link
Contributor

yufengwangca commented May 20, 2021

I think there is a problem on step 11, why the refcount of exchange drop to 0 before we expecting a response for Command 'AddNetwork', the protocol should not close the exchange before it receives the last message of the conversation.

Does this means we received a default response to 'AddNetwork' command before the official response from server to the AddNetwork command?

@yufengwangca
Copy link
Contributor

yufengwangca commented May 21, 2021

@bzbarsky-apple I am not sure stop calling Close can address this issue. To me, it looks like the issue is on step 11.

The CommandSender received another response other than the response to AddNetwork command from the server, this response cause the current exchange get closed unexpectedly.

@bzbarsky-apple
Copy link
Contributor Author

I guess you means the Refcount goes to 0?

Yes, that is correct. Typo fixed.

why the refcount of exchange drop to 0 before we expecting a response for Command 'AddNetwork'

The main problem is in step 7, where there is an extra Close that reduces the refcount to 0 when it shouldn't be 0.

The CommandSender received another response other than the response to AddNetwork command from the server

No, it did not. There are two CommandSenders involved here. They end up using the same exact ExchangeContext, because the extra Close in step 7 causes us to treat that ExchangeContext as no longer used while one of the CommandSenders is still sort of using it, and the second then starts using it before the first one is done.

@kghost
Copy link
Contributor

kghost commented May 21, 2021

I'm really tired of manually manipulate exchange context references. Let me write a shared_ptr like handle to manage the context.

@bzbarsky-apple
Copy link
Contributor Author

@kghost Not to say we shouldn't do it, but what would ExchangeManager::OnConnectionExpired then do? And where would the ambient "we sent a message, now we're just waiting for a response or timeout" ref be held?

@kghost
Copy link
Contributor

kghost commented May 21, 2021

what would ExchangeManager::OnConnectionExpired then do?

Do nothing, just pass the exchange ref to app, because app may be holding a ref, let app choose how to deal with his own ref. If there are no ref, then the exchange context shouldn't be exist.

where would the ambient "we sent a message, now we're just waiting for a response or timeout" ref be held

hold a ref in the timer context.

@bzbarsky-apple
Copy link
Contributor Author

Do nothing, just pass the exchange ref to app

OK, let's start with that. We can do that even without the shared_ptr bit, and it would fix this issue on its own... ;)

@kghost kghost self-assigned this May 21, 2021
@yufengwangca
Copy link
Contributor

yufengwangca commented May 21, 2021

@bzbarsky-apple

The main problem is in step 7, where there is an extra Close that reduces the refcount to 0 when it shouldn't be 0.

This is an issue, we don't close the exchange in Weave within HandleConnectionClosed, so this Close might be added by us from previous refactors.

The CommandSender received another response other than the response to AddNetwork command from the server

No, it did not. There are two CommandSenders involved here. They end up using the same exact ExchangeContext, because the extra Close in step 7 causes us to treat that ExchangeContext as no longer used while one of the CommandSenders is still sort of using it, and the second then starts using it before the first one is done.

My question is we have asked for a new exchange in step 9 for sending 'AddNetwork', how come this new created exchange get destroyed in step 11 before CommandSender receive the response from server to AddNetwork command ?

@bzbarsky-apple
Copy link
Contributor Author

how come this new created exchange get destroyed in step 11

Because there is one extra Release() involved from that bogus Close in step 7. Step 7 prematurely destroyed an exchange that was not supposed to get destroyed until step 11 due to the ref that HandleResponse was holding. Step 7 dropped that ref (incorrectly). Then the "new exchange" in step 9 ends up pointing to the same exact memory as the incorectly-destroyed exchange. We send the command, there is one ref to it waiting for the command to come back. Then step 11 drops the ref it was holding all along, and the refcount goes to 0.

The step 7 "yeah, we'll call Release on this refcounted object that we otherwise didn't take a ref to" bit is the broken part here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants