-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ExchangeManager::OnConnectionExpired is incorrectly closing exchanges, can cause lost messages or crashes #7012
Comments
Note that this may be part of what's going on in #6978 |
Thank you for the solid investigation! Long germ I belive we need some form of RAII/Handle support. We had a super hard time with message buffers before that as well. Manual sequencing addref/removeref seems very error prone. @yufengwangca @mrjerryjohns as well |
@bzbarsky-apple I guess you means the Refcount goes to 0? |
I think there is a problem on step 11, why the refcount of exchange drop to 0 before we expecting a response for Command 'AddNetwork', the protocol should not close the exchange before it receives the last message of the conversation. Does this means we received a default response to 'AddNetwork' command before the official response from server to the AddNetwork command? |
@bzbarsky-apple I am not sure stop calling Close can address this issue. To me, it looks like the issue is on step 11. The CommandSender received another response other than the response to AddNetwork command from the server, this response cause the current exchange get closed unexpectedly. |
Yes, that is correct. Typo fixed.
The main problem is in step 7, where there is an extra
No, it did not. There are two CommandSenders involved here. They end up using the same exact ExchangeContext, because the extra |
I'm really tired of manually manipulate exchange context references. Let me write a shared_ptr like handle to manage the context. |
@kghost Not to say we shouldn't do it, but what would ExchangeManager::OnConnectionExpired then do? And where would the ambient "we sent a message, now we're just waiting for a response or timeout" ref be held? |
Do nothing, just pass the exchange ref to app, because app may be holding a ref, let app choose how to deal with his own ref. If there are no ref, then the exchange context shouldn't be exist.
hold a ref in the timer context. |
OK, let's start with that. We can do that even without the shared_ptr bit, and it would fix this issue on its own... ;) |
This is an issue, we don't close the exchange in Weave within HandleConnectionClosed, so this Close might be added by us from previous refactors.
My question is we have asked for a new exchange in step 9 for sending 'AddNetwork', how come this new created exchange get destroyed in step 11 before CommandSender receive the response from server to AddNetwork command ? |
Because there is one extra Release() involved from that bogus Close in step 7. Step 7 prematurely destroyed an exchange that was not supposed to get destroyed until step 11 due to the ref that HandleResponse was holding. Step 7 dropped that ref (incorrectly). Then the "new exchange" in step 9 ends up pointing to the same exact memory as the incorectly-destroyed exchange. We send the command, there is one ref to it waiting for the command to come back. Then step 11 drops the ref it was holding all along, and the refcount goes to 0. The step 7 "yeah, we'll call Release on this refcounted object that we otherwise didn't take a ref to" bit is the broken part here. |
STEPS TO REPRODUCE:
./gn_build.sh
./out/debug/standalone/chip-tool pairing ble-wifi x y 112233 12345678 3840
where "x" and "y" can be replaced with the WiFi SSID and password as desired.
EXPECTED RESULTS: M5Stack joins the wifi network
ACTUAL RESULTS: M5Stack gets the SSID and password, but never gets an EnableNetwork command.
@pan-apple and I dug through what happens on the chip-tool side, and it looks like the following, starting at the first place where something weird starts to happen:
ExchangeContext::HandleResponse
takes a ref, refcount is 2.Close
because we got a response, refcount is 1SecureSessionMgr::NewPairing
which callsMarkConnectionsExpired
.ExchangeManager::OnConnectionExpired
which callsClose
on all the exchange contexts for the session. Refcount goes to 0 and exchange context N is destroyed.ExchangeContext::HandleResponse
, which drops the ref it took. Refcount goes to 0, exchange context N is destroyed.A note: If at step 9 we had gotten a different exchange context then instead of step 11 destroying an exchange context that should not be destroyed it would have called Release() on a 0-refcount context and hit the
abort()
there.Proposed Solution
Stop calling
Close
on random exchange contexts that we don't understand the lifetime of. That will lead to use-after-free, as here. If the teardown had happened when someone had allocated an exchange but not yet called SendMessage, they would have ended up calling SendMessage on a deleted exchange.... Maybe that case would have been helped by a delegate being notified the exchange is closing (though I am doubtful), but note in the above steps we are closing an already-closed exchange, so even that does not help.@kghost @andreilitvin
The text was updated successfully, but these errors were encountered: