Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RemoveFabric command crashes server when removing accessing fabric #16748

Closed
bzbarsky-apple opened this issue Mar 29, 2022 · 6 comments
Closed

Comments

@bzbarsky-apple
Copy link
Contributor

bzbarsky-apple commented Mar 29, 2022

Problem

After #16098 we crash if we RemoveFabric and pass it the fabric index of the fabric the command is running on.

This happens because of this code in SessionManager::PrepareMessage:

        if (session->GetSecureSessionType() == SecureSession::Type::kCASE)
        {
            FabricInfo * fabric = mFabricTable->FindFabricWithIndex(session->GetFabricIndex());
            VerifyOrDie(fabric != nullptr);
            CryptoContext::BuildNonce(nonce, packetHeader.GetSecurityFlags(), messageCounter, fabric->GetNodeId());

when sending the response message for the RemoveFabric command. That VerifyOrDie dies, because fabric is in fact null by this point.

Proposed Solution

We need to either fix this code to get the node id from somewhere else (e.g. store it in the session instead of getting it from the fabric table every time) or change how exactly RemoveFabric works so that it manages to send the message before doing the actual fabric removal.... but then we may have situations where we claim to have removed the fabric when we actually have not.

@kghost @tcarmelveilleux @turon

@bzbarsky-apple bzbarsky-apple changed the title RemoveFabric command crashes server when removing current fabric RemoveFabric command crashes server when removing accessing fabric Mar 29, 2022
@turon
Copy link
Contributor

turon commented Mar 29, 2022

I'd lean toward storing the node id redundantly in the session context. That way we can support this "lingering session" use cases elegantly and decouple them from configuration updates that are occurring over the session.

@Damian-Nordic
Copy link
Contributor

Note that the same assertion fails when the controller is subscribed to Leave/ShutDown events and the factory reset is triggered.

@kghost kghost self-assigned this Apr 13, 2022
@kghost
Copy link
Contributor

kghost commented Apr 13, 2022

Discussed with Boris, we will achieve RemoveFabric in following actions:

  1. Call OnFabricDeletedFromStorage
  2. Send RemoveFabric response
  3. Remove session associated with the fabric
  4. Remove exchanges associated with the fabric
  5. Call exchange->OnExchangeReleased to all exchange delegates
  6. Remove the fabric object

Step 3-6 will be deferred to the next schedule

@mlepage-google
Copy link
Contributor

I can verify I hit this too while testing something. :-)

@Damian-Nordic
Copy link
Contributor

Damian-Nordic commented Apr 21, 2022

@kghost I agree with your solution, but I think you will need to make sure that the LEAVE event is generated prior to removing ACLs for the removed fabric. Otherwise, generating the report containing the LEAVE event will fail the permission check, right?

tcarmelveilleux added a commit to tcarmelveilleux/connectedhomeip that referenced this issue Apr 27, 2022
Because of an access to prior fabric data that is now deleted,
in SessionManager::PrepareMessage, while trying to reply to RemoveFabric,
applications crash when RemoveFabric is done on the accessing fabric.

This crash was awaiting full fix of  project-chip#16748 to be fixed, but that
issue is much bigger scope. We can actually fix the crash with a
suggestion made by @turon
(project-chip#16748 (comment))
to keep the *local node ID* in the SecureSession so that
SessionManager does not try to look-back at the FabricTable
whenever preparing a CASE message where the fabric may be gone.

This is a root cause fix for that very crash, but does not address
the other aspects of project-chip#16748 which relate to completely cleanly
handling fabric removal edge cases.

Issue project-chip#16748
Fixes project-chip#17579
Fixes project-chip#17680
Fixes project-chip#16729

This PR does the following:
- Add local node ID to the SecureSession and fix all associated plumbing
- Use the local node ID for nonce generation in PrepareMessage rather
  than looking-up the fabric table (which may no longer hold the
  fabric that has that prior node ID)
- Improve CASE session establishment logging
- Fix the tests needed
- Fix bad comments in TestPairingSession tests

Testing done:
- Added a YAML test (TestSelfFabricRemoval.yaml) for this case
  - Validated it failed before code fixes with the previously seen
    crash.
  - Validated that it passes with the new fixes
- Added necessary tests to TestPairingSession for new methods
- Unit tests pass
- Cert tests pass
tcarmelveilleux added a commit that referenced this issue Apr 28, 2022
* Add TestSelfFabricRemoval.yaml test

* Fix crash on removal of accessing fabric

Because of an access to prior fabric data that is now deleted,
in SessionManager::PrepareMessage, while trying to reply to RemoveFabric,
applications crash when RemoveFabric is done on the accessing fabric.

This crash was awaiting full fix of  #16748 to be fixed, but that
issue is much bigger scope. We can actually fix the crash with a
suggestion made by @turon
(#16748 (comment))
to keep the *local node ID* in the SecureSession so that
SessionManager does not try to look-back at the FabricTable
whenever preparing a CASE message where the fabric may be gone.

This is a root cause fix for that very crash, but does not address
the other aspects of #16748 which relate to completely cleanly
handling fabric removal edge cases.

Issue #16748
Fixes #17579
Fixes #17680
Fixes #16729

This PR does the following:
- Add local node ID to the SecureSession and fix all associated plumbing
- Use the local node ID for nonce generation in PrepareMessage rather
  than looking-up the fabric table (which may no longer hold the
  fabric that has that prior node ID)
- Improve CASE session establishment logging
- Fix the tests needed
- Fix bad comments in TestPairingSession tests

Testing done:
- Added a YAML test (TestSelfFabricRemoval.yaml) for this case
  - Validated it failed before code fixes with the previously seen
    crash.
  - Validated that it passes with the new fixes
- Added necessary tests to TestPairingSession for new methods
- Unit tests pass
- Cert tests pass

* Restyled by whitespace

* Restyled by clang-format

* Regen ZAP after comment

* Address review comments

* Restyled by clang-format

* Reorder one argument used in test-only code

* Restyled by clang-format

Co-authored-by: Restyled.io <[email protected]>
@bzbarsky-apple
Copy link
Contributor Author

Fixed in #17815.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants