-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add background event handling for CASE establish #24099
Conversation
CASE session establishment has operations which are costly, such as checking certificate chains. The handshake messages are processed in the event thread, so while these operations occur, other events cannot be processed. This delays responses, and can cause the event queue to fill entirely, which is fatal. This commit adds support for background event processing, and uses it to process the most costly operations during CASESesssion::HandleSigma3. - add platform support for background event processing: ScheduleBackgroundWork, RunBackgroundEventLoop, etc. - add device config flags for enabling/disabling and configuration - add implementation for FreeRTOS platform - refactor some CASESession operations so they can be static, avoiding use of member variables - break HandlSigma3 into 3 parts A/B/C: - HandleSigma3a (foreground, processes incoming message) - HandleSigma3b (background, performs most costly operations) - HandleSigma3c (foreground, sends status report) This breakup of HandleSigma3 was done in a fairly straightforward manner so it could be clearer, during review, that behaviour has not substantially changed. A subsequent commit should clean it up further by introducing helper code for managing the foreground/background work, lifetime of work object, when to send status report and/or abort pending establish, etc. Also still to do is implementation for other platforms, and for other messages in CASESession (e.g. Sigma2), and for other costly operations (e.g. PASESession). Currently, CASE session establishment is simplified: - only one pairing session is active at a time - it's always the same CASESession object in CASEServer - the two classes are higly coupled (e.g. CASEServer relies upon CASESession aborting the pending establish if an error occurs) Therefore, HandleSigma3b can rely upon the lifetime of the CASESession object, use an additional state and sequence number to synchronize work across foreground/background, and avoid use of member variables. If and when CASE session establishment becomes more complex, assumptions should be revisited. TESTING Testing was performed on M5Stack (ESP32) by commissioning using the Google Home app on Android. First, baseline behaviour with background events disabled: - If no errors, commissioning succeeds as before - If HandleSigma3a fails and sends a status report, pairing retries promptly and succeeds - If HandleSigma3a fails and cannot send a status report, pairing retries after about a minute and succeeds - If HandleSigma3c succeeds but cannot send a status report, pairing retries after about a minute and succeeds Next, improved behaviour with background events enabled: - If no errors, commissioning succeeds as before - If HandleSigma3a fails and sends a status report, pairing retries promptly and succeeds - (this includes failure to schedule HandleSigma3b) - If HandleSigma3b fails and sends a status report, pairing retries promptly and succeeds - If HandleSigma3c fails and sends a status report, pairing retries promptly and succeeds - If HandleSigma3c succeeds but cannot send a status report, pairing retries after about a minute and succeeds - If HandleSigma3b is starved (scheduled but does not complete), after several minutes the failsafe timer fires, then Home app allows try again, which then succeeds - If HandleSigma3b is delayed (completes late), the sequence number is unexpected, so no status report is sent, then after several minutes the failsafe timer fires, then Home app allows try again, which then succeeds
PR #24099: Size comparison from b49c3ed to 3a80c45 Increases above 0.2%:
Increases (23 builds for bl602, bl702, cc13x2_26x2, efr32, esp32, k32w, qpg)
Decreases (6 builds for cc13x2_26x2)
Full report (23 builds for bl602, bl702, cc13x2_26x2, efr32, esp32, k32w, qpg)
|
src/include/platform/internal/GenericPlatformManagerImpl_FreeRTOS.ipp
Outdated
Show resolved
Hide resolved
src/include/platform/internal/GenericPlatformManagerImpl_FreeRTOS.ipp
Outdated
Show resolved
Hide resolved
Had more related changes, but they're all removed, so remove this change also.
Change error code also.
Cleanup can occur in a subsequent commit.
Cleanup can occur in a subsequent commit.
PR #24099: Size comparison from 02bc67e to 9ffd859 Increases (1 build for cc32xx)
Full report (1 build for cc32xx)
|
nRF/Zephyr tests don't like this not being cleaned up. May fix Darwin too?
PR #24099: Size comparison from b3c599d to 77ef3cc Increases (1 build for cc32xx)
Full report (1 build for cc32xx)
|
Seems needed on Darwin.
We would like to see the crypto abstraction itself admit asynchronous implementations, in particular of OperationalKeystore::SignWithOpKeyPair as that is the critical API to securing operational private keys. Doing it this way has a few limitations:
Changing the crypto APIs to be asynchronous will fix those limitations - SignOpWithKeyPair should take a callback to indicate when the work is done. |
PR #24099: Size comparison from a6d2883 to 42c8fbf Increases above 0.2%:
Increases (43 builds for bl602, bl702, cc13x2_26x2, cc32xx, cyw30739, efr32, esp32, k32w, linux, mbed, nrfconnect, psoc6, qpg, telink)
Decreases (10 builds for cc13x2_26x2, psoc6)
Full report (43 builds for bl602, bl702, cc13x2_26x2, cc32xx, cyw30739, efr32, esp32, k32w, linux, mbed, nrfconnect, psoc6, qpg, telink)
|
Since the task priorities were adjusted for project-chip#24099 (and maybe even before?) the factory reset RPC would time out, breaking any test which relies on the RPC successfully completing. This is because the reset is happening immediately, before whatever is handling the RPC can complete. Reducing the FreeRTOS task priority just before the reset "fixes" it. (Maybe there's a better fix?) Tested on Mighty Gecko SiLabs board.
Since the task priorities were adjusted for #24099 (and maybe even before?) the factory reset RPC would time out, breaking any test which relies on the RPC successfully completing. This is because the reset is happening immediately, before whatever is handling the RPC can complete. Reducing the FreeRTOS task priority just before the reset "fixes" it. (Maybe there's a better fix?) Tested on Mighty Gecko SiLabs board.
* Add background event handling for CASE establish CASE session establishment has operations which are costly, such as checking certificate chains. The handshake messages are processed in the event thread, so while these operations occur, other events cannot be processed. This delays responses, and can cause the event queue to fill entirely, which is fatal. This commit adds support for background event processing, and uses it to process the most costly operations during CASESesssion::HandleSigma3. - add platform support for background event processing: ScheduleBackgroundWork, RunBackgroundEventLoop, etc. - add device config flags for enabling/disabling and configuration - add implementation for FreeRTOS platform - refactor some CASESession operations so they can be static, avoiding use of member variables - break HandlSigma3 into 3 parts A/B/C: - HandleSigma3a (foreground, processes incoming message) - HandleSigma3b (background, performs most costly operations) - HandleSigma3c (foreground, sends status report) This breakup of HandleSigma3 was done in a fairly straightforward manner so it could be clearer, during review, that behaviour has not substantially changed. A subsequent commit should clean it up further by introducing helper code for managing the foreground/background work, lifetime of work object, when to send status report and/or abort pending establish, etc. Also still to do is implementation for other platforms, and for other messages in CASESession (e.g. Sigma2), and for other costly operations (e.g. PASESession). Currently, CASE session establishment is simplified: - only one pairing session is active at a time - it's always the same CASESession object in CASEServer - the two classes are higly coupled (e.g. CASEServer relies upon CASESession aborting the pending establish if an error occurs) Therefore, HandleSigma3b can rely upon the lifetime of the CASESession object, use an additional state and sequence number to synchronize work across foreground/background, and avoid use of member variables. If and when CASE session establishment becomes more complex, assumptions should be revisited. TESTING Testing was performed on M5Stack (ESP32) by commissioning using the Google Home app on Android. First, baseline behaviour with background events disabled: - If no errors, commissioning succeeds as before - If HandleSigma3a fails and sends a status report, pairing retries promptly and succeeds - If HandleSigma3a fails and cannot send a status report, pairing retries after about a minute and succeeds - If HandleSigma3c succeeds but cannot send a status report, pairing retries after about a minute and succeeds Next, improved behaviour with background events enabled: - If no errors, commissioning succeeds as before - If HandleSigma3a fails and sends a status report, pairing retries promptly and succeeds - (this includes failure to schedule HandleSigma3b) - If HandleSigma3b fails and sends a status report, pairing retries promptly and succeeds - If HandleSigma3c fails and sends a status report, pairing retries promptly and succeeds - If HandleSigma3c succeeds but cannot send a status report, pairing retries after about a minute and succeeds - If HandleSigma3b is starved (scheduled but does not complete), after several minutes the failsafe timer fires, then Home app allows try again, which then succeeds - If HandleSigma3b is delayed (completes late), the sequence number is unexpected, so no status report is sent, then after several minutes the failsafe timer fires, then Home app allows try again, which then succeeds * Remove WIP code * Address some comments from code review * Remove cruft from testing. * Remove some conditional compilation * Remove some conditional compilation * Move function back where it was Had more related changes, but they're all removed, so remove this change also. * Add some documentation Change error code also. * Use platform new/delete * Undo changes that are merely reordering Cleanup can occur in a subsequent commit. * Undo changes that are merely reordering Cleanup can occur in a subsequent commit. * Remove include file fix (C/C++) * Add documentation to background processing API * Use alternate fabrics table API * Improve documentation * Add assertion * Undo some unrelated cleanup * Update src/protocols/secure_channel/CASESession.cpp Co-authored-by: Boris Zbarsky <[email protected]> * Ensure root cert buf keeps span * Restyled by whitespace * Restyled by clang-format * Add new functions to GenericPlatformManagerImpl So all platforms build and work, even if they don't use the new feature. * Attempt to fix build errors on some platforms Apparently initializing structs with anonymous unions is challenging. * Improving host test environment This commit has a bunch of extra logging etc. to flush out any more CI issues. * Remove log statements and clean up * Update fake PlatformManagerImpl * Increase timeout on fake linux CI * Redo changes to make tests work Undo previous changes to test/app contexts, and go back to just fixing the tests more surgically and contained. Passes Linux host tests and Linux fake platform tests now. * Undo SetSystemLayerForTesting nRF/Zephyr tests don't like this not being cleaned up. May fix Darwin too? * Change fake linux tests timeout back to 15 mins * Restyle * Init/shutdown platform mgr in TestCASESession Seems needed on Darwin. --------- Co-authored-by: Boris Zbarsky <[email protected]> Co-authored-by: Restyled.io <[email protected]>
Since the task priorities were adjusted for project-chip#24099 (and maybe even before?) the factory reset RPC would time out, breaking any test which relies on the RPC successfully completing. This is because the reset is happening immediately, before whatever is handling the RPC can complete. Reducing the FreeRTOS task priority just before the reset "fixes" it. (Maybe there's a better fix?) Tested on Mighty Gecko SiLabs board.
Since the task priorities were adjusted for project-chip#24099 (and maybe even before?) the factory reset RPC would time out, breaking any test which relies on the RPC successfully completing. This is because the reset is happening immediately, before whatever is handling the RPC can complete. Reducing the FreeRTOS task priority just before the reset "fixes" it. (Maybe there's a better fix?) Tested on Mighty Gecko SiLabs board.
CASE session establishment has operations which are costly, such as checking certificate chains. The handshake messages are processed in the event thread, so while these operations occur, other events cannot be processed. This delays responses, and can cause the event queue to fill entirely, which is fatal.
This commit adds support for background event processing, and uses it to process the most costly operations during CASESesssion::HandleSigma3.
This breakup of HandleSigma3 was done in a fairly straightforward manner so it could be clearer, during review, that behaviour has not substantially changed. A subsequent commit should clean it up further by introducing helper code for managing the foreground/background work, lifetime of work object, when to send status report and/or abort pending establish, etc.
Also still to do is implementation for other platforms, and for other messages in CASESession (e.g. Sigma2), and for other costly operations (e.g. PASESession).
Currently, CASE session establishment is simplified:
Therefore, HandleSigma3b can rely upon the lifetime of the CASESession object, use an additional state and sequence number to synchronize work across foreground/background, and avoid use of member variables. If and when CASE session establishment becomes more complex, assumptions should be revisited.
TESTING
Testing was performed on M5Stack (ESP32) by commissioning using the Google Home app on Android.
First, baseline behaviour with background events disabled:
Next, improved behaviour with background events enabled: