-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement session recovery #18883
Implement session recovery #18883
Conversation
PR #18883: Size comparison from e7bb258 to afb4aa5 Increases above 0.2%:
Increases (37 builds for cc13x2_26x2, cyw30739, efr32, esp32, k32w, linux, mbed, nrfconnect, p6, telink)
Decreases (37 builds for cc13x2_26x2, cyw30739, efr32, esp32, k32w, linux, mbed, nrfconnect, p6, telink)
Full report (37 builds for cc13x2_26x2, cyw30739, efr32, esp32, k32w, linux, mbed, nrfconnect, p6, telink)
|
// | ||
// We don't need to reset the state all the way back to NeedsAddress since all that transpired | ||
// was just CASE connection failure. So let's re-use the cached address to re-do CASE again | ||
// if need-be. | ||
// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignoring the multicast traffic injected for mDNS and only considering the two peers involved in the session, CASE is the expensive part of the operation and address resolution is relatively cheap. Does it really make sense to undertake a potentially lengthy sequence of retries with a potentially stale address?
src/lib/core/CHIPConfig.h
Outdated
#ifndef CHIP_CONFIG_SESSION_RECOVERY_BACKOFF_INITIAL_MS | ||
#define CHIP_CONFIG_SESSION_RECOVERY_BACKOFF_INITIAL_MS (30 * 1000) | ||
#endif // CHIP_CONFIG_SESSION_RECOVERY_BACKOFF_INITIAL_MS | ||
|
||
/** | ||
* @def CHIP_PEER_CONNECTION_TIMEOUT_CHECK_FREQUENCY_MS | ||
* @def CHIP_CONFIG_SESSION_RECOVERY_BACKOFF_MULTIPLIER | ||
* | ||
* @brief How frequent are peer connections checked for timeouts. | ||
* @brief A multiplier which is used to increase exponential backoff timeout. | ||
*/ | ||
#ifndef CHIP_PEER_CONNECTION_TIMEOUT_CHECK_FREQUENCY_MS | ||
#define CHIP_PEER_CONNECTION_TIMEOUT_CHECK_FREQUENCY_MS 5000 | ||
#endif // CHIP_PEER_CONNECTION_TIMEOUT_CHECK_FREQUENCY_MS | ||
#ifndef CHIP_CONFIG_SESSION_RECOVERY_BACKOFF_MULTIPLIER | ||
#define CHIP_CONFIG_SESSION_RECOVERY_BACKOFF_MULTIPLIER (1.3) | ||
#endif // CHIP_CONFIG_SESSION_RECOVERY_BACKOFF_MULTIPLIER | ||
|
||
/** | ||
* @def CHIP_CONFIG_SESSION_RECOVERY_BACKOFF_MAX_MS | ||
* | ||
* @brief Max backoff timeout for session recovery. Default 5 minutes. | ||
*/ | ||
#ifndef CHIP_CONFIG_SESSION_RECOVERY_BACKOFF_MAX_MS | ||
#define CHIP_CONFIG_SESSION_RECOVERY_BACKOFF_MAX_MS (5 * 60 * 1000) | ||
#endif // CHIP_CONFIG_SESSION_RECOVERY_BACKOFF_MAX_MS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am concerned about the potentially large stackup of retries at multiple layers here. We already have MRP at the exchange layer for CASE. And here I think we are multiplying that at another layer with session recovery, sometimes without resolution of a new address.
Won't this lead to extremely slow recovery in the case of renumbering?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please let me know once the other bits of code this is on top of merge so this can be reviewed sanely....
So, I'm not super convinced this is something that we need... The trigger for initiating a session is usually an application intent to initiate some higher-level interaction with a peer (e.g subscribe, or send a command, or do BDX, etc). Consequently, it follows naturally that the re-establishment of a failed session should also fall similarly, on some demonstrated application intent. In some cases, this happens in the protocol layer. In the IM, we have a built-in retry mechanism for subscriptions that will do exponential back-off, etc. I'm concerned this session recovery construct conflicts with that. The same goes for something like BDX, where the requestor has a built in mechanism for figuring out when/how to initiate another interaction. This PR seems to assume that upon failure to receive a response from a peer, that there is still continued application interest and intent to interact with that peer. That is certainly not true in many cases (e.g sending a one-off command to turn on a light bulb, or an OTA query to a single provider). This ends up running a machinery that may spin and churn indefinitely attempting to connect to a peer that you in fact, do not want to continue connecting to. The application and protocol machinery are the best guides for what to do here. Feels like we should leave it to them to figure out when and to whom they want to establish a session perhaps? |
PR #18883: Size comparison from 8ef608d to afaab2c Increases (39 builds for cc13x2_26x2, cyw30739, efr32, esp32, k32w, linux, mbed, p6, telink)
Decreases (3 builds for cc13x2_26x2)
Full report (39 builds for cc13x2_26x2, cyw30739, efr32, esp32, k32w, linux, mbed, p6, telink)
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
Is this going to get updated @kghost ? |
After retry logic moved to application layer, I can't accomplish the goal in transport layer or messaging layer. I'll provide required API for applications to implement session recovery. |
Can we make sure the default implementation (without app intervention) has a sane retry policy. If apps want to override that, that's their choice, but we should make sure the default implementation is a workable one. |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
This stale pull request has been automatically closed. Thank you for your contributions. |
Problem
Fixes #16202 #15766
Change overview
Start a new CASE pairing when RMP 3rd try timeout.Use an exponent backoff if CASE pairing failssession->DispatchSessionEvent(&SessionDelegate::OnRequestRecovery)
to trigger itImplemented inside
OperationalDeviceProxy
state machine, adds 2 stateRecovering
andRecoveringBackoff
RecoveringBackoff: if recover fails, start a timer then give another try.Note: this PR depends on following PRs
Testing
Passed unit-tests