Updated CASE Session lifetime and management logic #17298

mrjerryjohns · 2022-04-12T21:47:29Z

Problem

We don't have clear understanding of when how the lifetime of a CASE session should be managed, as well as what it means to have multiple CASE sessions between a set of peers and whether that is even permitted.

Proposal

Here are a set of axioms/tenets from a Slack conversation:

We should endeavor to re-use existing CASE sessions over creating a new one if possible.
Sessions should only get removed if we have conclusive evidence of session failure (i.e response timeout is the best proxy today for this), OR they get evicted to make way for a new one due to LRU.
Of a set of active sessions to a peer, the one that was established later is likely to be more viable than one established earlier. When interacting with a peer, it should prefer using more recently established sessions over older sessions.
While the 'ships-passing-in-the-night' scenario (i.e two concurrently established CASE sessions) is possible, time at which Sigma3 is received on either side is still a good enough tie-breaker. If either side gets that wrong (unlikely), it's still unlikely we'll have broken sessions because 2) above ensures sessions aren't opportunistically terminated.
Asymmetric session state (i.e a session that is deemed valid on one side but isn't on the other) is a valid condition that can happen at steady state that all nodes have to be tolerant to. It can happen more/less frequently depending on memory pressure and session establishment rates, reboots, etc.
While nodes do need to be tolerant, we should endeavor to minimize asymmetric sessions where possible, since the penalty for using such a session is an interaction timeout that is highly latent to discover, and can result in wasteful CASE setup interactions.

Here are some implementation side-effects from the above that I think would help:

i. Once established, there is no real advantage to actively pruning duplicate sessions since it's sunk cost. This also ensures active exchanges do not get un-necessarily terminated.
b. When looking for an active session to a peer, we should return the latest session to that peer amongst the set of sessions to that peer (if any).
c. When a session has been established, SessionHolders holding onto an existing session to that peer should get updated to now reference the new session (through the SessionReleaseDelegate API). This follows tenet 3) above.

Work Items:

mrjerryjohns · 2022-04-12T21:48:32Z

Materially speaking, this renders a lot of the spec material in this PR somewhat moot I believe (@msandstedt ?)

msandstedt · 2022-04-13T01:50:15Z

Materially speaking, this renders a lot of the spec material in this PR somewhat moot I believe (@msandstedt ?)

I think so. The problem we're facing is that we are trying to find a deterministic algorithm that peers can use such that they will drop a second session if one exists and will both drop the same session.

I think this is actually impossible currently because input to that algorithm will always include the existence of 2+ sessions. That is exactly the piece of information that is not guaranteed to be shared by both peers. It is always possible for one peer to have two sessions open, and the other peer to only have one.

This is a consequence of tenet 5:

Asymmetric session state (i.e a session that is deemed valid on one side but isn't on the other) is a valid condition that can happen at steady state that all nodes have to be tolerant to.

mrjerryjohns · 2022-06-02T19:06:05Z

Meeting Minutes (Today's Call):
Attendees: Jerry, Mingjie, Michael Spang, Michael Sandstedt, Boris, Terence

Session Recovery: (Existing PR here).

Broad consensus that applications and protocols should be responsible for dictating when, and the cadence at which session recovery occurs.
They could setup a SessionHolderWithDelegate to be notified when we have experienced a transport failure (i.e MRP failure). @kghost to add a new method to that delegate to make this possible.
We should not forcibly evict a session that has encountered a transport failure as per the long discussions that have happened in Slack and transpired in this issue.
- Instead, we should mark it as ‘Defunct’ (or some equivalent). This ensures that any existing exchanges on that CAN continue to function, and possibly even succeed.
- If they do eventually succeed, that will return that session back to a normal operational state, and can be used again.
There is diminished value in the SDK attempting a 1-time session recovery attempt since in most cases, the underlying protocol operation (e.g sending an invoke) will eventually fail/time-out (resubscriptions are an exception). This is assuming the underlying cause of failure is genuinely transport related (and isn’t a side-effect of single-threading issues on the target).
Fix-up the IM re-subscription logic to permit both applications managing the session setup on each ensuing re-subscription attempt, as well as the IM possibly handling that on its own by talking directly to CASESessionManager.
- IM re-subscription logic should be tweaked such that the expectations of provided a policy callback should be updated to now be responsible for arming the timer as well as re-establishing the session.
- Default implementation can arm the timer + talk to CASESessionManager to establish new session.
- Applications can over-ride that and do this themselves, and call directly into the IM to re-subscribe with a given session handle.

Session Shifting: (Existing PR here)

CAT tags are really only used for ACLs.
Couldn’t come up with a good use-case for why CAT tags should be incorporated into session equivalence checking during session shifting.
For now, let’s not incorporate CAT tags into shifting until we have a demonstrated issue or problem that necessitates a fix.

woody-apple · 2022-07-05T17:45:56Z

SDK Review: Given the impact on reliability of CASE, this is required for SVE.

woody-apple · 2022-08-03T17:16:29Z

SVE/Cert Blocker Review: Does not appear to be blocking a test case at this time, removing SVE.

mrjerryjohns · 2022-09-15T19:00:09Z

All tasks are done.

mrjerryjohns added the secure channel label Apr 12, 2022

mrjerryjohns added the V1.0 label Apr 12, 2022

This was referenced Apr 20, 2022

Add session eviction logic to CASE session management #17568

Closed

Update existing SessionHolders to shift to a newly established session atomically #17569

Closed

Use-after-free when expiring secure sessions #17558

Closed

kghost mentioned this issue May 24, 2022

Re-subscription after restarting device #18761

Closed

mrjerryjohns mentioned this issue Jun 13, 2022

[2/3] CASE Eviction (Initial Impl) #19502

Merged

This was referenced Jun 13, 2022

Bound switch failed to connect with Light after only Light was rebooted #19521

Closed

CASE auth fail after ESP32 crash #14308

Closed

mrjerryjohns mentioned this issue Jun 17, 2022

Figure out interaction timeout behaviors and recovery logic #16202

Closed

bzbarsky-apple mentioned this issue Jun 18, 2022

How to reconnect device with app #10653

Closed

bzbarsky-apple added the request sve label Jul 2, 2022

woody-apple added sve and removed request sve labels Jul 5, 2022

bzbarsky-apple mentioned this issue Jul 14, 2022

CASE session reestablishment after power cycling a device #16376

Closed

woody-apple removed the sve label Aug 3, 2022

mrjerryjohns closed this as completed Sep 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated CASE Session lifetime and management logic #17298

Updated CASE Session lifetime and management logic #17298

mrjerryjohns commented Apr 12, 2022 •

edited

Loading

mrjerryjohns commented Apr 12, 2022

msandstedt commented Apr 13, 2022 •

edited by mrjerryjohns

Loading

mrjerryjohns commented Jun 2, 2022 •

edited

Loading

woody-apple commented Jul 5, 2022

woody-apple commented Aug 3, 2022

mrjerryjohns commented Sep 15, 2022

Updated CASE Session lifetime and management logic #17298

Updated CASE Session lifetime and management logic #17298

Comments

mrjerryjohns commented Apr 12, 2022 • edited Loading

Problem

Proposal

mrjerryjohns commented Apr 12, 2022

msandstedt commented Apr 13, 2022 • edited by mrjerryjohns Loading

mrjerryjohns commented Jun 2, 2022 • edited Loading

woody-apple commented Jul 5, 2022

woody-apple commented Aug 3, 2022

mrjerryjohns commented Sep 15, 2022

mrjerryjohns commented Apr 12, 2022 •

edited

Loading

msandstedt commented Apr 13, 2022 •

edited by mrjerryjohns

Loading

mrjerryjohns commented Jun 2, 2022 •

edited

Loading