Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated CASE Session lifetime and management logic #17298

Closed
3 tasks done
mrjerryjohns opened this issue Apr 12, 2022 · 6 comments
Closed
3 tasks done

Updated CASE Session lifetime and management logic #17298

mrjerryjohns opened this issue Apr 12, 2022 · 6 comments

Comments

@mrjerryjohns
Copy link
Contributor

mrjerryjohns commented Apr 12, 2022

Problem

We don't have clear understanding of when how the lifetime of a CASE session should be managed, as well as what it means to have multiple CASE sessions between a set of peers and whether that is even permitted.

Proposal

Here are a set of axioms/tenets from a Slack conversation:

  1. We should endeavor to re-use existing CASE sessions over creating a new one if possible.
  2. Sessions should only get removed if we have conclusive evidence of session failure (i.e response timeout is the best proxy today for this), OR they get evicted to make way for a new one due to LRU.
  3. Of a set of active sessions to a peer, the one that was established later is likely to be more viable than one established earlier. When interacting with a peer, it should prefer using more recently established sessions over older sessions.
  4. While the 'ships-passing-in-the-night' scenario (i.e two concurrently established CASE sessions) is possible, time at which Sigma3 is received on either side is still a good enough tie-breaker. If either side gets that wrong (unlikely), it's still unlikely we'll have broken sessions because 2) above ensures sessions aren't opportunistically terminated.
  5. Asymmetric session state (i.e a session that is deemed valid on one side but isn't on the other) is a valid condition that can happen at steady state that all nodes have to be tolerant to. It can happen more/less frequently depending on memory pressure and session establishment rates, reboots, etc.
  6. While nodes do need to be tolerant, we should endeavor to minimize asymmetric sessions where possible, since the penalty for using such a session is an interaction timeout that is highly latent to discover, and can result in wasteful CASE setup interactions.

Here are some implementation side-effects from the above that I think would help:

i. Once established, there is no real advantage to actively pruning duplicate sessions since it's sunk cost. This also ensures active exchanges do not get un-necessarily terminated.
b. When looking for an active session to a peer, we should return the latest session to that peer amongst the set of sessions to that peer (if any).
c. When a session has been established, SessionHolders holding onto an existing session to that peer should get updated to now reference the new session (through the SessionReleaseDelegate API). This follows tenet 3) above.

Work Items:

@mrjerryjohns
Copy link
Contributor Author

Materially speaking, this renders a lot of the spec material in this PR somewhat moot I believe (@msandstedt ?)

@msandstedt
Copy link
Contributor

msandstedt commented Apr 13, 2022

Materially speaking, this renders a lot of the spec material in this PR somewhat moot I believe (@msandstedt ?)

I think so. The problem we're facing is that we are trying to find a deterministic algorithm that peers can use such that they will drop a second session if one exists and will both drop the same session.

I think this is actually impossible currently because input to that algorithm will always include the existence of 2+ sessions. That is exactly the piece of information that is not guaranteed to be shared by both peers. It is always possible for one peer to have two sessions open, and the other peer to only have one.

This is a consequence of tenet 5:

Asymmetric session state (i.e a session that is deemed valid on one side but isn't on the other) is a valid condition that can happen at steady state that all nodes have to be tolerant to.

@mrjerryjohns
Copy link
Contributor Author

mrjerryjohns commented Jun 2, 2022

Meeting Minutes (Today's Call):
Attendees: Jerry, Mingjie, Michael Spang, Michael Sandstedt, Boris, Terence

Session Recovery: (Existing PR here).

  • Broad consensus that applications and protocols should be responsible for dictating when, and the cadence at which session recovery occurs.
  • They could setup a SessionHolderWithDelegate to be notified when we have experienced a transport failure (i.e MRP failure). @kghost to add a new method to that delegate to make this possible.
  • We should not forcibly evict a session that has encountered a transport failure as per the long discussions that have happened in Slack and transpired in this issue.
    • Instead, we should mark it as ‘Defunct’ (or some equivalent). This ensures that any existing exchanges on that CAN continue to function, and possibly even succeed.
    • If they do eventually succeed, that will return that session back to a normal operational state, and can be used again.
  • There is diminished value in the SDK attempting a 1-time session recovery attempt since in most cases, the underlying protocol operation (e.g sending an invoke) will eventually fail/time-out (resubscriptions are an exception). This is assuming the underlying cause of failure is genuinely transport related (and isn’t a side-effect of single-threading issues on the target).
  • Fix-up the IM re-subscription logic to permit both applications managing the session setup on each ensuing re-subscription attempt, as well as the IM possibly handling that on its own by talking directly to CASESessionManager.
    • IM re-subscription logic should be tweaked such that the expectations of provided a policy callback should be updated to now be responsible for arming the timer as well as re-establishing the session.
    • Default implementation can arm the timer + talk to CASESessionManager to establish new session.
    • Applications can over-ride that and do this themselves, and call directly into the IM to re-subscribe with a given session handle.

Session Shifting: (Existing PR here)

  • CAT tags are really only used for ACLs.
  • Couldn’t come up with a good use-case for why CAT tags should be incorporated into session equivalence checking during session shifting.
  • For now, let’s not incorporate CAT tags into shifting until we have a demonstrated issue or problem that necessitates a fix.

@woody-apple
Copy link
Contributor

SDK Review: Given the impact on reliability of CASE, this is required for SVE.

@woody-apple
Copy link
Contributor

SVE/Cert Blocker Review: Does not appear to be blocking a test case at this time, removing SVE.

@mrjerryjohns
Copy link
Contributor Author

All tasks are done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants