Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Revision attribute for instance session affinity #9039

Closed
steren opened this issue Aug 12, 2020 · 11 comments
Closed

Proposal: Revision attribute for instance session affinity #9039

steren opened this issue Aug 12, 2020 · 11 comments
Labels
area/API API objects and controllers area/networking help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Well-understood/specified features, ready for coding. kind/spec Discussion of how a feature should be exposed to customers. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. triage/accepted Issues which should be fixed (post-triage)

Comments

@steren
Copy link
Contributor

steren commented Aug 12, 2020

/area API
/kind spec

Summary

Add a Revision-level attribute to enable session affinity to container instances.

Use cases:

  • Real time multi-player apps: Libraries like socket.io default to long polling and upgrade to WebSockets if possible. When doing long polling or when needing to reconnect after loss of a WS connection, the library expects to reach the same server.

Proposal

Introduce a new revision level attribute (e.g. spec.sessionAffinity: true). Defaulting to false.
Set to true to configure Knative to route sequential requests for a given user to the same Revision container instance.
Knative would inspect the value of a cookie to identify multiple requests by the same user and then direct all such requests to the same instance.
Of course If the instance is rebooted, unhealthy, overloaded or becomes unavailable when the number of instances has been scaled down, session affinity will be broken and further requests are then routed to a different instance. It's "best effort" affinity, due to the autoscaling nature of Knative.

@steren steren added the kind/feature Well-understood/specified features, ready for coding. label Aug 12, 2020
@knative-prow-robot knative-prow-robot added area/API API objects and controllers kind/spec Discussion of how a feature should be exposed to customers. labels Aug 12, 2020
@mattmoor
Copy link
Member

/area networking

cc @tcnghia @nak3 @ZhiminXiang
cc @evankanderson @dprotaso

I think it's probably prudent to defer API discussions to after we've nailed down the semantics. We basically provide zero guarantees today that a Pod even continues to exist outside the bounds of a request. There are also a wide variety of gotchas here:

  1. What happens when a user is using containerConcurrency and the pod they were hitting filled up?
  2. What happens when Pods are generally underutilized, do we allow folks to reconnect to pods we might scale down?
  3. Given the way our activator-based load-balancing works (cc @vagababov) allowing it to forward requests to pods outside of its range could wreak havok on accounting (which is especially critical for good LB w/ CC).

I could see configuring the networking layer to try to prefer these sorts of things even without API changes, but "best effort" and "conformance" don't really mix for me. I'd love to hear others thoughts.

@wlhee
Copy link

wlhee commented Aug 20, 2020

My intuition: this is best effort affinity.

  1. What happens when a user is using containerConcurrency and the pod they were hitting filled up?
    Break the affinity

  2. What happens when Pods are generally underutilized, do we allow folks to reconnect to pods we might scale down?
    Break the affinity, return a different cookie

  3. Given the way our activator-based load-balancing works (cc @vagababov) allowing it to forward requests to pods outside of its range could wreak havok on accounting (which is especially critical for good LB w/ CC).
    Not sure :)

@steren
Copy link
Contributor Author

steren commented Aug 20, 2020

Yep, I think I covered 1. and 2. in the proposal: Of course If the instance is rebooted, unhealthy, overloaded or becomes unavailable when the number of instances has been scaled down, session affinity will be broken and further requests are then routed to a different instance.

@mattmoor
Copy link
Member

My concern is the subjectivity of the semantics you're describing. How do we know if an implementation implements this properly or regresses? What's to stop me from claiming we do this today and just suck! 😉

I am guessing you're after Envoy's cookie semantics, which is already available in Istio and Contour (at least), but we have a process that needs to be followed to leverage new networking features in our public API surface. Especially since we have a non-Envoy-based implementation.

My chief concern is being able to test the semantics, but a close second is how we preserve those semantics with the activator (which isn't Envoy-based either) on the dataplane.

cc @tcnghia @nak3 @ZhiminXiang (for networking)
cc @vagababov (for activator)

I'd guess those are both solvable problems, but it's worth resolving them before we get too deep into the process above (or talk about the API surface).

@mattmoor
Copy link
Member

Oh, another important element that I forgot, which is pretty key to our expansion principles: Is this possible in Ingress v2?

@dprotaso
Copy link
Member

Given the failure modes I don't understand why you wouldn't want to move your session state persistence to some external service - ie. memcache, redis, apache gemfire

ie. if you're using spring there's tooling that abstracts this
https://spring.io/projects/spring-session-data-redis
https://spring.io/projects/spring-session-data-geode

@nak3
Copy link
Contributor

nak3 commented Sep 27, 2020

In Ingress v2, it seems to be under discussion. (cc @hbagdi )
kubernetes-sigs/gateway-api#98 or kubernetes-sigs/gateway-api#196

@github-actions
Copy link

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 27, 2020
@mattmoor
Copy link
Member

mattmoor commented Jan 4, 2021

/lifecycle frozen

@knative-prow-robot knative-prow-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 4, 2021
@evankanderson
Copy link
Member

It looks like this is under discussion, but needs a concrete proposal which would explain how it fits with the different networking implementations (and how it fits with other features like traffic splits). If we want to make this part of the specification, we'd also need a way to express conformance.

/triage accepted

@knative-prow-robot knative-prow-robot added the triage/accepted Issues which should be fixed (post-triage) label Mar 22, 2021
@dprotaso dprotaso added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jun 12, 2021
@dprotaso dprotaso added this to the Icebox milestone Aug 2, 2022
@dprotaso
Copy link
Member

dprotaso commented Aug 2, 2022

Closing as dupe of #8160

@dprotaso dprotaso closed this as completed Aug 2, 2022
@dprotaso dprotaso removed this from the Icebox milestone Aug 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/API API objects and controllers area/networking help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Well-understood/specified features, ready for coding. kind/spec Discussion of how a feature should be exposed to customers. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. triage/accepted Issues which should be fixed (post-triage)
Projects
None yet
Development

No branches or pull requests

7 participants