Proposal: Revision attribute for instance session affinity #9039

steren · 2020-08-12T20:28:03Z

/area API
/kind spec

Summary

Add a Revision-level attribute to enable session affinity to container instances.

Use cases:

Real time multi-player apps: Libraries like socket.io default to long polling and upgrade to WebSockets if possible. When doing long polling or when needing to reconnect after loss of a WS connection, the library expects to reach the same server.

Proposal

Introduce a new revision level attribute (e.g. spec.sessionAffinity: true). Defaulting to false.
Set to true to configure Knative to route sequential requests for a given user to the same Revision container instance.
Knative would inspect the value of a cookie to identify multiple requests by the same user and then direct all such requests to the same instance.
Of course If the instance is rebooted, unhealthy, overloaded or becomes unavailable when the number of instances has been scaled down, session affinity will be broken and further requests are then routed to a different instance. It's "best effort" affinity, due to the autoscaling nature of Knative.

The text was updated successfully, but these errors were encountered:

mattmoor · 2020-08-20T19:13:31Z

/area networking

cc @tcnghia @nak3 @ZhiminXiang
cc @evankanderson @dprotaso

I think it's probably prudent to defer API discussions to after we've nailed down the semantics. We basically provide zero guarantees today that a Pod even continues to exist outside the bounds of a request. There are also a wide variety of gotchas here:

What happens when a user is using containerConcurrency and the pod they were hitting filled up?
What happens when Pods are generally underutilized, do we allow folks to reconnect to pods we might scale down?
Given the way our activator-based load-balancing works (cc @vagababov) allowing it to forward requests to pods outside of its range could wreak havok on accounting (which is especially critical for good LB w/ CC).

I could see configuring the networking layer to try to prefer these sorts of things even without API changes, but "best effort" and "conformance" don't really mix for me. I'd love to hear others thoughts.

wlhee · 2020-08-20T20:05:42Z

My intuition: this is best effort affinity.

What happens when a user is using containerConcurrency and the pod they were hitting filled up?
Break the affinity
What happens when Pods are generally underutilized, do we allow folks to reconnect to pods we might scale down?
Break the affinity, return a different cookie
Given the way our activator-based load-balancing works (cc @vagababov) allowing it to forward requests to pods outside of its range could wreak havok on accounting (which is especially critical for good LB w/ CC).
Not sure :)

steren · 2020-08-20T20:12:42Z

Yep, I think I covered 1. and 2. in the proposal: Of course If the instance is rebooted, unhealthy, overloaded or becomes unavailable when the number of instances has been scaled down, session affinity will be broken and further requests are then routed to a different instance.

mattmoor · 2020-08-21T02:03:31Z

My concern is the subjectivity of the semantics you're describing. How do we know if an implementation implements this properly or regresses? What's to stop me from claiming we do this today and just suck! 😉

I am guessing you're after Envoy's cookie semantics, which is already available in Istio and Contour (at least), but we have a process that needs to be followed to leverage new networking features in our public API surface. Especially since we have a non-Envoy-based implementation.

My chief concern is being able to test the semantics, but a close second is how we preserve those semantics with the activator (which isn't Envoy-based either) on the dataplane.

cc @tcnghia @nak3 @ZhiminXiang (for networking)
cc @vagababov (for activator)

I'd guess those are both solvable problems, but it's worth resolving them before we get too deep into the process above (or talk about the API surface).

mattmoor · 2020-08-21T02:04:39Z

Oh, another important element that I forgot, which is pretty key to our expansion principles: Is this possible in Ingress v2?

dprotaso · 2020-09-18T18:00:01Z

Given the failure modes I don't understand why you wouldn't want to move your session state persistence to some external service - ie. memcache, redis, apache gemfire

ie. if you're using spring there's tooling that abstracts this
https://spring.io/projects/spring-session-data-redis
https://spring.io/projects/spring-session-data-geode

nak3 · 2020-09-27T11:32:07Z

In Ingress v2, it seems to be under discussion. (cc @hbagdi )
kubernetes-sigs/gateway-api#98 or kubernetes-sigs/gateway-api#196

github-actions · 2020-12-27T01:57:47Z

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

mattmoor · 2021-01-04T16:34:49Z

/lifecycle frozen

evankanderson · 2021-03-22T03:30:22Z

It looks like this is under discussion, but needs a concrete proposal which would explain how it fits with the different networking implementations (and how it fits with other features like traffic splits). If we want to make this part of the specification, we'd also need a way to express conformance.

/triage accepted

dprotaso · 2022-08-02T19:24:21Z

Closing as dupe of #8160

steren added the kind/feature Well-understood/specified features, ready for coding. label Aug 12, 2020

knative-prow-robot added area/API API objects and controllers kind/spec Discussion of how a feature should be exposed to customers. labels Aug 12, 2020

knative-prow-robot added the area/networking label Aug 20, 2020

mattmoor mentioned this issue Sep 17, 2020

Session affinity / sticky sessions / cookie based traffic splitting #8160

Open

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 27, 2020

knative-prow-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 4, 2021

knative-prow-robot added the triage/accepted Issues which should be fixed (post-triage) label Mar 22, 2021

dprotaso added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jun 12, 2021

dprotaso added this to the Icebox milestone Aug 2, 2022

dprotaso closed this as completed Aug 2, 2022

dprotaso removed this from the Icebox milestone Aug 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Revision attribute for instance session affinity #9039

Proposal: Revision attribute for instance session affinity #9039

steren commented Aug 12, 2020 •

edited

Loading

mattmoor commented Aug 20, 2020

wlhee commented Aug 20, 2020

steren commented Aug 20, 2020

mattmoor commented Aug 21, 2020

mattmoor commented Aug 21, 2020

dprotaso commented Sep 18, 2020

nak3 commented Sep 27, 2020

github-actions bot commented Dec 27, 2020

mattmoor commented Jan 4, 2021

evankanderson commented Mar 22, 2021

dprotaso commented Aug 2, 2022

Proposal: Revision attribute for instance session affinity #9039

Proposal: Revision attribute for instance session affinity #9039

Comments

steren commented Aug 12, 2020 • edited Loading

Summary

Use cases:

Proposal

mattmoor commented Aug 20, 2020

wlhee commented Aug 20, 2020

steren commented Aug 20, 2020

mattmoor commented Aug 21, 2020

mattmoor commented Aug 21, 2020

dprotaso commented Sep 18, 2020

nak3 commented Sep 27, 2020

github-actions bot commented Dec 27, 2020

mattmoor commented Jan 4, 2021

evankanderson commented Mar 22, 2021

dprotaso commented Aug 2, 2022

steren commented Aug 12, 2020 •

edited

Loading