Change upstream on error when sticky session balancer is used #4048

fedunineyu · 2019-04-29T14:36:47Z

What this PR does / why we need it:
Currently with sticky sessions enabled next upstream is not requested on error (details are in #4035).

In this PR lua script for sticky session balancer is modified so that on error key for consistent hash is regenerated to point to another upstream.

Which issue this PR fixes : fixes #4035

ElvinEfendi · 2019-04-29T14:44:02Z

@fedunineyu thanks for your PR, can you add at unit tests for this?

fedunineyu · 2019-04-29T17:56:37Z

@fedunineyu thanks for your PR, can you add at unit tests for this?

Here you are: 0d7029f9468722671342fa0446740dbc29662443

fedunineyu · 2019-04-30T05:04:44Z

I've made several load tests by scenario in #4035 and noticed that proposed fix should be updated: it doesn't work well in the situations when the failing request arrived without sticky cookie.
In such cases several attempts can be made with the same, failing upstream because previous_upstream = nil and any generated key seems to be fine.

I'm going to obtain failing upstream from ngx.var.upstream_addr and regenerate key until new_upstream is not equal to it.

fedunineyu · 2019-04-30T08:30:21Z

@ElvinEfendi Recent e2e test run failed :

Error from server (ServerTimeout): No API token found for service account "ingress-nginx-e2e", retry after the token is automatically created and added to the service account
make: *** [e2e-test] Error 1
make: Leaving directory `/home/travis/build/kubernetes/ingress-nginx'
The command "test/e2e/run.sh" exited with 2.

Should I restart it? If yes, how can I do it without commit?

fedunineyu · 2019-04-30T11:23:06Z

/retest

k8s-ci-robot · 2019-04-30T11:23:19Z

@fedunineyu: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ElvinEfendi · 2019-04-30T13:16:20Z

/ok-to-test

ElvinEfendi · 2019-04-30T14:05:45Z

rootfs/etc/nginx/lua/balancer/sticky.lua

@@ -59,31 +67,82 @@ local function set_cookie(self, value)
  end
 end

+function _M.get_last_failure()


Does this have to be public function?

This function is made "public" to change its behavior for request failure simulation (see this test).

ElvinEfendi · 2019-04-30T14:05:56Z

rootfs/etc/nginx/lua/balancer/sticky.lua

+
+  -- use previous upstream if this is the first attempt or previous attempt succeeded
+  if state_name == nil and upstream_from_cookie ~= nil then
+    do return upstream_from_cookie end


why do ... end ?

Yeah, right. They are not required.
removed in ddf738fddadfb7ea9090ed2bd6fe277a34dcb81e

ElvinEfendi · 2019-04-30T14:06:33Z

@fedunineyu I did not have time to extensively review yet, but will get to this sometime this week.

ElvinEfendi · 2019-05-02T13:54:50Z

Looking more at this PR, it's suggesting a fundamental change - you're kinda doing passive healthchecking. Please read #4035 (comment).

fedunineyu · 2019-05-02T18:18:22Z

Looking more at this PR, it's suggesting a fundamental change - you're kinda doing passive healthchecking. Please read #4035 (comment).

I've posted reply to your comment.
I'd like to note that in this PR we simply ensure that the new upstream differs from the failed one. Response result check is stateless. So, IMHO, it neither changes any base concepts nor introduces new mechanisms as it relies on nginx standard behaviour on upstream failure.

ElvinEfendi · 2019-05-02T21:57:42Z

What if there was a network blip and request to the sticky endpoint fails? With your PR you will generate a new cookie and pick a new endpoint. But it is possible that if you retried the same endpoint request could have succeeded, so you unnecessarily broke stickiness.

So, IMHO, it neither changes any base concepts nor introduces new mechanisms as it relies on nginx standard behaviour on upstream failure.

Nginx's standard behaviour does not dictate how you choose upstream/endpoint on retry. It's left to the balancer to decide that. Therefore I'm saying you are changing the ingress-nginx sticky balancer implementation conceptually - now you are breaking stickiness on first failure you see, I'm not sure if this is what most of the people expects.

Current idea behind stickiness implementation is simple: proxy to the same endpoint as long as it is healthy. And healthiness is defined by Kubernetes Readiness probe.

Nginx Plus seems to have sticky cookie, I wonder what it does when server fails? From https://nginx.org/en/docs/http/ngx_http_upstream_module.html#sticky_cookie:

If the designated server cannot process a request, the new server is selected as if the client has not been bound yet.

But that does not define when it deems a server as "cannot process a request". Is it based on healthchecking? Is it based on the first failure it sees? Is it based of max_fails and it chooses new server only the existing server failed max_fails times?

fedunineyu · 2019-05-03T09:54:58Z

What if there was a network blip and request to the sticky endpoint fails? With your PR you will generate a new cookie and pick a new endpoint. But it is possible that if you retried the same endpoint request could have succeeded, so you unnecessarily broke stickiness.

Right you are. Stickiness will be broken for those requests that where issued during network blip. But they would be processed by another upstream. As applications should tolerate session lost (containers can stop working, nodes can restart and so on, right?), I think, there should be nothing serious with this issue.

So, IMHO, it neither changes any base concepts nor introduces new mechanisms as it relies on nginx standard behaviour on upstream failure.

Nginx's standard behaviour does not dictate how you choose upstream/endpoint on retry. It's left to the balancer to decide that. Therefore I'm saying you are changing the ingress-nginx sticky balancer implementation conceptually - now you are breaking stickiness on first failure you see, I'm not sure if this is what most of the people expects.

Ah, I see. Now I understand your comment about fundamental change.

Returning to the people expectations...
If you are sure that the current behaviour is expected by the most nginx ingress users, what if cookie regeneration on failure will be optional?
For those who want to go along with the current implementation the behaviour will be same. Others will set annotation, for example, session-cookie-new-on-failure: true (default value is false).

yadolov · 2019-05-08T09:05:07Z

@ElvinEfendi

Nginx Plus seems to have sticky cookie, I wonder what it does when server fails?…

I want to mention here about HAProxy. It has redispatch option:

option redispatch
  Enable or disable session redistribution in case of connection failure

So adding similar configuration flag (annotation) is not such bad idea?

ElvinEfendi · 2019-05-08T12:43:33Z

@fedunineyu @yadolov I like the idea of new configuration option using annotation 👍

fedunineyu · 2019-05-08T16:02:50Z

In 8b0944345e808c4f00a6a2bbb1b1c3fbefcecf4e I've added support for annotation session-cookie-change-on-failure (it seems to sound fine 😃).
I've tested both options of new annotation in our cluster. They work as expected.

fedunineyu · 2019-05-13T16:11:27Z

@ElvinEfendi
It seems that sticky session hashing was broken in #3743: now "plain text" value like 1557763189.195.5491.504753 is written into cookie (see key = string.format("%s.%s.%s", ngx.now(), ngx.worker.pid(), math.random(999999)) here) instead of its sha1() value.

I'd like to fix it in a separate PR but to avoid conflicts it would be better to merge this PR first.
Can you please share your plans for code review of this PR?

ElvinEfendi · 2019-05-13T17:29:56Z

It seems that sticky session hashing was broken in #3743: now "plain text" value like 1557763189.195.5491.504753 is written into cookie

@fedunineyu that was an intentional change. We decided since there's no revealing information and security risk why would we hash it unnecessarily. Let me know if you think that's concerning.

fedunineyu · 2019-05-20T15:54:22Z

@ElvinEfendi
Is there anything I can do to push forward this PR or now it's your turn?

aledbf · 2019-05-27T08:44:36Z

/ok-to-test

aledbf · 2019-05-27T08:45:47Z

@fedunineyu please squash the commits and rebase

codecov-io · 2019-05-27T08:50:41Z

Codecov Report

Merging #4048 into master will increase coverage by 0.04%.
The diff coverage is 50%.

@@            Coverage Diff            @@
##           master   #4048      +/-   ##
=========================================
+ Coverage   57.76%   57.8%   +0.04%     
=========================================
  Files          87      87              
  Lines        6459    7037     +578     
=========================================
+ Hits         3731    4068     +337     
- Misses       2296    2512     +216     
- Partials      432     457      +25

Impacted Files	Coverage Δ
internal/ingress/types.go	`0% <ø> (ø)`	⬆️
internal/ingress/controller/controller.go	`46.52% <100%> (-0.04%)`	⬇️
...ternal/ingress/annotations/sessionaffinity/main.go	`55.26% <33.33%> (-1.88%)`	⬇️
cmd/nginx/main.go	`6.09% <0%> (-0.62%)`	⬇️
internal/ingress/zz_generated.deepcopy.go	`0% <0%> (ø)`	⬆️
internal/ingress/controller/endpoints.go	`95.12% <0%> (+0.12%)`	⬆️
internal/ingress/controller/config/config.go	`100% <0%> (+1.43%)`	⬆️
internal/ingress/controller/nginx.go	`30.7% <0%> (+1.69%)`	⬆️
internal/ingress/annotations/parser/main.go	`86.66% <0%> (+3.03%)`	⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 24cb0e5...254629c. Read the comment docs.

1. Session cookie is updated on previous attempt failure when `session-cookie-change-on-failure = true` (default value is `false`). 2. Added tests to check both cases. 3. Updated docs. Co-Authored-By: Vladimir Grishin <[email protected]>

fedunineyu · 2019-05-27T10:16:41Z

@fedunineyu please squash the commits and rebase

Done!

aledbf · 2019-05-27T10:38:11Z

/retest

aledbf · 2019-05-27T10:39:29Z

/approve

ElvinEfendi · 2019-05-27T12:34:37Z

Give me a few more days on this.

ElvinEfendi · 2019-06-06T13:58:54Z

/lgtm

Thanks @fedunineyu !

k8s-ci-robot · 2019-06-06T13:59:10Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aledbf, ElvinEfendi, fedunineyu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ElvinEfendi,aledbf]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 29, 2019

k8s-ci-robot requested review from aledbf and ElvinEfendi April 29, 2019 14:37

aledbf assigned ElvinEfendi Apr 29, 2019

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 29, 2019

k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Apr 30, 2019

ElvinEfendi reviewed Apr 30, 2019

View reviewed changes

fedunineyu force-pushed the change-upstream-on-error-with-sticky-session branch from 8b09443 to 254629c Compare May 27, 2019 10:10

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 27, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 6, 2019

k8s-ci-robot merged commit 286ff13 into kubernetes:master Jun 6, 2019

ElvinEfendi mentioned this pull request Jun 6, 2019

simplify sticky balancer and fix a bug #4169

Merged

fedunineyu deleted the change-upstream-on-error-with-sticky-session branch June 10, 2019 06:46

pushrax mentioned this pull request Dec 6, 2023

chash: retry on another endpoint if the first one fails #10730

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change upstream on error when sticky session balancer is used #4048

Change upstream on error when sticky session balancer is used #4048

fedunineyu commented Apr 29, 2019

ElvinEfendi commented Apr 29, 2019

fedunineyu commented Apr 29, 2019

fedunineyu commented Apr 30, 2019

fedunineyu commented Apr 30, 2019

fedunineyu commented Apr 30, 2019

k8s-ci-robot commented Apr 30, 2019

ElvinEfendi commented Apr 30, 2019

ElvinEfendi Apr 30, 2019

fedunineyu May 7, 2019

ElvinEfendi Apr 30, 2019

fedunineyu May 2, 2019

ElvinEfendi commented Apr 30, 2019

ElvinEfendi commented May 2, 2019

fedunineyu commented May 2, 2019

ElvinEfendi commented May 2, 2019

fedunineyu commented May 3, 2019

yadolov commented May 8, 2019

ElvinEfendi commented May 8, 2019

fedunineyu commented May 8, 2019

fedunineyu commented May 13, 2019

ElvinEfendi commented May 13, 2019

fedunineyu commented May 20, 2019

aledbf commented May 27, 2019

aledbf commented May 27, 2019

codecov-io commented May 27, 2019 •

edited

Loading

fedunineyu commented May 27, 2019

aledbf commented May 27, 2019

aledbf commented May 27, 2019

ElvinEfendi commented May 27, 2019

ElvinEfendi commented Jun 6, 2019

k8s-ci-robot commented Jun 6, 2019

Change upstream on error when sticky session balancer is used #4048

Change upstream on error when sticky session balancer is used #4048

Conversation

fedunineyu commented Apr 29, 2019

ElvinEfendi commented Apr 29, 2019

fedunineyu commented Apr 29, 2019

fedunineyu commented Apr 30, 2019

fedunineyu commented Apr 30, 2019

fedunineyu commented Apr 30, 2019

k8s-ci-robot commented Apr 30, 2019

ElvinEfendi commented Apr 30, 2019

ElvinEfendi Apr 30, 2019

Choose a reason for hiding this comment

fedunineyu May 7, 2019

Choose a reason for hiding this comment

ElvinEfendi Apr 30, 2019

Choose a reason for hiding this comment

fedunineyu May 2, 2019

Choose a reason for hiding this comment

ElvinEfendi commented Apr 30, 2019

ElvinEfendi commented May 2, 2019

fedunineyu commented May 2, 2019

ElvinEfendi commented May 2, 2019

fedunineyu commented May 3, 2019

yadolov commented May 8, 2019

ElvinEfendi commented May 8, 2019

fedunineyu commented May 8, 2019

fedunineyu commented May 13, 2019

ElvinEfendi commented May 13, 2019

fedunineyu commented May 20, 2019

aledbf commented May 27, 2019

aledbf commented May 27, 2019

codecov-io commented May 27, 2019 • edited Loading

Codecov Report

fedunineyu commented May 27, 2019

aledbf commented May 27, 2019

aledbf commented May 27, 2019

ElvinEfendi commented May 27, 2019

ElvinEfendi commented Jun 6, 2019

k8s-ci-robot commented Jun 6, 2019

codecov-io commented May 27, 2019 •

edited

Loading