Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change upstream on error when sticky session balancer is used #4048

Conversation

fedunineyu
Copy link
Contributor

What this PR does / why we need it:
Currently with sticky sessions enabled next upstream is not requested on error (details are in #4035).

In this PR lua script for sticky session balancer is modified so that on error key for consistent hash is regenerated to point to another upstream.

Which issue this PR fixes : fixes #4035

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 29, 2019
@ElvinEfendi
Copy link
Member

@fedunineyu thanks for your PR, can you add at unit tests for this?

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 29, 2019
@fedunineyu
Copy link
Contributor Author

@fedunineyu thanks for your PR, can you add at unit tests for this?

Here you are: 0d7029f9468722671342fa0446740dbc29662443

@fedunineyu
Copy link
Contributor Author

I've made several load tests by scenario in #4035 and noticed that proposed fix should be updated: it doesn't work well in the situations when the failing request arrived without sticky cookie.
In such cases several attempts can be made with the same, failing upstream because previous_upstream = nil and any generated key seems to be fine.

I'm going to obtain failing upstream from ngx.var.upstream_addr and regenerate key until new_upstream is not equal to it.

@fedunineyu
Copy link
Contributor Author

@ElvinEfendi Recent e2e test run failed :

Error from server (ServerTimeout): No API token found for service account "ingress-nginx-e2e", retry after the token is automatically created and added to the service account
make: *** [e2e-test] Error 1
make: Leaving directory `/home/travis/build/kubernetes/ingress-nginx'
The command "test/e2e/run.sh" exited with 2.

Should I restart it? If yes, how can I do it without commit?

@fedunineyu
Copy link
Contributor Author

/retest

@k8s-ci-robot
Copy link
Contributor

@fedunineyu: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ElvinEfendi
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Apr 30, 2019
@@ -59,31 +67,82 @@ local function set_cookie(self, value)
end
end

function _M.get_last_failure()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have to be public function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is made "public" to change its behavior for request failure simulation (see this test).


-- use previous upstream if this is the first attempt or previous attempt succeeded
if state_name == nil and upstream_from_cookie ~= nil then
do return upstream_from_cookie end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do ... end ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, right. They are not required.
removed in ddf738fddadfb7ea9090ed2bd6fe277a34dcb81e

@ElvinEfendi
Copy link
Member

@fedunineyu I did not have time to extensively review yet, but will get to this sometime this week.

@ElvinEfendi
Copy link
Member

Looking more at this PR, it's suggesting a fundamental change - you're kinda doing passive healthchecking. Please read #4035 (comment).

@fedunineyu
Copy link
Contributor Author

Looking more at this PR, it's suggesting a fundamental change - you're kinda doing passive healthchecking. Please read #4035 (comment).

I've posted reply to your comment.
I'd like to note that in this PR we simply ensure that the new upstream differs from the failed one. Response result check is stateless. So, IMHO, it neither changes any base concepts nor introduces new mechanisms as it relies on nginx standard behaviour on upstream failure.

@ElvinEfendi
Copy link
Member

What if there was a network blip and request to the sticky endpoint fails? With your PR you will generate a new cookie and pick a new endpoint. But it is possible that if you retried the same endpoint request could have succeeded, so you unnecessarily broke stickiness.

So, IMHO, it neither changes any base concepts nor introduces new mechanisms as it relies on nginx standard behaviour on upstream failure.

Nginx's standard behaviour does not dictate how you choose upstream/endpoint on retry. It's left to the balancer to decide that. Therefore I'm saying you are changing the ingress-nginx sticky balancer implementation conceptually - now you are breaking stickiness on first failure you see, I'm not sure if this is what most of the people expects.

Current idea behind stickiness implementation is simple: proxy to the same endpoint as long as it is healthy. And healthiness is defined by Kubernetes Readiness probe.

Nginx Plus seems to have sticky cookie, I wonder what it does when server fails? From https://nginx.org/en/docs/http/ngx_http_upstream_module.html#sticky_cookie:

If the designated server cannot process a request, the new server is selected as if the client has not been bound yet.

But that does not define when it deems a server as "cannot process a request". Is it based on healthchecking? Is it based on the first failure it sees? Is it based of max_fails and it chooses new server only the existing server failed max_fails times?

@fedunineyu
Copy link
Contributor Author

What if there was a network blip and request to the sticky endpoint fails? With your PR you will generate a new cookie and pick a new endpoint. But it is possible that if you retried the same endpoint request could have succeeded, so you unnecessarily broke stickiness.

Right you are. Stickiness will be broken for those requests that where issued during network blip. But they would be processed by another upstream. As applications should tolerate session lost (containers can stop working, nodes can restart and so on, right?), I think, there should be nothing serious with this issue.

So, IMHO, it neither changes any base concepts nor introduces new mechanisms as it relies on nginx standard behaviour on upstream failure.

Nginx's standard behaviour does not dictate how you choose upstream/endpoint on retry. It's left to the balancer to decide that. Therefore I'm saying you are changing the ingress-nginx sticky balancer implementation conceptually - now you are breaking stickiness on first failure you see, I'm not sure if this is what most of the people expects.

Ah, I see. Now I understand your comment about fundamental change.

Returning to the people expectations...
If you are sure that the current behaviour is expected by the most nginx ingress users, what if cookie regeneration on failure will be optional?
For those who want to go along with the current implementation the behaviour will be same. Others will set annotation, for example, session-cookie-new-on-failure: true (default value is false).

@yadolov
Copy link
Contributor

yadolov commented May 8, 2019

@ElvinEfendi

Nginx Plus seems to have sticky cookie, I wonder what it does when server fails?…

I want to mention here about HAProxy. It has redispatch option:

option redispatch
  Enable or disable session redistribution in case of connection failure

So adding similar configuration flag (annotation) is not such bad idea?

@ElvinEfendi
Copy link
Member

@fedunineyu @yadolov I like the idea of new configuration option using annotation 👍

@fedunineyu
Copy link
Contributor Author

In 8b0944345e808c4f00a6a2bbb1b1c3fbefcecf4e I've added support for annotation session-cookie-change-on-failure (it seems to sound fine 😃).
I've tested both options of new annotation in our cluster. They work as expected.

@fedunineyu
Copy link
Contributor Author

@ElvinEfendi
It seems that sticky session hashing was broken in #3743: now "plain text" value like 1557763189.195.5491.504753 is written into cookie (see key = string.format("%s.%s.%s", ngx.now(), ngx.worker.pid(), math.random(999999)) here) instead of its sha1() value.

I'd like to fix it in a separate PR but to avoid conflicts it would be better to merge this PR first.
Can you please share your plans for code review of this PR?

@ElvinEfendi
Copy link
Member

It seems that sticky session hashing was broken in #3743: now "plain text" value like 1557763189.195.5491.504753 is written into cookie

@fedunineyu that was an intentional change. We decided since there's no revealing information and security risk why would we hash it unnecessarily. Let me know if you think that's concerning.

@fedunineyu
Copy link
Contributor Author

@ElvinEfendi
Is there anything I can do to push forward this PR or now it's your turn?

@aledbf
Copy link
Member

aledbf commented May 27, 2019

/ok-to-test

@aledbf
Copy link
Member

aledbf commented May 27, 2019

@fedunineyu please squash the commits and rebase

@codecov-io
Copy link

codecov-io commented May 27, 2019

Codecov Report

Merging #4048 into master will increase coverage by 0.04%.
The diff coverage is 50%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #4048      +/-   ##
=========================================
+ Coverage   57.76%   57.8%   +0.04%     
=========================================
  Files          87      87              
  Lines        6459    7037     +578     
=========================================
+ Hits         3731    4068     +337     
- Misses       2296    2512     +216     
- Partials      432     457      +25
Impacted Files Coverage Δ
internal/ingress/types.go 0% <ø> (ø) ⬆️
internal/ingress/controller/controller.go 46.52% <100%> (-0.04%) ⬇️
...ternal/ingress/annotations/sessionaffinity/main.go 55.26% <33.33%> (-1.88%) ⬇️
cmd/nginx/main.go 6.09% <0%> (-0.62%) ⬇️
internal/ingress/zz_generated.deepcopy.go 0% <0%> (ø) ⬆️
internal/ingress/controller/endpoints.go 95.12% <0%> (+0.12%) ⬆️
internal/ingress/controller/config/config.go 100% <0%> (+1.43%) ⬆️
internal/ingress/controller/nginx.go 30.7% <0%> (+1.69%) ⬆️
internal/ingress/annotations/parser/main.go 86.66% <0%> (+3.03%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 24cb0e5...254629c. Read the comment docs.

1. Session cookie is updated on previous attempt failure when `session-cookie-change-on-failure = true` (default value is `false`).
2. Added tests to check both cases.
3. Updated docs.

Co-Authored-By: Vladimir Grishin <[email protected]>
@fedunineyu fedunineyu force-pushed the change-upstream-on-error-with-sticky-session branch from 8b09443 to 254629c Compare May 27, 2019 10:10
@fedunineyu
Copy link
Contributor Author

@fedunineyu please squash the commits and rebase

Done!

@aledbf
Copy link
Member

aledbf commented May 27, 2019

/retest

@aledbf
Copy link
Member

aledbf commented May 27, 2019

/approve

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 27, 2019
@ElvinEfendi
Copy link
Member

Give me a few more days on this.

@ElvinEfendi
Copy link
Member

/lgtm

Thanks @fedunineyu !

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 6, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aledbf, ElvinEfendi, fedunineyu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 286ff13 into kubernetes:master Jun 6, 2019
@fedunineyu fedunineyu deleted the change-upstream-on-error-with-sticky-session branch June 10, 2019 06:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Next upstream not requested on errors if session affinity is enabled
6 participants