Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The old revision pod is still in running state with 100% traffic routing to new revision #755

Closed
jessiezcc opened this issue Apr 26, 2018 · 12 comments
Assignees
Labels
area/API API objects and controllers area/autoscale area/build Build topics specifically related to Knative area/monitoring area/test-and-release It flags unit/e2e/conformance/perf test issues for product features kind/bug Categorizes issue or PR as related to a bug. kind/doc Something isn't clear kind/feature Well-understood/specified features, ready for coding.

Comments

@jessiezcc
Copy link
Contributor

jessiezcc commented Apr 26, 2018

Expected Behavior

After routing 100% to new revision and removing all references to the old revision, I expect the old revision pod to be torn down and disappear

Actual Behavior

old revision pod is still in running state

Steps to Reproduce the Problem

  1. bazel run sample/helloworld/everything.create
  2. bazel run sample/helloworld:updated_everything.apply

jessiezhu@gobaby:~/go/src/github.com/elafros/elafros$ kubectl -n ela-system get pods
NAME READY STATUS RESTARTS AGE
configuration-example-00001-autoscaler-77696d95c6-84wml 1/1 Running 0 8m
configuration-example-00002-autoscaler-6b4dbf566b-7rknq 1/1 Running 0 7m
ela-activator-6f9d78ff7c-rstwc 1/1 Running 0 2d
ela-controller-54dcdfb6-4qdkw 1/1 Running 0 18m
ela-webhook-7c4d5c5547-mvb2p 1/1 Running 0 2d

Additional Info

status:
conditions:
- state: Ready
status: "True"
domain: route-example.default.demo-domain.com
traffic:
- percent: 100
revisionName: configuration-example-00002

@google-prow-robot
Copy link

@jessiezcc: GitHub didn't allow me to assign the following users: user.

Note that only elafros members and repo collaborators can be assigned.

In response to this:

Expected Behavior

After routing 100% to new revision, I expect the old revision pod to be torn down and disappear

Actual Behavior

old revision pod is still in running state

Steps to Reproduce the Problem

  1. bazel run sample/helloworld/everything.create
  2. bazel run sample/helloworld:updated_everything.apply

jessiezhu@gobaby:~/go/src/github.com/elafros/elafros$ kubectl -n ela-system get pods
NAME READY STATUS RESTARTS AGE
configuration-example-00001-autoscaler-77696d95c6-84wml 1/1 Running 0 8m
configuration-example-00002-autoscaler-6b4dbf566b-7rknq 1/1 Running 0 7m
ela-activator-6f9d78ff7c-rstwc 1/1 Running 0 2d
ela-controller-54dcdfb6-4qdkw 1/1 Running 0 18m
ela-webhook-7c4d5c5547-mvb2p 1/1 Running 0 2d

Additional Info

status:
conditions:

  • state: Ready
    status: "True"
    domain: route-example.default.demo-domain.com
    traffic:
  • percent: 100
    revisionName: configuration-example-00002

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@google-prow-robot google-prow-robot added area/API API objects and controllers area/autoscale area/build Build topics specifically related to Knative area/monitoring area/networking area/test-and-release It flags unit/e2e/conformance/perf test issues for product features kind/bug Categorizes issue or PR as related to a bug. kind/feature Well-understood/specified features, ready for coding. kind/doc Something isn't clear labels Apr 26, 2018
@yanweiguo yanweiguo removed area/API API objects and controllers area/build Build topics specifically related to Knative area/monitoring area/networking area/test-and-release It flags unit/e2e/conformance/perf test issues for product features kind/bug Categorizes issue or PR as related to a bug. kind/doc Something isn't clear labels Apr 27, 2018
@yanweiguo
Copy link
Contributor

I think 1->0 scaling should fix this. @akyyy could you have a look?

@akyyy
Copy link
Contributor

akyyy commented Apr 27, 2018

When you change the traffic weights, the activator may not be involved if the revisions are active all the time. So activator could not fix this.

@akyyy
Copy link
Contributor

akyyy commented Apr 27, 2018

Actually, if the expectation is the old pod should be gone eventually (e.g. 5 minutes by default), then yes, the new 1->0 code path can fix this.
To enable activator, you can set enable-scale-to-zero to true.

@jessiezcc
Copy link
Contributor Author

jessiezcc commented Apr 27, 2018 via email

@josephburnett
Copy link
Contributor

yes, this should still work when enable-scale-to-zero is turned off. when the revision is no longer routable, it should be transitioned to the Retired state and be torn down. it sounds like this is not happening.

@google-prow-robot google-prow-robot added area/API API objects and controllers area/build Build topics specifically related to Knative area/monitoring area/networking area/test-and-release It flags unit/e2e/conformance/perf test issues for product features kind/bug Categorizes issue or PR as related to a bug. kind/doc Something isn't clear labels May 29, 2018
@glyn
Copy link
Contributor

glyn commented Jun 12, 2018

@josephburnett Unless @markusthoemmes is actively working on this, I'd like to investigate the issue.

@josephburnett josephburnett assigned glyn and unassigned markusthoemmes Jun 12, 2018
@glyn
Copy link
Contributor

glyn commented Jun 13, 2018

I've reproduced the issue (with enable-scale-to-zero turned off), but I'm wondering about the UX. You see, after the fact, the user can split the traffic between the old and new revisions and they will both continue to work. So automatically deleting a revision with 0% traffic routed to it presumes that the user will not want to switch traffic back to that revision later. For instance, they might deploy a new version of an application and after 100% of traffic is routed to the new revision, they could discover a problem and want to route the majority of traffic to the previous revision.

Another way of looking at the UX is in terms of ease of reversing actions. If the user splits traffic 99% to the new revision and 1% to the old revision, they can back out of this very easily. But if they end up with 100% of traffic going to the new revision and we automatically prune the old revision, it’s harder for them to reverse what they just did (suppose they made a mistake).

I wonder if we should make scaling to zero more intelligent and have it apply to revisions which are no longer routable, even if enable-scale-to-zero is turned off. That way, old revision pods could be resurrected if needed. Admittedly that adds complexity and might confuse some users.

Another option would be to leave the behaviour the way it is, especially now that enable-scale-to-zero is turned on by default. It may be reasonable for expect users who choose to turn off enable-scale-to-zero to manage their revisions more carefully and to cope with pruning unroutable revisions. I don't like this option as we are effectively leaking revision pods.

Thoughts?

@josephburnett
Copy link
Contributor

I don't plan to support turning off scale-to-zero. That flag is there just as a way to roll out the change. all revisions should scale to zero.

But there has been some more discussion and design around serving states which outlines a better architecture: #645 (comment) So please disregard my comment about transitioning to Retired.

@glyn
Copy link
Contributor

glyn commented Jun 13, 2018

I see, thanks. So the simplest fix for this issue is to wait until scaling to zero by default has "bedded in" and then delete the enable-scale-to-zero flag.

@mattmoor
Copy link
Member

mattmoor commented Jul 8, 2018

Scale to zero is enabled by default

@josephburnett
Copy link
Contributor

Created #1531 to track clean up Reserve Revisions no longer routable.

nak3 added a commit to nak3/serving that referenced this issue May 6, 2021
Since knative@e2a8237, `test/config` directory needs ytt command.

This patch uses ytt.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/API API objects and controllers area/autoscale area/build Build topics specifically related to Knative area/monitoring area/test-and-release It flags unit/e2e/conformance/perf test issues for product features kind/bug Categorizes issue or PR as related to a bug. kind/doc Something isn't clear kind/feature Well-understood/specified features, ready for coding.
Projects
None yet
Development

No branches or pull requests

9 participants