The old revision pod is still in running state with 100% traffic routing to new revision #755

jessiezcc · 2018-04-26T22:47:06Z

Expected Behavior

After routing 100% to new revision and removing all references to the old revision, I expect the old revision pod to be torn down and disappear

Actual Behavior

old revision pod is still in running state

Steps to Reproduce the Problem

bazel run sample/helloworld/everything.create
bazel run sample/helloworld:updated_everything.apply

jessiezhu@gobaby:~/go/src/github.com/elafros/elafros$ kubectl -n ela-system get pods
NAME READY STATUS RESTARTS AGE
configuration-example-00001-autoscaler-77696d95c6-84wml 1/1 Running 0 8m
configuration-example-00002-autoscaler-6b4dbf566b-7rknq 1/1 Running 0 7m
ela-activator-6f9d78ff7c-rstwc 1/1 Running 0 2d
ela-controller-54dcdfb6-4qdkw 1/1 Running 0 18m
ela-webhook-7c4d5c5547-mvb2p 1/1 Running 0 2d

Additional Info

status:
conditions:
- state: Ready
status: "True"
domain: route-example.default.demo-domain.com
traffic:
- percent: 100
revisionName: configuration-example-00002

google-prow-robot · 2018-04-26T22:47:07Z

@jessiezcc: GitHub didn't allow me to assign the following users: user.

Note that only elafros members and repo collaborators can be assigned.

In response to this:

Expected Behavior

After routing 100% to new revision, I expect the old revision pod to be torn down and disappear

Actual Behavior

old revision pod is still in running state

Steps to Reproduce the Problem

bazel run sample/helloworld/everything.create

bazel run sample/helloworld:updated_everything.apply

jessiezhu@gobaby:~/go/src/github.com/elafros/elafros$ kubectl -n ela-system get pods
NAME READY STATUS RESTARTS AGE
configuration-example-00001-autoscaler-77696d95c6-84wml 1/1 Running 0 8m
configuration-example-00002-autoscaler-6b4dbf566b-7rknq 1/1 Running 0 7m
ela-activator-6f9d78ff7c-rstwc 1/1 Running 0 2d
ela-controller-54dcdfb6-4qdkw 1/1 Running 0 18m
ela-webhook-7c4d5c5547-mvb2p 1/1 Running 0 2d

Additional Info

status:
conditions:

state: Ready
status: "True"
domain: route-example.default.demo-domain.com
traffic:

percent: 100
revisionName: configuration-example-00002

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yanweiguo · 2018-04-27T16:42:44Z

I think 1->0 scaling should fix this. @akyyy could you have a look?

akyyy · 2018-04-27T17:28:46Z

When you change the traffic weights, the activator may not be involved if the revisions are active all the time. So activator could not fix this.

akyyy · 2018-04-27T17:42:22Z

Actually, if the expectation is the old pod should be gone eventually (e.g. 5 minutes by default), then yes, the new 1->0 code path can fix this.
To enable activator, you can set enable-scale-to-zero to true.

jessiezcc · 2018-04-27T18:07:35Z

I tested with scale to zero turned on with Joe, the old revision pod did go away. The question here is: Should it work even when enable-scale-to-zero is turned off?

…

On Fri, Apr 27, 2018 at 10:42 AM akyyy ***@***.***> wrote: Actually, if the expectation is the old pod should be gone *eventually* (e.g. 5 minutes by default), then yes, the new 1->0 code path can fix this. To enable activator, you can set enable-scale-to-zero <https://github.com/elafros/elafros/blob/fb6f994c56b4a3cdbe7e518d511b1186857ca0c0/elaconfig.yaml#L63> to true. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#755 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AkL1roiE9GCOmq9GoEw5imngov0qBuuSks5ts1h_gaJpZM4Tojvb> .

josephburnett · 2018-05-10T16:11:42Z

yes, this should still work when enable-scale-to-zero is turned off. when the revision is no longer routable, it should be transitioned to the Retired state and be torn down. it sounds like this is not happening.

glyn · 2018-06-12T15:55:58Z

@josephburnett Unless @markusthoemmes is actively working on this, I'd like to investigate the issue.

glyn · 2018-06-13T14:58:05Z

I've reproduced the issue (with enable-scale-to-zero turned off), but I'm wondering about the UX. You see, after the fact, the user can split the traffic between the old and new revisions and they will both continue to work. So automatically deleting a revision with 0% traffic routed to it presumes that the user will not want to switch traffic back to that revision later. For instance, they might deploy a new version of an application and after 100% of traffic is routed to the new revision, they could discover a problem and want to route the majority of traffic to the previous revision.

Another way of looking at the UX is in terms of ease of reversing actions. If the user splits traffic 99% to the new revision and 1% to the old revision, they can back out of this very easily. But if they end up with 100% of traffic going to the new revision and we automatically prune the old revision, it’s harder for them to reverse what they just did (suppose they made a mistake).

I wonder if we should make scaling to zero more intelligent and have it apply to revisions which are no longer routable, even if enable-scale-to-zero is turned off. That way, old revision pods could be resurrected if needed. Admittedly that adds complexity and might confuse some users.

Another option would be to leave the behaviour the way it is, especially now that enable-scale-to-zero is turned on by default. It may be reasonable for expect users who choose to turn off enable-scale-to-zero to manage their revisions more carefully and to cope with pruning unroutable revisions. I don't like this option as we are effectively leaking revision pods.

Thoughts?

josephburnett · 2018-06-13T15:21:36Z

I don't plan to support turning off scale-to-zero. That flag is there just as a way to roll out the change. all revisions should scale to zero.

But there has been some more discussion and design around serving states which outlines a better architecture: #645 (comment) So please disregard my comment about transitioning to Retired.

glyn · 2018-06-13T15:37:01Z

I see, thanks. So the simplest fix for this issue is to wait until scaling to zero by default has "bedded in" and then delete the enable-scale-to-zero flag.

mattmoor · 2018-07-08T16:24:52Z

Scale to zero is enabled by default

josephburnett · 2018-07-09T15:58:27Z

Created #1531 to track clean up Reserve Revisions no longer routable.

Since knative@e2a8237, `test/config` directory needs ytt command. This patch uses ytt.

josephburnett assigned markusthoemmes May 10, 2018

josephburnett assigned glyn and unassigned markusthoemmes Jun 12, 2018

glyn mentioned this issue Jun 14, 2018

RevisionServingStateRetired is never set #1203

Closed

tcnghia removed the area/networking label Jun 27, 2018

mattmoor closed this as completed Jul 8, 2018

josephburnett mentioned this issue Jul 9, 2018

Retire Revisions after no longer routable #1531

Closed

nicolaferraro mentioned this issue Dec 14, 2018

Old revisions running forever when using minScale #2720

Closed

nak3 added a commit to nak3/serving that referenced this issue May 6, 2021

Use ytt command to apply test/config (knative#755)

f810700

Since knative@e2a8237, `test/config` directory needs ytt command. This patch uses ytt.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The old revision pod is still in running state with 100% traffic routing to new revision #755

The old revision pod is still in running state with 100% traffic routing to new revision #755

jessiezcc commented Apr 26, 2018 •

edited by evankanderson

Loading

google-prow-robot commented Apr 26, 2018

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Additional Info

yanweiguo commented Apr 27, 2018

akyyy commented Apr 27, 2018

akyyy commented Apr 27, 2018

jessiezcc commented Apr 27, 2018 via email

josephburnett commented May 10, 2018

glyn commented Jun 12, 2018 •

edited

Loading

glyn commented Jun 13, 2018

josephburnett commented Jun 13, 2018

glyn commented Jun 13, 2018 •

edited

Loading

mattmoor commented Jul 8, 2018

josephburnett commented Jul 9, 2018

The old revision pod is still in running state with 100% traffic routing to new revision #755

The old revision pod is still in running state with 100% traffic routing to new revision #755

Comments

jessiezcc commented Apr 26, 2018 • edited by evankanderson Loading

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Additional Info

google-prow-robot commented Apr 26, 2018

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Additional Info

yanweiguo commented Apr 27, 2018

akyyy commented Apr 27, 2018

akyyy commented Apr 27, 2018

jessiezcc commented Apr 27, 2018 via email

josephburnett commented May 10, 2018

glyn commented Jun 12, 2018 • edited Loading

glyn commented Jun 13, 2018

josephburnett commented Jun 13, 2018

glyn commented Jun 13, 2018 • edited Loading

mattmoor commented Jul 8, 2018

josephburnett commented Jul 9, 2018

jessiezcc commented Apr 26, 2018 •

edited by evankanderson

Loading

glyn commented Jun 12, 2018 •

edited

Loading

glyn commented Jun 13, 2018 •

edited

Loading