feature(serving): Add --autoscale-window #614

rhuss · 2020-01-21T10:56:17Z

Fixes #613

Fixes knative#613

knative-prow-robot · 2020-01-21T10:56:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rhuss

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [rhuss]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

rhuss · 2020-01-21T11:25:47Z

Not 100% sure about the option name:

--autoscale-window
--scale-window
--window

'guess autoscale is the best choice here ...

rhuss · 2020-01-21T11:27:18Z

No idea about the IT test error:

   --- FAIL: TestTrafficSplit/update_tag_from_testing_to_staging_for_@latest_revision (6.58s)
        traffic_split_test.go:372: assertion failed: 
            --- expectedTargets
            +++ formattedActualTargets
              []e2e.TargetFields{
              	{
            - 		Tag:      "staging",
            + 		Tag:      "testing",
              		Revision: "echo5-rev-1",
              		Percent:  100,
              		Latest:   true,
              	},
              }
            
    --- PASS: TestTrafficSplit/update_tag_from_testing_to_staging_for_a_revision_(non_@latest) (12.73s)

Let's check whether this is a flake.

/retest

duglin · 2020-01-21T12:53:55Z

I believe this PR uses this annotation: https://github.com/knative/serving/blob/master/pkg/apis/autoscaling/register.go#L67 right?

Note the comment in there says:

// WindowAnnotationKey is the annotation to specify the time
// interval over which to calculate the average metric

If so, does Autoscale window duration for when to scale down accurately describe it? When I read the cmd line option text it sounds like the value used will be the number of seconds before an idle pod will be killed. But is that true? I'm not fully up to speed on the usage of the 'window' but from the text in the golang code it looks more like it's used as part of the averaging logic, and not necessarily an upper-bound on the idle timeout of a pod. If I'm correct, then I think the option text needs to be changed to more accurately reflect what this value is used for.... or if we're looking to modify the idle timeout, we may need to tweak a different value in the yaml.

rhuss · 2020-01-21T15:40:46Z

If so, does Autoscale window duration for when to scale down accurately describe it?

In fact its not only for scaling down but for autoscaling in general (let me update that). If I understand correctly (but please correct me @markusthoemmes ), this value specifies the window after which the traffic average is caculated and the:

If its 0 then scaled to 0
If its larger than the concurrency target scaled up (based on much the current average concurrency is over the target value)

So the shortest possible description is probably "Time window after which to determine whether to autoscale" Does this sound better ?

markusthoemmes · 2020-01-21T15:58:03Z

It's the window over which statistics are averaged. We actually recalculate every two seconds today.

The sideffect is that we only scale down to 0 after we receive straight 0s for the entire window. That's because of math 😂 . The average will only be 0 if all values are 0.

duglin · 2020-01-21T15:58:24Z

Time window after which to determine whether to autoscale

That's better and if we can't come up with anything else then that's ok for now.
But, I do wonder if it's still a little misleading and your use of the word "granularity" in our slack chat got me wondering if we should make it clear that this flag doesn't necessarily have a direct relation to when things scale, but instead it controls how often we check "something" in our auto-scaling logic.

So, perhaps something more like:
Time window over which the autoscaling metrics are averaged.
A bit geeky, but sadly this flag is geeky :-)

markusthoemmes · 2020-01-21T16:13:34Z

but instead it controls how often we check "something" in our auto-scaling logic.

That's not true.

It in fact directly affects when things are scaled to 0 because of the maths behind it. The distinction between checking after and calculating over is super important here. The window simply specifies the amount of time we look in the past for historic data, maybe that's more useful?

The amount of historic data that is taken into account for autoscaling decisions.

rhuss · 2020-01-21T17:27:22Z

It in fact directly affects when things are scaled to 0 because of the maths behind it. The distinction between checking after and calculating over is super important here. The window simply specifies the amount of time we look in the past for historic data, maybe that's more useful?

To harden my understanding, for the first scaling event to happen, you need at least wait for this time window to pass, correct ? .

I mean, if the window is 30 minutes, you have to wait 30 minutes until you scale up because of increased load. After that, the scaling decision is made every 2 seconds (with the data from the last 30 minutes). Is this correct? (asking because that was what I actually observer, i.e. that no scale-up happens when I set this window really large and I'm doing a load test before that window has been reached for the first time).

rhuss · 2020-01-21T17:31:15Z

The amount of historic data that is taken into account for autoscaling decisions.

I like that, maybe adding some hint that scale to zero happens when no request comes in during that time:

Duration to look back for making autoscaling decisions. The service is scaled to zero if no request was received in during that time.

knative-metrics-robot · 2020-01-21T18:13:35Z

The following is the coverage report on the affected files.
Say /test pull-knative-client-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/kn/commands/revision/describe.go	62.2%	61.6%	-0.6
pkg/kn/commands/service/configuration_edit_flags.go	83.6%	82.5%	-1.2
pkg/serving/config_changes.go	79.3%	79.5%	0.3
pkg/serving/revision_template.go	64.7%	62.9%	-1.8

duglin · 2020-01-21T20:18:21Z

I mean, if the window is 30 minutes, you have to wait 30 minutes until you scale up because of increased load.

I don't think this is true because I can see things scale up immediately when I set CC=1 and all pods are busy. This is why I said this window is "part" of the logic w.r.t. scaling up. But I'm sure @markusthoemmes will correct me if I'm wrong ;-)

markusthoemmes · 2020-01-22T07:00:53Z

We don't wait for the window to fill first, no.

rhuss · 2020-01-22T07:13:21Z

We don't wait for the window to fill first, no.

ok, then I will have to investigate further why upscale didn't work for me when setting the window to 30 minutes, with a concurrency-target of 10 I get not more than 2 containers for 100 concurrent clients.

markusthoemmes · 2020-01-22T07:17:57Z

Depending on which version you're running, the behavior might be slightly different.

As of our latest code, we assume that the entire window is "there". For a fresh revision we're therefore considering for example 29 minutes of straight 0s in your case. That'll heavily bias the scale towards the lower end.

rhuss · 2020-01-22T07:33:01Z

As of our latest code, we assume that the entire window is "there". For a fresh revision we're therefore considering for example 29 minutes of straight 0s in your case. That'll heavily bias the scale towards the lower end.

Ah, I see. But wouldn't it be better that you just take what you have during the first 30 minutes to calculate the average (e.g. average only over 10minutes if your service is running for 10minutes, not filling up 20 minutes with 0), maybe with some portion of leeway (eg soften by multiplying with some weighting function or adding a shorter duration of zeros, like 5minutes) ? But that's just a autoscaling greenhorn speaking here :)

navidshaikh · 2020-01-22T07:53:15Z

/lgtm

* Create presubmit jobs for knative/client Also enable tide for the new repo. * Add the missing coverage job/tabs This is a workaround for knative#615.

feature(serving): Add --autoscale-window

6ce7993

Fixes knative#613

googlebot added the cla: yes Indicates the PR's author has signed the CLA. label Jan 21, 2020

knative-prow-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 21, 2020

knative-prow-robot requested review from mattmoor and maximilien January 21, 2020 10:56

knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 21, 2020

rhuss added 2 commits January 21, 2020 18:34

chore: Update help-text for --autoscale-window

c39f01f

chore: Add missing generated files

e1752c1

knative-prow-robot assigned navidshaikh Jan 22, 2020

knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 22, 2020

knative-prow-robot merged commit 5001dcd into knative:master Jan 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature(serving): Add --autoscale-window #614

feature(serving): Add --autoscale-window #614

rhuss commented Jan 21, 2020

knative-prow-robot commented Jan 21, 2020

rhuss commented Jan 21, 2020

rhuss commented Jan 21, 2020

duglin commented Jan 21, 2020

rhuss commented Jan 21, 2020 •

edited

Loading

markusthoemmes commented Jan 21, 2020

duglin commented Jan 21, 2020

markusthoemmes commented Jan 21, 2020

rhuss commented Jan 21, 2020

rhuss commented Jan 21, 2020 •

edited

Loading

knative-metrics-robot commented Jan 21, 2020

duglin commented Jan 21, 2020

markusthoemmes commented Jan 22, 2020

rhuss commented Jan 22, 2020

markusthoemmes commented Jan 22, 2020

rhuss commented Jan 22, 2020

navidshaikh commented Jan 22, 2020

feature(serving): Add --autoscale-window #614

feature(serving): Add --autoscale-window #614

Conversation

rhuss commented Jan 21, 2020

knative-prow-robot commented Jan 21, 2020

rhuss commented Jan 21, 2020

rhuss commented Jan 21, 2020

duglin commented Jan 21, 2020

rhuss commented Jan 21, 2020 • edited Loading

markusthoemmes commented Jan 21, 2020

duglin commented Jan 21, 2020

markusthoemmes commented Jan 21, 2020

rhuss commented Jan 21, 2020

rhuss commented Jan 21, 2020 • edited Loading

knative-metrics-robot commented Jan 21, 2020

duglin commented Jan 21, 2020

markusthoemmes commented Jan 22, 2020

rhuss commented Jan 22, 2020

markusthoemmes commented Jan 22, 2020

rhuss commented Jan 22, 2020

navidshaikh commented Jan 22, 2020

rhuss commented Jan 21, 2020 •

edited

Loading

rhuss commented Jan 21, 2020 •

edited

Loading