Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature(serving): Add --autoscale-window #614

Merged
merged 3 commits into from
Jan 22, 2020

Conversation

rhuss
Copy link
Contributor

@rhuss rhuss commented Jan 21, 2020

Fixes #613

@googlebot googlebot added the cla: yes Indicates the PR's author has signed the CLA. label Jan 21, 2020
@knative-prow-robot knative-prow-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 21, 2020
@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rhuss

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow-robot knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 21, 2020
@rhuss
Copy link
Contributor Author

rhuss commented Jan 21, 2020

Not 100% sure about the option name:

  • --autoscale-window
  • --scale-window
  • --window

'guess autoscale is the best choice here ...

@rhuss
Copy link
Contributor Author

rhuss commented Jan 21, 2020

No idea about the IT test error:

   --- FAIL: TestTrafficSplit/update_tag_from_testing_to_staging_for_@latest_revision (6.58s)
        traffic_split_test.go:372: assertion failed: 
            --- expectedTargets
            +++ formattedActualTargets
              []e2e.TargetFields{
              	{
            - 		Tag:      "staging",
            + 		Tag:      "testing",
              		Revision: "echo5-rev-1",
              		Percent:  100,
              		Latest:   true,
              	},
              }
            
    --- PASS: TestTrafficSplit/update_tag_from_testing_to_staging_for_a_revision_(non_@latest) (12.73s)

Let's check whether this is a flake.

/retest

@duglin
Copy link
Contributor

duglin commented Jan 21, 2020

I believe this PR uses this annotation: https://github.com/knative/serving/blob/master/pkg/apis/autoscaling/register.go#L67 right?

Note the comment in there says:

// WindowAnnotationKey is the annotation to specify the time
// interval over which to calculate the average metric

If so, does Autoscale window duration for when to scale down accurately describe it? When I read the cmd line option text it sounds like the value used will be the number of seconds before an idle pod will be killed. But is that true? I'm not fully up to speed on the usage of the 'window' but from the text in the golang code it looks more like it's used as part of the averaging logic, and not necessarily an upper-bound on the idle timeout of a pod. If I'm correct, then I think the option text needs to be changed to more accurately reflect what this value is used for.... or if we're looking to modify the idle timeout, we may need to tweak a different value in the yaml.

@rhuss
Copy link
Contributor Author

rhuss commented Jan 21, 2020

If so, does Autoscale window duration for when to scale down accurately describe it?

In fact its not only for scaling down but for autoscaling in general (let me update that). If I understand correctly (but please correct me @markusthoemmes ), this value specifies the window after which the traffic average is caculated and the:

  • If its 0 then scaled to 0
  • If its larger than the concurrency target scaled up (based on much the current average concurrency is over the target value)

So the shortest possible description is probably "Time window after which to determine whether to autoscale" Does this sound better ?

@markusthoemmes
Copy link
Contributor

It's the window over which statistics are averaged. We actually recalculate every two seconds today.

The sideffect is that we only scale down to 0 after we receive straight 0s for the entire window. That's because of math 😂 . The average will only be 0 if all values are 0.

@duglin
Copy link
Contributor

duglin commented Jan 21, 2020

Time window after which to determine whether to autoscale

That's better and if we can't come up with anything else then that's ok for now.
But, I do wonder if it's still a little misleading and your use of the word "granularity" in our slack chat got me wondering if we should make it clear that this flag doesn't necessarily have a direct relation to when things scale, but instead it controls how often we check "something" in our auto-scaling logic.

So, perhaps something more like:
Time window over which the autoscaling metrics are averaged.
A bit geeky, but sadly this flag is geeky :-)

@markusthoemmes
Copy link
Contributor

but instead it controls how often we check "something" in our auto-scaling logic.

That's not true.

It in fact directly affects when things are scaled to 0 because of the maths behind it. The distinction between checking after and calculating over is super important here. The window simply specifies the amount of time we look in the past for historic data, maybe that's more useful?

The amount of historic data that is taken into account for autoscaling decisions.

@rhuss
Copy link
Contributor Author

rhuss commented Jan 21, 2020

It in fact directly affects when things are scaled to 0 because of the maths behind it. The distinction between checking after and calculating over is super important here. The window simply specifies the amount of time we look in the past for historic data, maybe that's more useful?

To harden my understanding, for the first scaling event to happen, you need at least wait for this time window to pass, correct ? .

I mean, if the window is 30 minutes, you have to wait 30 minutes until you scale up because of increased load. After that, the scaling decision is made every 2 seconds (with the data from the last 30 minutes). Is this correct? (asking because that was what I actually observer, i.e. that no scale-up happens when I set this window really large and I'm doing a load test before that window has been reached for the first time).

@rhuss
Copy link
Contributor Author

rhuss commented Jan 21, 2020

The amount of historic data that is taken into account for autoscaling decisions.

I like that, maybe adding some hint that scale to zero happens when no request comes in during that time:

Duration to look back for making autoscaling decisions. The service is scaled to zero if no request was received in during that time.

@knative-metrics-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-knative-client-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/kn/commands/revision/describe.go 62.2% 61.6% -0.6
pkg/kn/commands/service/configuration_edit_flags.go 83.6% 82.5% -1.2
pkg/serving/config_changes.go 79.3% 79.5% 0.3
pkg/serving/revision_template.go 64.7% 62.9% -1.8

@duglin
Copy link
Contributor

duglin commented Jan 21, 2020

I mean, if the window is 30 minutes, you have to wait 30 minutes until you scale up because of increased load.

I don't think this is true because I can see things scale up immediately when I set CC=1 and all pods are busy. This is why I said this window is "part" of the logic w.r.t. scaling up. But I'm sure @markusthoemmes will correct me if I'm wrong ;-)

@markusthoemmes
Copy link
Contributor

We don't wait for the window to fill first, no.

@rhuss
Copy link
Contributor Author

rhuss commented Jan 22, 2020

We don't wait for the window to fill first, no.

ok, then I will have to investigate further why upscale didn't work for me when setting the window to 30 minutes, with a concurrency-target of 10 I get not more than 2 containers for 100 concurrent clients.

@markusthoemmes
Copy link
Contributor

Depending on which version you're running, the behavior might be slightly different.

As of our latest code, we assume that the entire window is "there". For a fresh revision we're therefore considering for example 29 minutes of straight 0s in your case. That'll heavily bias the scale towards the lower end.

@rhuss
Copy link
Contributor Author

rhuss commented Jan 22, 2020

As of our latest code, we assume that the entire window is "there". For a fresh revision we're therefore considering for example 29 minutes of straight 0s in your case. That'll heavily bias the scale towards the lower end.

Ah, I see. But wouldn't it be better that you just take what you have during the first 30 minutes to calculate the average (e.g. average only over 10minutes if your service is running for 10minutes, not filling up 20 minutes with 0), maybe with some portion of leeway (eg soften by multiplying with some weighting function or adding a shorter duration of zeros, like 5minutes) ? But that's just a autoscaling greenhorn speaking here :)

@navidshaikh
Copy link
Collaborator

/lgtm

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 22, 2020
@knative-prow-robot knative-prow-robot merged commit 5001dcd into knative:master Jan 22, 2020
coryrc pushed a commit to coryrc/client that referenced this pull request May 14, 2020
* Create presubmit jobs for knative/client

Also enable tide for the new repo.

* Add the missing coverage job/tabs

This is a workaround for knative#615.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cla: yes Indicates the PR's author has signed the CLA. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add --autoscale-window for managing the autoscale window
7 participants