Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create api-availability measurement #1096

Open
mm4tt opened this issue Mar 3, 2020 · 38 comments
Open

Create api-availability measurement #1096

mm4tt opened this issue Mar 3, 2020 · 38 comments
Assignees
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@mm4tt
Copy link
Contributor

mm4tt commented Mar 3, 2020

Justification

See this comment - #1086 (comment)

Milestones

V0

  1. Create a new "ApiAvailability" measurement
  2. This measurement should periodically probe the apiserver's /healthz endpoint and record whether api was available or not
  3. The measurement should output the following stats
    1. availability percentage (e.g. % of OK responses over all responses)
    2. Longest consecutively unavailability period
    3. ...

V1

  1. Make the measurement fail if apiserver is consecutively not available in the last XX min (e.g. 30min).
  2. Make the error critical to make sure the test execution is stopped in such case

V2

  1. Come up with exact SLI/SLO definition, make it available in perf-dash for further analysis
  2. Make it a WIP Scalability SLO
  3. Evaluate and promote to official SLO

/good-first-issue

@k8s-ci-robot
Copy link
Contributor

@mm4tt:
This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

Justification

See this comment - #1086 (comment)

Milestones

V0

  1. Create a new "ApiAvailability" measurement
  2. This measurement should periodically probe the apiserver's /healthz endpoint and record whether api was available or not
  3. The measurement should output the following stats
    1. availability percentage (e.g. % of OK responses over all responses)
    2. Longest consecutively unavailability period
    3. ...

V1

  1. Make the measurement fail if apiserver is consecutively not available in the last XX min (e.g. 30min).
  2. Make the error critical to make sure the test execution is stopped in such case

V2

  1. Come up with exact SLI/SLO definition, make it available in perf-dash for further analysis
  2. Make it a WIP Scalability SLO
  3. Evaluate and promote to official SLO

/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Mar 3, 2020
@vamossagar12
Copy link
Contributor

vamossagar12 commented Mar 13, 2020

/assign
I would like to take this up. Can I expect more details?

@mm4tt
Copy link
Contributor Author

mm4tt commented Mar 18, 2020

Great, thanks @vamossagar12

Let's start with the V0. Could you let me know which part of the description is not clear or requires further details?

@vamossagar12
Copy link
Contributor

thanks @mm4tt Couple of things:

  1. how frequently should the probe happen? Would it be configurable or we can define it to be fixed?
  2. The availability percentage should be wrt the entire perf test run and we have to keep track of percentage and longest unavailable period?

As an outside question, typically if the health end point is down for a configured time, response would be to scale up or something like that. Just wanted to know the rationale behind adding this particular measurement. Thanks!

@wojtek-t
Copy link
Member

how frequently should the probe happen? Would it be configurable or we can define it to be fixed?

Seems like implementation detail. We can make it configurable, but I doubt we will be using different values once we agree on sth.

The availability percentage should be wrt the entire perf test run and we have to keep track of percentage and longest unavailable period?

Correct

As an outside question, typically if the health end point is down for a configured time, response would be to scale up or something like that. Just wanted to know the rationale behind adding this particular measurement. Thanks!

There are couple points:

  • we want to ensure that we understand how availability looks like (I can imgine all other SLOs being fine, but cluster being periodicall unavailable)
  • we can use it as optimization - if cluster is down for X minutes, the it probably won't get up ever, and we can shut down the test and save money

@vamossagar12
Copy link
Contributor

Thanks for the updates @wojtek-t . I will start with this issue now. Pretty sure there would be a few more questions along the way though :)

@vamossagar12
Copy link
Contributor

vamossagar12 commented Mar 26, 2020

hi. Since I last commented, I didn't get a chance to look at it. Will start over the next couple of days..

@vamossagar12
Copy link
Contributor

hi so i started looking at this today. One of the things is that there's already a measurement metrics_for_e2e which, among other things, also fetches the metrics from api-server.

So, the new measurement that needs to be created, can work along the same lines just that instead of hitting the /metrics end-point, it can hit the /healthz end-point of api-server. Thats' from an implementation standpoint.

The other question that I had is that the measurements that we defines are within a Step and a group of Steps forms a single test, so when we say that this new measurement measures the health of the api-server for the duration of a test(mentioned above), so what exactly does a test mean in this case? A Step which houses the measurement(s) or the overall test?

@mm4tt
Copy link
Contributor Author

mm4tt commented Mar 30, 2020

Hey, @vamossagar12

It makes sense to make the implementation similar to the metrics_for_e2e. It's exactly as you said, instead of hitting /metrics, we'll be hitting /healthz.

Regarding your second question. Usually a measurement has two actions: start and gather. This one should work the same. The start action should start a goroutine that will be continuously pinging /healthz endpoint and the gather will stop this goroutine and wrap everything up. Does it make sense?

@wojtek-t
Copy link
Member

It makes sense to make the implementation similar to the metrics_for_e2e. It's exactly as you said, instead of hitting /metrics, we'll be hitting /healthz.

Not sure I understand - for metrics_for_e2e IIRC we're fetching the metrics once at the end of test. Assuming that the above is true - it can't really work the same way...

@mm4tt
Copy link
Contributor Author

mm4tt commented Mar 30, 2020

It makes sense to make the implementation similar to the metrics_for_e2e. It's exactly as you said, instead of hitting /metrics, we'll be hitting /healthz.

Not sure I understand - for metrics_for_e2e IIRC we're fetching the metrics once at the end of test. Assuming that the above is true - it can't really work the same way...

Good point, I should have checked how the metrics_for_e2e works :) Still you should be able to take some inspiration from that measurement. Let me know if you have more questions.

@vamossagar12
Copy link
Contributor

Actually I meant to use the metrics_e2e as only a baseline of how to interact with the api server. I hadn't looked at the internals so what @wojtek-t told is even more valuable :)

@vamossagar12
Copy link
Contributor

Hey, @vamossagar12

It makes sense to make the implementation similar to the metrics_for_e2e. It's exactly as you said, instead of hitting /metrics, we'll be hitting /healthz.

Regarding your second question. Usually a measurement has two actions: start and gather. This one should work the same. The start action should start a goroutine that will be continuously pinging /healthz endpoint and the gather will stop this goroutine and wrap everything up. Does it make sense?

Regarding point 2, I still have a question. As far as what I have understood, the hierarchy of a test is as follows:

Test -> Step -> Phase(s) Or measurement(s)

If we just focus on measurements, it's 2 levels lower than a test. So, when we hit the api-server from start and gather from within a measurement, that would be still within the context of that particular step. Is the question clear now? Or is my thinking totally off track here 😄

@mm4tt
Copy link
Contributor Author

mm4tt commented Mar 30, 2020

Oh, I see where the confusion comes from. The hierarchy you listed is correct. But the measurement there should be treated as measurement invocation. The measurement object life spans the whole test. So if your test is

Step1: call method start on measurement A
Step2: do something else
Step3: do something else
Step4: Call method gather on measurement A

Then methods start and gather will be called on the same measurement instance

@vamossagar12
Copy link
Contributor

vamossagar12 commented Apr 7, 2020

@mm4tt I have taken an initial stab at this here: https://github.com/kubernetes/perf-tests/pull/1162/files

Plz review when you have the bandwidth. I would also put more thoughts on improving it.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 6, 2020
@mm4tt
Copy link
Contributor Author

mm4tt commented Jul 6, 2020

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 6, 2020
@vamossagar12
Copy link
Contributor

@wojtek-t The PR got merged. I guess the next task would be to write the config files? Any specific areas for me to start looking at for that?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 15, 2020
@wojtek-t
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 16, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 17, 2021
@tosi3k
Copy link
Member

tosi3k commented May 17, 2021

We're in the middle of V1. We have already added a configurable availability percentage threshold under which the test would fail.

The problem is that the API availability measurement makes the API call latency measurement fail. This is because the former runs kubectl exec periodically underneath - this results in POST pods' exec subresource API call latency jump above the 1s threshold (which is understandable and shouldn't be considered as an API call latency SLO violation). @jupblb is currently working on fixing this.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 15, 2021
@wojtek-t
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 23, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 21, 2021
@wojtek-t
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 22, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 20, 2022
@wojtek-t
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 21, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 22, 2022
@wojtek-t
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 23, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 21, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 20, 2022
@wojtek-t
Copy link
Member

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 22, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 8, 2023
@wojtek-t
Copy link
Member

wojtek-t commented Feb 8, 2023

/remove-lifecycle rotten

@vamossagar12 vamossagar12 removed their assignment Jun 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

7 participants