Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes CI Policy: all prow.k8s.io jobs must have testgrid alerts configured #18599

Open
spiffxp opened this issue Aug 1, 2020 · 30 comments
Assignees
Labels
area/jobs area/testgrid help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/testing Categorizes an issue or PR as relevant to SIG Testing.

Comments

@spiffxp
Copy link
Member

spiffxp commented Aug 1, 2020

Part of #18551

Why this is important:

  • In order to ensure effective use of community resources, we expect them to be spent on jobs that provide useful signal, and are actively maintained
  • Configuring testgrid alerts requires setting an e-mail address, which gives us a point of contact to escalate to if a job is deemed an ineffective use of community resources
  • We'll use this to implement a policy where we reserve the right to remove/disable jobs that are deemed an ineffective use of resources (e.g. perma-failing for O(weeks)) if the point of contact is unresponsive

TODO:

  • come up with test to enforce this policy (logging only)
  • come up with report to identify jobs and likely candidate sig owners
    • eg: if a job is on "sig-foo"'s dashboard, it's likely they want to own it
    • who to contact for jobs that aren't sigs?
  • send out notice to all sigs, give a deadline of N weeks
  • any jobs that don't have e-mail addresses configured should be removed
  • flip test to failing once all jobs meet the policy

Other thoughts / notes:

  • need to parse both prowjob config and testgrid config for this
    • there are some testgrids not populated by prow, we can probably ignore these?
    • there are some prowjobs that don't have all their testgrid config in annotations

/sig testing

@spiffxp
Copy link
Member Author

spiffxp commented Sep 10, 2020

Some prior art to serve as starting points:

  • go run ./experiment/prowjob-report/main.go --config ./config/prow/config.yaml --job-config ./config/jobs --format csv > jobs.csv which I periodically reimport into this spreadsheet
    • the owner_dash column is a guess on who should own the job based on which sig/wg-prefixed testgrid dashboard the job lives under
    • could run with --format json and pipe through to jq
    • there are some gaps; the big one I'm aware of is that it misses testgrid info that is solely in testgrid config
  • This test (which does more than its name lets on) should serve as a good starting point for how to fail or log jobs/tabs that don't have alerts configured:
    func TestReleaseBlockingJobsMustHaveTestgridDescriptions(t *testing.T) {
    // TODO(spiffxp): start with master, enforce for all release branches
    re := regexp.MustCompile("^sig-release-master-(blocking|informing)$")
    for _, dashboard := range cfg.Dashboards {
    if !re.MatchString(dashboard.Name) {
    continue
    }
    suffix := re.FindStringSubmatch(dashboard.Name)[1]
    for _, dashboardtab := range dashboard.DashboardTab {
    intro := fmt.Sprintf("dashboard_tab %v/%v is release-%v", dashboard.Name, dashboardtab.Name, suffix)
    if dashboardtab.Name == "" {
    t.Errorf("%v: - Must have a name", intro)
    }
    if dashboardtab.TestGroupName == "" {
    t.Errorf("%v: - Must have a test_group_name", intro)
    }
    if dashboardtab.Description == "" {
    t.Errorf("%v: - Must have a description", intro)
    }
    // TODO(spiffxp): enforce for informing as well
    if suffix == "informing" {
    if !strings.HasPrefix(dashboardtab.Description, "OWNER: ") {
    t.Logf("NOTICE: %v: - Must have a description that starts with OWNER: ", intro)
    }
    if dashboardtab.AlertOptions == nil {
    t.Logf("NOTICE: %v: - Must have alert_options (ensure informing dashboard is listed first in testgrid-dashboards)", intro)
    } else if dashboardtab.AlertOptions.AlertMailToAddresses == "" {
    t.Logf("NOTICE: %v: - Must have alert_options.alert_mail_to_addresses", intro)
    }
    } else {
    if dashboardtab.AlertOptions == nil {
    t.Errorf("%v: - Must have alert_options (ensure blocking dashboard is listed first in testgrid-dashboards)", intro)
    } else if dashboardtab.AlertOptions.AlertMailToAddresses == "" {
    t.Errorf("%v: - Must have alert_options.alert_mail_to_addresses", intro)
    }
    }
    }
    }
    }

@spiffxp
Copy link
Member Author

spiffxp commented Sep 11, 2020

/help

@k8s-ci-robot k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Sep 11, 2020
@RobertKielty
Copy link
Member

RobertKielty commented Sep 17, 2020

@spiffxp, myself, @ScrapCodes and @rayandas will do work together on the test. Coordinating with them both now.

@RobertKielty
Copy link
Member

RobertKielty commented Sep 19, 2020

So @rayandas @ScrapCodes and myself met up to do exploratory work on this and as result we learned a little bazel!

To run the test above we need to invoke bazel as follows :

cd TEST_INFRA_REPO_ROOT
bazel test //config/tests/testgrids:go_default_test --config--test_output=all  

Worth noting that we must use bazel to run this test and not the go test runner.

The bazel build file is used to pull in test-grid runtime configuration using the
following dependencies :

        "@com_github_googlecloudplatform_testgrid//config:go_default_library",
        "@com_github_googlecloudplatform_testgrid//pb/config:go_default_library",

and also to pass in parameters to the test


        "--config=$(location testconf.pb)",
        "--prow-config=$(location //config/prow:config.yaml)",
        "--job-config=config/jobs",
 

@RobertKielty
Copy link
Member

/assign
/remove help-wanted

@RobertKielty
Copy link
Member

RobertKielty commented Sep 20, 2020

To run the specific test use

bazel test //config/tests/testgrids:go_default_test \
--test_output=all \
--test_filter=TestReleaseBlockingJobsShouldHaveTestgridDescriptions

@RobertKielty
Copy link
Member

/remove help-wanted

@spiffxp
Copy link
Member Author

spiffxp commented Nov 3, 2020

@RobertKielty the command is /remove help (not intuitive IMO)... but speaking of, are you still working on this?

@spiffxp
Copy link
Member Author

spiffxp commented Nov 3, 2020

Reviewed #19286 (review)

@RobertKielty
Copy link
Member

Reviewing this now.

@RobertKielty
Copy link
Member

I want to talk about to @spiffxp about this when he gets back.

@RobertKielty
Copy link
Member

Spoke with @spiffxp about this issue where I proposed writing a helper function to decouple selection of Kubernetes jobs from testing their policy conformance.

@RobertKielty
Copy link
Member

/remove help

@spiffxp
Copy link
Member Author

spiffxp commented Jan 8, 2021

/remove-help

@k8s-ci-robot k8s-ci-robot removed the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jan 8, 2021
@spiffxp
Copy link
Member Author

spiffxp commented Feb 9, 2021

/milestone v1.21
Checking back in on this

@RobertKielty
Copy link
Member

Might be a good idea to hand off to someone else.

@BenTheElder BenTheElder modified the milestones: v1.21, v1.23 Jun 25, 2021
@spiffxp
Copy link
Member Author

spiffxp commented Jul 27, 2021

/help

I don't really see any way around having to use the tests in config/tests/testgrid. It's the only way you can guarantee you're walking testgrid configs that are the result of both whatever prowjob annotations exist, and whatever manually defined testgrids there are. And testgrid config is where the alerting is defined.

Like, you have my full support if you want to try suggesting:

  • "hey what if we didn't allow manually defined testgrids for any prowjobs we define" or
  • "hey if I added support to configurator for this one testgrid-config-only-field as an annotation we wouldn't need the testgrid configs"

Because then if you could just use prowjob annotations it's way faster/easier to run go test ./config/tests/jobs/..., plus you don't have to worry about iterating over all testgrids, then all job configs (to make sure you didn't miss something that only showed up in one)...

But if you're not willing to tackle that to simplify the problem space, then...

The slowest-but-it-already-works way to do this is use bazel to run the tests. The first time takes a long time (if I've timed it, I've forgotten, but longer than 10min), you end up recompiling the world, and you get a tiny space heater for a while. But then it's faster and you can iterate relatively quickly.

There is a way you can use go to run the configurator to generate the testgrid proto, then run the testgrid tests point at that, but it's too easy to get tripped up and use stale data.

Things you could start with:

@k8s-ci-robot
Copy link
Contributor

@spiffxp:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

I don't really see any way around having to use the tests in config/tests/testgrid. It's the only way you can guarantee you're walking testgrid configs that are the result of both whatever prowjob annotations exist, and whatever manually defined testgrids there are. And testgrid config is where the alerting is defined.

Like, you have my full support if you want to try suggesting:

  • "hey what if we didn't allow manually defined testgrids for any prowjobs we define" or
  • "hey if I added support to configurator for this one testgrid-config-only-field as an annotation we wouldn't need the testgrid configs"

Because then if you could just use prowjob annotations it's way faster/easier to run go test ./config/tests/jobs/..., plus you don't have to worry about iterating over all testgrids, then all job configs (to make sure you didn't miss something that only showed up in one)...

But if you're not willing to tackle that to simplify the problem space, then...

The slowest-but-it-already-works way to do this is use bazel to run the tests. The first time takes a long time (if I've timed it, I've forgotten, but longer than 10min), you end up recompiling the world, and you get a tiny space heater for a while. But then it's faster and you can iterate relatively quickly.

There is a way you can use go to run the configurator to generate the testgrid proto, then run the testgrid tests point at that, but it's too easy to get tripped up and use stale data.

Things you could start with:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jul 27, 2021
@spiffxp
Copy link
Member Author

spiffxp commented Oct 1, 2021

/kind cleanup

@k8s-ci-robot k8s-ci-robot added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Oct 1, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 30, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 29, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ameukam
Copy link
Member

ameukam commented Feb 28, 2022

/reopen
/lifecycle frozen

@k8s-ci-robot
Copy link
Contributor

@ameukam: Reopened this issue.

In response to this:

/reopen
/lifecycle frozen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Feb 28, 2022
@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Feb 28, 2022
@ameukam
Copy link
Member

ameukam commented Feb 28, 2022

/milestone clear

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/jobs area/testgrid help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants