Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

triage: bail on certain global clusters after 30s #17643

Merged
merged 1 commit into from
May 20, 2020

Conversation

spiffxp
Copy link
Member

@spiffxp spiffxp commented May 19, 2020

triage works by clustering test failures in two stages:

  • locally: create clusters of test failures for each unique test
  • globally: merge each test's clusters into a global set of clusters

The clustering/merging is done by computing edit distance between the
failure text of each test failure or failure cluster and accepting the
first pair that has an edit distance of 10% of their combined length.

This can add up in the worst case, where edit distance is going to be
computed for every existing cluster before creating a new cluster.

We've arbitrarily handled it thus far by:

  • truncating failure text to 200k 10k chars
  • bailing out on local clustering after 60s per unique test

This PR adds:

  • bailing out on global clustering of pathological / low value clusters after 30s
  • more logging to see where clustering is working vs. not

This should address #17625

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 19, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: spiffxp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added area/triage sig/testing Categorizes an issue or PR as relevant to SIG Testing. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 19, 2020
triage works by clustering test failures in two stages:
- locally: create clusters of test failures for each unique test
- globally: merge each test's clusters into a global set of clusters

The clustering/merging is done by computing edit distance between the
failure text of each test failure or failure cluster and accepting the
first pair that has an edit distance of 10% of their combined length.

This can add up in the worst case, where edit distance is going to be
computed for every existing cluster before creating a new cluster.

We've arbitrarily handled it thus far by:
- truncating failure text to ~200k~ 10k chars
- bailing out on local clustering after 60s per unique test

This PR adds:
- bailing out on global clustering of pathological / low value clusters
  after 30s
- more logging to see where clustering is working vs. not
@spiffxp
Copy link
Member Author

spiffxp commented May 20, 2020

/test pull-test-infra-bazel
flake

ERROR: An error occurred during the fetch of repository 'bazel-base':
   Pull command failed: 2020/05/20 00:14:05 Running the Image Puller to pull images from a Docker Registry...
2020/05/20 00:14:06 Image pull was unsuccessful: unable to save remote image gcr.io/k8s-testimages/launcher.gcr.io/google/bazel@sha256:cefc822f93bb3dcf272ce3e4c5162b179d5c165584ace13856afed99662b87cd: unable to write image layers: unable to write image layer: unable to write the contents of layer 0 to /bazel-scratch/.cache/bazel/_bazel_root/05618a594cb3499a4b912817865867cd/external/bazel-base/image/000.tar.gz: read tcp 10.60.151.59:53786->74.125.70.128:443: read: connection reset by peer

@spiffxp
Copy link
Member Author

spiffxp commented May 20, 2020

/cc @dims @BenTheElder
since you were involved with #17629

@dims
Copy link
Member

dims commented May 20, 2020

LGTM this definitely helps with the additional logs :)

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 20, 2020
@k8s-ci-robot k8s-ci-robot merged commit 1fcf814 into kubernetes:master May 20, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.19 milestone May 20, 2020
@spiffxp spiffxp deleted the bail-on-triage branch May 21, 2020 04:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/triage cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants