go.k8s.io/triage is serving stale data #17069

spiffxp · 2020-04-02T17:35:14Z

What happened:
https://go.k8s.io/triage is showing
"2216 clusters of 92485 failures (6433 in last day) out of 149213 builds from 3/18/2020, 5:00:24 PM to 3/30/2020, 10:54:32 PM."

What you expected to happen:
The end date should be today (04/02/2020)

Please provide links to example occurrences, if any:
https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1

/area triage
/assign

spiffxp · 2020-04-02T17:36:52Z

https://testgrid.k8s.io/sig-testing-misc#triage is all green, so I'm guessing kettle has stopped feeding data to the bigquery dataset we're using

spiffxp · 2020-04-02T17:36:59Z

/area kettle

spiffxp · 2020-04-02T17:44:23Z

Last log entries... seems like it just got stuck?

spiffxp@spiffxp-macbookpro:~$ k logs pod/kettle-5df45c4dcb-fqdlf | tail -n10

==== 2020-03-30 22:55:11 PDT ========================================
PULLED 658
ACK irrelevant 655
EXTEND-ACK  3
gs://kubernetes-jenkins/logs/ci-cri-containerd-e2e-gci-gce-serial/1244861611756228612 True True 2020-03-30 22:38:48 PDT FAILURE
gs://kubernetes-jenkins/logs/ci-test-infra-benchmark-demo/1244865638594252800 True True 2020-03-30 22:54:54 PDT SUCCESS
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-flaky-repro/1244857333914275840 True True 2020-03-30 22:21:44 PDT FAILURE
ACK "finished.json" 3
Downloading JUnit artifacts.

Not out of disk space

spiffxp@spiffxp-macbookpro:~$ k exec -ti pod/kettle-5df45c4dcb-fqdlf df
Filesystem      1K-blocks      Used Available Use% Mounted on
overlay          98868448   6373144  92478920   7% /
tmpfs               65536         0     65536   0% /dev
tmpfs            15439876         0  15439876   0% /sys/fs/cgroup
/dev/sdb       1033024192 537385356 494573876  53% /data
/dev/sda1        98868448   6373144  92478920   7% /etc/hosts
tmpfs            15439876         4  15439872   1% /etc/service-account
shm                 65536         0     65536   0% /dev/shm
tmpfs            15439876        12  15439864   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs            15439876         0  15439876   0% /proc/acpi
tmpfs            15439876         0  15439876   0% /proc/scsi
tmpfs            15439876         0  15439876   0% /sys/firmware

Deleting the pod to get it to restart

spiffxp · 2020-04-02T17:59:09Z

It's running, but I suspect it'll be a few hours before it's ingesting data again. Currently at:

Loading builds from gs://kubernetes-jenkins/pr-logs

spiffxp · 2020-04-02T19:08:21Z

Now at:

Loading builds from gs://kubernetes-jenkins/logs/

spiffxp · 2020-04-02T21:49:47Z

Now at:

gs://istio-circleci/e2e-pilot-noauth-v1alpha3-v2-2/195022
gs://istio-circleci/e2e-pilot-noauth-v1alpha3-v2-2/195020
gs://istio-circleci/e2e-pilot-noauth-v1alpha3-v2-2/195019
gs://istio-circleci/e2e-pilot-noauth-v1alpha3-v2-2/195021
# etc

spiffxp · 2020-04-02T23:21:33Z

Now at:

24761/33852 gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-windows-containerd-gce/1245404572097187845 9 23369401
24762/33852 gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-windows-gce-1-17/1245177066807103488 9 2096838
24763/33852 gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-windows-gce-1-17/1245055389347614721 9 2032195
24764/33852 gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-windows-gce-1909-1-18/1245764443804012547 9 2051247
# etc

spiffxp · 2020-04-03T00:44:25Z

Now seeing a variety of fun errors in the log related to parsing JUnit files

ERROR:root:error on gs://kubernetes-jenkins/pr-logs/pull/kubeflow_pipelines/3397/kubeflow-pipeline-e2e-test/1245483970011860994
Traceback (most recent call last):
  File "make_json.py", line 194, in make_rows
    yield rowid, row_for_build(path, started, finished, results)
  File "make_json.py", line 110, in row_for_build
    for test in parse_junit(result):
  File "make_json.py", line 70, in parse_junit
    time = float(child.attrib['time'] or 0)
KeyError: 'time'

ERROR:root:error on gs://kubernetes-jenkins/pr-logs/pull/48943/pull-kubernetes-bazel/33447
Traceback (most recent call last):
  File "make_json.py", line 194, in make_rows
    yield rowid, row_for_build(path, started, finished, results)
  File "make_json.py", line 110, in row_for_build
    for test in parse_junit(result):
  File "make_json.py", line 38, in parse_junit
    tree = ET.fromstring(xml)
  File "/opt/pypy-5.8-linux_x86_64-portable/lib-python/2.7/xml/etree/ElementTree.py", line 1311, in XML
    parser.feed(text)
  File "/opt/pypy-5.8-linux_x86_64-portable/lib-python/2.7/xml/etree/ElementTree.py", line 1663, in feed
    self._raiseerror(v)
  File "/opt/pypy-5.8-linux_x86_64-portable/lib-python/2.7/xml/etree/ElementTree.py", line 1517, in _raiseerror
    raise err
ParseError: not well-formed (invalid token): line 22, column 8

ERROR:root:error on gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-cos-gke-k8sstable1-alpha-cluster/1188068387603877893
Traceback (most recent call last):
  File "make_json.py", line 194, in make_rows
    yield rowid, row_for_build(path, started, finished, results)
  File "make_json.py", line 110, in row_for_build
    for test in parse_junit(result):
  File "make_json.py", line 38, in parse_junit
    tree = ET.fromstring(xml)
  File "/opt/pypy-5.8-linux_x86_64-portable/lib-python/2.7/xml/etree/ElementTree.py", line 1312, in XML
    return parser.close()
  File "/opt/pypy-5.8-linux_x86_64-portable/lib-python/2.7/xml/etree/ElementTree.py", line 1675, in close
    self._raiseerror(v)
  File "/opt/pypy-5.8-linux_x86_64-portable/lib-python/2.7/xml/etree/ElementTree.py", line 1517, in _raiseerror
    raise err
ParseError: unclosed token: line 361, column 6

ERROR:root:error on gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-alpha-api/1190048016136933380
Traceback (most recent call last):
  File "make_json.py", line 194, in make_rows
    yield rowid, row_for_build(path, started, finished, results)
  File "make_json.py", line 110, in row_for_build
    for test in parse_junit(result):
  File "make_json.py", line 38, in parse_junit
    tree = ET.fromstring(xml)
  File "/opt/pypy-5.8-linux_x86_64-portable/lib-python/2.7/xml/etree/ElementTree.py", line 1312, in XML
    return parser.close()
  File "/opt/pypy-5.8-linux_x86_64-portable/lib-python/2.7/xml/etree/ElementTree.py", line 1675, in close
    self._raiseerror(v)
  File "/opt/pypy-5.8-linux_x86_64-portable/lib-python/2.7/xml/etree/ElementTree.py", line 1517, in _raiseerror
    raise err
ParseError: no element found: line 2, column 0

ERROR:root:error on gs://kubernetes-jenkins/logs/ci-benchmark-scheduler-master/1191946794280423424
Traceback (most recent call last):
  File "make_json.py", line 194, in make_rows
    yield rowid, row_for_build(path, started, finished, results)
  File "make_json.py", line 110, in row_for_build
    for test in parse_junit(result):
  File "make_json.py", line 38, in parse_junit
    tree = ET.fromstring(xml)
  File "/opt/pypy-5.8-linux_x86_64-portable/lib-python/2.7/xml/etree/ElementTree.py", line 1311, in XML
    parser.feed(text)
  File "/opt/pypy-5.8-linux_x86_64-portable/lib-python/2.7/xml/etree/ElementTree.py", line 1663, in feed
    self._raiseerror(v)
  File "/opt/pypy-5.8-linux_x86_64-portable/lib-python/2.7/xml/etree/ElementTree.py", line 1517, in _raiseerror
    raise err
ParseError: out of memory: line 1, column 0

spiffxp · 2020-04-03T01:15:05Z

Now seeing

==== 2020-04-02 18:13:45 PDT ========================================
PULLED 5794
ACK irrelevant 5743
EXTEND-ACK  51
already present??
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gke-gci-ci-master/1245683912152190978 True True 2020-04-02 05:06:20 PDT FAILURE
# etc
gs://kubernetes-jenkins/logs/metrics-kettle/1244892319367303168 True True 2020-03-31 00:40:49 PDT FAILURE
already present??
WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: www.googleapis.com
WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: www.googleapis.com
WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: www.googleapis.com
WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: www.googleapis.com
WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: www.googleapis.com
WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: www.googleapis.com
gs://kubernetes-jenkins/pr-logs/pull/kops/8818/pull-kops-verify-generated/1244896089295818753 True True 2020-03-31 00:55:53 PDT SUCCESS
# etc

spiffxp · 2020-04-03T03:44:43Z

/close
"2187 clusters of 103890 failures (8197 in last day) out of 172861 builds from 3/19/2020, 5:00:04 PM to 4/2/2020, 7:52:43 PM."

k8s-ci-robot · 2020-04-03T03:45:09Z

@spiffxp: Closing this issue.

In response to this:

/close
"2187 clusters of 103890 failures (8197 in last day) out of 172861 builds from 3/19/2020, 5:00:04 PM to 4/2/2020, 7:52:43 PM."

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

spiffxp · 2020-07-13T20:26:26Z

"1816 clusters of 193285 failures (22214 in last day) out of 162624 builds from 6/28/2020, 5:00:16 PM to 7/13/2020, 12:59:07 AM."

spiffxp · 2020-07-13T20:29:15Z

Kettle doesn't appear to be stuck

$ k --context=gke_k8s-gubernator_us-west1-b_g8r logs -lapp=kettle -f
==== 2020-07-13 13:28:33 PDT ========================================
PULLED 89
ACK irrelevant 84
EXTEND-ACK  5
gs://kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_descheduler/338/pull-descheduler-test-e2e-k8s-master/1282773415526141952 True True 2020-07-13 13:26:46 PDT FAILURE
gs://kubernetes-jenkins/pr-logs/pull/92349/pull-kubernetes-kubemark-e2e-gce-big/1282744096825282560 True True 2020-07-13 11:30:41 PDT FAILURE
gs://kubernetes-jenkins/pr-logs/pull/kops/9567/pull-kops-verify-gomod/1282772785252274176 True True 2020-07-13 13:28:07 PDT SUCCESS
gs://kubernetes-jenkins/pr-logs/pull/batch/pull-kubernetes-bazel-test/1282770790663589888 True True 2020-07-13 13:19:57 PDT SUCCESS
gs://kubernetes-jenkins/pr-logs/pull/92819/pull-kubernetes-dependencies-canary/1282768759748038657 True True 2020-07-13 13:13:31 PDT SUCCESS
ACK "finished.json" 5
Downloading JUnit artifacts.
^C

spiffxp · 2020-07-13T22:06:48Z

"1838 clusters of 205138 failures (23506 in last day) out of 168104 builds from 6/28/2020, 5:00:16 PM to 7/13/2020, 12:20:44 PM."

I changed nothing?

spiffxp · 2020-07-13T22:11:49Z

/close

https://testgrid.k8s.io/sig-testing-misc#metrics-kettle

This check was complaining that the build tables had fallen stale, but is happy now (not sure what caused it and why it's fixed)

https://testgrid.k8s.io/sig-testing-misc#triage

I forgot, triage runs somewhat infrequently now, so 12:20 PT is actually expected

k8s-ci-robot · 2020-07-13T22:12:03Z

@spiffxp: Closing this issue.

In response to this:

/close

https://testgrid.k8s.io/sig-testing-misc#metrics-kettle

This check was complaining that the build tables had fallen stale, but is happy now (not sure what caused it and why it's fixed)

https://testgrid.k8s.io/sig-testing-misc#triage

I forgot, triage runs somewhat infrequently now, so 12:20 PT is actually expected

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

spiffxp · 2020-07-15T17:23:25Z

https://testgrid.k8s.io/sig-testing-misc#metrics-kettle&width=5

OK, so this is periodically recurring, and has been getting worse since the start of July (which is as far back as testgrid goes at this point

This feels like #13432 resurfacing? We never root caused why it was happening, or why it seemingly disappeared

spiffxp added the kind/bug Categorizes issue or PR as related to a bug. label Apr 2, 2020

k8s-ci-robot assigned spiffxp Apr 2, 2020

k8s-ci-robot added the area/triage label Apr 2, 2020

k8s-ci-robot added the area/kettle label Apr 2, 2020

k8s-ci-robot closed this as completed Apr 3, 2020

spiffxp mentioned this issue Jul 10, 2020

Extend Kettle build fields to be used for determining flakes #18197

Merged

spiffxp reopened this Jul 13, 2020

k8s-ci-robot closed this as completed Jul 13, 2020

BenTheElder mentioned this issue Aug 5, 2021

kettle appears to be down #23135

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

go.k8s.io/triage is serving stale data #17069

go.k8s.io/triage is serving stale data #17069

spiffxp commented Apr 2, 2020 •

edited

Loading

spiffxp commented Apr 2, 2020

spiffxp commented Apr 2, 2020

spiffxp commented Apr 2, 2020

spiffxp commented Apr 2, 2020

spiffxp commented Apr 2, 2020

spiffxp commented Apr 2, 2020

spiffxp commented Apr 2, 2020

spiffxp commented Apr 3, 2020

spiffxp commented Apr 3, 2020

spiffxp commented Apr 3, 2020

k8s-ci-robot commented Apr 3, 2020

spiffxp commented Jul 13, 2020

spiffxp commented Jul 13, 2020

spiffxp commented Jul 13, 2020

spiffxp commented Jul 13, 2020

k8s-ci-robot commented Jul 13, 2020

spiffxp commented Jul 15, 2020

go.k8s.io/triage is serving stale data #17069

go.k8s.io/triage is serving stale data #17069

Comments

spiffxp commented Apr 2, 2020 • edited Loading

spiffxp commented Apr 2, 2020

spiffxp commented Apr 2, 2020

spiffxp commented Apr 2, 2020

spiffxp commented Apr 2, 2020

spiffxp commented Apr 2, 2020

spiffxp commented Apr 2, 2020

spiffxp commented Apr 2, 2020

spiffxp commented Apr 3, 2020

spiffxp commented Apr 3, 2020

spiffxp commented Apr 3, 2020

k8s-ci-robot commented Apr 3, 2020

spiffxp commented Jul 13, 2020

spiffxp commented Jul 13, 2020

spiffxp commented Jul 13, 2020

spiffxp commented Jul 13, 2020

k8s-ci-robot commented Jul 13, 2020

spiffxp commented Jul 15, 2020

spiffxp commented Apr 2, 2020 •

edited

Loading