Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

go.k8s.io/triage is serving stale data #17069

Closed
spiffxp opened this issue Apr 2, 2020 · 17 comments
Closed

go.k8s.io/triage is serving stale data #17069

spiffxp opened this issue Apr 2, 2020 · 17 comments
Assignees
Labels
area/kettle area/triage kind/bug Categorizes issue or PR as related to a bug.

Comments

@spiffxp
Copy link
Member

spiffxp commented Apr 2, 2020

What happened:
https://go.k8s.io/triage is showing
"2216 clusters of 92485 failures (6433 in last day) out of 149213 builds from 3/18/2020, 5:00:24 PM to 3/30/2020, 10:54:32 PM."

What you expected to happen:
The end date should be today (04/02/2020)

Please provide links to example occurrences, if any:
https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1

/area triage
/assign

@spiffxp spiffxp added the kind/bug Categorizes issue or PR as related to a bug. label Apr 2, 2020
@spiffxp
Copy link
Member Author

spiffxp commented Apr 2, 2020

https://testgrid.k8s.io/sig-testing-misc#triage is all green, so I'm guessing kettle has stopped feeding data to the bigquery dataset we're using

@spiffxp
Copy link
Member Author

spiffxp commented Apr 2, 2020

/area kettle

@spiffxp
Copy link
Member Author

spiffxp commented Apr 2, 2020

Last log entries... seems like it just got stuck?

spiffxp@spiffxp-macbookpro:~$ k logs pod/kettle-5df45c4dcb-fqdlf | tail -n10

==== 2020-03-30 22:55:11 PDT ========================================
PULLED 658
ACK irrelevant 655
EXTEND-ACK  3
gs://kubernetes-jenkins/logs/ci-cri-containerd-e2e-gci-gce-serial/1244861611756228612 True True 2020-03-30 22:38:48 PDT FAILURE
gs://kubernetes-jenkins/logs/ci-test-infra-benchmark-demo/1244865638594252800 True True 2020-03-30 22:54:54 PDT SUCCESS
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-flaky-repro/1244857333914275840 True True 2020-03-30 22:21:44 PDT FAILURE
ACK "finished.json" 3
Downloading JUnit artifacts.

Not out of disk space

spiffxp@spiffxp-macbookpro:~$ k exec -ti pod/kettle-5df45c4dcb-fqdlf df
Filesystem      1K-blocks      Used Available Use% Mounted on
overlay          98868448   6373144  92478920   7% /
tmpfs               65536         0     65536   0% /dev
tmpfs            15439876         0  15439876   0% /sys/fs/cgroup
/dev/sdb       1033024192 537385356 494573876  53% /data
/dev/sda1        98868448   6373144  92478920   7% /etc/hosts
tmpfs            15439876         4  15439872   1% /etc/service-account
shm                 65536         0     65536   0% /dev/shm
tmpfs            15439876        12  15439864   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs            15439876         0  15439876   0% /proc/acpi
tmpfs            15439876         0  15439876   0% /proc/scsi
tmpfs            15439876         0  15439876   0% /sys/firmware

Deleting the pod to get it to restart

@spiffxp
Copy link
Member Author

spiffxp commented Apr 2, 2020

It's running, but I suspect it'll be a few hours before it's ingesting data again. Currently at:

Loading builds from gs://kubernetes-jenkins/pr-logs

@spiffxp
Copy link
Member Author

spiffxp commented Apr 2, 2020

Now at:

Loading builds from gs://kubernetes-jenkins/logs/

@spiffxp
Copy link
Member Author

spiffxp commented Apr 2, 2020

Now at:

gs://istio-circleci/e2e-pilot-noauth-v1alpha3-v2-2/195022
gs://istio-circleci/e2e-pilot-noauth-v1alpha3-v2-2/195020
gs://istio-circleci/e2e-pilot-noauth-v1alpha3-v2-2/195019
gs://istio-circleci/e2e-pilot-noauth-v1alpha3-v2-2/195021
# etc

@spiffxp
Copy link
Member Author

spiffxp commented Apr 2, 2020

Now at:

24761/33852 gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-windows-containerd-gce/1245404572097187845 9 23369401
24762/33852 gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-windows-gce-1-17/1245177066807103488 9 2096838
24763/33852 gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-windows-gce-1-17/1245055389347614721 9 2032195
24764/33852 gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-windows-gce-1909-1-18/1245764443804012547 9 2051247
# etc

@spiffxp
Copy link
Member Author

spiffxp commented Apr 3, 2020

Now seeing a variety of fun errors in the log related to parsing JUnit files

ERROR:root:error on gs://kubernetes-jenkins/pr-logs/pull/kubeflow_pipelines/3397/kubeflow-pipeline-e2e-test/1245483970011860994
Traceback (most recent call last):
  File "make_json.py", line 194, in make_rows
    yield rowid, row_for_build(path, started, finished, results)
  File "make_json.py", line 110, in row_for_build
    for test in parse_junit(result):
  File "make_json.py", line 70, in parse_junit
    time = float(child.attrib['time'] or 0)
KeyError: 'time'
ERROR:root:error on gs://kubernetes-jenkins/pr-logs/pull/48943/pull-kubernetes-bazel/33447
Traceback (most recent call last):
  File "make_json.py", line 194, in make_rows
    yield rowid, row_for_build(path, started, finished, results)
  File "make_json.py", line 110, in row_for_build
    for test in parse_junit(result):
  File "make_json.py", line 38, in parse_junit
    tree = ET.fromstring(xml)
  File "/opt/pypy-5.8-linux_x86_64-portable/lib-python/2.7/xml/etree/ElementTree.py", line 1311, in XML
    parser.feed(text)
  File "/opt/pypy-5.8-linux_x86_64-portable/lib-python/2.7/xml/etree/ElementTree.py", line 1663, in feed
    self._raiseerror(v)
  File "/opt/pypy-5.8-linux_x86_64-portable/lib-python/2.7/xml/etree/ElementTree.py", line 1517, in _raiseerror
    raise err
ParseError: not well-formed (invalid token): line 22, column 8
ERROR:root:error on gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-cos-gke-k8sstable1-alpha-cluster/1188068387603877893
Traceback (most recent call last):
  File "make_json.py", line 194, in make_rows
    yield rowid, row_for_build(path, started, finished, results)
  File "make_json.py", line 110, in row_for_build
    for test in parse_junit(result):
  File "make_json.py", line 38, in parse_junit
    tree = ET.fromstring(xml)
  File "/opt/pypy-5.8-linux_x86_64-portable/lib-python/2.7/xml/etree/ElementTree.py", line 1312, in XML
    return parser.close()
  File "/opt/pypy-5.8-linux_x86_64-portable/lib-python/2.7/xml/etree/ElementTree.py", line 1675, in close
    self._raiseerror(v)
  File "/opt/pypy-5.8-linux_x86_64-portable/lib-python/2.7/xml/etree/ElementTree.py", line 1517, in _raiseerror
    raise err
ParseError: unclosed token: line 361, column 6
ERROR:root:error on gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-alpha-api/1190048016136933380
Traceback (most recent call last):
  File "make_json.py", line 194, in make_rows
    yield rowid, row_for_build(path, started, finished, results)
  File "make_json.py", line 110, in row_for_build
    for test in parse_junit(result):
  File "make_json.py", line 38, in parse_junit
    tree = ET.fromstring(xml)
  File "/opt/pypy-5.8-linux_x86_64-portable/lib-python/2.7/xml/etree/ElementTree.py", line 1312, in XML
    return parser.close()
  File "/opt/pypy-5.8-linux_x86_64-portable/lib-python/2.7/xml/etree/ElementTree.py", line 1675, in close
    self._raiseerror(v)
  File "/opt/pypy-5.8-linux_x86_64-portable/lib-python/2.7/xml/etree/ElementTree.py", line 1517, in _raiseerror
    raise err
ParseError: no element found: line 2, column 0
ERROR:root:error on gs://kubernetes-jenkins/logs/ci-benchmark-scheduler-master/1191946794280423424
Traceback (most recent call last):
  File "make_json.py", line 194, in make_rows
    yield rowid, row_for_build(path, started, finished, results)
  File "make_json.py", line 110, in row_for_build
    for test in parse_junit(result):
  File "make_json.py", line 38, in parse_junit
    tree = ET.fromstring(xml)
  File "/opt/pypy-5.8-linux_x86_64-portable/lib-python/2.7/xml/etree/ElementTree.py", line 1311, in XML
    parser.feed(text)
  File "/opt/pypy-5.8-linux_x86_64-portable/lib-python/2.7/xml/etree/ElementTree.py", line 1663, in feed
    self._raiseerror(v)
  File "/opt/pypy-5.8-linux_x86_64-portable/lib-python/2.7/xml/etree/ElementTree.py", line 1517, in _raiseerror
    raise err
ParseError: out of memory: line 1, column 0

@spiffxp
Copy link
Member Author

spiffxp commented Apr 3, 2020

Now seeing

==== 2020-04-02 18:13:45 PDT ========================================
PULLED 5794
ACK irrelevant 5743
EXTEND-ACK  51
already present??
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gke-gci-ci-master/1245683912152190978 True True 2020-04-02 05:06:20 PDT FAILURE
# etc
gs://kubernetes-jenkins/logs/metrics-kettle/1244892319367303168 True True 2020-03-31 00:40:49 PDT FAILURE
already present??
WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: www.googleapis.com
WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: www.googleapis.com
WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: www.googleapis.com
WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: www.googleapis.com
WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: www.googleapis.com
WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: www.googleapis.com
gs://kubernetes-jenkins/pr-logs/pull/kops/8818/pull-kops-verify-generated/1244896089295818753 True True 2020-03-31 00:55:53 PDT SUCCESS
# etc

@spiffxp
Copy link
Member Author

spiffxp commented Apr 3, 2020

/close
"2187 clusters of 103890 failures (8197 in last day) out of 172861 builds from 3/19/2020, 5:00:04 PM to 4/2/2020, 7:52:43 PM."

@k8s-ci-robot
Copy link
Contributor

@spiffxp: Closing this issue.

In response to this:

/close
"2187 clusters of 103890 failures (8197 in last day) out of 172861 builds from 3/19/2020, 5:00:04 PM to 4/2/2020, 7:52:43 PM."

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@spiffxp
Copy link
Member Author

spiffxp commented Jul 13, 2020

"1816 clusters of 193285 failures (22214 in last day) out of 162624 builds from 6/28/2020, 5:00:16 PM to 7/13/2020, 12:59:07 AM."

@spiffxp spiffxp reopened this Jul 13, 2020
@spiffxp
Copy link
Member Author

spiffxp commented Jul 13, 2020

Kettle doesn't appear to be stuck

$ k --context=gke_k8s-gubernator_us-west1-b_g8r logs -lapp=kettle -f
==== 2020-07-13 13:28:33 PDT ========================================
PULLED 89
ACK irrelevant 84
EXTEND-ACK  5
gs://kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_descheduler/338/pull-descheduler-test-e2e-k8s-master/1282773415526141952 True True 2020-07-13 13:26:46 PDT FAILURE
gs://kubernetes-jenkins/pr-logs/pull/92349/pull-kubernetes-kubemark-e2e-gce-big/1282744096825282560 True True 2020-07-13 11:30:41 PDT FAILURE
gs://kubernetes-jenkins/pr-logs/pull/kops/9567/pull-kops-verify-gomod/1282772785252274176 True True 2020-07-13 13:28:07 PDT SUCCESS
gs://kubernetes-jenkins/pr-logs/pull/batch/pull-kubernetes-bazel-test/1282770790663589888 True True 2020-07-13 13:19:57 PDT SUCCESS
gs://kubernetes-jenkins/pr-logs/pull/92819/pull-kubernetes-dependencies-canary/1282768759748038657 True True 2020-07-13 13:13:31 PDT SUCCESS
ACK "finished.json" 5
Downloading JUnit artifacts.
^C

@spiffxp
Copy link
Member Author

spiffxp commented Jul 13, 2020

"1838 clusters of 205138 failures (23506 in last day) out of 168104 builds from 6/28/2020, 5:00:16 PM to 7/13/2020, 12:20:44 PM."

I changed nothing?

@spiffxp
Copy link
Member Author

spiffxp commented Jul 13, 2020

/close

https://testgrid.k8s.io/sig-testing-misc#metrics-kettle

This check was complaining that the build tables had fallen stale, but is happy now (not sure what caused it and why it's fixed)

https://testgrid.k8s.io/sig-testing-misc#triage

I forgot, triage runs somewhat infrequently now, so 12:20 PT is actually expected

@k8s-ci-robot
Copy link
Contributor

@spiffxp: Closing this issue.

In response to this:

/close

https://testgrid.k8s.io/sig-testing-misc#metrics-kettle

This check was complaining that the build tables had fallen stale, but is happy now (not sure what caused it and why it's fixed)

https://testgrid.k8s.io/sig-testing-misc#triage

I forgot, triage runs somewhat infrequently now, so 12:20 PT is actually expected

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@spiffxp
Copy link
Member Author

spiffxp commented Jul 15, 2020

https://testgrid.k8s.io/sig-testing-misc#metrics-kettle&width=5

OK, so this is periodically recurring, and has been getting worse since the start of July (which is as far back as testgrid goes at this point

Screen Shot 2020-07-15 at 10 21 31 AM

This feels like #13432 resurfacing? We never root caused why it was happening, or why it seemingly disappeared

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kettle area/triage kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

2 participants