-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added guide for monitoring CI #5244
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,212 @@ | ||||||
# Monitoring Kubernetes Health | ||||||
|
||||||
**Table of Contents** | ||||||
|
||||||
- [Monitoring Kubernetes Health](#monitoring-kubernetes-health) | ||||||
- [Monitoring the health of Kubernetes with TestGrid](#monitoring-the-health-of-kubernetes-with-testgrid) | ||||||
- [What dashboards should I monitor?](#what-dashboards-should-i-monitor) | ||||||
- [What do I do when I see a TestGrid alert?](#what-do-i-do-when-i-see-a-testgrid-alert) | ||||||
- [Communicate your findings](#communicate-your-findings) | ||||||
- [Fill out the issue](#fill-out-the-issue) | ||||||
- [Iterate](#iterate) | ||||||
|
||||||
|
||||||
## Monitoring the health of Kubernetes with TestGrid | ||||||
|
||||||
TestGrid is a highly-configurable, interactive dashboard for viewing your test | ||||||
results in a grid, see https://github.com/GoogleCloudPlatform/testgrid. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: A better transition would be nice here, like 'it is partially open-sourced so you can view the source code here' or something to that effect? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To disambiguate even more "the back end is open-sourced" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 to clarifying the repo contains the back-end components of testgrid, not the dashboard code itself |
||||||
|
||||||
The Kubernetes community has its own instance of TestGrid, https://testgrid.k8s.io/, | ||||||
which we use to monitor and observe the health of the project. | ||||||
|
||||||
Each SIG has its own set of dashboards, and each dashboard is composed of | ||||||
different end-to-end (e2e) jobs. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. there are more than e2e jobs
Suggested change
|
||||||
E2E jobs are in turn made up of test stages (e.g., bootstrapping a Kubernetes | ||||||
cluster, tearing down a Kubernetes cluster) and e2e tests (e.g., Kubectl client | ||||||
Kubectl logs should be able to retrieve and filter logs). | ||||||
These views allow different teams to monitor and understand how their areas | ||||||
are doing. | ||||||
|
||||||
We highly encourage anyone to periodically monitor these dashboards. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't want to encourage toil. SIGs should be periodically monitoring the dashboards related to subprojects they own. |
||||||
If you see that a job or test has been failing, please raise an issue with the | ||||||
corresponding SIG in either their mailing list or in Slack. | ||||||
|
||||||
Help maintaining tests, fixing broken tests, improving test success rates, and | ||||||
overall test improvements are always highly needed throughout the project. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. grammar nit:
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @thejoycekung you sure on this one? 👀 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep! The sentence is basically |
||||||
|
||||||
**Note**: It is important that all SIGs periodically monitor their jobs and | ||||||
tests. These are used to figure out when to release Kubernetes. | ||||||
Comment on lines
+37
to
+38
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is too broad. Not all jobs/tests are used to figure out when to release Kubernetes. |
||||||
Furthermore, if jobs or tests are failing or flaking, then pull requests will | ||||||
take a lot longer to be merged. | ||||||
Comment on lines
+39
to
+40
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Might be an opportune time to mention the difference between periodic / presubmit / postsubmit There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also possibly mentioning metrics |
||||||
|
||||||
|
||||||
### What dashboards should I monitor? | ||||||
|
||||||
This depends on what areas of Kubernetes you want to contribute to. | ||||||
You should monitor the dashboards owned by the SIG you are working with. | ||||||
Additionally, you should check: | ||||||
|
||||||
* https://testgrid.k8s.io/sig-release-master-blocking and | ||||||
* https://testgrid.k8s.io/sig-release-master-informing | ||||||
|
||||||
since these jobs run tests owned by other SIGs. | ||||||
Also, make sure to periodically check on the "blocking" and "informing" | ||||||
dashboards for past releases. | ||||||
|
||||||
--- | ||||||
|
||||||
## What do I do when I see a TestGrid alert? | ||||||
|
||||||
If you are part of a SIG's mailing list, occasionally you may see emails from | ||||||
TestGrid reporting that a job or a test has recently failed. | ||||||
If you are casually browsing through TestGrid, you may also see jobs labeled as | ||||||
"flaky" (in purple) or as "failing" (in red). | ||||||
This section is to help guide you on what to do in these occasions. | ||||||
|
||||||
### Communicate your findings | ||||||
|
||||||
The number one thing to do is to communicate your findings: a test or job has | ||||||
been flaking or failing. | ||||||
If you saw a TestGrid alert on a mailing list, please reply to the thread and | ||||||
mention that you are looking into it. | ||||||
It is important to communicate to prevent duplicate work and to ensure CI | ||||||
problems get attention. | ||||||
|
||||||
In order to communicate with the rest of the community and to drive the work, | ||||||
please open up an issue on Kubernetes, | ||||||
https://github.com/kubernetes/kubernetes/issues/new/choose, and choose the appropriate issue | ||||||
template. | ||||||
Comment on lines
+68
to
+78
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would suggest:
How do I decide which kubernetes/kubernetes issue template to use?
|
||||||
|
||||||
### Fill out the issue | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Unsure if this is out of scope for this issue or whether it falls more under "revamping the issue template"? -> We should talk a little bit about how to title it, e.g. prefix with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I said something similar in a different PR review that wrote instructions on how to file a flake issue #5205 (comment) Instructions on correctly filling out an issue are most likely to be read if they are part of the issue template itself. Alternatively, make a page dedicated just to how to file kubernetes/kubernetes issues, and link to that page from the issue template. I opened kubernetes/kubernetes#95528 to cover updating the flake template, maybe it should expand for both. |
||||||
|
||||||
1. **Which job(s) are failing or flaking** | ||||||
|
||||||
The job is the tab in TestGrid that you are looking at. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Job name is different from tab name. Click on a testgrid cell to see what the actual job name is. Alternatively, tabs that don't have a e.g. https://testgrid.k8s.io/sig-release-1.20-blocking#verify-1.20
The difference is important because:
|
||||||
|
||||||
<img src="./testgrid-images/testgrid-jobs.png" height="50%" width="40%"> | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. question: is alt-text a thing we should do here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. agree we should alt-text |
||||||
|
||||||
The above example was taken from the SIG Release dashboard and we can see that | ||||||
* `conformance-ga-only` https://testgrid.k8s.io/sig-release-master-blocking#conformance-ga-only, | ||||||
* `skew-cluster-latest-kubectl-stable1-gce` https://testgrid.k8s.io/sig-release-master-blocking#skew-cluster-latest-kubectl-stable1-gce, | ||||||
* `gci-gce-ingress` https://testgrid.k8s.io/sig-release-master-blocking#gci-gce-ingress, | ||||||
* `kind-master-parallel` https://testgrid.k8s.io/sig-release-master-blocking#kind-master-parallel | ||||||
|
||||||
are flaky (we should have some issues opened up for these to investigate why | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it will be useful to include the actual issues examples |
||||||
:smile:). | ||||||
|
||||||
2. **Which tests are failing or flaking** | ||||||
|
||||||
Let's grab an example from the SIG release dashboards and look at the | ||||||
`node-kubelet-features-master` job in | ||||||
https://testgrid.k8s.io/sig-release-master-informing#node-kubelet-features-master. | ||||||
|
||||||
<img src="./testgrid-images/failed-tests.png" height="70%" width="100%"> | ||||||
|
||||||
Here we see that at 16.07 EDT and 15:07 EDT the job | ||||||
``` | ||||||
[k8s.io] NodeProblemDetector [NodeFeature:NodeProblemDetector] [k8s.io] SystemLogMonitor should generate node condition and events for corresponding errors [ubuntu] | ||||||
``` | ||||||
Failed for Kubernetes commit `9af86e8db` (this value is a row below the time - | ||||||
alejandrox1 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
the value above it is the run ID). | ||||||
The corresponding test-infra commit was `11cb57d36` (the value below the commit | ||||||
for Kubernetes). | ||||||
Comment on lines
+109
to
+112
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The wording here feels a little awkward. I think putting this info with part 3 "Since when has it been failing or flaking" ~ L134 would be more useful, because:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. agree with @thejoycekung |
||||||
|
||||||
At 15:07 EDT, the job | ||||||
``` | ||||||
[k8s.io] NodeProblemDetector [NodeFeature:NodeProblemDetector] [k8s.io] SystemLogMonitor should generate node condition and events for corresponding errors [cos-stable2] | ||||||
``` | ||||||
|
||||||
failed as well. | ||||||
|
||||||
If one or both of these jobs continue failing, or if they fail frequently | ||||||
enough, we should open an issue and investigate. | ||||||
|
||||||
3. **Since when has it been failing or flaking** | ||||||
|
||||||
This information you can get from the header of the page showing you all the | ||||||
tests. | ||||||
Going from top to bottom, you will see: | ||||||
* date | ||||||
* time | ||||||
* job run ID | ||||||
* Kubernetes commit that was tested | ||||||
* (Most often) test-infra commit | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. question (partially for my own knowledge): what else could it be for besides test-infra? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Depends on the test. I think there are some jobs that reference the kubeadm commit (for example). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Quirk: jobs that use (deprecated) bootstrap.py have the test-infra commit; jobs that use pod-utils do not (e.g. https://testgrid.k8s.io/sig-release-master-blocking#kind-master-parallel) Customizing docs: https://github.com/kubernetes/test-infra/blob/master/testgrid/config.md#column-headers |
||||||
|
||||||
4. **Reason for failure** | ||||||
|
||||||
The aim for this issue is to begin investigating - you don't have to find the | ||||||
reason for failure right away (nor the solution). | ||||||
However, do post any information you find useful. | ||||||
|
||||||
One way of getting useful information is to click on the failed runs (the red | ||||||
rectangles). | ||||||
This will send you to a page called [**Spyglass**](https://github.com/kubernetes/test-infra/tree/master/prow/spyglass). | ||||||
|
||||||
If we do this for the above test failures in `node-kubelet-features-master`, we | ||||||
will see the following | ||||||
|
||||||
<img src="./testgrid-images/spyglass-summary.png" height="60%" width="100%"> | ||||||
|
||||||
Right away it will show you what tests failed. | ||||||
Here we see that 2 tests failed (both related to the node problem detector) and | ||||||
the `e2e.go: Node Tests` stage was marked as failed (because the node problem | ||||||
detector tests failed). | ||||||
|
||||||
You will often see "stages" (steps in an e2e job) as mixed with the tests | ||||||
themselves. | ||||||
The stages tell you what was going on in the e2e job when an error | ||||||
occurred. | ||||||
|
||||||
If we click on the first test error, we will see logs that will (hopefully) help | ||||||
us figure out why the test failed. | ||||||
|
||||||
<img src="./testgrid-images/spyglass-result.png" height="60%" width="100%"> | ||||||
|
||||||
Further down the page you will see all the logs for the entire test run. | ||||||
Please copy any information you think may be useful from here into the issue. | ||||||
|
||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe is worth mention the artifacts for more logging of other components. |
||||||
5. **Anything else we need to know** | ||||||
|
||||||
There is this wonderful page built by SIG testing that often comes in handy: | ||||||
https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1 | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please use the shortlink, the bucket name is going to change eventually (ref: kubernetes/k8s.io#1305)
Suggested change
|
||||||
This page is called **Triage**. | ||||||
We can use it to see if a test we see failing in a given job has been failing in | ||||||
others and, in general, to understand how jobs are behaving. | ||||||
|
||||||
For example, we can see how the job we have been looking at has been behaving | ||||||
recently. | ||||||
|
||||||
There is one important detail we have to mention at this point, the job names | ||||||
you see on TestGrid are often aliases. | ||||||
For example, when we clicked on a test run for | ||||||
`node-kubelet-features-master` | ||||||
( | ||||||
https://testgrid.k8s.io/sig-release-master-informing#node-kubelet-features-master | ||||||
), at the top left corner of spyglass the page tells us the real job name, | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. another question (partially for my own knowledge): when we log an issue for failure/flake should we use the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. man, i've been taking super long to get through this pr but either one is good. i think that most people will know where to look with either the full name of the alias (usually the full name just adds the "ci-kubernetes-" prefix). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. job name, not tab name |
||||||
`ci-kubernetes-node-kubelet-features` (notice the "ci-kubernetes-" prefix). | ||||||
Comment on lines
+179
to
+186
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. question: would it be better to intersperse this information throughout the doc? like, when we talk about the testgrid tab, mention that these are often aliases, when we introduce spyglass, mention that the title at the top is the "real" job name There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. agree, I suggested earlier in the doc |
||||||
Then we can use this full job name in Triage | ||||||
|
||||||
https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&job=ci-kubernetes-node-kubelet-features | ||||||
|
||||||
At the time of this writing we saw the following | ||||||
|
||||||
<img src="./testgrid-images/triage.png" height="50%" width="100%"> | ||||||
|
||||||
**Note**: notice that you can also improve your query by filtering or excluding | ||||||
results based on test name or failure text. | ||||||
|
||||||
Sometimes, Triage will help you find patterns to figure out what's wrong. | ||||||
In this instance, we can also see that this job has been failing rather | ||||||
frequently (about 2 times per hour). | ||||||
|
||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Unsure if this is out of scope for this issue or whether it falls more under "revamping the issue template"? -> We should also mention adding a relevant SIG through |
||||||
### Iterate | ||||||
|
||||||
Once you have filled out the issue, please mention it in the appropriate mailing | ||||||
list thread (if you see an email from TestGrid mentioning a job or test | ||||||
failure) and share it with the appropriate SIG in the Kubernetes Slack. | ||||||
|
||||||
Don't worry if you are not sure how to debug further or how to resolve the | ||||||
issue! | ||||||
All issues are unique and require a bit of experience to figure out how to work | ||||||
on them. | ||||||
For the time being, reach out to people in Slack or the mailing list. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.