-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add script gathering and reporting health indicators #185
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Manuel Pégourié-Gonnard <[email protected]>
gh_username = os.environ["GITHUB_USERNAME"] | ||
gh_token = os.environ["GITHUB_API_TOKEN"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice-to-have: I'd like us to standardize on using the GitHub token from gh
(the official GitHub CLI) when it's available.
I have this code snippet in one of my scripts, which would need to be adapted because here you need the username as well.
def try_get_gh_auth_token() -> str:
"""Get the default authentication token from gh (the official GitHub client).
Return an empty string if there is no such token or if gh is not available.
"""
# TODO: allow specifying an alternative host name and user name
try:
output = subprocess.check_output(['gh', 'auth', 'token'],
stderr=subprocess.DEVNULL)
return output.strip().decode('ascii')
except subprocess.CalledProcessError:
return ''
except FileNotFoundError:
return ''
Currently two indicators are reported: | ||
1. Success rate of the nightly jobs. We don't expect "real" failures here, | ||
so any failure is likely to be an infra issue or a flaky test. | ||
2. Execution time of PR jobs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice-to-have: an indicator for jobs that fail without a reported cause in the failure list. The failure list is an artifact called failures.csv
or failures.csv.xz
. “Without a reported cause” means: failures.csv.xz
doesn't exist, and (failures.csv
doesn't exist or failures.csv
has size 0).
Even better, exclude jobs where the sole failure is that outcome analysis is unhappy.
With Mbed-TLS/mbedtls#9286, which adds an outcome line for running each component, this would count jobs that fail solely due to infrastructure problems (e.g. timeout, network glitches), as well as jobs that fail in outcome analysis. Thus this indicator could become a proxy for jobs that fail solely due to infrastructure problems.
I would ideally like to have an indicator that detects all infrastructure problems, but that seems hard.
This is a script I've been using to get some stats I see as health indicators for the CI.