-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jobs erroring out due to failure to fetch token #20816
Comments
cc @chaodaiG per go.k8s.io/oncall |
I would move kubernetes/kubernetes#98768 here, but CI signal is probably tracking that one, and it predates this. Filing here with additional details and for visibility. |
The initupload logs are the tail logs, I may need to see the full log for better understanding. Which cluster does this use? How can I get access to it? @BenTheElder |
|
if the job is using owner access to this project (and all other prow-related projects in kubernetes.io) is defined by https://github.com/kubernetes/k8s.io/blob/5b2304f875d8b877764a7bde5398eb6f4ffb0494/groups/groups.yaml#L645-L665 viewer access is defined by https://github.com/kubernetes/k8s.io/blob/5b2304f875d8b877764a7bde5398eb6f4ffb0494/groups/groups.yaml#L667-L701 I'll open a PR to reconcile the |
Thank you @spiffxp , please tag me on the PR |
@chaodaiG opened kubernetes/k8s.io#1636 and cc'ed you |
/priority important-soon |
Some stats, scanning through all prow jobs in the past 24 hours (~24000), by downloading
|
https://console.cloud.google.com/logs/query;query=%22oauth2:%20cannot%20fetch%20token%22%0Aresource.labels.container_name%3D%22initupload%22;timeRange=P14D?project=k8s-infra-prow-build But nothing of the sort for k8s-infra-prow-build-trusted Nor k8s-prow or k8s-prow-builds What changed?
|
Nodes from Jan 22nd exist still in greenhouse's pool, but the autoscaled node pool has all nodes from today. Checking the cluster operation logs: It looks like @spiffxp upgraded the cluster on the 10th. I don't see any cluster operations around February 2nd at all. I don't think this is the case. |
Is k8s-infra-prow-builds the only one with autoscaling? (I think so?) If so I suspect this is an issue with autoscaling the in-cluster DNS. @chaodaiG mentioned a workaround for this applied to another prow cluster. EDIT: and this has been a fairly obvious suspect for some time IMHO, but I forgot to mention. |
/close How does this line up with updating
So the problem stopped due to one of:
It took about 4.5h to fully recreate nodes, I opted not to try these out one-by-one to determine exactly which one would solve it |
@spiffxp: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What happened:
Seeing jobs fail early on with no logs more and more often lately, and not signs we've seen before like "failed scheduling" (need to increase nodes) or github clone flakes (kubernetes/kubernetes is a huge clone, still need to mitigate that ##18226).
Instead we're seeing:
Which is not really surfaced to the user, unless they know to go digging into the recorded pod state. Instead you just see:
Setting aside the UX, we should fix this error, we're seeing it a lot, even on jobs that are running with guaranteed QoS.
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
This is happening pretty often I think, but it's hard to pin down as the job state is just "failed" and the runtime is ~ 16m36s (which seems oddly specific?)
with https://prow.k8s.io/?repo=kubernetes%2Fkubernetes&type=presubmit&state=failure I see 28 hits for
16m36s
Please provide links to example occurrences, if any:
kubernetes/kubernetes#98768
kubernetes/kubernetes#98948 (comment)
Anything else we need to know?:
/area prow
The text was updated successfully, but these errors were encountered: