Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Packaging job on ci.jenkins.io never completes - retries fail #4106

Closed
MarkEWaite opened this issue May 27, 2024 · 6 comments
Closed

Packaging job on ci.jenkins.io never completes - retries fail #4106

MarkEWaite opened this issue May 27, 2024 · 6 comments

Comments

@MarkEWaite
Copy link

Service(s)

ci.jenkins.io

Summary

Packaging job on ci.jenkins.io has failed to allocate an agent for last 8 hours

Reproduction steps

  1. Open the packaging job, confirm that the agent is not allocated
@MarkEWaite MarkEWaite added the triage Incoming issues that need review label May 27, 2024
@dduportal dduportal self-assigned this May 27, 2024
@dduportal dduportal removed the triage Incoming issues that need review label May 27, 2024
@dduportal dduportal added this to the infra-team-sync-2024-05-28 milestone May 27, 2024
@dduportal
Copy link
Contributor

The pipeline logs shows errors like the following, indicating that the pod agents failed to be started (created in Kubernetes but never reached the "Running" state with an active connection to the controller):

06:17:06  ERROR: Failed to launch packaging-packaging-master-329-9hqpp-g6p96-v3n93
06:17:06  io.fabric8.kubernetes.client.KubernetesClientTimeoutException: Timed out waiting for [1000000] milliseconds for [Pod] with name:[packaging-packaging-master-329-9hqpp-g6p96-v3n93] in namespace [jenkins-agents].
06:17:06  	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilCondition(BaseOperation.java:939)
06:17:06  	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilReady(BaseOperation.java:921)
06:17:06  	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilReady(BaseOperation.java:97)
06:17:06  	at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:185)
06:17:06  	at hudson.slaves.SlaveComputer.lambda$_connect$0(SlaveComputer.java:297)
06:17:06  	at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
06:17:06  	at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80)
06:17:06  	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
06:17:06  	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
06:17:06  	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
06:17:06  	at java.base/java.lang.Thread.run(Unknown Source)
06:17:06  Pod jenkins-agents/packaging-packaging-master-329-9hqpp-g6p96-v3n93 was just deleted

@dduportal
Copy link
Contributor

FWIW, last successful build was the 10 of May 2024. Since then we've moved the ci.jenkins.io Linux container agent workloads from AWS/DigitalOcean to Azure: might (or might not be related).

Currently checking the pipeline agent label and setup, along with other errors in the controller and pods.

@dduportal
Copy link
Contributor

Oh: this pipeline uses a custom pod template definitions: https://github.com/jenkinsci/packaging/blob/3bb91b8dcb84387f4c04497d118109f029681fe2/Jenkinsfile#L12 (https://github.com/jenkinsci/packaging/blob/master/KubernetesPod.yaml).

This is most probably broken since #3954. Let's see if we can reuse the "all in one container image" here (cc @smerle33 FYI)

@dduportal
Copy link
Contributor

Confirmed the issue: the "custom" pod template definition is missing tolerations which forbids it to be scheduled.

As it uses the jenkinsciinfra/packaging Docker image, using the "All in one" image might be counter-productive as this job should have the same behavior in both CI and (private) CD.

We might have to update the definition to inherit from existing pod template (to get the admin-defined parameters) and only override what is needed (at least the image).

@dduportal
Copy link
Contributor

PR opened in jenkinsci/packaging#465. I can't tell if it fails because I'm not a maintainer of the repo, or if it is an issue due to my PR.

@dduportal
Copy link
Contributor

PR merged, agents are running

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants