Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not ask API server for a Pod during deletion unnecessarily #1304

Merged
merged 2 commits into from
Feb 1, 2023

Conversation

jglick
Copy link
Member

@jglick jglick commented Jan 31, 2023

Not tested yet, but hoping this would avoid

java.net.SocketTimeoutException: connect timed out
        at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
        at …
        at java.base/java.net.Socket.connect(Socket.java:609)
        at okhttp3.internal.platform.Platform.connectSocket(Platform.java:129)
        at …
Caused: java.io.IOException: connect timed out
        at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:533)
        at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:570)
        at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleGet(OperationSupport.java:482)
        at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleGet(BaseOperation.java:742)
        at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.getMandatory(BaseOperation.java:177)
Caused: io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get]  for kind: [Pod]  with name: […]  in namespace: […]  failed.
        at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(Kub
ernetesClientException.java:159)
        at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.getMandatory(BaseOperation.java:182)
        at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.get(BaseOperation.java:144)
        at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.get(BaseOperation.java:93)
        at org.csanchez.jenkins.plugins.kubernetes.KubernetesSlave._terminate(KubernetesSlave.java:313)
        at …

Probably the actual deletion would fail later anyway, but why make two API calls when one suffices (assuming you have not actually configured OnFailure)?

@jglick jglick requested a review from a team as a code owner January 31, 2023 23:16
@jglick jglick added the bug Bug Fixes label Jan 31, 2023
sh 'mvn -B -ntp -Dset.changelist -Dmaven.test.failure.ignore clean install'
infra.prepareToPublishIncrementals()
junit 'target/surefire-reports/*.xml'
retry(count: 3, conditions: [kubernetesAgent(handleNonKubernetes: true), nonresumable()]) {
Copy link
Member

@Vlatombe Vlatombe Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[offtopic] handleNonKubernetes: true feels weird, shouldn't the condition be renamed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well this is to handle the scenario on ci.jenkins.io where we are requesting a label which currently is handled by the kubernetes plugin but admins have left open the possibility of switching to some other cloud if performance merits it.

@Vlatombe
Copy link
Member

Vlatombe commented Feb 1, 2023

rebuilding

@Vlatombe Vlatombe closed this Feb 1, 2023
@Vlatombe Vlatombe reopened this Feb 1, 2023
@Vlatombe
Copy link
Member

Vlatombe commented Feb 1, 2023

Flake

 16.155 [run In Pod With Restart With Multiple Container Calls #1] run-in-pod-with-restart-with-multiple-container-calls-1-2-przqw has been removed for 15 sec, assuming it is not coming back

would that cause ABORTED build status even though it seems the build is able to complete its steps ?

Related: jenkinsci/workflow-durable-task-step-plugin#180

Prior to that

   1.124 [run In Pod With Restart With Multiple Container Calls #1] Waiting for reconnection of run-in-pod-with-restart-with-multiple-container-calls-1-2-przqw before proceeding with build
  13.929 [id=194]	INFO	h.TcpSlaveAgentListener$ConnectionHandler#run: Connection #6 failed: java.io.EOFException
  14.047 [id=196]	INFO	h.TcpSlaveAgentListener$ConnectionHandler#run: Accepted JNLP4-connect connection #7 from /10.244.0.32:32956

Maybe 15 seconds here is a bit too short considering the 10 seconds delay between retries ? https://github.com/jenkinsci/remoting/blob/952d22bc5673950f53a949b49fbe1427d99324d9/src/main/java/hudson/remoting/Engine.java#L688

@Vlatombe Vlatombe merged commit a982397 into jenkinsci:master Feb 1, 2023
@jglick
Copy link
Member Author

jglick commented Feb 1, 2023

Maybe 15 seconds here is a bit too short

Probably; trying to make routine tests run in a reasonable amount of time. FYI jenkinsci/workflow-durable-task-step-plugin#284

@jglick jglick deleted the lazy-pod branch February 1, 2023 13:39
@jglick
Copy link
Member Author

jglick commented Feb 27, 2023

FTR some other PRs related to pod deletion: #1095, #1227

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug Fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants