Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-49707] (part): Fail the build if an agent has been removed #104

Merged
merged 11 commits into from
Jul 2, 2019

Conversation

jglick
Copy link
Member

@jglick jglick commented Apr 29, 2019

Improved behavior for one subcase of JENKINS-49707, that a disconnected agent is actually removed from Jenkins configuration for whatever reason. We assume that it is not going to be readded (under the same name), so there is no point in waiting for it to reconnect. The one exception that I know of is that a Swarm agent will remove itself upon disconnect but might readd itself later, so this might break cross-disconnect durability of Pipeline builds for Swarm, but that seems like a far less likely case than typical Clouds.

#47 and #48 do something similar across Jenkins restarts. I think the TIMEOUT_WAITING_FOR_NODE_MILLIS grace period there is irrelevant here, since that would be done before the program even resumes.

  • decide about FAILURE vs. ABORTED status
  • test on EC2 with spot instances did not manage to set up the ec2 plugin successfully
  • test against kubernetes plugin with a killed pod
  • consider implementing instead in ExecutorStepExecution (would work not just for DurableTaskStep)

@schottsfired

@basil
Copy link
Member

basil commented Apr 30, 2019

this might break cross-disconnect durability of Pipeline builds for Swarm, but that seems like a far less likely case than typical Clouds

I rely on cross-disconnect durability of Pipeline builds for Swarm across my entire Jenkins fleet, so this would be a serious regression from my point of view. Would you be willing to run PipelineJobTest#buildShellScriptAfterRestart from jenkinsci/swarm-plugin#81 with these changes to ensure this workflow keeps working?

@jglick
Copy link
Member Author

jglick commented Apr 30, 2019

Would you be willing to run PipelineJobTest#buildShellScriptAfterRestart from jenkinsci/swarm-plugin#81 with these changes

Sure, though I doubt that would reproduce any problem anyway, as this patch is about the case that a client is disconnected and removed while the Jenkins master is still running. To test this patch you would need to kill the client and then have it restart, in a single “story”. Reconnection of a Swarm client across restarts is what #47/#48 should have been affecting; in that case there was a grace period, needed since SwarmSlave is an EphemeralNode and so we actually have to wait for it to be reattached.


@basil this code looks wrong to me. It could all be replaced by simply

jenkins.addNode(slave);

since that method already replaces any existing node of the same name, without firing the deleted event. (Currently it always fires the created event, even when it ought to be firing updated; if I get a moment I will try to patch that.)

@jglick
Copy link
Member Author

jglick commented Apr 30, 2019

Currently it always fires the created event, even when it ought to be firing updated

jenkinsci/jenkins#4004

@basil
Copy link
Member

basil commented Apr 30, 2019

Sure, though I doubt that would reproduce any problem anyway, as this patch is about the case that a client is disconnected and removed while the Jenkins master is still running. To test this patch you would need to kill the client and then have it restart, in a single “story”.

My latest update to jenkinsci/swarm-plugin#81 adds a new test for this case in PipelineJobTest#buildShellScriptAfterDisconnect.

@basil
Copy link
Member

basil commented Apr 30, 2019

@basil this code looks wrong to me. It could all be replaced by simply

jenkins.addNode(slave);

since that method already replaces any existing node of the same name, without firing the deleted event.

That change makes sense to me and my new integration tests still pass with it, so I opened jenkinsci/swarm-plugin#82.

@jglick
Copy link
Member Author

jglick commented Apr 30, 2019

Even with jenkinsci/swarm-plugin#82, buildShellScriptAfterDisconnect will fail due to the removeNode call from SwarmSlave.SELF_CLEANUP_LAUNCHER.afterDisconnect. I will check if a grace period helps.

@jglick
Copy link
Member Author

jglick commented Jun 6, 2019

hudson.plugins.swarm.PipelineJobTest succeeds after

diff --git a/plugin/pom.xml b/plugin/pom.xml
index bfda46a..1a913a9 100644
--- a/plugin/pom.xml
+++ b/plugin/pom.xml
@@ -77,7 +77,7 @@
       </plugins>
     </build>
     <properties>
-        <workflow-support-plugin.version>2.13</workflow-support-plugin.version>
+        <workflow-support-plugin.version>3.3</workflow-support-plugin.version>
     </properties>
     <dependencyManagement>
         <dependencies>
@@ -90,19 +90,19 @@
             <dependency>
                 <groupId>org.jenkins-ci.plugins</groupId>
                 <artifactId>scm-api</artifactId>
-                <version>1.3</version>
+                <version>2.2.6</version>
                 <scope>test</scope>
             </dependency>
             <dependency>
                 <groupId>org.jenkins-ci.plugins</groupId>
                 <artifactId>structs</artifactId>
-                <version>1.5</version>
+                <version>1.18</version>
                 <scope>test</scope>
             </dependency>
             <dependency>
                 <groupId>org.jenkins-ci.plugins</groupId>
                 <artifactId>script-security</artifactId>
-                <version>1.24</version>
+                <version>1.58</version>
                 <scope>test</scope>
             </dependency>
         </dependencies>
@@ -111,7 +111,7 @@
         <dependency>
             <groupId>org.jenkins-ci.plugins.workflow</groupId>
             <artifactId>workflow-api</artifactId>
-            <version>2.12</version>
+            <version>2.33</version>
             <scope>test</scope>
         </dependency>
         <dependency>
@@ -123,13 +123,13 @@
         <dependency>
             <groupId>org.jenkins-ci.plugins.workflow</groupId>
             <artifactId>workflow-cps</artifactId>
-            <version>2.28</version>
+            <version>2.70</version>
             <scope>test</scope>
         </dependency>
         <dependency>
             <groupId>org.jenkins-ci.plugins.workflow</groupId>
             <artifactId>workflow-durable-task-step</artifactId>
-            <version>2.4</version>
+            <version>2.32-rc917.b70f8edaf896</version> <!-- https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/104 -->
             <scope>test</scope>
         </dependency>
         <dependency>
@@ -141,7 +141,7 @@
         <dependency>
             <groupId>org.jenkins-ci.plugins.workflow</groupId>
             <artifactId>workflow-step-api</artifactId>
-            <version>2.7</version>
+            <version>2.20</version>
             <scope>test</scope>
         </dependency>
         <dependency>
diff --git a/pom.xml b/pom.xml
index d5fbe91..4ce3480 100644
--- a/pom.xml
+++ b/pom.xml
@@ -24,7 +24,7 @@
     <version>3.18-SNAPSHOT</version>
 
     <properties>
-        <jenkins.version>2.60.3</jenkins.version>
+        <jenkins.version>2.150.1</jenkins.version>
         <java.level>8</java.level>
     </properties>
 

@jglick jglick marked this pull request as ready for review June 6, 2019 17:43
@jglick jglick requested a review from car-roll June 6, 2019 17:47
Copy link
Member

@dwnusbaum dwnusbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general, added some comments/questions.

return null;
} else {
LOGGER.fine(() -> node + " has been removed for a while, assuming it is not coming back");
throw new FlowInterruptedException(Result.ABORTED, new ExecutorStepExecution.RemovedNodeCause());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is ABORTED really the result we want in this case? My initial thought was that FAILURE would be more appropriate. Thoughts?

Looks like at least one reason for ABORTED is that over in RemovedNodeListener you call BodyExecution.cancel, and CpsBodyExecution always throws a FlowInterruptedException with result ABORTED in that method, so it makes sense to be consistent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally FAILURE is used for cases where something about the build itself broke. The typical scenario here is that the build started running, something like a spot instance got killed, and so the build could not complete due to external causes. Prior to this patch (plus downstream where appropriate) the build would just hang until manually ABORTED; now the abort is automatic.

// Part of JENKINS-49707: check whether an agent has been removed.
// (Note that a Computer may be missing because a Node is offline,
// and conversely after removing a Node its Computer may remain for a while.
// Therefore we only fail here if _both_ are absent.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that really the case here? It looks like we only check the Node, not the Computer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FilePathUtils.find above already checked nullness of Jenkins.getComputer.

* {@link ExecutorPickle} is written to the stream first, which holds a {@link PlaceholderTask},
* and the {@link BodyExecution} holds {@link PlaceholderTask.Callback} whose {@link WorkspaceList.Lease}
* is not processed by {@link WorkspaceListLeasePickle} since pickles are not recursive.
* So we make a best effort and only try to cancel a body within the current session.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So IIUC, if you restart Jenkins while a Pipeline is executing inside of a node block, and then remove the Node from Jenkins, that's how you get to the "Agent foo was deleted, but do not have a node body to cancel" branch in RemovedNodeListener? Or session here refers to something else?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your understanding is correct. The patch to DurableTaskStep should still suffice to abort the build if it was inside sh when the node was removed.

If you remove the node during the restart, a different bit of code in ExecutorPickle (existing long before this patch) kicks in and aborts the build before it even resumes the CPS VM thread.

jglick and others added 2 commits July 1, 2019 10:30
Co-Authored-By: Devin Nusbaum <[email protected]>
@dwnusbaum dwnusbaum merged commit 56d41b7 into jenkinsci:master Jul 2, 2019
@slaughter550
Copy link
Contributor

@jglick Quick question, we came across this and I was wondering how to set org.jenkinsci.plugins.workflow.support.pickles.ExecutorPickle.timeOutForNodeMillis correctly? We tried setting it on jenkins master with an arg on Jenkins startup with -D but it doesn't appear to be inheriting correctly. I'm new to Jenkins so apologies if it's a stupid question. Thanks!

@jglick
Copy link
Member Author

jglick commented Jul 24, 2020

-D is used for this sort of thing. Maybe try the users’ mailing list?

@slaughter550
Copy link
Contributor

For anyone who finds this later, its org.jenkinsci.plugins.workflow.support.pickles.ExecutorPickle.timeoutForNodeMillis where timeout is all lower case. The changelog here is incorrect

@jglick
Copy link
Member Author

jglick commented Jul 28, 2020

@slaughter550 you can file a PR for https://github.com/jenkinsci/workflow-durable-task-step-plugin/blob/master/CHANGELOG.md#232 as needed though I see no mention of the property to be misspelled.

@slaughter550
Copy link
Contributor

slaughter550 commented Sep 19, 2020

@jglick PR submitted, it's not misspelled, its incorrectly capitalized. #142

jglick added a commit to jglick/workflow-durable-task-step-plugin that referenced this pull request Jan 5, 2022
…ut `BodyExecution`.

When we do not use pickles, there is no need for `transient`.
Moving the reference to `ExecutorStepExecution` from `PlaceholderTask` avoids a memory leak.
This is only possible now that `Callback` holds the `ExecutorStepExecution`.
And we can now properly handle `AsynchronousExecution.interrupt`.
jglick added a commit that referenced this pull request Oct 17, 2022
…tion of `node` step, not kill whole build (#180)

* Sketch of non-`Pickle`-based resumption of `ExecutorStepExecution`

* Amending test from #34

* Fixing `Callback`

* Exploring where to detect a failure eligible for retry

* Making `normalNodeDisappearance` pass under somewhat altered circumstances

* First successful `node` block retry

* `FilePathDynamicContext` must take precedence over `ExecutorStepDynamicContext.FilePathTranslator`

* `ExecutorStepTest.unloadableExecutorPickle` as currently conceived no longer makes sense

* Fixing `ExecutorStepTest.nodeDisconnectMissingContextVariableException` by copying some diagnostic code from `FilePathDynamicContext`

* Fixing `ExecutorStepTest.contextualizeFreshFilePathAfterAgentReconnection`

* Marking `ExecutorStepRetryEligibility` beta

* Javadoc imports

* SpotBugs: does not make sense to allow `ExecutorStepDynamicContext.node` to be null

* Reasonable behavior of `ExecutorPickleTest.canceledQueueItem`

* User-visible logging about 5m timeout now that `ExecutorPickle.printWaitingMessage` is unused

* `ExecutorPickleTest` → `ExecutorStepDynamicContextTest`

* Move `TIMEOUT_WAITING_FOR_NODE_MILLIS` from `ExecutorPickle` to `ExecutorStepExecution`

* Some user-visible logging when `ExecutorStepRetryEligibility` is _not_ used

* Do not block on `FutureImpl`.
After jenkinsci/workflow-step-api-plugin#73 this caused `SecretsMasker` to block on a `KubernetesComputer`
which in turn blocked provisioning of a `TaskListener`, causing `RestartPipelineTest.terminatedPodAfterRestart` to fail to find a message:

```
FINE	o.j.p.w.s.d.DurableTaskStep$Execution#getWorkspace: rediscovering that terminated-pod-after-restart-1-8fb2m-j206k-8x125 has been removed and timeout has expired
FINE	o.j.p.w.s.d.DurableTaskStep$Execution#_listener: JENKINS-34021: could not get TaskListener in CpsStepContext[9:sh]:Owner[terminated Pod After Restart/1:terminated Pod After Restart #1]
java.util.concurrent.TimeoutException
	at hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:102)
	at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepDynamicContext$Translator.get(ExecutorStepDynamicContext.java:115)
Caused: java.io.IOException
	at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepDynamicContext$Translator.get(ExecutorStepDynamicContext.java:118)
	at org.jenkinsci.plugins.workflow.steps.DynamicContext$Typed.get(DynamicContext.java:97)
	at org.jenkinsci.plugins.workflow.cps.ContextVariableSet.get(ContextVariableSet.java:139)
	at org.jenkinsci.plugins.workflow.cps.ContextVariableSet$1Delegate.doGet(ContextVariableSet.java:98)
	at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:75)
	at org.csanchez.jenkins.plugins.kubernetes.pipeline.SecretsMasker$Factory.get(SecretsMasker.java:85)
	at org.csanchez.jenkins.plugins.kubernetes.pipeline.SecretsMasker$Factory.get(SecretsMasker.java:73)
	at org.jenkinsci.plugins.workflow.steps.DynamicContext$Typed.get(DynamicContext.java:95)
	at org.jenkinsci.plugins.workflow.cps.ContextVariableSet.get(ContextVariableSet.java:139)
	at org.jenkinsci.plugins.workflow.cps.CpsThread.getContextVariable(CpsThread.java:135)
	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.doGet(CpsStepContext.java:297)
	at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:75)
	at org.jenkinsci.plugins.workflow.support.DefaultStepContext.getListener(DefaultStepContext.java:127)
	at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:87)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution._listener(DurableTaskStep.java:421)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.listener(DurableTaskStep.java:412)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.getWorkspace(DurableTaskStep.java:363)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:570)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:553)
	at …
FINE	o.j.p.w.s.d.DurableTaskStep$Execution$NewlineSafeTaskListener#getLogger: creating filtered stream
FINE	o.j.p.w.s.d.DurableTaskStep$Execution#_listener: terminated-pod-after-restart-1-8fb2m-j206k-8x125 has been removed for 15 sec, assuming it is not coming back
```

* A `sh` step could fail to find a workspace at the moment it starts.
java.lang.NullPointerException
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:329)
	at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:319)
	at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:193)
	at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122)
	at …

* Also handling `ClosedChannelException` as commonly thrown by `SimpleBuildStep`s and the like: jenkinsci/kubernetes-plugin#1083 (comment)

* Also retry after `MissingContextVariableException` on `FilePath`

* Remove `super.onResume` left over in #46; see jenkinsci/workflow-step-api-plugin#66

* Documenting need for JENKINS-30383

* jenkinsci/workflow-step-api-plugin#77 allows `retryNodeBlockSynchAcrossRestarts` to be fixed

* jenkinsci/workflow-step-api-plugin#77 released

* Expanding `MissingContextVariableException` type list after noticing that `isUnix` requests `Launcher` not `FilePath`

* Never log a message with only `PlaceholderTask.cookie`; also need `.runId` for context

* Suppressing uninteresting log message

* Maybe need to call `StepContext.saveState`?

* Deleting comment; for now it seems clearest to keep `ExecutorStepDynamicContext` as a distinct type

* Resolve longstanding tech debt from #104 & #117 about `BodyExecution`.
When we do not use pickles, there is no need for `transient`.
Moving the reference to `ExecutorStepExecution` from `PlaceholderTask` avoids a memory leak.
This is only possible now that `Callback` holds the `ExecutorStepExecution`.
And we can now properly handle `AsynchronousExecution.interrupt`.

* Sketch of updating `WorkspaceStepExecution`: #180 (comment)

* Allowing `FilePathDynamicContext` to block waiting for an agent to reconnect.
Not great since the CPS VM thread will only grant 5s.
Better would be to avoid unnecessary use of `StepContext.get(FilePath.class)`
(which implicitly requires a live `Channel`);
perhaps `FilePathRepresentation` should become an API.
Amends b65ff1c.
#180 (comment)

* Adding a comment about 8b08398

* 8b08398 seems to allow a968adf to be finished.

* Refining 8b08398 to avoid tail call

* `org.jenkinsci.plugins.workflow.support.concurrent.Futures` deprecated

* SpotBugs reminds me that `WorkspaceStepExecution.Callback` needs an SVUID; taking one from trunk

* Adding `NodeTranslator`

* Beginning rewrite to `AgentErrorCondition`

* Extracted `RetryExecutorStepTest` from `ExecutorStepTest`

* Rely on blocking `ExecutorStepExecution.onResume`

* Removing bogus `waitForMessage` (and unnecessary `interrupt`) from `unrestorableAgent`. Sometimes a failing `waitForMessage` seems to trigger https://maven.apache.org/surefire/maven-surefire-plugin/faq.html#corruptedstream though I cannot track down why.

* `PlaceholderTask` may not hold a reference to `ExecutorStepExecution`

* Equivalent of #184 for `ExecutorStepDynamicContext`

* Assert jenkinsci/workflow-api-plugin@b1778a9

* jenkinsci/workflow-cps-plugin#534 & jenkinsci/workflow-job-plugin#260 released

* Windows tests jenkinsci/workflow-basic-steps-plugin#203 (comment)

* Avoiding `powershell` jenkinsci/workflow-basic-steps-plugin#203 (comment)

* Better usage of Hamcrest; got a flake https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/180/checks?check_run_id=6779178701 with `QueueTaskCancelled` instead of `RemovedNodeCause`

* In fact the flake was in a different test (`canceledQueueItem`)

* Forgot to make `canceledQueueItem` not use `sh`, though it should not really matter since this step is expected to fail anyway

* Test flake: https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/180/checks?check_run_id=7240329028

* Use `AgentOfflineException` from `ExecutorStepDynamicContext` as well

* Can now run all of `AgentErrorConditionTest`

* Responding to `CancellationException` in `ExecutorStepDynamicContext` with `QueueTaskCancelled`, analogous to handling in `ExecutorPickle`

* Permit `canceledQueueItem` to pass on `RemovedNodeCause`

* Normalizing indentation

* jenkinsci/workflow-step-api-plugin#124 deprecates an overload which 2e95e3a only handles at one call site; TBD if others should be switched to `actualInterruption`

* Refined `ExecutorStepDynamicContext` behavior in case of `quietingDown` #180 (comment) or `terminating` #180 (comment)

* Add an explicit `serialVersionUID` to `ExecutorStepExecution.PlaceholderTask.Callback` #180 (comment)

* Optimizing `withExecution` to only consider steps in the current build #180 (comment)

* Pick up jenkinsci/workflow-api-plugin#256 #180 (comment)

* jenkinsci/workflow-api-plugin#256 released
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants