[JENKINS-67164] Call `StepExecution.onResume` in response to `WorkflowRun.onLoad` not `FlowExecutionList.ItemListenerImpl` #221

jglick · 2022-05-11T19:33:47Z

JENKINS-67164: Reapplies #178 and its amendment #188, reverting the reversion #198.

Fixes the problem identified in SSHAgentStepWorkflowTest.sshAgentAvailableAfterRestart whereby a sh step began running while an sshagent step was still in the middle of an onResume, by introducing a FlowExecution.afterStepExecutionsResumed hook which CpsFlowExecution will implement to unpause the program near the end of WorkflowRun.onLoad, rather than unpausing immediately upon deserializing program.dat.

Seems to offer a nicer way to implement the ExecutorPickle replacement portion of jenkinsci/workflow-durable-task-step-plugin#180, by letting ExecutorStepExecution.onResume wait for an agent to reconnect before resuming progran execution. Compared to ExecutorPickle the advantage is that if there is in fact a timeout, rather than marking the entire program as a failure to load, the node step can throw a proper Groovy-level exception which can be handled gracefully.

A matching PR to workflow-cps-plugin is needed to avoid the class of regression seen with sshagent. ~~I would like to run some integration tests in advance.~~ jenkinsci/bom#1122 (comment)

…wRun.onLoad` not `FlowExecutionList.ItemListenerImpl`

dwnusbaum

Seems fine to me. For what it's worth, I would have preferred to see the old commits cherry-picked or a revert-the-revert commit with your changes applied as a new commit to make it easier to tell what the new changes are, although I guess the followup commits in #198 would have created a bunch of conflicts.

jglick · 2022-05-11T20:36:46Z

a revert-the-revert commit with your changes applied as a new commit

Agreed, though it would have been tricky, since I had to play with various uncommitted edits before I found something that worked.

…ugin into FlowExecutionList-JENKINS-67164

dwnusbaum · 2022-05-19T22:25:57Z

src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java

+     * Does so in parallel, but always completing enclosing blocks before the enclosed step.
+     * A simplified version of https://stackoverflow.com/a/67449067/12916, since this should be a tree not a general DAG.
+     */
+    private static final class ParallelResumer {


Is this required for correctness or is it an optimization?

It is an optimization. Previously all steps were resumed serially in a topological order. (Or I hope getCurrentExecutions enforced a topological order; the sorting by CpsThread is a bit opaque to me.) The problem in jenkinsci/workflow-durable-task-step-plugin#180 arises when you have a bunch of parallel branches all running node and all of them are going to fail due to missing agents. Previously, Jenkins would wait 5m for the first agent, report that it was gone, then wait 5m for the next agent, etc. Now all the node blocks are resumed in parallel, followed by the sh steps inside them, etc. Have a test demonstrating this in workflow-durable-task-step but only now got an incremental build from jenkinsci/workflow-cps-plugin#534 that I needed.

Ok, I guess it's kind of unusual for ExecutorStepExecution.onResume to block if no other steps do so, but looking through jenkinsci/workflow-durable-task-step-plugin#180 it seems like you already tried various alternatives, so this seems fine.

Basically this is the replacement for the TryRepeatedly pickle. I played with a lot of alternatives indeed. I am not claiming this approach is ideal but it seems to be straightforward enough and do the job.

Earlier, before hitting on the idea of having onResume block, I had hoped to arrange it so that the program would resume right away, and then a sh step running on the dead agent would eventually figure out there was no hope and abort. I ran into problems in functional tests, though, things like

node('agent') { input 'Proceed?' // restart here sh 'true' }

The program would run past input to the sh step, which would then either fail with a MissingContextVariableException (confusing) or block the CPS VM thread inside DSL (which will throw an error if it blocks >5s). Admittedly this is an artificial case (you should not hold an executor like that) but a lot of tests were failing that I had to work around in artificial ways and it seemed uncomfortable. It would have been possible to fix all these cases but only in pretty intrusive ways: by defining a new StepContext object in lieu of FilePath (perhaps DynamicFilePathContext.Representation) which could be deserialized without pickles, but every step currently expecting FilePath would need to be adjusted to expect this instead, and then the step would need to have complicated logic to wait for the new contextual object to be ready (translatable to a live FilePath) before doing anything useful.

The compromise I settled on is closer to the original pickle behavior in that CPS code and steps do not run until an attempt is made to restore the original context: until all open node blocks have gotten their agent to reconnect, or timed out trying. The difference is only that in case this fails (with a timeout), the error is thrown out of StepExecution.onResume and thus StepContext.onFailure so it becomes a properly modeled Throwable with a CPS VM stack trace that we can catch and handle—unlike the original situation where the entire build would be effectively hard-killed with no possible cleanup.

src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java

…r882853693 Example from `ExecutorStepDynamicContextTest.parallelNodeDisappearance`: ``` "Computer.threadPoolForRemoting [#3]" jenkinsci#89 daemon prio=5 os_prio=0 tid=0x00007f30d0c6a800 nid=0x6d0f9 in Object.wait() [0x00007f31047fe000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:460) at hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:97) - locked <0x00000000f83dd1e8> (a hudson.remoting.AsyncFutureImpl) at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepDynamicContext.resume(ExecutorStepDynamicContext.java:108) at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution.onResume(ExecutorStepExecution.java:201) at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ParallelResumer.lambda$run$5(FlowExecutionList.java:369) at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ParallelResumer$$Lambda$350/265274739.run(Unknown Source) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68) at … ```

jglick · 2022-06-02T20:37:40Z

I believe this is ready to release if jenkinsci/workflow-cps-plugin#534 can be released at the same time (to avoid regressions in some corner cases). jenkinsci/workflow-durable-task-step-plugin#226 can then also be released (but there is no rush in doing so).

src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java

dwnusbaum · 2022-06-02T23:02:49Z

src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java

+     * Does so in parallel, but always completing enclosing blocks before the enclosed step.
+     * A simplified version of https://stackoverflow.com/a/67449067/12916, since this should be a tree not a general DAG.
+     */
+    private static final class ParallelResumer {


Ok, I guess it's kind of unusual for ExecutorStepExecution.onResume to block if no other steps do so, but looking through jenkinsci/workflow-durable-task-step-plugin#180 it seems like you already tried various alternatives, so this seems fine.

Co-authored-by: Devin Nusbaum <[email protected]>

[JENKINS-67164] Call StepExecution.onResume in response to `Workflo…

8fba087

…wRun.onLoad` not `FlowExecutionList.ItemListenerImpl`

jglick requested a review from dwnusbaum May 11, 2022 19:35

jglick added the bug label May 11, 2022

jglick mentioned this pull request May 11, 2022

Recover gracefully when a PlaceholderTask is in the queue but the associated build is complete (II) jenkinsci/workflow-durable-task-step-plugin#226

Merged

jglick requested a review from car-roll May 11, 2022 19:50

dwnusbaum approved these changes May 11, 2022

View reviewed changes

This was referenced May 11, 2022

[JENKINS-49707] Introduce ErrorCondition #217

Merged

[JENKINS-67164] Pretest changes related to StepExecution.onResume jenkinsci/bom#1122

Closed

jglick added breaking and removed bug labels May 12, 2022

car-roll approved these changes May 18, 2022

View reviewed changes

jglick added 2 commits May 19, 2022 17:22

Call StepExecution.onResume in parallel to the extent possible

b1778a9

Merge branch 'master' of https://github.com/jenkinsci/workflow-api-pl…

a643c5f

…ugin into FlowExecutionList-JENKINS-67164

dwnusbaum reviewed May 19, 2022

View reviewed changes

jglick commented May 26, 2022

View reviewed changes

src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java Outdated Show resolved Hide resolved

jglick marked this pull request as ready for review June 2, 2022 20:35

dwnusbaum approved these changes Jun 2, 2022

View reviewed changes

Typo in comment

e10ac66

Co-authored-by: Devin Nusbaum <[email protected]>

jglick merged commit a1e4906 into jenkinsci:master Jun 3, 2022

jglick deleted the FlowExecutionList-JENKINS-67164 branch June 3, 2022 14:18

jglick added a commit to jglick/workflow-cps-plugin that referenced this pull request Jun 3, 2022

jenkinsci/workflow-api-plugin#221 released

3708876

jglick added a commit to jglick/workflow-durable-task-step-plugin that referenced this pull request Jun 3, 2022

jenkinsci/workflow-api-plugin#221 released

8bf158c

This was referenced Jun 3, 2022

[JENKINS-67164] Plugin updates related to FlowExecutionList changes jenkinsci/bom#1187

Merged

Require 2.332.x #226

Merged

jglick mentioned this pull request Jun 3, 2022

Block workflow-api at 1162.va_1e49062a_00e jenkins-infra/update-center2#599

Merged

jglick mentioned this pull request Oct 10, 2022

FlowExecutionList.ParallelResumer should wait until Jenkins startup is complete #256

Merged

dwnusbaum mentioned this pull request Aug 6, 2024

Prevent StepExecutionIterator from leaking memory in cases where a single processed execution has a stuck CPS VM thread #347

Merged

6 tasks

jglick mentioned this pull request Aug 9, 2024

Avoid infinite loops due to corrupted flow graphs in some cases and improve resumption error handling #349

Merged

6 tasks

jglick mentioned this pull request Jan 6, 2025

Resume Pipeline builds asynchronously #368

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JENKINS-67164] Call `StepExecution.onResume` in response to `WorkflowRun.onLoad` not `FlowExecutionList.ItemListenerImpl` #221

[JENKINS-67164] Call `StepExecution.onResume` in response to `WorkflowRun.onLoad` not `FlowExecutionList.ItemListenerImpl` #221

jglick commented May 11, 2022 •

edited

Loading

dwnusbaum left a comment

jglick commented May 11, 2022

dwnusbaum May 19, 2022 •

edited

Loading

jglick May 19, 2022

dwnusbaum Jun 2, 2022

jglick Jun 3, 2022

jglick commented Jun 2, 2022

dwnusbaum Jun 2, 2022

[JENKINS-67164] Call StepExecution.onResume in response to WorkflowRun.onLoad not FlowExecutionList.ItemListenerImpl #221

[JENKINS-67164] Call StepExecution.onResume in response to WorkflowRun.onLoad not FlowExecutionList.ItemListenerImpl #221

Conversation

jglick commented May 11, 2022 • edited Loading

dwnusbaum left a comment

Choose a reason for hiding this comment

jglick commented May 11, 2022

dwnusbaum May 19, 2022 • edited Loading

Choose a reason for hiding this comment

jglick May 19, 2022

Choose a reason for hiding this comment

dwnusbaum Jun 2, 2022

Choose a reason for hiding this comment

jglick Jun 3, 2022

Choose a reason for hiding this comment

jglick commented Jun 2, 2022

dwnusbaum Jun 2, 2022

Choose a reason for hiding this comment

[JENKINS-67164] Call `StepExecution.onResume` in response to `WorkflowRun.onLoad` not `FlowExecutionList.ItemListenerImpl` #221

[JENKINS-67164] Call `StepExecution.onResume` in response to `WorkflowRun.onLoad` not `FlowExecutionList.ItemListenerImpl` #221

jglick commented May 11, 2022 •

edited

Loading

dwnusbaum May 19, 2022 •

edited

Loading