Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-67164] Call StepExecution.onResume in response to WorkflowRun.onLoad not FlowExecutionList.ItemListenerImpl #221

Merged
merged 5 commits into from
Jun 3, 2022

Conversation

jglick
Copy link
Member

@jglick jglick commented May 11, 2022

JENKINS-67164: Reapplies #178 and its amendment #188, reverting the reversion #198.

Fixes the problem identified in SSHAgentStepWorkflowTest.sshAgentAvailableAfterRestart whereby a sh step began running while an sshagent step was still in the middle of an onResume, by introducing a FlowExecution.afterStepExecutionsResumed hook which CpsFlowExecution will implement to unpause the program near the end of WorkflowRun.onLoad, rather than unpausing immediately upon deserializing program.dat.

Seems to offer a nicer way to implement the ExecutorPickle replacement portion of jenkinsci/workflow-durable-task-step-plugin#180, by letting ExecutorStepExecution.onResume wait for an agent to reconnect before resuming progran execution. Compared to ExecutorPickle the advantage is that if there is in fact a timeout, rather than marking the entire program as a failure to load, the node step can throw a proper Groovy-level exception which can be handled gracefully.

A matching PR to workflow-cps-plugin is needed to avoid the class of regression seen with sshagent. I would like to run some integration tests in advance. jenkinsci/bom#1122 (comment)

…wRun.onLoad` not `FlowExecutionList.ItemListenerImpl`
Copy link
Member

@dwnusbaum dwnusbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine to me. For what it's worth, I would have preferred to see the old commits cherry-picked or a revert-the-revert commit with your changes applied as a new commit to make it easier to tell what the new changes are, although I guess the followup commits in #198 would have created a bunch of conflicts.

@jglick
Copy link
Member Author

jglick commented May 11, 2022

a revert-the-revert commit with your changes applied as a new commit

Agreed, though it would have been tricky, since I had to play with various uncommitted edits before I found something that worked.

* Does so in parallel, but always completing enclosing blocks before the enclosed step.
* A simplified version of https://stackoverflow.com/a/67449067/12916, since this should be a tree not a general DAG.
*/
private static final class ParallelResumer {
Copy link
Member

@dwnusbaum dwnusbaum May 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this required for correctness or is it an optimization?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is an optimization. Previously all steps were resumed serially in a topological order. (Or I hope getCurrentExecutions enforced a topological order; the sorting by CpsThread is a bit opaque to me.) The problem in jenkinsci/workflow-durable-task-step-plugin#180 arises when you have a bunch of parallel branches all running node and all of them are going to fail due to missing agents. Previously, Jenkins would wait 5m for the first agent, report that it was gone, then wait 5m for the next agent, etc. Now all the node blocks are resumed in parallel, followed by the sh steps inside them, etc. Have a test demonstrating this in workflow-durable-task-step but only now got an incremental build from jenkinsci/workflow-cps-plugin#534 that I needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I guess it's kind of unusual for ExecutorStepExecution.onResume to block if no other steps do so, but looking through jenkinsci/workflow-durable-task-step-plugin#180 it seems like you already tried various alternatives, so this seems fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically this is the replacement for the TryRepeatedly pickle. I played with a lot of alternatives indeed. I am not claiming this approach is ideal but it seems to be straightforward enough and do the job.

Earlier, before hitting on the idea of having onResume block, I had hoped to arrange it so that the program would resume right away, and then a sh step running on the dead agent would eventually figure out there was no hope and abort. I ran into problems in functional tests, though, things like

node('agent') {
  input 'Proceed?' // restart here
  sh 'true'
}

The program would run past input to the sh step, which would then either fail with a MissingContextVariableException (confusing) or block the CPS VM thread inside DSL (which will throw an error if it blocks >5s). Admittedly this is an artificial case (you should not hold an executor like that) but a lot of tests were failing that I had to work around in artificial ways and it seemed uncomfortable. It would have been possible to fix all these cases but only in pretty intrusive ways: by defining a new StepContext object in lieu of FilePath (perhaps DynamicFilePathContext.Representation) which could be deserialized without pickles, but every step currently expecting FilePath would need to be adjusted to expect this instead, and then the step would need to have complicated logic to wait for the new contextual object to be ready (translatable to a live FilePath) before doing anything useful.

The compromise I settled on is closer to the original pickle behavior in that CPS code and steps do not run until an attempt is made to restore the original context: until all open node blocks have gotten their agent to reconnect, or timed out trying. The difference is only that in case this fails (with a timeout), the error is thrown out of StepExecution.onResume and thus StepContext.onFailure so it becomes a properly modeled Throwable with a CPS VM stack trace that we can catch and handle—unlike the original situation where the entire build would be effectively hard-killed with no possible cleanup.

…r882853693

Example from `ExecutorStepDynamicContextTest.parallelNodeDisappearance`:

```
"Computer.threadPoolForRemoting [#3]" jenkinsci#89 daemon prio=5 os_prio=0 tid=0x00007f30d0c6a800 nid=0x6d0f9 in Object.wait() [0x00007f31047fe000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	at java.lang.Object.wait(Object.java:460)
	at hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:97)
	- locked <0x00000000f83dd1e8> (a hudson.remoting.AsyncFutureImpl)
	at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepDynamicContext.resume(ExecutorStepDynamicContext.java:108)
	at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution.onResume(ExecutorStepExecution.java:201)
	at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ParallelResumer.lambda$run$5(FlowExecutionList.java:369)
	at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ParallelResumer$$Lambda$350/265274739.run(Unknown Source)
	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
	at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68)
	at …
```
@jglick jglick marked this pull request as ready for review June 2, 2022 20:35
@jglick
Copy link
Member Author

jglick commented Jun 2, 2022

I believe this is ready to release if jenkinsci/workflow-cps-plugin#534 can be released at the same time (to avoid regressions in some corner cases). jenkinsci/workflow-durable-task-step-plugin#226 can then also be released (but there is no rush in doing so).

* Does so in parallel, but always completing enclosing blocks before the enclosed step.
* A simplified version of https://stackoverflow.com/a/67449067/12916, since this should be a tree not a general DAG.
*/
private static final class ParallelResumer {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I guess it's kind of unusual for ExecutorStepExecution.onResume to block if no other steps do so, but looking through jenkinsci/workflow-durable-task-step-plugin#180 it seems like you already tried various alternatives, so this seems fine.

Co-authored-by: Devin Nusbaum <[email protected]>
@jglick jglick merged commit a1e4906 into jenkinsci:master Jun 3, 2022
@jglick jglick deleted the FlowExecutionList-JENKINS-67164 branch June 3, 2022 14:18
jglick added a commit to jglick/workflow-cps-plugin that referenced this pull request Jun 3, 2022
jglick added a commit to jglick/workflow-durable-task-step-plugin that referenced this pull request Jun 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants