-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JENKINS-67164] Call StepExecution.onResume
in response to WorkflowRun.onLoad
not FlowExecutionList.ItemListenerImpl
#221
[JENKINS-67164] Call StepExecution.onResume
in response to WorkflowRun.onLoad
not FlowExecutionList.ItemListenerImpl
#221
Conversation
…wRun.onLoad` not `FlowExecutionList.ItemListenerImpl`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems fine to me. For what it's worth, I would have preferred to see the old commits cherry-picked or a revert-the-revert commit with your changes applied as a new commit to make it easier to tell what the new changes are, although I guess the followup commits in #198 would have created a bunch of conflicts.
Agreed, though it would have been tricky, since I had to play with various uncommitted edits before I found something that worked. |
…ugin into FlowExecutionList-JENKINS-67164
* Does so in parallel, but always completing enclosing blocks before the enclosed step. | ||
* A simplified version of https://stackoverflow.com/a/67449067/12916, since this should be a tree not a general DAG. | ||
*/ | ||
private static final class ParallelResumer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this required for correctness or is it an optimization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is an optimization. Previously all steps were resumed serially in a topological order. (Or I hope getCurrentExecutions
enforced a topological order; the sorting by CpsThread
is a bit opaque to me.) The problem in jenkinsci/workflow-durable-task-step-plugin#180 arises when you have a bunch of parallel branches all running node
and all of them are going to fail due to missing agents. Previously, Jenkins would wait 5m for the first agent, report that it was gone, then wait 5m for the next agent, etc. Now all the node
blocks are resumed in parallel, followed by the sh
steps inside them, etc. Have a test demonstrating this in workflow-durable-task-step
but only now got an incremental build from jenkinsci/workflow-cps-plugin#534 that I needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I guess it's kind of unusual for ExecutorStepExecution.onResume
to block if no other steps do so, but looking through jenkinsci/workflow-durable-task-step-plugin#180 it seems like you already tried various alternatives, so this seems fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically this is the replacement for the TryRepeatedly
pickle. I played with a lot of alternatives indeed. I am not claiming this approach is ideal but it seems to be straightforward enough and do the job.
Earlier, before hitting on the idea of having onResume
block, I had hoped to arrange it so that the program would resume right away, and then a sh
step running on the dead agent would eventually figure out there was no hope and abort. I ran into problems in functional tests, though, things like
node('agent') {
input 'Proceed?' // restart here
sh 'true'
}
The program would run past input
to the sh
step, which would then either fail with a MissingContextVariableException
(confusing) or block the CPS VM thread inside DSL
(which will throw an error if it blocks >5s). Admittedly this is an artificial case (you should not hold an executor like that) but a lot of tests were failing that I had to work around in artificial ways and it seemed uncomfortable. It would have been possible to fix all these cases but only in pretty intrusive ways: by defining a new StepContext
object in lieu of FilePath
(perhaps DynamicFilePathContext.Representation
) which could be deserialized without pickles, but every step currently expecting FilePath
would need to be adjusted to expect this instead, and then the step would need to have complicated logic to wait for the new contextual object to be ready (translatable to a live FilePath
) before doing anything useful.
The compromise I settled on is closer to the original pickle behavior in that CPS code and steps do not run until an attempt is made to restore the original context: until all open node
blocks have gotten their agent to reconnect, or timed out trying. The difference is only that in case this fails (with a timeout), the error is thrown out of StepExecution.onResume
and thus StepContext.onFailure
so it becomes a properly modeled Throwable
with a CPS VM stack trace that we can catch and handle—unlike the original situation where the entire build would be effectively hard-killed with no possible cleanup.
src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java
Outdated
Show resolved
Hide resolved
…r882853693 Example from `ExecutorStepDynamicContextTest.parallelNodeDisappearance`: ``` "Computer.threadPoolForRemoting [#3]" jenkinsci#89 daemon prio=5 os_prio=0 tid=0x00007f30d0c6a800 nid=0x6d0f9 in Object.wait() [0x00007f31047fe000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:460) at hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:97) - locked <0x00000000f83dd1e8> (a hudson.remoting.AsyncFutureImpl) at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepDynamicContext.resume(ExecutorStepDynamicContext.java:108) at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution.onResume(ExecutorStepExecution.java:201) at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ParallelResumer.lambda$run$5(FlowExecutionList.java:369) at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ParallelResumer$$Lambda$350/265274739.run(Unknown Source) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68) at … ```
I believe this is ready to release if jenkinsci/workflow-cps-plugin#534 can be released at the same time (to avoid regressions in some corner cases). jenkinsci/workflow-durable-task-step-plugin#226 can then also be released (but there is no rush in doing so). |
src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java
Outdated
Show resolved
Hide resolved
* Does so in parallel, but always completing enclosing blocks before the enclosed step. | ||
* A simplified version of https://stackoverflow.com/a/67449067/12916, since this should be a tree not a general DAG. | ||
*/ | ||
private static final class ParallelResumer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I guess it's kind of unusual for ExecutorStepExecution.onResume
to block if no other steps do so, but looking through jenkinsci/workflow-durable-task-step-plugin#180 it seems like you already tried various alternatives, so this seems fine.
Co-authored-by: Devin Nusbaum <[email protected]>
JENKINS-67164: Reapplies #178 and its amendment #188, reverting the reversion #198.
Fixes the problem identified in
SSHAgentStepWorkflowTest.sshAgentAvailableAfterRestart
whereby ash
step began running while ansshagent
step was still in the middle of anonResume
, by introducing aFlowExecution.afterStepExecutionsResumed
hook whichCpsFlowExecution
will implement to unpause the program near the end ofWorkflowRun.onLoad
, rather than unpausing immediately upon deserializingprogram.dat
.Seems to offer a nicer way to implement the
ExecutorPickle
replacement portion of jenkinsci/workflow-durable-task-step-plugin#180, by lettingExecutorStepExecution.onResume
wait for an agent to reconnect before resuming progran execution. Compared toExecutorPickle
the advantage is that if there is in fact a timeout, rather than marking the entire program as a failure to load, thenode
step can throw a proper Groovy-level exception which can be handled gracefully.A matching PR to
workflow-cps-plugin
is needed to avoid the class of regression seen withsshagent
.I would like to run some integration tests in advance.jenkinsci/bom#1122 (comment)