Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-67351] Avoid deadlock when resuming Pipelines in some cases #188

Merged
merged 6 commits into from
Dec 17, 2021

Conversation

dwnusbaum
Copy link
Member

@dwnusbaum dwnusbaum commented Dec 15, 2021

See JENKINS-67351, #178, and jenkinsci/workflow-durable-task-step-plugin#185.

The RunMap lock is always held when WorkflowRun.onLoad calls FlowExecutionListener.fireResumed, which means that any steps that need to acquire other locks in StepExecution.onResume (e.g. ExecutorStepExecution.onResume needs to acquire the queue lock to schedule a build) may trigger a deadlock with other code that also attempts to acquire the RunMap lock.

To prevent these issues, we use a new ExecutorService so that the RunMap lock is not held when resuming step executions.

This PR also introduces a new API that will be used to update jenkinsci/workflow-durable-task-step-plugin#185 (will mark as ready for review once I have a PR up for that) to prevent Pipelines from being resumed before FlowExecutionListener$ItemListenerImpl.onLoaded has even had a chance to execute, which can lead to similar issues.

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests - that demonstrates feature works or fixes the issue

@dwnusbaum dwnusbaum added the bug label Dec 15, 2021
/**
* Returns true if all executions that were present in this {@link FlowExecutionList} have been loaded and resumed.
*
* This takes place slightly after {@link InitMilestone#COMPLETED} is reached during Jenkins startup.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://github.com/jenkinsci/jenkins/blob/960925fc5055208d2b64a0388161ddcbcd5916aa/core/src/main/java/jenkins/model/Jenkins.java#L987-L1033. Perhaps we could instead add an API to core to indicate that Jenkins.<init> is truly complete, although that would not help us right at this moment, or maybe we could check for STARTUP_MARKER_FILE instead or something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another idea from @jglick is to refactor FlowExecutionList$ItemListenerImpl.onLoaded to instead be an @Initializer method so that we can then just use InitMilestone.COMPLETED.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #188 (comment) and later comments. For now we plan to leave this method in place but mark it as @Restricted(Beta.class).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of using STARTUP_MARKER_FILE since it's already there, but this way seems more explicit without having to dig into core

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBD if we actually need the new API. Otherwise looks good.

}
}

// Using an unbounded thread pool to avoid potential deadlock as in JENKINS-25890. TODO: Unclear if this is actually necessary.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timer is probably good enough? Not sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, maybe I'll switch back to Timer for now and then if there is related deadlock we can introduce an ExecutorService. In this case we are at least guaranteed to release the RunMap lock as long as the callback does not use the current thread, so even if Timer is saturated and the task to resume the step executions is not scheduled quickly, I don't think anything else would be blocked waiting for it.

@dwnusbaum
Copy link
Member Author

TBD if we actually need the new API. Otherwise looks good.

Yeah I am currently testing a switch from ItemListener to @Initializer to avoid the new API. I think that would be cleaner but TBD on whether it breaks anything or causes other issues.

@dwnusbaum
Copy link
Member Author

I think that would be cleaner but TBD on whether it breaks anything or causes other issues.

So far it causes ExecutorPickleTest.canceledQueueItem to consistently fail here because the queue is empty, but I'm not quite sure why. I'll investigate tomorrow.

@jglick
Copy link
Member

jglick commented Dec 15, 2021

That test is edited in jenkinsci/workflow-durable-task-step-plugin#180 FYI.

@dwnusbaum
Copy link
Member Author

The issue with that test seems to be that having Pipelines resume before this code runs changes the behavior slightly. Compare the log output of the tests, here is the current behavior:

   ...
   0.300 [p #1] Resuming build at Thu Dec 16 09:27:25 EST 2021 after Jenkins restart
Unblocking wait/1 as success
   0.341 [id=95]	WARNING	o.j.p.w.s.p.ExecutorPickleTest#lambda$canceledQueueItem$1: FROM TEST CODE: About to check queue length
   0.352 [p #1] Waiting to resume part of p #1: Finished waiting
   1.375 [id=58]	INFO	o.j.p.workflow.job.WorkflowRun#finish: p #1 completed: ABORTED
   1.406 [p #1] [Pipeline] End of Pipeline
   1.407 [p #1] Queue task was cancelled
   1.410 [p #1] Finished: ABORTED

Here is the behavior when @Initializer is used:

   ...
   0.413 [p #1] Resuming build at Thu Dec 16 09:29:53 EST 2021 after Jenkins restart
   0.414 [p #1] Waiting to resume part of p #1
Unblocking wait/1 as success
   0.430 [id=98]	WARNING	o.j.p.w.s.p.ExecutorPickleTest#lambda$canceledQueueItem$1: FROM TEST CODE: About to check queue length
   0.433 [id=98]	INFO	jenkins.model.Jenkins#cleanUp: Stopping Jenkins
   ... Assertion already failed, Pipeline resumes once Jenkins is already shutting down! ...
   2.382 [p #1] Waiting to resume part of p #1: ‘Jenkins’ doesn’t have label ‘slave0’
   5.412 [p #1] Waiting to resume part of p #1: ‘Jenkins’ doesn’t have label ‘slave0’
   8.437 [p #1] Waiting to resume part of p #1: ‘Jenkins’ doesn’t have label ‘slave0’
  13.443 [p #1] Waiting to resume part of p #1: ‘Jenkins’ doesn’t have label ‘slave0’
  19.445 [p #1] Waiting to resume part of p #1: ‘Jenkins’ doesn’t have label ‘slave0’
  26.450 [p #1] Killed part of p #1 after waiting for 25 sec because we assume unknown agent slave0 is never going to appear
  26.507 [p #1] [Pipeline] End of Pipeline
  26.515 [id=83]	INFO	o.j.p.workflow.job.WorkflowRun#finish: p #1 completed: ABORTED
  26.561 [p #1] Agent was removed
  26.561 [p #1] Finished: ABORTED

Easy enough to fix that test, but I will check workflow-job and workflow-cps as well.

@jglick
Copy link
Member

jglick commented Dec 16, 2021

Not sure I follow. Do you mean that JenkinsSessionRule is running its then block too early?

@dwnusbaum
Copy link
Member Author

Oh actually it looks like the problem is that when using @Initializer, this condition from jenkinsci/workflow-durable-task-step-plugin#184 is always true the first time that tryResolve in ExecutorPickle.rehydrate runs, so the first "Waiting to resume" message is sent to the build log before the placeholder task is scheduled, so the queue is empty. Adding a sleep to ExecutorPickleTest or waiting for a more precise message before checking the queue fixes the test.

Tests in workflow-job and workflow-cps pass using @Initializer to resume Pipelines, so it seems ok, but I am a little concerned that it may cause issues for controllers that have a lot of Pipelines to resume after a restart, since anything waiting for InitMilestone.COMPLETED will be blocked until all Pipeline runs and their executions have been loaded from disk. Any thoughts on that? Maybe we try that for now to avoid a new API and then if there are issues we add the new API and switch back to using an ItemListener?

@jglick
Copy link
Member

jglick commented Dec 16, 2021

anything waiting for InitMilestone.COMPLETED will be blocked until all Pipeline runs and their executions have been loaded from disk

Hmm, true, that sounds dangerous. I tried to think of a cleaner system than the API currently proposed here, but have not come up with anything. Maybe at least mark it @Beta for now, since there is only one known caller and we may revisit the design?

@dwnusbaum
Copy link
Member Author

Maybe at least mark it @beta for now, since there is only one known caller and we may revisit the design?

Sure, that sounds good to me.

@dwnusbaum dwnusbaum marked this pull request as ready for review December 16, 2021 19:16
@car-roll car-roll merged commit 6d6de20 into jenkinsci:master Dec 17, 2021
@jglick
Copy link
Member

jglick commented Dec 17, 2021

Note: not deployed due to apparent flake https://github.com/jenkinsci/workflow-api-plugin/runs/4555248728

@jglick
Copy link
Member

jglick commented Dec 17, 2021

@@ -163,6 +167,18 @@ public static FlowExecutionList get() {
return l;
}

/**
* Returns true if all executions that were present in this {@link FlowExecutionList} have been loaded and resumed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that last part actually true? Now that onResume can be called asynchronously, this does not seem to be tracking that it completed.

Copy link
Member Author

@dwnusbaum dwnusbaum Jan 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The WorkflowRun and CpsFlowExecution have been loaded, CpsFlowExecution.loadProgramAsync and FlowExecutionListener.onResumed have been called, but StepExecution.onResume may not have been called yet or may still be in progress.

@jglick
Copy link
Member

jglick commented Jan 14, 2022

@dwnusbaum I think this needs to be reverted. It causes frequent test failures in SSHAgentStepWorkflowTest.sshAgentAvailableAfterRestart which I believe are legitimate. Using

diff --git src/main/java/com/cloudbees/jenkins/plugins/sshagent/SSHAgentStepExecution.java src/main/java/com/cloudbees/jenkins/plugins/sshagent/SSHAgentStepExecution.java
index d36e18d..41110d7 100644
--- src/main/java/com/cloudbees/jenkins/plugins/sshagent/SSHAgentStepExecution.java
+++ src/main/java/com/cloudbees/jenkins/plugins/sshagent/SSHAgentStepExecution.java
@@ -79,6 +79,7 @@ public class SSHAgentStepExecution extends AbstractStepExecutionImpl implements
     @Override
     public void onResume() {
         super.onResume();
+        listener.getLogger().println("=== onResume starts");
         try {
             purgeSockets();
             initRemoteAgent();
@@ -86,6 +87,7 @@ public class SSHAgentStepExecution extends AbstractStepExecutionImpl implements
             listener.getLogger().println(Messages.SSHAgentBuildWrapper_CouldNotStartAgent());
             x.printStackTrace(listener.getLogger());
         }
+        listener.getLogger().println("=== onResume ends");
     }
 
     static FilePath tempDir(FilePath ws) {

it is possible to see that with this patch,

[Pipeline] sh

is typically printed somewhere in the middle of onResume, whereas previously onResume consistently ran to completion before the sh step started (as you would likely expect).

@dwnusbaum dwnusbaum deleted the JENKINS-67351 branch January 14, 2022 19:06
@dwnusbaum
Copy link
Member Author

@jglick Fine (I have no time to investigate), but this cannot be reverted on its own, #178 as well as jenkinsci/workflow-durable-task-step-plugin#188 and jenkinsci/workflow-durable-task-step-plugin#185 would need to be reverted all at the same time.

@jglick
Copy link
Member

jglick commented Jan 14, 2022

Right, that is what I meant.

@dwnusbaum
Copy link
Member Author

Perhaps adding some kind of API that CpsFlowExecution.loadProgramAsync could check to ensure that StepExecution.onResume has completed for all of its executions (or for all executions globally) before calling CpsThreadGroup.scheduleRun would prevent steps from running too early while still allowing StepExecution.onResume to run asynchronously. Another approach might be to get rid of ResumeStepExecutionListener and just have CpsFlowExecution.loadProgramAsync call StepExecution.onResume on its own steps before resuming the program.

@jglick
Copy link
Member

jglick commented Jan 14, 2022

Either of those proposals sound reasonable.

@dwnusbaum
Copy link
Member Author

Well, I got sidetracked and looked into this a bit.

Another approach might be to get rid of ResumeStepExecutionListener and just have CpsFlowExecution.loadProgramAsync call StepExecution.onResume on its own steps before resuming the program.

This does not seem to work on its own, because at least in tests, CpsFlowExecution.loadProgramAsync runs on the same thread that calls WorkflowRun.onLoad and CpsFlowExecution.onLoad (not async?), so calling StepExecution.onResume inside of loadProgramAsync loads the execution twice, which fails. (See the stack trace below.)

From a bit more testing, passing an Executor argument backed by a thread pool to the Futures.addCallback call in loadProgramAsync does seem to make things work, but that also seems like a significant change all on its own that might have other unanticipated effects.

java.lang.NullPointerException
	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.initializeStorage(CpsFlowExecution.java:681)
// Second load of CpsFlowExecution
	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.onLoad(CpsFlowExecution.java:726)
	at org.jenkinsci.plugins.workflow.job.WorkflowRun.getExecution(WorkflowRun.java:691)
	at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.get(WorkflowRun.java:963)
	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getExecution(CpsStepContext.java:218)
	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getNode(CpsStepContext.java:303)
	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.scheduleNextRun(CpsStepContext.java:375)
	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.completed(CpsStepContext.java:327)
	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.onFailure(CpsStepContext.java:315)
	at org.jenkinsci.plugins.workflow.flow.FlowExecutionListTest$NonResumableStep$ExecutionImpl.onResume(FlowExecutionListTest.java:184)
// New code to call StepExecution.onResume here.
	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$2.onSuccess(CpsFlowExecution.java:803)
	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$2.onSuccess(CpsFlowExecution.java:786)
	at org.jenkinsci.plugins.workflow.support.concurrent.Futures$1.run(Futures.java:153)
// SameThreadExecutorService seems to be used because `CpsFlowExecution.loadProgramAsync` uses `Futures.addCallback(ListenableFuture, FutureCallback)` rather than `Futures.addCallback(ListenableFuture, FutureCallback, Executor)`.
	at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:253)
	at com.google.common.util.concurrent.ExecutionList$RunnableExecutorPair.execute(ExecutionList.java:149)
	at com.google.common.util.concurrent.ExecutionList.add(ExecutionList.java:105)
	at com.google.common.util.concurrent.AbstractFuture.addListener(AbstractFuture.java:155)
	at org.jenkinsci.plugins.workflow.support.concurrent.Futures.addCallback(Futures.java:163)
	at org.jenkinsci.plugins.workflow.support.concurrent.Futures.addCallback(Futures.java:93)
	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.loadProgramAsync(CpsFlowExecution.java:783)
// First load of CpsFlowExecution
	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.onLoad(CpsFlowExecution.java:750)
	at org.jenkinsci.plugins.workflow.job.WorkflowRun.getExecution(WorkflowRun.java:691)
	at org.jenkinsci.plugins.workflow.job.WorkflowRun.onLoad(WorkflowRun.java:550)
	at hudson.model.RunMap.retrieve(RunMap.java:225)
	at hudson.model.RunMap.retrieve(RunMap.java:57)
	at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:501)
	at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:483)
	at jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:381)
	at jenkins.model.lazy.LazyBuildMixIn.getBuildByNumber(LazyBuildMixIn.java:228)
	at org.jenkinsci.plugins.workflow.job.WorkflowJob.getBuildByNumber(WorkflowJob.java:232)
	at org.jenkinsci.plugins.workflow.flow.FlowExecutionListTest.lambda$resumeStepExecutions$5(FlowExecutionListTest.java:156)

@jglick
Copy link
Member

jglick commented May 9, 2022

Perhaps adding some kind of API that CpsFlowExecution.loadProgramAsync could check to ensure that StepExecution.onResume has completed for all of its executions (or for all executions globally) before calling CpsThreadGroup.scheduleRun would prevent steps from running too early while still allowing StepExecution.onResume to run asynchronously.

Perhaps. The following seems to restore reasonable behavior to sshAgentAvailableAfterRestart:

diff --git src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java
index b04be34..10eff35 100644
--- src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java
+++ src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java
@@ -173,10 +173,18 @@ public class FlowExecutionList implements Iterable<FlowExecution> {
      */
     @Extension
     public static class ItemListenerImpl extends ItemListener {
+        void pause(FlowExecution e, boolean b) {
+            try {
+                e.getClass().getMethod("pause", boolean.class).invoke(e, b);
+            } catch (Exception x) {
+                x.printStackTrace();
+            }
+        }
         @Override
         public void onLoaded() {
             FlowExecutionList list = FlowExecutionList.get();
             for (final FlowExecution e : list) {
+                pause(e, true);
                 // The call to FlowExecutionOwner.get in the implementation of iterator() is sufficent to load the Pipeline.
                 LOGGER.log(Level.FINE, "Eagerly loaded {0}", e);
                 Futures.addCallback(e.getCurrentExecutions(false), new FutureCallback<List<StepExecution>>() {
@@ -190,6 +198,7 @@ public class FlowExecutionList implements Iterable<FlowExecution> {
                                 se.getContext().onFailure(x);
                             }
                         }
+                        pause(e, false);
                     }
 
                     @Override
@@ -200,7 +209,7 @@ public class FlowExecutionList implements Iterable<FlowExecution> {
                             LOGGER.log(Level.WARNING, "Failed to load " + e, t);
                         }
                     }
-                }, MoreExecutors.directExecutor());
+                }, Timer.get());
             }
         }
     }

If I could get something like that to work, was there any other obstacle to restoring your fixes (basically reverting #198 & jenkinsci/workflow-durable-task-step-plugin#197)?

@dwnusbaum
Copy link
Member Author

If I could get something like that to work, was there any other obstacle to restoring your fixes (basically reverting #198 & jenkinsci/workflow-durable-task-step-plugin#197)?

Not that I am aware of.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants