Skip to content

Commit

Permalink
[JENKINS-49707] Agent missing after controller restart to fail resump…
Browse files Browse the repository at this point in the history
…tion of `node` step, not kill whole build (#180)

* Sketch of non-`Pickle`-based resumption of `ExecutorStepExecution`

* Amending test from #34

* Fixing `Callback`

* Exploring where to detect a failure eligible for retry

* Making `normalNodeDisappearance` pass under somewhat altered circumstances

* First successful `node` block retry

* `FilePathDynamicContext` must take precedence over `ExecutorStepDynamicContext.FilePathTranslator`

* `ExecutorStepTest.unloadableExecutorPickle` as currently conceived no longer makes sense

* Fixing `ExecutorStepTest.nodeDisconnectMissingContextVariableException` by copying some diagnostic code from `FilePathDynamicContext`

* Fixing `ExecutorStepTest.contextualizeFreshFilePathAfterAgentReconnection`

* Marking `ExecutorStepRetryEligibility` beta

* Javadoc imports

* SpotBugs: does not make sense to allow `ExecutorStepDynamicContext.node` to be null

* Reasonable behavior of `ExecutorPickleTest.canceledQueueItem`

* User-visible logging about 5m timeout now that `ExecutorPickle.printWaitingMessage` is unused

* `ExecutorPickleTest` → `ExecutorStepDynamicContextTest`

* Move `TIMEOUT_WAITING_FOR_NODE_MILLIS` from `ExecutorPickle` to `ExecutorStepExecution`

* Some user-visible logging when `ExecutorStepRetryEligibility` is _not_ used

* Do not block on `FutureImpl`.
After jenkinsci/workflow-step-api-plugin#73 this caused `SecretsMasker` to block on a `KubernetesComputer`
which in turn blocked provisioning of a `TaskListener`, causing `RestartPipelineTest.terminatedPodAfterRestart` to fail to find a message:

```
FINE	o.j.p.w.s.d.DurableTaskStep$Execution#getWorkspace: rediscovering that terminated-pod-after-restart-1-8fb2m-j206k-8x125 has been removed and timeout has expired
FINE	o.j.p.w.s.d.DurableTaskStep$Execution#_listener: JENKINS-34021: could not get TaskListener in CpsStepContext[9:sh]:Owner[terminated Pod After Restart/1:terminated Pod After Restart #1]
java.util.concurrent.TimeoutException
	at hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:102)
	at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepDynamicContext$Translator.get(ExecutorStepDynamicContext.java:115)
Caused: java.io.IOException
	at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepDynamicContext$Translator.get(ExecutorStepDynamicContext.java:118)
	at org.jenkinsci.plugins.workflow.steps.DynamicContext$Typed.get(DynamicContext.java:97)
	at org.jenkinsci.plugins.workflow.cps.ContextVariableSet.get(ContextVariableSet.java:139)
	at org.jenkinsci.plugins.workflow.cps.ContextVariableSet$1Delegate.doGet(ContextVariableSet.java:98)
	at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:75)
	at org.csanchez.jenkins.plugins.kubernetes.pipeline.SecretsMasker$Factory.get(SecretsMasker.java:85)
	at org.csanchez.jenkins.plugins.kubernetes.pipeline.SecretsMasker$Factory.get(SecretsMasker.java:73)
	at org.jenkinsci.plugins.workflow.steps.DynamicContext$Typed.get(DynamicContext.java:95)
	at org.jenkinsci.plugins.workflow.cps.ContextVariableSet.get(ContextVariableSet.java:139)
	at org.jenkinsci.plugins.workflow.cps.CpsThread.getContextVariable(CpsThread.java:135)
	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.doGet(CpsStepContext.java:297)
	at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:75)
	at org.jenkinsci.plugins.workflow.support.DefaultStepContext.getListener(DefaultStepContext.java:127)
	at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:87)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution._listener(DurableTaskStep.java:421)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.listener(DurableTaskStep.java:412)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.getWorkspace(DurableTaskStep.java:363)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:570)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:553)
	at …
FINE	o.j.p.w.s.d.DurableTaskStep$Execution$NewlineSafeTaskListener#getLogger: creating filtered stream
FINE	o.j.p.w.s.d.DurableTaskStep$Execution#_listener: terminated-pod-after-restart-1-8fb2m-j206k-8x125 has been removed for 15 sec, assuming it is not coming back
```

* A `sh` step could fail to find a workspace at the moment it starts.
java.lang.NullPointerException
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:329)
	at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:319)
	at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:193)
	at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122)
	at …

* Also handling `ClosedChannelException` as commonly thrown by `SimpleBuildStep`s and the like: jenkinsci/kubernetes-plugin#1083 (comment)

* Also retry after `MissingContextVariableException` on `FilePath`

* Remove `super.onResume` left over in #46; see jenkinsci/workflow-step-api-plugin#66

* Documenting need for JENKINS-30383

* jenkinsci/workflow-step-api-plugin#77 allows `retryNodeBlockSynchAcrossRestarts` to be fixed

* jenkinsci/workflow-step-api-plugin#77 released

* Expanding `MissingContextVariableException` type list after noticing that `isUnix` requests `Launcher` not `FilePath`

* Never log a message with only `PlaceholderTask.cookie`; also need `.runId` for context

* Suppressing uninteresting log message

* Maybe need to call `StepContext.saveState`?

* Deleting comment; for now it seems clearest to keep `ExecutorStepDynamicContext` as a distinct type

* Resolve longstanding tech debt from #104 & #117 about `BodyExecution`.
When we do not use pickles, there is no need for `transient`.
Moving the reference to `ExecutorStepExecution` from `PlaceholderTask` avoids a memory leak.
This is only possible now that `Callback` holds the `ExecutorStepExecution`.
And we can now properly handle `AsynchronousExecution.interrupt`.

* Sketch of updating `WorkspaceStepExecution`: #180 (comment)

* Allowing `FilePathDynamicContext` to block waiting for an agent to reconnect.
Not great since the CPS VM thread will only grant 5s.
Better would be to avoid unnecessary use of `StepContext.get(FilePath.class)`
(which implicitly requires a live `Channel`);
perhaps `FilePathRepresentation` should become an API.
Amends b65ff1c.
#180 (comment)

* Adding a comment about 8b08398

* 8b08398 seems to allow a968adf to be finished.

* Refining 8b08398 to avoid tail call

* `org.jenkinsci.plugins.workflow.support.concurrent.Futures` deprecated

* SpotBugs reminds me that `WorkspaceStepExecution.Callback` needs an SVUID; taking one from trunk

* Adding `NodeTranslator`

* Beginning rewrite to `AgentErrorCondition`

* Extracted `RetryExecutorStepTest` from `ExecutorStepTest`

* Rely on blocking `ExecutorStepExecution.onResume`

* Removing bogus `waitForMessage` (and unnecessary `interrupt`) from `unrestorableAgent`. Sometimes a failing `waitForMessage` seems to trigger https://maven.apache.org/surefire/maven-surefire-plugin/faq.html#corruptedstream though I cannot track down why.

* `PlaceholderTask` may not hold a reference to `ExecutorStepExecution`

* Equivalent of #184 for `ExecutorStepDynamicContext`

* Assert jenkinsci/workflow-api-plugin@b1778a9

* jenkinsci/workflow-cps-plugin#534 & jenkinsci/workflow-job-plugin#260 released

* Windows tests jenkinsci/workflow-basic-steps-plugin#203 (comment)

* Avoiding `powershell` jenkinsci/workflow-basic-steps-plugin#203 (comment)

* Better usage of Hamcrest; got a flake https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/180/checks?check_run_id=6779178701 with `QueueTaskCancelled` instead of `RemovedNodeCause`

* In fact the flake was in a different test (`canceledQueueItem`)

* Forgot to make `canceledQueueItem` not use `sh`, though it should not really matter since this step is expected to fail anyway

* Test flake: https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/180/checks?check_run_id=7240329028

* Use `AgentOfflineException` from `ExecutorStepDynamicContext` as well

* Can now run all of `AgentErrorConditionTest`

* Responding to `CancellationException` in `ExecutorStepDynamicContext` with `QueueTaskCancelled`, analogous to handling in `ExecutorPickle`

* Permit `canceledQueueItem` to pass on `RemovedNodeCause`

* Normalizing indentation

* jenkinsci/workflow-step-api-plugin#124 deprecates an overload which 2e95e3a only handles at one call site; TBD if others should be switched to `actualInterruption`

* Refined `ExecutorStepDynamicContext` behavior in case of `quietingDown` #180 (comment) or `terminating` #180 (comment)

* Add an explicit `serialVersionUID` to `ExecutorStepExecution.PlaceholderTask.Callback` #180 (comment)

* Optimizing `withExecution` to only consider steps in the current build #180 (comment)

* Pick up jenkinsci/workflow-api-plugin#256 #180 (comment)

* jenkinsci/workflow-api-plugin#256 released
  • Loading branch information
jglick authored Oct 17, 2022
1 parent e663bd5 commit 8ad5f86
Show file tree
Hide file tree
Showing 15 changed files with 592 additions and 277 deletions.
1 change: 1 addition & 0 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,7 @@
<dependency>
<groupId>org.jenkins-ci.plugins.workflow</groupId>
<artifactId>workflow-api</artifactId>
<version>1200.v8005c684b_a_c6</version>
</dependency>
<dependency>
<groupId>org.jenkins-ci.plugins.workflow</groupId>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,6 @@
import org.jenkinsci.plugins.workflow.steps.StepExecution;
import org.jenkinsci.plugins.workflow.support.concurrent.Timeout;
import org.jenkinsci.plugins.workflow.support.concurrent.WithThreadName;
import org.jenkinsci.plugins.workflow.support.pickles.ExecutorPickle;
import org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution;
import org.kohsuke.accmod.Restricted;
import org.kohsuke.accmod.restrictions.DoNotUse;
Expand Down Expand Up @@ -306,6 +305,9 @@ static final class Execution extends AbstractStepExecutionImpl implements Runnab
returnStatus = step.returnStatus;
StepContext context = getContext();
ws = context.get(FilePath.class);
if (ws == null) {
throw new AbortException("No workspace currently accessible");
}
node = FilePathUtils.getNodeName(ws);
DurableTask durableTask = step.task();
if (returnStdout) {
Expand Down Expand Up @@ -356,12 +358,13 @@ static final class Execution extends AbstractStepExecutionImpl implements Runnab
LOGGER.fine(() -> "discovered that " + node + " has been removed");
removedNodeDiscovered = System.nanoTime();
return null;
} else if (System.nanoTime() - removedNodeDiscovered < TimeUnit.MILLISECONDS.toNanos(ExecutorPickle.TIMEOUT_WAITING_FOR_NODE_MILLIS)) {
} else if (System.nanoTime() - removedNodeDiscovered < TimeUnit.MILLISECONDS.toNanos(ExecutorStepExecution.TIMEOUT_WAITING_FOR_NODE_MILLIS)) {
LOGGER.fine(() -> "rediscovering that " + node + " has been removed");
return null;
} else {
LOGGER.fine(() -> node + " has been removed for a while, assuming it is not coming back");
throw new FlowInterruptedException(Result.ABORTED, new ExecutorStepExecution.RemovedNodeCause());
LOGGER.fine(() -> "rediscovering that " + node + " has been removed and timeout has expired");
listener().getLogger().println(node + " has been removed for " + Util.getTimeSpanString(ExecutorStepExecution.TIMEOUT_WAITING_FOR_NODE_MILLIS) + ", assuming it is not coming back");
throw new FlowInterruptedException(Result.ABORTED, /* TODO false probably more appropriate */true, new ExecutorStepExecution.RemovedNodeCause());
}
}
removedNodeDiscovered = 0; // something else; reset
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,12 +31,14 @@
import hudson.model.Computer;
import jenkins.model.Jenkins;
import org.jenkinsci.plugins.workflow.flow.FlowExecutionOwner;
import org.jenkinsci.plugins.workflow.support.steps.ExecutorStepDynamicContext;

/**
* Reference to {@link Computer}
*
* @author Kohsuke Kawaguchi
* @deprecated Normally now done via {@link ExecutorStepDynamicContext}.
*/
@Deprecated
public class ComputerPickle extends Pickle {
private final String slave;

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,7 @@

import com.google.common.util.concurrent.ListenableFuture;
import edu.umd.cs.findbugs.annotations.NonNull;
import edu.umd.cs.findbugs.annotations.SuppressFBWarnings;
import hudson.Extension;
import hudson.Main;
import hudson.Util;
import hudson.init.InitMilestone;
import hudson.model.Executor;
Expand All @@ -41,35 +39,30 @@
import hudson.model.queue.SubTask;

import java.util.concurrent.Future;
import java.util.concurrent.TimeUnit;
import java.util.logging.Level;
import java.util.logging.Logger;

import jenkins.model.Jenkins;
import org.jenkinsci.plugins.workflow.flow.FlowExecutionOwner;
import org.jenkinsci.plugins.workflow.pickles.Pickle;
import org.jenkinsci.plugins.workflow.steps.FlowInterruptedException;
import org.jenkinsci.plugins.workflow.steps.durable_task.Messages;
import org.jenkinsci.plugins.workflow.support.steps.ExecutorStepDynamicContext;
import org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution;
import org.kohsuke.accmod.Restricted;
import org.kohsuke.accmod.restrictions.NoExternalUse;

/**
* Persists an {@link Executor} as the {@link hudson.model.Queue.Task} it was running.
* That task can in turn have some way of producing a display name, a special {@link hudson.model.Queue.Executable} with a custom {@code executorCell.jelly}, and so on.
* When rehydrated, the task is rescheduled, and when it starts executing the owning executor is produced.
* Typically the {@link SubTask#getAssignedLabel} should be a {@link Node#getSelfLabel} so that the rehydrated executor is in fact on the same node.
* @deprecated Normally now done via {@link ExecutorStepDynamicContext}.
*/
@Deprecated
public class ExecutorPickle extends Pickle {

private static final Logger LOGGER = Logger.getLogger(ExecutorPickle.class.getName());

private final Queue.Task task;

@SuppressFBWarnings(value = "MS_SHOULD_BE_FINAL", justification = "deliberately mutable")
@Restricted(NoExternalUse.class)
public static long TIMEOUT_WAITING_FOR_NODE_MILLIS = Main.isUnitTest ? /* fail faster */ TimeUnit.SECONDS.toMillis(15) : Long.getLong(ExecutorPickle.class.getName()+".timeoutForNodeMillis", TimeUnit.MINUTES.toMillis(5));

private ExecutorPickle(Executor executor) {
if (executor instanceof OneOffExecutor) {
throw new IllegalArgumentException("OneOffExecutor not currently supported");
Expand Down Expand Up @@ -105,7 +98,7 @@ protected Executor tryResolve() throws Exception {
throw new IllegalStateException("queue refused " + task);
}
itemID = item.getId();
endTimeNanos = System.nanoTime() + TIMEOUT_WAITING_FOR_NODE_MILLIS*1000000;
endTimeNanos = System.nanoTime() + ExecutorStepExecution.TIMEOUT_WAITING_FOR_NODE_MILLIS*1000000;
LOGGER.log(Level.FINE, "{0} scheduled {1}", new Object[] {ExecutorPickle.this, item});
} else {
item = Queue.getInstance().getItem(itemID);
Expand All @@ -129,7 +122,7 @@ protected Executor tryResolve() throws Exception {
if (System.nanoTime() > endTimeNanos) {
Queue.getInstance().cancel(item);
owner.getListener().getLogger().printf("Killed %s after waiting for %s because we assume unknown agent %s is never going to appear%n",
item.task.getDisplayName(), Util.getTimeSpanString(TIMEOUT_WAITING_FOR_NODE_MILLIS), placeholder.getAssignedLabel());
item.task.getDisplayName(), Util.getTimeSpanString(ExecutorStepExecution.TIMEOUT_WAITING_FOR_NODE_MILLIS), placeholder.getAssignedLabel());
throw new FlowInterruptedException(Result.ABORTED, new ExecutorStepExecution.RemovedNodeCause());
}
}
Expand Down Expand Up @@ -159,7 +152,7 @@ protected Executor tryResolve() throws Exception {
}
@Override protected void printWaitingMessage(@NonNull TaskListener listener) {
Queue.Item item = Queue.getInstance().getItem(itemID);
String message = Messages.ExecutorPickle_waiting_to_resume(task.getFullDisplayName());
String message = "Waiting to resume " + task.getFullDisplayName();
if (item == null) { // ???
listener.getLogger().println(message);
return;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,10 +31,14 @@
import org.jenkinsci.plugins.workflow.FilePathUtils;
import org.jenkinsci.plugins.workflow.flow.FlowExecutionOwner;
import org.jenkinsci.plugins.workflow.pickles.Pickle;
import org.jenkinsci.plugins.workflow.support.steps.ExecutorStepDynamicContext;
import org.jenkinsci.plugins.workflow.support.steps.FilePathDynamicContext;

/**
* @author Kohsuke Kawaguchi
* @deprecated Normally now done via {@link ExecutorStepDynamicContext} or {@link FilePathDynamicContext}.
*/
@Deprecated
public class FilePathPickle extends Pickle {

private final String slave;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,12 @@
import org.jenkinsci.plugins.workflow.FilePathUtils;
import org.jenkinsci.plugins.workflow.flow.FlowExecutionOwner;
import org.jenkinsci.plugins.workflow.pickles.Pickle;
import org.jenkinsci.plugins.workflow.support.steps.ExecutorStepDynamicContext;

/**
* @deprecated Normally now done via {@link ExecutorStepDynamicContext}.
*/
@Deprecated
public class WorkspaceListLeasePickle extends Pickle {

// Could perhaps just store the FilePath directly (thus using FilePathPickle implicitly), but we need the Computer anyway for its WorkspaceList:
Expand Down
Loading

0 comments on commit 8ad5f86

Please sign in to comment.