Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[JENKINS-49707] Agent missing after controller restart to fail resump…
…tion of `node` step, not kill whole build (#180) * Sketch of non-`Pickle`-based resumption of `ExecutorStepExecution` * Amending test from #34 * Fixing `Callback` * Exploring where to detect a failure eligible for retry * Making `normalNodeDisappearance` pass under somewhat altered circumstances * First successful `node` block retry * `FilePathDynamicContext` must take precedence over `ExecutorStepDynamicContext.FilePathTranslator` * `ExecutorStepTest.unloadableExecutorPickle` as currently conceived no longer makes sense * Fixing `ExecutorStepTest.nodeDisconnectMissingContextVariableException` by copying some diagnostic code from `FilePathDynamicContext` * Fixing `ExecutorStepTest.contextualizeFreshFilePathAfterAgentReconnection` * Marking `ExecutorStepRetryEligibility` beta * Javadoc imports * SpotBugs: does not make sense to allow `ExecutorStepDynamicContext.node` to be null * Reasonable behavior of `ExecutorPickleTest.canceledQueueItem` * User-visible logging about 5m timeout now that `ExecutorPickle.printWaitingMessage` is unused * `ExecutorPickleTest` → `ExecutorStepDynamicContextTest` * Move `TIMEOUT_WAITING_FOR_NODE_MILLIS` from `ExecutorPickle` to `ExecutorStepExecution` * Some user-visible logging when `ExecutorStepRetryEligibility` is _not_ used * Do not block on `FutureImpl`. After jenkinsci/workflow-step-api-plugin#73 this caused `SecretsMasker` to block on a `KubernetesComputer` which in turn blocked provisioning of a `TaskListener`, causing `RestartPipelineTest.terminatedPodAfterRestart` to fail to find a message: ``` FINE o.j.p.w.s.d.DurableTaskStep$Execution#getWorkspace: rediscovering that terminated-pod-after-restart-1-8fb2m-j206k-8x125 has been removed and timeout has expired FINE o.j.p.w.s.d.DurableTaskStep$Execution#_listener: JENKINS-34021: could not get TaskListener in CpsStepContext[9:sh]:Owner[terminated Pod After Restart/1:terminated Pod After Restart #1] java.util.concurrent.TimeoutException at hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:102) at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepDynamicContext$Translator.get(ExecutorStepDynamicContext.java:115) Caused: java.io.IOException at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepDynamicContext$Translator.get(ExecutorStepDynamicContext.java:118) at org.jenkinsci.plugins.workflow.steps.DynamicContext$Typed.get(DynamicContext.java:97) at org.jenkinsci.plugins.workflow.cps.ContextVariableSet.get(ContextVariableSet.java:139) at org.jenkinsci.plugins.workflow.cps.ContextVariableSet$1Delegate.doGet(ContextVariableSet.java:98) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:75) at org.csanchez.jenkins.plugins.kubernetes.pipeline.SecretsMasker$Factory.get(SecretsMasker.java:85) at org.csanchez.jenkins.plugins.kubernetes.pipeline.SecretsMasker$Factory.get(SecretsMasker.java:73) at org.jenkinsci.plugins.workflow.steps.DynamicContext$Typed.get(DynamicContext.java:95) at org.jenkinsci.plugins.workflow.cps.ContextVariableSet.get(ContextVariableSet.java:139) at org.jenkinsci.plugins.workflow.cps.CpsThread.getContextVariable(CpsThread.java:135) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.doGet(CpsStepContext.java:297) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:75) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.getListener(DefaultStepContext.java:127) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:87) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution._listener(DurableTaskStep.java:421) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.listener(DurableTaskStep.java:412) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.getWorkspace(DurableTaskStep.java:363) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:570) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:553) at … FINE o.j.p.w.s.d.DurableTaskStep$Execution$NewlineSafeTaskListener#getLogger: creating filtered stream FINE o.j.p.w.s.d.DurableTaskStep$Execution#_listener: terminated-pod-after-restart-1-8fb2m-j206k-8x125 has been removed for 15 sec, assuming it is not coming back ``` * A `sh` step could fail to find a workspace at the moment it starts. java.lang.NullPointerException at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:329) at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:319) at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:193) at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122) at … * Also handling `ClosedChannelException` as commonly thrown by `SimpleBuildStep`s and the like: jenkinsci/kubernetes-plugin#1083 (comment) * Also retry after `MissingContextVariableException` on `FilePath` * Remove `super.onResume` left over in #46; see jenkinsci/workflow-step-api-plugin#66 * Documenting need for JENKINS-30383 * jenkinsci/workflow-step-api-plugin#77 allows `retryNodeBlockSynchAcrossRestarts` to be fixed * jenkinsci/workflow-step-api-plugin#77 released * Expanding `MissingContextVariableException` type list after noticing that `isUnix` requests `Launcher` not `FilePath` * Never log a message with only `PlaceholderTask.cookie`; also need `.runId` for context * Suppressing uninteresting log message * Maybe need to call `StepContext.saveState`? * Deleting comment; for now it seems clearest to keep `ExecutorStepDynamicContext` as a distinct type * Resolve longstanding tech debt from #104 & #117 about `BodyExecution`. When we do not use pickles, there is no need for `transient`. Moving the reference to `ExecutorStepExecution` from `PlaceholderTask` avoids a memory leak. This is only possible now that `Callback` holds the `ExecutorStepExecution`. And we can now properly handle `AsynchronousExecution.interrupt`. * Sketch of updating `WorkspaceStepExecution`: #180 (comment) * Allowing `FilePathDynamicContext` to block waiting for an agent to reconnect. Not great since the CPS VM thread will only grant 5s. Better would be to avoid unnecessary use of `StepContext.get(FilePath.class)` (which implicitly requires a live `Channel`); perhaps `FilePathRepresentation` should become an API. Amends b65ff1c. #180 (comment) * Adding a comment about 8b08398 * 8b08398 seems to allow a968adf to be finished. * Refining 8b08398 to avoid tail call * `org.jenkinsci.plugins.workflow.support.concurrent.Futures` deprecated * SpotBugs reminds me that `WorkspaceStepExecution.Callback` needs an SVUID; taking one from trunk * Adding `NodeTranslator` * Beginning rewrite to `AgentErrorCondition` * Extracted `RetryExecutorStepTest` from `ExecutorStepTest` * Rely on blocking `ExecutorStepExecution.onResume` * Removing bogus `waitForMessage` (and unnecessary `interrupt`) from `unrestorableAgent`. Sometimes a failing `waitForMessage` seems to trigger https://maven.apache.org/surefire/maven-surefire-plugin/faq.html#corruptedstream though I cannot track down why. * `PlaceholderTask` may not hold a reference to `ExecutorStepExecution` * Equivalent of #184 for `ExecutorStepDynamicContext` * Assert jenkinsci/workflow-api-plugin@b1778a9 * jenkinsci/workflow-cps-plugin#534 & jenkinsci/workflow-job-plugin#260 released * Windows tests jenkinsci/workflow-basic-steps-plugin#203 (comment) * Avoiding `powershell` jenkinsci/workflow-basic-steps-plugin#203 (comment) * Better usage of Hamcrest; got a flake https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/180/checks?check_run_id=6779178701 with `QueueTaskCancelled` instead of `RemovedNodeCause` * In fact the flake was in a different test (`canceledQueueItem`) * Forgot to make `canceledQueueItem` not use `sh`, though it should not really matter since this step is expected to fail anyway * Test flake: https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/180/checks?check_run_id=7240329028 * Use `AgentOfflineException` from `ExecutorStepDynamicContext` as well * Can now run all of `AgentErrorConditionTest` * Responding to `CancellationException` in `ExecutorStepDynamicContext` with `QueueTaskCancelled`, analogous to handling in `ExecutorPickle` * Permit `canceledQueueItem` to pass on `RemovedNodeCause` * Normalizing indentation * jenkinsci/workflow-step-api-plugin#124 deprecates an overload which 2e95e3a only handles at one call site; TBD if others should be switched to `actualInterruption` * Refined `ExecutorStepDynamicContext` behavior in case of `quietingDown` #180 (comment) or `terminating` #180 (comment) * Add an explicit `serialVersionUID` to `ExecutorStepExecution.PlaceholderTask.Callback` #180 (comment) * Optimizing `withExecution` to only consider steps in the current build #180 (comment) * Pick up jenkinsci/workflow-api-plugin#256 #180 (comment) * jenkinsci/workflow-api-plugin#256 released
- Loading branch information