Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-49707] Agent missing after controller restart to fail resumption of node step, not kill whole build #180

Merged
merged 92 commits into from
Oct 17, 2022

Conversation

jglick
Copy link
Member

@jglick jglick commented Nov 16, 2021

JENKINS-49707

Upstream of jenkinsci/workflow-basic-steps-plugin#195 and jenkinsci/kubernetes-plugin#1211. Subsumes #231. Subsumes #249.

Relevant prior work: #104, jenkinsci/kubernetes-plugin#461, #47, #48, #115, #101, jenkinsci/workflow-basic-steps-plugin#86, #175, #176, #179.

Plan of record:

The idea is that users updating plugins would see a somewhat altered behavior for resumption inside node blocks by default, which I hope would go mostly unnoticed; and could adjust pipelines to look something like

podTemplate(…) {
  retry(count: 3, conditions: [kubernetesAgent(), nonresumable()]) {
    node(POD_LABEL) {
      checkout scm
      sh 'make world'
    }
  }
}

or

pipeline {
  agent {
    kubernetes {
      yaml: ''
      retries: 3
    }
  }
  stages {
    stage('main') {
      steps {
        sh 'make world'
      }
    }
  }
}

The key is to avoid relying on Pickles for ExecutorStepExecution to resume properly. (More precisely, pickles using TryRepeatedly; SecretPickle and the like are harmless in this context.) Instead we take explicit control over how the step is resumed and what happens when. The problem with Pickle here is that it is too magical; we would like to be able to detect after resumption that an agent is gone and then throw an exception at the Groovy level, which is not possible if the build is aborted at the CPS VM level due to an inability to rehydrate.


I had planned to offer a system property to revert to ExecutorPickle in case of catastrophic problems (along with a couple sanity tests of that mode) but after reviewing the changes to ExecutorStepExecution this does not seem straightforward: the major simplifications and fixes to it rely on moving away from ExecutorPickle.

*/
@Deprecated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the factories for these pickles be deleted or replaced with an implementation that just throws an exception if it encounters an object of the specified type? Or do you want to keep them around in case some other step or wild Groovy code is relying on them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to retain them for compatibility. Weird Pipeline script is one possibility. They could also be used in other Pipeline steps; I know PushdStep binds FilePathDynamicContext, so that is not affected, but there may be others. Seems harmless enough to leave them here.

After jenkinsci/workflow-step-api-plugin#73 this caused `SecretsMasker` to block on a `KubernetesComputer`
which in turn blocked provisioning of a `TaskListener`, causing `RestartPipelineTest.terminatedPodAfterRestart` to fail to find a message:

```
FINE	o.j.p.w.s.d.DurableTaskStep$Execution#getWorkspace: rediscovering that terminated-pod-after-restart-1-8fb2m-j206k-8x125 has been removed and timeout has expired
FINE	o.j.p.w.s.d.DurableTaskStep$Execution#_listener: JENKINS-34021: could not get TaskListener in CpsStepContext[9:sh]:Owner[terminated Pod After Restart/1:terminated Pod After Restart jenkinsci#1]
java.util.concurrent.TimeoutException
	at hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:102)
	at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepDynamicContext$Translator.get(ExecutorStepDynamicContext.java:115)
Caused: java.io.IOException
	at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepDynamicContext$Translator.get(ExecutorStepDynamicContext.java:118)
	at org.jenkinsci.plugins.workflow.steps.DynamicContext$Typed.get(DynamicContext.java:97)
	at org.jenkinsci.plugins.workflow.cps.ContextVariableSet.get(ContextVariableSet.java:139)
	at org.jenkinsci.plugins.workflow.cps.ContextVariableSet$1Delegate.doGet(ContextVariableSet.java:98)
	at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:75)
	at org.csanchez.jenkins.plugins.kubernetes.pipeline.SecretsMasker$Factory.get(SecretsMasker.java:85)
	at org.csanchez.jenkins.plugins.kubernetes.pipeline.SecretsMasker$Factory.get(SecretsMasker.java:73)
	at org.jenkinsci.plugins.workflow.steps.DynamicContext$Typed.get(DynamicContext.java:95)
	at org.jenkinsci.plugins.workflow.cps.ContextVariableSet.get(ContextVariableSet.java:139)
	at org.jenkinsci.plugins.workflow.cps.CpsThread.getContextVariable(CpsThread.java:135)
	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.doGet(CpsStepContext.java:297)
	at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:75)
	at org.jenkinsci.plugins.workflow.support.DefaultStepContext.getListener(DefaultStepContext.java:127)
	at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:87)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution._listener(DurableTaskStep.java:421)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.listener(DurableTaskStep.java:412)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.getWorkspace(DurableTaskStep.java:363)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:570)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:553)
	at …
FINE	o.j.p.w.s.d.DurableTaskStep$Execution$NewlineSafeTaskListener#getLogger: creating filtered stream
FINE	o.j.p.w.s.d.DurableTaskStep$Execution#_listener: terminated-pod-after-restart-1-8fb2m-j206k-8x125 has been removed for 15 sec, assuming it is not coming back
```
throw new FlowInterruptedException(Result.ABORTED, new ExecutorStepExecution.RemovedNodeCause());
LOGGER.fine(() -> "rediscovering that " + node + " has been removed and timeout has expired");
listener().getLogger().println(node + " has been removed for " + Util.getTimeSpanString(ExecutorStepExecution.TIMEOUT_WAITING_FOR_NODE_MILLIS) + ", assuming it is not coming back");
throw new FlowInterruptedException(Result.ABORTED, /* TODO false probably more appropriate */true, new ExecutorStepExecution.RemovedNodeCause());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that

if (t instanceof FlowInterruptedException && ((FlowInterruptedException) t).getCauses().stream().anyMatch(
c -> c instanceof ExecutorStepExecution.RemovedNodeCause || c instanceof ExecutorStepExecution.QueueTaskCancelled)) {
return true;
bypasses the actualInterruption check in https://github.com/jenkinsci/workflow-basic-steps-plugin/blob/d57e3ca46d24bb91b8410467284176b0f319c64e/src/main/java/org/jenkinsci/plugins/workflow/steps/RetryStepExecution.java#L93 so the distinction would only matter for other catching steps like https://github.com/jenkinsci/workflow-basic-steps-plugin/blob/d57e3ca46d24bb91b8410467284176b0f319c64e/src/main/java/org/jenkinsci/plugins/workflow/steps/CatchErrorStep.java#L229

Copy link
Member

@dwnusbaum dwnusbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a brief look at the PR and added some questions, but in general it looks fine. I am planning on reviewing it more thoroughly tomorrow.

@@ -45,7 +45,7 @@
* Allows a step body to save a representation of a workspace
* without forcing a particular {@link FilePath#getChannel} to be used the whole time.
*/
@Extension public final class FilePathDynamicContext extends DynamicContext.Typed<FilePath> {
@Extension(ordinal = 100) public final class FilePathDynamicContext extends DynamicContext.Typed<FilePath> {
Copy link
Member

@dwnusbaum dwnusbaum Sep 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is ordinal = 100 needed for PushdStep to keep working, or something else?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jglick jglick marked this pull request as draft September 12, 2022 13:40
@jglick

This comment was marked as resolved.

@jglick jglick requested a review from dwnusbaum October 10, 2022 22:23
@jglick jglick marked this pull request as ready for review October 17, 2022 13:57
@basil
Copy link
Member

basil commented May 21, 2023

Breaks io.jenkins.docker.pipeline.DockerNodeStepTest#withinNode and io.jenkins.docker.pipeline.DockerNodeStepTest#nodeWithinDockerNodeWithinNode as discovered in jenkinsci/docker-plugin#943.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants