Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-63164] Clear CpsBodyExecution.thread when the body completes #367

Closed
wants to merge 2 commits into from

Conversation

dwnusbaum
Copy link
Member

@dwnusbaum dwnusbaum commented Jul 22, 2020

See JENKINS-63164. Kind of reimplements #245 after #279 replaced it with a different approach that works better than #245 but only affects the parallel and load steps.

JENKINS-63164 is the root cause of RestartingLoadStepTest.updatedBindingsOnRestart being flaky, see #366.

I am really not sure about the best way to fix the issue. See #367 (comment) for some discussion of other options.

// may still be reachable (e.g. closures that outlive the body). If the CpsBodyExecution is still part
// of the program, we need to make sure that we do not reference to a dead CpsThread because that may
// cause resumption issues (JENKINS-63164).
thread = null;
Copy link
Member Author

@dwnusbaum dwnusbaum Jul 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this. Some other ideas I had:

  • Update CpsThreadGroup.run so that when it removes a thread from CpsThreadGroup.threads it calls some new method CpsThread.destroy or similar that nulls out fields (mainly contextVariables) in CpsThread in case something is still referencing it, and so that if the CpsThread is still live then things work as they did before the patch. Might fix other issues as well that we aren't aware of.
  • Modify groovy-cps so that the Env captured by CpsClosure does not reference things like CallEnv.returnAddress from nested environments. Not sure about correctness/feasibility of this, seems complicated. At least in the simple case here we only need the locals from the nested environments and if we only stored those variables there would be no issues, but perhaps there could be a references from the locals to some of the fields in CallEnv, and maybe we need to maintain the actual Env references as-is for semantics.

If we stick with the current approach, some methods like CpsBodyExecution.isLaunched might need to be updated to account for the fact that thread can now go from null to non-null and then back to null.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first option sounds better to me. Is there a downside to that?

Copy link
Member Author

@dwnusbaum dwnusbaum Jul 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first option sounds better to me. Is there a downside to that?

I'm not really sure, I can prototype it. If CpsBodyExecution.thread is actually accessible from the program as a captured variable in the closure's environment, then nulling out all of the fields in CpsThread could result in some weird behavior/NPEs, but I suspect that the CpsThread is not actually accessible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting some other ideas that came up in a discussion today in case they are useful to someone in the future:

  • Replacing CpsBodyExecution.thread with CpsBodyExecution.threadId and only holding a transient reference to the CpsThread itself, loading the value by looking up the threadId in CpsThreadGroup. Not sure if this would be possible, and would require some adjustments to maintain serialization compatiblity.
  • Hooking into RiverWriter or similar code in workflow-support and just not serializing CpsThreads that are not part of the root CpsThreadGroup. Seems complicated, and I'm not sure about compatibility.

Comment on lines +220 to +221
SemaphoreStep.success("wait/1", null);
r.assertBuildStatusSuccess(r.waitForCompletion(b));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird diff, but all I did here was removed the redundant call to SemaphoreStep.waitForStart and the comment/code regarding "Before the fix for JENKINS-53709" since jenkinsci/workflow-durable-task-step-plugin#104 changed the behavior.

@dwnusbaum
Copy link
Member Author

Incrementals publisher failed (InterruptedException in BourneShellScript.launchWithCookie) but the tests are passing so I am marking this as ready for review.

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid I do not understand the issues well enough to comment.

Copy link
Contributor

@bitwiseman bitwiseman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to Jesse, I'm not sure how to evaluate this or #368. They both look reasonable, but maybe we could discuss a bit further.

@dwnusbaum
Copy link
Member Author

Closing in favor of #368.

@dwnusbaum dwnusbaum closed this Jul 23, 2020
@dwnusbaum dwnusbaum deleted the JENKINS-63164 branch July 23, 2020 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants