Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding RemainingActivityListener for diagnosis #643

Merged
merged 2 commits into from
Aug 24, 2023

Conversation

jglick
Copy link
Member

@jglick jglick commented Aug 23, 2023

Helpful to see cases where #198 revealed something nondeterministic about test execution.

Copy link
Contributor

@MarkEWaite MarkEWaite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very good to me.

As far as I can tell, this is an "always on" extension that will be run on every test. Let me know if I've misunderstood.

@jglick
Copy link
Member Author

jglick commented Aug 23, 2023

this is an "always on" extension that will be run on every test

Correct; the default behavior is to print a warning about a potential detected problem, but otherwise take no action.

@jglick
Copy link
Member Author

jglick commented Aug 23, 2023

ICYMI jenkinsci/copyartifact-plugin#206 (comment) (in that case .wait=true)

@jglick jglick merged commit 3efc797 into jenkinsci:master Aug 24, 2023
13 checks passed
@jglick jglick deleted the RemainingActivityListener branch August 24, 2023 19:11
jglick added a commit to jglick/copyartifact-plugin that referenced this pull request Aug 24, 2023
@basil
Copy link
Member

basil commented Aug 24, 2023

This flags builds that are executed with FreeStyleBuild b = j.buildAndAssertSuccess(f);. This does not seem right to me.

@jglick
Copy link
Member Author

jglick commented Aug 24, 2023

Not that I am aware of; do you have an example?

@basil
Copy link
Member

basil commented Aug 24, 2023

Try running the core test suite with org.jvnet.hudson.test.RemainingActivityListener.fatal=true; tons of examples. We might need to add waitForCompletion(r) to JenkinsRule#assertBuildStatus.

MarkEWaite pushed a commit to jenkinsci/copyartifact-plugin that referenced this pull request Aug 24, 2023
@jglick
Copy link
Member Author

jglick commented Aug 24, 2023

Possibly. Is this output visible somewhere? Some PR? Any specific test cases for example?

@basil
Copy link
Member

basil commented Aug 24, 2023

We might need to add waitForCompletion(r) to JenkinsRule#assertBuildStatus.

That doesn't seem to help. Even with

diff --git a/src/main/java/org/jvnet/hudson/test/JenkinsRule.java b/src/main/java/org/jvnet/hudson/test/JenkinsRule.java
index 424bad72..9c460f96 100644
--- a/src/main/java/org/jvnet/hudson/test/JenkinsRule.java
+++ b/src/main/java/org/jvnet/hudson/test/JenkinsRule.java
@@ -1442,6 +1442,7 @@ public class JenkinsRule implements TestRule, MethodRule, RootAction {
      * Asserts that the outcome of the build is a specific outcome.
      */
     public <R extends Run> R assertBuildStatus(Result status, R r) throws Exception {
+        waitForCompletion(r);
         if(status==r.getResult())
             return r;

I still get e.g.

[ERROR] Tests run: 5, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 33.21 s <<< FAILURE! -- in hudson.LauncherTest
[ERROR] hudson.LauncherTest.correctlyExpandEnvVars -- Time elapsed: 16.01 s <<< FAILURE!
java.lang.AssertionError: test0 #1 still seems to be running, which could break deletion of log files or metadata
        at org.jvnet.hudson.test.RemainingActivityListener.onTearDown(RemainingActivityListener.java:62)
        at org.jvnet.hudson.test.JenkinsRule.after(JenkinsRule.java:496)
        at org.jvnet.hudson.test.JenkinsRule$1.evaluate(JenkinsRule.java:625)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.lang.Thread.run(Thread.java:833)

where that test is only running

FreeStyleBuild build = rule.buildAndAssertSuccess(project);

I get dozens of similar errors. I am afraid this utility is not very useful with this high rate of false positives.

Note that I am running with -DforkCount=8 on an Intel Core i7-9700K with only 8 physical cores, so I am intentionally overtaxing the CPU here to create scheduling delays.

@jglick
Copy link
Member Author

jglick commented Aug 24, 2023

I will check if I can reproduce, and adjust accordingly.

@jglick
Copy link
Member Author

jglick commented Aug 24, 2023

I see some failures using -DforkCount=1C (which nearly melts my laptop) after

diff --git test/pom.xml test/pom.xml
index 75035b367d..19e08b39d0 100644
--- test/pom.xml
+++ test/pom.xml
@@ -116,7 +116,7 @@ THE SOFTWARE.
     <dependency>
       <groupId>${project.groupId}</groupId>
       <artifactId>jenkins-test-harness</artifactId>
-      <version>2058.va_7b_41a_286207</version>
+      <version>2062.v3efc79721e45</version>
       <scope>test</scope>
       <exclusions>
         <exclusion>
@@ -322,6 +322,7 @@ THE SOFTWARE.
           <systemPropertyVariables>
             <hudson.maven.debug>${mavenDebug}</hudson.maven.debug>
             <buildDirectory>${project.build.directory}</buildDirectory>
+            <org.jvnet.hudson.test.RemainingActivityListener.fatal>true</org.jvnet.hudson.test.RemainingActivityListener.fatal>
           </systemPropertyVariables>
           <reuseForks>false</reuseForks>
         </configuration>

mostly of queue items, which may or may not matter depending on the circumstances, and some executors. For now, fatal is intended to be used for diagnostic purposes rather than being configured permanently. Will dig into the cause more deeply in a couple weeks. Seems harmless in the meantime since by default it merely logs a warning.

@basil
Copy link
Member

basil commented Aug 24, 2023

I'm not sure I would really consider this change complete until it is proven that the fatal mode can be configured permanently in some significant test suite. I would personally not have approved this PR in its current form.

Another problem is these false positives:

[ERROR] hudson.model.QueueCrashTest.doNotPersistQueueOnCrashBeforeSave -- Time elapsed: 14.70 s <<< FAILURE!
java.lang.AssertionError: hudson.model.Queue$BuildableItem:hudson.model.FreeStyleProject@5ba1b54f[test0]:1 is still scheduled, which if it ever runs, could break deletion of log files or metadata
        at org.jvnet.hudson.test.RemainingActivityListener.onTearDown(RemainingActivityListener.java:62)
        at org.jvnet.hudson.test.JenkinsRule.after(JenkinsRule.java:496)
        at org.jvnet.hudson.test.JenkinsRule$1.evaluate(JenkinsRule.java:625)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.lang.Thread.run(Thread.java:833)

[ERROR] hudson.model.QueueCrashTest.persistQueueOnCrash -- Time elapsed: 6.571 s <<< FAILURE!
java.lang.AssertionError: hudson.model.Queue$BuildableItem:hudson.model.FreeStyleProject@6ddc90f[test0]:2 is still scheduled, which if it ever runs, could break deletion of log files or metadata
        at org.jvnet.hudson.test.RemainingActivityListener.onTearDown(RemainingActivityListener.java:62)
        at org.jvnet.hudson.test.JenkinsRule.after(JenkinsRule.java:496)
        at org.jvnet.hudson.test.JenkinsRule$1.evaluate(JenkinsRule.java:625)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.lang.Thread.run(Thread.java:833)


private static String problem() {
for (Computer c : Jenkins.get().getComputers()) {
for (Executor x : c.getAllExecutors()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is incorrect. QueueTaskFuture#get (which is what JenkinsRule uses to determine build completion) waits until AsyncFutureImpl#set is called to complete the future before notifying consumers. That call is done by WorkUnitContext#synchronizeEnd via Executor#finish1. At that point, Executor#isIdle is still false and the executor is still in Computer#getAllExecutors (which is what the code in this PR is checking). Executor#finish2 is called after Executor#finish1 and calls Computer#removeExecutor which (possibly asynchronously!) calls executors.remove(e) to remove the executor, at which point Computer#getAllExecutors will finally stop returning the non-idle executor. If Executor#finish1 completes before the executor is removed by Executor#finish2 (possible in certain scheduling scenarios) then the code in this PR will result in a false positive. Similarly, if Executor#finish2 decides to run the removal asynchronously (possible in certain scheduling scenarios) then the code in this PR will also result in a false positive.

There are really two things that need to be solved to make this code correct:

  • The code in this PR needs to wait until both Executor#finish1 and Executor#finish2 have been called before calling Computer#getAllExecutors. Currently this code runs after QueueTaskFuture#get which means that Executor#finish1 has been called but Executor#finish2 has not necessarily been called.
  • If Executor#finish2 has been called but has placed its task onto Computer#threadPoolForRemoting (which happens only sometimes) then the code in this PR needs to wait until that task has completed, for example by having Computer#removeExecutor keep track of how many pending removals are in progress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants