Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-76088

Tests are queuing runs but not cleaning them up, causing random failures

XMLWordPrintable

      Here's an example of the problem from the test output:

         3.743 [id=794]	WARNING	o.j.h.t.RemainingActivityListener#onTearDown: test0 #1 still seems to be running, which could break deletion of log files or metadata
         3.753 [id=794]	INFO	hudson.lifecycle.Lifecycle#onStatusUpdate: Stopping Jenkins
         3.755 [id=856]	INFO	hudson.model.Run#execute: test0 #1 aborted
      java.lang.InterruptedException: sleep interrupted
      	at java.base/java.lang.Thread.sleep(Native Method)
      	at org.jvnet.hudson.test.SleepBuilder.perform(SleepBuilder.java:52)
      	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
      	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:818)
      	at hudson.model.Build$BuildExecution.build(Build.java:199)
      	at hudson.model.Build$BuildExecution.doRun(Build.java:164)
      	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:527)
      	at hudson.model.Run.execute(Run.java:1840)
      	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:44)
      	at hudson.model.ResourceController.execute(ResourceController.java:101)
      	at hudson.model.Executor.run(Executor.java:446)
         3.772 [id=861]	INFO	o.j.h.test.SimpleCommandLauncher#afterDisconnect: killed Process[pid=1659906, exitValue=0] with {HUDSON_COOKIE=31b09490-6551-4c52-912f-398e61f35baa} for slave0
         3.772 [id=106]	INFO	o.j.h.test.SimpleCommandLauncher#afterDisconnect: no process for slave0
         3.773 [id=794]	INFO	hudson.lifecycle.Lifecycle#onStatusUpdate: Jenkins stopped
         3.847 [id=794]	INFO	o.j.h.t.TemporaryDirectoryAllocator#dispose: deleting /my/path/to/oss/jenkinsci/lenient-shutdown-plugin/target/tmp/j h15337751292702828254
         3.849 [id=856]	WARNING	jenkins.util.Listeners#lambda$notify$0
      java.lang.IllegalStateException: Jenkins.instance is missing. Read the documentation of Jenkins.getInstanceOrNull to see what you are doing wrong.
      	at jenkins.model.Jenkins.get(Jenkins.java:804)
      	at jenkins.triggers.ReverseBuildTrigger$RunListenerImpl.calculateCache(ReverseBuildTrigger.java:250)
      	at jenkins.triggers.ReverseBuildTrigger$RunListenerImpl.onCompleted(ReverseBuildTrigger.java:276)
      	at hudson.model.listeners.RunListener.lambda$fireCompleted$0(RunListener.java:223)
      	at jenkins.util.Listeners.lambda$notify$0(Listeners.java:59)
      	at jenkins.util.Listeners.notify(Listeners.java:67)
      	at hudson.model.listeners.RunListener.fireCompleted(RunListener.java:221)
      	at hudson.model.Run.execute(Run.java:1881)
      	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:44)
      	at hudson.model.ResourceController.execute(ResourceController.java:101)
      	at hudson.model.Executor.run(Executor.java:446)
         3.851 [id=856]	FINEST	hudson.XmlFile#write: Writing /my/path/to/oss/jenkinsci/lenient-shutdown-plugin/target/tmp/j h15337751292702828254/jobs/test0/builds/1/build.xml
      java.lang.Throwable
      	at hudson.XmlFile.write(XmlFile.java:205)
      	at hudson.model.Run.save(Run.java:2001)
      	at hudson.model.Run.execute(Run.java:1893)
      	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:44)
      	at hudson.model.ResourceController.execute(ResourceController.java:101)
      	at hudson.model.Executor.run(Executor.java:446)
      

      ...and here's the error reported in the Maven output:

      [ERROR] com.sonymobile.jenkins.plugins.lenientshutdown.ShutdownSlaveActionPermissionTest.testDoIndexPermissionAlice -- Time elapsed: 3.857 s <<< ERROR!
      java.nio.file.DirectoryNotEmptyException: /my/path/to/oss/jenkinsci/lenient-shutdown-plugin/target/tmp/j h15337751292702828254
      	at java.base/sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:246)
      	at java.base/sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(AbstractFileSystemProvider.java:110)
      	at java.base/java.nio.file.Files.deleteIfExists(Files.java:1191)
      	at org.jvnet.hudson.test.TemporaryDirectoryAllocator.delete(TemporaryDirectoryAllocator.java:145)
      	at org.jvnet.hudson.test.TemporaryDirectoryAllocator.dispose(TemporaryDirectoryAllocator.java:103)
      	at org.jvnet.hudson.test.TestEnvironment.dispose(TestEnvironment.java:83)
      	at org.jvnet.hudson.test.JenkinsRule.after(JenkinsRule.java:532)
      	at org.jvnet.hudson.test.JenkinsRule$1.evaluate(JenkinsRule.java:677)
      	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
      	at java.base/java.lang.Thread.run(Thread.java:840)
      	Suppressed: java.io.IOException: These files still exist : jobs
      		at org.jvnet.hudson.test.TemporaryDirectoryAllocator.delete(TemporaryDirectoryAllocator.java:149)
      		... 6 more
      

      ...in other words, the test ends, the @JenkinsRule stops Jenkins, which aborts the run, but there's a RunListener that receives the onCompleted event after "Jenkins stopped", the run's build.xml is written to the temporary directory while the TemporaryDirectoryAllocator is trying to delete it, so it fails the clean-up!

      In theory, the RemainingActivityListener is warning us about the potential for a problem (there are 18 hits for WARNING o.j.h.t.RemainingActivityListener#onTearDown in 8 files), but we really need for the affected tests to fail with such a message rather than hiding it in the test output and taking our chances (race condition) with other processes.

      Once we expose all the (randomly) defective tests, let's introduce an abstraction that makes it easy to terminate any runs queued and/or in-progress before the test completes, such that all onCompleted events are sent before the JenkinsRule tries to stop Jenkins.

      Bonus points if the runs we're queuing can stop using the SleepBuilder and instead use a latch of some sort so we can signal to the step across threads that we're done. It would likely speed up the tests AND reduce the potential for race conditions, especially if some operations randomly take longer.

      From what I can tell, this problem wasn't so terrible before (i.e. circa 1816.v8138d8056949 of the jenkins-test-harness), when the failure to dispose of the temporary directories would be silently ignored.

            Unassigned Unassigned
            oli Olivier Dagenais
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: