Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-10615

Workspaces seem to be removed prematurely on concurrent jobs

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Critical Critical
    • core

      Concurrent Builds workspaces are sometimes removed before the build steps have completed.
      This is not seen on the primary workspace, only those that get "@2" etc appended.

      This causes a failure on the job in question - symptoms somewhat varying depending on the process running. Generally a file not found, followed by a process complaining that its Current Working directory has gone.

      Hard to reproduce, but reproducibility seems to increase with uptime from rare to more than daily.

      Affects builds running on both linux and windows slave nodes. Not seen on builds running on master.
      Disk space, memory and cpu availability more than adequate.

          [JENKINS-10615] Workspaces seem to be removed prematurely on concurrent jobs

          Danny Staple added a comment -

          Possible related exception.

          Danny Staple added a comment - Possible related exception.

          Danny Staple added a comment -

          Raised as critical as we have lost a number of longish tests this way.

          Danny Staple added a comment - Raised as critical as we have lost a number of longish tests this way.

          Danny Staple added a comment -

          Now looking at the Workspace clean-up log now I've found it. It doesn't have timestamps - but must be close.

          Danny Staple added a comment - Now looking at the Workspace clean-up log now I've found it. It doesn't have timestamps - but must be close.

          Danny Staple added a comment - - edited

          Okay - finding some related threads - http://jenkins.361315.n4.nabble.com/workspace-cleanup-thread-deleting-all-active-workspaces-td392638.html, and thinking this may be related to #JENKINS-3653.

          Danny Staple added a comment - - edited Okay - finding some related threads - http://jenkins.361315.n4.nabble.com/workspace-cleanup-thread-deleting-all-active-workspaces-td392638.html , and thinking this may be related to # JENKINS-3653 .

          Danny Staple added a comment - - edited

          I've found a Workspace clean-up.log that incriminates itself by mentioning exactly that it tried, and failed to clean up a workspace belonging to a running job. The timing of the log is during the run of the job, and definitely corresponds with the job then failing shortly afterwards at the next operation that required a file from the workspace.

          ERROR: Failed to delete d:\hudson\workspace\myproject@2
          hudson.util.IOException2: remote file operation failed: d:\hudson\workspace\myproject@2 at hudson.remoting.Channel@9a5286f:testserver2
          	at hudson.FilePath.act(FilePath.java:749)
          	at hudson.FilePath.act(FilePath.java:735)
          	at hudson.FilePath.deleteRecursive(FilePath.java:819)
          --
          	at hudson.model.WorkspaceCleanupThread.execute(WorkspaceCleanupThread.java:74)
          	at hudson.model.AsyncPeriodicWork$1.run(AsyncPeriodicWork.java:51)
          	at java.lang.Thread.run(Thread.java:619)
          Caused by: java.io.IOException: Unable to delete d:\hudson\workspace\myproject@2\testcode\Libraries\ext01.pyd
          	at hudson.Util.deleteFile(Util.java:261)
          	at hudson.Util.deleteRecursive(Util.java:303)
          	at hudson.Util.deleteContentsRecursive(Util.java:222)
          --
          	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
          	at java.lang.Thread.run(Unknown Source)
          Deleting d:\hudson\workspace\myproject@3
          ERROR: Failed to delete d:\hudson\workspace\myproject@3
          
          hudson.util.IOException2: remote file operation failed: d:\hudson\workspace\myproject@3 at hudson.remoting.Channel@9a5286f:testserver2
          	at hudson.FilePath.act(FilePath.java:749)
          	at hudson.FilePath.act(FilePath.java:735)
          	at hudson.FilePath.deleteRecursive(FilePath.java:819)
          --
          	at hudson.model.WorkspaceCleanupThread.execute(WorkspaceCleanupThread.java:74)
          	at hudson.model.AsyncPeriodicWork$1.run(AsyncPeriodicWork.java:51)
          	at java.lang.Thread.run(Thread.java:619)
          Caused by: java.io.IOException: Unable to delete d:\hudson\workspace\myproject@3\testcode\Libraries\ext01.pyd
          	at hudson.Util.deleteFile(Util.java:261)
          	at hudson.Util.deleteRecursive(Util.java:303)
          	at hudson.Util.deleteContentsRecursive(Util.java:222)
          

          This looks like it has partially deleted stuff in that workspace. Enough to break the job that was running in it. Sometimes, depending on where a job is, may fully remove the workspace too.

          Danny Staple added a comment - - edited I've found a Workspace clean-up.log that incriminates itself by mentioning exactly that it tried, and failed to clean up a workspace belonging to a running job. The timing of the log is during the run of the job, and definitely corresponds with the job then failing shortly afterwards at the next operation that required a file from the workspace. ERROR: Failed to delete d:\hudson\workspace\myproject@2 hudson.util.IOException2: remote file operation failed: d:\hudson\workspace\myproject@2 at hudson.remoting.Channel@9a5286f:testserver2 at hudson.FilePath.act(FilePath.java:749) at hudson.FilePath.act(FilePath.java:735) at hudson.FilePath.deleteRecursive(FilePath.java:819) -- at hudson.model.WorkspaceCleanupThread.execute(WorkspaceCleanupThread.java:74) at hudson.model.AsyncPeriodicWork$1.run(AsyncPeriodicWork.java:51) at java.lang. Thread .run( Thread .java:619) Caused by: java.io.IOException: Unable to delete d:\hudson\workspace\myproject@2\testcode\Libraries\ext01.pyd at hudson.Util.deleteFile(Util.java:261) at hudson.Util.deleteRecursive(Util.java:303) at hudson.Util.deleteContentsRecursive(Util.java:222) -- at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang. Thread .run(Unknown Source) Deleting d:\hudson\workspace\myproject@3 ERROR: Failed to delete d:\hudson\workspace\myproject@3 hudson.util.IOException2: remote file operation failed: d:\hudson\workspace\myproject@3 at hudson.remoting.Channel@9a5286f:testserver2 at hudson.FilePath.act(FilePath.java:749) at hudson.FilePath.act(FilePath.java:735) at hudson.FilePath.deleteRecursive(FilePath.java:819) -- at hudson.model.WorkspaceCleanupThread.execute(WorkspaceCleanupThread.java:74) at hudson.model.AsyncPeriodicWork$1.run(AsyncPeriodicWork.java:51) at java.lang. Thread .run( Thread .java:619) Caused by: java.io.IOException: Unable to delete d:\hudson\workspace\myproject@3\testcode\Libraries\ext01.pyd at hudson.Util.deleteFile(Util.java:261) at hudson.Util.deleteRecursive(Util.java:303) at hudson.Util.deleteContentsRecursive(Util.java:222) This looks like it has partially deleted stuff in that workspace. Enough to break the job that was running in it. Sometimes, depending on where a job is, may fully remove the workspace too.

          Danny Staple added a comment -

          I've now found hudson.model.WorkspaceCleanupThread.disabled and set this to true. I should soon know (give it a week or so) if this improves the system stability.

          Danny Staple added a comment - I've now found hudson.model.WorkspaceCleanupThread.disabled and set this to true. I should soon know (give it a week or so) if this improves the system stability.

          Mandeep Rai added a comment -

          from an IRC conversation:

          abayer: AbstractBuild.AbstractRunner.run does the workspace lease acquisition and release, but publishers (such as archiving aritfacts) is done via AbstractBuild.AbstractRunner.post, which gets called after run.
          the workspace lease gets released before the post-build actions run

          I think this has to do when a build is done all it's build steps, it prematurely releases it's lock on the workspace, and then when archiving of artifacts takes place, another executor starts up and sees the workspace as unlocked, so it tries to clear it

          Mandeep Rai added a comment - from an IRC conversation: abayer: AbstractBuild.AbstractRunner.run does the workspace lease acquisition and release, but publishers (such as archiving aritfacts) is done via AbstractBuild.AbstractRunner.post, which gets called after run. the workspace lease gets released before the post-build actions run I think this has to do when a build is done all it's build steps, it prematurely releases it's lock on the workspace, and then when archiving of artifacts takes place, another executor starts up and sees the workspace as unlocked, so it tries to clear it

          Danny Staple added a comment -

          Unfortunately, this can occur before the post build stages have started, and during the actual run steps. Disabling that thread has completely stopped the problem from occurring in my setup, but it seems that it would be better to find out how to fix it.

          Danny Staple added a comment - Unfortunately, this can occur before the post build stages have started, and during the actual run steps. Disabling that thread has completely stopped the problem from occurring in my setup, but it seems that it would be better to find out how to fix it.

          sharon xia added a comment -

          I met this issue frequently. Whenever I started more than 2 concurrent builds on one job, when one of the job, for example, job@2 finished, then all of the files under job@2,job@3,job@4 will lost. I sucks a lot from this function.

          sharon xia added a comment - I met this issue frequently. Whenever I started more than 2 concurrent builds on one job, when one of the job, for example, job@2 finished, then all of the files under job@2,job@3,job@4 will lost. I sucks a lot from this function.

          Rob Petti added a comment -

          Workspace cleanup can be disabled entirely until this is fixed in Jenkins. Just start jenkins with the following property set:

          java -Dhudson.model.WorkspaceCleanupThread.disabled=true -jar jenkins.war

          Rob Petti added a comment - Workspace cleanup can be disabled entirely until this is fixed in Jenkins. Just start jenkins with the following property set: java -Dhudson.model.WorkspaceCleanupThread.disabled=true -jar jenkins.war

          Rob Petti added a comment -

          Additional logging information for this problem can be obtained by adding a new system logger through the Loggers interface with the following name: 'hudson.model.WorkspaceCleanupThread'. That should give us insight into why the workspaces are being deleted.

          Rob Petti added a comment - Additional logging information for this problem can be obtained by adding a new system logger through the Loggers interface with the following name: 'hudson.model.WorkspaceCleanupThread'. That should give us insight into why the workspaces are being deleted.

          Danny Staple added a comment -

          We have been using the startup flag for some time - I can confirm that it stops the issue occurring. Thanks for the tip on the loggers - this may assist me in finding other issues - https://wiki.jenkins-ci.org/display/JENKINS/Logging.

          Danny Staple added a comment - We have been using the startup flag for some time - I can confirm that it stops the issue occurring. Thanks for the tip on the loggers - this may assist me in finding other issues - https://wiki.jenkins-ci.org/display/JENKINS/Logging .

          Matt Gumbel added a comment -

          I have a reproducible test case for this issue...happy to test any proposed fixes that come along.

          Matt Gumbel added a comment - I have a reproducible test case for this issue...happy to test any proposed fixes that come along.

          Danny Staple added a comment -

          For us it was intermittent - but often enough to be a serious problem - when you run as many builds daily as we do - intermittent problems become a daily certainty, but we could not reproduce it on demand.

          Matt - can you describe your setup - having a local 100% reproducibly (or close) would help someone make a fix for it.

          Danny Staple added a comment - For us it was intermittent - but often enough to be a serious problem - when you run as many builds daily as we do - intermittent problems become a daily certainty, but we could not reproduce it on demand. Matt - can you describe your setup - having a local 100% reproducibly (or close) would help someone make a fix for it.

          Matt Gumbel added a comment -

          1. Configure a slave with 2 executors
          2. Create a test job restricted to run only on that slave
          3. Check Execute concurrent builds
          4. Use shell script below for job execution
          5. Add post-build step to delete workspace when job is done
          6. Download testdir.tar.gz and place it in /tmp on the build slave
          7. Click build-now about 15 times to queue up a bunch of instances of this job

          When the shell script exists nonzero, it has found a concurrency issue. Different failure conditions in the run script will be hit depending on timing.

          #!/bin/bash
          
          pwd
          ls -la
          
          if [[ -d delete-in-progress ]] ; then
              find delete-in-progress
              echo "fail (delete-in-progress)"
              exit 1
          fi
          
          if [[ -f testdir/build_number ]]; then
              echo "oops...found existing testdir from run not yet cleaned up"
              cat testdir/build_number
              rm -rf testdir
              exit 1
          fi
          
          if ! mkdir -m 0777 lockdir; then
              echo "fail: lockdir already exists from concurrent job instance $?"
              exit 1
          fi
          
          tar xzf /tmp/testdir.tar.gz
          echo $BUILD_NUMBER >> testdir/build_number
          
          cd testdir
          for i in 21 5; do
              cd $i
              for j in `seq 1 100`; do
                  cd $j
                  for k in `seq 1 1000`; do
                      if [[ ! -f $k ]]; then
                          echo "failed to find: $i $j $k"
                          pwd
                          ls -l
                          exit 1
                      fi
                  done
                  cd ..
              done
              cd ..
          done
          cd ..
          
          if [[ ! -d lockdir ]] ; then
              echo "fail: lockdir disappeared"
              exit 1
          fi
          
          rmdir lockdir
          
          if ! mkdir -m 0777 delete-in-progress; then
              echo "fail: somebody else is deleting here"
              exit 1
          fi
          
          touch delete-in-progress/$BUILD_NUMBER
          
          exit 0
          

          Matt Gumbel added a comment - 1. Configure a slave with 2 executors 2. Create a test job restricted to run only on that slave 3. Check Execute concurrent builds 4. Use shell script below for job execution 5. Add post-build step to delete workspace when job is done 6. Download testdir.tar.gz and place it in /tmp on the build slave 7. Click build-now about 15 times to queue up a bunch of instances of this job When the shell script exists nonzero, it has found a concurrency issue. Different failure conditions in the run script will be hit depending on timing. #!/bin/bash pwd ls -la if [[ -d delete-in-progress ]] ; then find delete-in-progress echo "fail (delete-in-progress)" exit 1 fi if [[ -f testdir/build_number ]]; then echo "oops...found existing testdir from run not yet cleaned up" cat testdir/build_number rm -rf testdir exit 1 fi if ! mkdir -m 0777 lockdir; then echo "fail: lockdir already exists from concurrent job instance $?" exit 1 fi tar xzf /tmp/testdir.tar.gz echo $BUILD_NUMBER >> testdir/build_number cd testdir for i in 21 5; do cd $i for j in `seq 1 100`; do cd $j for k in `seq 1 1000`; do if [[ ! -f $k ]]; then echo "failed to find: $i $j $k" pwd ls -l exit 1 fi done cd .. done cd .. done cd .. if [[ ! -d lockdir ]] ; then echo "fail: lockdir disappeared" exit 1 fi rmdir lockdir if ! mkdir -m 0777 delete-in-progress; then echo "fail: somebody else is deleting here" exit 1 fi touch delete-in-progress/$BUILD_NUMBER exit 0

          Matt Gumbel added a comment -

          I should also note...if you turn on log timestamps plugin, you can look at the failed job and see via the timestamps that it was running at the same instant as another instance's cleanup phase.

          Matt Gumbel added a comment - I should also note...if you turn on log timestamps plugin, you can look at the failed job and see via the timestamps that it was running at the same instant as another instance's cleanup phase.

          Code changed in jenkins
          User: Kohsuke Kawaguchi
          Path:
          test/src/test/groovy/hudson/model/AbstractProjectTest.groovy
          http://jenkins-ci.org/commit/jenkins/61cf2df0660e507fab20442f44e917d75f946917
          Log:
          JENKINS-10615

          Reproduced the problem. Workspace is getting released prematurely, before publishers run. This has all sorts of serious problems.

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Kohsuke Kawaguchi Path: test/src/test/groovy/hudson/model/AbstractProjectTest.groovy http://jenkins-ci.org/commit/jenkins/61cf2df0660e507fab20442f44e917d75f946917 Log: JENKINS-10615 Reproduced the problem. Workspace is getting released prematurely, before publishers run. This has all sorts of serious problems.

          Code changed in jenkins
          User: Kohsuke Kawaguchi
          Path:
          changelog.html
          core/src/main/java/hudson/model/AbstractBuild.java
          http://jenkins-ci.org/commit/jenkins/d183345007ecd6fae01565975fb48ddec6e47af4
          Log:
          [FIXED JENKINS-10615]

          Hold on to the lease until the very end.
          Previously, the lease was only held until the main build section is over, before publishers start running.

          Compare: https://github.com/jenkinsci/jenkins/compare/7648a95a7c57...d183345007ec

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Kohsuke Kawaguchi Path: changelog.html core/src/main/java/hudson/model/AbstractBuild.java http://jenkins-ci.org/commit/jenkins/d183345007ecd6fae01565975fb48ddec6e47af4 Log: [FIXED JENKINS-10615] Hold on to the lease until the very end. Previously, the lease was only held until the main build section is over, before publishers start running. Compare: https://github.com/jenkinsci/jenkins/compare/7648a95a7c57...d183345007ec

          The root cause of the problem is that the workspace used for a build was released after the main build section is over, so the publishers end up running with the workspace that it no longer owns.

          As a result, workspace clean up thread might try to recover this "no longer in use" workspace, or another build might try to use it, resulting in various failure modes.

          Kohsuke Kawaguchi added a comment - The root cause of the problem is that the workspace used for a build was released after the main build section is over, so the publishers end up running with the workspace that it no longer owns. As a result, workspace clean up thread might try to recover this "no longer in use" workspace, or another build might try to use it, resulting in various failure modes.

          dogfood added a comment -

          Integrated in jenkins_main_trunk #3036

          Result = SUCCESS

          dogfood added a comment - Integrated in jenkins_main_trunk #3036 Result = SUCCESS

          Code changed in jenkins
          User: Kohsuke Kawaguchi
          Path:
          test/src/test/groovy/hudson/model/AbstractProjectTest.groovy
          http://jenkins-ci.org/commit/jenkins/8c8102efd4fa324d34dcb9dc156c37aacb1feb1a
          Log:
          JENKINS-10615

          Reproduced the problem. Workspace is getting released prematurely, before publishers run. This has all sorts of serious problems.

          (cherry picked from commit 61cf2df0660e507fab20442f44e917d75f946917)

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Kohsuke Kawaguchi Path: test/src/test/groovy/hudson/model/AbstractProjectTest.groovy http://jenkins-ci.org/commit/jenkins/8c8102efd4fa324d34dcb9dc156c37aacb1feb1a Log: JENKINS-10615 Reproduced the problem. Workspace is getting released prematurely, before publishers run. This has all sorts of serious problems. (cherry picked from commit 61cf2df0660e507fab20442f44e917d75f946917)

          Code changed in jenkins
          User: Kohsuke Kawaguchi
          Path:
          core/src/main/java/hudson/model/AbstractBuild.java
          http://jenkins-ci.org/commit/jenkins/d95e16428c47933a6a81c5508f92d977ab2efa4d
          Log:
          [FIXED JENKINS-10615]

          Hold on to the lease until the very end.
          Previously, the lease was only held until the main build section is over, before publishers start running.

          (cherry picked from commit d183345007ecd6fae01565975fb48ddec6e47af4)

          Conflicts:
          changelog.html

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Kohsuke Kawaguchi Path: core/src/main/java/hudson/model/AbstractBuild.java http://jenkins-ci.org/commit/jenkins/d95e16428c47933a6a81c5508f92d977ab2efa4d Log: [FIXED JENKINS-10615] Hold on to the lease until the very end. Previously, the lease was only held until the main build section is over, before publishers start running. (cherry picked from commit d183345007ecd6fae01565975fb48ddec6e47af4) Conflicts: changelog.html

          Code changed in jenkins
          User: Jesse Glick
          Path:
          test/src/test/groovy/hudson/model/AbstractProjectTest.groovy
          http://jenkins-ci.org/commit/jenkins/85e9e126773c0bb20a8529a2e6591dde17d7e209
          Log:
          JENKINS-10615 AbstractProjectTest.testWorkspaceLock frequently fails on jenkins.ci due to InterruptedException in HudsonTestCase.setUp.
          Possibly because it is sorted after JENKINS-15156 testGetBuildAfterGC and the test suite times out.

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: test/src/test/groovy/hudson/model/AbstractProjectTest.groovy http://jenkins-ci.org/commit/jenkins/85e9e126773c0bb20a8529a2e6591dde17d7e209 Log: JENKINS-10615 AbstractProjectTest.testWorkspaceLock frequently fails on jenkins.ci due to InterruptedException in HudsonTestCase.setUp. Possibly because it is sorted after JENKINS-15156 testGetBuildAfterGC and the test suite times out.

          dogfood added a comment -

          Integrated in jenkins_main_trunk #3111
          JENKINS-10615 AbstractProjectTest.testWorkspaceLock frequently fails on jenkins.ci due to InterruptedException in HudsonTestCase.setUp. (Revision 85e9e126773c0bb20a8529a2e6591dde17d7e209)

          Result = SUCCESS
          Jesse Glick : 85e9e126773c0bb20a8529a2e6591dde17d7e209
          Files :

          • test/src/test/groovy/hudson/model/AbstractProjectTest.groovy

          dogfood added a comment - Integrated in jenkins_main_trunk #3111 JENKINS-10615 AbstractProjectTest.testWorkspaceLock frequently fails on jenkins.ci due to InterruptedException in HudsonTestCase.setUp. (Revision 85e9e126773c0bb20a8529a2e6591dde17d7e209) Result = SUCCESS Jesse Glick : 85e9e126773c0bb20a8529a2e6591dde17d7e209 Files : test/src/test/groovy/hudson/model/AbstractProjectTest.groovy

          Code changed in jenkins
          User: Jesse Glick
          Path:
          test/src/test/groovy/hudson/model/AbstractProjectTest.groovy
          http://jenkins-ci.org/commit/jenkins/389a565de417170f586830ee9fa7a7ec9749fc68
          Log:
          JENKINS-10615 AbstractProjectTest.testWorkspaceLock frequently fails on jenkins.ci due to InterruptedException in HudsonTestCase.setUp.
          Possibly because it is sorted after JENKINS-15156 testGetBuildAfterGC and the test suite times out.

          (cherry picked from commit 85e9e126773c0bb20a8529a2e6591dde17d7e209)

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: test/src/test/groovy/hudson/model/AbstractProjectTest.groovy http://jenkins-ci.org/commit/jenkins/389a565de417170f586830ee9fa7a7ec9749fc68 Log: JENKINS-10615 AbstractProjectTest.testWorkspaceLock frequently fails on jenkins.ci due to InterruptedException in HudsonTestCase.setUp. Possibly because it is sorted after JENKINS-15156 testGetBuildAfterGC and the test suite times out. (cherry picked from commit 85e9e126773c0bb20a8529a2e6591dde17d7e209)

          dogfood added a comment -

          Integrated in jenkins_main_trunk #3715
          JENKINS-10615 AbstractProjectTest.testWorkspaceLock frequently fails on jenkins.ci due to InterruptedException in HudsonTestCase.setUp. (Revision 389a565de417170f586830ee9fa7a7ec9749fc68)

          Result = SUCCESS
          Jesse Glick : 389a565de417170f586830ee9fa7a7ec9749fc68
          Files :

          • test/src/test/groovy/hudson/model/AbstractProjectTest.groovy

          dogfood added a comment - Integrated in jenkins_main_trunk #3715 JENKINS-10615 AbstractProjectTest.testWorkspaceLock frequently fails on jenkins.ci due to InterruptedException in HudsonTestCase.setUp. (Revision 389a565de417170f586830ee9fa7a7ec9749fc68) Result = SUCCESS Jesse Glick : 389a565de417170f586830ee9fa7a7ec9749fc68 Files : test/src/test/groovy/hudson/model/AbstractProjectTest.groovy

            kohsuke Kohsuke Kawaguchi
            dannystaple Danny Staple
            Votes:
            6 Vote for this issue
            Watchers:
            16 Start watching this issue

              Created:
              Updated:
              Resolved: