Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-10615

Workspaces seem to be removed prematurely on concurrent jobs

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Critical Critical
    • core

      Concurrent Builds workspaces are sometimes removed before the build steps have completed.
      This is not seen on the primary workspace, only those that get "@2" etc appended.

      This causes a failure on the job in question - symptoms somewhat varying depending on the process running. Generally a file not found, followed by a process complaining that its Current Working directory has gone.

      Hard to reproduce, but reproducibility seems to increase with uptime from rare to more than daily.

      Affects builds running on both linux and windows slave nodes. Not seen on builds running on master.
      Disk space, memory and cpu availability more than adequate.

          [JENKINS-10615] Workspaces seem to be removed prematurely on concurrent jobs

          Danny Staple created issue -

          Danny Staple added a comment -

          Possible related exception.

          Danny Staple added a comment - Possible related exception.
          Danny Staple made changes -
          Attachment New: jenkins_log_exception.txt [ 20685 ]

          Danny Staple added a comment -

          Raised as critical as we have lost a number of longish tests this way.

          Danny Staple added a comment - Raised as critical as we have lost a number of longish tests this way.

          Danny Staple added a comment -

          Now looking at the Workspace clean-up log now I've found it. It doesn't have timestamps - but must be close.

          Danny Staple added a comment - Now looking at the Workspace clean-up log now I've found it. It doesn't have timestamps - but must be close.

          Danny Staple added a comment - - edited

          Okay - finding some related threads - http://jenkins.361315.n4.nabble.com/workspace-cleanup-thread-deleting-all-active-workspaces-td392638.html, and thinking this may be related to #JENKINS-3653.

          Danny Staple added a comment - - edited Okay - finding some related threads - http://jenkins.361315.n4.nabble.com/workspace-cleanup-thread-deleting-all-active-workspaces-td392638.html , and thinking this may be related to # JENKINS-3653 .

          Danny Staple added a comment - - edited

          I've found a Workspace clean-up.log that incriminates itself by mentioning exactly that it tried, and failed to clean up a workspace belonging to a running job. The timing of the log is during the run of the job, and definitely corresponds with the job then failing shortly afterwards at the next operation that required a file from the workspace.

          ERROR: Failed to delete d:\hudson\workspace\myproject@2
          hudson.util.IOException2: remote file operation failed: d:\hudson\workspace\myproject@2 at hudson.remoting.Channel@9a5286f:testserver2
          	at hudson.FilePath.act(FilePath.java:749)
          	at hudson.FilePath.act(FilePath.java:735)
          	at hudson.FilePath.deleteRecursive(FilePath.java:819)
          --
          	at hudson.model.WorkspaceCleanupThread.execute(WorkspaceCleanupThread.java:74)
          	at hudson.model.AsyncPeriodicWork$1.run(AsyncPeriodicWork.java:51)
          	at java.lang.Thread.run(Thread.java:619)
          Caused by: java.io.IOException: Unable to delete d:\hudson\workspace\myproject@2\testcode\Libraries\ext01.pyd
          	at hudson.Util.deleteFile(Util.java:261)
          	at hudson.Util.deleteRecursive(Util.java:303)
          	at hudson.Util.deleteContentsRecursive(Util.java:222)
          --
          	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
          	at java.lang.Thread.run(Unknown Source)
          Deleting d:\hudson\workspace\myproject@3
          ERROR: Failed to delete d:\hudson\workspace\myproject@3
          
          hudson.util.IOException2: remote file operation failed: d:\hudson\workspace\myproject@3 at hudson.remoting.Channel@9a5286f:testserver2
          	at hudson.FilePath.act(FilePath.java:749)
          	at hudson.FilePath.act(FilePath.java:735)
          	at hudson.FilePath.deleteRecursive(FilePath.java:819)
          --
          	at hudson.model.WorkspaceCleanupThread.execute(WorkspaceCleanupThread.java:74)
          	at hudson.model.AsyncPeriodicWork$1.run(AsyncPeriodicWork.java:51)
          	at java.lang.Thread.run(Thread.java:619)
          Caused by: java.io.IOException: Unable to delete d:\hudson\workspace\myproject@3\testcode\Libraries\ext01.pyd
          	at hudson.Util.deleteFile(Util.java:261)
          	at hudson.Util.deleteRecursive(Util.java:303)
          	at hudson.Util.deleteContentsRecursive(Util.java:222)
          

          This looks like it has partially deleted stuff in that workspace. Enough to break the job that was running in it. Sometimes, depending on where a job is, may fully remove the workspace too.

          Danny Staple added a comment - - edited I've found a Workspace clean-up.log that incriminates itself by mentioning exactly that it tried, and failed to clean up a workspace belonging to a running job. The timing of the log is during the run of the job, and definitely corresponds with the job then failing shortly afterwards at the next operation that required a file from the workspace. ERROR: Failed to delete d:\hudson\workspace\myproject@2 hudson.util.IOException2: remote file operation failed: d:\hudson\workspace\myproject@2 at hudson.remoting.Channel@9a5286f:testserver2 at hudson.FilePath.act(FilePath.java:749) at hudson.FilePath.act(FilePath.java:735) at hudson.FilePath.deleteRecursive(FilePath.java:819) -- at hudson.model.WorkspaceCleanupThread.execute(WorkspaceCleanupThread.java:74) at hudson.model.AsyncPeriodicWork$1.run(AsyncPeriodicWork.java:51) at java.lang. Thread .run( Thread .java:619) Caused by: java.io.IOException: Unable to delete d:\hudson\workspace\myproject@2\testcode\Libraries\ext01.pyd at hudson.Util.deleteFile(Util.java:261) at hudson.Util.deleteRecursive(Util.java:303) at hudson.Util.deleteContentsRecursive(Util.java:222) -- at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang. Thread .run(Unknown Source) Deleting d:\hudson\workspace\myproject@3 ERROR: Failed to delete d:\hudson\workspace\myproject@3 hudson.util.IOException2: remote file operation failed: d:\hudson\workspace\myproject@3 at hudson.remoting.Channel@9a5286f:testserver2 at hudson.FilePath.act(FilePath.java:749) at hudson.FilePath.act(FilePath.java:735) at hudson.FilePath.deleteRecursive(FilePath.java:819) -- at hudson.model.WorkspaceCleanupThread.execute(WorkspaceCleanupThread.java:74) at hudson.model.AsyncPeriodicWork$1.run(AsyncPeriodicWork.java:51) at java.lang. Thread .run( Thread .java:619) Caused by: java.io.IOException: Unable to delete d:\hudson\workspace\myproject@3\testcode\Libraries\ext01.pyd at hudson.Util.deleteFile(Util.java:261) at hudson.Util.deleteRecursive(Util.java:303) at hudson.Util.deleteContentsRecursive(Util.java:222) This looks like it has partially deleted stuff in that workspace. Enough to break the job that was running in it. Sometimes, depending on where a job is, may fully remove the workspace too.

          Danny Staple added a comment -

          I've now found hudson.model.WorkspaceCleanupThread.disabled and set this to true. I should soon know (give it a week or so) if this improves the system stability.

          Danny Staple added a comment - I've now found hudson.model.WorkspaceCleanupThread.disabled and set this to true. I should soon know (give it a week or so) if this improves the system stability.

          Mandeep Rai added a comment -

          from an IRC conversation:

          abayer: AbstractBuild.AbstractRunner.run does the workspace lease acquisition and release, but publishers (such as archiving aritfacts) is done via AbstractBuild.AbstractRunner.post, which gets called after run.
          the workspace lease gets released before the post-build actions run

          I think this has to do when a build is done all it's build steps, it prematurely releases it's lock on the workspace, and then when archiving of artifacts takes place, another executor starts up and sees the workspace as unlocked, so it tries to clear it

          Mandeep Rai added a comment - from an IRC conversation: abayer: AbstractBuild.AbstractRunner.run does the workspace lease acquisition and release, but publishers (such as archiving aritfacts) is done via AbstractBuild.AbstractRunner.post, which gets called after run. the workspace lease gets released before the post-build actions run I think this has to do when a build is done all it's build steps, it prematurely releases it's lock on the workspace, and then when archiving of artifacts takes place, another executor starts up and sees the workspace as unlocked, so it tries to clear it
          Mandeep Rai made changes -
          Link New: This issue is related to JENKINS-7827 [ JENKINS-7827 ]

            kohsuke Kohsuke Kawaguchi
            dannystaple Danny Staple
            Votes:
            6 Vote for this issue
            Watchers:
            16 Start watching this issue

              Created:
              Updated:
              Resolved: