Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-27329

WorkspaceCleanupThread may delete workspaces of running jobs

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • core
    • Linux host, Linux, OSX, Windows, slaves. Jenkins version 1.602.

      The problem is as described in JENKINS-4501. As requested in JENKINS-4501, I am creating a new issue as this problem still exists in 1.602.

      In short, Jenkins silently and erroneously deletes workspaces on slaves for matrix projects that are not old.

      Over the course of the time I've worked with Jenkins this behavior has created literally days of work and waiting on very long running builds that rely on cached workspaces to be manageable. It's cost me more hours again today after restoring jenkins to a new server after a hardware failure. This setting was reset since it exists outside normal recommended backup files and I didn't think to add it when I "fixed" this last time.

      Would it not be easier to have hudson.model.WorkspaceCleanupThread.disabled default to true? Having the default behavior be "destroy my data" seems bad, especially with how cheap disk is now. I'm sure when this option was implemented it made a lot of sense, but when I can get a 1TB for $50, it just seems wrong-headed. Let the fallow workspaces lie. I can clean them up if I need to.

      If that's not an acceptable solution, could it not be moved to a config location in the Jenkins home? That way we can be relatively sure that the setting will be propagated in backups and not bite someone who thought they solved this problem and had forgotten about it?

          [JENKINS-27329] WorkspaceCleanupThread may delete workspaces of running jobs

          >> Jenkins silently and erroneously deletes workspaces on slaves for matrix projects that are not old
          We just hit this problem (or what appears to be this problem) as well.

          An extract from $JENKINS_HOME/Workspace clean-up.log:

          Deleting /Users/dlshudson/jenkins_slave/workspace/dials_distribute on dials-mac-mini
          Deleting /scratch/jenkins_slave/workspace/dials_distribute on dials-ws133
          Deleting /scratch/jenkins_slave/workspace/dials_distribute on dials-ws154
          

          side-note: it's a shame those log lines are not time-stamped
          The 3 mentioned workspaces are from a matrix project, and all workspaces had been accessed recently.

          Presumably there is a bug in the workspace cleanup code that means it does not handle matrix projects correctly.

          Note that the job configuration specifies 4 slaves: 1 by label, and 3 by individual nodes. The workspaces that were deleted were those on the 3 slaves that were specified as individual notes, but the workspace on the slave that was specified by label was not deleted. Possibly a clue o the bug?

          The workaround is to set hudson.model.WorkspaceCleanupThread.disabled=true.

          Matthew Webber added a comment - >> Jenkins silently and erroneously deletes workspaces on slaves for matrix projects that are not old We just hit this problem (or what appears to be this problem) as well. An extract from $JENKINS_HOME/Workspace clean-up.log : Deleting /Users/dlshudson/jenkins_slave/workspace/dials_distribute on dials-mac-mini Deleting /scratch/jenkins_slave/workspace/dials_distribute on dials-ws133 Deleting /scratch/jenkins_slave/workspace/dials_distribute on dials-ws154 side-note : it's a shame those log lines are not time-stamped The 3 mentioned workspaces are from a matrix project, and all workspaces had been accessed recently. Presumably there is a bug in the workspace cleanup code that means it does not handle matrix projects correctly. Note that the job configuration specifies 4 slaves: 1 by label, and 3 by individual nodes. The workspaces that were deleted were those on the 3 slaves that were specified as individual notes, but the workspace on the slave that was specified by label was not deleted. Possibly a clue o the bug? The workaround is to set hudson.model.WorkspaceCleanupThread.disabled=true .

          Daniel knows about this area, so assigning to him for comment (sorry, Daniel!)

          Matthew Webber added a comment - Daniel knows about this area, so assigning to him for comment (sorry, Daniel!)

          Ingo Weinhold added a comment -

          Since JENKINS-30916 has been closed as a duplicate: Here the ticket description only says that workspaces that aren't old are deleted. In fact a workspace can even be deleted while a build using the workspace is in progress. The lines from the system log for such a case:

          Okt 13, 2015 3:29:27 AM INFORMATION hudson.slaves.CommandLauncher launch
          slave agent launched for BonefishMac-Ubuntu-12.04
          Okt 13, 2015 3:31:15 AM INFORMATION hudson.model.AsyncPeriodicWork$1 run
          Started Workspace clean-up
          Okt 13, 2015 3:31:21 AM INFORMATION hudson.model.Run execute
          Bar-Nightly/label=Ubuntu-12.04 #222 main build action completed: FAILURE
          

          Ingo Weinhold added a comment - Since JENKINS-30916 has been closed as a duplicate: Here the ticket description only says that workspaces that aren't old are deleted. In fact a workspace can even be deleted while a build using the workspace is in progress. The lines from the system log for such a case: Okt 13, 2015 3:29:27 AM INFORMATION hudson.slaves.CommandLauncher launch slave agent launched for BonefishMac-Ubuntu-12.04 Okt 13, 2015 3:31:15 AM INFORMATION hudson.model.AsyncPeriodicWork$1 run Started Workspace clean-up Okt 13, 2015 3:31:21 AM INFORMATION hudson.model.Run execute Bar-Nightly/label=Ubuntu-12.04 #222 main build action completed: FAILURE

          Daniel Beck added a comment -

          bonefish Same reason, workspace cleanup uses the root workspace directory modification date to determine whether it's old. As matrix jobs only build in subdirectories (corresponding to axes), it's trivial for these to appear unmodified for a long time.

          Daniel Beck added a comment - bonefish Same reason, workspace cleanup uses the root workspace directory modification date to determine whether it's old. As matrix jobs only build in subdirectories (corresponding to axes), it's trivial for these to appear unmodified for a long time.

          David Aldrich added a comment -

          I experienced this bug last night. The workspace of a matrix job was deleted while the job was running.

          Jan 12, 2017 9:00:00 PM INFO hudson.triggers.SCMTrigger$Runner run
          SCM changes detected in tml_system_level_regression_tests_linux_all_branches_and_trunk » branches/TRY_TML_LEDA_17May2016. Triggering #125
          Jan 12, 2017 9:00:00 PM INFO hudson.triggers.SCMTrigger$Runner run
          SCM changes detected in tml_system_level_regression_tests_linux_all_branches_and_trunk » trunk. Triggering #153
          Jan 12, 2017 10:35:00 PM INFO hudson.triggers.SCMTrigger$Runner run
          SCM changes detected in Regression_test_TestParams_VisualStudio. Triggering #403
          Jan 12, 2017 10:41:41 PM INFO hudson.model.Run execute
          Regression_test_TestParams_VisualStudio #403 main build action completed: SUCCESS
          Jan 12, 2017 11:03:20 PM INFO hudson.model.AsyncPeriodicWork$1 run
          Started Workspace clean-up
          Jan 12, 2017 11:03:48 PM INFO hudson.model.AsyncPeriodicWork$1 run
          Finished Workspace clean-up. 27,337 ms
          Jan 12, 2017 11:04:51 PM INFO hudson.model.Run execute
          tml_system_level_regression_tests_linux_all_branches_and_trunk/trunk #153 main build action completed: FAILURE

          We are running Jenkins 2.40 with Multi-Branch Project Plugin 0.3.

          David Aldrich added a comment - I experienced this bug last night. The workspace of a matrix job was deleted while the job was running. Jan 12, 2017 9:00:00 PM INFO hudson.triggers.SCMTrigger$Runner run SCM changes detected in tml_system_level_regression_tests_linux_all_branches_and_trunk » branches/TRY_TML_LEDA_17May2016. Triggering #125 Jan 12, 2017 9:00:00 PM INFO hudson.triggers.SCMTrigger$Runner run SCM changes detected in tml_system_level_regression_tests_linux_all_branches_and_trunk » trunk. Triggering #153 Jan 12, 2017 10:35:00 PM INFO hudson.triggers.SCMTrigger$Runner run SCM changes detected in Regression_test_TestParams_VisualStudio. Triggering #403 Jan 12, 2017 10:41:41 PM INFO hudson.model.Run execute Regression_test_TestParams_VisualStudio #403 main build action completed: SUCCESS Jan 12, 2017 11:03:20 PM INFO hudson.model.AsyncPeriodicWork$1 run Started Workspace clean-up Jan 12, 2017 11:03:48 PM INFO hudson.model.AsyncPeriodicWork$1 run Finished Workspace clean-up. 27,337 ms Jan 12, 2017 11:04:51 PM INFO hudson.model.Run execute tml_system_level_regression_tests_linux_all_branches_and_trunk/trunk #153 main build action completed: FAILURE We are running Jenkins 2.40 with Multi-Branch Project Plugin 0.3.

          David Aldrich added a comment -

          Please could this issue be assigned to someone?

          David Aldrich added a comment - Please could this issue be assigned to someone?

          davida2009 In the absence of a proper fix, we are using the workaround described in my earlier comment, and that works for us.

          Matthew Webber added a comment - davida2009 In the absence of a proper fix, we are using the workaround described in my earlier comment, and that works for us.

          David Aldrich added a comment -

          Matthew, please forgive my ignorance but where do I set hudson.model.WorkspaceCleanupThread.disabled=true?

          David Aldrich added a comment - Matthew, please forgive my ignorance but where do I set hudson.model.WorkspaceCleanupThread.disabled=true?

          davida2009 It's passed as a Java system property when you start Jenkins. Something like:

          java -Dhudson.model.WorkspaceCleanupThread.disabled=true -jar jenkins.war
          

          See https://wiki.jenkins-ci.org/display/JENKINS/Features+controlled+by+system+properties

          Exactly how you change this on your system will depend on how you installed Jenkins, and what scripts you use to start it.

          Matthew Webber added a comment - davida2009 It's passed as a Java system property when you start Jenkins. Something like: java -Dhudson.model.WorkspaceCleanupThread.disabled=true -jar jenkins.war See https://wiki.jenkins-ci.org/display/JENKINS/Features+controlled+by+system+properties Exactly how you change this on your system will depend on how you installed Jenkins, and what scripts you use to start it.

          David Aldrich added a comment -

          Thanks for your help Matthew.

          David Aldrich added a comment - Thanks for your help Matthew.

          Tony Wallace added a comment -

          Thanks to all who wrote on this bug in 2017. I think the recent activity helped me find this bug when I searched for something to explain what was happening. This workaround does seem to work and I'm very grateful.  

          Respectfully, I only wish I'd found it when I searched for the same, last year. 

           

          Tony Wallace added a comment - Thanks to all who wrote on this bug in 2017. I think the recent activity helped me find this bug when I searched for something to explain what was happening. This workaround does seem to work and I'm very grateful.   Respectfully, I only wish I'd found it when I searched for the same, last year.   

          Reinhold Füreder added a comment - - edited

          Also experienced for scripted pipeline (on master-only Jenkins installation): I think this issue should really just be fixed (instead of being addressed by implementing one of the various possible and more or less nice workarounds), especially because I hope it should not be too difficult for a Jenkins (core) developer? => I dare to put my money on jglick

          According to https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/model/WorkspaceCleanupThread.java (see #shouldBeDeleted() method) there is "only special" support for AbstractProject (and thus FreeStyleProject) – but even that is IMHO not 100% safe (only the workspace on the node of the last build is kept, i.e. in case of concurrently running builds it may delete the non-last still running one too)...

          My naive search in Jenkins JavaDoc only showed a very easy (but unfortunately also non-perfect) possibility, based on
          http://javadoc.jenkins-ci.org/hudson/model/Job.html#isBuilding--

          if (item instanceof Job<?,?>) {
            Job<?,?> j = (Job<?,?>) item;
            if (j.isBuilding()) {
              return false;
            }
          }
          

          Here the problem might be that old workspaces on other nodes might never be deleted, I think. (Maybe that might be nonetheless still better than the current behaviour.)

          => Actually only in case when Job#isBuilding() returns true, then all the (possible concurrent) running builds need to be checked and only their (active) workspace should be skipped from deletion? => Therefore still hoping and praying for Jesse...

          (Very) naive PR: https://github.com/jenkinsci/jenkins/pull/3444

          Reinhold Füreder added a comment - - edited Also experienced for scripted pipeline (on master-only Jenkins installation): I think this issue should really just be fixed (instead of being addressed by implementing one of the various possible and more or less nice workarounds), especially because I hope it should not be too difficult for a Jenkins (core) developer? => I dare to put my money on jglick According to https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/model/WorkspaceCleanupThread.java (see #shouldBeDeleted() method) there is "only special" support for AbstractProject (and thus FreeStyleProject ) – but even that is IMHO not 100% safe (only the workspace on the node of the last build is kept, i.e. in case of concurrently running builds it may delete the non-last still running one too)... My naive search in Jenkins JavaDoc only showed a very easy (but unfortunately also non-perfect) possibility, based on http://javadoc.jenkins-ci.org/hudson/model/Job.html#isBuilding-- if (item instanceof Job<?,?>) { Job<?,?> j = (Job<?,?>) item; if (j.isBuilding()) { return false ; } } Here the problem might be that old workspaces on other nodes might never be deleted, I think. (Maybe that might be nonetheless still better than the current behaviour.) => Actually only in case when Job#isBuilding() returns true, then all the (possible concurrent) running builds need to be checked and only their (active) workspace should be skipped from deletion? => Therefore still hoping and praying for Jesse... (Very) naive PR: https://github.com/jenkinsci/jenkins/pull/3444

          Kim Abbott added a comment -

          I, too, have been seeing this happen recently.  It's happening on slave jobs (they are restricted to which slave they run on and this configuration never changes) but also on jobs utilizing Publish over SSH plugin for copying files to/from other machines and executing commands via SSH Publishers.

          We run under tomcat - I'm not sure how to affect a change in this troubling behavior.  The workarounds mentioned there don't look like something I can use.  If anyone has guidance, I'm all ears.

          Kim Abbott added a comment - I, too, have been seeing this happen recently.  It's happening on slave jobs (they are restricted to which slave they run on and this configuration never changes) but also on jobs utilizing Publish over SSH plugin for copying files to/from other machines and executing commands via SSH Publishers. We run under tomcat - I'm not sure how to affect a change in this troubling behavior.  The workarounds mentioned there don't look like something I can use.  If anyone has guidance, I'm all ears.

          rupunzlkim to your problem report, can you please add the version of Jenkins you are running.

          Matthew Webber added a comment - rupunzlkim to your problem report, can you please add the version of Jenkins you are running.

          Adam Hong added a comment -

          also seeing this happen recently, version 2.60.3

          Adam Hong added a comment - also seeing this happen recently, version 2.60.3

          Kim Abbott added a comment -

          Sorry for the delay.  The version that we've noticed this on is 2.7.4

          Kim Abbott added a comment - Sorry for the delay.  The version that we've noticed this on is 2.7.4

          Code changed in jenkins
          User: Reinhold Füreder
          Path:
          core/src/main/java/hudson/model/WorkspaceCleanupThread.java
          http://jenkins-ci.org/commit/jenkins/f258aff7a736a81306ecb7d3c56cacc9b3a09a68
          Log:
          JENKINS-27329 Less aggressive WorkspaceCleanupThread (#3444)

          I dare to claim that the default behaviour of WorkspaceCleanupThread is too aggressive => this little change is by no means perfect (or admittedly even far from perfect), but IMHO a saner or slightly more defensive default behaviour.

          Mind that according to https://github.com/jenkinsci/jenkins/blob/9e64bcdcb4a2cf12d59dfa334e09ffb448d361e9/core/src/main/java/hudson/model/Job.java#L301 this "only" checks whether or not the last build of a job is in progress, while the JavaDoc says "Returns true if a build of this project is in progress." (cf. http://javadoc.jenkins-ci.org/hudson/model/Job.html#isBuilding--)

          • Fix compilation
          • Dummy commit to trigger pipeline

          Previous pipeline execution (https://ci.jenkins.io/blue/organizations/jenkins/Core%2Fjenkins/detail/PR-3444/2/tests) failed with one failing test that at first glance appears to be unrelated with my change(s) and looks like a flaky test?

          • Add fine logging message

          *NOTE:* This service been marked for deprecation: https://developer.github.com/changes/2018-04-25-github-services-deprecation/

          Functionality will be removed from GitHub.com on January 31st, 2019.

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Reinhold Füreder Path: core/src/main/java/hudson/model/WorkspaceCleanupThread.java http://jenkins-ci.org/commit/jenkins/f258aff7a736a81306ecb7d3c56cacc9b3a09a68 Log: JENKINS-27329 Less aggressive WorkspaceCleanupThread (#3444) JENKINS-27329 Less aggressive WorkspaceCleanupThread I dare to claim that the default behaviour of WorkspaceCleanupThread is too aggressive => this little change is by no means perfect (or admittedly even far from perfect), but IMHO a saner or slightly more defensive default behaviour. Mind that according to https://github.com/jenkinsci/jenkins/blob/9e64bcdcb4a2cf12d59dfa334e09ffb448d361e9/core/src/main/java/hudson/model/Job.java#L301 this "only" checks whether or not the last build of a job is in progress, while the JavaDoc says "Returns true if a build of this project is in progress." (cf. http://javadoc.jenkins-ci.org/hudson/model/Job.html#isBuilding-- ) Fix compilation Dummy commit to trigger pipeline Previous pipeline execution ( https://ci.jenkins.io/blue/organizations/jenkins/Core%2Fjenkins/detail/PR-3444/2/tests ) failed with one failing test that at first glance appears to be unrelated with my change(s) and looks like a flaky test? Add fine logging message * NOTE: * This service been marked for deprecation: https://developer.github.com/changes/2018-04-25-github-services-deprecation/ Functionality will be removed from GitHub.com on January 31st, 2019.

          Oleg Nenashev added a comment - - edited

          Fix has been applied in 2.125. IMHO the fix is not complete for parallel AbstractProject builds, but it is better than nothing. Will create a follow-up ticket

          Oleg Nenashev added a comment - - edited Fix has been applied in 2.125. IMHO the fix is not complete for parallel AbstractProject builds, but it is better than nothing. Will create a follow-up ticket

          Oleg Nenashev added a comment -

          danielbeck this thing is marked as RFE in the changelog, but I think this is a bug. Would you agree if I recategorize it?

          Oleg Nenashev added a comment - danielbeck this thing is marked as RFE in the changelog, but I think this is a bug. Would you agree if I recategorize it?

            Unassigned Unassigned
            qhartman Quentin Hartman
            Votes:
            13 Vote for this issue
            Watchers:
            20 Start watching this issue

              Created:
              Updated:
              Resolved: