Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-27329

WorkspaceCleanupThread may delete workspaces of running jobs

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved (View Workflow)
    • Major
    • Resolution: Fixed
    • core
    • Linux host, Linux, OSX, Windows, slaves. Jenkins version 1.602.

    Description

      The problem is as described in JENKINS-4501. As requested in JENKINS-4501, I am creating a new issue as this problem still exists in 1.602.

      In short, Jenkins silently and erroneously deletes workspaces on slaves for matrix projects that are not old.

      Over the course of the time I've worked with Jenkins this behavior has created literally days of work and waiting on very long running builds that rely on cached workspaces to be manageable. It's cost me more hours again today after restoring jenkins to a new server after a hardware failure. This setting was reset since it exists outside normal recommended backup files and I didn't think to add it when I "fixed" this last time.

      Would it not be easier to have hudson.model.WorkspaceCleanupThread.disabled default to true? Having the default behavior be "destroy my data" seems bad, especially with how cheap disk is now. I'm sure when this option was implemented it made a lot of sense, but when I can get a 1TB for $50, it just seems wrong-headed. Let the fallow workspaces lie. I can clean them up if I need to.

      If that's not an acceptable solution, could it not be moved to a config location in the Jenkins home? That way we can be relatively sure that the setting will be propagated in backups and not bite someone who thought they solved this problem and had forgotten about it?

      Attachments

        Issue Links

          Activity

            >> Jenkins silently and erroneously deletes workspaces on slaves for matrix projects that are not old
            We just hit this problem (or what appears to be this problem) as well.

            An extract from $JENKINS_HOME/Workspace clean-up.log:

            Deleting /Users/dlshudson/jenkins_slave/workspace/dials_distribute on dials-mac-mini
            Deleting /scratch/jenkins_slave/workspace/dials_distribute on dials-ws133
            Deleting /scratch/jenkins_slave/workspace/dials_distribute on dials-ws154
            

            side-note: it's a shame those log lines are not time-stamped
            The 3 mentioned workspaces are from a matrix project, and all workspaces had been accessed recently.

            Presumably there is a bug in the workspace cleanup code that means it does not handle matrix projects correctly.

            Note that the job configuration specifies 4 slaves: 1 by label, and 3 by individual nodes. The workspaces that were deleted were those on the 3 slaves that were specified as individual notes, but the workspace on the slave that was specified by label was not deleted. Possibly a clue o the bug?

            The workaround is to set hudson.model.WorkspaceCleanupThread.disabled=true.

            mwebber Matthew Webber added a comment - >> Jenkins silently and erroneously deletes workspaces on slaves for matrix projects that are not old We just hit this problem (or what appears to be this problem) as well. An extract from $JENKINS_HOME/Workspace clean-up.log : Deleting /Users/dlshudson/jenkins_slave/workspace/dials_distribute on dials-mac-mini Deleting /scratch/jenkins_slave/workspace/dials_distribute on dials-ws133 Deleting /scratch/jenkins_slave/workspace/dials_distribute on dials-ws154 side-note : it's a shame those log lines are not time-stamped The 3 mentioned workspaces are from a matrix project, and all workspaces had been accessed recently. Presumably there is a bug in the workspace cleanup code that means it does not handle matrix projects correctly. Note that the job configuration specifies 4 slaves: 1 by label, and 3 by individual nodes. The workspaces that were deleted were those on the 3 slaves that were specified as individual notes, but the workspace on the slave that was specified by label was not deleted. Possibly a clue o the bug? The workaround is to set hudson.model.WorkspaceCleanupThread.disabled=true .

            Daniel knows about this area, so assigning to him for comment (sorry, Daniel!)

            mwebber Matthew Webber added a comment - Daniel knows about this area, so assigning to him for comment (sorry, Daniel!)
            bonefish Ingo Weinhold added a comment -

            Since JENKINS-30916 has been closed as a duplicate: Here the ticket description only says that workspaces that aren't old are deleted. In fact a workspace can even be deleted while a build using the workspace is in progress. The lines from the system log for such a case:

            Okt 13, 2015 3:29:27 AM INFORMATION hudson.slaves.CommandLauncher launch
            slave agent launched for BonefishMac-Ubuntu-12.04
            Okt 13, 2015 3:31:15 AM INFORMATION hudson.model.AsyncPeriodicWork$1 run
            Started Workspace clean-up
            Okt 13, 2015 3:31:21 AM INFORMATION hudson.model.Run execute
            Bar-Nightly/label=Ubuntu-12.04 #222 main build action completed: FAILURE
            
            bonefish Ingo Weinhold added a comment - Since JENKINS-30916 has been closed as a duplicate: Here the ticket description only says that workspaces that aren't old are deleted. In fact a workspace can even be deleted while a build using the workspace is in progress. The lines from the system log for such a case: Okt 13, 2015 3:29:27 AM INFORMATION hudson.slaves.CommandLauncher launch slave agent launched for BonefishMac-Ubuntu-12.04 Okt 13, 2015 3:31:15 AM INFORMATION hudson.model.AsyncPeriodicWork$1 run Started Workspace clean-up Okt 13, 2015 3:31:21 AM INFORMATION hudson.model.Run execute Bar-Nightly/label=Ubuntu-12.04 #222 main build action completed: FAILURE
            danielbeck Daniel Beck added a comment -

            bonefish Same reason, workspace cleanup uses the root workspace directory modification date to determine whether it's old. As matrix jobs only build in subdirectories (corresponding to axes), it's trivial for these to appear unmodified for a long time.

            danielbeck Daniel Beck added a comment - bonefish Same reason, workspace cleanup uses the root workspace directory modification date to determine whether it's old. As matrix jobs only build in subdirectories (corresponding to axes), it's trivial for these to appear unmodified for a long time.
            davida2009 David Aldrich added a comment -

            I experienced this bug last night. The workspace of a matrix job was deleted while the job was running.

            Jan 12, 2017 9:00:00 PM INFO hudson.triggers.SCMTrigger$Runner run
            SCM changes detected in tml_system_level_regression_tests_linux_all_branches_and_trunk » branches/TRY_TML_LEDA_17May2016. Triggering #125
            Jan 12, 2017 9:00:00 PM INFO hudson.triggers.SCMTrigger$Runner run
            SCM changes detected in tml_system_level_regression_tests_linux_all_branches_and_trunk » trunk. Triggering #153
            Jan 12, 2017 10:35:00 PM INFO hudson.triggers.SCMTrigger$Runner run
            SCM changes detected in Regression_test_TestParams_VisualStudio. Triggering #403
            Jan 12, 2017 10:41:41 PM INFO hudson.model.Run execute
            Regression_test_TestParams_VisualStudio #403 main build action completed: SUCCESS
            Jan 12, 2017 11:03:20 PM INFO hudson.model.AsyncPeriodicWork$1 run
            Started Workspace clean-up
            Jan 12, 2017 11:03:48 PM INFO hudson.model.AsyncPeriodicWork$1 run
            Finished Workspace clean-up. 27,337 ms
            Jan 12, 2017 11:04:51 PM INFO hudson.model.Run execute
            tml_system_level_regression_tests_linux_all_branches_and_trunk/trunk #153 main build action completed: FAILURE

            We are running Jenkins 2.40 with Multi-Branch Project Plugin 0.3.

            davida2009 David Aldrich added a comment - I experienced this bug last night. The workspace of a matrix job was deleted while the job was running. Jan 12, 2017 9:00:00 PM INFO hudson.triggers.SCMTrigger$Runner run SCM changes detected in tml_system_level_regression_tests_linux_all_branches_and_trunk » branches/TRY_TML_LEDA_17May2016. Triggering #125 Jan 12, 2017 9:00:00 PM INFO hudson.triggers.SCMTrigger$Runner run SCM changes detected in tml_system_level_regression_tests_linux_all_branches_and_trunk » trunk. Triggering #153 Jan 12, 2017 10:35:00 PM INFO hudson.triggers.SCMTrigger$Runner run SCM changes detected in Regression_test_TestParams_VisualStudio. Triggering #403 Jan 12, 2017 10:41:41 PM INFO hudson.model.Run execute Regression_test_TestParams_VisualStudio #403 main build action completed: SUCCESS Jan 12, 2017 11:03:20 PM INFO hudson.model.AsyncPeriodicWork$1 run Started Workspace clean-up Jan 12, 2017 11:03:48 PM INFO hudson.model.AsyncPeriodicWork$1 run Finished Workspace clean-up. 27,337 ms Jan 12, 2017 11:04:51 PM INFO hudson.model.Run execute tml_system_level_regression_tests_linux_all_branches_and_trunk/trunk #153 main build action completed: FAILURE We are running Jenkins 2.40 with Multi-Branch Project Plugin 0.3.
            davida2009 David Aldrich added a comment -

            Please could this issue be assigned to someone?

            davida2009 David Aldrich added a comment - Please could this issue be assigned to someone?

            davida2009 In the absence of a proper fix, we are using the workaround described in my earlier comment, and that works for us.

            mwebber Matthew Webber added a comment - davida2009 In the absence of a proper fix, we are using the workaround described in my earlier comment, and that works for us.
            davida2009 David Aldrich added a comment -

            Matthew, please forgive my ignorance but where do I set hudson.model.WorkspaceCleanupThread.disabled=true?

            davida2009 David Aldrich added a comment - Matthew, please forgive my ignorance but where do I set hudson.model.WorkspaceCleanupThread.disabled=true?

            davida2009 It's passed as a Java system property when you start Jenkins. Something like:

            java -Dhudson.model.WorkspaceCleanupThread.disabled=true -jar jenkins.war
            

            See https://wiki.jenkins-ci.org/display/JENKINS/Features+controlled+by+system+properties

            Exactly how you change this on your system will depend on how you installed Jenkins, and what scripts you use to start it.

            mwebber Matthew Webber added a comment - davida2009 It's passed as a Java system property when you start Jenkins. Something like: java -Dhudson.model.WorkspaceCleanupThread.disabled=true -jar jenkins.war See https://wiki.jenkins-ci.org/display/JENKINS/Features+controlled+by+system+properties Exactly how you change this on your system will depend on how you installed Jenkins, and what scripts you use to start it.
            davida2009 David Aldrich added a comment -

            Thanks for your help Matthew.

            davida2009 David Aldrich added a comment - Thanks for your help Matthew.
            hushp1pt Tony Wallace added a comment -

            Thanks to all who wrote on this bug in 2017. I think the recent activity helped me find this bug when I searched for something to explain what was happening. This workaround does seem to work and I'm very grateful.  

            Respectfully, I only wish I'd found it when I searched for the same, last year. 

             

            hushp1pt Tony Wallace added a comment - Thanks to all who wrote on this bug in 2017. I think the recent activity helped me find this bug when I searched for something to explain what was happening. This workaround does seem to work and I'm very grateful.   Respectfully, I only wish I'd found it when I searched for the same, last year.   
            reinholdfuereder Reinhold Füreder added a comment - - edited

            Also experienced for scripted pipeline (on master-only Jenkins installation): I think this issue should really just be fixed (instead of being addressed by implementing one of the various possible and more or less nice workarounds), especially because I hope it should not be too difficult for a Jenkins (core) developer? => I dare to put my money on jglick

            According to https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/model/WorkspaceCleanupThread.java (see #shouldBeDeleted() method) there is "only special" support for AbstractProject (and thus FreeStyleProject) – but even that is IMHO not 100% safe (only the workspace on the node of the last build is kept, i.e. in case of concurrently running builds it may delete the non-last still running one too)...

            My naive search in Jenkins JavaDoc only showed a very easy (but unfortunately also non-perfect) possibility, based on
            http://javadoc.jenkins-ci.org/hudson/model/Job.html#isBuilding--

            if (item instanceof Job<?,?>) {
              Job<?,?> j = (Job<?,?>) item;
              if (j.isBuilding()) {
                return false;
              }
            }
            

            Here the problem might be that old workspaces on other nodes might never be deleted, I think. (Maybe that might be nonetheless still better than the current behaviour.)

            => Actually only in case when Job#isBuilding() returns true, then all the (possible concurrent) running builds need to be checked and only their (active) workspace should be skipped from deletion? => Therefore still hoping and praying for Jesse...

            (Very) naive PR: https://github.com/jenkinsci/jenkins/pull/3444

            reinholdfuereder Reinhold Füreder added a comment - - edited Also experienced for scripted pipeline (on master-only Jenkins installation): I think this issue should really just be fixed (instead of being addressed by implementing one of the various possible and more or less nice workarounds), especially because I hope it should not be too difficult for a Jenkins (core) developer? => I dare to put my money on jglick According to https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/model/WorkspaceCleanupThread.java (see #shouldBeDeleted() method) there is "only special" support for AbstractProject (and thus FreeStyleProject ) – but even that is IMHO not 100% safe (only the workspace on the node of the last build is kept, i.e. in case of concurrently running builds it may delete the non-last still running one too)... My naive search in Jenkins JavaDoc only showed a very easy (but unfortunately also non-perfect) possibility, based on http://javadoc.jenkins-ci.org/hudson/model/Job.html#isBuilding-- if (item instanceof Job<?,?>) { Job<?,?> j = (Job<?,?>) item; if (j.isBuilding()) { return false ; } } Here the problem might be that old workspaces on other nodes might never be deleted, I think. (Maybe that might be nonetheless still better than the current behaviour.) => Actually only in case when Job#isBuilding() returns true, then all the (possible concurrent) running builds need to be checked and only their (active) workspace should be skipped from deletion? => Therefore still hoping and praying for Jesse... (Very) naive PR: https://github.com/jenkinsci/jenkins/pull/3444
            rupunzlkim Kim Abbott added a comment -

            I, too, have been seeing this happen recently.  It's happening on slave jobs (they are restricted to which slave they run on and this configuration never changes) but also on jobs utilizing Publish over SSH plugin for copying files to/from other machines and executing commands via SSH Publishers.

            We run under tomcat - I'm not sure how to affect a change in this troubling behavior.  The workarounds mentioned there don't look like something I can use.  If anyone has guidance, I'm all ears.

            rupunzlkim Kim Abbott added a comment - I, too, have been seeing this happen recently.  It's happening on slave jobs (they are restricted to which slave they run on and this configuration never changes) but also on jobs utilizing Publish over SSH plugin for copying files to/from other machines and executing commands via SSH Publishers. We run under tomcat - I'm not sure how to affect a change in this troubling behavior.  The workarounds mentioned there don't look like something I can use.  If anyone has guidance, I'm all ears.

            rupunzlkim to your problem report, can you please add the version of Jenkins you are running.

            mwebber Matthew Webber added a comment - rupunzlkim to your problem report, can you please add the version of Jenkins you are running.
            adamhong Adam Hong added a comment -

            also seeing this happen recently, version 2.60.3

            adamhong Adam Hong added a comment - also seeing this happen recently, version 2.60.3
            rupunzlkim Kim Abbott added a comment -

            Sorry for the delay.  The version that we've noticed this on is 2.7.4

            rupunzlkim Kim Abbott added a comment - Sorry for the delay.  The version that we've noticed this on is 2.7.4

            Code changed in jenkins
            User: Reinhold Füreder
            Path:
            core/src/main/java/hudson/model/WorkspaceCleanupThread.java
            http://jenkins-ci.org/commit/jenkins/f258aff7a736a81306ecb7d3c56cacc9b3a09a68
            Log:
            JENKINS-27329 Less aggressive WorkspaceCleanupThread (#3444)

            I dare to claim that the default behaviour of WorkspaceCleanupThread is too aggressive => this little change is by no means perfect (or admittedly even far from perfect), but IMHO a saner or slightly more defensive default behaviour.

            Mind that according to https://github.com/jenkinsci/jenkins/blob/9e64bcdcb4a2cf12d59dfa334e09ffb448d361e9/core/src/main/java/hudson/model/Job.java#L301 this "only" checks whether or not the last build of a job is in progress, while the JavaDoc says "Returns true if a build of this project is in progress." (cf. http://javadoc.jenkins-ci.org/hudson/model/Job.html#isBuilding--)

            • Fix compilation
            • Dummy commit to trigger pipeline

            Previous pipeline execution (https://ci.jenkins.io/blue/organizations/jenkins/Core%2Fjenkins/detail/PR-3444/2/tests) failed with one failing test that at first glance appears to be unrelated with my change(s) and looks like a flaky test?

            • Add fine logging message

            *NOTE:* This service been marked for deprecation: https://developer.github.com/changes/2018-04-25-github-services-deprecation/

            Functionality will be removed from GitHub.com on January 31st, 2019.

            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Reinhold Füreder Path: core/src/main/java/hudson/model/WorkspaceCleanupThread.java http://jenkins-ci.org/commit/jenkins/f258aff7a736a81306ecb7d3c56cacc9b3a09a68 Log: JENKINS-27329 Less aggressive WorkspaceCleanupThread (#3444) JENKINS-27329 Less aggressive WorkspaceCleanupThread I dare to claim that the default behaviour of WorkspaceCleanupThread is too aggressive => this little change is by no means perfect (or admittedly even far from perfect), but IMHO a saner or slightly more defensive default behaviour. Mind that according to https://github.com/jenkinsci/jenkins/blob/9e64bcdcb4a2cf12d59dfa334e09ffb448d361e9/core/src/main/java/hudson/model/Job.java#L301 this "only" checks whether or not the last build of a job is in progress, while the JavaDoc says "Returns true if a build of this project is in progress." (cf. http://javadoc.jenkins-ci.org/hudson/model/Job.html#isBuilding-- ) Fix compilation Dummy commit to trigger pipeline Previous pipeline execution ( https://ci.jenkins.io/blue/organizations/jenkins/Core%2Fjenkins/detail/PR-3444/2/tests ) failed with one failing test that at first glance appears to be unrelated with my change(s) and looks like a flaky test? Add fine logging message * NOTE: * This service been marked for deprecation: https://developer.github.com/changes/2018-04-25-github-services-deprecation/ Functionality will be removed from GitHub.com on January 31st, 2019.
            oleg_nenashev Oleg Nenashev added a comment - - edited

            Fix has been applied in 2.125. IMHO the fix is not complete for parallel AbstractProject builds, but it is better than nothing. Will create a follow-up ticket

            oleg_nenashev Oleg Nenashev added a comment - - edited Fix has been applied in 2.125. IMHO the fix is not complete for parallel AbstractProject builds, but it is better than nothing. Will create a follow-up ticket
            oleg_nenashev Oleg Nenashev added a comment -

            danielbeck this thing is marked as RFE in the changelog, but I think this is a bug. Would you agree if I recategorize it?

            oleg_nenashev Oleg Nenashev added a comment - danielbeck this thing is marked as RFE in the changelog, but I think this is a bug. Would you agree if I recategorize it?

            People

              Unassigned Unassigned
              qhartman Quentin Hartman
              Votes:
              13 Vote for this issue
              Watchers:
              20 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: