Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-2548

Node does not come back online after disk space cleared

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • remoting
    • None
    • Platform: All, OS: All

      We are using Hudson as a single master server and had it go offline due to less
      than 1GB disk space being enabled.

      After we clear some disk space Hudson does not come back online until we restart
      the servlet container. Could it not detect that there is enough disk space
      available and come back online automatically?

          [JENKINS-2548] Node does not come back online after disk space cleared

          dogfood added a comment -

          Integrated in jenkins_main_trunk #1335
          Revert "[FIXED JENKINS-2548] Slaves taken offline for low disk space will now"

          Kohsuke Kawaguchi : 706b2dfd71904224399e52843233c12e219803e4
          Files :

          • core/src/main/java/hudson/node_monitors/AbstractNodeMonitorDescriptor.java
          • changelog.html
          • core/src/main/resources/hudson/node_monitors/Messages.properties
          • core/src/main/java/hudson/node_monitors/AbstractDiskSpaceMonitor.java

          dogfood added a comment - Integrated in jenkins_main_trunk #1335 Revert " [FIXED JENKINS-2548] Slaves taken offline for low disk space will now" Kohsuke Kawaguchi : 706b2dfd71904224399e52843233c12e219803e4 Files : core/src/main/java/hudson/node_monitors/AbstractNodeMonitorDescriptor.java changelog.html core/src/main/resources/hudson/node_monitors/Messages.properties core/src/main/java/hudson/node_monitors/AbstractDiskSpaceMonitor.java

          Andrew Bayer added a comment -

          kohsuke - what would be the best way to record in the DiskSpace OfflineCause which specific monitor is the reason? Subclassing it further, or adding a flag of some sort?

          Andrew Bayer added a comment - kohsuke - what would be the best way to record in the DiskSpace OfflineCause which specific monitor is the reason? Subclassing it further, or adding a flag of some sort?

          I think we need Computers to treat NodeMonitors as something special. We can have Computers remember the set of NodeMonitors that raising a red flag, and isOffline() would check if this set is empty. This leaves "temporarily offline" concept for administrator's use alone.

          This also means NodeMonitors should have a backdoor to raise/drop this red flag, and existing NodeMonitors should be modified to use this mechanism so that automatic on/off and administrative manual on/off will not collide with each other.

          I think such a distinction is the only way to make it work correctly in the presence of multiple node monitors reporting problems.

          Kohsuke Kawaguchi added a comment - I think we need Computers to treat NodeMonitors as something special. We can have Computers remember the set of NodeMonitors that raising a red flag, and isOffline() would check if this set is empty. This leaves "temporarily offline" concept for administrator's use alone. This also means NodeMonitors should have a backdoor to raise/drop this red flag, and existing NodeMonitors should be modified to use this mechanism so that automatic on/off and administrative manual on/off will not collide with each other. I think such a distinction is the only way to make it work correctly in the presence of multiple node monitors reporting problems.

          kolos added a comment -

          Hello,

          I do not mean to moan, but I've been trying to get my Jenkins nodes to come back online in the last 30-40 mins and it doesn't matter if I click the button 'back online' or restart Jenkins itself, it doesn't make a difference.

          One node just on its own managed to come back online, not sure how/why.

          I'm a bit surprised that this isn't biting many more people.

          May I suggest increasing the priority of this issue?

          Kolos

          kolos added a comment - Hello, I do not mean to moan, but I've been trying to get my Jenkins nodes to come back online in the last 30-40 mins and it doesn't matter if I click the button 'back online' or restart Jenkins itself, it doesn't make a difference. One node just on its own managed to come back online, not sure how/why. I'm a bit surprised that this isn't biting many more people. May I suggest increasing the priority of this issue? Kolos

          kolos added a comment -

          Ah, ok, it looks like I need to do these to bring a node back online:

          In order:

          1. clean up disk space
          2. restart Jenkins itself
          3. mark the offline node as being back online

          Kolos

          kolos added a comment - Ah, ok, it looks like I need to do these to bring a node back online: In order: clean up disk space restart Jenkins itself mark the offline node as being back online Kolos

          Marc Günther added a comment -

          @kolos: we are using swarm clients, and all I need to do is press the "bring this node back online" button. I have to do this all the time, and I never had to restart Jenkins.

          Marc Günther added a comment - @kolos: we are using swarm clients, and all I need to do is press the "bring this node back online" button. I have to do this all the time, and I never had to restart Jenkins.

          another way to make slave online by running cli
          java -jar $CLIJAR -s $URL -i $sshkey online-node $SLAVE

          but at my jenkins environment (1.447.2 LTS jenkins) (Ubuntu 10.04.3 LTS ) ,
          after running above cli, it make slave online but again in few second, jenkins make it offline...
          after running cli for few times(ore than 2 and 3), slave appears online.

          seems need change in jenkins core to make it stable.

          please share your comments if any.

          hiteswar kumar added a comment - another way to make slave online by running cli java -jar $CLIJAR -s $URL -i $sshkey online-node $SLAVE but at my jenkins environment (1.447.2 LTS jenkins) (Ubuntu 10.04.3 LTS ) , after running above cli, it make slave online but again in few second, jenkins make it offline... after running cli for few times(ore than 2 and 3), slave appears online. seems need change in jenkins core to make it stable. please share your comments if any.

          Ari Hyttinen added a comment - - edited

          Hit this today, too.

          Master has many slaves on many hosts, all running Windows, and Jenkins 1.458, and this is so far happening on two slave hosts. Both affected slave hosts had total 2 slaves. One slave node on each host works ok, showing plenty of free disk space on Build Executor Status page. But the other slave node on both hosts refuses to come online and shows free disk space under 1 gigabyte. Other data on Build Executor Status seems to valid. Both nodes go offline even it is brought back online, due to perceived lack of disk space.

          Cloned a third slave on one of the hosts, and run a test job there. The job run and the node seems to work normally. However, in the Build Executor Status, this node reports N/A for all the status items, including Free Disk Space.

          Also, as far as is known, the disk space has not really been low on these hosts, ie. the low disk space warning itself is bogus.

          Have not yet tried to restart Jenkins, or update to the latest release, so don't know if that will fix the issue, will comment further once that is done.

          Update: situation fixed itself without reboot. Or so it seems, at least.

          Ari Hyttinen added a comment - - edited Hit this today, too. Master has many slaves on many hosts, all running Windows, and Jenkins 1.458, and this is so far happening on two slave hosts. Both affected slave hosts had total 2 slaves. One slave node on each host works ok, showing plenty of free disk space on Build Executor Status page. But the other slave node on both hosts refuses to come online and shows free disk space under 1 gigabyte. Other data on Build Executor Status seems to valid. Both nodes go offline even it is brought back online, due to perceived lack of disk space. Cloned a third slave on one of the hosts, and run a test job there. The job run and the node seems to work normally. However, in the Build Executor Status, this node reports N/A for all the status items, including Free Disk Space. Also, as far as is known, the disk space has not really been low on these hosts, ie. the low disk space warning itself is bogus. Have not yet tried to restart Jenkins, or update to the latest release, so don't know if that will fix the issue, will comment further once that is done. Update: situation fixed itself without reboot. Or so it seems, at least.

          Marc Günther added a comment -

          Marc Günther added a comment - Fixed by pull request 514: https://github.com/jenkinsci/jenkins/pull/514

          Hi,
          This issue is fixed or not , if it is fixed please share a fixed version of Jenkins.

          Ayyappan subramanian added a comment - Hi, This issue is fixed or not , if it is fixed please share a fixed version of Jenkins.

            abayer Andrew Bayer
            manderson23 manderson23
            Votes:
            8 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: