• Icon: Improvement Improvement
    • Resolution: Fixed
    • Icon: Minor Minor
    • docker-plugin
    • None

      Jenkins contains functionality where it monitors the "health" of all its static slaves - it periodically checks that they have enough disk space available, and (most importantly) it takes the slave "offline" if the slave has insufficient disk space available.
      This means that Jenkins avoids running builds on slave that have low disk space.

      I'd like similar functionality built into the docker-plugin.

      At present, the docker-plugin code counts the number of containers that a docker host is currently running, compares that to the "Container Cap" number, and refuses to start new containers on that docker host if it has reached that cap.
      I would like to see an additional check of that docker host's filesystem space, so that the plugin also refused to start new containers on a host that had insufficient disk space.

      This RFE is because we have has builds fail because they were run on a container that was started on a docker host that had insufficient disk space, and I'd like Jenkins to (automatically) stop using docker hosts that had insufficient space (this would then allow Jenkins to continue to run builds on other docker hosts until someone had resolved the disk space problem).

          [JENKINS-30052] Take Docker host "offline" if it is unhealthy

          Monitor is checking exactly what jenkins slave.jar see, for cloud there is no way to predict what will be mounted in container.

          Kanstantsin Shautsou added a comment - Monitor is checking exactly what jenkins slave.jar see, for cloud there is no way to predict what will be mounted in container.

          pjdarton added a comment -

          Would it be possible to make Jenkins check "exactly what jenkins slave.jar see" before it runs the rest of the build?
          (or, if not before, at least before the slave is disposed of)

          If the docker-plugin could see "exactly what jenkins slave.jar see" after it had started a container, it could then see if the container was unhealthy and take that cloud entry offline if it saw that newly-created containers on that cloud were unhealthy.
          It would be acceptable if this was done after a build had failed, if that could then prevent future builds from failing from the same problem - what I want is for Jenkins to not make the same mistake twice, so that if a machine runs out of space then Jenkins will run fewer builds at once instead of having all builds fail.

          (I admit that this is not an easy thing to do, but running out of disk space is a problem that I frequently encounter with Jenkins - this is a real problem when there are many builds and many users)

          pjdarton added a comment - Would it be possible to make Jenkins check "exactly what jenkins slave.jar see" before it runs the rest of the build? (or, if not before, at least before the slave is disposed of) If the docker-plugin could see "exactly what jenkins slave.jar see" after it had started a container, it could then see if the container was unhealthy and take that cloud entry offline if it saw that newly-created containers on that cloud were unhealthy. It would be acceptable if this was done after a build had failed, if that could then prevent future builds from failing from the same problem - what I want is for Jenkins to not make the same mistake twice, so that if a machine runs out of space then Jenkins will run fewer builds at once instead of having all builds fail. (I admit that this is not an easy thing to do, but running out of disk space is a problem that I frequently encounter with Jenkins - this is a real problem when there are many builds and many users)

          in theory it should do force monitoring check after slave connected, but not accepted build on executor.

          Kanstantsin Shautsou added a comment - in theory it should do force monitoring check after slave connected, but not accepted build on executor.

          pjdarton added a comment -

          If that is true then the "force monitoring check" is failing to take the slave offline.
          I have had builds started on containers where "df -h" has shown the "/" partition (which contains /home/jenkins/, which is where the build slave process runs) to be 100% full.
          So, either the "monitoring check" does not believe that having ${workspace} on a partition that is 100% full is a problem, or it is failing to take problem slaves offline.

          For static slaves (not docker slaves), I have seen builds fail due to lack of disk space and then the slave be taken offline after a build has failed. I would guess that this "monitoring check" is done in a different thread to the one running the build and so the build can be started before the "monitoring check" has finished. This means that you get one build fail, but later builds avoid using that slave (until someone fixes the slave and brings it back online).

          What I would like is for the docker-plugin to be aware of these monitoring checks and, if a container fails the check, the plugin should avoid starting new containers of that type on that host. Ideally, the check should be run before the build starts, and the build not be started on a failing container at all, but it would be acceptable to have a single failed build as long as subsequent builds avoided the problem.

          pjdarton added a comment - If that is true then the "force monitoring check" is failing to take the slave offline. I have had builds started on containers where "df -h" has shown the "/" partition (which contains /home/jenkins/, which is where the build slave process runs) to be 100% full. So, either the "monitoring check" does not believe that having ${workspace} on a partition that is 100% full is a problem, or it is failing to take problem slaves offline. For static slaves (not docker slaves), I have seen builds fail due to lack of disk space and then the slave be taken offline after a build has failed. I would guess that this "monitoring check" is done in a different thread to the one running the build and so the build can be started before the "monitoring check" has finished. This means that you get one build fail, but later builds avoid using that slave (until someone fixes the slave and brings it back online). What I would like is for the docker-plugin to be aware of these monitoring checks and, if a container fails the check, the plugin should avoid starting new containers of that type on that host. Ideally, the check should be run before the build starts, and the build not be started on a failing container at all, but it would be acceptable to have a single failed build as long as subsequent builds avoided the problem.

          Sorry, i mean "how it can be implemented"

          Kanstantsin Shautsou added a comment - Sorry, i mean "how it can be implemented"

          stephenconnolly could you suggest something about this async monitors?

          Kanstantsin Shautsou added a comment - stephenconnolly could you suggest something about this async monitors?

          So one of the issues with {{NodeMonitor}}s is that they are designed to run in a background thread and Core does not provide an API to await their completion.

          In theory calling AbstractNodeMonitorDescriptor.monitor(Computer) would allow getting the results in a blocking mode, but that method is protected, and thus you would need either hacks (bad) or a change to core (slow) to invoke it.

          Additionally the Super Crappy Cloud API doesn't give any hooks to do this, so specific support would have to be woven into each cloud implementation... until you find out that some of the NodeMonitor implementations need to convert the Computer into a Node which means that the node has to have completed provisioning (for correctly written cloud implementations)...

          OK, so we'll dodge that rabbit hole of an approach, the cloud implementation cannot invoke the NodeMonitor before completing provisioning... perhaps then the way is to make Core handle this?

          If I were making such a change in core, I would suggest adding a method to Node or Computer that returns a Set<AbstractNodeMonitorDescriptor> of node monitors that must return success for the node to be available. I would add a Set<AbstractNodeMonitorDescriptor> instance to Computer to track the node monitors that have a successful result as well as a time. Then have either Node or Computer's isAcceptingTasks() method return false if the instance variable does not containsAll(). When testing containsAll() and the map is null if there are any required node monitors, launch a thread for that specific computer (will require a helper method in AbstractNodeMonitorDescriptor) and set the map to an empty map. Then in AbstractNodeMonitorDescriptor have it update the map with results also.

          The net effect of that is that a Cloud implementation (or even regular slaves) could select a subset node monitors that must pass before any job is accepted by overriding the method.

          The alternative to all that is to add specific code into the docker plugin that runs the required checks via the shell before even returning the node... the disadvantage is that you would have to specifically code up each check... the advantage is that you would not poison the provisioning statistics... for this latter reason I would recommend that integer go with the dedicated support... by all means file an RFE against core for the enhancement I described above as that would be generally useful, but it would require some changes in the provisioning stats to reduce the risk of provisioning stats poisoning

          HTH

          Stephen Connolly added a comment - So one of the issues with {{NodeMonitor}}s is that they are designed to run in a background thread and Core does not provide an API to await their completion. In theory calling AbstractNodeMonitorDescriptor.monitor(Computer) would allow getting the results in a blocking mode, but that method is protected, and thus you would need either hacks (bad) or a change to core (slow) to invoke it. Additionally the Super Crappy Cloud API doesn't give any hooks to do this, so specific support would have to be woven into each cloud implementation... until you find out that some of the NodeMonitor implementations need to convert the Computer into a Node which means that the node has to have completed provisioning (for correctly written cloud implementations)... OK, so we'll dodge that rabbit hole of an approach, the cloud implementation cannot invoke the NodeMonitor before completing provisioning... perhaps then the way is to make Core handle this? If I were making such a change in core, I would suggest adding a method to Node or Computer that returns a Set<AbstractNodeMonitorDescriptor> of node monitors that must return success for the node to be available. I would add a Set<AbstractNodeMonitorDescriptor> instance to Computer to track the node monitors that have a successful result as well as a time. Then have either Node or Computer's isAcceptingTasks() method return false if the instance variable does not containsAll() . When testing containsAll() and the map is null if there are any required node monitors, launch a thread for that specific computer (will require a helper method in AbstractNodeMonitorDescriptor ) and set the map to an empty map. Then in AbstractNodeMonitorDescriptor have it update the map with results also. The net effect of that is that a Cloud implementation (or even regular slaves) could select a subset node monitors that must pass before any job is accepted by overriding the method. The alternative to all that is to add specific code into the docker plugin that runs the required checks via the shell before even returning the node... the disadvantage is that you would have to specifically code up each check... the advantage is that you would not poison the provisioning statistics... for this latter reason I would recommend that integer go with the dedicated support... by all means file an RFE against core for the enhancement I described above as that would be generally useful, but it would require some changes in the provisioning stats to reduce the risk of provisioning stats poisoning HTH

          Aside:

          Clarification on the RFE:

          • You add the method returning Set<AbstractNodeMonitorDescriptor> to Node because Node represents the configuration state and the list of required node monitors should be configurable (probably you would introduce the UI via NodeProperty so perhaps all that method would do is build the Set by iterating all NodeProperty instances of the Node and then store the resulting Set in a field (for performance) and rebuild the set on Node.save().
          • You add the instance field containing all the passing NodeMonitor checks to Computer because Computer represents the live state of the connection. Reset the instance field to null on connect/disconnect.
          • You update the Computer instance field (via a method) from AbstractNodeMonitorDescriptor for visibility and also so that the background checks will keep updating. There is great potential for deadlock so likely there would be a need to rework how that code works so that we can update a subset of node monitors for a specific computer on demand while also being able to update all the node monitors for all computers periodically. Thus you want to consolidate that threading logic in AbstractNodeMonitorDescriptor

          Stephen Connolly added a comment - Aside: Clarification on the RFE: You add the method returning Set<AbstractNodeMonitorDescriptor> to Node because Node represents the configuration state and the list of required node monitors should be configurable (probably you would introduce the UI via NodeProperty so perhaps all that method would do is build the Set by iterating all NodeProperty instances of the Node and then store the resulting Set in a field (for performance) and rebuild the set on Node.save() . You add the instance field containing all the passing NodeMonitor checks to Computer because Computer represents the live state of the connection. Reset the instance field to null on connect/disconnect. You update the Computer instance field (via a method) from AbstractNodeMonitorDescriptor for visibility and also so that the background checks will keep updating. There is great potential for deadlock so likely there would be a need to rework how that code works so that we can update a subset of node monitors for a specific computer on demand while also being able to update all the node monitors for all computers periodically. Thus you want to consolidate that threading logic in AbstractNodeMonitorDescriptor

          pjdarton added a comment -

          Sort-of done, albeit indirectly.
          The plugin automatically disables a docker cloud or template if provisioning fails. Disablement is only temporary, but it means that subsequent attempts to provision an agent would then choose another template.
          That's been good enough IME to remove the need to do anything more complicated.

          pjdarton added a comment - Sort-of done, albeit indirectly. The plugin automatically disables a docker cloud or template if provisioning fails. Disablement is only temporary, but it means that subsequent attempts to provision an agent would then choose another template. That's been good enough IME to remove the need to do anything more complicated.

            Unassigned Unassigned
            pjdarton pjdarton
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: