• Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • core, remoting
    • None

      As of 1.520, AbstractNodeMonitorDescriptor monitors nodes sequentially. As the # of slaves go up, this will take a long time to complete, and this also makes the monitoring susceptive to a hang.

      While a ping thread is there to detect unresponsive nodes, its interval is 10mins and the time out is 4mins, so a few unresonsive nodes can quickly push the total running time of node monitoring beyond the default monitoring cycle of 1 hour.

      A better approach is to make asynchronous remoting calls to all the slaves at once, then wait for the results to come back. This way, we can get the result back for ones that are functioning.

          [JENKINS-18438] Node monitoring should run in parallel

          Code changed in jenkins
          User: Kohsuke Kawaguchi
          Path:
          changelog.html
          core/src/main/java/hudson/FilePath.java
          core/src/main/java/hudson/model/Node.java
          core/src/main/java/hudson/model/Slave.java
          core/src/main/java/hudson/node_monitors/AbstractAsyncNodeMonitorDescriptor.java
          core/src/main/java/hudson/node_monitors/AbstractNodeMonitorDescriptor.java
          core/src/main/java/hudson/node_monitors/ArchitectureMonitor.java
          core/src/main/java/hudson/node_monitors/ClockMonitor.java
          core/src/main/java/hudson/node_monitors/DiskSpaceMonitor.java
          core/src/main/java/hudson/node_monitors/DiskSpaceMonitorDescriptor.java
          core/src/main/java/hudson/node_monitors/ResponseTimeMonitor.java
          core/src/main/java/hudson/node_monitors/SwapSpaceMonitor.java
          core/src/main/java/hudson/node_monitors/TemporarySpaceMonitor.java
          core/src/main/java/jenkins/model/Jenkins.java
          core/src/test/java/hudson/slaves/NodeListTest.java
          http://jenkins-ci.org/commit/jenkins/735713801b130fe247cf17bbca7b4561e41b1d13
          Log:
          [FIXED JENKINS-18438]

          Node monitoring should run in parallel to reduce the total round-trip
          time in large instances.

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Kohsuke Kawaguchi Path: changelog.html core/src/main/java/hudson/FilePath.java core/src/main/java/hudson/model/Node.java core/src/main/java/hudson/model/Slave.java core/src/main/java/hudson/node_monitors/AbstractAsyncNodeMonitorDescriptor.java core/src/main/java/hudson/node_monitors/AbstractNodeMonitorDescriptor.java core/src/main/java/hudson/node_monitors/ArchitectureMonitor.java core/src/main/java/hudson/node_monitors/ClockMonitor.java core/src/main/java/hudson/node_monitors/DiskSpaceMonitor.java core/src/main/java/hudson/node_monitors/DiskSpaceMonitorDescriptor.java core/src/main/java/hudson/node_monitors/ResponseTimeMonitor.java core/src/main/java/hudson/node_monitors/SwapSpaceMonitor.java core/src/main/java/hudson/node_monitors/TemporarySpaceMonitor.java core/src/main/java/jenkins/model/Jenkins.java core/src/test/java/hudson/slaves/NodeListTest.java http://jenkins-ci.org/commit/jenkins/735713801b130fe247cf17bbca7b4561e41b1d13 Log: [FIXED JENKINS-18438] Node monitoring should run in parallel to reduce the total round-trip time in large instances.

          dogfood added a comment -

          Integrated in jenkins_main_trunk #2671
          [FIXED JENKINS-18438] (Revision 735713801b130fe247cf17bbca7b4561e41b1d13)

          Result = SUCCESS
          kohsuke : 735713801b130fe247cf17bbca7b4561e41b1d13
          Files :

          • core/src/test/java/hudson/slaves/NodeListTest.java
          • core/src/main/java/hudson/node_monitors/AbstractNodeMonitorDescriptor.java
          • core/src/main/java/hudson/FilePath.java
          • core/src/main/java/hudson/node_monitors/TemporarySpaceMonitor.java
          • core/src/main/java/jenkins/model/Jenkins.java
          • core/src/main/java/hudson/model/Node.java
          • core/src/main/java/hudson/node_monitors/ClockMonitor.java
          • core/src/main/java/hudson/node_monitors/DiskSpaceMonitorDescriptor.java
          • core/src/main/java/hudson/node_monitors/DiskSpaceMonitor.java
          • core/src/main/java/hudson/node_monitors/ArchitectureMonitor.java
          • core/src/main/java/hudson/node_monitors/SwapSpaceMonitor.java
          • core/src/main/java/hudson/model/Slave.java
          • core/src/main/java/hudson/node_monitors/AbstractAsyncNodeMonitorDescriptor.java
          • changelog.html
          • core/src/main/java/hudson/node_monitors/ResponseTimeMonitor.java

          dogfood added a comment - Integrated in jenkins_main_trunk #2671 [FIXED JENKINS-18438] (Revision 735713801b130fe247cf17bbca7b4561e41b1d13) Result = SUCCESS kohsuke : 735713801b130fe247cf17bbca7b4561e41b1d13 Files : core/src/test/java/hudson/slaves/NodeListTest.java core/src/main/java/hudson/node_monitors/AbstractNodeMonitorDescriptor.java core/src/main/java/hudson/FilePath.java core/src/main/java/hudson/node_monitors/TemporarySpaceMonitor.java core/src/main/java/jenkins/model/Jenkins.java core/src/main/java/hudson/model/Node.java core/src/main/java/hudson/node_monitors/ClockMonitor.java core/src/main/java/hudson/node_monitors/DiskSpaceMonitorDescriptor.java core/src/main/java/hudson/node_monitors/DiskSpaceMonitor.java core/src/main/java/hudson/node_monitors/ArchitectureMonitor.java core/src/main/java/hudson/node_monitors/SwapSpaceMonitor.java core/src/main/java/hudson/model/Slave.java core/src/main/java/hudson/node_monitors/AbstractAsyncNodeMonitorDescriptor.java changelog.html core/src/main/java/hudson/node_monitors/ResponseTimeMonitor.java

          I already commented on https://github.com/jenkinsci/jenkins/commit/735713801b130fe247cf17bbca7b4561e41b1d13 , but this change breaks time-based monitoring (clock difference and response time) and causes some slaves to timeout whereas they used to connect in acceptable time before the change.

          Vincent Latombe added a comment - I already commented on https://github.com/jenkinsci/jenkins/commit/735713801b130fe247cf17bbca7b4561e41b1d13 , but this change breaks time-based monitoring (clock difference and response time) and causes some slaves to timeout whereas they used to connect in acceptable time before the change.

          This changes breaks node monitoring report.
          See previous comments

          Philippe Jandot added a comment - This changes breaks node monitoring report. See previous comments

          Jesse Glick added a comment -

          JENKINS-18671 was filed separately; leave this closed and use linked issues for any regressions.

          Jesse Glick added a comment - JENKINS-18671 was filed separately; leave this closed and use linked issues for any regressions.

            kohsuke Kohsuke Kawaguchi
            kohsuke Kohsuke Kawaguchi
            Votes:
            2 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: