Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-18438

Node monitoring should run in parallel

    XMLWordPrintable

Details

    Description

      As of 1.520, AbstractNodeMonitorDescriptor monitors nodes sequentially. As the # of slaves go up, this will take a long time to complete, and this also makes the monitoring susceptive to a hang.

      While a ping thread is there to detect unresponsive nodes, its interval is 10mins and the time out is 4mins, so a few unresonsive nodes can quickly push the total running time of node monitoring beyond the default monitoring cycle of 1 hour.

      A better approach is to make asynchronous remoting calls to all the slaves at once, then wait for the results to come back. This way, we can get the result back for ones that are functioning.

      Attachments

        Issue Links

          Activity

            Code changed in jenkins
            User: Kohsuke Kawaguchi
            Path:
            changelog.html
            core/src/main/java/hudson/FilePath.java
            core/src/main/java/hudson/model/Node.java
            core/src/main/java/hudson/model/Slave.java
            core/src/main/java/hudson/node_monitors/AbstractAsyncNodeMonitorDescriptor.java
            core/src/main/java/hudson/node_monitors/AbstractNodeMonitorDescriptor.java
            core/src/main/java/hudson/node_monitors/ArchitectureMonitor.java
            core/src/main/java/hudson/node_monitors/ClockMonitor.java
            core/src/main/java/hudson/node_monitors/DiskSpaceMonitor.java
            core/src/main/java/hudson/node_monitors/DiskSpaceMonitorDescriptor.java
            core/src/main/java/hudson/node_monitors/ResponseTimeMonitor.java
            core/src/main/java/hudson/node_monitors/SwapSpaceMonitor.java
            core/src/main/java/hudson/node_monitors/TemporarySpaceMonitor.java
            core/src/main/java/jenkins/model/Jenkins.java
            core/src/test/java/hudson/slaves/NodeListTest.java
            http://jenkins-ci.org/commit/jenkins/735713801b130fe247cf17bbca7b4561e41b1d13
            Log:
            [FIXED JENKINS-18438]

            Node monitoring should run in parallel to reduce the total round-trip
            time in large instances.

            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Kohsuke Kawaguchi Path: changelog.html core/src/main/java/hudson/FilePath.java core/src/main/java/hudson/model/Node.java core/src/main/java/hudson/model/Slave.java core/src/main/java/hudson/node_monitors/AbstractAsyncNodeMonitorDescriptor.java core/src/main/java/hudson/node_monitors/AbstractNodeMonitorDescriptor.java core/src/main/java/hudson/node_monitors/ArchitectureMonitor.java core/src/main/java/hudson/node_monitors/ClockMonitor.java core/src/main/java/hudson/node_monitors/DiskSpaceMonitor.java core/src/main/java/hudson/node_monitors/DiskSpaceMonitorDescriptor.java core/src/main/java/hudson/node_monitors/ResponseTimeMonitor.java core/src/main/java/hudson/node_monitors/SwapSpaceMonitor.java core/src/main/java/hudson/node_monitors/TemporarySpaceMonitor.java core/src/main/java/jenkins/model/Jenkins.java core/src/test/java/hudson/slaves/NodeListTest.java http://jenkins-ci.org/commit/jenkins/735713801b130fe247cf17bbca7b4561e41b1d13 Log: [FIXED JENKINS-18438] Node monitoring should run in parallel to reduce the total round-trip time in large instances.
            dogfood dogfood added a comment -

            Integrated in jenkins_main_trunk #2671
            [FIXED JENKINS-18438] (Revision 735713801b130fe247cf17bbca7b4561e41b1d13)

            Result = SUCCESS
            kohsuke : 735713801b130fe247cf17bbca7b4561e41b1d13
            Files :

            • core/src/test/java/hudson/slaves/NodeListTest.java
            • core/src/main/java/hudson/node_monitors/AbstractNodeMonitorDescriptor.java
            • core/src/main/java/hudson/FilePath.java
            • core/src/main/java/hudson/node_monitors/TemporarySpaceMonitor.java
            • core/src/main/java/jenkins/model/Jenkins.java
            • core/src/main/java/hudson/model/Node.java
            • core/src/main/java/hudson/node_monitors/ClockMonitor.java
            • core/src/main/java/hudson/node_monitors/DiskSpaceMonitorDescriptor.java
            • core/src/main/java/hudson/node_monitors/DiskSpaceMonitor.java
            • core/src/main/java/hudson/node_monitors/ArchitectureMonitor.java
            • core/src/main/java/hudson/node_monitors/SwapSpaceMonitor.java
            • core/src/main/java/hudson/model/Slave.java
            • core/src/main/java/hudson/node_monitors/AbstractAsyncNodeMonitorDescriptor.java
            • changelog.html
            • core/src/main/java/hudson/node_monitors/ResponseTimeMonitor.java
            dogfood dogfood added a comment - Integrated in jenkins_main_trunk #2671 [FIXED JENKINS-18438] (Revision 735713801b130fe247cf17bbca7b4561e41b1d13) Result = SUCCESS kohsuke : 735713801b130fe247cf17bbca7b4561e41b1d13 Files : core/src/test/java/hudson/slaves/NodeListTest.java core/src/main/java/hudson/node_monitors/AbstractNodeMonitorDescriptor.java core/src/main/java/hudson/FilePath.java core/src/main/java/hudson/node_monitors/TemporarySpaceMonitor.java core/src/main/java/jenkins/model/Jenkins.java core/src/main/java/hudson/model/Node.java core/src/main/java/hudson/node_monitors/ClockMonitor.java core/src/main/java/hudson/node_monitors/DiskSpaceMonitorDescriptor.java core/src/main/java/hudson/node_monitors/DiskSpaceMonitor.java core/src/main/java/hudson/node_monitors/ArchitectureMonitor.java core/src/main/java/hudson/node_monitors/SwapSpaceMonitor.java core/src/main/java/hudson/model/Slave.java core/src/main/java/hudson/node_monitors/AbstractAsyncNodeMonitorDescriptor.java changelog.html core/src/main/java/hudson/node_monitors/ResponseTimeMonitor.java

            I already commented on https://github.com/jenkinsci/jenkins/commit/735713801b130fe247cf17bbca7b4561e41b1d13 , but this change breaks time-based monitoring (clock difference and response time) and causes some slaves to timeout whereas they used to connect in acceptable time before the change.

            vlatombe Vincent Latombe added a comment - I already commented on https://github.com/jenkinsci/jenkins/commit/735713801b130fe247cf17bbca7b4561e41b1d13 , but this change breaks time-based monitoring (clock difference and response time) and causes some slaves to timeout whereas they used to connect in acceptable time before the change.

            This changes breaks node monitoring report.
            See previous comments

            zfil Philippe Jandot added a comment - This changes breaks node monitoring report. See previous comments
            jglick Jesse Glick added a comment -

            JENKINS-18671 was filed separately; leave this closed and use linked issues for any regressions.

            jglick Jesse Glick added a comment - JENKINS-18671 was filed separately; leave this closed and use linked issues for any regressions.

            People

              kohsuke Kohsuke Kawaguchi
              kohsuke Kohsuke Kawaguchi
              Votes:
              2 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: