• Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • remoting
    • None

      This is to track the problem originally reported here: http://n4.nabble.com/Polling-hung-td1310838.html#a1310838
      The referenced thread is relocated to http://jenkins.361315.n4.nabble.com/Polling-hung-td1310838.html

      What the problem boils down to is that many remote operations are performed synchronously causing the channel object to be locked while a response returns. In situations where a lengthy remote operations is using the channel, SCM polling can be blocked waiting for the monitor on the channel to be released. In extreme situations, all the polling threads can wind up waiting on object monitors for the channel objects, preventing further processing of polling tasks.

      Furthermore, if the slave dies, the locked channel object still exists in the master JVM. If no IOException is thrown to indicate the termination of the connection to the pipe, the channel can never be closed because Channel.close() itself is a sychronized operation.

        1. DUMP1.txt
          57 kB
        2. hung_scm_pollers_02.PNG
          hung_scm_pollers_02.PNG
          145 kB
        3. thread_dump_02.txt
          92 kB
        4. threads.vetted.txt
          163 kB

          [JENKINS-5413] SCM polling getting hung

          sharon xia added a comment -

          02:36:16 Started by upstream project "echidna-patch-quality" build number 335
          02:36:16 originally caused by:
          02:36:16 Started by command line by xxx
          02:36:16 [EnvInject] - Loading node environment variables.
          02:36:17 Building remotely on ECHIDNA-QUALITY (6.1 windows-6.1 windows amd64-windows amd64-windows-6.1 amd64) in workspace c:\buildfarm-slave\workspace\echidna-patch-compile
          02:36:18 > git rev-parse --is-inside-work-tree
          02:36:19 Fetching changes from the remote Git repository
          02:36:19 > git config remote.origin.url ssh://*@...:*/ghts/ta
          02:36:20 Fetching upstream changes from ssh://*@...:*/ghts/ta
          02:36:20 > git --version
          02:36:20 > git fetch --tags --progress ssh://*@...:/ghts/ta +refs/heads/:refs/remotes/origin/*
          02:56:20 ERROR: Timeout after 20 minutes
          02:56:20 FATAL: Failed to fetch from ssh://*@...:*/ghts/ta
          02:56:20 hudson.plugins.git.GitException: Failed to fetch from ssh://bmcdiags@10.110.61.117:30000/ghts/ta
          02:56:20 at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:623)
          02:56:20 at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:855)
          02:56:20 at hudson.plugins.git.GitSCM.checkout(GitSCM.java:880)
          02:56:20 at hudson.model.AbstractProject.checkout(AbstractProject.java:1414)
          02:56:20 at hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:671)
          02:56:20 at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:88)
          02:56:20 at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:580)
          02:56:20 at hudson.model.Run.execute(Run.java:1684)
          02:56:20 at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
          02:56:20 at hudson.model.ResourceController.execute(ResourceController.java:88)
          02:56:20 at hudson.model.Executor.run(Executor.java:231)
          02:56:20 Caused by: hudson.plugins.git.GitException: Command "git fetch --tags --progress ssh://*@...:/ghts/ta +refs/heads/:refs/remotes/origin/*" returned status code -1:
          02:56:20 stdout:
          02:56:20 stderr: Could not create directory 'c/Users/Administrator/.ssh'.
          02:56:20
          02:56:20 at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1325)
          02:56:20 at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandWithCredentials(CliGitAPIImpl.java:1186)
          02:56:20 at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.access$200(CliGitAPIImpl.java:87)
          02:56:20 at org.jenkinsci.plugins.gitclient.CliGitAPIImpl$1.execute(CliGitAPIImpl.java:257)
          02:56:20 at org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$1.call(RemoteGitImpl.java:153)
          02:56:20 at org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$1.call(RemoteGitImpl.java:146)
          02:56:20 at hudson.remoting.UserRequest.perform(UserRequest.java:118)
          02:56:20 at hudson.remoting.UserRequest.perform(UserRequest.java:48)
          02:56:20 at hudson.remoting.Request$2.run(Request.java:326)
          02:56:20 at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
          02:56:20 at java.util.concurrent.FutureTask.run(Unknown Source)
          02:56:20 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
          02:56:20 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
          02:56:20 at hudson.remoting.Engine$1$1.run(Engine.java:63)
          02:56:20 at java.lang.Thread.run(Unknown Source)

          sharon xia added a comment - 02:36:16 Started by upstream project "echidna-patch-quality" build number 335 02:36:16 originally caused by: 02:36:16 Started by command line by xxx 02:36:16 [EnvInject] - Loading node environment variables. 02:36:17 Building remotely on ECHIDNA-QUALITY (6.1 windows-6.1 windows amd64-windows amd64-windows-6.1 amd64) in workspace c:\buildfarm-slave\workspace\echidna-patch-compile 02:36:18 > git rev-parse --is-inside-work-tree 02:36:19 Fetching changes from the remote Git repository 02:36:19 > git config remote.origin.url ssh://* @ . . . : */ghts/ta 02:36:20 Fetching upstream changes from ssh://* @ . . . : */ghts/ta 02:36:20 > git --version 02:36:20 > git fetch --tags --progress ssh://* @ . . . : /ghts/ta +refs/heads/ :refs/remotes/origin/* 02:56:20 ERROR: Timeout after 20 minutes 02:56:20 FATAL: Failed to fetch from ssh://* @ . . . : */ghts/ta 02:56:20 hudson.plugins.git.GitException: Failed to fetch from ssh://bmcdiags@10.110.61.117:30000/ghts/ta 02:56:20 at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:623) 02:56:20 at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:855) 02:56:20 at hudson.plugins.git.GitSCM.checkout(GitSCM.java:880) 02:56:20 at hudson.model.AbstractProject.checkout(AbstractProject.java:1414) 02:56:20 at hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:671) 02:56:20 at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:88) 02:56:20 at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:580) 02:56:20 at hudson.model.Run.execute(Run.java:1684) 02:56:20 at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43) 02:56:20 at hudson.model.ResourceController.execute(ResourceController.java:88) 02:56:20 at hudson.model.Executor.run(Executor.java:231) 02:56:20 Caused by: hudson.plugins.git.GitException: Command "git fetch --tags --progress ssh://* @ . . . : /ghts/ta +refs/heads/ :refs/remotes/origin/*" returned status code -1: 02:56:20 stdout: 02:56:20 stderr: Could not create directory 'c/Users/Administrator/.ssh'. 02:56:20 02:56:20 at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1325) 02:56:20 at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandWithCredentials(CliGitAPIImpl.java:1186) 02:56:20 at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.access$200(CliGitAPIImpl.java:87) 02:56:20 at org.jenkinsci.plugins.gitclient.CliGitAPIImpl$1.execute(CliGitAPIImpl.java:257) 02:56:20 at org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$1.call(RemoteGitImpl.java:153) 02:56:20 at org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$1.call(RemoteGitImpl.java:146) 02:56:20 at hudson.remoting.UserRequest.perform(UserRequest.java:118) 02:56:20 at hudson.remoting.UserRequest.perform(UserRequest.java:48) 02:56:20 at hudson.remoting.Request$2.run(Request.java:326) 02:56:20 at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72) 02:56:20 at java.util.concurrent.FutureTask.run(Unknown Source) 02:56:20 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) 02:56:20 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) 02:56:20 at hudson.remoting.Engine$1$1.run(Engine.java:63) 02:56:20 at java.lang.Thread.run(Unknown Source)

          Daniel Beck added a comment -

          sharon_xia: That's a completely different issue. This issue is about polling that NEVER finishes, yours aborts after 20 minutes. It even seems to tell you what the problem is: Could not create directory 'c/Users/Administrator/.ssh'.

          To request further assistance, please ask on the jenkinsci-users mailing list or in #jenkins on Freenode. This thread is long enough already.

          Daniel Beck added a comment - sharon_xia : That's a completely different issue. This issue is about polling that NEVER finishes, yours aborts after 20 minutes. It even seems to tell you what the problem is: Could not create directory 'c/Users/Administrator/.ssh'. To request further assistance, please ask on the jenkinsci-users mailing list or in #jenkins on Freenode. This thread is long enough already.

          Brian Smith added a comment -

          I haven't had this issue since we started doing weekly reboots of the whole system (master and nodes).

          Brian Smith added a comment - I haven't had this issue since we started doing weekly reboots of the whole system (master and nodes).

          mark 3000 added a comment -

          We encountered this issue for the first time (that I'm aware of) after upgrading to 1.583 from 1.578.

          mark 3000 added a comment - We encountered this issue for the first time (that I'm aware of) after upgrading to 1.583 from 1.578.

          marlene cote added a comment -

          We are seeing this too! It is having a huge impact on our productivity!! We too upgraded to 1.583.

          Please help.

          marlene cote added a comment - We are seeing this too! It is having a huge impact on our productivity!! We too upgraded to 1.583. Please help.

          Morten Engelhardt Olsen added a comment - - edited

          At Atmel we're now managing this issue by having the following system groovy script run every couple of minutes to monitor the processor load:

          import java.lang.management.*;
          
          def threadBean = ManagementFactory.getThreadMXBean();
          def osBean     = ManagementFactory.getOperatingSystemMXBean();
          
          println "\n\n\n[Checking state of (master)]";
          
          println "Current CPU Time used by Jenkins: " + threadBean.getCurrentThreadCpuTime() + "ns";
          
          double processLoad = (osBean.getProcessCpuLoad() * 100).round(2);
          double cpuLoad = (osBean.getSystemCpuLoad() * 100).round(2);
          println "Process CPU Load: " + processLoad + "%";
          println "CPU Load: " + cpuLoad + "%";
          
          if (processLoad < 90) {
            println "\n\n\n === Load is less than 90%, nothing to do ===\n\n\n";
            println "\n\n\n[Done checking: CPU Load: " + cpuLoad + "%]\n\n\n";
            return;
          } else {
            println "\n\n\n === Load is more than 90%, checking for stuck threads! ===\n\n\n";
          }
          
          
          println "\n\n\n[Checking all threads]\n\n\n";
          def threadNum = 0;
          def killThreadNum = 0;
          
          def stacktraces = Thread.getAllStackTraces();
          stacktraces.each { thread, stack ->
            if (thread.getName().contains("trigger/TimerTrigger/check") ) {
              println "=== Interrupting thread " + thread.getName()+ " ===";
              thread.interrupt();
              killThreadNum++;
            }
            threadNum++;
          }
          
          println "\n\n\n[Done checking: " + threadNum + " threads, killed " + killThreadNum + "]\n\n\n";
          
          return; // Suppress groovy state dump

          Note that we had to check for TimerTrigger, not SCM Polling as the original code did. This is currently running on 1.580.2.

          Morten Engelhardt Olsen added a comment - - edited At Atmel we're now managing this issue by having the following system groovy script run every couple of minutes to monitor the processor load: import java.lang.management.*; def threadBean = ManagementFactory.getThreadMXBean(); def osBean = ManagementFactory.getOperatingSystemMXBean(); println "\n\n\n[Checking state of (master)]" ; println "Current CPU Time used by Jenkins: " + threadBean.getCurrentThreadCpuTime() + "ns" ; double processLoad = (osBean.getProcessCpuLoad() * 100).round(2); double cpuLoad = (osBean.getSystemCpuLoad() * 100).round(2); println " Process CPU Load: " + processLoad + "%" ; println "CPU Load: " + cpuLoad + "%" ; if (processLoad < 90) { println "\n\n\n === Load is less than 90%, nothing to do ===\n\n\n" ; println "\n\n\n[Done checking: CPU Load: " + cpuLoad + "%]\n\n\n" ; return ; } else { println "\n\n\n === Load is more than 90%, checking for stuck threads! ===\n\n\n" ; } println "\n\n\n[Checking all threads]\n\n\n" ; def threadNum = 0; def killThreadNum = 0; def stacktraces = Thread .getAllStackTraces(); stacktraces.each { thread, stack -> if (thread.getName().contains( "trigger/TimerTrigger/check" ) ) { println "=== Interrupting thread " + thread.getName()+ " ===" ; thread.interrupt(); killThreadNum++; } threadNum++; } println "\n\n\n[Done checking: " + threadNum + " threads, killed " + killThreadNum + "]\n\n\n" ; return ; // Suppress groovy state dump Note that we had to check for TimerTrigger , not SCM Polling as the original code did. This is currently running on 1.580.2.

          The script provided on Jan 13 seems to be solving a different problem. On our instance, we see stuck SCM polling threads even when the CPU load is zero. With three SCM polling processes stuck as of this moment, the thread names reported by Thread.getAllStackTraces() are main, Finalizer, Signal Dispatcher, and Reference Handler.

          I'm pig-ignorant of groovy, and have yet to figure out where its access to Jenkins thread innards are documented, but previous iterations of scripts that did identify a stuck thread to interrupt were ineffective for us — we've yet to find an effective workaround that doesn't rely on restarting the jenkins daemon.

          We're using 1.590, and looking to switch to LTS releases as soon as they pass us by.

          Nathaniel Irons added a comment - The script provided on Jan 13 seems to be solving a different problem. On our instance, we see stuck SCM polling threads even when the CPU load is zero. With three SCM polling processes stuck as of this moment, the thread names reported by Thread.getAllStackTraces() are main, Finalizer, Signal Dispatcher, and Reference Handler. I'm pig-ignorant of groovy, and have yet to figure out where its access to Jenkins thread innards are documented, but previous iterations of scripts that did identify a stuck thread to interrupt were ineffective for us — we've yet to find an effective workaround that doesn't rely on restarting the jenkins daemon. We're using 1.590, and looking to switch to LTS releases as soon as they pass us by.

          We are experiencing git polling getting hung as well. We have ~15 jobs that poll every 5 minutes. It gets hung roughly 24 hours after a service restart. We also have the BitBucket pull request builder polling every 5 minutes for another ~15 jobs.

          Jenkins v1.622
          git plugin 2.4.0
          git-client plugin 1.18.0
          bitbucket-pullrequest-builder plugin 1.4.7

          0-30 minutes prior to being hung, I see this exception:

          WARNING: Process leaked file descriptors. See http://wiki.jenkins-ci.org/display/JENKINS/Spawning+processes+from+build for more information
          java.lang.Exception
          	at hudson.Proc$LocalProc.join(Proc.java:329)
          	at hudson.Proc.joinWithTimeout(Proc.java:168)
          	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1596)
          	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1576)
          	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1572)
          	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommand(CliGitAPIImpl.java:1233)
          	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl$4.execute(CliGitAPIImpl.java:583)
          	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandWithCredentials(CliGitAPIImpl.java:1310)
          	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandWithCredentials(CliGitAPIImpl.java:1261)
          	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandWithCredentials(CliGitAPIImpl.java:1252)
          	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.getHeadRev(CliGitAPIImpl.java:2336)
          	at hudson.plugins.git.GitSCM.compareRemoteRevisionWithImpl(GitSCM.java:583)
          	at hudson.plugins.git.GitSCM.compareRemoteRevisionWith(GitSCM.java:527)
          	at hudson.scm.SCM.compareRemoteRevisionWith(SCM.java:381)
          	at hudson.scm.SCM.poll(SCM.java:398)
          	at hudson.model.AbstractProject._poll(AbstractProject.java:1461)
          	at hudson.model.AbstractProject.poll(AbstractProject.java:1364)
          	at jenkins.triggers.SCMTriggerItem$SCMTriggerItems$Bridge.poll(SCMTriggerItem.java:119)
          	at hudson.triggers.SCMTrigger$Runner.runPolling(SCMTrigger.java:510)
          	at hudson.triggers.SCMTrigger$Runner.run(SCMTrigger.java:539)
          	at hudson.util.SequentialExecutionQueue$QueueEntry.run(SequentialExecutionQueue.java:118)
          	at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
          	at java.util.concurrent.FutureTask.run(Unknown Source)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
          	at java.lang.Thread.run(Unknown Source)
          

          I will be happy to provide more configuration details and logs if requested.

          Andrew Hoffmann added a comment - We are experiencing git polling getting hung as well. We have ~15 jobs that poll every 5 minutes. It gets hung roughly 24 hours after a service restart. We also have the BitBucket pull request builder polling every 5 minutes for another ~15 jobs. Jenkins v1.622 git plugin 2.4.0 git-client plugin 1.18.0 bitbucket-pullrequest-builder plugin 1.4.7 0-30 minutes prior to being hung, I see this exception: WARNING: Process leaked file descriptors. See http: //wiki.jenkins-ci.org/display/JENKINS/Spawning+processes+from+build for more information java.lang.Exception at hudson.Proc$LocalProc.join(Proc.java:329) at hudson.Proc.joinWithTimeout(Proc.java:168) at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1596) at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1576) at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1572) at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommand(CliGitAPIImpl.java:1233) at org.jenkinsci.plugins.gitclient.CliGitAPIImpl$4.execute(CliGitAPIImpl.java:583) at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandWithCredentials(CliGitAPIImpl.java:1310) at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandWithCredentials(CliGitAPIImpl.java:1261) at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandWithCredentials(CliGitAPIImpl.java:1252) at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.getHeadRev(CliGitAPIImpl.java:2336) at hudson.plugins.git.GitSCM.compareRemoteRevisionWithImpl(GitSCM.java:583) at hudson.plugins.git.GitSCM.compareRemoteRevisionWith(GitSCM.java:527) at hudson.scm.SCM.compareRemoteRevisionWith(SCM.java:381) at hudson.scm.SCM.poll(SCM.java:398) at hudson.model.AbstractProject._poll(AbstractProject.java:1461) at hudson.model.AbstractProject.poll(AbstractProject.java:1364) at jenkins.triggers.SCMTriggerItem$SCMTriggerItems$Bridge.poll(SCMTriggerItem.java:119) at hudson.triggers.SCMTrigger$Runner.runPolling(SCMTrigger.java:510) at hudson.triggers.SCMTrigger$Runner.run(SCMTrigger.java:539) at hudson.util.SequentialExecutionQueue$QueueEntry.run(SequentialExecutionQueue.java:118) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang. Thread .run(Unknown Source) I will be happy to provide more configuration details and logs if requested.

          Yves Martin added a comment - - edited

          When investigating a Subversion SCM polling issue (JENKINS-31192), I find out that a global lock hudson.scm.SubversionSCM$ModuleLocation prevents threads to work concurrently. Is that "big lock" really relevant ? Maybe it is possible to reduce the code section when the lock is held.

          Yves Martin added a comment - - edited When investigating a Subversion SCM polling issue ( JENKINS-31192 ), I find out that a global lock hudson.scm.SubversionSCM$ModuleLocation prevents threads to work concurrently. Is that "big lock" really relevant ? Maybe it is possible to reduce the code section when the lock is held.

          Meng Xin Zhu added a comment -

          Still happening on Jenkins LTS 2.19.4.

          My job is periodically polling from git repo(every 5 minutes). However the scm polling might hang infinitely without timeout. The subsequent manually job building also is blocked by scm polling. It's definitely a critical issue to impact the usability of jenkins.

          Meng Xin Zhu added a comment - Still happening on Jenkins LTS 2.19.4. My job is periodically polling from git repo(every 5 minutes). However the scm polling might hang infinitely without timeout. The subsequent manually job building also is blocked by scm polling. It's definitely a critical issue to impact the usability of jenkins.

            Unassigned Unassigned
            dty Dean Yu
            Votes:
            141 Vote for this issue
            Watchers:
            147 Start watching this issue

              Created:
              Updated: