• Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Critical Critical
    • core
    • None
    • Platform: All, OS: All

      A few days ago we had a Mercurial server outage, with the result that all Hg
      processes running at the time hung. (For technical reasons relating to network
      config, the connections do not time out - they just hang forever.)

      For those jobs running on master, the Hg polling was killed after an hour due to
      issue #4461.

      But for those jobs running on a slave,
      SCMTrigger.DescriptorImpl.queue.inProgress shows them still active, even though
      their polling log claims they were killed after an hour. A thread dump on master
      confirms this:

      "SCM polling for hudson.model.FreeStyleProject@164e3e2[apitest]" prio=10
      tid=0xa0e0a400 nid=0x746e in Object.wait() [0xf77ff000..0xf77ff554]
      java.lang.Thread.State: WAITING (on object monitor)
      at java.lang.Object.wait(Native Method)
      at java.lang.Object.wait(Object.java:485)
      at hudson.remoting.Request$1.get(Request.java:185)

      • locked <0x69424868> (a hudson.remoting.UserRequest)
        at hudson.remoting.Request$1.get(Request.java:165)
        at hudson.remoting.FutureAdapter.get(FutureAdapter.java:55)
        at hudson.Proc$RemoteProc.join(Proc.java:290)
        at
        hudson.plugins.mercurial.MercurialSCM.joinWithTimeout(MercurialSCM.java:233)
        at hudson.plugins.mercurial.MercurialSCM.pollChanges(MercurialSCM.java:192)
        at hudson.model.AbstractProject.pollSCMChanges(AbstractProject.java:1032)
        at hudson.triggers.SCMTrigger$Runner.runPolling(SCMTrigger.java:317)
        at hudson.triggers.SCMTrigger$Runner.run(SCMTrigger.java:344)
        at
        hudson.util.SequentialExecutionQueue$QueueEntry.run(SequentialExecutionQueue.java:114)

      It seems that even though proc.kill() was called in another thread, proc.join()
      is still waiting.

      Looking at the implementation, it is no wonder kill() does not work:
      Request.callAsynch's Future.cancel just returns false and does nothing!

      Shouldn't it call channel.send(new Cancel(id)) or abort(...) or something like this?

          [JENKINS-4611] hudson.proc.RemoteProc.kill() does not work

          Jesse Glick added a comment -

          .

          Jesse Glick added a comment - .

          Jesse Glick added a comment -

          Workaround: identify the hung jobs, then

          for (thread in Thread.currentThread().threadGroup.threads) {
          if (thread != null && thread.name.matches('SCM polling for
          .(job-1|job-2|...).'))

          { thread.interrupt() }

          }

          Jesse Glick added a comment - Workaround: identify the hung jobs, then for (thread in Thread.currentThread().threadGroup.threads) { if (thread != null && thread.name.matches('SCM polling for . (job-1|job-2|...). ')) { thread.interrupt() } }

          Jesse Glick added a comment -

          This is turning out to be a major problem requiring at least weekly intervention
          to log in to the slave and run 'killall hg'. Worse than the immediate impact of
          the bug is the fact that it is not obvious from the Hudson GUI that anything is
          wrong; the jobs are all blue, you have to notice that they have not run in days.

          Jesse Glick added a comment - This is turning out to be a major problem requiring at least weekly intervention to log in to the slave and run 'killall hg'. Worse than the immediate impact of the bug is the fact that it is not obvious from the Hudson GUI that anything is wrong; the jobs are all blue, you have to notice that they have not run in days.

          Code changed in hudson
          User: : jglick
          Path:
          trunk/hudson/main/core/src/main/java/hudson/Proc.java
          trunk/hudson/main/core/src/test/java/hudson/LauncherTest.java
          trunk/hudson/main/remoting/src/main/java/hudson/remoting/Request.java
          trunk/hudson/main/remoting/src/test/java/hudson/remoting/SimpleTest.java
          http://fisheye4.cenqua.com/changelog/hudson/?cs=23526
          Log:
          [FIXED JENKINS-4611] hudson.proc.RemoteProc.kill() was a no-op.

          SCM/JIRA link daemon added a comment - Code changed in hudson User: : jglick Path: trunk/hudson/main/core/src/main/java/hudson/Proc.java trunk/hudson/main/core/src/test/java/hudson/LauncherTest.java trunk/hudson/main/remoting/src/main/java/hudson/remoting/Request.java trunk/hudson/main/remoting/src/test/java/hudson/remoting/SimpleTest.java http://fisheye4.cenqua.com/changelog/hudson/?cs=23526 Log: [FIXED JENKINS-4611] hudson.proc.RemoteProc.kill() was a no-op.

          Code changed in hudson
          User: : jglick
          Path:
          trunk/www/changelog.html
          http://fisheye4.cenqua.com/changelog/hudson/?cs=23527
          Log:
          JENKINS-4611 Noting.

          SCM/JIRA link daemon added a comment - Code changed in hudson User: : jglick Path: trunk/www/changelog.html http://fisheye4.cenqua.com/changelog/hudson/?cs=23527 Log: JENKINS-4611 Noting.

            Unassigned Unassigned
            jglick Jesse Glick
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: