Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-4611

hudson.proc.RemoteProc.kill() does not work

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Fixed
    • core
    • None
    • Platform: All, OS: All

    Description

      A few days ago we had a Mercurial server outage, with the result that all Hg
      processes running at the time hung. (For technical reasons relating to network
      config, the connections do not time out - they just hang forever.)

      For those jobs running on master, the Hg polling was killed after an hour due to
      issue #4461.

      But for those jobs running on a slave,
      SCMTrigger.DescriptorImpl.queue.inProgress shows them still active, even though
      their polling log claims they were killed after an hour. A thread dump on master
      confirms this:

      "SCM polling for hudson.model.FreeStyleProject@164e3e2[apitest]" prio=10
      tid=0xa0e0a400 nid=0x746e in Object.wait() [0xf77ff000..0xf77ff554]
      java.lang.Thread.State: WAITING (on object monitor)
      at java.lang.Object.wait(Native Method)
      at java.lang.Object.wait(Object.java:485)
      at hudson.remoting.Request$1.get(Request.java:185)

      • locked <0x69424868> (a hudson.remoting.UserRequest)
        at hudson.remoting.Request$1.get(Request.java:165)
        at hudson.remoting.FutureAdapter.get(FutureAdapter.java:55)
        at hudson.Proc$RemoteProc.join(Proc.java:290)
        at
        hudson.plugins.mercurial.MercurialSCM.joinWithTimeout(MercurialSCM.java:233)
        at hudson.plugins.mercurial.MercurialSCM.pollChanges(MercurialSCM.java:192)
        at hudson.model.AbstractProject.pollSCMChanges(AbstractProject.java:1032)
        at hudson.triggers.SCMTrigger$Runner.runPolling(SCMTrigger.java:317)
        at hudson.triggers.SCMTrigger$Runner.run(SCMTrigger.java:344)
        at
        hudson.util.SequentialExecutionQueue$QueueEntry.run(SequentialExecutionQueue.java:114)

      It seems that even though proc.kill() was called in another thread, proc.join()
      is still waiting.

      Looking at the implementation, it is no wonder kill() does not work:
      Request.callAsynch's Future.cancel just returns false and does nothing!

      Shouldn't it call channel.send(new Cancel(id)) or abort(...) or something like this?

      Attachments

        Issue Links

          Activity

            jglick Jesse Glick added a comment -

            .

            jglick Jesse Glick added a comment - .
            jglick Jesse Glick added a comment -

            Workaround: identify the hung jobs, then

            for (thread in Thread.currentThread().threadGroup.threads) {
            if (thread != null && thread.name.matches('SCM polling for
            .(job-1|job-2|...).'))

            { thread.interrupt() }

            }

            jglick Jesse Glick added a comment - Workaround: identify the hung jobs, then for (thread in Thread.currentThread().threadGroup.threads) { if (thread != null && thread.name.matches('SCM polling for . (job-1|job-2|...). ')) { thread.interrupt() } }
            jglick Jesse Glick added a comment -

            This is turning out to be a major problem requiring at least weekly intervention
            to log in to the slave and run 'killall hg'. Worse than the immediate impact of
            the bug is the fact that it is not obvious from the Hudson GUI that anything is
            wrong; the jobs are all blue, you have to notice that they have not run in days.

            jglick Jesse Glick added a comment - This is turning out to be a major problem requiring at least weekly intervention to log in to the slave and run 'killall hg'. Worse than the immediate impact of the bug is the fact that it is not obvious from the Hudson GUI that anything is wrong; the jobs are all blue, you have to notice that they have not run in days.

            Code changed in hudson
            User: : jglick
            Path:
            trunk/hudson/main/core/src/main/java/hudson/Proc.java
            trunk/hudson/main/core/src/test/java/hudson/LauncherTest.java
            trunk/hudson/main/remoting/src/main/java/hudson/remoting/Request.java
            trunk/hudson/main/remoting/src/test/java/hudson/remoting/SimpleTest.java
            http://fisheye4.cenqua.com/changelog/hudson/?cs=23526
            Log:
            [FIXED JENKINS-4611] hudson.proc.RemoteProc.kill() was a no-op.

            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in hudson User: : jglick Path: trunk/hudson/main/core/src/main/java/hudson/Proc.java trunk/hudson/main/core/src/test/java/hudson/LauncherTest.java trunk/hudson/main/remoting/src/main/java/hudson/remoting/Request.java trunk/hudson/main/remoting/src/test/java/hudson/remoting/SimpleTest.java http://fisheye4.cenqua.com/changelog/hudson/?cs=23526 Log: [FIXED JENKINS-4611] hudson.proc.RemoteProc.kill() was a no-op.

            Code changed in hudson
            User: : jglick
            Path:
            trunk/www/changelog.html
            http://fisheye4.cenqua.com/changelog/hudson/?cs=23527
            Log:
            JENKINS-4611 Noting.

            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in hudson User: : jglick Path: trunk/www/changelog.html http://fisheye4.cenqua.com/changelog/hudson/?cs=23527 Log: JENKINS-4611 Noting.

            People

              Unassigned Unassigned
              jglick Jesse Glick
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: