• Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • remoting
    • None

      This is to track the problem originally reported here: http://n4.nabble.com/Polling-hung-td1310838.html#a1310838
      The referenced thread is relocated to http://jenkins.361315.n4.nabble.com/Polling-hung-td1310838.html

      What the problem boils down to is that many remote operations are performed synchronously causing the channel object to be locked while a response returns. In situations where a lengthy remote operations is using the channel, SCM polling can be blocked waiting for the monitor on the channel to be released. In extreme situations, all the polling threads can wind up waiting on object monitors for the channel objects, preventing further processing of polling tasks.

      Furthermore, if the slave dies, the locked channel object still exists in the master JVM. If no IOException is thrown to indicate the termination of the connection to the pipe, the channel can never be closed because Channel.close() itself is a sychronized operation.

        1. DUMP1.txt
          57 kB
        2. hung_scm_pollers_02.PNG
          hung_scm_pollers_02.PNG
          145 kB
        3. thread_dump_02.txt
          92 kB
        4. threads.vetted.txt
          163 kB

          [JENKINS-5413] SCM polling getting hung

          Dean Yu created issue -

          rsteele added a comment -

          I believe I'm seeing this problem as well and have been for a couple (maybe more?) weeks. One difference: I'm using the ClearCase plugin as my SCM provider, but otherwise the symptoms seem to be the same: one of my slaves, though still "online", seems to get stuck while polling for changes (though I don't see any ClearCase processes running in Process Explorer). Furthermore, killing the slave doesn't seem to do any good and the master doesn't even notice the slave has died.

          rsteele added a comment - I believe I'm seeing this problem as well and have been for a couple (maybe more?) weeks. One difference: I'm using the ClearCase plugin as my SCM provider, but otherwise the symptoms seem to be the same: one of my slaves, though still "online", seems to get stuck while polling for changes (though I don't see any ClearCase processes running in Process Explorer). Furthermore, killing the slave doesn't seem to do any good and the master doesn't even notice the slave has died.

          Carl Quinn added a comment -

          And I'm seeing it as well with a Perforce SCM.

          Carl Quinn added a comment - And I'm seeing it as well with a Perforce SCM.

          mdillon added a comment -

          Here is a stack dump from a Hudson master we were running after all 10 asychronous polling threads were hung. The job names, executor names, and internal classes names have been munged just in case. This thread dump appears to be missing the stack for the main thread for some reason, but I don't think that is a big deal.

          Once our server got itself into this state, we were not able to unstick polling without a restart. Disconnecting the affected executor did not cause these threads to go away and reconnecting the executor did not cause polling to resume.

          Our workaround has been to add a setting to revert the subversion plugin back to master-only polling on the affected installations. FWIW, we seem to see this on high-load Hudson installations.

          mdillon added a comment - Here is a stack dump from a Hudson master we were running after all 10 asychronous polling threads were hung. The job names, executor names, and internal classes names have been munged just in case. This thread dump appears to be missing the stack for the main thread for some reason, but I don't think that is a big deal. Once our server got itself into this state, we were not able to unstick polling without a restart. Disconnecting the affected executor did not cause these threads to go away and reconnecting the executor did not cause polling to resume. Our workaround has been to add a setting to revert the subversion plugin back to master-only polling on the affected installations. FWIW, we seem to see this on high-load Hudson installations.
          mdillon made changes -
          Attachment New: threads.vetted.txt [ 19185 ]

          mdillon added a comment -

          BTW, that thread dump was from a Hudson master running the equivalent of Hudson 1.322. I don't know if anyone else in the company has a thread dump from a more recent Hudson version.

          mdillon added a comment - BTW, that thread dump was from a Hudson master running the equivalent of Hudson 1.322. I don't know if anyone else in the company has a thread dump from a more recent Hudson version.

          dshields777 added a comment -

          I'm seeing the same behavior.

          mdillon, you mention reverting to master-only polling as a workaround. How did you do that? Is there a config setting that I'm missing? Or do you mean you went back to an earlier version of the SVN plugin?

          dshields777 added a comment - I'm seeing the same behavior. mdillon, you mention reverting to master-only polling as a workaround. How did you do that? Is there a config setting that I'm missing? Or do you mean you went back to an earlier version of the SVN plugin?

          Dean Yu added a comment -

          @dshields777: We build our Hudson installation from source with some modifications. We added a switch to the Subversion plugin to poll from the master. We can certainly contribute this patch upstream so other people can use the same workaround.

          Dean Yu added a comment - @dshields777: We build our Hudson installation from source with some modifications. We added a switch to the Subversion plugin to poll from the master. We can certainly contribute this patch upstream so other people can use the same workaround.

          wgracelee added a comment -

          Hi, we have the same problem on Hudson 1.352 on using subversion. Is the patch available for download now?

          wgracelee added a comment - Hi, we have the same problem on Hudson 1.352 on using subversion. Is the patch available for download now?

          Hi, I've been seeing this issue too (since late January - I wish I could give you an exact version number). I haven't found any way to reliably reproduce it (it happens at random every other day). It occurred as late as yesterday, running Hudson 1.353 with Subversion Plugin 1.16.

          The typical scenario in our case is as follows
          1) A job hangs.*
          2) The node becomes unusable. A job starting on the node stops at "Building remotely on MySlave".
          3) SVN polling gets stuck.
          4) The node can be made usable by disconnecting and reconnecting in Hudson's node management.
          5) Polling only resumes after a Hudson restart.

          *) Some background on how these job hang-ups manifest themselves:
          There's one particular job that hangs often, but I can't determine what's special about it. It typically hangs at "Recording plot data". In other words, it's not during actual job execution, but just at the end. This occurs on any of our slave nodes - nodes that are running other Hudson jobs without a hitch. When I've removed plotting it hangs at archiving/fingerprinting instead. I suspect one of the code analysis plugins we run at the end (e.g. Findbugs) might be responsible. If you believe this is of interest to the SVN polling issue I'll be happy to provide more detailed information.

          daniel_franzen added a comment - Hi, I've been seeing this issue too (since late January - I wish I could give you an exact version number). I haven't found any way to reliably reproduce it (it happens at random every other day). It occurred as late as yesterday, running Hudson 1.353 with Subversion Plugin 1.16. The typical scenario in our case is as follows 1) A job hangs.* 2) The node becomes unusable. A job starting on the node stops at "Building remotely on MySlave". 3) SVN polling gets stuck. 4) The node can be made usable by disconnecting and reconnecting in Hudson's node management. 5) Polling only resumes after a Hudson restart. *) Some background on how these job hang-ups manifest themselves: There's one particular job that hangs often, but I can't determine what's special about it. It typically hangs at "Recording plot data". In other words, it's not during actual job execution, but just at the end. This occurs on any of our slave nodes - nodes that are running other Hudson jobs without a hitch. When I've removed plotting it hangs at archiving/fingerprinting instead. I suspect one of the code analysis plugins we run at the end (e.g. Findbugs) might be responsible. If you believe this is of interest to the SVN polling issue I'll be happy to provide more detailed information.

            Unassigned Unassigned
            dty Dean Yu
            Votes:
            141 Vote for this issue
            Watchers:
            147 Start watching this issue

              Created:
              Updated: