Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-11586

Regression introduced with Slave Side ChannelPinger

    • Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • Icon: Major Major
    • core

      From a user perspective:
      When completing the build process of a multi-config project from a Windows slave, the job hangs at the "Archiving Artifacts" step.

      From my perspective:

      STDOUT FROM JENKINS
      [Winstone 2011/11/02 09:59:31] - AJP13 Listener started: port=8009
      [Winstone 2011/11/02 09:59:31] - Winstone Servlet Engine v0.9.10
      running: controlPort=disabled
      Nov 2, 2011 9:59:32 AM hudson.model.Hudson$5 onAttained
      INFO: Started initialization
      [Winstone 2011/11/02 09:59:32] - HTTPS Listener started: port=8443
      Nov 2, 2011 9:59:33 AM hudson.model.Hudson$5 onAttained
      INFO: Listed all plugins
      Nov 2, 2011 9:59:33 AM hudson.model.Hudson$5 onAttained
      INFO: Prepared all plugins
      Nov 2, 2011 9:59:33 AM hudson.model.Hudson$5 onAttained
      INFO: Started all plugins
      Nov 2, 2011 9:59:33 AM hudson.model.Hudson$5 onAttained
      INFO: Augmented all extensions
      Nov 2, 2011 9:59:35 AM hudson.model.Hudson$5 onAttained
      INFO: Loaded all jobs
      Nov 2, 2011 9:59:39 AM hudson.model.Hudson$5 onAttained
      INFO: Completed initialization
      Nov 2, 2011 9:59:39 AM hudson.TcpSlaveAgentListener <init>
      INFO: JNLP slave agent listener started on TCP port 54055
      Nov 2, 2011 9:59:47 AM hudson.slaves.SlaveComputer tryReconnect
      INFO: Attempting to reconnect w32
      Nov 2, 2011 9:59:47 AM hudson.slaves.SlaveComputer tryReconnect
      INFO: Attempting to reconnect w64
      Nov 2, 2011 10:23:32 AM hudson.model.Run run
      INFO: SargeProduct-Build \273 Windows64 #7 main build action completed: SUCCESS
      Nov 2, 2011 10:28:26 AM hudson.model.Run run
      INFO: SargeProduct-Build \273 Windows32 #7 main build action completed: SUCCESS
      Nov 2, 2011 10:29:10 AM hudson.slaves.ChannelPinger$1 onDead
      INFO: Ping failed. Terminating the channel.
      Nov 2, 2011 10:29:45 AM hudson.slaves.SlaveComputer tryReconnect
      INFO: Attempting to reconnect w64
      Nov 2, 2011 10:34:09 AM hudson.slaves.ChannelPinger$1 onDead
      INFO: Ping failed. Terminating the channel.
      Nov 2, 2011 10:34:45 AM hudson.slaves.SlaveComputer tryReconnect
      INFO: Attempting to reconnect w32
      

      It appears ChannelPinger does not reliably receive responses while the actual build is running. As soon as the build completes, the ChannelPinger closes the channel and then the slave connection (ssh) is re-established. This results in the next step of the running build job hanging. In this case, it hangs while attempting to archive artifacts.

      My best guess at the cause of this is the archiving task is no longer able to connect to the now closed/defunct channel. This issue can be reliably reproduced in our environment 100% of the time with the following versions of Jenkins:

      LTS 1.409.2
      1.436
      GIT CSET: [0afa0ea7773d8998b7662384562521812f8743ae] fixed l10n(ja).

      Description of my build slave environment and job:

      • vCenter Virtual Machine, 1 vCPU, 4GB RAM (32bit) 8GB RAM (64bit), 50GB disk, 1 vmxnet3 NIC.
      • Windows Server 2003 running cygwin w/ssh.
      • Jenkins connecting to build slaves via built-in ssh connector.
      • Build job executes shell script which manages build and related tasks from start to finish. (See scrubbed/attached config.xml)

      Expected behavior:
      The expected behavior is for the subsequent build steps within a running job to complete successfully, as well as the job itself completing successfully. Moreover, the channel should not be closed as it clearly is functional throughout the build. This can be seen as the output from the job is frequently updated in the log and communication is never actually broken until after the job successfully completes.

      Additional Notes:

      Please note, I am currently performing a git bisect regression test to precisely determine the exact changeset(s). The current bisect log is as follows:

      git bisect log
      git bisect start
      # good: [9a143cff6f462f621732dbe70c5659d7e322a3e2] [maven-release-plugin] prepare release hudson-1_354
      git bisect good 9a143cff6f462f621732dbe70c5659d7e322a3e2
      # bad: [a3f7f8d316969148810719ed8ba6865b67cf75c9] [maven-release-plugin] prepare release pom-1.409.2
      git bisect bad a3f7f8d316969148810719ed8ba6865b67cf75c9
      # good: [745e0114e5212b0696eed030abb7d4ddd8b99105] integrated back the RC branch
      git bisect good 745e0114e5212b0696eed030abb7d4ddd8b99105
      # skip: [7da580e86ea89308603754d517ce0e2b95ee96ca] making the code work with JDK1.5, too.
      git bisect skip 7da580e86ea89308603754d517ce0e2b95ee96ca
      # good: [feefea927fbf1bad09569d7eef6a589899e00484] Removing maven.hudson-labs.org
      git bisect good feefea927fbf1bad09569d7eef6a589899e00484
      # good: [47a39f9fcec63f84b06af03453af88196171b92e] take care of missing / for file path
      git bisect good 47a39f9fcec63f84b06af03453af88196171b92e
      # bad: [0afa0ea7773d8998b7662384562521812f8743ae] fixed l10n(ja).
      git bisect bad 0afa0ea7773d8998b7662384562521812f8743ae
      # good: [283cb18658420b1964d19a901011001ddc478ed1] oups missed to add a file for test
      git bisect good 283cb18658420b1964d19a901011001ddc478ed1
      # good: [f8b8b3d3b9d220883fbaf6c7eb6b327ac31facbc] Describing the fix made in b9be0b980e69187ee7d9727adb94a7100121cffb
      git bisect good f8b8b3d3b9d220883fbaf6c7eb6b327ac31facbc
      

      Given the current bisection results, it seems highly probable the bisect will identify the regression being introduced with the following cset:
      http://jenkins-ci.org/commit/jenkins/18327e9de69b2937ce29730071ba818899c7ac51

          [JENKINS-11586] Regression introduced with Slave Side ChannelPinger

          Ryan Hass added a comment -

          I have determined the regression was introduced in the following cset:

          commit cff80e297cd11f4572ce4bb9763509c697d3a607
          Author: Kohsuke Kawaguchi <kk@kohsuke.org>
          Date:   Wed Mar 23 12:04:45 2011 -0700
          
              ping should be set up from both directions
          
          Completed git bisect log
          git bisect start
          # good: [9a143cff6f462f621732dbe70c5659d7e322a3e2] [maven-release-plugin] prepare release hudson-1_354
          git bisect good 9a143cff6f462f621732dbe70c5659d7e322a3e2
          # bad: [a3f7f8d316969148810719ed8ba6865b67cf75c9] [maven-release-plugin] prepare release pom-1.409.2
          git bisect bad a3f7f8d316969148810719ed8ba6865b67cf75c9
          # good: [745e0114e5212b0696eed030abb7d4ddd8b99105] integrated back the RC branch
          git bisect good 745e0114e5212b0696eed030abb7d4ddd8b99105
          # skip: [7da580e86ea89308603754d517ce0e2b95ee96ca] making the code work with JDK1.5, too.
          git bisect skip 7da580e86ea89308603754d517ce0e2b95ee96ca
          # good: [feefea927fbf1bad09569d7eef6a589899e00484] Removing maven.hudson-labs.org
          git bisect good feefea927fbf1bad09569d7eef6a589899e00484
          # good: [47a39f9fcec63f84b06af03453af88196171b92e] take care of missing / for file path
          git bisect good 47a39f9fcec63f84b06af03453af88196171b92e
          # bad: [0afa0ea7773d8998b7662384562521812f8743ae] fixed l10n(ja).
          git bisect bad 0afa0ea7773d8998b7662384562521812f8743ae
          # good: [283cb18658420b1964d19a901011001ddc478ed1] oups missed to add a file for test
          git bisect good 283cb18658420b1964d19a901011001ddc478ed1
          # good: [f8b8b3d3b9d220883fbaf6c7eb6b327ac31facbc] Describing the fix made in b9be0b980e69187ee7d9727adb94a7100121cffb
          git bisect good f8b8b3d3b9d220883fbaf6c7eb6b327ac31facbc
          # bad: [fbfdaef4327be93be85452bc8e53bb5ca05ab06a] made to work with JBoss6
          git bisect bad fbfdaef4327be93be85452bc8e53bb5ca05ab06a
          # bad: [6c12a64cbac41e25d8b46f7616a032a99c14080c] added methods to help visualization
          git bisect bad 6c12a64cbac41e25d8b46f7616a032a99c14080c
          # bad: [ddbcff6927bcdf8ee0494794f5248d283a723ce7] recording 156350286b79da748091d45349cb0f33e11210f8
          git bisect bad ddbcff6927bcdf8ee0494794f5248d283a723ce7
          # good: [18327e9de69b2937ce29730071ba818899c7ac51] [FIXED JENKINS-8990] Configurable ping interval
          git bisect good 18327e9de69b2937ce29730071ba818899c7ac51
          # bad: [9c8fcf2e75f13312b1c7067716aaef0b5b95d61e] recording JENKINS-8990 fix
          git bisect bad 9c8fcf2e75f13312b1c7067716aaef0b5b95d61e
          # good: [42a3b57e9d50e9ef4f2dff87ef39ae91f55411bb] doc/licensing header setup
          git bisect good 42a3b57e9d50e9ef4f2dff87ef39ae91f55411bb
          

          I have confirmed my results by reverting the bad change from a 1.409.2 base, and running the same jobs which previously would fail. The regression no longer persists once the changeset is no longer present in the build.

          Ryan Hass added a comment - I have determined the regression was introduced in the following cset: commit cff80e297cd11f4572ce4bb9763509c697d3a607 Author: Kohsuke Kawaguchi <kk@kohsuke.org> Date: Wed Mar 23 12:04:45 2011 -0700 ping should be set up from both directions Completed git bisect log git bisect start # good: [9a143cff6f462f621732dbe70c5659d7e322a3e2] [maven-release-plugin] prepare release hudson-1_354 git bisect good 9a143cff6f462f621732dbe70c5659d7e322a3e2 # bad: [a3f7f8d316969148810719ed8ba6865b67cf75c9] [maven-release-plugin] prepare release pom-1.409.2 git bisect bad a3f7f8d316969148810719ed8ba6865b67cf75c9 # good: [745e0114e5212b0696eed030abb7d4ddd8b99105] integrated back the RC branch git bisect good 745e0114e5212b0696eed030abb7d4ddd8b99105 # skip: [7da580e86ea89308603754d517ce0e2b95ee96ca] making the code work with JDK1.5, too. git bisect skip 7da580e86ea89308603754d517ce0e2b95ee96ca # good: [feefea927fbf1bad09569d7eef6a589899e00484] Removing maven.hudson-labs.org git bisect good feefea927fbf1bad09569d7eef6a589899e00484 # good: [47a39f9fcec63f84b06af03453af88196171b92e] take care of missing / for file path git bisect good 47a39f9fcec63f84b06af03453af88196171b92e # bad: [0afa0ea7773d8998b7662384562521812f8743ae] fixed l10n(ja). git bisect bad 0afa0ea7773d8998b7662384562521812f8743ae # good: [283cb18658420b1964d19a901011001ddc478ed1] oups missed to add a file for test git bisect good 283cb18658420b1964d19a901011001ddc478ed1 # good: [f8b8b3d3b9d220883fbaf6c7eb6b327ac31facbc] Describing the fix made in b9be0b980e69187ee7d9727adb94a7100121cffb git bisect good f8b8b3d3b9d220883fbaf6c7eb6b327ac31facbc # bad: [fbfdaef4327be93be85452bc8e53bb5ca05ab06a] made to work with JBoss6 git bisect bad fbfdaef4327be93be85452bc8e53bb5ca05ab06a # bad: [6c12a64cbac41e25d8b46f7616a032a99c14080c] added methods to help visualization git bisect bad 6c12a64cbac41e25d8b46f7616a032a99c14080c # bad: [ddbcff6927bcdf8ee0494794f5248d283a723ce7] recording 156350286b79da748091d45349cb0f33e11210f8 git bisect bad ddbcff6927bcdf8ee0494794f5248d283a723ce7 # good: [18327e9de69b2937ce29730071ba818899c7ac51] [FIXED JENKINS-8990] Configurable ping interval git bisect good 18327e9de69b2937ce29730071ba818899c7ac51 # bad: [9c8fcf2e75f13312b1c7067716aaef0b5b95d61e] recording JENKINS-8990 fix git bisect bad 9c8fcf2e75f13312b1c7067716aaef0b5b95d61e # good: [42a3b57e9d50e9ef4f2dff87ef39ae91f55411bb] doc/licensing header setup git bisect good 42a3b57e9d50e9ef4f2dff87ef39ae91f55411bb I have confirmed my results by reverting the bad change from a 1.409.2 base, and running the same jobs which previously would fail. The regression no longer persists once the changeset is no longer present in the build.

          Removing this changeset would eliminate the ping check, so it makes sense that it no longer terminates the channel.

          The ping currently is supposed to wait for 4 minutes before it gives up and kills the channel. But I have a hard time believing that the channel did really clog for 4 minutes. I suspect this is the same as JENKINS-11097, as the version of remoting in 1.409.x is affected by it.

          If possible, I recommend using the latest remoting.jar. It might not be very trivial to have 1.409.x use it, but I think it'll solve the problem. (If not, at least it'll report more information about the activity of the ping thread that led to the channel shutdown.)

          Kohsuke Kawaguchi added a comment - Removing this changeset would eliminate the ping check, so it makes sense that it no longer terminates the channel. The ping currently is supposed to wait for 4 minutes before it gives up and kills the channel. But I have a hard time believing that the channel did really clog for 4 minutes. I suspect this is the same as JENKINS-11097 , as the version of remoting in 1.409.x is affected by it. If possible, I recommend using the latest remoting.jar. It might not be very trivial to have 1.409.x use it, but I think it'll solve the problem. (If not, at least it'll report more information about the activity of the ping thread that led to the channel shutdown.)

          Ryan Hass added a comment -

          Taking ownership while working on suggestion from Kohsuke.

          Ryan Hass added a comment - Taking ownership while working on suggestion from Kohsuke.

          Oliver Bock added a comment -

          Any update on this?

          Thanks

          Oliver Bock added a comment - Any update on this? Thanks

          Daniel Beck added a comment -

          Is anyone still experiencing this issue on recent (no older than eight weeks) versions of Jenkins? If so, please specify the Jenkins version and exact behavior you're seeing.

          Daniel Beck added a comment - Is anyone still experiencing this issue on recent (no older than eight weeks) versions of Jenkins? If so, please specify the Jenkins version and exact behavior you're seeing.

          Daniel Beck added a comment -

          No response to comment asking for updated information in three weeks, so resolving as Cannot Reproduce.

          Given the age of this issue, please file a new report if a similar issue occurs again.

          Daniel Beck added a comment - No response to comment asking for updated information in three weeks, so resolving as Cannot Reproduce. Given the age of this issue, please file a new report if a similar issue occurs again.

            Unassigned Unassigned
            rtyler R. Tyler Croy
            Votes:
            12 Vote for this issue
            Watchers:
            16 Start watching this issue

              Created:
              Updated:
              Resolved: