Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-11586

Regression introduced with Slave Side ChannelPinger

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • Icon: Major Major
    • core

      From a user perspective:
      When completing the build process of a multi-config project from a Windows slave, the job hangs at the "Archiving Artifacts" step.

      From my perspective:

      STDOUT FROM JENKINS
      [Winstone 2011/11/02 09:59:31] - AJP13 Listener started: port=8009
      [Winstone 2011/11/02 09:59:31] - Winstone Servlet Engine v0.9.10
      running: controlPort=disabled
      Nov 2, 2011 9:59:32 AM hudson.model.Hudson$5 onAttained
      INFO: Started initialization
      [Winstone 2011/11/02 09:59:32] - HTTPS Listener started: port=8443
      Nov 2, 2011 9:59:33 AM hudson.model.Hudson$5 onAttained
      INFO: Listed all plugins
      Nov 2, 2011 9:59:33 AM hudson.model.Hudson$5 onAttained
      INFO: Prepared all plugins
      Nov 2, 2011 9:59:33 AM hudson.model.Hudson$5 onAttained
      INFO: Started all plugins
      Nov 2, 2011 9:59:33 AM hudson.model.Hudson$5 onAttained
      INFO: Augmented all extensions
      Nov 2, 2011 9:59:35 AM hudson.model.Hudson$5 onAttained
      INFO: Loaded all jobs
      Nov 2, 2011 9:59:39 AM hudson.model.Hudson$5 onAttained
      INFO: Completed initialization
      Nov 2, 2011 9:59:39 AM hudson.TcpSlaveAgentListener <init>
      INFO: JNLP slave agent listener started on TCP port 54055
      Nov 2, 2011 9:59:47 AM hudson.slaves.SlaveComputer tryReconnect
      INFO: Attempting to reconnect w32
      Nov 2, 2011 9:59:47 AM hudson.slaves.SlaveComputer tryReconnect
      INFO: Attempting to reconnect w64
      Nov 2, 2011 10:23:32 AM hudson.model.Run run
      INFO: SargeProduct-Build \273 Windows64 #7 main build action completed: SUCCESS
      Nov 2, 2011 10:28:26 AM hudson.model.Run run
      INFO: SargeProduct-Build \273 Windows32 #7 main build action completed: SUCCESS
      Nov 2, 2011 10:29:10 AM hudson.slaves.ChannelPinger$1 onDead
      INFO: Ping failed. Terminating the channel.
      Nov 2, 2011 10:29:45 AM hudson.slaves.SlaveComputer tryReconnect
      INFO: Attempting to reconnect w64
      Nov 2, 2011 10:34:09 AM hudson.slaves.ChannelPinger$1 onDead
      INFO: Ping failed. Terminating the channel.
      Nov 2, 2011 10:34:45 AM hudson.slaves.SlaveComputer tryReconnect
      INFO: Attempting to reconnect w32
      

      It appears ChannelPinger does not reliably receive responses while the actual build is running. As soon as the build completes, the ChannelPinger closes the channel and then the slave connection (ssh) is re-established. This results in the next step of the running build job hanging. In this case, it hangs while attempting to archive artifacts.

      My best guess at the cause of this is the archiving task is no longer able to connect to the now closed/defunct channel. This issue can be reliably reproduced in our environment 100% of the time with the following versions of Jenkins:

      LTS 1.409.2
      1.436
      GIT CSET: [0afa0ea7773d8998b7662384562521812f8743ae] fixed l10n(ja).

      Description of my build slave environment and job:

      • vCenter Virtual Machine, 1 vCPU, 4GB RAM (32bit) 8GB RAM (64bit), 50GB disk, 1 vmxnet3 NIC.
      • Windows Server 2003 running cygwin w/ssh.
      • Jenkins connecting to build slaves via built-in ssh connector.
      • Build job executes shell script which manages build and related tasks from start to finish. (See scrubbed/attached config.xml)

      Expected behavior:
      The expected behavior is for the subsequent build steps within a running job to complete successfully, as well as the job itself completing successfully. Moreover, the channel should not be closed as it clearly is functional throughout the build. This can be seen as the output from the job is frequently updated in the log and communication is never actually broken until after the job successfully completes.

      Additional Notes:

      Please note, I am currently performing a git bisect regression test to precisely determine the exact changeset(s). The current bisect log is as follows:

      git bisect log
      git bisect start
      # good: [9a143cff6f462f621732dbe70c5659d7e322a3e2] [maven-release-plugin] prepare release hudson-1_354
      git bisect good 9a143cff6f462f621732dbe70c5659d7e322a3e2
      # bad: [a3f7f8d316969148810719ed8ba6865b67cf75c9] [maven-release-plugin] prepare release pom-1.409.2
      git bisect bad a3f7f8d316969148810719ed8ba6865b67cf75c9
      # good: [745e0114e5212b0696eed030abb7d4ddd8b99105] integrated back the RC branch
      git bisect good 745e0114e5212b0696eed030abb7d4ddd8b99105
      # skip: [7da580e86ea89308603754d517ce0e2b95ee96ca] making the code work with JDK1.5, too.
      git bisect skip 7da580e86ea89308603754d517ce0e2b95ee96ca
      # good: [feefea927fbf1bad09569d7eef6a589899e00484] Removing maven.hudson-labs.org
      git bisect good feefea927fbf1bad09569d7eef6a589899e00484
      # good: [47a39f9fcec63f84b06af03453af88196171b92e] take care of missing / for file path
      git bisect good 47a39f9fcec63f84b06af03453af88196171b92e
      # bad: [0afa0ea7773d8998b7662384562521812f8743ae] fixed l10n(ja).
      git bisect bad 0afa0ea7773d8998b7662384562521812f8743ae
      # good: [283cb18658420b1964d19a901011001ddc478ed1] oups missed to add a file for test
      git bisect good 283cb18658420b1964d19a901011001ddc478ed1
      # good: [f8b8b3d3b9d220883fbaf6c7eb6b327ac31facbc] Describing the fix made in b9be0b980e69187ee7d9727adb94a7100121cffb
      git bisect good f8b8b3d3b9d220883fbaf6c7eb6b327ac31facbc
      

      Given the current bisection results, it seems highly probable the bisect will identify the regression being introduced with the following cset:
      http://jenkins-ci.org/commit/jenkins/18327e9de69b2937ce29730071ba818899c7ac51

            Unassigned Unassigned
            rtyler R. Tyler Croy
            Votes:
            12 Vote for this issue
            Watchers:
            16 Start watching this issue

              Created:
              Updated:
              Resolved: