Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-19619

Collecting findbugs analysis results occasionally causes ssh slave to go offline causing job to abort

      Collecting findbugs analysis results occasionally causes ssh slave to go offline causing job to abort

      Some of our legacy builds have a large number of Findbugs warnings.
      Parsing these results in the post build action sometimes causes the slave to go offline, and jenkins quickly brings the slave back online but it causes the job to have failed.

      I notice that findbugs outputs a lot of information to the slave.log on the jenkins master
      By the time I check, the slave.log has rotated to slave.log.1 but a new slave.log has not been created.

      From tailing slave.log on Jenkins master:

      Sep 17, 2013 11:52:39 AM hudson.plugins.findbugs.parser.FindBugsParser findSourceFile
      WARNING: Can't resolve absolute file name for file CallbackInterceptorConfigurer.java, dir list = [/tr/j/jh/workspace/tws_trunk_nightly_build/com.aepona.tws.build/build/test-reports/findbugs.xml/src/main/java, /tr/j/jh/workspace/tws_trunk_nightly_build/com.aepona.tws.build/build/test-reports/findbugs.xml/src/test/java, /tr/j/jh/workspace/tws_trunk_nightly_build/com.aepona.tws.build/build/test-reports/findbugs.xml/src]
      tail: `slave.log' has become inaccessible: No such file or directory
      

      list of slave logs: notice how slave.log no longer exists
      slave.log has rolled over to slave.log.1

      $ ll -h
      total 764K
      -rw-r--r-- 1 rcbuild_user cs_sl025 461K Sep 17 11:52 slave.log.1
      -rw-r--r-- 1 rcbuild_user cs_sl025 3.1K Sep 11 10:28 slave.log.10
      -rw-r--r-- 1 rcbuild_user cs_sl025 2.6K Sep 17 08:42 slave.log.2
      -rw-r--r-- 1 rcbuild_user cs_sl025 2.6K Sep 17 08:03 slave.log.3
      -rw-r--r-- 1 rcbuild_user cs_sl025 2.8K Sep 17 07:53 slave.log.4
      -rw-r--r-- 1 rcbuild_user cs_sl025 267K Sep 13 10:33 slave.log.5
      -rw-r--r-- 1 rcbuild_user cs_sl025 2.6K Sep 12 10:03 slave.log.6
      -rw-r--r-- 1 rcbuild_user cs_sl025 2.6K Sep 12 09:37 slave.log.7
      

      From Jenkins log:

      Sep 17, 2013 11:52:38 AM hudson.model.Run execute
      INFO: tws_trunk_nightly_build #306 main build action completed: SUCCESS
      Sep 17, 2013 11:52:39 AM hudson.remoting.SynchronousCommandTransport$ReaderThread run
      SEVERE: I/O error in channel Neshi
      java.io.IOException: Unexpected termination of the channel
              at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50)
      Caused by: java.io.EOFException
              at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596)
              at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1316)
              at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
              at hudson.remoting.Command.readFrom(Command.java:92)
              at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:71)
              at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
      
      Sep 17, 2013 11:52:39 AM hudson.model.AbstractBuild$AbstractBuildExecution performAllBuildSteps
      WARNING: Publisher hudson.plugins.findbugs.FindBugsPublisher aborted due to exception
      hudson.remoting.RequestAbortedException: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
              at hudson.remoting.RequestAbortedException.wrapForRethrow(RequestAbortedException.java:41)
              at hudson.remoting.RequestAbortedException.wrapForRethrow(RequestAbortedException.java:34)
              at hudson.remoting.Request.call(Request.java:174)
              at hudson.remoting.Channel.call(Channel.java:714)
              at hudson.FilePath.act(FilePath.java:898)
              at hudson.FilePath.act(FilePath.java:882)
              at hudson.plugins.findbugs.FindBugsPublisher.perform(FindBugsPublisher.java:161)
              at hudson.plugins.analysis.core.HealthAwarePublisher.perform(HealthAwarePublisher.java:144)
              at hudson.plugins.analysis.core.HealthAwareRecorder.perform(HealthAwareRecorder.java:333)
              at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
              at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:782)
              at hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:754)
              at hudson.model.Build$BuildExecution.post2(Build.java:183)
              at hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:707)
              at hudson.model.Run.execute(Run.java:1629)
              at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
              at hudson.model.ResourceController.execute(ResourceController.java:88)
              at hudson.model.Executor.run(Executor.java:246)
      Caused by: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
              at hudson.remoting.Request.abort(Request.java:299)
              at hudson.remoting.Channel.terminate(Channel.java:774)
              at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:69)
      Caused by: java.io.IOException: Unexpected termination of the channel
              at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50)
      Caused by: java.io.EOFException
              at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596)
              at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1316)
              at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
              at hudson.remoting.Command.readFrom(Command.java:92)
              at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:71)
              at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
      

      Slave Reconnect from jenkins.log

      Sep 17, 2013 11:53:09 AM hudson.slaves.SlaveComputer tryReconnect
      INFO: Attempting to reconnect xxx
      

      Log from build:

      11:52:38 BUILD SUCCESSFUL
      11:52:38 Total time: 8 seconds
      11:52:38 [FINDBUGS] Collecting findbugs analysis files...
      11:52:39 ERROR: Publisher hudson.plugins.findbugs.FindBugsPublisher aborted due to exception
      11:52:39 hudson.remoting.RequestAbortedException: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
      11:52:39 	at hudson.remoting.RequestAbortedException.wrapForRethrow(RequestAbortedException.java:41)
      11:52:39 	at hudson.remoting.RequestAbortedException.wrapForRethrow(RequestAbortedException.java:34)
      11:52:39 	at hudson.remoting.Request.call(Request.java:174)
      11:52:39 	at hudson.remoting.Channel.call(Channel.java:714)
      11:52:39 	at hudson.FilePath.act(FilePath.java:898)
      11:52:39 	at hudson.FilePath.act(FilePath.java:882)
      11:52:39 	at hudson.plugins.findbugs.FindBugsPublisher.perform(FindBugsPublisher.java:161)
      11:52:39 	at hudson.plugins.analysis.core.HealthAwarePublisher.perform(HealthAwarePublisher.java:144)
      11:52:39 	at hudson.plugins.analysis.core.HealthAwareRecorder.perform(HealthAwareRecorder.java:333)
      11:52:39 	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
      11:52:39 	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:782)
      11:52:39 	at hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:754)
      11:52:39 	at hudson.model.Build$BuildExecution.post2(Build.java:183)
      11:52:39 	at hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:707)
      11:52:39 	at hudson.model.Run.execute(Run.java:1629)
      11:52:39 	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
      11:52:39 	at hudson.model.ResourceController.execute(ResourceController.java:88)
      11:52:39 	at hudson.model.Executor.run(Executor.java:246)
      11:52:39 Caused by: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
      11:52:39 	at hudson.remoting.Request.abort(Request.java:299)
      11:52:39 	at hudson.remoting.Channel.terminate(Channel.java:774)
      11:52:39 	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:69)
      11:52:39 Caused by: java.io.IOException: Unexpected termination of the channel
      11:52:39 	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50)
      11:52:39 Caused by: java.io.EOFException
      11:52:39 	at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596)
      11:52:39 	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1316)
      11:52:39 	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
      11:52:39 	at hudson.remoting.Command.readFrom(Command.java:92)
      11:52:39 	at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:71)
      11:52:39 	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
      11:52:39 [PMD] Skipping publisher since build result is FAILURE
      

          [JENKINS-19619] Collecting findbugs analysis results occasionally causes ssh slave to go offline causing job to abort

          This is an old project and it just means I cannot see the findbugs errors in context of the java file, so not a high priority.
          More concerned that it causes the slave to go offline and fail the build.

          main problem might be how jenkins core manages the slave logs.
          changing the logging level just stops findbugs plugin from triggering the core issue...

          Geoff Cummings added a comment - This is an old project and it just means I cannot see the findbugs errors in context of the java file, so not a high priority. More concerned that it causes the slave to go offline and fail the build. main problem might be how jenkins core manages the slave logs. changing the logging level just stops findbugs plugin from triggering the core issue...

          Ulli Hafner added a comment -

          I see. In the first project, it is clear that the file resolution does not work. Here it could help, that in the ant script the source path is provided for findbugs. I'm using that information afterwards to resolve the file names.

          But you are right, the actual problem still remains. I only suggested the workaround since I can't change the affected code in Jenkins core (or in the ssh plug-in). I already posted a question in the dev list but got no answer so far from the developers. (Actually I think the only one who can help here is Kohsuke since he wrote most of that code). So if there is anything I can do in my plug-in, let me know. Otherwise we need to wait until someone else has the time to look into that part of Jenkins.

          Ulli Hafner added a comment - I see. In the first project, it is clear that the file resolution does not work. Here it could help, that in the ant script the source path is provided for findbugs. I'm using that information afterwards to resolve the file names. But you are right, the actual problem still remains. I only suggested the workaround since I can't change the affected code in Jenkins core (or in the ssh plug-in). I already posted a question in the dev list but got no answer so far from the developers. (Actually I think the only one who can help here is Kohsuke since he wrote most of that code). So if there is anything I can do in my plug-in, let me know. Otherwise we need to wait until someone else has the time to look into that part of Jenkins.

          Thanks Ulli, I have built a copy of the findbugs plugin with that logging set to FINE to work around it for now.
          Better to get it fixed in core.

          Geoff Cummings added a comment - Thanks Ulli, I have built a copy of the findbugs plugin with that logging set to FINE to work around it for now. Better to get it fixed in core.

          Code changed in jenkins
          User: Stephen Connolly
          Path:
          src/com/trilead/ssh2/channel/Channel.java
          http://jenkins-ci.org/commit/trilead-ssh2/f1353cc0e0aa1b1e6bc845236e4a2530ea3103fd
          Log:
          [FIXED JENKINS-18836][FIXED JENKINS-18879][FIXED JENKINS-19619] remove double call of freeupWindow(len); when using ssh-slaves 0.27+

          • the more performant code path is only followed when using SSH Slaves 0.27+
          • the double call causes the channel to get torn down
          • thus excessive logging to stderr on the slave side of the connection will cause the connection to tear down
          • removing the duplicate call resolves the issue

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Stephen Connolly Path: src/com/trilead/ssh2/channel/Channel.java http://jenkins-ci.org/commit/trilead-ssh2/f1353cc0e0aa1b1e6bc845236e4a2530ea3103fd Log: [FIXED JENKINS-18836] [FIXED JENKINS-18879] [FIXED JENKINS-19619] remove double call of freeupWindow(len); when using ssh-slaves 0.27+ the more performant code path is only followed when using SSH Slaves 0.27+ the double call causes the channel to get torn down thus excessive logging to stderr on the slave side of the connection will cause the connection to tear down removing the duplicate call resolves the issue

          Code changed in jenkins
          User: Stephen Connolly
          Path:
          changelog.html
          core/pom.xml
          http://jenkins-ci.org/commit/jenkins/bb265c5e95b0fe39128720b903914236962db41b
          Log:
          [FIXED JENKINS-18836][FIXED JENKINS-18879][FIXED JENKINS-19619] Upgrade trilead-ssh to version with the fix

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Stephen Connolly Path: changelog.html core/pom.xml http://jenkins-ci.org/commit/jenkins/bb265c5e95b0fe39128720b903914236962db41b Log: [FIXED JENKINS-18836] [FIXED JENKINS-18879] [FIXED JENKINS-19619] Upgrade trilead-ssh to version with the fix

          Fixed in Jenkins core 1.536

          Stephen Connolly added a comment - Fixed in Jenkins core 1.536

          dogfood added a comment -

          Integrated in jenkins_main_trunk #2938
          [FIXED JENKINS-18836][FIXED JENKINS-18879][FIXED JENKINS-19619] Upgrade trilead-ssh to version with the fix (Revision bb265c5e95b0fe39128720b903914236962db41b)

          Result = UNSTABLE
          Stephen Connolly : bb265c5e95b0fe39128720b903914236962db41b
          Files :

          • changelog.html
          • core/pom.xml

          dogfood added a comment - Integrated in jenkins_main_trunk #2938 [FIXED JENKINS-18836] [FIXED JENKINS-18879] [FIXED JENKINS-19619] Upgrade trilead-ssh to version with the fix (Revision bb265c5e95b0fe39128720b903914236962db41b) Result = UNSTABLE Stephen Connolly : bb265c5e95b0fe39128720b903914236962db41b Files : changelog.html core/pom.xml

          Ulli Hafner added a comment -

          Thanks for fixing that Stephen!

          Ulli Hafner added a comment - Thanks for fixing that Stephen!

          Code changed in jenkins
          User: Stephen Connolly
          Path:
          core/pom.xml
          http://jenkins-ci.org/commit/jenkins/1bb06ada301496ebed6d212188d1b7c9d006317b
          Log:
          [FIXED JENKINS-18836][FIXED JENKINS-18879][FIXED JENKINS-19619] Upgrade trilead-ssh to version with the fix

          (cherry picked from commit bb265c5e95b0fe39128720b903914236962db41b)

          Conflicts:
          changelog.html

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Stephen Connolly Path: core/pom.xml http://jenkins-ci.org/commit/jenkins/1bb06ada301496ebed6d212188d1b7c9d006317b Log: [FIXED JENKINS-18836] [FIXED JENKINS-18879] [FIXED JENKINS-19619] Upgrade trilead-ssh to version with the fix (cherry picked from commit bb265c5e95b0fe39128720b903914236962db41b) Conflicts: changelog.html

          Code changed in jenkins
          User: Stephen Connolly
          Path:
          src/com/trilead/ssh2/channel/Channel.java
          http://jenkins-ci.org/commit/trilead-ssh2/5811ddd7ae15670a4f9ad345352613b3f2f2db97
          Log:
          JENKINS-22938 SSH slave connections die after the slave outputs 4MB of stderr, usually during findbugs analysis

          The fix for JENKINS-18836, JENKINS-18879, JENKINS-19619 was incorrect in its analysis.

          • There is no call to getChannelData() on the new code path, so thus you cannot have two calls of freeupWindow()
          • The problem with the original call to freeupWindow() is that it is on the receiver thread. You should not mix the responsibilities. Blocking the receiver thread to send a message will negatively impact performance and connection stability.
          • The correct solution is to push the freeupWindow onto the async queue thus the ACK gets sent and the purity of the receiving thread can be maintained.

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Stephen Connolly Path: src/com/trilead/ssh2/channel/Channel.java http://jenkins-ci.org/commit/trilead-ssh2/5811ddd7ae15670a4f9ad345352613b3f2f2db97 Log: JENKINS-22938 SSH slave connections die after the slave outputs 4MB of stderr, usually during findbugs analysis The fix for JENKINS-18836 , JENKINS-18879 , JENKINS-19619 was incorrect in its analysis. There is no call to getChannelData() on the new code path, so thus you cannot have two calls of freeupWindow() The problem with the original call to freeupWindow() is that it is on the receiver thread. You should not mix the responsibilities. Blocking the receiver thread to send a message will negatively impact performance and connection stability. The correct solution is to push the freeupWindow onto the async queue thus the ACK gets sent and the purity of the receiving thread can be maintained.

            drulli Ulli Hafner
            gcummings Geoff Cummings
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: