Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-19619

Collecting findbugs analysis results occasionally causes ssh slave to go offline causing job to abort

      Collecting findbugs analysis results occasionally causes ssh slave to go offline causing job to abort

      Some of our legacy builds have a large number of Findbugs warnings.
      Parsing these results in the post build action sometimes causes the slave to go offline, and jenkins quickly brings the slave back online but it causes the job to have failed.

      I notice that findbugs outputs a lot of information to the slave.log on the jenkins master
      By the time I check, the slave.log has rotated to slave.log.1 but a new slave.log has not been created.

      From tailing slave.log on Jenkins master:

      Sep 17, 2013 11:52:39 AM hudson.plugins.findbugs.parser.FindBugsParser findSourceFile
      WARNING: Can't resolve absolute file name for file CallbackInterceptorConfigurer.java, dir list = [/tr/j/jh/workspace/tws_trunk_nightly_build/com.aepona.tws.build/build/test-reports/findbugs.xml/src/main/java, /tr/j/jh/workspace/tws_trunk_nightly_build/com.aepona.tws.build/build/test-reports/findbugs.xml/src/test/java, /tr/j/jh/workspace/tws_trunk_nightly_build/com.aepona.tws.build/build/test-reports/findbugs.xml/src]
      tail: `slave.log' has become inaccessible: No such file or directory
      

      list of slave logs: notice how slave.log no longer exists
      slave.log has rolled over to slave.log.1

      $ ll -h
      total 764K
      -rw-r--r-- 1 rcbuild_user cs_sl025 461K Sep 17 11:52 slave.log.1
      -rw-r--r-- 1 rcbuild_user cs_sl025 3.1K Sep 11 10:28 slave.log.10
      -rw-r--r-- 1 rcbuild_user cs_sl025 2.6K Sep 17 08:42 slave.log.2
      -rw-r--r-- 1 rcbuild_user cs_sl025 2.6K Sep 17 08:03 slave.log.3
      -rw-r--r-- 1 rcbuild_user cs_sl025 2.8K Sep 17 07:53 slave.log.4
      -rw-r--r-- 1 rcbuild_user cs_sl025 267K Sep 13 10:33 slave.log.5
      -rw-r--r-- 1 rcbuild_user cs_sl025 2.6K Sep 12 10:03 slave.log.6
      -rw-r--r-- 1 rcbuild_user cs_sl025 2.6K Sep 12 09:37 slave.log.7
      

      From Jenkins log:

      Sep 17, 2013 11:52:38 AM hudson.model.Run execute
      INFO: tws_trunk_nightly_build #306 main build action completed: SUCCESS
      Sep 17, 2013 11:52:39 AM hudson.remoting.SynchronousCommandTransport$ReaderThread run
      SEVERE: I/O error in channel Neshi
      java.io.IOException: Unexpected termination of the channel
              at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50)
      Caused by: java.io.EOFException
              at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596)
              at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1316)
              at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
              at hudson.remoting.Command.readFrom(Command.java:92)
              at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:71)
              at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
      
      Sep 17, 2013 11:52:39 AM hudson.model.AbstractBuild$AbstractBuildExecution performAllBuildSteps
      WARNING: Publisher hudson.plugins.findbugs.FindBugsPublisher aborted due to exception
      hudson.remoting.RequestAbortedException: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
              at hudson.remoting.RequestAbortedException.wrapForRethrow(RequestAbortedException.java:41)
              at hudson.remoting.RequestAbortedException.wrapForRethrow(RequestAbortedException.java:34)
              at hudson.remoting.Request.call(Request.java:174)
              at hudson.remoting.Channel.call(Channel.java:714)
              at hudson.FilePath.act(FilePath.java:898)
              at hudson.FilePath.act(FilePath.java:882)
              at hudson.plugins.findbugs.FindBugsPublisher.perform(FindBugsPublisher.java:161)
              at hudson.plugins.analysis.core.HealthAwarePublisher.perform(HealthAwarePublisher.java:144)
              at hudson.plugins.analysis.core.HealthAwareRecorder.perform(HealthAwareRecorder.java:333)
              at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
              at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:782)
              at hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:754)
              at hudson.model.Build$BuildExecution.post2(Build.java:183)
              at hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:707)
              at hudson.model.Run.execute(Run.java:1629)
              at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
              at hudson.model.ResourceController.execute(ResourceController.java:88)
              at hudson.model.Executor.run(Executor.java:246)
      Caused by: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
              at hudson.remoting.Request.abort(Request.java:299)
              at hudson.remoting.Channel.terminate(Channel.java:774)
              at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:69)
      Caused by: java.io.IOException: Unexpected termination of the channel
              at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50)
      Caused by: java.io.EOFException
              at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596)
              at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1316)
              at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
              at hudson.remoting.Command.readFrom(Command.java:92)
              at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:71)
              at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
      

      Slave Reconnect from jenkins.log

      Sep 17, 2013 11:53:09 AM hudson.slaves.SlaveComputer tryReconnect
      INFO: Attempting to reconnect xxx
      

      Log from build:

      11:52:38 BUILD SUCCESSFUL
      11:52:38 Total time: 8 seconds
      11:52:38 [FINDBUGS] Collecting findbugs analysis files...
      11:52:39 ERROR: Publisher hudson.plugins.findbugs.FindBugsPublisher aborted due to exception
      11:52:39 hudson.remoting.RequestAbortedException: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
      11:52:39 	at hudson.remoting.RequestAbortedException.wrapForRethrow(RequestAbortedException.java:41)
      11:52:39 	at hudson.remoting.RequestAbortedException.wrapForRethrow(RequestAbortedException.java:34)
      11:52:39 	at hudson.remoting.Request.call(Request.java:174)
      11:52:39 	at hudson.remoting.Channel.call(Channel.java:714)
      11:52:39 	at hudson.FilePath.act(FilePath.java:898)
      11:52:39 	at hudson.FilePath.act(FilePath.java:882)
      11:52:39 	at hudson.plugins.findbugs.FindBugsPublisher.perform(FindBugsPublisher.java:161)
      11:52:39 	at hudson.plugins.analysis.core.HealthAwarePublisher.perform(HealthAwarePublisher.java:144)
      11:52:39 	at hudson.plugins.analysis.core.HealthAwareRecorder.perform(HealthAwareRecorder.java:333)
      11:52:39 	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
      11:52:39 	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:782)
      11:52:39 	at hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:754)
      11:52:39 	at hudson.model.Build$BuildExecution.post2(Build.java:183)
      11:52:39 	at hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:707)
      11:52:39 	at hudson.model.Run.execute(Run.java:1629)
      11:52:39 	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
      11:52:39 	at hudson.model.ResourceController.execute(ResourceController.java:88)
      11:52:39 	at hudson.model.Executor.run(Executor.java:246)
      11:52:39 Caused by: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
      11:52:39 	at hudson.remoting.Request.abort(Request.java:299)
      11:52:39 	at hudson.remoting.Channel.terminate(Channel.java:774)
      11:52:39 	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:69)
      11:52:39 Caused by: java.io.IOException: Unexpected termination of the channel
      11:52:39 	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50)
      11:52:39 Caused by: java.io.EOFException
      11:52:39 	at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596)
      11:52:39 	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1316)
      11:52:39 	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
      11:52:39 	at hudson.remoting.Command.readFrom(Command.java:92)
      11:52:39 	at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:71)
      11:52:39 	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
      11:52:39 [PMD] Skipping publisher since build result is FAILURE
      

          [JENKINS-19619] Collecting findbugs analysis results occasionally causes ssh slave to go offline causing job to abort

          For some of our jos the following logging in FindBugsParser.java causes a lot of logging in the slave.log on the jenkins master.

          Logger.getLogger(getClass().getName()).log(Level.WARNING,
              "Can't resolve absolute file name for file " + sourceLine.getSourceFile()
              + ", dir list = " + project.getSourceDirList().toString());
          

          This logging eventually causes problems with Jenkins causing the ssh slave to be disconnected.
          Not sure if it is related to slave.log size, or if it may be the large amount of logging being transferred causes issues with the jenkins master slave communication>

          This may be a core issue around the slave.log, but changing this log level to FINE does not result in the log entries being recorded in the slave.log
          As a result we no longer experience the errors reported and the slave does not get disconnected.

          Geoff Cummings added a comment - For some of our jos the following logging in FindBugsParser.java causes a lot of logging in the slave.log on the jenkins master. Logger.getLogger(getClass().getName()).log(Level.WARNING, "Can't resolve absolute file name for file " + sourceLine.getSourceFile() + ", dir list = " + project.getSourceDirList().toString()); This logging eventually causes problems with Jenkins causing the ssh slave to be disconnected. Not sure if it is related to slave.log size, or if it may be the large amount of logging being transferred causes issues with the jenkins master slave communication> This may be a core issue around the slave.log, but changing this log level to FINE does not result in the log entries being recorded in the slave.log As a result we no longer experience the errors reported and the slave does not get disconnected.

          Ulli Hafner added a comment -

          Can you please check why FindBugs can't resolve the file names? Normally, FindBugs reports the absolute paths so that the files will be found. How is FindBugs started? I think this would help to track down the problem...

          Ulli Hafner added a comment - Can you please check why FindBugs can't resolve the file names? Normally, FindBugs reports the absolute paths so that the files will be found. How is FindBugs started? I think this would help to track down the problem...

          First Project which causes this.

          hmm.. it's an old project that uses ant, so I had to go have a look around... and I'm Not overly familiar with how findbugs works with ant.

          The workspace has ~400 projects which each create a jar file.
          The ant build script builds these each project and copies the jars to a build folder and later runs findbugs on the jars.

          workspace/
           - build_project
             - build 
               - test_reports
                 - findbugs.xml  (findbugs report)
               - project_1.jar
               - project_2.jar
           - project_1
           - project_2
           - project_3
          

          looking at the findbugs.xml results, the sourcepath is relative and not absolute.
          so for a file that is in project_1, the sourcepath would be com/company/xxx/MyJavaFile.java
          This causes the plugin to try looking in
          $workspace/build_project/build/test-reports/findbugs.xml/src/test/java

          The file is really in
          $workspace/project_1/src/test/java

          i'll try and get time to get it working correctly...

          Geoff Cummings added a comment - First Project which causes this. hmm.. it's an old project that uses ant, so I had to go have a look around... and I'm Not overly familiar with how findbugs works with ant. The workspace has ~400 projects which each create a jar file. The ant build script builds these each project and copies the jars to a build folder and later runs findbugs on the jars. workspace/ - build_project - build - test_reports - findbugs.xml (findbugs report) - project_1.jar - project_2.jar - project_1 - project_2 - project_3 looking at the findbugs.xml results, the sourcepath is relative and not absolute. so for a file that is in project_1, the sourcepath would be com/company/xxx/MyJavaFile.java This causes the plugin to try looking in $workspace/build_project/build/test-reports/findbugs.xml/src/test/java The file is really in $workspace/project_1/src/test/java i'll try and get time to get it working correctly...

          The other project which causes this is giving the following error:

          WARNING: Can't resolve absolute file name for file <Unknown>

          I'll need to spend some time looking into this one in more detail.

          Geoff Cummings added a comment - The other project which causes this is giving the following error: WARNING: Can't resolve absolute file name for file <Unknown> I'll need to spend some time looking into this one in more detail.

          This is an old project and it just means I cannot see the findbugs errors in context of the java file, so not a high priority.
          More concerned that it causes the slave to go offline and fail the build.

          main problem might be how jenkins core manages the slave logs.
          changing the logging level just stops findbugs plugin from triggering the core issue...

          Geoff Cummings added a comment - This is an old project and it just means I cannot see the findbugs errors in context of the java file, so not a high priority. More concerned that it causes the slave to go offline and fail the build. main problem might be how jenkins core manages the slave logs. changing the logging level just stops findbugs plugin from triggering the core issue...

          Ulli Hafner added a comment -

          I see. In the first project, it is clear that the file resolution does not work. Here it could help, that in the ant script the source path is provided for findbugs. I'm using that information afterwards to resolve the file names.

          But you are right, the actual problem still remains. I only suggested the workaround since I can't change the affected code in Jenkins core (or in the ssh plug-in). I already posted a question in the dev list but got no answer so far from the developers. (Actually I think the only one who can help here is Kohsuke since he wrote most of that code). So if there is anything I can do in my plug-in, let me know. Otherwise we need to wait until someone else has the time to look into that part of Jenkins.

          Ulli Hafner added a comment - I see. In the first project, it is clear that the file resolution does not work. Here it could help, that in the ant script the source path is provided for findbugs. I'm using that information afterwards to resolve the file names. But you are right, the actual problem still remains. I only suggested the workaround since I can't change the affected code in Jenkins core (or in the ssh plug-in). I already posted a question in the dev list but got no answer so far from the developers. (Actually I think the only one who can help here is Kohsuke since he wrote most of that code). So if there is anything I can do in my plug-in, let me know. Otherwise we need to wait until someone else has the time to look into that part of Jenkins.

          Thanks Ulli, I have built a copy of the findbugs plugin with that logging set to FINE to work around it for now.
          Better to get it fixed in core.

          Geoff Cummings added a comment - Thanks Ulli, I have built a copy of the findbugs plugin with that logging set to FINE to work around it for now. Better to get it fixed in core.

          Code changed in jenkins
          User: Stephen Connolly
          Path:
          src/com/trilead/ssh2/channel/Channel.java
          http://jenkins-ci.org/commit/trilead-ssh2/f1353cc0e0aa1b1e6bc845236e4a2530ea3103fd
          Log:
          [FIXED JENKINS-18836][FIXED JENKINS-18879][FIXED JENKINS-19619] remove double call of freeupWindow(len); when using ssh-slaves 0.27+

          • the more performant code path is only followed when using SSH Slaves 0.27+
          • the double call causes the channel to get torn down
          • thus excessive logging to stderr on the slave side of the connection will cause the connection to tear down
          • removing the duplicate call resolves the issue

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Stephen Connolly Path: src/com/trilead/ssh2/channel/Channel.java http://jenkins-ci.org/commit/trilead-ssh2/f1353cc0e0aa1b1e6bc845236e4a2530ea3103fd Log: [FIXED JENKINS-18836] [FIXED JENKINS-18879] [FIXED JENKINS-19619] remove double call of freeupWindow(len); when using ssh-slaves 0.27+ the more performant code path is only followed when using SSH Slaves 0.27+ the double call causes the channel to get torn down thus excessive logging to stderr on the slave side of the connection will cause the connection to tear down removing the duplicate call resolves the issue

          Code changed in jenkins
          User: Stephen Connolly
          Path:
          changelog.html
          core/pom.xml
          http://jenkins-ci.org/commit/jenkins/bb265c5e95b0fe39128720b903914236962db41b
          Log:
          [FIXED JENKINS-18836][FIXED JENKINS-18879][FIXED JENKINS-19619] Upgrade trilead-ssh to version with the fix

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Stephen Connolly Path: changelog.html core/pom.xml http://jenkins-ci.org/commit/jenkins/bb265c5e95b0fe39128720b903914236962db41b Log: [FIXED JENKINS-18836] [FIXED JENKINS-18879] [FIXED JENKINS-19619] Upgrade trilead-ssh to version with the fix

          Fixed in Jenkins core 1.536

          Stephen Connolly added a comment - Fixed in Jenkins core 1.536

          dogfood added a comment -

          Integrated in jenkins_main_trunk #2938
          [FIXED JENKINS-18836][FIXED JENKINS-18879][FIXED JENKINS-19619] Upgrade trilead-ssh to version with the fix (Revision bb265c5e95b0fe39128720b903914236962db41b)

          Result = UNSTABLE
          Stephen Connolly : bb265c5e95b0fe39128720b903914236962db41b
          Files :

          • changelog.html
          • core/pom.xml

          dogfood added a comment - Integrated in jenkins_main_trunk #2938 [FIXED JENKINS-18836] [FIXED JENKINS-18879] [FIXED JENKINS-19619] Upgrade trilead-ssh to version with the fix (Revision bb265c5e95b0fe39128720b903914236962db41b) Result = UNSTABLE Stephen Connolly : bb265c5e95b0fe39128720b903914236962db41b Files : changelog.html core/pom.xml

          Ulli Hafner added a comment -

          Thanks for fixing that Stephen!

          Ulli Hafner added a comment - Thanks for fixing that Stephen!

          Code changed in jenkins
          User: Stephen Connolly
          Path:
          core/pom.xml
          http://jenkins-ci.org/commit/jenkins/1bb06ada301496ebed6d212188d1b7c9d006317b
          Log:
          [FIXED JENKINS-18836][FIXED JENKINS-18879][FIXED JENKINS-19619] Upgrade trilead-ssh to version with the fix

          (cherry picked from commit bb265c5e95b0fe39128720b903914236962db41b)

          Conflicts:
          changelog.html

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Stephen Connolly Path: core/pom.xml http://jenkins-ci.org/commit/jenkins/1bb06ada301496ebed6d212188d1b7c9d006317b Log: [FIXED JENKINS-18836] [FIXED JENKINS-18879] [FIXED JENKINS-19619] Upgrade trilead-ssh to version with the fix (cherry picked from commit bb265c5e95b0fe39128720b903914236962db41b) Conflicts: changelog.html

          Code changed in jenkins
          User: Stephen Connolly
          Path:
          src/com/trilead/ssh2/channel/Channel.java
          http://jenkins-ci.org/commit/trilead-ssh2/5811ddd7ae15670a4f9ad345352613b3f2f2db97
          Log:
          JENKINS-22938 SSH slave connections die after the slave outputs 4MB of stderr, usually during findbugs analysis

          The fix for JENKINS-18836, JENKINS-18879, JENKINS-19619 was incorrect in its analysis.

          • There is no call to getChannelData() on the new code path, so thus you cannot have two calls of freeupWindow()
          • The problem with the original call to freeupWindow() is that it is on the receiver thread. You should not mix the responsibilities. Blocking the receiver thread to send a message will negatively impact performance and connection stability.
          • The correct solution is to push the freeupWindow onto the async queue thus the ACK gets sent and the purity of the receiving thread can be maintained.

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Stephen Connolly Path: src/com/trilead/ssh2/channel/Channel.java http://jenkins-ci.org/commit/trilead-ssh2/5811ddd7ae15670a4f9ad345352613b3f2f2db97 Log: JENKINS-22938 SSH slave connections die after the slave outputs 4MB of stderr, usually during findbugs analysis The fix for JENKINS-18836 , JENKINS-18879 , JENKINS-19619 was incorrect in its analysis. There is no call to getChannelData() on the new code path, so thus you cannot have two calls of freeupWindow() The problem with the original call to freeupWindow() is that it is on the receiver thread. You should not mix the responsibilities. Blocking the receiver thread to send a message will negatively impact performance and connection stability. The correct solution is to push the freeupWindow onto the async queue thus the ACK gets sent and the purity of the receiving thread can be maintained.

            drulli Ulli Hafner
            gcummings Geoff Cummings
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: