Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-48850

Random java.io.IOException: Unexpected termination of the channel

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed (View Workflow)
    • Priority: Minor
    • Resolution: Cannot Reproduce
    • Component/s: remoting, ssh-slaves-plugin
    • Labels:
      None
    • Environment:
      Jenkins server/slave OS: Ubuntu 14.04.5 LTS
      Jenkins server/slave openJDK: 8u141-b15-3~14.04
      Jenkins: 2.89.2
      SSH-slave-plugin: 1.23
    • Similar Issues:

      Description

      Related to: JENKINS-25858 and JENKINS-48810

      Per suggestion from Oleg Nenashev,
      I'm openning a separate bug ticket for further investigation.

      Jenkins Server log:

      Dec 21, 2017 12:17:09 PM hudson.remoting.SynchronousCommandTransport$ReaderThread run
      SEVERE: I/O error in channel jenkins-smoke-slave03(192.168.100.94)
      java.io.IOException: Unexpected termination of the channel
              at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)
      Caused by: java.io.EOFException
              at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2638)
              at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3113)
              at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:853)
              at java.io.ObjectInputStream.<init>(ObjectInputStream.java:349)
              at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48)
              at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
              at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
      

      Jenkins Slave log:

      Dec 21, 2017 12:15:09 PM hudson.remoting.RemoteInvocationHandler$Unexporter reportStats
      INFO: rate(1min) = 381.9±905.3/sec; rate(5min) = 363.6±923.4/sec; rate(15min) = 335.3±927.4/sec; rate(total) = 100.3±521.0/sec; N = 35,086
      Dec 21, 2017 12:16:09 PM hudson.remoting.RemoteInvocationHandler$Unexporter reportStats
      INFO: rate(1min) = 272.0±705.3/sec; rate(5min) = 324.8±863.5/sec; rate(15min) = 322.8±905.9/sec; rate(total) = 100.3±521.0/sec; N = 35,098
      Dec 21, 2017 12:17:09 PM hudson.remoting.RemoteInvocationHandler$Unexporter reportStats
      INFO: rate(1min) = 321.9±768.9/sec; rate(5min) = 333.2±865.8/sec; rate(15min) = 326.3±905.0/sec; rate(total) = 100.4±521.2/sec; N = 35,110
      ERROR: Connection terminated
      ESC[8mha:////4Cm+u8BY/EgsbhzNlnUfOXWprV5tRETZDv4u6647BaROAAAAVx+LCAAAAAAAAP9b85aBtbiIQSmjNKU4P08vOT+vOD8nVc8DzHWtSE4tKMnMz/PLL0mV3NWzufebKBsTA0NFEYMUmgZnCA1SyAABjCCFBQC2xNaiYAAAAA==ESC[0mjava.io.EOFException
              at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2638)
              at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3113)
              at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:853)
              at java.io.ObjectInputStream.<init>(ObjectInputStream.java:349)
              at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48)
              at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
              at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
      Caused: java.io.IOException: Unexpected termination of the channel
              at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)
      ERROR: Socket connection to SSH server was lost
      ESC[8mha:////4Cm+u8BY/EgsbhzNlnUfOXWprV5tRETZDv4u6647BaROAAAAVx+LCAAAAAAAAP9b85aBtbiIQSmjNKU4P08vOT+vOD8nVc8DzHWtSE4tKMnMz/PLL0mV3NWzufebKBsTA0NFEYMUmgZnCA1SyAABjCCFBQC2xNaiYAAAAA==ESC[0mjava.io.IOException: Peer sent DISCONNECT message (reason code 2): Packet corrupt
              at com.trilead.ssh2.transport.TransportManager.receiveLoop(TransportManager.java:779)
              at com.trilead.ssh2.transport.TransportManager$1.run(TransportManager.java:502)
              at java.lang.Thread.run(Thread.java:748)
      Slave JVM has not reported exit code before the socket was lost
      [12/21/17 12:17:09] [SSH] Connection closed.
      

      This "Unexpected termination of the channel" has happened everyday (3 days in a roll) to any of slaves randomly since I updated the Jenkins core and all the plugins to the latest on Dec 19. 2017.

      The previous Jenkins core and plugin were updated back on April 2017:

      Jenkins Core: 2.46.2
      SSH-slave puglin: 1.16

      Due to the more than usual of the random "Unexpected termination of the channel",
      on "Dec 22. 2017" I downgraded Jenkins Core and SSH-slave plugin to:

      Jenkins Core: 2.60.3 (which remoting should be the same as 2.46.2 based on changelog)
      SSH-slave puglin: 1.16

      The issue has been eased since the downgrade,
      but the random "Unexpected termination of the channel" still happened a couple time so far.

        Attachments

          Issue Links

            Activity

            Hide
            jthompson Jeff Thompson added a comment -

            Andrew Marlow, a lot of people have good success keeping channels alive over many builds or long builds. Certainly there are also a number of cases where people have reliability problems for a wide variety of reasons. Sometimes they're able to stabilize or strengthen their environment and these problems disappear. Most of the times they don't provide enough information for anyone who isn't local to diagnose anything. There isn't much reason to keep multiple, duplicate tickets open that all lack information or continued response.

            With your acknowledged network unreliability, you may also want to give the remoting-kafka-plugin a try. I'd suggest running a test build environment or trying it with a few jobs. One of the reasons for creating the new plugin was a hope for improved reliability. Your unreliable network may be a good test case for that.

            My network is pretty reliable so I can't reproduce any of these reports or give a good workout to the remoting-kafka-plugin.

            Show
            jthompson Jeff Thompson added a comment - Andrew Marlow , a lot of people have good success keeping channels alive over many builds or long builds. Certainly there are also a number of cases where people have reliability problems for a wide variety of reasons. Sometimes they're able to stabilize or strengthen their environment and these problems disappear. Most of the times they don't provide enough information for anyone who isn't local to diagnose anything. There isn't much reason to keep multiple, duplicate tickets open that all lack information or continued response. With your acknowledged network unreliability, you may also want to give the remoting-kafka-plugin a try. I'd suggest running a test build environment or trying it with a few jobs. One of the reasons for creating the new plugin was a hope for improved reliability. Your unreliable network may be a good test case for that. My network is pretty reliable so I can't reproduce any of these reports or give a good workout to the remoting-kafka-plugin.
            Hide
            jthompson Jeff Thompson added a comment -

            There hasn't been updates to this for a while and there is insufficient information and diagnostics to make any progress on this report. If anyone tries out the remoting-kafka-plugin to see if it provides improvements that would be good information. Otherwise, we may just have to consider this as due to unreliable networks and close it as Cannot Reproduce.

            Show
            jthompson Jeff Thompson added a comment - There hasn't been updates to this for a while and there is insufficient information and diagnostics to make any progress on this report. If anyone tries out the remoting-kafka-plugin to see if it provides improvements that would be good information. Otherwise, we may just have to consider this as due to unreliable networks and close it as Cannot Reproduce.
            Hide
            jthompson Jeff Thompson added a comment -

            It's been two months since any information was provided that might give hints on the cause or reproduction and we're no closer to having any verification that this is actually a code rather than environment issue. Nor whether the remoting-kafka-plugin helps. I'm going to close it out as Cannot Reproduce. If someone is able to provide additional information, please do and we can re-open it.

            Show
            jthompson Jeff Thompson added a comment - It's been two months since any information was provided that might give hints on the cause or reproduction and we're no closer to having any verification that this is actually a code rather than environment issue. Nor whether the remoting-kafka-plugin helps. I'm going to close it out as Cannot Reproduce. If someone is able to provide additional information, please do and we can re-open it.
            Hide
            dshvedchenko Denis Shvedchenko added a comment - - edited

            Jeff Thompson

             

            Hello, we also caught this issue. and I think I can bring some explanation ( not fully ) , this time it is related to Slave startup routines

            (related issue JENKINS-38487 )

            Agent went offline with after several seconds : 

            XX:+UnlockDiagnosticVMOptions -XX:G1SummarizeRSetStatsPeriod=1  -XX:MetaspaceSize=128m -XX:MaxMetaspaceSize=512m -Dgroovy.use.classvalue=true -jar remoting.jar -workDir /data/pentaho/jenkins/test-head01
            Jan 24, 2019 3:16:49 AM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
            INFO: Using /data/xxxxxxxxxxxxxxxx/remoting as a remoting work directory
            Both error and output logs will be printed to /data/xxxxxxxxxxxxxxx/remoting
            <===[JENKINS REMOTING CAPACITY]===>channel started
            Remoting version: 3.27
            This is a Unix agent
            Evacuated stdout
            just before slave ph-slave-01 gets online ...
            executing prepare script ...
            setting up slave ph-slave-01 ...
            slave setup done.
            Jan 24, 2019 3:16:51 AM org.jenkinsci.remoting.util.AnonymousClassWarnings warn
            WARNING: Attempt to (de-)serialize anonymous class org.jenkinsci.plugins.envinject.EnvInjectComputerListener$2; see: https://jenkins.io/redirect/serialization-of-anonymous-classes/
            ERROR: null
            java.util.concurrent.CancellationException
            	at java.util.concurrent.FutureTask.report(FutureTask.java:121)
            	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
            	at hudson.plugins.sshslaves.SSHLauncher.launch(SSHLauncher.java:902)
            	at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:294)
            	at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
            	at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71)
            	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
            	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
            	at java.lang.Thread.run(Thread.java:748)
            [01/24/19 03:17:08] Launch failed - cleaning up connection
            [01/24/19 03:17:08] [SSH] Connection closed.
            ERROR: Connection terminated
            java.io.EOFException
            	at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2681)
            	at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3156)
            	at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:862)
            	at java.io.ObjectInputStream.<init>(ObjectInputStream.java:358)
            	at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:49)
            	at hudson.remoting.Command.readFrom(Command.java:140)
            	at hudson.remoting.Command.readFrom(Command.java:126)
            	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:36)
            	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
            Caused: java.io.IOException: Unexpected termination of the channel
            	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77) 

             

            I've found out that some jobs even managed to start, but then abruptly were terminated.

            I found in master log some possible explanation:

            mail watcher could not send email about slave started, some mailer configuration was changed recently, ( and after that change slave could not survive start).

            Jan 24, 2019 3:16:45 AM org.jenkinsci.plugins.mailwatcher.MailWatcherNotification log
            INFO: mail-watcher-plugin: unable to notify
            javax.mail.MessagingException: Could not connect to SMTP host: xxxxxxxxx, port: 465;
              nested exception is:
                    java.net.SocketTimeoutException: connect timed out
                    at com.sun.mail.smtp.SMTPTransport.openServer(SMTPTransport.java:1934)
                    at com.sun.mail.smtp.SMTPTransport.protocolConnect(SMTPTransport.java:638)
                    at javax.mail.Service.connect(Service.java:295)
                    at javax.mail.Service.connect(Service.java:176)
                    at javax.mail.Service.connect(Service.java:125)
                    at javax.mail.Transport.send0(Transport.java:194)
                    at javax.mail.Transport.send(Transport.java:124)
                    at org.jenkinsci.plugins.mailwatcher.MailWatcherMailer.send(MailWatcherMailer.java:135)
                    at org.jenkinsci.plugins.mailwatcher.MailWatcherMailer.send(MailWatcherMailer.java:128)
                    at org.jenkinsci.plugins.mailwatcher.MailWatcherNotification.send(MailWatcherNotification.java:156)
                    at org.jenkinsci.plugins.mailwatcher.WatcherComputerListener$Notification$Builder.send(WatcherComputerListener.java:181)
                    at org.jenkinsci.plugins.mailwatcher.WatcherComputerListener.onOnline(WatcherComputerListener.java:101)
                    at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:693)
                    at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:432)
                    at hudson.plugins.sshslaves.SSHLauncher.startAgent(SSHLauncher.java:1034)
                    at hudson.plugins.sshslaves.SSHLauncher.access$500(SSHLauncher.java:128)
                    at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:868)
                    at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:833)
                    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                    at java.lang.Thread.run(Thread.java:748)
            Caused by: java.net.SocketTimeoutException: connect timed out
                    at java.net.PlainSocketImpl.socketConnect(Native Method)
             

             

            so Jenkins could sent notification mail.

            Mailer listened to both : 25 and 465, after changes only to 25/tcp port.

            When I fix that configuration on master for mailer, all errors went off, and slave can start.

             

            <===[JENKINS REMOTING CAPACITY]===>channel started
            Remoting version: 3.27
            This is a Unix agent
            Evacuated stdout
            just before slave ph-slave-01 gets online ...
            executing prepare script ...
            setting up slave ph-slave-01 ...
            slave setup done.
            Jan 24, 2019 3:19:44 AM org.jenkinsci.remoting.util.AnonymousClassWarnings warn
            WARNING: Attempt to (de-)serialize anonymous class org.jenkinsci.plugins.envinject.EnvInjectComputerListener$2; see: https://jenkins.io/redirect/serialization-of-anonymous-classes/
            [StartupTrigger] - Scanning jobs for node ph-slave-01
            Agent successfully connected and online 

             

            you can note : Agent successfully connected and online now

            Hope it can bring you some info about how to reproduce some cases

            I hope it means that until master could not notify/just go clean with startup procedure for Slave start event, it breaks connection with it.

            Show
            dshvedchenko Denis Shvedchenko added a comment - - edited Jeff Thompson   Hello, we also caught this issue. and I think I can bring some explanation ( not fully ) , this time it is related to Slave startup routines (related issue JENKINS-38487  ) Agent went offline with after several seconds :  XX:+UnlockDiagnosticVMOptions -XX:G1SummarizeRSetStatsPeriod=1 -XX:MetaspaceSize=128m -XX:MaxMetaspaceSize=512m -Dgroovy.use.classvalue= true -jar remoting.jar -workDir /data/pentaho/jenkins/test-head01 Jan 24, 2019 3:16:49 AM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir INFO: Using /data/xxxxxxxxxxxxxxxx/remoting as a remoting work directory Both error and output logs will be printed to /data/xxxxxxxxxxxxxxx/remoting <===[JENKINS REMOTING CAPACITY]===>channel started Remoting version: 3.27 This is a Unix agent Evacuated stdout just before slave ph-slave-01 gets online ... executing prepare script ... setting up slave ph-slave-01 ... slave setup done. Jan 24, 2019 3:16:51 AM org.jenkinsci.remoting.util.AnonymousClassWarnings warn WARNING: Attempt to (de-)serialize anonymous class org.jenkinsci.plugins.envinject.EnvInjectComputerListener$2; see: https: //jenkins.io/redirect/serialization-of-anonymous-classes/ ERROR: null java.util.concurrent.CancellationException at java.util.concurrent.FutureTask.report(FutureTask.java:121) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at hudson.plugins.sshslaves.SSHLauncher.launch(SSHLauncher.java:902) at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:294) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) [01/24/19 03:17:08] Launch failed - cleaning up connection [01/24/19 03:17:08] [SSH] Connection closed. ERROR: Connection terminated java.io.EOFException at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2681) at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3156) at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:862) at java.io.ObjectInputStream.<init>(ObjectInputStream.java:358) at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:49) at hudson.remoting.Command.readFrom(Command.java:140) at hudson.remoting.Command.readFrom(Command.java:126) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:36) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63) Caused: java.io.IOException: Unexpected termination of the channel at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)   I've found out that some jobs even managed to start, but then abruptly were terminated. I found in master log some possible explanation: mail watcher could not send email about slave started, some mailer configuration was changed recently, ( and after that change slave could not survive start). Jan 24, 2019 3:16:45 AM org.jenkinsci.plugins.mailwatcher.MailWatcherNotification log INFO: mail-watcher-plugin: unable to notify javax.mail.MessagingException: Could not connect to SMTP host: xxxxxxxxx, port: 465; nested exception is: java.net.SocketTimeoutException: connect timed out at com.sun.mail.smtp.SMTPTransport.openServer(SMTPTransport.java:1934) at com.sun.mail.smtp.SMTPTransport.protocolConnect(SMTPTransport.java:638) at javax.mail.Service.connect(Service.java:295) at javax.mail.Service.connect(Service.java:176) at javax.mail.Service.connect(Service.java:125) at javax.mail.Transport.send0(Transport.java:194) at javax.mail.Transport.send(Transport.java:124) at org.jenkinsci.plugins.mailwatcher.MailWatcherMailer.send(MailWatcherMailer.java:135) at org.jenkinsci.plugins.mailwatcher.MailWatcherMailer.send(MailWatcherMailer.java:128) at org.jenkinsci.plugins.mailwatcher.MailWatcherNotification.send(MailWatcherNotification.java:156) at org.jenkinsci.plugins.mailwatcher.WatcherComputerListener$Notification$Builder.send(WatcherComputerListener.java:181) at org.jenkinsci.plugins.mailwatcher.WatcherComputerListener.onOnline(WatcherComputerListener.java:101) at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:693) at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:432) at hudson.plugins.sshslaves.SSHLauncher.startAgent(SSHLauncher.java:1034) at hudson.plugins.sshslaves.SSHLauncher.access$500(SSHLauncher.java:128) at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:868) at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:833) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Caused by: java.net.SocketTimeoutException: connect timed out at java.net.PlainSocketImpl.socketConnect(Native Method)   so Jenkins could sent notification mail. Mailer listened to both : 25 and 465, after changes only to 25/tcp port. When I fix that configuration on master for mailer, all errors went off, and slave can start.   <===[JENKINS REMOTING CAPACITY]===>channel started Remoting version: 3.27 This is a Unix agent Evacuated stdout just before slave ph-slave-01 gets online ... executing prepare script ... setting up slave ph-slave-01 ... slave setup done. Jan 24, 2019 3:19:44 AM org.jenkinsci.remoting.util.AnonymousClassWarnings warn WARNING: Attempt to (de-)serialize anonymous class org.jenkinsci.plugins.envinject.EnvInjectComputerListener$2; see: https: //jenkins.io/redirect/serialization-of-anonymous-classes/ [StartupTrigger] - Scanning jobs for node ph-slave-01 Agent successfully connected and online   you can note : Agent successfully connected and online now Hope it can bring you some info about how to reproduce some cases I hope it means that until master could not notify/just go clean with startup procedure for Slave start event, it breaks connection with it.
            Hide
            jthompson Jeff Thompson added a comment -

            Thanks for providing that information, Denis Shvedchenko. Hopefully it will be useful for others who encounter connection failures. I'm glad you were able to track it down and resolve it.

            Show
            jthompson Jeff Thompson added a comment - Thanks for providing that information, Denis Shvedchenko . Hopefully it will be useful for others who encounter connection failures. I'm glad you were able to track it down and resolve it.

              People

              Assignee:
              jthompson Jeff Thompson
              Reporter:
              totoroliu Rick Liu
              Votes:
              5 Vote for this issue
              Watchers:
              11 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: