Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-62999

INFO: Failed to synchronize IO streams on the channel

      I'm seeing the following error on our agents.  When this occurs the job hangs and does not make further progress.  We have to disconnect and reconnect the agent for the job to continue to move forward.  On the agent we are seeing the following in the logs.

       

      Jul 07, 2020 6:45:12 PM org.jenkinsci.remoting.util.AnonymousClassWarnings warnJul 07, 2020 6:45:12 PM org.jenkinsci.remoting.util.AnonymousClassWarnings warnWARNING: Attempt to (de-)serialize anonymous class org.jenkinsci.plugins.pipeline.utility.steps.fs.TeeStep$1; see: https://jenkins.io/redirect/serialization-of-anonymous-classes/Jul 07, 2020 7:45:42 PM hudson.Launcher$RemoteLaunchCallable$1 joinINFO: Failed to synchronize IO streams on the channel hudson.remoting.Channel@ed17bee:channeljava.lang.InterruptedException at java.lang.Object.wait(Native Method) at hudson.remoting.Request.call(Request.java:177) at hudson.remoting.Channel.call(Channel.java:997) at hudson.remoting.Channel.syncIO(Channel.java:1730) at hudson.Launcher$RemoteLaunchCallable$1.join(Launcher.java:1328) at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at hudson.remoting.RemoteInvocationHandler$RPCRequest.perform(RemoteInvocationHandler.java:931) at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:905) at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:857) at hudson.remoting.UserRequest.perform(UserRequest.java:211) at hudson.remoting.UserRequest.perform(UserRequest.java:54) at hudson.remoting.Request$2.run(Request.java:369) at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Jul 07, 2020 7:46:12 PM hudson.Launcher$RemoteLaunchCallable$1 joinINFO: Failed to synchronize IO streams on the channel hudson.remoting.Channel@ed17bee:channeljava.lang.InterruptedException at java.lang.Object.wait(Native Method) at hudson.remoting.Request.call(Request.java:177) at hudson.remoting.Channel.call(Channel.java:997) at hudson.remoting.Channel.syncIO(Channel.java:1730) at hudson.Launcher$RemoteLaunchCallable$1.join(Launcher.java:1328) at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at hudson.remoting.RemoteInvocationHandler$RPCRequest.perform(RemoteInvocationHandler.java:931) at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:905) at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:857) at hudson.remoting.UserRequest.perform(UserRequest.java:211) at hudson.remoting.UserRequest.perform(UserRequest.java:54) at hudson.remoting.Request$2.run(Request.java:369) at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
      

       

      There are several more of the last error message in the logs with the same back trace.

          [JENKINS-62999] INFO: Failed to synchronize IO streams on the channel

          So, after removing the "tee" step from my Jenkinsfile.  I'm no longer seeing this problem.  

          Spenser Gilliland added a comment - So, after removing the "tee" step from my Jenkinsfile.  I'm no longer seeing this problem.  

          Jeff Thompson added a comment -

          Interesting. I'm glad you were able to figure something out, to at least start to troubleshoot this. Often things that look like Remoting failures are caused by plugins (or various environmental, system, or configuration issues). Sometimes it's an interaction between different plugins or plugin operations that causes problems. These can be very difficult to diagnose and reproduce.

          If you can generate a simple reproduction case, someone might be able to take a look at it. Or you can dig into the details yourself.

          My guess is that something is going on with the tee operation and it's getting hung up. That then leads to other problems.

          Jeff Thompson added a comment - Interesting. I'm glad you were able to figure something out, to at least start to troubleshoot this. Often things that look like Remoting failures are caused by plugins (or various environmental, system, or configuration issues). Sometimes it's an interaction between different plugins or plugin operations that causes problems. These can be very difficult to diagnose and reproduce. If you can generate a simple reproduction case, someone might be able to take a look at it. Or you can dig into the details yourself. My guess is that something is going on with the tee operation and it's getting hung up. That then leads to other problems.

          Jeff Thompson added a comment -

          [I've edited the original description to update to the accepted terminology. Please use "agent" as the accepted term.]

          Jeff Thompson added a comment - [I've edited the original description to update to the accepted terminology. Please use "agent" as the accepted term.]

          Thanks for taking a look at this.  I haven't verified this but we have a shared library function called record which looks like this.

           

          /**
           * Record a function and archive it's log
           *
           * @param filename filename to store the log in
           * @param func function to record
           * @param abort_on_failure propagate error on failure
           */
          def record(filename, func, abort_on_failure = false) {
           try {
           //tee(filename) {
           func()
           //}
           } catch(FlowInterruptedException e) {
           // If Timeout then throw
           throw e
           } catch (e) {
           unstable(e.toString())
           if (abort_on_failure) throw e
           } finally {
           //archiveArtifacts artifacts: filename
           }
          }

           

          To work around the problem, you can see how we commented out the "tee" function (and archiveArtifact).  This was hanging only when we are pushing a lot of console output from "func()" (this happened during a failing test which had too verbose of logging and no limiter). After removing the "tee", this function was able to push a 8GB log file without issue.  Typically this is the lowest level function in a stack that looks something like.

          try {
            node(...) {
              image.inside(...) {
                utils.record('test-0.log') {
                  sh "pytest"
                }
              }
            }
          } catch(e) {
            emailext(...)
            throw e
          }
              

           

           

          Spenser Gilliland added a comment - Thanks for taking a look at this.  I haven't verified this but we have a shared library function called record which looks like this.   /** * Record a function and archive it's log * * @param filename filename to store the log in * @param func function to record * @param abort_on_failure propagate error on failure */ def record(filename, func, abort_on_failure = false ) { try { //tee(filename) { func() //} } catch (FlowInterruptedException e) { // If Timeout then throw throw e } catch (e) { unstable(e.toString()) if (abort_on_failure) throw e } finally { //archiveArtifacts artifacts: filename } }   To work around the problem, you can see how we commented out the "tee" function (and archiveArtifact).  This was hanging only when we are pushing a lot of console output from "func()" (this happened during a failing test which had too verbose of logging and no limiter). After removing the "tee", this function was able to push a 8GB log file without issue.  Typically this is the lowest level function in a stack that looks something like. try { node(...) { image.inside(...) { utils.record( 'test-0.log' ) { sh "pytest" } } } } catch (e) { emailext(...) throw e }    

          Cristian added a comment -

          FWIW I was able to reproduce the issue. Simply replacing the tee Jenkins step with the tee shell command fixed the issue. A few things I have noticed, which may be real or in my imagination:

          • It seems to happen when there is a lot of data to "tee"
          • It may happen more in low bandwidth situations
          • I could see the agent was starting to send ~13 Mbps of data constantly, until the job was stopped.
          • "Attempt to (de-)serialize anonymous class" is the only thing I see at first. I don't see the whole "Failed to synchronize IO streams on the channel" thing until I stop the job
          • Other jobs started after this happened to show strange behaviour. For example, the last X lines of a step log starting to appear in a loop (so the jobs never finishing)

          My setup was with two non-virtualized machines connected via SSH with the latest LTS Jenkins and up to date plugins, using Java 11 in both controller and agent.

           

          My impression is that the tee step has problems when it receives data faster than it can send it.

          Cristian added a comment - FWIW I was able to reproduce the issue. Simply replacing the tee Jenkins step with the tee shell command fixed the issue. A few things I have noticed, which may be real or in my imagination: It seems to happen when there is a lot of data to "tee" It may happen more in low bandwidth situations I could see the agent was starting to send ~13 Mbps of data constantly, until the job was stopped. "Attempt to (de-)serialize anonymous class" is the only thing I see at first. I don't see the whole "Failed to synchronize IO streams on the channel" thing until I stop the job Other jobs started after this happened to show strange behaviour. For example, the last X lines of a step log starting to appear in a loop (so the jobs never finishing) My setup was with two non-virtualized machines connected via SSH with the latest LTS Jenkins and up to date plugins, using Java 11 in both controller and agent.   My impression is that the tee step has problems when it receives data faster than it can send it.

          dor s added a comment - - edited

          I have jenkins controller that runs as a pod in my eks cluster, the pod is beyond ALB (load balancer with 30 min timeout). 
          I have jenkins agent on my windows 10 vm at my office, and connected with websocket.
          My job is running a selenium test, and after 3 hours I have got the same error, without any retry. 

          There is any way to add additional retries here? or to skip the problematic streamed data?

          dor s added a comment - - edited I have jenkins controller that runs as a pod in my eks cluster, the pod is beyond ALB (load balancer with 30 min timeout).  I have jenkins agent on my windows 10 vm at my office, and connected with websocket. My job is running a selenium test, and after 3 hours I have got the same error, without any retry.  There is any way to add additional retries here? or to skip the problematic streamed data?

            Unassigned Unassigned
            spenser309 Spenser Gilliland
            Votes:
            1 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated: