Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-67434

intermittent unstable connection between agent to master

      Our application team trying to make a connection from their agents to Jenkins master. The initial connectivity was good and after an amount of time, we see a disruption. 

      The build nodes succeeded to be connected to the Jenkins master, but sometimes the Jenkins master does not reply back (showing timeout). We are using WebSocket method.

       

      Jenkins version: 2.319.1 (LTS)

      Launch method: Launch agent by connecting it to the controller.

      Agent name: HSM - Pipe Plaza

      Agent version: 4.11.2

      JDK version: 1.8

      Agent logs: Refer to the attachment

      Jenkins Master logs: Refer to the attachment. 

       

      2021-12-21 04:20:34.830+0000 [id=181293] INFO j.s.DefaultJnlpSlaveReceiver#channelClosed: Jetty (winstone)-181293 for HSM - Dry Dry Desert terminated: java.nio.channels.ClosedChannelException

       

      While agent connection we can see below Warning in the logs. May I know what this warning refers.

      2021-12-21 04:25:21.601+0000 [id=181571] WARNING hudson.model.Slave#reportLauncherCreateError: Issue with creating launcher for agent HSM - Grumble Volcano. The agent has not been fully initialized yetProbably there is a race condition with Agent reconnection or disconnection, check other log entries
      java.lang.IllegalStateException: No remoting channel to the agent OR it has not been fully initialized yet
      at hudson.model.Slave.reportLauncherCreateError(Slave.java:540)
      at hudson.model.Slave.createLauncher(Slave.java:512)
      at org.jenkinsci.plugins.workflow.support.DefaultStepContext.makeLauncher(DefaultStepContext.java:160)
      at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:83)
      at org.jenkinsci.plugins.docker.workflow.WithContainerStep$Callback.finished(WithContainerStep.java:391)
      at org.jenkinsci.plugins.workflow.steps.BodyExecutionCallback$TailCall.onFailure(BodyExecutionCallback.java:128)
      at org.jenkinsci.plugins.workflow.cps.CpsBodyExecution$FailureAdapter.receive(CpsBodyExecution.java:361)
      at com.cloudbees.groovy.cps.impl.ThrowBlock$1.receive(ThrowBlock.java:68)
      at com.cloudbees.groovy.cps.impl.ConstantBlock.eval(ConstantBlock.java:21)
      at com.cloudbees.groovy.cps.Next.step(Next.java:83)
      at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:174)
      at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:163)
      at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:129)
      at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:268)
      at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:163)
      at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:18)
      at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:51)
      at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:185)
      at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:402)
      at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$400(CpsThreadGroup.java:96)
      at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:314)
      at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:278)
      at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:67)
      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:139)
      at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68)
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      at java.lang.Thread.run(Thread.java:748)

       

        1. agent_logs
          76 kB
        2. jenkins.log-20211221.gz
          601 kB
        3. screenshot.PNG
          screenshot.PNG
          103 kB

          [JENKINS-67434] intermittent unstable connection between agent to master

          Jeff Thompson added a comment -

          Generally, disconnects and other transport problems result from some local issue. Generally something regarding system, networking, or environment conditions. Sometimes it is caused by bad plugin interactions.

          It appears that there are occasionally some issues with disconnects when using WebScokets. There are several issues filed here about that but I haven't seen that anyone has figured them out yet.

          You can try asking for other Jenkins users' experiences on the Jenkins user email list.

          Jeff Thompson added a comment - Generally, disconnects and other transport problems result from some local issue. Generally something regarding system, networking, or environment conditions. Sometimes it is caused by bad plugin interactions. It appears that there are occasionally some issues with disconnects when using WebScokets. There are several issues filed here about that but I haven't seen that anyone has figured them out yet. You can try asking for other Jenkins users' experiences on the Jenkins user email list.

          Aswin Raj added a comment -

          All was working by using SSH instead of WebSocket. Also, this appears to be related to the load (the higher the workload, the more likely to happen).

          Any ideas why that happened and how to mitigate it?

          Aswin Raj added a comment - All was working by using SSH instead of WebSocket. Also, this appears to be related to the load (the higher the workload, the more likely to happen). Any ideas why that happened and how to mitigate it?

          Jeff Thompson added a comment -

          Generally, SSH agents are the most reliable agent type in Jenkins for a variety of reasons. Also more secure.

          The implementations between SSH and WebSocket are very different, so it's not surprising that there are differences in behavior under load.

          There are a lot of reasons why higher load can cause connection problems. As I said previously, most of them are related to local setup. I'm not familiar with how the underlying library for WebSocket handles timeouts and related issues. Possibly that degrades under heavy load.

          Jeff Thompson added a comment - Generally, SSH agents are the most reliable agent type in Jenkins for a variety of reasons. Also more secure. The implementations between SSH and WebSocket are very different, so it's not surprising that there are differences in behavior under load. There are a lot of reasons why higher load can cause connection problems. As I said previously, most of them are related to local setup. I'm not familiar with how the underlying library for WebSocket handles timeouts and related issues. Possibly that degrades under heavy load.

          Aswin Raj added a comment -

          According to the article "https://www.jenkins.io/blog/2020/02/02/web-socket/" . WebSocket way of the connection agents under heavy build loads has not yet been tested. Could you please confirm still in Jenkins 2.319v the WebSocket will be in the Beta phase.? 

           

          Also how to enable debug mode in jenkins. 

          Aswin Raj added a comment - According to the article "https://www.jenkins.io/blog/2020/02/02/web-socket/" . WebSocket way of the connection agents under heavy build loads has not yet been tested. Could you please confirm still in Jenkins 2.319v the WebSocket will be in the Beta phase.?    Also how to enable debug mode in jenkins. 

          Zuhair Haider added a comment -

          Similar issue is impacting our Jenkins implementation. Though we had the same setup running smooth since long and only recently slaves started getting disconnected abruptly.

          Only thing I could observe was it started occurring since JDK8-311 update. Same slaves run fine though with same load when running jobs using JDK11 environment. 
          Failed to send back a reply to the request hudson.remoting.Request$2@19445a87: hudson.remoting.ChannelClosedException: Channel "unknown": Protocol stack cannot write data anymore. It is not open for write Dec 27, 2021 10:07:59 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Terminated Dec 27, 2021 10:08:01 PM hudson.util.ProcessTree getKillers WARNING: Failed to obtain killers hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@4e6277a3:JNLP4-connect connection to jenkins.bravurasolutions.net/192.168.156.80:45097": Remote call on JNLP4-connect connection to jenkins.bravurasolutions.net/192.168.156.80:45097 failed. The channel is closing down or has closed down
          Any update on resolving this would be much helpful.

          Thanks

           

          Zuhair Haider added a comment - Similar issue is impacting our Jenkins implementation. Though we had the same setup running smooth since long and only recently slaves started getting disconnected abruptly. Only thing I could observe was it started occurring since JDK8-311 update. Same slaves run fine though with same load when running jobs using JDK11 environment.  Failed to send back a reply to the request hudson.remoting.Request$2@19445a87: hudson.remoting.ChannelClosedException: Channel "unknown": Protocol stack cannot write data anymore. It is not open for write Dec 27, 2021 10:07:59 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Terminated Dec 27, 2021 10:08:01 PM hudson.util.ProcessTree getKillers WARNING: Failed to obtain killers hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@4e6277a3:JNLP4-connect connection to jenkins.bravurasolutions.net/192.168.156.80:45097": Remote call on JNLP4-connect connection to jenkins.bravurasolutions.net/192.168.156.80:45097 failed. The channel is closing down or has closed down Any update on resolving this would be much helpful. Thanks  

          Vishal added a comment -

          We still see this issue with Jenkins version 2.401.1 (and jdk 11)
          Is there any work around ? 

          Vishal added a comment - We still see this issue with Jenkins version 2.401.1 (and jdk 11) .  Is there any work around ? 

            Unassigned Unassigned
            wri1kor Aswin Raj
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: