Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-64992

Can not connect agents to Jenkins on debian

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Critical
    • Resolution: Unresolved
    • Component/s: remoting
    • Labels:
      None
    • Environment:
    • Similar Issues:

      Description

      I installed the jenkins package from  http://pkg.jenkins.io/debian-stable on a fresh installed Debian Buster machine using apt. After the initial configuration with the recommended plugins I added a new Node and tried to connect this agent.

      Initially all looks fine, and the output of the agent.jar process tells

      INFO: Connected

      But a view minutes later the agent.jar process ends with the exception as shown below:

      INFO: Protocol JNLP4-connect encountered an unexpected exception
      java.util.concurrent.ExecutionException: org.jenkinsci.remoting.protocol.impl.ConnectionRefusalException: Remote closed connection without specifying reason
      {{ at org.jenkinsci.remoting.util.SettableFuture.get(SettableFuture.java:223)}}
      {{ at hudson.remoting.Engine.innerRun(Engine.java:744)}}
      {{ at hudson.remoting.Engine.run(Engine.java:519)}}
      Caused by: org.jenkinsci.remoting.protocol.impl.ConnectionRefusalException: Remote closed connection without specifying reason
      {{ at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.onRecvClosed(ConnectionHeadersFilterLayer.java:440)}}
      {{ at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:816)}}
      {{ at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:288)}}
      {{ at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:172)}}
      {{ at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:816)}}
      {{ at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154)}}
      {{ at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer.access$1500(BIONetworkLayer.java:48)}}
      {{ at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetworkLayer.java:247)}}
      {{ at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)}}
      {{ at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)}}
      {{ at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:118)}}
      {{ at java.base/java.lang.Thread.run(Unknown Source)}}
      {{ Suppressed: java.nio.channels.ClosedChannelException}}
      {{ ... 7 more}}

      Mõrz 02, 2021 2:00:31 NACHM. hudson.remoting.jnlp.Main$CuiListener error
      SEVERE: The server rejected the connection: None of the protocols were accepted
      java.lang.Exception: The server rejected the connection: None of the protocols were accepted
      {{ at hudson.remoting.Engine.onConnectionRejected(Engine.java:829)}}
      {{ at hudson.remoting.Engine.innerRun(Engine.java:769)}}
      {{ at hudson.remoting.Engine.run(Engine.java:519)}}

      More agent.jar log output can be found in the attached Agent.log

      On the Jenkins server side I can see the following exception in the log:

      2021-03-02 13:00:31.137+0000 [id=322] WARNING o.j.r.p.i.SSLEngineFilterLayer#onRecv: [JNLP4-connect connection from jenkins.windows.node/10.119.64.14:60826]
      java.lang.NullPointerException
      {{ at jenkins.slaves.DefaultJnlpSlaveReceiver.afterProperties(DefaultJnlpSlaveReceiver.java:127)}}
      {{ at org.jenkinsci.remoting.engine.JnlpConnectionState$2.invoke(JnlpConnectionState.java:394)}}
      {{ at org.jenkinsci.remoting.engine.JnlpConnectionState.fire(JnlpConnectionState.java:337)}}
      {{ at org.jenkinsci.remoting.engine.JnlpConnectionState.fireAfterProperties(JnlpConnectionState.java:391)}}
      {{ at org.jenkinsci.remoting.engine.JnlpProtocol4Handler$Handler.onReceiveHeaders(JnlpProtocol4Handler.java:323)}}
      {{ at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.onRecv(ConnectionHeadersFilterLayer.java:196)}}
      {{ at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecv(ProtocolStack.java:668)}}
      {{ at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processRead(SSLEngineFilterLayer.java:369)}}
      {{ at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecv(SSLEngineFilterLayer.java:117)}}
      {{ at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecv(ProtocolStack.java:668)}}
      {{ at org.jenkinsci.remoting.protocol.NetworkLayer.onRead(NetworkLayer.java:136)}}
      {{ at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:160)}}
      {{ at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:795)}}
      {{ at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)}}
      {{ at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)}}
      {{ at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)}}
      {{ at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)}}
      {{ at java.base/java.lang.Thread.run(Thread.java:834)}}
      2021-03-02 13:00:31.137+0000 [id=322] SEVERE o.j.r.p.impl.NIONetworkLayer#ready: [JNLP4-connect connection from jenkins.windows.node/10.119.64.14:60826] Uncaught NullPointerException
      java.lang.NullPointerException
      {{ at jenkins.slaves.DefaultJnlpSlaveReceiver.afterProperties(DefaultJnlpSlaveReceiver.java:127)}}
      {{ at org.jenkinsci.remoting.engine.JnlpConnectionState$2.invoke(JnlpConnectionState.java:394)}}
      {{ at org.jenkinsci.remoting.engine.JnlpConnectionState.fire(JnlpConnectionState.java:337)}}
      {{ at org.jenkinsci.remoting.engine.JnlpConnectionState.fireAfterProperties(JnlpConnectionState.java:391)}}
      {{ at org.jenkinsci.remoting.engine.JnlpProtocol4Handler$Handler.onReceiveHeaders(JnlpProtocol4Handler.java:323)}}
      {{ at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.onRecv(ConnectionHeadersFilterLayer.java:196)}}
      {{ at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecv(ProtocolStack.java:668)}}
      {{ at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processRead(SSLEngineFilterLayer.java:369)}}
      {{ at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecv(SSLEngineFilterLayer.java:117)}}
      {{ at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecv(ProtocolStack.java:668)}}
      {{ at org.jenkinsci.remoting.protocol.NetworkLayer.onRead(NetworkLayer.java:136)}}
      {{ at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:160)}}
      {{ at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:795)}}
      {{ at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)}}
      {{ at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)}}
      {{ at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)}}
      {{ at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)}}
      {{ at java.base/java.lang.Thread.run(Thread.java:834)}}

      More form the Servers log output can be found in Jenkins.log and Node.log.

       

      From the timing of the events I assume the following behavior:

      1. Agent connects to server without issues
      2. Server and Agent communicate a bit and then Server get a Nullpointer exception and closes the TCP channel
      3. Agent tries to send data over the closed channel and sees that he is already disconnected
      4. Agent tries to reconnect, but even if there was the  Nullpointer exception from 2), the server still thinks the Agent is connected and refuses the reconnect
      5. Agent crashes, since it can not connect the server

      So the root cause in my opinion is that there is this Nullpointer exception on Jenkins Server Side, a few seconds after an Agent connects.

      I can reproduce this behavior with different Agent PCs, and even reinstalled the Jenkins Package on my Debian machine twice.

      The Windows PC was connected to a different Jenkins Server (also on Debian Buster) before. There it operated without any issues. To switch to the new server, I deleted the agent.jar, the workspace and the java cache in %temp% before downloading the new agent.jar from the new server. I also updated the jre on the Windows agent during troubleshooting (from jre8 to jre11) to match the main version of the java jre.

        Attachments

        1. Agent.47jar.log
          3 kB
        2. Agent.log
          5 kB
        3. Jenkins.47jar.log
          34 kB
        4. Jenkins.log
          4 kB
        5. Node.47jar.log
          3 kB
        6. Node.log
          6 kB

          Issue Links

            Activity

            Hide
            jthompson Jeff Thompson added a comment -

            The NullPointerException is probably a red herring. At least it was when I investigated it recently though it's possible there is some other sequence where it causes a real problem. This section of code is trying to deal with reconnects for Inbound TCP Agents. It hasn't worked correctly for some unknown, probably long, period of time. That's not as surprising or serious as it sounds because there are other reasons why the reconnects have limited value.

            I recently worked on a PR to avoid this NullPointerException and to get reconnects working, at least as well as they once did. See JENKINS-64510 and https://github.com/jenkinsci/jenkins/pull/5138 . This PR has been merged and should be in the upcoming weekly release.

            The real problem is probably something different, but it's not clear from the report.

            Which agent version are you using?

            I'm going to mark this as a duplicate of JENKINS-64510 based upon the NullPointerException, but I think the issue is really something different. Please try it again with Jenkins 2.282+ (when released) and agent version 4.7.

            Show
            jthompson Jeff Thompson added a comment - The NullPointerException is probably a red herring. At least it was when I investigated it recently though it's possible there is some other sequence where it causes a real problem. This section of code is trying to deal with reconnects for Inbound TCP Agents. It hasn't worked correctly for some unknown, probably long, period of time. That's not as surprising or serious as it sounds because there are other reasons why the reconnects have limited value. I recently worked on a PR to avoid this NullPointerException and to get reconnects working, at least as well as they once did. See JENKINS-64510 and https://github.com/jenkinsci/jenkins/pull/5138 . This PR has been merged and should be in the upcoming weekly release. The real problem is probably something different, but it's not clear from the report. Which agent version are you using? I'm going to mark this as a duplicate of JENKINS-64510 based upon the NullPointerException, but I think the issue is really something different. Please try it again with Jenkins 2.282+ (when released) and agent version 4.7.
            Hide
            markewaite Mark Waite added a comment -

            Tobias I run Jenkins from the Debian Buster Docker image with ssh agents connected from many different operating systems (CentOS, Debian, FreeBSD, OpenBSD, openSUSE, Oracle Linux, Ubuntu, and Windows) and hardware platforms (AMD64, ARM64, PPC64LE, s390x), a swarm agent connected from Debian 10, and an inbound agent connected from Windows 10. I've run the controller and the agents with both JDK 8 and JDK 11 for an extended period without any connection loss or exceptions.

            Can you describe precisely how you're connecting the agent to Jenkins so that others can attempt to duplicate the configuration? For example, are you using websocket for the connection? Are you using the -tcp flag to run the agent? Is the agent launched by Jenkins or launched by a separate process?

            Show
            markewaite Mark Waite added a comment - Tobias I run Jenkins from the Debian Buster Docker image with ssh agents connected from many different operating systems (CentOS, Debian, FreeBSD, OpenBSD, openSUSE, Oracle Linux, Ubuntu, and Windows) and hardware platforms (AMD64, ARM64, PPC64LE, s390x), a swarm agent connected from Debian 10, and an inbound agent connected from Windows 10. I've run the controller and the agents with both JDK 8 and JDK 11 for an extended period without any connection loss or exceptions. Can you describe precisely how you're connecting the agent to Jenkins so that others can attempt to duplicate the configuration? For example, are you using websocket for the connection? Are you using the -tcp flag to run the agent? Is the agent launched by Jenkins or launched by a separate process?
            Hide
            tobiaszellner Tobias added a comment -

            Hello Mark,

            Hello Jeff,

            thanks for the replies. 

            @Mark: 

            Well, I'm also surprised about this behavior, because on my other Jenkins instance (basically the same installation with Debian Buster and Java 11) I also don't have problems with connecting agents. To you questions:

            I'm starting the agent as a separate process using the windows command line, where I simply run 

            java -jar agent.jar -jnlpUrl https://jenkins.server:9090/computer/BASEIT31/slave-agent.jnlp -secret <secret> -workDir "D:\jenkins\work"

            I don't use the websocket for the connection, but the JNLP4 Protocol over TLS (Inbound TCP Agent Protocol/4 (TLS encryption))

             

            @Jeff:

            At the moment I'm using agent.jar version 4.5, since this is the one that I can download from my Jenkins instance.  I will try to upgrade to the weekly release of Jenkins tomorrow and then tell you if the behavior changed afterwards.

             

            Show
            tobiaszellner Tobias added a comment - Hello Mark, Hello Jeff, thanks for the replies.  @Mark:  Well, I'm also surprised about this behavior, because on my other Jenkins instance (basically the same installation with Debian Buster and Java 11) I also don't have problems with connecting agents. To you questions: I'm starting the agent as a separate process using the windows command line, where I simply run  java -jar agent.jar -jnlpUrl https://jenkins.server:9090/computer/BASEIT31/slave-agent.jnlp -secret <secret> -workDir "D:\jenkins\work" I don't use the websocket for the connection, but the JNLP4 Protocol over TLS (Inbound TCP Agent Protocol/4 (TLS encryption))   @Jeff: At the moment I'm using agent.jar version 4.5, since this is the one that I can download from my Jenkins instance.  I will try to upgrade to the weekly release of Jenkins tomorrow and then tell you if the behavior changed afterwards.  
            Hide
            tobiaszellner Tobias added a comment - - edited

            Hello Jeff,

            I tried out the weekly release today, so I'm running Jenkins 2.282 now. I also updated the agent.jar on my node machine with the new version, so I'm running agent version 4.7 now.

            But the node is still not working properly. On the other hand, the behavior I the changed slightly and the logs changed a lot. So I also attached the output after the update:

            • Agent.47jar.log   output of the command line on the agent
            • Jenkins.47jar.log output of jenkins on the server (/var/log/jenkins/jenkins.log)
            • Node.47jar.log   output of the node log (/var/lib/jenkins/logs/slaves/BASEIT31/slave.log)

            So, when I connect the agent running java -jar agent.jar -jnlpUrl https://jenkins.server:9090/computer/BASEIT31/slave-agent.jnlp -secret <secret> -workDir "D:\jenkins\work" in the cmd I get connected, and everything looks fine.

            • The WebUI tells the node is connected
            • jenkins.log tells{
              Unknown macro: { Accepted JNLP4-connect connection #4 from /10.119.64.14}

              }

            • node.log tells     Agent successfully connected and online
            • Agents cmd output tells  "INFO: Connected"

            But when I now want to open the nodes systemInfo page, something strange happens:

            • The WebUI shows three empty tables and tells "agent, version Unknown (agent is offline)"
            • The output on the agents cmd tells "INFO: Terminated" and does a reconnect within the same second  (see Agent.47jar.log line 34 ff )
            • The slave.log shows an exception entry starting with "ERROR: Connection terminated "(see Node.47jar.log line 5 ff)
            • The jenkins.log log shows an exception entry starting with "Caught exception evaluating: it.oSDescription in /computer/BASEIT31/systemInfo"  (see Jenkins.47jar.log line 7 ff)

            The difference to the behavior with the LTS version is as far as I can see now:

            1. The agent.jar Process does not crash anymore. It just get's disconnected and reconnects with every RPC call (ping thread or requesting the agents system data manually)
            2. The stack trace in the jenkins.log is much more verbose and contains several inner exceptions

             

            I also configured Jenkins to run without https encryption, but this has no effect on the behavior at all.

            But that is also not a working setup for me. Do you have any more suggestions?

            Thanks 

             

            Show
            tobiaszellner Tobias added a comment - - edited Hello Jeff, I tried out the weekly release today, so I'm running Jenkins 2.282 now. I also updated the agent.jar on my node machine with the new version, so I'm running agent version 4.7 now. But the node is still not working properly. On the other hand, the behavior I the changed slightly and the logs changed a lot. So I also attached the output after the update: Agent.47jar.log   output of the command line on the agent Jenkins.47jar.log output of jenkins on the server (/var/log/jenkins/jenkins.log) Node.47jar.log   output of the node log (/var/lib/jenkins/logs/slaves/BASEIT31/slave.log) So, when I connect the agent running java -jar agent.jar -jnlpUrl https://jenkins.server:9090/computer/BASEIT31/slave-agent.jnlp -secret <secret> -workDir "D:\jenkins\work" in the cmd I get connected, and everything looks fine. The WebUI tells the node is connected jenkins.log tells{ Unknown macro: { Accepted JNLP4-connect connection #4 from /10.119.64.14} } node.log tells      Agent successfully connected and online Agents cmd output tells  " INFO: Connected " But when I now want to open the nodes systemInfo page, something strange happens: The WebUI shows three empty tables and tells " agent, version Unknown (agent is offline)" The output on the agents cmd tells " INFO: Terminated" and does a reconnect within the same second  (see  Agent.47jar.log line 34 ff ) The slave.log shows an exception entry starting with " ERROR: Connection terminated "(see Node.47jar.log line 5 ff) The jenkins.log log shows an exception entry starting with " Caught exception evaluating: it.oSDescription in /computer/BASEIT31/systemInfo"   (see Jenkins.47jar.log line 7 ff) The difference to the behavior with the LTS version is as far as I can see now: The agent.jar Process does not crash anymore. It just get's disconnected and reconnects with every RPC call (ping thread or requesting the agents system data manually) The stack trace in the jenkins.log is much more verbose and contains several inner exceptions   I also configured Jenkins to run without https encryption, but this has no effect on the behavior at all. But that is also not a working setup for me. Do you have any more suggestions? Thanks   
            Hide
            jthompson Jeff Thompson added a comment -

            HTTP vs. HTTPS shouldn't make any difference here. It would if you were running WebSockets. You could try running with the Websockets implementation. It might give different results.

            Unfortunately, this is about what I expected you to observe with the latest update. As I stated originally, the NullPointerException was a red herring. The behavior was that the agent would connect, something would break the connection, and then the controller would attempt to attempt to run through the reconnect code, which has been broken for a long time. Fixing the reconnect code doesn't really help your case, because the fundamental problem still exists.

            Now, you're more clearly seeing the ChannelClosedException. Unfortunately, this type of issue is almost always system, environment, or networking related. This means that it works fine for other people and there isn't much capability for remotely troubleshooting this. For some reason, the channel has been closed. This almost always occurs from something outside Jenkins breaking the connection. In rare cases, people have tracked this down to a bad plugin interaction.

            The problem likely occurs from some networking issue that breaks the connection. It could also be some other resource issue. I recommend standard troubleshooting approaches. Try creating a very minimal reproduction case. Make sure it works when run somewhere besides a single machine. Try varying different aspects of the installation to see if anything changes.

            Show
            jthompson Jeff Thompson added a comment - HTTP vs. HTTPS shouldn't make any difference here. It would if you were running WebSockets. You could try running with the Websockets implementation. It might give different results. Unfortunately, this is about what I expected you to observe with the latest update. As I stated originally, the NullPointerException was a red herring. The behavior was that the agent would connect, something would break the connection, and then the controller would attempt to attempt to run through the reconnect code, which has been broken for a long time. Fixing the reconnect code doesn't really help your case, because the fundamental problem still exists. Now, you're more clearly seeing the ChannelClosedException. Unfortunately, this type of issue is almost always system, environment, or networking related. This means that it works fine for other people and there isn't much capability for remotely troubleshooting this. For some reason, the channel has been closed. This almost always occurs from something outside Jenkins breaking the connection. In rare cases, people have tracked this down to a bad plugin interaction. The problem likely occurs from some networking issue that breaks the connection. It could also be some other resource issue. I recommend standard troubleshooting approaches. Try creating a very minimal reproduction case. Make sure it works when run somewhere besides a single machine. Try varying different aspects of the installation to see if anything changes.
            Hide
            tobiaszellner Tobias added a comment -

            Hello Jeff,

            thanks' a lot for the help.

            I now switched my agent/master connection to a simple local network and now it works stable. It even does not matter if the client is running on java 8 or java 11. 

            I will have to investigate the problems now with my IT department to figure out why this connections always get disconnected in our domain network.

             

            Greetings Tobias

            Show
            tobiaszellner Tobias added a comment - Hello Jeff, thanks' a lot for the help. I now switched my agent/master connection to a simple local network and now it works stable. It even does not matter if the client is running on java 8 or java 11.  I will have to investigate the problems now with my IT department to figure out why this connections always get disconnected in our domain network.   Greetings Tobias

              People

              Assignee:
              jthompson Jeff Thompson
              Reporter:
              tobiaszellner Tobias
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated: