• Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • kubernetes-plugin, remoting
    • None
    • Jenkins v2.89.2
      Kubernetes Plugin v1.3.3
    • Remoting 3.28

      While provisioning slaves from a private Kubernetes instance, we've found that a lot of slaves terminate with the following (or similar) stack trace on the slave's side:

       

      INFO: Setting up slave: kube1-medium-r9zf4
      Apr 10, 2018 11:02:05 AM hudson.remoting.jnlp.Main$CuiListener <init>
      INFO: Jenkins agent is running in headless mode.
      Apr 10, 2018 11:02:05 AM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
      INFO: Using /home/<user>/workDir/remoting as a remoting work directory
      Apr 10, 2018 11:02:06 AM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Locating server ...
      Apr 10, 2018 11:02:06 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve
      INFO: Remoting server accepts the following protocols: [JNLP4-connect, CLI2-connect, JNLP-connect, Ping, CLI-connect, JNLP2-connect]
      Apr 10, 2018 11:02:06 AM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Agent discovery successful <...>
      pr 10, 2018 11:02:06 AM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Handshaking
      Apr 10, 2018 11:02:06 AM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Connecting to <Jenkins Master>
      Apr 10, 2018 11:02:06 AM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Trying protocol: JNLP4-connect
      Apr 10, 2018 11:02:07 AM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Remote identity confirmed: <...>
      Apr 10, 2018 11:02:07 AM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Connected
      Apr 10, 2018 11:02:14 AM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Terminated
      Apr 10, 2018 11:02:14 AM hudson.remoting.UserRequest perform
      WARNING: LinkageError while performing UserRequest:jenkins.slaves.restarter.JnlpSlaveRestarterInstaller$2@3e708317
      java.lang.NoClassDefFoundError: jenkins/slaves/restarter/JnlpSlaveRestarterInstaller$2$1
              at jenkins.slaves.restarter.JnlpSlaveRestarterInstaller$2.call(JnlpSlaveRestarterInstaller.java:71)
              at jenkins.slaves.restarter.JnlpSlaveRestarterInstaller$2.call(JnlpSlaveRestarterInstaller.java:53)
              at hudson.remoting.UserRequest.perform(UserRequest.java:207)
              at hudson.remoting.UserRequest.perform(UserRequest.java:53)
              at hudson.remoting.Request$2.run(Request.java:358)
              at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
              at java.util.concurrent.FutureTask.run(Unknown Source)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
              at hudson.remoting.Engine$1$1.run(Engine.java:98)
              at java.lang.Thread.run(Unknown Source)
      Caused by: java.lang.ClassNotFoundException: jenkins.slaves.restarter.JnlpSlaveRestarterInstaller$2$1
              at java.net.URLClassLoader.findClass(Unknown Source)
              at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:159)
              at java.lang.ClassLoader.loadClass(Unknown Source)
              at java.lang.ClassLoader.loadClass(Unknown Source)
              ... 11 more
      
      

      The class that appears to not have been found isn't consistently the same. I've seen `FilePathFilter`, `LaunchConfiguration`, `StringBuilderWriter`, and some others being reported as well. Sometimes, there's also exceptions related to `JarCacheSupport` not being able to resolve jars (I don't have the exact stacktrace at hand - will post it if I find it again).

      On the master's side, these exceptions generally manifest as `ChannelClosedException`s, or weird Exception-less failures in pipeline branches.

      ERROR: Issue with creating launcher for agent kube1-medium-r9zf4. The agent has not been fully initialized yet
      
      ERROR: Issue with creating launcher for agent kube1-medium-r9zf4. The agent has not been fully initialized yet
      
      remote file operation failed: /home/<user>/workspace/<job_name> at hudson.remoting.Channel@6639429c:JNLP4-connect connection from <some-host-name>/<ip-address>:60326: hudson.remoting.ChannelClosedException: Remote call on JNLP4-connect connection from <some-host-name>/<ip-address>:60326 failed. The channel is closing down or has closed down
      

      I haven't been able to consistently reproduce the error, but it does manifest enough to be causing major pain to users (especially since we extensively use pipelines with a large number of parallel nodes, and a failure in any one of the nodes causes the entire pipeline to fail).

          [JENKINS-50730] NoClassDefFound errors in Cloud Slaves

          Jeff Thompson added a comment -

          alonlavi, there are different opinions on how issue reports like this should be handled. I tend to follow the approach that if it cannot be sufficiently described and reproduced then it should be closed as cannot reproduce and then re-opened if someone obtains better information. Particularly when a significant number of the reporters have seen the problem go away from environmental or version changes.

          But, let's keep this one open for a while longer and see if we get any better information.

          Jeff Thompson added a comment - alonlavi , there are different opinions on how issue reports like this should be handled. I tend to follow the approach that if it cannot be sufficiently described and reproduced then it should be closed as cannot reproduce and then re-opened if someone obtains better information. Particularly when a significant number of the reporters have seen the problem go away from environmental or version changes. But, let's keep this one open for a while longer and see if we get any better information.

          Jeff Thompson added a comment -

          I saw this ClassNotFoundException yesterday. It was one of the last things in a string of exceptions and messages in my agent logs on my Windows 10 system when it went to sleep. I couldn't see that it caused any failures or problems. Things started up without any problem when the system woke up.

          This might not have any relation to other instances when people are seeing this exception. However given other reports, this might not be causing any problems but may be following on from other problems. Perhaps the real problem is elsewhere (environmental or configuration as noted by many, including my experience) and this particular message is a distraction.

          Jeff Thompson added a comment - I saw this ClassNotFoundException yesterday. It was one of the last things in a string of exceptions and messages in my agent logs on my Windows 10 system when it went to sleep. I couldn't see that it caused any failures or problems. Things started up without any problem when the system woke up. This might not have any relation to other instances when people are seeing this exception. However given other reports, this might not be causing any problems but may be following on from other problems. Perhaps the real problem is elsewhere (environmental or configuration as noted by many, including my experience) and this particular message is a distraction.

          Jeff Thompson added a comment -

          Since the original reporter no longer observes this issue and no one has provided additional information to describe, diagnose, or reproduce this issue in quite some time I am going to close this report down as Cannot Reproduce. If anyone can provide further reports and information that could move this forward we can re-open it at that time.

          Jeff Thompson added a comment - Since the original reporter no longer observes this issue and no one has provided additional information to describe, diagnose, or reproduce this issue in quite some time I am going to close this report down as Cannot Reproduce. If anyone can provide further reports and information that could move this forward we can re-open it at that time.

          Thomas COAN added a comment - - edited

          I face same problem for a couple of days now. 

          The slave agent is launched with following command 

          java -jar agent.jar -jnlpUrl https://MY_MASTER_HOST:MY_MASTER_PORT/computer/slave-number1/slave-agent.jnlp -secret ***** -workDir "/var/lib/jenkins"

          It was working fine for monthes, but i don't know what has changed (I don't remember any updates or modification on master or slave)

          Slave works during several minutes/hours and dies again with the following error :

          INFO: Connecting to MY_MASTER_HOST:MY_MASTER_PORT
          Oct 24, 2018 10:36:35 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Trying protocol: JNLP4-connect
          Oct 24, 2018 10:36:35 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Remote identity confirmed: 47:2d:42:ae:61:c6:79:b1:69:79:27:d7:26:b8:15:ec
          Oct 24, 2018 10:36:36 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Connected
          Oct 24, 2018 10:51:36 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Terminated
          Oct 24, 2018 10:51:46 AM hudson.remoting.jnlp.Main$CuiListener error
          SEVERE: jenkins/slaves/restarter/JnlpSlaveRestarterInstaller
          java.lang.NoClassDefFoundError: jenkins/slaves/restarter/JnlpSlaveRestarterInstaller
          at jenkins.slaves.restarter.JnlpSlaveRestarterInstaller$FindEffectiveRestarters$1.onReconnect(JnlpSlaveRestarterInstaller.java:97)
          at hudson.remoting.EngineListenerSplitter.onReconnect(EngineListenerSplitter.java:49)
          at hudson.remoting.Engine.innerRun(Engine.java:662)
          at hudson.remoting.Engine.run(Engine.java:469)
          Caused by: java.lang.ClassNotFoundException: jenkins.slaves.restarter.JnlpSlaveRestarterInstaller
          at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
          at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:171)
          at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
          at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
          ... 4 more

          Config : 

          • java : 1.8.0_181
          • os : ubuntu 16.04.4 LTS (Xenial Xerus)
          • jenkins master version : 2.138.2

           

          Thomas COAN added a comment - - edited I face same problem for a couple of days now.  The slave agent is launched with following command  java -jar agent.jar -jnlpUrl https://MY_MASTER_HOST:MY_MASTER_PORT/computer/slave-number1/slave-agent.jnlp -secret ***** -workDir "/var/lib/jenkins" It was working fine for monthes, but i don't know what has changed (I don't remember any updates or modification on master or slave) Slave works during several minutes/hours and dies again with the following error : INFO: Connecting to MY_MASTER_HOST:MY_MASTER_PORT Oct 24, 2018 10:36:35 AM hudson.remoting.jnlp.Main$CuiListener status INFO: Trying protocol: JNLP4-connect Oct 24, 2018 10:36:35 AM hudson.remoting.jnlp.Main$CuiListener status INFO: Remote identity confirmed: 47:2d:42:ae:61:c6:79:b1:69:79:27:d7:26:b8:15:ec Oct 24, 2018 10:36:36 AM hudson.remoting.jnlp.Main$CuiListener status INFO: Connected Oct 24, 2018 10:51:36 AM hudson.remoting.jnlp.Main$CuiListener status INFO: Terminated Oct 24, 2018 10:51:46 AM hudson.remoting.jnlp.Main$CuiListener error SEVERE: jenkins/slaves/restarter/JnlpSlaveRestarterInstaller java.lang.NoClassDefFoundError: jenkins/slaves/restarter/JnlpSlaveRestarterInstaller at jenkins.slaves.restarter.JnlpSlaveRestarterInstaller$FindEffectiveRestarters$1.onReconnect(JnlpSlaveRestarterInstaller.java:97) at hudson.remoting.EngineListenerSplitter.onReconnect(EngineListenerSplitter.java:49) at hudson.remoting.Engine.innerRun(Engine.java:662) at hudson.remoting.Engine.run(Engine.java:469) Caused by: java.lang.ClassNotFoundException: jenkins.slaves.restarter.JnlpSlaveRestarterInstaller at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:171) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 4 more Config :   java : 1.8.0_181 os : ubuntu 16.04.4 LTS (Xenial Xerus) jenkins master version : 2.138.2  

          Jeff Thompson added a comment -

          As far as I've been able to tell so, this is not a cause of failures or even a direct symptom. It results after an earlier connection failure and just indicates that a reconnection attempt has failed.

          The stack traces all begin with a failure in onReconnect(). This indicates the connection has terminated and Remoting is attempting to reconnect. Unfortunately, the reconnect attempt fails and this message results.

          Connection failures and disconnects can happen for many reasons, commonly associated with system, network, or environment issues. Particularly if nothing has changed and this suddenly starts occurring it is likely to be one of these external causes.

          It might be possible to improve the retry logic but without further information on what is occurring it is difficult to know what changes might improve it.

          One thing I could easily do is to change the log messaging and downgrade the severity of this particular message.

          Jeff Thompson added a comment - As far as I've been able to tell so, this is not a cause of failures or even a direct symptom. It results after an earlier connection failure and just indicates that a reconnection attempt has failed. The stack traces all begin with a failure in onReconnect(). This indicates the connection has terminated and Remoting is attempting to reconnect. Unfortunately, the reconnect attempt fails and this message results. Connection failures and disconnects can happen for many reasons, commonly associated with system, network, or environment issues. Particularly if nothing has changed and this suddenly starts occurring it is likely to be one of these external causes. It might be possible to improve the retry logic but without further information on what is occurring it is difficult to know what changes might improve it. One thing I could easily do is to change the log messaging and downgrade the severity of this particular message.

          Thomas COAN added a comment -

          Thanks Jeff,

          Yes changing the log message would be great. 

           

          => As a workaround, I have changes the way of connecting the slave using now the ssh agent method instead of java web start.

          There are no connection failure from slave to master anymore for a couple of days

          Thomas 

          Thomas COAN added a comment - Thanks Jeff, Yes changing the log message would be great.    => As a workaround, I have changes the way of connecting the slave using now the ssh agent  method instead of java web start. There are no connection failure from slave to master anymore for a couple of days Thomas 

          Jeff Thompson added a comment -

          tcoan, I've got a PR in review to tweak the messaging. https://github.com/jenkinsci/remoting/pull/295   If you have any comments on that proposal, please share.

          That's great news on the improved reliability using ssh agent. I wish I knew why that was the case.

          Jeff Thompson added a comment - tcoan , I've got a PR in review to tweak the messaging. https://github.com/jenkinsci/remoting/pull/295    If you have any comments on that proposal, please share. That's great news on the improved reliability using ssh agent. I wish I knew why that was the case.

          Thomas COAN added a comment -

          jthompson, I have reviewed the PR, it seems good to me. Thanks Jeff.

          As a remark I have read in the PR comments that the case should be rare, but for some weeks it was happening 5-10 times per day on my side. A research in the web seems to demonstrate that there are some people facing the same issue. 

          With the new message, we will be able to investigate outside jenkins. Previous message seemed to indicate it was an internal error.

           

          Thomas COAN added a comment - jthompson , I have reviewed the PR, it seems good to me. Thanks Jeff. As a remark I have read in the PR comments that the case should be rare, but for some weeks it was happening 5-10 times per day on my side. A research in the web seems to demonstrate that there are some people facing the same issue.  With the new message, we will be able to investigate outside jenkins. Previous message seemed to indicate it was an internal error.  

          Jeff Thompson added a comment -

          tcoan, if you have a GitHub account, you would be welcome to add your comment directly to the PR. At this point, though, I hope to merge that today so no need to bother. 

          I think the question of rarity was on the order of minutes or less. If the occurrence was that frequent these log messages could still be annoying. At 5-10 per day the disconnects are very annoying but the log messages may be helpful.

          Jeff Thompson added a comment - tcoan , if you have a GitHub account, you would be welcome to add your comment directly to the PR. At this point, though, I hope to merge that today so no need to bother.  I think the question of rarity was on the order of minutes or less. If the occurrence was that frequent these log messages could still be annoying. At 5-10 per day the disconnects are very annoying but the log messages may be helpful.

          Jeff Thompson added a comment -

          As discussed in the comments, we decided to improve the log messages on reconnect to reduce the log level and improve the clarity. This won't do anything to help reduce the original problem but we've received insufficient information to reproduce those situations.

          This will be picked up by a Jenkins weekly build soon.

          Jeff Thompson added a comment - As discussed in the comments, we decided to improve the log messages on reconnect to reduce the log level and improve the clarity. This won't do anything to help reduce the original problem but we've received insufficient information to reproduce those situations. This will be picked up by a Jenkins weekly build soon.

            jthompson Jeff Thompson
            karthikduddu Karthik Duddu
            Votes:
            3 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated:
              Resolved: