Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-67062

Jenkins fails to resume builds during restarts when the Agent is connected with WebSockets

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Critical Critical
    • None
    • Jenkins: 2.303.3 JDK11 (latest LTS to date)
      Kubernetes Plugin: 1.30.6 (latest to date)
      jenkins/inbound-agent:4.11-1 (latest to date)
    • 2.338

      There is no error shown in the Jenkins logs itself, but the agent fails with:

      ❯ kubectl logs -f default-728vq
      Warning: SECRET is defined twice in command-line arguments and the environment variable
      Warning: AGENT_NAME is defined twice in command-line arguments and the environment variable
      Nov 04, 2021 7:45:59 PM hudson.remoting.jnlp.Main createEngine
      INFO: Setting up agent: default-728vq
      Nov 04, 2021 7:45:59 PM hudson.remoting.jnlp.Main$CuiListener <init>
      INFO: Jenkins agent is running in headless mode.
      Nov 04, 2021 7:45:59 PM hudson.remoting.Engine startEngine
      INFO: Using Remoting version: 4.11
      Nov 04, 2021 7:45:59 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
      INFO: Using /home/jenkins/agent/remoting as a remoting work directory
      Nov 04, 2021 7:45:59 PM org.jenkinsci.remoting.engine.WorkDirManager setupLogging
      INFO: Both error and output logs will be printed to /home/jenkins/agent/remoting
      Nov 04, 2021 7:46:00 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: WebSocket connection open
      Nov 04, 2021 7:46:00 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Connected
      Nov 04, 2021 7:46:27 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Write side closed
      Nov 04, 2021 7:46:27 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Read side closed
      Nov 04, 2021 7:46:27 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Terminated
      Nov 04, 2021 7:46:27 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Read side closed
      Nov 04, 2021 7:46:27 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Read side closed
      Nov 04, 2021 7:46:27 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: http://jenkins.default.svc.cluster.local:8080/login is not ready: 503
      Nov 04, 2021 7:46:38 PM hudson.remoting.Engine lambda$new$1
      SEVERE: Uncaught exception in Engine thread Thread[Thread-0,5,main]
      java.lang.NoClassDefFoundError: jenkins/slaves/restarter/JnlpSlaveRestarterInstaller
              at jenkins.slaves.restarter.JnlpSlaveRestarterInstaller$FindEffectiveRestarters$1.onReconnect(JnlpSlaveRestarterInstaller.java:91)
              at hudson.remoting.EngineListenerSplitter.onReconnect(EngineListenerSplitter.java:54)
              at hudson.remoting.Engine.runWebSocket(Engine.java:687)
              at hudson.remoting.Engine.run(Engine.java:496)
      Caused by: java.lang.ClassNotFoundException: jenkins.slaves.restarter.JnlpSlaveRestarterInstaller
              at java.base/java.net.URLClassLoader.findClass(Unknown Source)
              at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:215)
              at java.base/java.lang.ClassLoader.loadClass(Unknown Source)
              at java.base/java.lang.ClassLoader.loadClass(Unknown Source)
              ... 4 more
      

      The issue can be easily reproduced in any environment:

      $ kind create cluster
      
      $ helm repo add jenkins https://charts.jenkins.io
      
      $ helm repo update
      
      $ helm upgrade jenkins jenkins/jenkins --install --wait --debug -f- <<'EOF'
      controller:
        adminPassword: admin
        agentListenerEnabled: false
        # specifying plugins without version makes sure to use the latest
        installPlugins:
          - kubernetes
          - workflow-aggregator
          - git
          - configuration-as-code
          - job-dsl
          - saferestart
        JCasC:
          configScripts:
            my-jobs: |
              jobs:
                - script: |
                    pipelineJob('testjob') {
                      definition {
                        cps {
                          script("""\
                            pipeline {
                              agent any
                              stages {
                                stage ('test') {
                                  steps {
                                    sleep 1000
                                  }
                                }
                              }
                            }""".stripIndent())
                          sandbox()
                        }
                      }
                    }
      agent:
        websocket: true
        tag: 4.11-1
      EOF
      
      $ echo http://127.0.0.1:8080 && kubectl --namespace default port-forward svc/jenkins 8080:8080
      

      Then:

      1. Go to the Jenkins UI at http://127.0.0.1:8080
      2. Login with "admin" as user and password
      3. Trigger a build of "testjob"
      4. Wait for a pod to be assigned to the build and the sleep command to start running
      5. Start following the pod logs with kubectl logs -f <name-of-pod>
      6. Go to Jenkins home page and click in Restart Safely and confirm
      7. Watch the pod logs, it will fail with the stack trace mentioned above. The build will also fail.

          [JENKINS-67062] Jenkins fails to resume builds during restarts when the Agent is connected with WebSockets

          Felipe Santos added a comment -

          This is a follow-up of the linked issue.

          Felipe Santos added a comment - This is a follow-up of the linked issue.

          Felipe Santos added a comment - - edited

          This should be helpful for someone debugging it at the code level:

          This exception is thrown when the agent runs its reconnect flow, starting here and the specific exception is thrown here. It seems that the r variable is pointing to a type that is no longer known to the agent JVM after the remoting channel has been broken?

          From https://github.com/falldamagestudio/UE-Jenkins-Images/issues/5

          Felipe Santos added a comment - - edited This should be helpful for someone debugging it at the code level: This exception is thrown when the agent runs its reconnect flow,  starting here  and the specific exception is thrown  here . It seems that the  r  variable is pointing to a type that is no longer known to the agent JVM after the remoting channel has been broken ? From  https://github.com/falldamagestudio/UE-Jenkins-Images/issues/5

          Felipe Santos added a comment -

          Another problem is that the jenkins-agent itself exits with 0 as error code, which is certainly very misleading.

          Felipe Santos added a comment - Another problem is that the jenkins-agent itself exits with 0 as error code, which is certainly very misleading.

          Felipe Santos added a comment - - edited

          A workaround can be implemented as I did in https://github.com/felipecrs/jenkins-agent-dind/pull/36. However, while better than nothing this is not perfect. The same caveats stated in the last paragraph of https://github.com/felipecrs/jenkins-agent-dind/pull/36 also applies to here.
           

          Felipe Santos added a comment - - edited A workaround can be implemented as I did in https://github.com/felipecrs/jenkins-agent-dind/pull/36. However, while better than nothing this is not perfect. The same caveats stated in the last paragraph of https://github.com/felipecrs/jenkins-agent-dind/pull/36  also applies to here.  

          Kim K added a comment -

          Just had this happen when someone restarted our master over the weekend. Our native Windows agents run as a Service w/ Local System user have the same issue.

          Is the workaround for to not use WebSockets for now?

          Kim K added a comment - Just had this happen when someone restarted our master over the weekend. Our native Windows agents run as a Service w/ Local System user have the same issue. Is the workaround for to not use WebSockets for now?

          Felipe Santos added a comment -

          That's sad, but I think so.

          Felipe Santos added a comment - That's sad, but I think so.

          I just add that we have the same error when running multiple builds - 40 parallel pods (containerCap=40) and about 2500 builds stages in build queue (1 build = 1 stage = 1 pod). Kubernetes was moderately loaded.

          Additionally I've noticed that if one pod fails then more pods are going to fail - I run ~10 builds which generates ~2500 mentioned builds stages and it was working fine for about 1h, and then when first pod failed then more of them start failing. 

          I'm also using script to cancel previous builds - I'm not sure how much it helps to reproduce to issue

          def cancelPreviousBuilds() {
              def jobName = env.JOB_NAME
              def buildNumber = env.BUILD_NUMBER.toInteger()
              def currentJob = Jenkins.instance.getItemByFullName(jobName)    for (def build : currentJob.builds) {
                  if (build.isBuilding() && build.number.toInteger() < buildNumber) {
                      build.doStop()
                      echo "Aborting build ${build.number.toInteger()}"
                  }
              }
          }
          

          To sum up, Jenkins restart is not needed to reproduce the issue. 

          Temporary, we are going to use TCP connection instead websockets. 

           

          Can you increase the priority for this issue?

          Łukasz Kubisiak added a comment - I just add that we have the same error when running multiple builds - 40 parallel pods (containerCap=40) and about 2500 builds stages in build queue (1 build = 1 stage = 1 pod). Kubernetes was moderately loaded. Additionally I've noticed that if one pod fails then more pods are going to fail - I run ~10 builds which generates ~2500 mentioned builds stages and it was working fine for about 1h, and then when first pod failed then more of them start failing.  I'm also using script to cancel previous builds - I'm not sure how much it helps to reproduce to issue def cancelPreviousBuilds() { def jobName = env.JOB_NAME def buildNumber = env.BUILD_NUMBER.toInteger() def currentJob = Jenkins.instance.getItemByFullName(jobName) for (def build : currentJob.builds) { if (build.isBuilding() && build.number.toInteger() < buildNumber) { build.doStop() echo "Aborting build ${build.number.toInteger()}" } } } To sum up, Jenkins restart is not needed to reproduce the issue.  Temporary, we are going to use TCP connection instead websockets.    Can you increase the priority for this issue?

          Felipe Santos added a comment -

          I did increase the priority to Critical, and for me it's indeed critical. TCP is not much of an option my network constraints.

          I'm not a Jenkins maintainer though, so if the priority is not correct, please let me know.

          Felipe Santos added a comment - I did increase the priority to Critical, and for me it's indeed critical. TCP is not much of an option my network constraints. I'm not a Jenkins maintainer though, so if the priority is not correct, please let me know.

          Basil Crow added a comment -

          Duplicates JENKINS-66446, which was fixed in jenkinsci/jenkins#6315 and jenkinsci/jenkins#6329 toward 2.338.

          Basil Crow added a comment - Duplicates JENKINS-66446 , which was fixed in jenkinsci/jenkins#6315 and jenkinsci/jenkins#6329 toward 2.338.

            Unassigned Unassigned
            felipecassiors Felipe Santos
            Votes:
            8 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: