Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-67062

Jenkins fails to resume builds during restarts when the Agent is connected with WebSockets

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Duplicate
    • None
    • Jenkins: 2.303.3 JDK11 (latest LTS to date)
      Kubernetes Plugin: 1.30.6 (latest to date)
      jenkins/inbound-agent:4.11-1 (latest to date)
    • 2.338

    Description

      There is no error shown in the Jenkins logs itself, but the agent fails with:

      ❯ kubectl logs -f default-728vq
      Warning: SECRET is defined twice in command-line arguments and the environment variable
      Warning: AGENT_NAME is defined twice in command-line arguments and the environment variable
      Nov 04, 2021 7:45:59 PM hudson.remoting.jnlp.Main createEngine
      INFO: Setting up agent: default-728vq
      Nov 04, 2021 7:45:59 PM hudson.remoting.jnlp.Main$CuiListener <init>
      INFO: Jenkins agent is running in headless mode.
      Nov 04, 2021 7:45:59 PM hudson.remoting.Engine startEngine
      INFO: Using Remoting version: 4.11
      Nov 04, 2021 7:45:59 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
      INFO: Using /home/jenkins/agent/remoting as a remoting work directory
      Nov 04, 2021 7:45:59 PM org.jenkinsci.remoting.engine.WorkDirManager setupLogging
      INFO: Both error and output logs will be printed to /home/jenkins/agent/remoting
      Nov 04, 2021 7:46:00 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: WebSocket connection open
      Nov 04, 2021 7:46:00 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Connected
      Nov 04, 2021 7:46:27 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Write side closed
      Nov 04, 2021 7:46:27 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Read side closed
      Nov 04, 2021 7:46:27 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Terminated
      Nov 04, 2021 7:46:27 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Read side closed
      Nov 04, 2021 7:46:27 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Read side closed
      Nov 04, 2021 7:46:27 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: http://jenkins.default.svc.cluster.local:8080/login is not ready: 503
      Nov 04, 2021 7:46:38 PM hudson.remoting.Engine lambda$new$1
      SEVERE: Uncaught exception in Engine thread Thread[Thread-0,5,main]
      java.lang.NoClassDefFoundError: jenkins/slaves/restarter/JnlpSlaveRestarterInstaller
              at jenkins.slaves.restarter.JnlpSlaveRestarterInstaller$FindEffectiveRestarters$1.onReconnect(JnlpSlaveRestarterInstaller.java:91)
              at hudson.remoting.EngineListenerSplitter.onReconnect(EngineListenerSplitter.java:54)
              at hudson.remoting.Engine.runWebSocket(Engine.java:687)
              at hudson.remoting.Engine.run(Engine.java:496)
      Caused by: java.lang.ClassNotFoundException: jenkins.slaves.restarter.JnlpSlaveRestarterInstaller
              at java.base/java.net.URLClassLoader.findClass(Unknown Source)
              at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:215)
              at java.base/java.lang.ClassLoader.loadClass(Unknown Source)
              at java.base/java.lang.ClassLoader.loadClass(Unknown Source)
              ... 4 more
      

      The issue can be easily reproduced in any environment:

      $ kind create cluster
      
      $ helm repo add jenkins https://charts.jenkins.io
      
      $ helm repo update
      
      $ helm upgrade jenkins jenkins/jenkins --install --wait --debug -f- <<'EOF'
      controller:
        adminPassword: admin
        agentListenerEnabled: false
        # specifying plugins without version makes sure to use the latest
        installPlugins:
          - kubernetes
          - workflow-aggregator
          - git
          - configuration-as-code
          - job-dsl
          - saferestart
        JCasC:
          configScripts:
            my-jobs: |
              jobs:
                - script: |
                    pipelineJob('testjob') {
                      definition {
                        cps {
                          script("""\
                            pipeline {
                              agent any
                              stages {
                                stage ('test') {
                                  steps {
                                    sleep 1000
                                  }
                                }
                              }
                            }""".stripIndent())
                          sandbox()
                        }
                      }
                    }
      agent:
        websocket: true
        tag: 4.11-1
      EOF
      
      $ echo http://127.0.0.1:8080 && kubectl --namespace default port-forward svc/jenkins 8080:8080
      

      Then:

      1. Go to the Jenkins UI at http://127.0.0.1:8080
      2. Login with "admin" as user and password
      3. Trigger a build of "testjob"
      4. Wait for a pod to be assigned to the build and the sleep command to start running
      5. Start following the pod logs with kubectl logs -f <name-of-pod>
      6. Go to Jenkins home page and click in Restart Safely and confirm
      7. Watch the pod logs, it will fail with the stack trace mentioned above. The build will also fail.

      Attachments

        Issue Links

          Activity

            felipecassiors Felipe Santos created issue -
            felipecassiors Felipe Santos added a comment -

            This is a follow-up of the linked issue.

            felipecassiors Felipe Santos added a comment - This is a follow-up of the linked issue.
            felipecassiors Felipe Santos made changes -
            Field Original Value New Value
            Link This issue duplicates JENKINS-52283 [ JENKINS-52283 ]
            felipecassiors Felipe Santos added a comment - - edited

            This should be helpful for someone debugging it at the code level:

            This exception is thrown when the agent runs its reconnect flow, starting here and the specific exception is thrown here. It seems that the r variable is pointing to a type that is no longer known to the agent JVM after the remoting channel has been broken?

            From https://github.com/falldamagestudio/UE-Jenkins-Images/issues/5

            felipecassiors Felipe Santos added a comment - - edited This should be helpful for someone debugging it at the code level: This exception is thrown when the agent runs its reconnect flow,  starting here  and the specific exception is thrown  here . It seems that the  r  variable is pointing to a type that is no longer known to the agent JVM after the remoting channel has been broken ? From  https://github.com/falldamagestudio/UE-Jenkins-Images/issues/5
            felipecassiors Felipe Santos made changes -
            Environment Jenkins: 2.303.3 JDK11 (latest LTS to date)
            Kubernetes Plugin: 1.30.6 (latest to date)
            jenkins/inbound-agent:4.11-1
            Jenkins: 2.303.3 JDK11 (latest LTS to date)
            Kubernetes Plugin: 1.30.6 (latest to date)
            jenkins/inbound-agent:4.11-1 (latest to date)
            felipecassiors Felipe Santos added a comment -

            Another problem is that the jenkins-agent itself exits with 0 as error code, which is certainly very misleading.

            felipecassiors Felipe Santos added a comment - Another problem is that the jenkins-agent itself exits with 0 as error code, which is certainly very misleading.
            felipecassiors Felipe Santos added a comment - - edited

            A workaround can be implemented as I did in https://github.com/felipecrs/jenkins-agent-dind/pull/36. However, while better than nothing this is not perfect. The same caveats stated in the last paragraph of https://github.com/felipecrs/jenkins-agent-dind/pull/36 also applies to here.
             

            felipecassiors Felipe Santos added a comment - - edited A workaround can be implemented as I did in https://github.com/felipecrs/jenkins-agent-dind/pull/36. However, while better than nothing this is not perfect. The same caveats stated in the last paragraph of https://github.com/felipecrs/jenkins-agent-dind/pull/36  also applies to here.  
            drrobblebobble Kim K added a comment -

            Just had this happen when someone restarted our master over the weekend. Our native Windows agents run as a Service w/ Local System user have the same issue.

            Is the workaround for to not use WebSockets for now?

            drrobblebobble Kim K added a comment - Just had this happen when someone restarted our master over the weekend. Our native Windows agents run as a Service w/ Local System user have the same issue. Is the workaround for to not use WebSockets for now?
            felipecassiors Felipe Santos added a comment -

            That's sad, but I think so.

            felipecassiors Felipe Santos added a comment - That's sad, but I think so.

            I just add that we have the same error when running multiple builds - 40 parallel pods (containerCap=40) and about 2500 builds stages in build queue (1 build = 1 stage = 1 pod). Kubernetes was moderately loaded.

            Additionally I've noticed that if one pod fails then more pods are going to fail - I run ~10 builds which generates ~2500 mentioned builds stages and it was working fine for about 1h, and then when first pod failed then more of them start failing. 

            I'm also using script to cancel previous builds - I'm not sure how much it helps to reproduce to issue

            def cancelPreviousBuilds() {
                def jobName = env.JOB_NAME
                def buildNumber = env.BUILD_NUMBER.toInteger()
                def currentJob = Jenkins.instance.getItemByFullName(jobName)    for (def build : currentJob.builds) {
                    if (build.isBuilding() && build.number.toInteger() < buildNumber) {
                        build.doStop()
                        echo "Aborting build ${build.number.toInteger()}"
                    }
                }
            }
            

            To sum up, Jenkins restart is not needed to reproduce the issue. 

            Temporary, we are going to use TCP connection instead websockets. 

             

            Can you increase the priority for this issue?

            lukasz_kubisiak Łukasz Kubisiak added a comment - I just add that we have the same error when running multiple builds - 40 parallel pods (containerCap=40) and about 2500 builds stages in build queue (1 build = 1 stage = 1 pod). Kubernetes was moderately loaded. Additionally I've noticed that if one pod fails then more pods are going to fail - I run ~10 builds which generates ~2500 mentioned builds stages and it was working fine for about 1h, and then when first pod failed then more of them start failing.  I'm also using script to cancel previous builds - I'm not sure how much it helps to reproduce to issue def cancelPreviousBuilds() { def jobName = env.JOB_NAME def buildNumber = env.BUILD_NUMBER.toInteger() def currentJob = Jenkins.instance.getItemByFullName(jobName) for (def build : currentJob.builds) { if (build.isBuilding() && build.number.toInteger() < buildNumber) { build.doStop() echo "Aborting build ${build.number.toInteger()}" } } } To sum up, Jenkins restart is not needed to reproduce the issue.  Temporary, we are going to use TCP connection instead websockets.    Can you increase the priority for this issue?
            felipecassiors Felipe Santos made changes -
            Priority Minor [ 4 ] Critical [ 2 ]
            felipecassiors Felipe Santos added a comment -

            I did increase the priority to Critical, and for me it's indeed critical. TCP is not much of an option my network constraints.

            I'm not a Jenkins maintainer though, so if the priority is not correct, please let me know.

            felipecassiors Felipe Santos added a comment - I did increase the priority to Critical, and for me it's indeed critical. TCP is not much of an option my network constraints. I'm not a Jenkins maintainer though, so if the priority is not correct, please let me know.
            felipecassiors Felipe Santos made changes -
            Link This issue duplicates JENKINS-66446 [ JENKINS-66446 ]
            basil Basil Crow added a comment -

            Duplicates JENKINS-66446, which was fixed in jenkinsci/jenkins#6315 and jenkinsci/jenkins#6329 toward 2.338.

            basil Basil Crow added a comment - Duplicates JENKINS-66446 , which was fixed in jenkinsci/jenkins#6315 and jenkinsci/jenkins#6329 toward 2.338.
            basil Basil Crow made changes -
            Released As 2.338
            Assignee Jeff Thompson [ jthompson ]
            Resolution Duplicate [ 3 ]
            Status Open [ 1 ] Closed [ 6 ]
            basil Basil Crow made changes -
            Link This issue duplicates JENKINS-52283 [ JENKINS-52283 ]

            People

              Unassigned Unassigned
              felipecassiors Felipe Santos
              Votes:
              8 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: