Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-56939

pipeline gets stuck: root cause unclear

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved (View Workflow)
    • Major
    • Resolution: Duplicate
    • kubernetes-plugin
    • None
    • Jenkins ver. 2.164.1, running on kuberenetes pod with 6 Cpu request and alot of memory.
      Kubernetes plugin 1.14.10-SNAPSHOT 8/4/19 (also tested with 1.14.9)

    Description

      I'm running following pipeline in 10 build at the same time (so 10 *30 = 300 "stages")

      def label = "v5-$USER"
      def batchList = "batchList"
      def batchListArray = []
      
      timestamps {
      podTemplate(
          label: label,
          serviceAccount: "jenkins",
          idleMinutes: 0,
          podRetention: never(),
          containers: [
              containerTemplate(
                  name: 'kubectl',
                  image: 'lachlanevenson/k8s-kubectl:v1.10.7',
                  ttyEnabled: true,
                  command: 'cat'
              ),
              containerTemplate(
                  name: 'helm',
                  image: 'lachlanevenson/k8s-helm:v2.9.1',
                  ttyEnabled: true,
                  command: 'cat'
              )
          ]
      ){node(label) {
          stage('Setup') {
              stage('Get list batches') {
                  container("kubectl") {
                      sh """
                          echo 1 > $batchList
                          echo 2 >> $batchList
                          echo 3 >> $batchList
                          echo 4 >> $batchList
                          echo 5 >> $batchList
                          echo 6 >> $batchList
                          echo 7 >> $batchList
                          echo 8 >> $batchList
                          echo 9 >> $batchList
                          echo 10 >> $batchList
                          echo 11 >> $batchList
                          echo 12 >> $batchList
                          echo 13 >> $batchList
                          echo 14 >> $batchList
                          echo 15 >> $batchList
                          echo 16 >> $batchList
                          echo 17 >> $batchList
                          echo 18 >> $batchList
                          echo 19 >> $batchList
                          echo 20 >> $batchList
                          echo 21 >> $batchList
                          echo 22 >> $batchList
                          echo 23 >> $batchList
                          echo 24 >> $batchList
                          echo 25 >> $batchList
                          echo 26 >> $batchList
                          echo 27 >> $batchList
                          echo 28 >> $batchList
                          echo 29 >> $batchList
                          echo 30 >> $batchList
                      """
                  }            batchListFile = readFile "$batchList"
                  lines = batchListFile.readLines()
                  for (batch in lines) {
                      batchListArray.add(batch)
                  }
              }
          }    stage ('Parallel') {
              def parallelData = [:]        batchListArray.each { batch ->
                  parallelData[batch] = {
                      stage("Deploy - $batch") {
                          container("helm") {
                              sh """
                                  sleep 5
                              """
                          }
                      }                stage("Queue - $batch") {
                          container("kubectl") {
                              sh """
                                  sleep 600
                              """
                          }
                      }
                  }
              }
              parallel parallelData
          }    stage('Clean up') {
              container("helm") {
                  sh """
                      sleep 10
                  """
              }
          }
      }
      }
      }
      

      This is a simplified version of the real pipeline we're running.

      For 1 build, this runs fine, but when you run 10 builds at once, it starts degrading.

      What is observed is that the first job is quickly able to launch all the sleeps, even the sleeps in the later stages of the pipeline, but when the amount of build increase, the stages don't seem to be able to start any longer, slowing down the build drastically and failing eventually.

      Note that the agent pods and containers are starting fine.

      Also:   java.io.IOException: Timed out waiting for websocket connection. You should increase the value of system property org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionTimeout currently set at 30 seconds
      		at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.doLaunch(ContainerExecDecorator.java:369)
      		at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.launch(ContainerExecDecorator.java:252)
      		at hudson.Launcher$ProcStarter.start(Launcher.java:455)
      		at org.jenkinsci.plugins.durabletask.BourneShellScript.launchWithCookie(BourneShellScript.java:206)
      		at org.jenkinsci.plugins.durabletask.FileMonitoringTask.launch(FileMonitoringTask.java:99)
      		at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:305)
      		at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:268)
      		at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:176)
      		at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122)
      		at groovy.lang.MetaClassImpl.invokeMethodOnGroovyObject(MetaClassImpl.java:1278)
      		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1172)
      		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
      		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:810)
      		at groovy.lang.GroovyObjectSupport.invokeMethod(GroovyObjectSupport.java:46)
      		at groovy.lang.MetaClassImpl.invokeMethodOnGroovyObject(MetaClassImpl.java:1278)
      		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1172)
      		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
      		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:810)
      		at groovy.lang.GroovyObjectSupport.invokeMethod(GroovyObjectSupport.java:46)
      		at groovy.lang.MetaClassImpl.invokeMethodOnGroovyObject(MetaClassImpl.java:1278)
      		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1172)
      		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
      		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:810)
      		at groovy.lang.GroovyObjectSupport.invokeMethod(GroovyObjectSupport.java:46)
      		at groovy.lang.MetaClassImpl.invokeMethodOnGroovyObject(MetaClassImpl.java:1278)
      		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1172)
      		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
      		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:810)
      		at groovy.lang.GroovyObjectSupport.invokeMethod(GroovyObjectSupport.java:46)
      		at groovy.lang.MetaClassImpl.invokeMethodOnGroovyObject(MetaClassImpl.java:1278)
      		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1172)
      		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
      		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:810)
      		at groovy.lang.GroovyObjectSupport.invokeMethod(GroovyObjectSupport.java:46)
      		at groovy.lang.MetaClassImpl.invokeMethodOnGroovyObject(MetaClassImpl.java:1278)
      		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1172)
      		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
      		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:810)
      		at groovy.lang.GroovyObjectSupport.invokeMethod(GroovyObjectSupport.java:46)
      		at groovy.lang.MetaClassImpl.invokeMethodOnGroovyObject(MetaClassImpl.java:1278)
      		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1172)
      		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
      		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:810)
      		at groovy.lang.GroovyObjectSupport.invokeMethod(GroovyObjectSupport.java:46)
      		at groovy.lang.MetaClassImpl.invokeMethodOnGroovyObject(MetaClassImpl.java:1278)
      		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1172)
      		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
      		at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:42)
      		at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
      		at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
      		at com.cloudbees.groovy.cps.sandbox.DefaultInvoker.methodCall(DefaultInvoker.java:20)
      		at WorkflowScript.run(WorkflowScript:88)
      java.io.IOException: Timed out waiting for websocket connection. You should increase the value of system property org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionTimeout currently set at 30 seconds
      	at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.doLaunch(ContainerExecDecorator.java:369)
      	at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.launch(ContainerExecDecorator.java:252)
      	at hudson.Launcher$ProcStarter.start(Launcher.java:455)
      	at org.jenkinsci.plugins.durabletask.BourneShellScript.launchWithCookie(BourneShellScript.java:206)
      	at org.jenkinsci.plugins.durabletask.FileMonitoringTask.launch(FileMonitoringTask.java:99)
      	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:305)
      	at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:268)
      	at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:176)
      	at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122)
      	at groovy.lang.MetaClassImpl.invokeMethodOnGroovyObject(MetaClassImpl.java:1278)
      	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1172)
      	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
      	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:810)
      	at groovy.lang.GroovyObjectSupport.invokeMethod(GroovyObjectSupport.java:46)
      	at groovy.lang.MetaClassImpl.invokeMethodOnGroovyObject(MetaClassImpl.java:1278)
      	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1172)
      	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
      	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:810)
      	at groovy.lang.GroovyObjectSupport.invokeMethod(GroovyObjectSupport.java:46)
      	at groovy.lang.MetaClassImpl.invokeMethodOnGroovyObject(MetaClassImpl.java:1278)
      	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1172)
      	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
      	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:810)
      	at groovy.lang.GroovyObjectSupport.invokeMethod(GroovyObjectSupport.java:46)
      	at groovy.lang.MetaClassImpl.invokeMethodOnGroovyObject(MetaClassImpl.java:1278)
      	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1172)
      	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
      	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:810)
      	at groovy.lang.GroovyObjectSupport.invokeMethod(GroovyObjectSupport.java:46)
      	at groovy.lang.MetaClassImpl.invokeMethodOnGroovyObject(MetaClassImpl.java:1278)
      	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1172)
      	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
      	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:810)
      	at groovy.lang.GroovyObjectSupport.invokeMethod(GroovyObjectSupport.java:46)
      	at groovy.lang.MetaClassImpl.invokeMethodOnGroovyObject(MetaClassImpl.java:1278)
      	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1172)
      	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
      	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:810)
      	at groovy.lang.GroovyObjectSupport.invokeMethod(GroovyObjectSupport.java:46)
      	at groovy.lang.MetaClassImpl.invokeMethodOnGroovyObject(MetaClassImpl.java:1278)
      	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1172)
      	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
      	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:810)
      	at groovy.lang.GroovyObjectSupport.invokeMethod(GroovyObjectSupport.java:46)
      	at groovy.lang.MetaClassImpl.invokeMethodOnGroovyObject(MetaClassImpl.java:1278)
      	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1172)
      	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
      	at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:42)
      	at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
      	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
      	at com.cloudbees.groovy.cps.sandbox.DefaultInvoker.methodCall(DefaultInvoker.java:20)
      	at WorkflowScript.run(WorkflowScript:80)
      	at ___cps.transform___(Native Method)
      	at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:57)
      	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:109)
      	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:82)
      	at sun.reflect.GeneratedMethodAccessor274.invoke(Unknown Source)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
      	at com.cloudbees.groovy.cps.impl.ConstantBlock.eval(ConstantBlock.java:21)
      	at com.cloudbees.groovy.cps.Next.step(Next.java:83)
      	at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:174)
      	at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:163)
      	at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:129)
      	at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:268)
      	at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:163)
      	at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:182)
      	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:332)
      	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$200(CpsThreadGroup.java:83)
      	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:244)
      	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:232)
      	at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:64)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131)
      	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      	at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      Finished: FAILURE
      

      So, with the latest SNAPSHOT, you do get following output:

      Timed out waiting for websocket connection. You should increase the value of system property org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionTimeout currently set at 30 seconds

      The strange thing is that the node/pod and its containers are already there, so don't see a good reason why it would help?

       

      I've tried many different settings and env tweaks, but none seem to make any difference.

      I'm out of ideas on how to do proper root cause analysis on this, any ideas?

      br,

      Jan

      ps: When I run a similar pipeline, but no containers, all 'on the master', there is no degradation in performance seen:

      def batchList = "batchList"
      def batchListArray = []
      
      node('master') {
          stage('Setup') {
              stage('Get list batches') {
                      sh """
                          echo 1 > $batchList
                          echo 2 >> $batchList
                          echo 3 >> $batchList
                          echo 4 >> $batchList
                          echo 5 >> $batchList
                          echo 6 >> $batchList
                          echo 7 >> $batchList
                          echo 8 >> $batchList
                          echo 9 >> $batchList
                          echo 10 >> $batchList
                          echo 11 >> $batchList
                          echo 12 >> $batchList
                          echo 13 >> $batchList
                          echo 14 >> $batchList
                          echo 15 >> $batchList
                          echo 16 >> $batchList
                          echo 17 >> $batchList
                          echo 18 >> $batchList
                          echo 19 >> $batchList
                          echo 20 >> $batchList
                          echo 21 >> $batchList
                          echo 22 >> $batchList
                          echo 23 >> $batchList
                          echo 24 >> $batchList
                          echo 25 >> $batchList
                          echo 26 >> $batchList
                          echo 27 >> $batchList
                          echo 28 >> $batchList
                          echo 29 >> $batchList
                          echo 30 >> $batchList
                      """
                  batchListFile = readFile "$batchList"
                  lines = batchListFile.readLines()
                  for (batch in lines) {
                      batchListArray.add(batch)
                  }
              }
          }    stage ('Parallel') {
              def parallelData = [:]        batchListArray.each { batch ->
                  parallelData[batch] = {
                      stage("Deploy - $batch") {
                              sh """
                                  sleep 5
                              """
                      }                stage("Queue - $batch") {
                              sh """
                                  sleep 600
                              """
                      }
                  }
              }
              parallel parallelData
          }    stage('Clean up') {
                  sh """
                      sleep 10
                  """
          }
      }
      

       

       

      Attachments

        Issue Links

          Activity

            heyleke Jan Heylen added a comment -

            After further investigation it looks like the container statements are the ones causing the performance issue.
            container("some container") {
            Once you run the same pipeline, but without the 'container' statements (so everything just on the jnlp container?), the scaling performance issue is no longer seen.

            heyleke Jan Heylen added a comment - After further investigation it looks like the container statements are the ones causing the performance issue. container("some container") { Once you run the same pipeline, but without the 'container' statements (so everything just on the jnlp container?), the scaling performance issue is no longer seen.
            nuzz Matt Nuzzaco added a comment -

            I'm also seeing this same behavior. A single build works fine (runs about 40 parallel stages/pods per build, as soon as you start more than 1-2 concurrent builds this websocket timeout issue starts showing up across a random set of the test pods

            Timed out waiting for websocket connection. You should increase the value of system property org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionTimeout currently set at 30 seconds 

             

            org.jenkins-ci.main:jenkins-war:2.173

            kubernetes:1.15.1

            jenkinsci/jnlp-slave:3.27-1

             

            nuzz Matt Nuzzaco added a comment - I'm also seeing this same behavior. A single build works fine (runs about 40 parallel stages/pods per build, as soon as you start more than 1-2 concurrent builds this websocket timeout issue starts showing up across a random set of the test pods Timed out waiting for websocket connection. You should increase the value of system property org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionTimeout currently set at 30 seconds   org.jenkins-ci.main:jenkins-war:2.173 kubernetes:1.15.1 jenkinsci/jnlp-slave:3.27-1  
            karolgil Karol Gil added a comment -

            Did you try upgrading the property that is mentioned in the error message? I've also encountered that for pipelines that are running ~100 pods but when we run Jenkins with
            -D org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionTimeout=60
            the error no longer appears and pipelines are running normally.

            karolgil Karol Gil added a comment - Did you try upgrading the property that is mentioned in the error message? I've also encountered that for pipelines that are running ~100 pods but when we run Jenkins with -D org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionTimeout=60 the error no longer appears and pipelines are running normally.
            jglick Jesse Glick added a comment -

            Probably addressed by JENKINS-58290.

            jglick Jesse Glick added a comment - Probably addressed by JENKINS-58290 .

            People

              Unassigned Unassigned
              heyleke Jan Heylen
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: