Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-40825

"Pipe not connected" errors when running multiple builds simultaneously

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Blocker Blocker
    • kubernetes-plugin
    • None
    • Jenkins 2.60
      Kubernetes plugin 0.12
      Kubernetes 1.5.1 on GKE
      Kubernetes 1.7.0 on AWS

      Hi there,

      We have Jenkins running in Kubernetes with the Kubernetes plugin, and have been experiencing `java.io.IOException: Pipe not connected` errors when running multiple builds simultaneously. This seems to consistently happen when we run 8 or more builds (on the same pipeline). About 50% of the builds will succeed, and the other 50% will fail with the `Pipe not connected` exception. Most of the time it will fail at stage 1, but sometimes at stage 2.

      We're using the following pipeline:

      podTemplate(label: 'mypod', containers: [
        containerTemplate(name: 'debian', image: 'debian', ttyEnabled: true, command: 'cat'),
        containerTemplate(name: 'ubuntu', image: 'ubuntu', ttyEnabled: true, command: 'cat')
      ]) {
        node('mypod') {
          container('debian') {
            stage('stage 1') {
              sh 'echo hello'
              sh 'sleep 30'
              sh 'echo world'
            }
      
            stage('stage 2') {
              sh 'echo hello'
              sh 'sleep 30'
              sh 'echo world'
            }
          }
        }
      }
      

      And this is the log of such failed build:

      [Pipeline] podTemplate
      [Pipeline] {
      [Pipeline] node
      Still waiting to schedule task
      Waiting for next available executor on mypod
      Running on kubernetes-a0e59102b59b48ad99693ca32b94ab38-11a5bcd7df12e4 in /home/jenkins/workspace/kubernetes-test-3
      [Pipeline] {
      [Pipeline] container
      [Pipeline] {
      [Pipeline] stage
      [Pipeline] { (stage 1)
      [Pipeline] sh
      [kubernetes-test-3] Running shell script
      Executing shell script inside container [debian] of pod [kubernetes-a0e59102b59b48ad99693ca32b94ab38-11a5bcd7df12e4]
      Executing command: sh -c echo $$ > '/home/jenkins/workspace/kubernetes-test-3@tmp/durable-f201019b/pid'; jsc=durable-7534cabf595ac7f32ca72b4db83e0af1; JENKINS_SERVER_COOKIE=$jsc '/home/jenkins/workspace/kubernetes-test-3@tmp/durable-f201019b/script.sh' > '/home/jenkins/workspace/kubernetes-test-3@tmp/durable-f201019b/jenkins-log.txt' 2>&1; echo $? > '/home/jenkins/workspace/kubernetes-test-3@tmp/durable-f201019b/jenkins-result.txt' 
      # cd /home/jenkins/workspace/kubernetes-test-3
      sh -c echo $$ > '/home/jenkins/workspace/kubernetes-test-3@tmp/durable-f201019b/pid'; jsc=durable-7534cabf595ac7f32ca72b4db83e0af1; JENKINS_SERVER_COOKIE=$jsc '/home/jenkins/workspace/kubernetes-test-3@tmp/durable-f201019b/script.sh' > '/home/jenkins/workspace/kubernetes-test-3@tmp/durable-f201019b/jenkins-log.txt' 2>&1; echo $? > '/home/jenkins/workspace/kubernetes-test-3@tmp/durable-f201019b/jenkins-result.txt' 
      exit
      # # + echo hello
      hello
      [Pipeline] sh
      [kubernetes-test-3] Running shell script
      Executing shell script inside container [debian] of pod [kubernetes-a0e59102b59b48ad99693ca32b94ab38-11a5bcd7df12e4]
      Executing command: sh -c echo $$ > '/home/jenkins/workspace/kubernetes-test-3@tmp/durable-0eb192c0/pid'; jsc=durable-7534cabf595ac7f32ca72b4db83e0af1; JENKINS_SERVER_COOKIE=$jsc '/home/jenkins/workspace/kubernetes-test-3@tmp/durable-0eb192c0/script.sh' > '/home/jenkins/workspace/kubernetes-test-3@tmp/durable-0eb192c0/jenkins-log.txt' 2>&1; echo $? > '/home/jenkins/workspace/kubernetes-test-3@tmp/durable-0eb192c0/jenkins-result.txt' 
      # cd /home/jenkins/workspace/kubernetes-test-3
      sh -c echo $$ > '/home/jenkins/workspace/kubernetes-test-3@tmp/durable-0eb192c0/pid'; jsc=durable-7534cabf595ac7f32ca72b4db83e0af1; JENKINS_SERVER_COOKIE=$jsc '/home/jenkins/workspace/kubernetes-test-3@tmp/durable-0eb192c0/script.sh' > '/home/jenkins/workspace/kubernetes-test-3@tmp/durable-0eb192c0/jenkins-log.txt' 2>&1; echo $? > '/home/jenkins/workspace/kubernetes-test-3@tmp/durable-0eb192c0/jenkins-result.txt' 
      exit
      # + sleep 30
      # [Pipeline] sh
      [kubernetes-test-3] Running shell script
      Executing shell script inside container [debian] of pod [kubernetes-a0e59102b59b48ad99693ca32b94ab38-11a5bcd7df12e4]
      [Pipeline] }
      [Pipeline] // stage
      [Pipeline] }
      [Pipeline] // container
      [Pipeline] }
      [Pipeline] // node
      [Pipeline] }
      [Pipeline] // podTemplate
      [Pipeline] End of Pipeline
      java.io.IOException: Pipe not connected
      	at java.io.PipedOutputStream.write(PipedOutputStream.java:140)
      	at java.io.OutputStream.write(OutputStream.java:75)
      	at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.launch(ContainerExecDecorator.java:125)
      	at hudson.Launcher$ProcStarter.start(Launcher.java:384)
      	at org.jenkinsci.plugins.durabletask.BourneShellScript.launchWithCookie(BourneShellScript.java:147)
      	at org.jenkinsci.plugins.durabletask.FileMonitoringTask.launch(FileMonitoringTask.java:61)
      	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:158)
      	at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:184)
      	at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:126)
      	at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:108)
      	at groovy.lang.GroovyObject$invokeMethod.call(Unknown Source)
      	at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
      	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
      	at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:151)
      	at org.kohsuke.groovy.sandbox.GroovyInterceptor.onMethodCall(GroovyInterceptor.java:21)
      	at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onMethodCall(SandboxInterceptor.java:115)
      	at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:149)
      	at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:146)
      	at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:123)
      	at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:123)
      	at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:123)
      	at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:123)
      	at com.cloudbees.groovy.cps.sandbox.SandboxInvoker.methodCall(SandboxInvoker.java:16)
      	at WorkflowScript.run(WorkflowScript:10)
      	at ___cps.transform___(Native Method)
      	at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:57)
      	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:109)
      	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:82)
      	at sun.reflect.GeneratedMethodAccessor521.invoke(Unknown Source)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
      	at com.cloudbees.groovy.cps.impl.ConstantBlock.eval(ConstantBlock.java:21)
      	at com.cloudbees.groovy.cps.Next.step(Next.java:58)
      	at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:154)
      	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:18)
      	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable$1.call(SandboxContinuable.java:33)
      	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable$1.call(SandboxContinuable.java:30)
      	at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.GroovySandbox.runInSandbox(GroovySandbox.java:108)
      	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:30)
      	at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:163)
      	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:324)
      	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$100(CpsThreadGroup.java:78)
      	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:236)
      	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:224)
      	at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:63)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
      	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at java.lang.Thread.run(Thread.java:745)
      Finished: FAILURE
      

      Something seems to be going wrong around https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/pipeline/ContainerExecDecorator.java#L125.

          [JENKINS-40825] "Pipe not connected" errors when running multiple builds simultaneously

          Mike Splain added a comment -

          We're seeing this as well with the same scenario.

          Mike Splain added a comment - We're seeing this as well with the same scenario.

          Lars Lawoko added a comment -

          Still happening for us.
          could be related to "java.io.Piped*Stream are not threads friendly and cause 'Pipe is broken' issue when jenkins pool the writing threads"? https://issues.jenkins-ci.org/browse/JENKINS-23958?focusedCommentId=228900&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-228900

          Lars Lawoko added a comment - Still happening for us. could be related to "java.io.Piped*Stream are not threads friendly and cause 'Pipe is broken' issue when jenkins pool the writing threads"? https://issues.jenkins-ci.org/browse/JENKINS-23958?focusedCommentId=228900&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-228900

          Lars Lawoko added a comment -

          We have been having this issue, and did an investigation.  It seems to be failing on pipe write in the first "cd"  command to get the workspace. From what we understand the underlying kubernetes-client library does not expose a stable connection. From what we understand the latch was implemented wrongly (?? We think). The latch is initialized once in the `container` "start" method, but a new connection is opened for each `sh` step. This causes the "waitQuietly" to pass through straight away without waiting, on all sh connections but the first. Once we fixed this the "OnOpen" callback in the websocket is never called, indicating the connection is not made, and the latch was only passed on thread interrupt.

          At this stage we didn't dig deeper, but hopefully this is a good starting point. Instead we switched to embedding jnlp into our main container using the "jnlp" container name "hack" to workaround this code.

          TL;DR  Embedding jnlp into your main container using the "jnlp" container name "hack" will workaround this issue code.

          Lars Lawoko added a comment - We have been having this issue, and did an investigation.  It seems to be failing on pipe write in the first "cd"  command to get the workspace. From what we understand the underlying kubernetes-client library does not expose a stable connection. From what we understand the latch was implemented wrongly (?? We think). The latch is initialized once in the `container` "start" method, but a new connection is opened for each `sh` step. This causes the "waitQuietly" to pass through straight away without waiting, on all sh connections but the first. Once we fixed this the "OnOpen" callback in the websocket is never called, indicating the connection is not made, and the latch was only passed on thread interrupt. At this stage we didn't dig deeper, but hopefully this is a good starting point. Instead we switched to embedding jnlp into our main container using the "jnlp" container name "hack" to workaround this code. TL;DR  Embedding jnlp into your main container using the "jnlp" container name "hack" will workaround this issue code.

          gilgamez added a comment -

          larslawoko Can you explain your workaround, perhaps with an example pipeline snippet, I'm experiencing this issue too and have tried to apply your workaround by using our previous heavy-weight build image with the name 'jnlp' as containerTemplate but that doesn't seem to work.

          gilgamez added a comment - larslawoko Can you explain your workaround, perhaps with an example pipeline snippet, I'm experiencing this issue too and have tried to apply your workaround by using our previous heavy-weight build image with the name 'jnlp' as containerTemplate but that doesn't seem to work.

          Jean Mertz added a comment -

          I'm also interested in this workaround, to see if it fits our setup. Of course also hoping for a permanent solution

          Jean Mertz added a comment - I'm also interested in this workaround, to see if it fits our setup. Of course also hoping for a permanent solution

          Lars Lawoko added a comment - - edited

          Esentially the bug is in the "container" step, so if you avoid it this bug should be worked around. Of course if you rely on shared disk of one pod with multiple containers this won't work

          Instead of having

           

           podTemplate(label: 'mypod', containers: [
           containerTemplate(name: 'debian', image: 'debian', ttyEnabled: true, command: 'cat')
           ]) \{
           node('mypod') \{
           container('debian') \{
           stage('stage 1') \{
           sh 'echo hello'
           sh 'sleep 30'
           sh 'echo world'
           }
          
          stage('stage 2') \{
           sh 'echo hello'
           sh 'sleep 30'
           sh 'echo world'
           }
           }
           }
           }
          

          (pusdocode, might need more tweaking)

          have: ( note the new image,the name is jnlp ( See https://issues.jenkins-ci.org/browse/JENKINS-40847), and no contianer step)

           

           podTemplate(label: 'mypod', containers: [
           containerTemplate(name: 'jnlp', image: 'custom/debian-with-jnlp'),
           ]) \{
           node('mypod') \{
           stage('stage 1') \{
           sh 'echo hello'
           sh 'sleep 30'
           sh 'echo world'
           }
          
          stage('stage 2') \{
           sh 'echo hello'
           sh 'sleep 30'
           sh 'echo world'
           }
           }
           }
            

          Not at work now, but If someone needs a more indepth walkthrough, comment here.

          Lars Lawoko added a comment - - edited Esentially the bug is in the "container" step, so if you avoid it this bug should be worked around. Of course if you rely on shared disk of one pod with multiple containers this won't work Instead of having   podTemplate(label: 'mypod' , containers: [ containerTemplate(name: 'debian' , image: 'debian' , ttyEnabled: true , command: 'cat' ) ]) \{ node( 'mypod' ) \{ container( 'debian' ) \{ stage( 'stage 1' ) \{ sh 'echo hello' sh 'sleep 30' sh 'echo world' } stage( 'stage 2' ) \{ sh 'echo hello' sh 'sleep 30' sh 'echo world' } } } } (pusdocode, might need more tweaking) have: ( note the new image,the name is jnlp ( See https://issues.jenkins-ci.org/browse/JENKINS-40847 ), and no contianer step)   podTemplate(label: 'mypod' , containers: [ containerTemplate(name: 'jnlp' , image: 'custom/debian-with-jnlp' ), ]) \{ node( 'mypod' ) \{ stage( 'stage 1' ) \{ sh 'echo hello' sh 'sleep 30' sh 'echo world' } stage( 'stage 2' ) \{ sh 'echo hello' sh 'sleep 30' sh 'echo world' } } }   Not at work now, but If someone needs a more indepth walkthrough, comment here.

          Jesse Redl added a comment -

          larslawoko thanks so much for posting the work around. I was about to throw the towel in on Jenkins today and this saved me. 

          For reference to others, rather than pulling directly from the office jenkinsci images I brought in the relevant bits from the published docker files:

          Another win for this workaround is that by dropping the container block you also cut out all of the noise being generated from: https://issues.jenkins-ci.org/browse/JENKINS-42048

          Jenkins 2.51
          Kubernetes plugin 0.11 
          Kubernetes 1.5.3 on GKE

           

          Jesse Redl added a comment - larslawoko thanks so much for posting the work around. I was about to throw the towel in on Jenkins today and this saved me.  For reference to others, rather than pulling directly from the office jenkinsci images I brought in the relevant bits from the published docker files: https://github.com/jenkinsci/docker-slave https://github.com/jenkinsci/docker-jnlp-slave Another win for this workaround is that by dropping the container block you also cut out all of the noise being generated from: https://issues.jenkins-ci.org/browse/JENKINS-42048 Jenkins 2.51 Kubernetes plugin 0.11 Kubernetes 1.5.3 on GKE  

          Steven Oud proposed work around does not  seem to work, it runs multiple retries, and eventually fails.

          Using only single container is kind of difficult for us, any more quick fixes?

           

          Gytis Ramanauskas added a comment - Steven Oud proposed work around does not  seem to work, it runs multiple retries, and eventually fails. Using only single container is kind of difficult for us, any more quick fixes?  

          Jean Mertz added a comment -

          I'm interested to hear csanchez' thoughts about this. I'd consider this a critical/blocking issue for serious/heavy use of this plugin, but am not familiar enough with the codebase to know where to start debugging this. The unpredictability of this error also makes it more difficult to pinpoint.

          Jean Mertz added a comment - I'm interested to hear csanchez ' thoughts about this. I'd consider this a critical/blocking issue for serious/heavy use of this plugin, but am not familiar enough with the codebase to know where to start debugging this. The unpredictability of this error also makes it more difficult to pinpoint.

          Jean Mertz added a comment -

          For those interested, we've worked around this issue for all sh commands by using the default JNLP connections, and then tunnelling the command to the right container. Something like this:

           

          def ksh(command) {
            if (env.CONTAINER_NAME) {
              if ((command instanceof String) || (command instanceof GString)) {
                command = kubectl(command)
              }
          
              if (command instanceof LinkedHashMap) {
                command["script"] = kubectl(command["script"])
              }
            }
          
            sh(command)
          }
          
          def kubectl(command) {
            "kubectl exec -i ${env.HOSTNAME} -c ${env.CONTAINER_NAME} -- /bin/sh -c 'cd ${env.WORKSPACE} && ${command}'"
          }
          
          def customContainer(String name, Closure body) {
            withEnv(["CONTAINER_NAME=$name"]) {
              body()
            }
          }
          

           

          This way, you can do something like:

           

          node('my-pod') {
            customContainer('container-1') {
              ksh 'echo hello world'
              ref = ksh returnStdout: true, script: 'git rev-parse --short HEAD'
            }
          }
          

           

           

          You do need a custom JNLP container with kubectl for this to work, which we built using this:

          FROM jenkinsci/jnlp-slave:2.62-alpine
          
          USER root
          ADD https://storage.googleapis.com/kubernetes-release/release/v1.6.1/bin/linux/amd64/kubectl /usr/local/bin/kubectl
          RUN chmod +x /usr/local/bin/kubectl
          USER jenkins
          

          This only works for commands using sh, so for example checkout(scm) doesn't work, but your home folder is shared across containers using a volume anyway, so you can simply do the checkout on the JNLP container, and have the files be available on the other containers as well.

          I hope this helps anyone, until there is a proper fix in place.

          Jean Mertz added a comment - For those interested, we've worked around this issue for all sh commands by using the default JNLP connections, and then tunnelling the command to the right container. Something like this:   def ksh(command) { if (env.CONTAINER_NAME) { if ((command instanceof String ) || (command instanceof GString)) { command = kubectl(command) } if (command instanceof LinkedHashMap) { command[ "script" ] = kubectl(command[ "script" ]) } } sh(command) } def kubectl(command) { "kubectl exec -i ${env.HOSTNAME} -c ${env.CONTAINER_NAME} -- /bin/sh -c 'cd ${env.WORKSPACE} && ${command}' " } def customContainer( String name, Closure body) { withEnv([ "CONTAINER_NAME=$name" ]) { body() } }   This way, you can do something like:   node( 'my-pod' ) { customContainer( 'container-1' ) { ksh 'echo hello world' ref = ksh returnStdout: true , script: 'git rev-parse -- short HEAD' } }     You do need a custom JNLP container with kubectl for this to work, which we built using this: FROM jenkinsci/jnlp-slave:2.62-alpine USER root ADD https: //storage.googleapis.com/kubernetes-release/release/v1.6.1/bin/linux/amd64/kubectl /usr/local/bin/kubectl RUN chmod +x /usr/local/bin/kubectl USER jenkins This only works for commands using sh, so for example checkout(scm) doesn't work, but your home folder is shared across containers using a volume anyway, so you can simply do the checkout on the JNLP container, and have the files be available on the other containers as well. I hope this helps anyone, until there is a proper fix in place.

          I think this is caused because several assumptions were made in the multiple container execution model, particularly in concurrent executions. I'm fixing JENKINS-42048 and have a better understanding now. It's a matter of time to get through all the issues

          In this particular case I guess the same Jenkins agent is reused for the concurrent builds, and the plugin does not consider this case, killing connections from different executions

          Carlos Sanchez added a comment - I think this is caused because several assumptions were made in the multiple container execution model, particularly in concurrent executions. I'm fixing  JENKINS-42048 and have a better understanding now. It's a matter of time to get through all the issues In this particular case I guess the same Jenkins agent is reused for the concurrent builds, and the plugin does not consider this case, killing connections from different executions

          Joan Goyeau added a comment - - edited

          Hi,

          This Pipe not connected issue is failing on half of our builds.
          This makes the Kubernetes Plugin quiet unusable. I'm interested on how Cloudbee is managing this.
          We are happy to help out to fix the bug. But we have no idea where to start. csanchez could we be of any help?

          Cheers

          Joan Goyeau added a comment - - edited Hi, This Pipe not connected issue is failing on half of our builds. This makes the Kubernetes Plugin quiet unusable. I'm interested on how Cloudbee is managing this. We are happy to help out to fix the bug. But we have no idea where to start.  csanchez  could we be of any help? Cheers

          The container idiom is an alpha feature unique to Kubernetes, and still have some glitches to be fixed, in progress in JENKINS-42048

          If you use agents in Kubernetes without executing into other containers it works as expected

          Carlos Sanchez added a comment - The container idiom is an alpha feature unique to Kubernetes, and still have some glitches to be fixed, in progress in JENKINS-42048 If you use agents in Kubernetes without executing into other containers it works as expected

          csanchez shouldn't the OnceRetentionStrategy which seems to be the default for kubernetes-plugin, prevent the node from being reused by other jobs?

           

          Ioannis Canellos added a comment - csanchez  shouldn't the OnceRetentionStrategy which seems to be the default for kubernetes-plugin, prevent the node from being reused by other jobs?  

          no, IIRC even with OnceRetentionStrategy there is some time (seconds) while the agent can receive more work.

          More info in https://wiki.jenkins-ci.org/display/JENKINS/One-Shot+Executor 

          For a true one job per agent we'd need to integrate with the one-shot plugin, which is in my roadmap, just lack of time

          Carlos Sanchez added a comment - no, IIRC even with OnceRetentionStrategy there is some time (seconds) while the agent can receive more work. More info in https://wiki.jenkins-ci.org/display/JENKINS/One-Shot+Executor   For a true one job per agent we'd need to integrate with the one-shot plugin, which is in my roadmap, just lack of time

          What if you added the build number as part of the label. Would that prevent the reuse?

           

          def label = ${env.JOB_NAME}.${env.BUILD_NUMBER}".replace('-', '_').replace('/', '_')
          podTemplate(label: "$label" ...) {
              node("$label") {
                 //do stuff
              }
          }

           

          Ioannis Canellos added a comment - What if you added the build number as part of the label. Would that prevent the reuse?   def label = ${env.JOB_NAME}.${env.BUILD_NUMBER}".replace( '-' , '_' ).replace( '/' , '_' ) podTemplate(label: "$label" ...) { node( "$label" ) { // do stuff } }  

          Joan Goyeau added a comment - - edited

          iocanel in my case I put a random UUID and it's the same.

          Joan Goyeau added a comment - - edited iocanel in my case I put a random UUID and it's the same.

          Jon Whitcraft added a comment -

          We are seeing them randomly as well, even on jobs that only run a podTemplate with a single container + the jnlp container.

          Jon Whitcraft added a comment - We are seeing them randomly as well, even on jobs that only run a podTemplate with a single container + the jnlp container.

          I seem to be bouncing the jenkins master pod around 2 times per day after getting the "Pipe not connected" error.  

          FWIW we use the github org plugin and can have 5-15 open pr jobs on our repos.  I also tried disabling concurrent builds in the Jenkinsfile to see if that helped but it didn't.  

          node{
           properties([
             disableConcurrentBuilds()
           ])
          }
          

           

          James Rawlings added a comment - I seem to be bouncing the jenkins master pod around 2 times per day after getting the "Pipe not connected" error.   FWIW we use the github org plugin and can have 5-15 open pr jobs on our repos.  I also tried disabling concurrent builds in the Jenkinsfile to see if that helped but it didn't.   node{ properties([ disableConcurrentBuilds() ]) }  

          csanchez I am not sure if the issue is related with the pod being reused. I have been using a version derived from the current master, for a while and I haven't hit the issue, even when a lot of concurrent builds take place. Is there any chance, that the root cause is something else? (e.g. a client bug, that might have been fixed in later versions?).

          If you are confident that the way to go is the `one shot executor` I would like to volunteer, if you have any pointers (existing docs are really limited).

           

          jrawlings: I assume that you using a forked version of the plugin, right?

          Ioannis Canellos added a comment - csanchez I am not sure if the issue is related with the pod being reused. I have been using a version derived from the current master, for a while and I haven't hit the issue, even when a lot of concurrent builds take place. Is there any chance, that the root cause is something else? (e.g. a client bug, that might have been fixed in later versions?). If you are confident that the way to go is the `one shot executor` I would like to volunteer, if you have any pointers (existing docs are really limited).   jrawlings : I assume that you using a forked version of the plugin, right?

          It is due to the container step, and I thought mixed with the container reuse but maybe not.

          JENKINS-42048 is proving to be more convoluted than I thought, there are a lot of assumptions that pipeline makes about the execution of commands, will try to find some time with one of the core devs there to fix it

          Carlos Sanchez added a comment - It is due to the container step, and I thought mixed with the container reuse but maybe not. JENKINS-42048  is proving to be more convoluted than I thought, there are a lot of assumptions that pipeline makes about the execution of commands, will try to find some time with one of the core devs there to fix it

          Jean Mertz added a comment -

          I can confirm that the latest master does not solve this problem, and we already have our configuration tweaked in such a way that all our containers are truly one-shot. It does indeed only happen when using the `container` step (and as I posted a couple of comments above, we have a workaround, using sh/kubectl instead of the custom pipeline step).

          Jean Mertz added a comment - I can confirm that the latest master does not solve this problem, and we already have our configuration tweaked in such a way that all our containers are truly one-shot. It does indeed only happen when using the `container` step (and as I posted a couple of comments above, we have a workaround, using sh/kubectl instead of the custom pipeline step).

          jeanmertz: Thanks for the feedback! 

          It seems really weird that I can't reproduce the issue myself. Can you bump the kubernetes-client version to 2.3.1?

          Ioannis Canellos added a comment - jeanmertz : Thanks for the feedback!  It seems really weird that I can't reproduce the issue myself. Can you bump the kubernetes-client version to 2.3.1?

          Lars Lawoko added a comment - - edited

          We haven't used this method in a while, so it might be fixed like you mentioned. But when we invesigated, the error was definately happening due to a websocket (jnlp continer to secondary container) issue. Seems to trigger issue on new sh step, not getting new/recovering a websocket connection.

           

          Only 50% sure about this now, but hopefully it's a jumping off point.

          Lars Lawoko added a comment - - edited We haven't used this method in a while, so it might be fixed like you mentioned. But when we invesigated, the error was definately happening due to a websocket (jnlp continer to secondary container) issue. Seems to trigger issue on new sh step, not getting new/recovering a websocket connection.   Only 50% sure about this now, but hopefully it's a jumping off point.

          I believe this is caused by JENKINS-42048, once that's merged we can try again

          Carlos Sanchez added a comment - I believe this is caused by JENKINS-42048 , once that's merged we can try again

          Corey O'Brien added a comment -

          I added a comment to the PR for JENKINS-42048: https://github.com/jenkinsci/kubernetes-plugin/pull/157#issuecomment-310416604

          Running that PR-157 code seems to reduce the frequency of the Pipe not connected errors, but doesn't seem to remove them entirely.

          Corey O'Brien added a comment - I added a comment to the PR for JENKINS-42048 : https://github.com/jenkinsci/kubernetes-plugin/pull/157#issuecomment-310416604 Running that PR-157 code seems to reduce the frequency of the Pipe not connected  errors, but doesn't seem to remove them entirely.

          Brian Wallace added a comment - - edited

          I rebuilt the kubernetes-plugin from SHA 050e559 and continue to see the Pipe Not Connected error. Stack trace is in the PR. We are running Jenkins 2.46.2-alpine.

          https://github.com/jenkinsci/kubernetes-plugin/pull/157#issuecomment-316504564

          UPDATE:  We saw this on v0.11 as well.  That is why I tried building the plugin from master at SHA 050e559.

          Brian Wallace added a comment - - edited I rebuilt the  kubernetes-plugin  from SHA  050e559  and continue to see the Pipe Not Connected error. Stack trace is in the PR. We are running Jenkins  2.46.2-alpine . https://github.com/jenkinsci/kubernetes-plugin/pull/157#issuecomment-316504564 UPDATE:  We saw this on v0.11 as well.  That is why I tried building the plugin from master at SHA  050e559.

          Bruce Bradley added a comment -

          I ran into this today and began investigating. I can reproduce the error reliably at will. Running 6 parallel tasks, each spinning up a pod template made up of three containers, I always see one of the pods fail outright as soon as the sh step begins. Here's what I know:

          • Everything appears to be fine launching the pods
          • Jenkins debug logging shows a successful HTTP GET request on the doomed pod, and this occurs during the same second that the shell command attempts to run (I haven't checked the code but I suspect that the shell command attempts to execute just after a successful pod get request)
          • Almost exactly one minute after the doomed pod comes up (the HTTP request comes back successful and we attempt to run sh) I see the "Error while pumping stream" message in the Jenkins server log:

           

          Jul 19, 2017 4:21:42 PM INFO okhttp3.internal.platform.Platform log
          <-- END HTTP (7866-byte body)
          Jul 19, 2017 4:22:40 PM SEVERE io.fabric8.kubernetes.client.utils.InputStreamPumper run
          Error while pumping stream.
          java.io.IOException: Pipe broken
           at java.io.PipedInputStream.read(PipedInputStream.java:321)
           at java.io.PipedInputStream.read(PipedInputStream.java:377)
           at java.io.InputStream.read(InputStream.java:101)
           at io.fabric8.kubernetes.client.utils.InputStreamPumper.run(InputStreamPumper.java:57)
           at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
           at java.util.concurrent.FutureTask.run(FutureTask.java:266)
           at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
           at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
           at java.lang.Thread.run(Thread.java:745)
          Jul 19, 2017 4:22:41 PM SEVERE io.fabric8.kubernetes.client.utils.InputStreamPumper run
          Error while pumping stream.
          java.io.IOException: Pipe broken
           at java.io.PipedInputStream.read(PipedInputStream.java:321)
           at java.io.PipedInputStream.read(PipedInputStream.java:377)
           at java.io.InputStream.read(InputStream.java:101)
           at io.fabric8.kubernetes.client.utils.InputStreamPumper.run(InputStreamPumper.java:57)
           at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
           at java.util.concurrent.FutureTask.run(FutureTask.java:266)
           at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
           at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
           at java.lang.Thread.run(Thread.java:745)
          

          Following that, it seems like there's a 5-minute timeout because almost exactly five minutes afterward Jenkins terminates the slave because the task is supposedly "complete".

          This probably doesn't help Carlos since he seems to have already narrowed down the root of the problem, but in the off chance that this does help I'm happy to provide.

          For what it's worth, we're running Jenkins 2.46.3-alpine with the Kubernetes plugin compiled @ b266a49e (we needed port forwarding functionality badly). I just tried running master today but ran into a nasty permissions error regarding a nohup file so I reverted to our previous plugin.

           

          Bruce Bradley added a comment - I ran into this today and began investigating. I can reproduce the error reliably at will. Running 6 parallel tasks, each spinning up a pod template made up of three containers, I always see one of the pods fail outright as soon as the sh step begins. Here's what I know: Everything appears to be fine launching the pods Jenkins debug logging shows a successful HTTP GET request on the doomed pod, and this occurs during the same second that the shell command attempts to run (I haven't checked the code but I suspect that the shell command attempts to execute just after a successful pod get request) Almost exactly one minute after the doomed pod comes up (the HTTP request comes back successful and we attempt to run sh) I see the "Error while pumping stream" message in the Jenkins server log:   Jul 19, 2017 4:21:42 PM INFO okhttp3.internal.platform.Platform log <-- END HTTP (7866-byte body) Jul 19, 2017 4:22:40 PM SEVERE io.fabric8.kubernetes.client.utils.InputStreamPumper run Error while pumping stream. java.io.IOException: Pipe broken at java.io.PipedInputStream.read(PipedInputStream.java:321) at java.io.PipedInputStream.read(PipedInputStream.java:377) at java.io.InputStream.read(InputStream.java:101) at io.fabric8.kubernetes.client.utils.InputStreamPumper.run(InputStreamPumper.java:57) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Jul 19, 2017 4:22:41 PM SEVERE io.fabric8.kubernetes.client.utils.InputStreamPumper run Error while pumping stream. java.io.IOException: Pipe broken at java.io.PipedInputStream.read(PipedInputStream.java:321) at java.io.PipedInputStream.read(PipedInputStream.java:377) at java.io.InputStream.read(InputStream.java:101) at io.fabric8.kubernetes.client.utils.InputStreamPumper.run(InputStreamPumper.java:57) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Following that, it seems like there's a 5-minute timeout because almost exactly five minutes afterward Jenkins terminates the slave because the task is supposedly "complete". This probably doesn't help Carlos since he seems to have already narrowed down the root of the problem, but in the off chance that this does help I'm happy to provide. For what it's worth, we're running Jenkins 2.46.3-alpine with the Kubernetes plugin compiled @ b266a49e (we needed port forwarding functionality badly). I just tried running master today but ran into a nasty permissions error regarding a nohup file so I reverted to our previous plugin.  

          Michael Andrews added a comment - - edited

          We have the exact same issue with multiple K8s agents running. We have a long running shell step in a container in a podTemplate. The step starts... and we eventually get a broken pipe log message BUT the step is still running. We then see the step return SUCCESS! But then we try to switch containers to run the NEXT shell step -  which hangs for 5mins and the job breaks because of the broken pipe.  We built the master branch locally to get all the newest bug fixes (which are great!). And we're using Jenkins 2.60.1. Really need this fixed. 

          Michael Andrews added a comment - - edited We have the exact same issue with multiple K8s agents running. We have a long running shell step in a container in a podTemplate. The step starts... and we eventually get a broken pipe log message BUT the step is still running. We then see the step return SUCCESS! But then we try to switch containers to run the NEXT shell step -  which hangs for 5mins and the job breaks because of the broken pipe.  We built the master branch locally to get all the newest bug fixes (which are great!). And we're using Jenkins 2.60.1. Really need this fixed. 

          Martin Sander added a comment -

          Happens on 0.11 as well.

          Martin Sander added a comment - Happens on 0.11 as well.

          Martin Sander added a comment -

          I reproduced this with a good ol' debugger connected, and this is what I found out:

          • This happens while waiting here.
          • When it happens, alive is false
          • I.e. either onClose or onFailure have been called
          • for onFailure, we should be able to see a stacktrace somewhere

          I will keep investigating..

          Martin Sander added a comment - I reproduced this with a good ol' debugger connected, and this is what I found out: This happens while waiting here . When it happens, alive is false I.e. either onClose or onFailure have been called for onFailure , we should be able to see a stacktrace somewhere I will keep investigating..

          Martin Sander added a comment -

          Seems that it also happens with alive being true. So no new information here I guess .

          Martin Sander added a comment - Seems that it also happens with alive being true . So no new information here I guess .

          0x89 - Thank you for digging into this. This issue is killing us.

          Michael Andrews added a comment - 0x89 - Thank you for digging into this. This issue is killing us.

          Martin Sander added a comment -

          killdash9: Unfortunately, I did not find anything conclusive yet (probably because I am just starting to get familiar with the source code), but I will go on. The master branch already has a bit of additional logging added that might help pinpoint this issue.

          Have you tried building the plugin yourself from master to check if it improves the situation? I know it doesn't fix the issue completely, but you may get hit less often.

          Martin Sander added a comment - killdash9 : Unfortunately, I did not find anything conclusive yet (probably because I am just starting to get familiar with the source code), but I will go on. The master branch already has a bit of additional logging added that might help pinpoint this issue. Have you tried building the plugin yourself from master to check if it improves the situation? I know it doesn't fix the issue completely, but you may get hit less often.

          Martin Sander added a comment - - edited

          Btw. maybe a useful information for everyone here:

          This is one of the scripts I use to reproduce this:

          def label = env.BUILD_TAG.drop(env.BUILD_TAG.length() - 63)
          podTemplate(
                  label: label,
                  containers: [
                          containerTemplate(name: 'mvn', image: 'maven', ttyEnabled: true, command: 'cat'),
                  ],
          ) {
              node(label) {
                  container('mvn') {
                      sh "sleep 200"
                  }
              }
          }
          

          The interesting part is that I was not able to reproduce this with sleep values up to 150 seconds, but 200 triggers this.

          Was able to reproduce with 120 seconds as well, so no progress here.

          Martin Sander added a comment - - edited Btw. maybe a useful information for everyone here: This is one of the scripts I use to reproduce this: def label = env.BUILD_TAG.drop(env.BUILD_TAG.length() - 63) podTemplate( label: label, containers: [ containerTemplate(name: 'mvn' , image: 'maven' , ttyEnabled: true , command: 'cat' ), ], ) { node(label) { container( 'mvn' ) { sh "sleep 200" } } } The interesting part is that I was not able to reproduce this with sleep values up to 150 seconds, but 200 triggers this. Was able to reproduce with 120 seconds as well, so no progress here.

          0x89 - we built from master a few weeks ago. Might be time to do it again. We also use SLEEP in our pipeline Shell steps. But I also see the broken pipe on long running shell commands.

          Michael Andrews added a comment - 0x89 - we built from master a few weeks ago. Might be time to do it again. We also use SLEEP in our pipeline Shell steps. But I also see the broken pipe on long running shell commands.

          0x89 There are several commits to deal with agent connection, read, and idle timeouts. Read timeout seems to default to 100s. https://github.com/jenkinsci/kubernetes-plugin/blob/master/CHANGELOG.md

          Michael Andrews added a comment - 0x89 There are several commits to deal with agent connection, read, and idle timeouts. Read timeout seems to default to 100s.  https://github.com/jenkinsci/kubernetes-plugin/blob/master/CHANGELOG.md

          0x89 actually that's for connection. 

          private static final int DEFAULT_SLAVE_JENKINS_CONNECTION_TIMEOUT = 100;

          Michael Andrews added a comment - 0x89  actually that's for connection.  private static final int DEFAULT_SLAVE_JENKINS_CONNECTION_TIMEOUT = 100;

          Michael Andrews added a comment - - edited

          So our magic number is 4. That's the number of agents I can run simultaneously. If I start a 5th one...they all get the broken pipe. We are using the 0.12 release.

          Michael Andrews added a comment - - edited So our magic number is 4. That's the number of agents I can run simultaneously. If I start a 5th one...they all get the broken pipe. We are using the 0.12 release.

          Martin Sander added a comment -

          I don't know if this is related, but while investigating this, I think I found a resource leak.
          After running a few of those builds (more than 20), I have about 70 threads stuck here:

          pool-317-thread-1
          
          "pool-317-thread-1" Id=2820 Group=main TIMED_WAITING on java.io.PipedInputStream@2863d70
          	at java.lang.Object.wait(Native Method)
          	-  waiting on java.io.PipedInputStream@2863d70
          	at java.io.PipedInputStream.read(PipedInputStream.java:326)
          	at java.io.PipedInputStream.read(PipedInputStream.java:377)
          	at java.io.InputStream.read(InputStream.java:101)
          	at io.fabric8.kubernetes.client.utils.InputStreamPumper.run(InputStreamPumper.java:57)
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
          	at java.lang.Thread.run(Thread.java:748)
          
          	Number of locked synchronizers = 1
          	- java.util.concurrent.ThreadPoolExecutor$Worker@783e3cd1
          

          I.e. seems that ExecWebSocketListener does not properly clean up its pumper...

          csanchez, iocanel:
          Do you think this is related or should I open a new ticket for that?

          Martin Sander added a comment - I don't know if this is related, but while investigating this, I think I found a resource leak. After running a few of those builds (more than 20), I have about 70 threads stuck here: pool-317-thread-1 "pool-317-thread-1" Id=2820 Group=main TIMED_WAITING on java.io.PipedInputStream@2863d70 at java.lang. Object .wait(Native Method) - waiting on java.io.PipedInputStream@2863d70 at java.io.PipedInputStream.read(PipedInputStream.java:326) at java.io.PipedInputStream.read(PipedInputStream.java:377) at java.io.InputStream.read(InputStream.java:101) at io.fabric8.kubernetes.client.utils.InputStreamPumper.run(InputStreamPumper.java:57) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang. Thread .run( Thread .java:748) Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@783e3cd1 I.e. seems that ExecWebSocketListener does not properly clean up its pumper... csanchez , iocanel : Do you think this is related or should I open a new ticket for that?

          0x89 The thread leak has been fixed as part of https://github.com/jenkinsci/kubernetes-plugin/pull/177 and should be part of the 0.12 release.

          Ioannis Canellos added a comment - 0x89 The thread leak has been fixed as part of https://github.com/jenkinsci/kubernetes-plugin/pull/177  and should be part of the 0.12 release.

          Martin Sander added a comment -

          iocanel: It isn't, I see it both with 0.12 and with the current master.

          Martin Sander added a comment - iocanel : It isn't, I see it both with 0.12 and with the current master.

          That's weird, it did solve the issue for me. 

          Before that commit, our Jenkins would choke due to this thread leak every few hours and now its doing great.

          Will need to check if there are more cause to this.

          Ioannis Canellos added a comment - That's weird, it did solve the issue for me.  Before that commit, our Jenkins would choke due to this thread leak every few hours and now its doing great. Will need to check if there are more cause to this.

          Martin Sander added a comment -

          iocanel:
          It looks like I have made a bit of progress (if I am not completely mistaken).

          Are you aware that the DecoratedLauncher returned by decorate is used by Jenkins not once to execute the original command(s) (sleep in above example), but re-used to check if the process is still running?

          I.e. launch is called several times, with executions possibly (or maybe even certainly) overlapping.
          So my current assumption is that it is not safe to use members of the wrapping ContainerExecDecorator inside the DecoratedLauncher, especially launcher, watch, and proc.

          I will try to validate this assumption and might send you a (probably crude) pull request.

          Martin Sander added a comment - iocanel : It looks like I have made a bit of progress (if I am not completely mistaken). Are you aware that the DecoratedLauncher returned by decorate is used by Jenkins not once to execute the original command(s) ( sleep in above example), but re-used to check if the process is still running? I.e. launch is called several times, with executions possibly (or maybe even certainly) overlapping. So my current assumption is that it is not safe to use members of the wrapping ContainerExecDecorator inside the DecoratedLauncher , especially launcher , watch , and proc . I will try to validate this assumption and might send you a (probably crude) pull request.

          Martin Sander added a comment -

          I did some quite extensive testing yesterday, and I was able to get rid of the resource leak (I think).

          Pull request here: https://github.com/jenkinsci/kubernetes-plugin/pull/180. I recommend also viewing it with whitespace changes ignored.

          I don't expect you to merge it like that, but would be happy to get feedback .

          Unfortunately, it does not completely get rid of the "pipe not connected" errors, but

          • it seems to fix the resource leak
          • the "pipe not connected" error seems to fail the build much less often
          • it seems that it most of the time comes from org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep, which
            • runs ps all ten seconds or so to check if the process is still alive
            • just prints a single error to the build log, even if that check fails multiple times (Set the logger for that class to FINE to see all failures)
            • luckily does not fail the build if one of those checks fail

          Martin Sander added a comment - I did some quite extensive testing yesterday, and I was able to get rid of the resource leak (I think). Pull request here: https://github.com/jenkinsci/kubernetes-plugin/pull/180 . I recommend also viewing it with whitespace changes ignored . I don't expect you to merge it like that, but would be happy to get feedback . Unfortunately, it does not completely get rid of the "pipe not connected" errors, but it seems to fix the resource leak the "pipe not connected" error seems to fail the build much less often it seems that it most of the time comes from org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep , which runs ps all ten seconds or so to check if the process is still alive just prints a single error to the build log, even if that check fails multiple times (Set the logger for that class to FINE to see all failures) luckily does not fail the build if one of those checks fail

          0x89 Your assumption (that the decorator is called multiple times) is valid and is aligned with what I've seen so far. 

          That was were https://github.com/jenkinsci/kubernetes-plugin/pull/177 was aiming (to close() the listeners opened by the liveness checks).

          But it seems that this is affecting us in more ways and I feel you are on the right track.  Let me review your pull request and I'll get back to you.

          Ioannis Canellos added a comment - 0x89 Your assumption (that the decorator is called multiple times) is valid and is aligned with what I've seen so far.  That was were https://github.com/jenkinsci/kubernetes-plugin/pull/177  was aiming (to close() the listeners opened by the liveness checks). But it seems that this is affecting us in more ways and I feel you are on the right track.  Let me review your pull request and I'll get back to you.

          Martin Sander added a comment - - edited

          iocanel:

          I might be on the right track, but I think I didn't go far enough.

          It actually is not only the Decorator that is reused, but even the Launcher is used more than once, launch is called more than once.
          I will verify this and probably issue another pull request from a different branch tomorrow.

          Martin Sander added a comment - - edited iocanel : I might be on the right track, but I think I didn't go far enough. It actually is not only the Decorator that is reused, but even the Launcher is used more than once, launch is called more than once. I will verify this and probably issue another pull request from a different branch tomorrow.

          Martin Sander added a comment -

          Martin Sander added a comment - New pull request: https://github.com/jenkinsci/kubernetes-plugin/pull/182 .

          Jesse Redl added a comment -

          Thanks for the fix, we've re-enabled out multi-container workflows within jenkins / kubernetes plugin after upgrading to the most recent release!

          Jesse Redl added a comment - Thanks for the fix, we've re-enabled out multi-container workflows within jenkins / kubernetes plugin after upgrading to the most recent release!

          Andras Kovi added a comment -

          We started seeing this issue again: exceptions.txt

          Jenkins ver. 2.107.3, kubernetes-plugin:1.10.1

          The wait for the started latch is interrupted org/csanchez/jenkins/plugins/kubernetes/pipeline/ContainerExecDecorator.java:328.
          How is this possible? What is interrupting it and what parameters may need to be tweaked to get it working?

          The error happens when we spawn a relatively large number, about 25 parallel executions.

          Andras Kovi added a comment - We started seeing this issue again: exceptions.txt Jenkins ver. 2.107.3, kubernetes-plugin:1.10.1 The wait for the started latch is interrupted org/csanchez/jenkins/plugins/kubernetes/pipeline/ContainerExecDecorator.java:328 . How is this possible? What is interrupting it and what parameters may need to be tweaked to get it working? The error happens when we spawn a relatively large number, about 25 parallel executions.

          akovi can you share simple Pipeline script that reproduces the problem?

          Claes Buckwalter added a comment - akovi can you share simple Pipeline script that reproduces the problem?

          Andras Kovi added a comment -

          Seems like the 'Max connections to Kubernetes API' parameter was set to a very low number causing this error.

          So, for the record, if one encounters this issue, raising the 'Max connections to Kubernetes API' config parameter should be increased.

          For planning purposes it would still be good to know the relation between this parameter and the possible number of parallel executions in a pipeline.

           

          Andras Kovi added a comment - Seems like the 'Max connections to Kubernetes API' parameter was set to a very low number causing this error. So, for the record, if one encounters this issue, raising the 'Max connections to Kubernetes API' config parameter should be increased. For planning purposes it would still be good to know the relation between this parameter and the possible number of parallel executions in a pipeline.  

          Raviteja A added a comment -

          After increasing the 'Max connections to Kubernetes API' parameter, it didn't resolved the issue. We had to restart master.After that things started working.

          Raviteja A added a comment - After increasing the 'Max connections to Kubernetes API' parameter, it didn't resolved the issue. We had to restart master.After that things started working.

            csanchez Carlos Sanchez
            soud Steven Oud
            Votes:
            20 Vote for this issue
            Watchers:
            38 Start watching this issue

              Created:
              Updated:
              Resolved: