[JENKINS-40825] "Pipe not connected" errors when running multiple builds simultaneously

Type: Bug
Resolution: Fixed
Priority: Blocker
Component/s: kubernetes-plugin
Labels:
None
Environment:
Jenkins 2.60
Kubernetes plugin 0.12
Kubernetes 1.5.1 on GKE
Kubernetes 1.7.0 on AWS

Similar Issues:
Powered by SuggestiMate

Show

Hi there,

We have Jenkins running in Kubernetes with the Kubernetes plugin, and have been experiencing `java.io.IOException: Pipe not connected` errors when running multiple builds simultaneously. This seems to consistently happen when we run 8 or more builds (on the same pipeline). About 50% of the builds will succeed, and the other 50% will fail with the `Pipe not connected` exception. Most of the time it will fail at stage 1, but sometimes at stage 2.

We're using the following pipeline:

podTemplate(label: 'mypod', containers: [
  containerTemplate(name: 'debian', image: 'debian', ttyEnabled: true, command: 'cat'),
  containerTemplate(name: 'ubuntu', image: 'ubuntu', ttyEnabled: true, command: 'cat')
]) {
  node('mypod') {
    container('debian') {
      stage('stage 1') {
        sh 'echo hello'
        sh 'sleep 30'
        sh 'echo world'
      }

      stage('stage 2') {
        sh 'echo hello'
        sh 'sleep 30'
        sh 'echo world'
      }
    }
  }
}

And this is the log of such failed build:

[Pipeline] podTemplate
[Pipeline] {
[Pipeline] node
Still waiting to schedule task
Waiting for next available executor on mypod
Running on kubernetes-a0e59102b59b48ad99693ca32b94ab38-11a5bcd7df12e4 in /home/jenkins/workspace/kubernetes-test-3
[Pipeline] {
[Pipeline] container
[Pipeline] {
[Pipeline] stage
[Pipeline] { (stage 1)
[Pipeline] sh
[kubernetes-test-3] Running shell script
Executing shell script inside container [debian] of pod [kubernetes-a0e59102b59b48ad99693ca32b94ab38-11a5bcd7df12e4]
Executing command: sh -c echo $$ > '/home/jenkins/workspace/kubernetes-test-3@tmp/durable-f201019b/pid'; jsc=durable-7534cabf595ac7f32ca72b4db83e0af1; JENKINS_SERVER_COOKIE=$jsc '/home/jenkins/workspace/kubernetes-test-3@tmp/durable-f201019b/script.sh' > '/home/jenkins/workspace/kubernetes-test-3@tmp/durable-f201019b/jenkins-log.txt' 2>&1; echo $? > '/home/jenkins/workspace/kubernetes-test-3@tmp/durable-f201019b/jenkins-result.txt' 
# cd /home/jenkins/workspace/kubernetes-test-3
sh -c echo $$ > '/home/jenkins/workspace/kubernetes-test-3@tmp/durable-f201019b/pid'; jsc=durable-7534cabf595ac7f32ca72b4db83e0af1; JENKINS_SERVER_COOKIE=$jsc '/home/jenkins/workspace/kubernetes-test-3@tmp/durable-f201019b/script.sh' > '/home/jenkins/workspace/kubernetes-test-3@tmp/durable-f201019b/jenkins-log.txt' 2>&1; echo $? > '/home/jenkins/workspace/kubernetes-test-3@tmp/durable-f201019b/jenkins-result.txt' 
exit
# # + echo hello
hello
[Pipeline] sh
[kubernetes-test-3] Running shell script
Executing shell script inside container [debian] of pod [kubernetes-a0e59102b59b48ad99693ca32b94ab38-11a5bcd7df12e4]
Executing command: sh -c echo $$ > '/home/jenkins/workspace/kubernetes-test-3@tmp/durable-0eb192c0/pid'; jsc=durable-7534cabf595ac7f32ca72b4db83e0af1; JENKINS_SERVER_COOKIE=$jsc '/home/jenkins/workspace/kubernetes-test-3@tmp/durable-0eb192c0/script.sh' > '/home/jenkins/workspace/kubernetes-test-3@tmp/durable-0eb192c0/jenkins-log.txt' 2>&1; echo $? > '/home/jenkins/workspace/kubernetes-test-3@tmp/durable-0eb192c0/jenkins-result.txt' 
# cd /home/jenkins/workspace/kubernetes-test-3
sh -c echo $$ > '/home/jenkins/workspace/kubernetes-test-3@tmp/durable-0eb192c0/pid'; jsc=durable-7534cabf595ac7f32ca72b4db83e0af1; JENKINS_SERVER_COOKIE=$jsc '/home/jenkins/workspace/kubernetes-test-3@tmp/durable-0eb192c0/script.sh' > '/home/jenkins/workspace/kubernetes-test-3@tmp/durable-0eb192c0/jenkins-log.txt' 2>&1; echo $? > '/home/jenkins/workspace/kubernetes-test-3@tmp/durable-0eb192c0/jenkins-result.txt' 
exit
# + sleep 30
# [Pipeline] sh
[kubernetes-test-3] Running shell script
Executing shell script inside container [debian] of pod [kubernetes-a0e59102b59b48ad99693ca32b94ab38-11a5bcd7df12e4]
[Pipeline] }
[Pipeline] // stage
[Pipeline] }
[Pipeline] // container
[Pipeline] }
[Pipeline] // node
[Pipeline] }
[Pipeline] // podTemplate
[Pipeline] End of Pipeline
java.io.IOException: Pipe not connected
	at java.io.PipedOutputStream.write(PipedOutputStream.java:140)
	at java.io.OutputStream.write(OutputStream.java:75)
	at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.launch(ContainerExecDecorator.java:125)
	at hudson.Launcher$ProcStarter.start(Launcher.java:384)
	at org.jenkinsci.plugins.durabletask.BourneShellScript.launchWithCookie(BourneShellScript.java:147)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask.launch(FileMonitoringTask.java:61)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:158)
	at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:184)
	at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:126)
	at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:108)
	at groovy.lang.GroovyObject$invokeMethod.call(Unknown Source)
	at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
	at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:151)
	at org.kohsuke.groovy.sandbox.GroovyInterceptor.onMethodCall(GroovyInterceptor.java:21)
	at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onMethodCall(SandboxInterceptor.java:115)
	at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:149)
	at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:146)
	at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:123)
	at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:123)
	at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:123)
	at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:123)
	at com.cloudbees.groovy.cps.sandbox.SandboxInvoker.methodCall(SandboxInvoker.java:16)
	at WorkflowScript.run(WorkflowScript:10)
	at ___cps.transform___(Native Method)
	at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:57)
	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:109)
	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:82)
	at sun.reflect.GeneratedMethodAccessor521.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
	at com.cloudbees.groovy.cps.impl.ConstantBlock.eval(ConstantBlock.java:21)
	at com.cloudbees.groovy.cps.Next.step(Next.java:58)
	at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:154)
	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:18)
	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable$1.call(SandboxContinuable.java:33)
	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable$1.call(SandboxContinuable.java:30)
	at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.GroovySandbox.runInSandbox(GroovySandbox.java:108)
	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:30)
	at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:163)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:324)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$100(CpsThreadGroup.java:78)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:236)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:224)
	at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:63)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Finished: FAILURE

Something seems to be going wrong around https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/pipeline/ContainerExecDecorator.java#L125.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

exceptions.txt
22 kB
2018-09-25 07:12

is blocked by

JENKINS-42048 Cannot Connect, PID NumberFormatException

Closed

is duplicated by

JENKINS-58463 Job build failed by "Interrupted while waiting for websocket connection, you should increase the Max connections to Kubernetes API"

Resolved

JENKINS-42298 SEVERE: Error while pumping stream.

Closed

links to

pull request #182

Mike Splain added a comment - 2017-01-12 20:56

We're seeing this as well with the same scenario.

Mike Splain added a comment - 2017-01-12 20:56 We're seeing this as well with the same scenario.

Lars Lawoko added a comment - 2017-02-20 04:34

Still happening for us.
could be related to "java.io.Piped*Stream are not threads friendly and cause 'Pipe is broken' issue when jenkins pool the writing threads"? https://issues.jenkins-ci.org/browse/JENKINS-23958?focusedCommentId=228900&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-228900

Lars Lawoko added a comment - 2017-02-20 04:34 Still happening for us. could be related to "java.io.Piped*Stream are not threads friendly and cause 'Pipe is broken' issue when jenkins pool the writing threads"? https://issues.jenkins-ci.org/browse/JENKINS-23958?focusedCommentId=228900&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-228900

Lars Lawoko added a comment - 2017-03-08 00:54

We have been having this issue, and did an investigation. It seems to be failing on pipe write in the first "cd" command to get the workspace. From what we understand the underlying kubernetes-client library does not expose a stable connection. From what we understand the latch was implemented wrongly (?? We think). The latch is initialized once in the `container` "start" method, but a new connection is opened for each `sh` step. This causes the "waitQuietly" to pass through straight away without waiting, on all sh connections but the first. Once we fixed this the "OnOpen" callback in the websocket is never called, indicating the connection is not made, and the latch was only passed on thread interrupt.

At this stage we didn't dig deeper, but hopefully this is a good starting point. Instead we switched to embedding jnlp into our main container using the "jnlp" container name "hack" to workaround this code.

TL;DR Embedding jnlp into your main container using the "jnlp" container name "hack" will workaround this issue code.

Lars Lawoko added a comment - 2017-03-08 00:54 We have been having this issue, and did an investigation. It seems to be failing on pipe write in the first "cd" command to get the workspace. From what we understand the underlying kubernetes-client library does not expose a stable connection. From what we understand the latch was implemented wrongly (?? We think). The latch is initialized once in the `container` "start" method, but a new connection is opened for each `sh` step. This causes the "waitQuietly" to pass through straight away without waiting, on all sh connections but the first. Once we fixed this the "OnOpen" callback in the websocket is never called, indicating the connection is not made, and the latch was only passed on thread interrupt. At this stage we didn't dig deeper, but hopefully this is a good starting point. Instead we switched to embedding jnlp into our main container using the "jnlp" container name "hack" to workaround this code. TL;DR Embedding jnlp into your main container using the "jnlp" container name "hack" will workaround this issue code.

gilgamez added a comment - 2017-03-13 14:25

larslawoko Can you explain your workaround, perhaps with an example pipeline snippet, I'm experiencing this issue too and have tried to apply your workaround by using our previous heavy-weight build image with the name 'jnlp' as containerTemplate but that doesn't seem to work.

gilgamez added a comment - 2017-03-13 14:25 larslawoko Can you explain your workaround, perhaps with an example pipeline snippet, I'm experiencing this issue too and have tried to apply your workaround by using our previous heavy-weight build image with the name 'jnlp' as containerTemplate but that doesn't seem to work.

Jean Mertz added a comment - 2017-03-15 06:59

I'm also interested in this workaround, to see if it fits our setup. Of course also hoping for a permanent solution

Jean Mertz added a comment - 2017-03-15 06:59 I'm also interested in this workaround, to see if it fits our setup. Of course also hoping for a permanent solution

Lars Lawoko added a comment - 2017-03-15 07:56 - edited

Esentially the bug is in the "container" step, so if you avoid it this bug should be worked around. Of course if you rely on shared disk of one pod with multiple containers this won't work

Instead of having

 podTemplate(label: 'mypod', containers: [
 containerTemplate(name: 'debian', image: 'debian', ttyEnabled: true, command: 'cat')
 ]) \{
 node('mypod') \{
 container('debian') \{
 stage('stage 1') \{
 sh 'echo hello'
 sh 'sleep 30'
 sh 'echo world'
 }

stage('stage 2') \{
 sh 'echo hello'
 sh 'sleep 30'
 sh 'echo world'
 }
 }
 }
 }

(pusdocode, might need more tweaking)

have: ( note the new image,the name is jnlp ( See https://issues.jenkins-ci.org/browse/JENKINS-40847), and no contianer step)

 podTemplate(label: 'mypod', containers: [
 containerTemplate(name: 'jnlp', image: 'custom/debian-with-jnlp'),
 ]) \{
 node('mypod') \{
 stage('stage 1') \{
 sh 'echo hello'
 sh 'sleep 30'
 sh 'echo world'
 }

stage('stage 2') \{
 sh 'echo hello'
 sh 'sleep 30'
 sh 'echo world'
 }
 }
 }

Not at work now, but If someone needs a more indepth walkthrough, comment here.

Lars Lawoko added a comment - 2017-03-15 07:56 - edited Esentially the bug is in the "container" step, so if you avoid it this bug should be worked around. Of course if you rely on shared disk of one pod with multiple containers this won't work Instead of having podTemplate(label: 'mypod' , containers: [ containerTemplate(name: 'debian' , image: 'debian' , ttyEnabled: true , command: 'cat' ) ]) \{ node( 'mypod' ) \{ container( 'debian' ) \{ stage( 'stage 1' ) \{ sh 'echo hello' sh 'sleep 30' sh 'echo world' } stage( 'stage 2' ) \{ sh 'echo hello' sh 'sleep 30' sh 'echo world' } } } } (pusdocode, might need more tweaking) have: ( note the new image,the name is jnlp ( See https://issues.jenkins-ci.org/browse/JENKINS-40847 ), and no contianer step) podTemplate(label: 'mypod' , containers: [ containerTemplate(name: 'jnlp' , image: 'custom/debian-with-jnlp' ), ]) \{ node( 'mypod' ) \{ stage( 'stage 1' ) \{ sh 'echo hello' sh 'sleep 30' sh 'echo world' } stage( 'stage 2' ) \{ sh 'echo hello' sh 'sleep 30' sh 'echo world' } } } Not at work now, but If someone needs a more indepth walkthrough, comment here.

Jesse Redl added a comment - 2017-03-21 04:53

larslawoko thanks so much for posting the work around. I was about to throw the towel in on Jenkins today and this saved me.

For reference to others, rather than pulling directly from the office jenkinsci images I brought in the relevant bits from the published docker files:

https://github.com/jenkinsci/docker-slave

https://github.com/jenkinsci/docker-jnlp-slave

Another win for this workaround is that by dropping the container block you also cut out all of the noise being generated from: https://issues.jenkins-ci.org/browse/JENKINS-42048

Jenkins 2.51
Kubernetes plugin 0.11 
Kubernetes 1.5.3 on GKE

Jesse Redl added a comment - 2017-03-21 04:53 larslawoko thanks so much for posting the work around. I was about to throw the towel in on Jenkins today and this saved me. For reference to others, rather than pulling directly from the office jenkinsci images I brought in the relevant bits from the published docker files: https://github.com/jenkinsci/docker-slave https://github.com/jenkinsci/docker-jnlp-slave Another win for this workaround is that by dropping the container block you also cut out all of the noise being generated from: https://issues.jenkins-ci.org/browse/JENKINS-42048 Jenkins 2.51 Kubernetes plugin 0.11 Kubernetes 1.5.3 on GKE

Gytis Ramanauskas added a comment - 2017-04-27 13:18

Steven Oud proposed work around does not seem to work, it runs multiple retries, and eventually fails.

Using only single container is kind of difficult for us, any more quick fixes?

Gytis Ramanauskas added a comment - 2017-04-27 13:18 Steven Oud proposed work around does not seem to work, it runs multiple retries, and eventually fails. Using only single container is kind of difficult for us, any more quick fixes?

Jean Mertz added a comment - 2017-04-28 07:50

I'm interested to hear csanchez' thoughts about this. I'd consider this a critical/blocking issue for serious/heavy use of this plugin, but am not familiar enough with the codebase to know where to start debugging this. The unpredictability of this error also makes it more difficult to pinpoint.

Jean Mertz added a comment - 2017-04-28 07:50 I'm interested to hear csanchez ' thoughts about this. I'd consider this a critical/blocking issue for serious/heavy use of this plugin, but am not familiar enough with the codebase to know where to start debugging this. The unpredictability of this error also makes it more difficult to pinpoint.

Jean Mertz added a comment - 2017-05-04 14:39

For those interested, we've worked around this issue for all sh commands by using the default JNLP connections, and then tunnelling the command to the right container. Something like this:

def ksh(command) {
  if (env.CONTAINER_NAME) {
    if ((command instanceof String) || (command instanceof GString)) {
      command = kubectl(command)
    }

    if (command instanceof LinkedHashMap) {
      command["script"] = kubectl(command["script"])
    }
  }

  sh(command)
}

def kubectl(command) {
  "kubectl exec -i ${env.HOSTNAME} -c ${env.CONTAINER_NAME} -- /bin/sh -c 'cd ${env.WORKSPACE} && ${command}'"
}

def customContainer(String name, Closure body) {
  withEnv(["CONTAINER_NAME=$name"]) {
    body()
  }
}

This way, you can do something like:

node('my-pod') {
  customContainer('container-1') {
    ksh 'echo hello world'
    ref = ksh returnStdout: true, script: 'git rev-parse --short HEAD'
  }
}

You do need a custom JNLP container with kubectl for this to work, which we built using this:

FROM jenkinsci/jnlp-slave:2.62-alpine

USER root
ADD https://storage.googleapis.com/kubernetes-release/release/v1.6.1/bin/linux/amd64/kubectl /usr/local/bin/kubectl
RUN chmod +x /usr/local/bin/kubectl
USER jenkins

This only works for commands using sh, so for example checkout(scm) doesn't work, but your home folder is shared across containers using a volume anyway, so you can simply do the checkout on the JNLP container, and have the files be available on the other containers as well.

I hope this helps anyone, until there is a proper fix in place.

Jean Mertz added a comment - 2017-05-04 14:39 For those interested, we've worked around this issue for all sh commands by using the default JNLP connections, and then tunnelling the command to the right container. Something like this: def ksh(command) { if (env.CONTAINER_NAME) { if ((command instanceof String ) || (command instanceof GString)) { command = kubectl(command) } if (command instanceof LinkedHashMap) { command[ "script" ] = kubectl(command[ "script" ]) } } sh(command) } def kubectl(command) { "kubectl exec -i ${env.HOSTNAME} -c ${env.CONTAINER_NAME} -- /bin/sh -c 'cd ${env.WORKSPACE} && ${command}' " } def customContainer( String name, Closure body) { withEnv([ "CONTAINER_NAME=$name" ]) { body() } } This way, you can do something like: node( 'my-pod' ) { customContainer( 'container-1' ) { ksh 'echo hello world' ref = ksh returnStdout: true , script: 'git rev-parse -- short HEAD' } } You do need a custom JNLP container with kubectl for this to work, which we built using this: FROM jenkinsci/jnlp-slave:2.62-alpine USER root ADD https: //storage.googleapis.com/kubernetes-release/release/v1.6.1/bin/linux/amd64/kubectl /usr/local/bin/kubectl RUN chmod +x /usr/local/bin/kubectl USER jenkins This only works for commands using sh, so for example checkout(scm) doesn't work, but your home folder is shared across containers using a volume anyway, so you can simply do the checkout on the JNLP container, and have the files be available on the other containers as well. I hope this helps anyone, until there is a proper fix in place.

Carlos Sanchez added a comment - 2017-05-04 14:45

I think this is caused because several assumptions were made in the multiple container execution model, particularly in concurrent executions. I'm fixing ~~JENKINS-42048~~ and have a better understanding now. It's a matter of time to get through all the issues

In this particular case I guess the same Jenkins agent is reused for the concurrent builds, and the plugin does not consider this case, killing connections from different executions

Carlos Sanchez added a comment - 2017-05-04 14:45 I think this is caused because several assumptions were made in the multiple container execution model, particularly in concurrent executions. I'm fixing JENKINS-42048 and have a better understanding now. It's a matter of time to get through all the issues In this particular case I guess the same Jenkins agent is reused for the concurrent builds, and the plugin does not consider this case, killing connections from different executions

Joan Goyeau added a comment - 2017-05-11 08:51 - edited

Hi,

This Pipe not connected issue is failing on half of our builds.
This makes the Kubernetes Plugin quiet unusable. I'm interested on how Cloudbee is managing this.
We are happy to help out to fix the bug. But we have no idea where to start. csanchez could we be of any help?

Cheers

Joan Goyeau added a comment - 2017-05-11 08:51 - edited Hi, This Pipe not connected issue is failing on half of our builds. This makes the Kubernetes Plugin quiet unusable. I'm interested on how Cloudbee is managing this. We are happy to help out to fix the bug. But we have no idea where to start. csanchez could we be of any help? Cheers

Carlos Sanchez added a comment - 2017-05-11 09:37

The container idiom is an alpha feature unique to Kubernetes, and still have some glitches to be fixed, in progress in ~~JENKINS-42048~~

If you use agents in Kubernetes without executing into other containers it works as expected

Carlos Sanchez added a comment - 2017-05-11 09:37 The container idiom is an alpha feature unique to Kubernetes, and still have some glitches to be fixed, in progress in JENKINS-42048 If you use agents in Kubernetes without executing into other containers it works as expected

Ioannis Canellos added a comment - 2017-05-16 09:45

csanchez shouldn't the OnceRetentionStrategy which seems to be the default for kubernetes-plugin, prevent the node from being reused by other jobs?

Ioannis Canellos added a comment - 2017-05-16 09:45 csanchez shouldn't the OnceRetentionStrategy which seems to be the default for kubernetes-plugin, prevent the node from being reused by other jobs?

Carlos Sanchez added a comment - 2017-05-16 09:58

no, IIRC even with OnceRetentionStrategy there is some time (seconds) while the agent can receive more work.

More info in https://wiki.jenkins-ci.org/display/JENKINS/One-Shot+Executor

For a true one job per agent we'd need to integrate with the one-shot plugin, which is in my roadmap, just lack of time

Carlos Sanchez added a comment - 2017-05-16 09:58 no, IIRC even with OnceRetentionStrategy there is some time (seconds) while the agent can receive more work. More info in https://wiki.jenkins-ci.org/display/JENKINS/One-Shot+Executor For a true one job per agent we'd need to integrate with the one-shot plugin, which is in my roadmap, just lack of time

Ioannis Canellos added a comment - 2017-05-16 15:07

What if you added the build number as part of the label. Would that prevent the reuse?

def label = ${env.JOB_NAME}.${env.BUILD_NUMBER}".replace('-', '_').replace('/', '_')
podTemplate(label: "$label" ...) {
    node("$label") {
       //do stuff
    }
}

Ioannis Canellos added a comment - 2017-05-16 15:07 What if you added the build number as part of the label. Would that prevent the reuse? def label = ${env.JOB_NAME}.${env.BUILD_NUMBER}".replace( '-' , '_' ).replace( '/' , '_' ) podTemplate(label: "$label" ...) { node( "$label" ) { // do stuff } }

Joan Goyeau added a comment - 2017-05-16 15:59 - edited

iocanel in my case I put a random UUID and it's the same.

Joan Goyeau added a comment - 2017-05-16 15:59 - edited iocanel in my case I put a random UUID and it's the same.

Jon Whitcraft added a comment - 2017-05-18 17:57

We are seeing them randomly as well, even on jobs that only run a podTemplate with a single container + the jnlp container.

Jon Whitcraft added a comment - 2017-05-18 17:57 We are seeing them randomly as well, even on jobs that only run a podTemplate with a single container + the jnlp container.

James Rawlings added a comment - 2017-05-22 11:20

I seem to be bouncing the jenkins master pod around 2 times per day after getting the "Pipe not connected" error.

FWIW we use the github org plugin and can have 5-15 open pr jobs on our repos. I also tried disabling concurrent builds in the Jenkinsfile to see if that helped but it didn't.

node{
 properties([
   disableConcurrentBuilds()
 ])
}

James Rawlings added a comment - 2017-05-22 11:20 I seem to be bouncing the jenkins master pod around 2 times per day after getting the "Pipe not connected" error. FWIW we use the github org plugin and can have 5-15 open pr jobs on our repos. I also tried disabling concurrent builds in the Jenkinsfile to see if that helped but it didn't. node{ properties([ disableConcurrentBuilds() ]) }

Ioannis Canellos added a comment - 2017-05-30 12:55

csanchez I am not sure if the issue is related with the pod being reused. I have been using a version derived from the current master, for a while and I haven't hit the issue, even when a lot of concurrent builds take place. Is there any chance, that the root cause is something else? (e.g. a client bug, that might have been fixed in later versions?).

If you are confident that the way to go is the `one shot executor` I would like to volunteer, if you have any pointers (existing docs are really limited).

jrawlings: I assume that you using a forked version of the plugin, right?

Ioannis Canellos added a comment - 2017-05-30 12:55 csanchez I am not sure if the issue is related with the pod being reused. I have been using a version derived from the current master, for a while and I haven't hit the issue, even when a lot of concurrent builds take place. Is there any chance, that the root cause is something else? (e.g. a client bug, that might have been fixed in later versions?). If you are confident that the way to go is the `one shot executor` I would like to volunteer, if you have any pointers (existing docs are really limited). jrawlings : I assume that you using a forked version of the plugin, right?

Carlos Sanchez added a comment - 2017-05-30 13:07

It is due to the container step, and I thought mixed with the container reuse but maybe not.

~~JENKINS-42048~~ is proving to be more convoluted than I thought, there are a lot of assumptions that pipeline makes about the execution of commands, will try to find some time with one of the core devs there to fix it

Carlos Sanchez added a comment - 2017-05-30 13:07 It is due to the container step, and I thought mixed with the container reuse but maybe not. JENKINS-42048 is proving to be more convoluted than I thought, there are a lot of assumptions that pipeline makes about the execution of commands, will try to find some time with one of the core devs there to fix it

Jean Mertz added a comment - 2017-05-30 13:24

I can confirm that the latest master does not solve this problem, and we already have our configuration tweaked in such a way that all our containers are truly one-shot. It does indeed only happen when using the `container` step (and as I posted a couple of comments above, we have a workaround, using sh/kubectl instead of the custom pipeline step).

Jean Mertz added a comment - 2017-05-30 13:24 I can confirm that the latest master does not solve this problem, and we already have our configuration tweaked in such a way that all our containers are truly one-shot. It does indeed only happen when using the `container` step (and as I posted a couple of comments above, we have a workaround, using sh/kubectl instead of the custom pipeline step).

Ioannis Canellos added a comment - 2017-05-30 13:29

jeanmertz: Thanks for the feedback!

It seems really weird that I can't reproduce the issue myself. Can you bump the kubernetes-client version to 2.3.1?

Ioannis Canellos added a comment - 2017-05-30 13:29 jeanmertz : Thanks for the feedback! It seems really weird that I can't reproduce the issue myself. Can you bump the kubernetes-client version to 2.3.1?

Lars Lawoko added a comment - 2017-05-30 15:51 - edited

We haven't used this method in a while, so it might be fixed like you mentioned. But when we invesigated, the error was definately happening due to a websocket (jnlp continer to secondary container) issue. Seems to trigger issue on new sh step, not getting new/recovering a websocket connection.

Only 50% sure about this now, but hopefully it's a jumping off point.

Lars Lawoko added a comment - 2017-05-30 15:51 - edited We haven't used this method in a while, so it might be fixed like you mentioned. But when we invesigated, the error was definately happening due to a websocket (jnlp continer to secondary container) issue. Seems to trigger issue on new sh step, not getting new/recovering a websocket connection. Only 50% sure about this now, but hopefully it's a jumping off point.

Carlos Sanchez added a comment - 2017-06-10 09:10

I believe this is caused by ~~JENKINS-42048~~, once that's merged we can try again

Carlos Sanchez added a comment - 2017-06-10 09:10 I believe this is caused by JENKINS-42048 , once that's merged we can try again

Corey O'Brien added a comment - 2017-06-22 15:39

I added a comment to the PR for ~~JENKINS-42048~~: https://github.com/jenkinsci/kubernetes-plugin/pull/157#issuecomment-310416604

Running that PR-157 code seems to reduce the frequency of the Pipe not connected errors, but doesn't seem to remove them entirely.

Corey O'Brien added a comment - 2017-06-22 15:39 I added a comment to the PR for JENKINS-42048 : https://github.com/jenkinsci/kubernetes-plugin/pull/157#issuecomment-310416604 Running that PR-157 code seems to reduce the frequency of the Pipe not connected errors, but doesn't seem to remove them entirely.

Brian Wallace added a comment - 2017-07-19 20:23 - edited

I rebuilt the kubernetes-plugin from SHA 050e559 and continue to see the Pipe Not Connected error. Stack trace is in the PR. We are running Jenkins 2.46.2-alpine.

https://github.com/jenkinsci/kubernetes-plugin/pull/157#issuecomment-316504564

UPDATE: We saw this on v0.11 as well. That is why I tried building the plugin from master at SHA 050e559.

Brian Wallace added a comment - 2017-07-19 20:23 - edited I rebuilt the kubernetes-plugin from SHA 050e559 and continue to see the Pipe Not Connected error. Stack trace is in the PR. We are running Jenkins 2.46.2-alpine . https://github.com/jenkinsci/kubernetes-plugin/pull/157#issuecomment-316504564 UPDATE: We saw this on v0.11 as well. That is why I tried building the plugin from master at SHA 050e559.

Bruce Bradley added a comment - 2017-07-20 00:52

I ran into this today and began investigating. I can reproduce the error reliably at will. Running 6 parallel tasks, each spinning up a pod template made up of three containers, I always see one of the pods fail outright as soon as the sh step begins. Here's what I know:

Everything appears to be fine launching the pods
Jenkins debug logging shows a successful HTTP GET request on the doomed pod, and this occurs during the same second that the shell command attempts to run (I haven't checked the code but I suspect that the shell command attempts to execute just after a successful pod get request)
Almost exactly one minute after the doomed pod comes up (the HTTP request comes back successful and we attempt to run sh) I see the "Error while pumping stream" message in the Jenkins server log:

Jul 19, 2017 4:21:42 PM INFO okhttp3.internal.platform.Platform log
<-- END HTTP (7866-byte body)
Jul 19, 2017 4:22:40 PM SEVERE io.fabric8.kubernetes.client.utils.InputStreamPumper run
Error while pumping stream.
java.io.IOException: Pipe broken
 at java.io.PipedInputStream.read(PipedInputStream.java:321)
 at java.io.PipedInputStream.read(PipedInputStream.java:377)
 at java.io.InputStream.read(InputStream.java:101)
 at io.fabric8.kubernetes.client.utils.InputStreamPumper.run(InputStreamPumper.java:57)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
Jul 19, 2017 4:22:41 PM SEVERE io.fabric8.kubernetes.client.utils.InputStreamPumper run
Error while pumping stream.
java.io.IOException: Pipe broken
 at java.io.PipedInputStream.read(PipedInputStream.java:321)
 at java.io.PipedInputStream.read(PipedInputStream.java:377)
 at java.io.InputStream.read(InputStream.java:101)
 at io.fabric8.kubernetes.client.utils.InputStreamPumper.run(InputStreamPumper.java:57)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)

Following that, it seems like there's a 5-minute timeout because almost exactly five minutes afterward Jenkins terminates the slave because the task is supposedly "complete".

This probably doesn't help Carlos since he seems to have already narrowed down the root of the problem, but in the off chance that this does help I'm happy to provide.

For what it's worth, we're running Jenkins 2.46.3-alpine with the Kubernetes plugin compiled @ b266a49e (we needed port forwarding functionality badly). I just tried running master today but ran into a nasty permissions error regarding a nohup file so I reverted to our previous plugin.

Bruce Bradley added a comment - 2017-07-20 00:52 I ran into this today and began investigating. I can reproduce the error reliably at will. Running 6 parallel tasks, each spinning up a pod template made up of three containers, I always see one of the pods fail outright as soon as the sh step begins. Here's what I know: Everything appears to be fine launching the pods Jenkins debug logging shows a successful HTTP GET request on the doomed pod, and this occurs during the same second that the shell command attempts to run (I haven't checked the code but I suspect that the shell command attempts to execute just after a successful pod get request) Almost exactly one minute after the doomed pod comes up (the HTTP request comes back successful and we attempt to run sh) I see the "Error while pumping stream" message in the Jenkins server log: Jul 19, 2017 4:21:42 PM INFO okhttp3.internal.platform.Platform log <-- END HTTP (7866-byte body) Jul 19, 2017 4:22:40 PM SEVERE io.fabric8.kubernetes.client.utils.InputStreamPumper run Error while pumping stream. java.io.IOException: Pipe broken at java.io.PipedInputStream.read(PipedInputStream.java:321) at java.io.PipedInputStream.read(PipedInputStream.java:377) at java.io.InputStream.read(InputStream.java:101) at io.fabric8.kubernetes.client.utils.InputStreamPumper.run(InputStreamPumper.java:57) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Jul 19, 2017 4:22:41 PM SEVERE io.fabric8.kubernetes.client.utils.InputStreamPumper run Error while pumping stream. java.io.IOException: Pipe broken at java.io.PipedInputStream.read(PipedInputStream.java:321) at java.io.PipedInputStream.read(PipedInputStream.java:377) at java.io.InputStream.read(InputStream.java:101) at io.fabric8.kubernetes.client.utils.InputStreamPumper.run(InputStreamPumper.java:57) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Following that, it seems like there's a 5-minute timeout because almost exactly five minutes afterward Jenkins terminates the slave because the task is supposedly "complete". This probably doesn't help Carlos since he seems to have already narrowed down the root of the problem, but in the off chance that this does help I'm happy to provide. For what it's worth, we're running Jenkins 2.46.3-alpine with the Kubernetes plugin compiled @ b266a49e (we needed port forwarding functionality badly). I just tried running master today but ran into a nasty permissions error regarding a nohup file so I reverted to our previous plugin.

Michael Andrews added a comment - 2017-07-20 20:24 - edited

We have the exact same issue with multiple K8s agents running. We have a long running shell step in a container in a podTemplate. The step starts... and we eventually get a broken pipe log message BUT the step is still running. We then see the step return SUCCESS! But then we try to switch containers to run the NEXT shell step - which hangs for 5mins and the job breaks because of the broken pipe. We built the master branch locally to get all the newest bug fixes (which are great!). And we're using Jenkins 2.60.1. Really need this fixed.

Michael Andrews added a comment - 2017-07-20 20:24 - edited We have the exact same issue with multiple K8s agents running. We have a long running shell step in a container in a podTemplate. The step starts... and we eventually get a broken pipe log message BUT the step is still running. We then see the step return SUCCESS! But then we try to switch containers to run the NEXT shell step - which hangs for 5mins and the job breaks because of the broken pipe. We built the master branch locally to get all the newest bug fixes (which are great!). And we're using Jenkins 2.60.1. Really need this fixed.

Martin Sander added a comment - 2017-07-26 09:27

Happens on 0.11 as well.

Martin Sander added a comment - 2017-07-26 09:27 Happens on 0.11 as well.

Martin Sander added a comment - 2017-07-27 12:34

I reproduced this with a good ol' debugger connected, and this is what I found out:

This happens while waiting here.
When it happens, alive is false
I.e. either onClose or onFailure have been called
for onFailure, we should be able to see a stacktrace somewhere

I will keep investigating..

Martin Sander added a comment - 2017-07-27 12:34 I reproduced this with a good ol' debugger connected, and this is what I found out: This happens while waiting here . When it happens, alive is false I.e. either onClose or onFailure have been called for onFailure , we should be able to see a stacktrace somewhere I will keep investigating..

Martin Sander added a comment - 2017-07-27 13:02

Seems that it also happens with alive being true. So no new information here I guess .

Martin Sander added a comment - 2017-07-27 13:02 Seems that it also happens with alive being true . So no new information here I guess .

Michael Andrews added a comment - 2017-07-27 16:10

0x89 - Thank you for digging into this. This issue is killing us.

Michael Andrews added a comment - 2017-07-27 16:10 0x89 - Thank you for digging into this. This issue is killing us.

Martin Sander added a comment - 2017-07-31 08:49

killdash9: Unfortunately, I did not find anything conclusive yet (probably because I am just starting to get familiar with the source code), but I will go on. The master branch already has a bit of additional logging added that might help pinpoint this issue.

Have you tried building the plugin yourself from master to check if it improves the situation? I know it doesn't fix the issue completely, but you may get hit less often.

Martin Sander added a comment - 2017-07-31 08:49 killdash9 : Unfortunately, I did not find anything conclusive yet (probably because I am just starting to get familiar with the source code), but I will go on. The master branch already has a bit of additional logging added that might help pinpoint this issue. Have you tried building the plugin yourself from master to check if it improves the situation? I know it doesn't fix the issue completely, but you may get hit less often.

Martin Sander added a comment - 2017-07-31 08:59 - edited

Btw. maybe a useful information for everyone here:

This is one of the scripts I use to reproduce this:

def label = env.BUILD_TAG.drop(env.BUILD_TAG.length() - 63)
podTemplate(
        label: label,
        containers: [
                containerTemplate(name: 'mvn', image: 'maven', ttyEnabled: true, command: 'cat'),
        ],
) {
    node(label) {
        container('mvn') {
            sh "sleep 200"
        }
    }
}

~~The interesting part is that I was not able to reproduce this with sleep values up to 150 seconds, but 200 triggers this.~~

Was able to reproduce with 120 seconds as well, so no progress here.

Martin Sander added a comment - 2017-07-31 08:59 - edited Btw. maybe a useful information for everyone here: This is one of the scripts I use to reproduce this: def label = env.BUILD_TAG.drop(env.BUILD_TAG.length() - 63) podTemplate( label: label, containers: [ containerTemplate(name: 'mvn' , image: 'maven' , ttyEnabled: true , command: 'cat' ), ], ) { node(label) { container( 'mvn' ) { sh "sleep 200" } } } The interesting part is that I was not able to reproduce this with sleep values up to 150 seconds, but 200 triggers this. Was able to reproduce with 120 seconds as well, so no progress here.

Michael Andrews added a comment - 2017-07-31 13:36

0x89 - we built from master a few weeks ago. Might be time to do it again. We also use SLEEP in our pipeline Shell steps. But I also see the broken pipe on long running shell commands.

Michael Andrews added a comment - 2017-07-31 13:36 0x89 - we built from master a few weeks ago. Might be time to do it again. We also use SLEEP in our pipeline Shell steps. But I also see the broken pipe on long running shell commands.

Michael Andrews added a comment - 2017-07-31 13:50

0x89 There are several commits to deal with agent connection, read, and idle timeouts. Read timeout seems to default to 100s. https://github.com/jenkinsci/kubernetes-plugin/blob/master/CHANGELOG.md

Michael Andrews added a comment - 2017-07-31 13:50 0x89 There are several commits to deal with agent connection, read, and idle timeouts. Read timeout seems to default to 100s. https://github.com/jenkinsci/kubernetes-plugin/blob/master/CHANGELOG.md

Michael Andrews added a comment - 2017-07-31 13:53

0x89 actually that's for connection.

private static final int DEFAULT_SLAVE_JENKINS_CONNECTION_TIMEOUT = 100;

Michael Andrews added a comment - 2017-07-31 13:53 0x89 actually that's for connection. private static final int DEFAULT_SLAVE_JENKINS_CONNECTION_TIMEOUT = 100;

Michael Andrews added a comment - 2017-07-31 19:06 - edited

So our magic number is 4. That's the number of agents I can run simultaneously. If I start a 5th one...they all get the broken pipe. We are using the 0.12 release.

Michael Andrews added a comment - 2017-07-31 19:06 - edited So our magic number is 4. That's the number of agents I can run simultaneously. If I start a 5th one...they all get the broken pipe. We are using the 0.12 release.

Martin Sander added a comment - 2017-08-01 08:48

I don't know if this is related, but while investigating this, I think I found a resource leak.
After running a few of those builds (more than 20), I have about 70 threads stuck here:

pool-317-thread-1

"pool-317-thread-1" Id=2820 Group=main TIMED_WAITING on java.io.PipedInputStream@2863d70
	at java.lang.Object.wait(Native Method)
	-  waiting on java.io.PipedInputStream@2863d70
	at java.io.PipedInputStream.read(PipedInputStream.java:326)
	at java.io.PipedInputStream.read(PipedInputStream.java:377)
	at java.io.InputStream.read(InputStream.java:101)
	at io.fabric8.kubernetes.client.utils.InputStreamPumper.run(InputStreamPumper.java:57)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)

	Number of locked synchronizers = 1
	- java.util.concurrent.ThreadPoolExecutor$Worker@783e3cd1

I.e. seems that ExecWebSocketListener does not properly clean up its pumper...

csanchez, iocanel:
Do you think this is related or should I open a new ticket for that?

Martin Sander added a comment - 2017-08-01 08:48 I don't know if this is related, but while investigating this, I think I found a resource leak. After running a few of those builds (more than 20), I have about 70 threads stuck here: pool-317-thread-1 "pool-317-thread-1" Id=2820 Group=main TIMED_WAITING on java.io.PipedInputStream@2863d70 at java.lang. Object .wait(Native Method) - waiting on java.io.PipedInputStream@2863d70 at java.io.PipedInputStream.read(PipedInputStream.java:326) at java.io.PipedInputStream.read(PipedInputStream.java:377) at java.io.InputStream.read(InputStream.java:101) at io.fabric8.kubernetes.client.utils.InputStreamPumper.run(InputStreamPumper.java:57) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang. Thread .run( Thread .java:748) Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@783e3cd1 I.e. seems that ExecWebSocketListener does not properly clean up its pumper... csanchez , iocanel : Do you think this is related or should I open a new ticket for that?

Ioannis Canellos added a comment - 2017-08-01 09:04

0x89 The thread leak has been fixed as part of https://github.com/jenkinsci/kubernetes-plugin/pull/177 and should be part of the 0.12 release.

Ioannis Canellos added a comment - 2017-08-01 09:04 0x89 The thread leak has been fixed as part of https://github.com/jenkinsci/kubernetes-plugin/pull/177 and should be part of the 0.12 release.

Martin Sander added a comment - 2017-08-01 12:49

iocanel: It isn't, I see it both with 0.12 and with the current master.

Martin Sander added a comment - 2017-08-01 12:49 iocanel : It isn't, I see it both with 0.12 and with the current master.

Ioannis Canellos added a comment - 2017-08-01 13:35

That's weird, it did solve the issue for me.

Before that commit, our Jenkins would choke due to this thread leak every few hours and now its doing great.

Will need to check if there are more cause to this.

Ioannis Canellos added a comment - 2017-08-01 13:35 That's weird, it did solve the issue for me. Before that commit, our Jenkins would choke due to this thread leak every few hours and now its doing great. Will need to check if there are more cause to this.

Martin Sander added a comment - 2017-08-01 14:56

iocanel:
It looks like I have made a bit of progress (if I am not completely mistaken).

Are you aware that the DecoratedLauncher returned by decorate is used by Jenkins not once to execute the original command(s) (sleep in above example), but re-used to check if the process is still running?

I.e. launch is called several times, with executions possibly (or maybe even certainly) overlapping.
So my current assumption is that it is not safe to use members of the wrapping ContainerExecDecorator inside the DecoratedLauncher, especially launcher, watch, and proc.

I will try to validate this assumption and might send you a (probably crude) pull request.

Martin Sander added a comment - 2017-08-01 14:56 iocanel : It looks like I have made a bit of progress (if I am not completely mistaken). Are you aware that the DecoratedLauncher returned by decorate is used by Jenkins not once to execute the original command(s) ( sleep in above example), but re-used to check if the process is still running? I.e. launch is called several times, with executions possibly (or maybe even certainly) overlapping. So my current assumption is that it is not safe to use members of the wrapping ContainerExecDecorator inside the DecoratedLauncher , especially launcher , watch , and proc . I will try to validate this assumption and might send you a (probably crude) pull request.

Martin Sander added a comment - 2017-08-02 09:20

I did some quite extensive testing yesterday, and I was able to get rid of the resource leak (I think).

Pull request here: https://github.com/jenkinsci/kubernetes-plugin/pull/180. I recommend also viewing it with whitespace changes ignored.

I don't expect you to merge it like that, but would be happy to get feedback .

Unfortunately, it does not completely get rid of the "pipe not connected" errors, but

it seems to fix the resource leak
the "pipe not connected" error seems to fail the build much less often
it seems that it most of the time comes from org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep, which
- runs ps all ten seconds or so to check if the process is still alive
- just prints a single error to the build log, even if that check fails multiple times (Set the logger for that class to FINE to see all failures)
- luckily does not fail the build if one of those checks fail

Martin Sander added a comment - 2017-08-02 09:20 I did some quite extensive testing yesterday, and I was able to get rid of the resource leak (I think). Pull request here: https://github.com/jenkinsci/kubernetes-plugin/pull/180 . I recommend also viewing it with whitespace changes ignored . I don't expect you to merge it like that, but would be happy to get feedback . Unfortunately, it does not completely get rid of the "pipe not connected" errors, but it seems to fix the resource leak the "pipe not connected" error seems to fail the build much less often it seems that it most of the time comes from org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep , which runs ps all ten seconds or so to check if the process is still alive just prints a single error to the build log, even if that check fails multiple times (Set the logger for that class to FINE to see all failures) luckily does not fail the build if one of those checks fail

Ioannis Canellos added a comment - 2017-08-02 11:25

0x89 Your assumption (that the decorator is called multiple times) is valid and is aligned with what I've seen so far.

That was were https://github.com/jenkinsci/kubernetes-plugin/pull/177 was aiming (to close() the listeners opened by the liveness checks).

But it seems that this is affecting us in more ways and I feel you are on the right track. Let me review your pull request and I'll get back to you.

Ioannis Canellos added a comment - 2017-08-02 11:25 0x89 Your assumption (that the decorator is called multiple times) is valid and is aligned with what I've seen so far. That was were https://github.com/jenkinsci/kubernetes-plugin/pull/177 was aiming (to close() the listeners opened by the liveness checks). But it seems that this is affecting us in more ways and I feel you are on the right track. Let me review your pull request and I'll get back to you.

Martin Sander added a comment - 2017-08-02 14:27 - edited

iocanel:

I might be on the right track, but I think I didn't go far enough.

It actually is not only the Decorator that is reused, but even the Launcher is used more than once, launch is called more than once.
I will verify this and ~~probably~~ issue another pull request from a different branch tomorrow.

Martin Sander added a comment - 2017-08-02 14:27 - edited iocanel : I might be on the right track, but I think I didn't go far enough. It actually is not only the Decorator that is reused, but even the Launcher is used more than once, launch is called more than once. I will verify this and probably issue another pull request from a different branch tomorrow.

Martin Sander added a comment - 2017-08-03 15:18

New pull request: https://github.com/jenkinsci/kubernetes-plugin/pull/182.

Martin Sander added a comment - 2017-08-03 15:18 New pull request: https://github.com/jenkinsci/kubernetes-plugin/pull/182 .

Jesse Redl added a comment - 2017-09-11 19:53

Thanks for the fix, we've re-enabled out multi-container workflows within jenkins / kubernetes plugin after upgrading to the most recent release!

Jesse Redl added a comment - 2017-09-11 19:53 Thanks for the fix, we've re-enabled out multi-container workflows within jenkins / kubernetes plugin after upgrading to the most recent release!

Andras Kovi added a comment - 2018-09-25 07:20

We started seeing this issue again: exceptions.txt

Jenkins ver. 2.107.3, kubernetes-plugin:1.10.1

The wait for the started latch is interrupted org/csanchez/jenkins/plugins/kubernetes/pipeline/ContainerExecDecorator.java:328.
How is this possible? What is interrupting it and what parameters may need to be tweaked to get it working?

The error happens when we spawn a relatively large number, about 25 parallel executions.

Andras Kovi added a comment - 2018-09-25 07:20 We started seeing this issue again: exceptions.txt Jenkins ver. 2.107.3, kubernetes-plugin:1.10.1 The wait for the started latch is interrupted org/csanchez/jenkins/plugins/kubernetes/pipeline/ContainerExecDecorator.java:328 . How is this possible? What is interrupting it and what parameters may need to be tweaked to get it working? The error happens when we spawn a relatively large number, about 25 parallel executions.

Claes Buckwalter added a comment - 2018-09-25 09:30

akovi can you share simple Pipeline script that reproduces the problem?

Claes Buckwalter added a comment - 2018-09-25 09:30 akovi can you share simple Pipeline script that reproduces the problem?

Andras Kovi added a comment - 2018-09-25 12:10

Seems like the 'Max connections to Kubernetes API' parameter was set to a very low number causing this error.

So, for the record, if one encounters this issue, raising the 'Max connections to Kubernetes API' config parameter should be increased.

For planning purposes it would still be good to know the relation between this parameter and the possible number of parallel executions in a pipeline.

Andras Kovi added a comment - 2018-09-25 12:10 Seems like the 'Max connections to Kubernetes API' parameter was set to a very low number causing this error. So, for the record, if one encounters this issue, raising the 'Max connections to Kubernetes API' config parameter should be increased. For planning purposes it would still be good to know the relation between this parameter and the possible number of parallel executions in a pipeline.

Raviteja A added a comment - 2019-01-22 06:35

After increasing the 'Max connections to Kubernetes API' parameter, it didn't resolved the issue. We had to restart master.After that things started working.

Raviteja A added a comment - 2019-01-22 06:35 After increasing the 'Max connections to Kubernetes API' parameter, it didn't resolved the issue. We had to restart master.After that things started working.

Assignee:: Carlos Sanchez

Reporter:: Steven Oud

Votes:: 20 Vote for this issue

Watchers:: 38 Start watching this issue

Created:: 2017-01-05 10:32

Updated:: 2019-07-12 10:04

Resolved:: 2017-08-11 08:38

Jenkins

Details

Description

Attachments

Attachments

Issue Links

Activity

Collapse comment: Mike Splain added a comment - 2017-01-12 20:56

Expand comment: Mike Splain added a comment - 2017-01-12 20:56

Collapse comment: Lars Lawoko added a comment - 2017-02-20 04:34

Expand comment: Lars Lawoko added a comment - 2017-02-20 04:34

Collapse comment: Lars Lawoko added a comment - 2017-03-08 00:54

Expand comment: Lars Lawoko added a comment - 2017-03-08 00:54

Collapse comment: gilgamez added a comment - 2017-03-13 14:25

Expand comment: gilgamez added a comment - 2017-03-13 14:25

Collapse comment: Jean Mertz added a comment - 2017-03-15 06:59

Expand comment: Jean Mertz added a comment - 2017-03-15 06:59

Collapse comment: Lars Lawoko added a comment - 2017-03-15 07:56, Edited by Lars Lawoko - 2017-03-15 07:58

Expand comment: Lars Lawoko added a comment - 2017-03-15 07:56, Edited by Lars Lawoko - 2017-03-15 07:58

Collapse comment: Jesse Redl added a comment - 2017-03-21 04:53

Expand comment: Jesse Redl added a comment - 2017-03-21 04:53

Collapse comment: Gytis Ramanauskas added a comment - 2017-04-27 13:18

Expand comment: Gytis Ramanauskas added a comment - 2017-04-27 13:18

Collapse comment: Jean Mertz added a comment - 2017-04-28 07:50

Expand comment: Jean Mertz added a comment - 2017-04-28 07:50

Collapse comment: Jean Mertz added a comment - 2017-05-04 14:39

Expand comment: Jean Mertz added a comment - 2017-05-04 14:39

Collapse comment: Carlos Sanchez added a comment - 2017-05-04 14:45

Expand comment: Carlos Sanchez added a comment - 2017-05-04 14:45

Collapse comment: Joan Goyeau added a comment - 2017-05-11 08:51, Edited by Joan Goyeau - 2017-05-11 08:52

Expand comment: Joan Goyeau added a comment - 2017-05-11 08:51, Edited by Joan Goyeau - 2017-05-11 08:52

Collapse comment: Carlos Sanchez added a comment - 2017-05-11 09:37

Expand comment: Carlos Sanchez added a comment - 2017-05-11 09:37

Collapse comment: Ioannis Canellos added a comment - 2017-05-16 09:45

Expand comment: Ioannis Canellos added a comment - 2017-05-16 09:45

Collapse comment: Carlos Sanchez added a comment - 2017-05-16 09:58

Expand comment: Carlos Sanchez added a comment - 2017-05-16 09:58

Collapse comment: Ioannis Canellos added a comment - 2017-05-16 15:07

Expand comment: Ioannis Canellos added a comment - 2017-05-16 15:07

Collapse comment: Joan Goyeau added a comment - 2017-05-16 15:59, Edited by Joan Goyeau - 2017-05-16 16:00

Expand comment: Joan Goyeau added a comment - 2017-05-16 15:59, Edited by Joan Goyeau - 2017-05-16 16:00

Collapse comment: Jon Whitcraft added a comment - 2017-05-18 17:57

Expand comment: Jon Whitcraft added a comment - 2017-05-18 17:57

Collapse comment: James Rawlings added a comment - 2017-05-22 11:20

Expand comment: James Rawlings added a comment - 2017-05-22 11:20

Collapse comment: Ioannis Canellos added a comment - 2017-05-30 12:55

Expand comment: Ioannis Canellos added a comment - 2017-05-30 12:55

Collapse comment: Carlos Sanchez added a comment - 2017-05-30 13:07

Expand comment: Carlos Sanchez added a comment - 2017-05-30 13:07

Collapse comment: Jean Mertz added a comment - 2017-05-30 13:24

Expand comment: Jean Mertz added a comment - 2017-05-30 13:24

Collapse comment: Ioannis Canellos added a comment - 2017-05-30 13:29

Expand comment: Ioannis Canellos added a comment - 2017-05-30 13:29

Collapse comment: Lars Lawoko added a comment - 2017-05-30 15:51, Edited by Lars Lawoko - 2017-05-30 15:52

Expand comment: Lars Lawoko added a comment - 2017-05-30 15:51, Edited by Lars Lawoko - 2017-05-30 15:52

Collapse comment: Carlos Sanchez added a comment - 2017-06-10 09:10

Expand comment: Carlos Sanchez added a comment - 2017-06-10 09:10

Collapse comment: Corey O'Brien added a comment - 2017-06-22 15:39

Expand comment: Corey O'Brien added a comment - 2017-06-22 15:39

Collapse comment: Brian Wallace added a comment - 2017-07-19 20:23, Edited by Brian Wallace - 2017-07-26 17:13

Expand comment: Brian Wallace added a comment - 2017-07-19 20:23, Edited by Brian Wallace - 2017-07-26 17:13

Collapse comment: Bruce Bradley added a comment - 2017-07-20 00:52

Expand comment: Bruce Bradley added a comment - 2017-07-20 00:52

Collapse comment: Michael Andrews added a comment - 2017-07-20 20:24, Edited by Michael Andrews - 2017-07-21 18:04

Expand comment: Michael Andrews added a comment - 2017-07-20 20:24, Edited by Michael Andrews - 2017-07-21 18:04

Collapse comment: Martin Sander added a comment - 2017-07-26 09:27

Expand comment: Martin Sander added a comment - 2017-07-26 09:27

Collapse comment: Martin Sander added a comment - 2017-07-27 12:34

Expand comment: Martin Sander added a comment - 2017-07-27 12:34

Collapse comment: Martin Sander added a comment - 2017-07-27 13:02

Expand comment: Martin Sander added a comment - 2017-07-27 13:02

Collapse comment: Michael Andrews added a comment - 2017-07-27 16:10

Expand comment: Michael Andrews added a comment - 2017-07-27 16:10

Collapse comment: Martin Sander added a comment - 2017-07-31 08:49

Expand comment: Martin Sander added a comment - 2017-07-31 08:49

Collapse comment: Martin Sander added a comment - 2017-07-31 08:59, Edited by Martin Sander - 2017-08-01 14:55

Expand comment: Martin Sander added a comment - 2017-07-31 08:59, Edited by Martin Sander - 2017-08-01 14:55

Collapse comment: Michael Andrews added a comment - 2017-07-31 13:36

Expand comment: Michael Andrews added a comment - 2017-07-31 13:36