Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-67167

in a kubernetes pod sh steps inside container() are failing sporadically

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • kubernetes-plugin
    • Jenkins 2.303.3
      Kubernetes plugin 1.30.6
      Durable Task Plugin: 1.39
      jnlp via jenkins/inbound-agent:4.11-1-alpine-jdk8

      Issue is reproducible using the attached pipeline: jnlpcontainer_tests.groovy

      Description of the test:

      • running inside a k8s pod, with multiple containers
        • a jnlp container
        • a build container
      • the pipeline starts 3 parallel branches
        • jnlp branch - runs sh inside container('jnlp'){}
        • build branch - runs sh inside container('build'){}  // this is how the second container in the pod is called 
        • noContainer() branch  – runs sh outside any container(){} closure
      • in each of the parallel branches a simple sh call is executed
      • in the jnlp and build branches sh is called inside a container() closure
        • in these 2 branches sh is failing sporadically
      • in the noContainer branch sh is called not inside a container() closure
        • not a single failure was noticed in this branch in all the tries I started

      mainly 2 Exceptions were thrown

      [2021-11-18T10:49:57.920Z] java.io.EOFException
      [2021-11-18T10:49:57.921Z] 	at okio.RealBufferedSource.require(RealBufferedSource.java:61)
      [2021-11-18T10:49:57.921Z] 	at okio.RealBufferedSource.readByte(RealBufferedSource.java:74)
      [2021-11-18T10:49:57.921Z] 	at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:117)
      [2021-11-18T10:49:57.921Z] 	at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:101)
      [2021-11-18T10:49:57.921Z] 	at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
      [2021-11-18T10:49:57.921Z] 	at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
      [2021-11-18T10:49:57.921Z] 	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
      [2021-11-18T10:49:57.921Z] 	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
      [2021-11-18T10:49:57.921Z] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      [2021-11-18T10:49:57.921Z] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      [2021-11-18T10:49:57.921Z] 	at java.lang.Thread.run(Thread.java:748)
      [2021-11-18T10:49:57.921Z] ERROR: Process exited immediately after creation. See output below
      [2021-11-18T10:49:57.921Z] Executing sh script inside container jnlp of pod test-multiplecontainers-in-node-5d914e4e-3023-4bf0-845d-2-pcxs5
      [2021-11-18T10:49:57.921Z] 
      Process exited immediately after creation. Check logs above for more details.
      

      and

      [2021-11-18T10:49:58.203Z] java.net.ProtocolException: Expected HTTP 101 response but was '500 Internal Server Error'
      [2021-11-18T10:49:58.205Z] 	at okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
      [2021-11-18T10:49:58.205Z] 	at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)
      [2021-11-18T10:49:58.205Z] 	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
      [2021-11-18T10:49:58.205Z] 	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
      [2021-11-18T10:49:58.205Z] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      [2021-11-18T10:49:58.205Z] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      [2021-11-18T10:49:58.205Z] 	at java.lang.Thread.run(Thread.java:748)
      io.fabric8.kubernetes.client.KubernetesClientException: error dialing backend: dial tcp 192.168.3.11:10250: connect: connection refused
      
      • NOTE: the test consists of 100 iteration for each branch, all executed in the same Agent pod. so if we get a KubernetesClientException with a connect refused error if retry again on the same container it will eventually work again
        see:

          [JENKINS-67167] in a kubernetes pod sh steps inside container() are failing sporadically

          Jesse Glick added a comment -

          Possible duplicate of JENKINS-59652. The implementation of the container step is known to be poor and due for a rewrite.

          Jesse Glick added a comment - Possible duplicate of JENKINS-59652 . The implementation of the container step is known to be poor and due for a rewrite.

          Yacine added a comment - - edited

          Hi jglick  thanks for your reply

          I don't think in this particular case the pod was evicted or stopped or even that the connection to it was lost and this is why:

          • as you can see in jnlpcontainer_tests.groovy,  the 3rd parallel branch is executing the same sh call but not wrapped in a container() closure. this branch had not a single sh-step failure
            • without the container() closure, the commands are executed on the jnlp container, the same jnlp container where we get issues if we explicitly choose it with container('jnlp') 
          • the pod was alive and reachable without issues for whatever is executed outside container()
          • this affects only sh calls 

          so this is problematic:

           container('jnlp'){
                sh("echo test")
          } 
          

          and this is not:

          // no container(){} closure
          sh ("echo test")
          

          while running at the same time on the same pod ( and same container as well )

          I am not sure how I can debug more to identify a possible workaround. any hints?

           

          Yacine added a comment - - edited Hi jglick   thanks for your reply I don't think in this particular case the pod was evicted or stopped or even that the connection to it was lost and this is why: as you can see in  jnlpcontainer_tests.groovy ,  the 3rd parallel branch is executing the same sh call but not wrapped in a container() closure. this branch had not a single sh-step failure without the container() closure, the commands are executed on the jnlp container, the same jnlp container where we get issues if we explicitly choose it with container('jnlp')  the pod was alive and reachable without issues for whatever is executed outside container() this affects only sh calls  so this is problematic: container( 'jnlp' ){ sh( "echo test" ) } and this is not: // no container(){} closure sh ( "echo test" ) while running at the same time on the same pod ( and same container as well ) this force us to use one (custom) jnlp container to run the pipeline, which is kind of against the recommendation in the docs https://github.com/jenkinsci/kubernetes-plugin#configuration We do not recommend overriding the jnlp container except under unusual circumstances. I am not sure how I can debug more to identify a possible workaround. any hints?  

          Jesse Glick added a comment -

          I do not know of any workaround beyond avoiding container.

          Jesse Glick added a comment - I do not know of any workaround beyond avoiding container .

          Yacine added a comment - - edited

          Hi jglick

          in a scripted pipeline:
          until the container() step is fixed/refactored, would it be possible to somehow select a different container than the jnlp as default for the execution of sh steps ? so that we don't have to use container()

          so, if in a pod we have:

          • a 'jnlp' container
          • a 'build' container

          we select the 'build' container as default ( maybe in the podTemplate definition), so that we don't have to do

          container('build'){
             sh("..")
          }
          

          I think if we want to keep the default jnlp container, most of the commands need to be executed somewhere else.

          or is this then the same effort as refactoring the container() step?

          Yacine added a comment - - edited Hi jglick in a scripted pipeline : until the container() step is fixed/refactored, would it be possible to somehow select a different container than the jnlp as default for the execution of sh steps ? so that we don't have to use container() so, if in a pod we have: a 'jnlp' container a 'build' container we select the 'build' container as default ( maybe in the podTemplate definition), so that we don't have to do container( 'build' ){ sh( ".." ) } I think if we want to keep the default jnlp container, most of the commands need to be executed somewhere else. or is this then the same effort as refactoring the container() step?

          Jesse Glick added a comment -

          would it be possible to somehow select a different container than the jnlp as default for the execution of sh steps?

          No it is not possible.

          The workaround is to use a pod with a single container (jnlp) whose image contains both a Jenkins agent (and JRE), and whatever other tools you might need.

          Jesse Glick added a comment - would it be possible to somehow select a different container than the jnlp as default for the execution of sh steps? No it is not possible. The workaround is to use a pod with a single container ( jnlp ) whose image contains both a Jenkins agent (and JRE), and whatever other tools you might need.

          Adam Placzek added a comment -

          Hi,
          We have the same problem and it even got more frequent recently after Kubernetes client and plugin upgrades.

          Is there a permanent solution planned or we should be ONLY using JNLP container? If so then please update the documentation which still does not recommend overwriting JNLP

          Adam Placzek added a comment - Hi, We have the same problem and it even got more frequent recently after Kubernetes client and plugin upgrades. Is there a permanent solution planned or we should be ONLY using JNLP container? If so then please update the documentation which still does not recommend overwriting JNLP

          I'm facing the same issue, I think is related with the cluster being private. I have two GKE clusters (private and public, same version and same jenkins chart), in the public one everything run smoothly, but in the private cluster the job (nothing special, just an sh echo... command) fails randomly. I figured out that always fails when the pod run in the same node as jenkins's pod.

          I though that the issue could be that I need to add a rule to my cluster network like other services that need access to the control plane through a certain port, but I couldn't find any reference to this...

          Any idea how to solve this?

          Francisco Aguiar added a comment - I'm facing the same issue, I think is related with the cluster being private. I have two GKE clusters (private and public, same version and same jenkins chart), in the public one everything run smoothly, but in the private cluster the job (nothing special, just an sh echo... command) fails randomly. I figured out that always fails when the pod run in the same node as jenkins's pod. I though that the issue could be that I need to add a rule to my cluster network like other services that need access to the control plane through a certain port, but I couldn't find any reference to this... Any idea how to solve this?

            Unassigned Unassigned
            ysmaoui Yacine
            Votes:
            3 Vote for this issue
            Watchers:
            19 Start watching this issue

              Created:
              Updated: