Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-67167

in a kubernetes pod sh steps inside container() are failing sporadically

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Critical
    • Resolution: Unresolved
    • Component/s: kubernetes-plugin
    • Labels:
    • Environment:
      Jenkins 2.303.3
      Kubernetes plugin 1.30.6
      Durable Task Plugin: 1.39
      jnlp via jenkins/inbound-agent:4.11-1-alpine-jdk8
    • Similar Issues:

      Description

      Issue is reproducible using the attached pipeline: jnlpcontainer_tests.groovy

      Description of the test:

      • running inside a k8s pod, with multiple containers
        • a jnlp container
        • a build container
      • the pipeline starts 3 parallel branches
        • jnlp branch - runs sh inside container('jnlp'){}
        • build branch - runs sh inside container('build'){}  // this is how the second container in the pod is called 
        • noContainer() branch  – runs sh outside any container(){} closure
      • in each of the parallel branches a simple sh call is executed
      • in the jnlp and build branches sh is called inside a container() closure
        • in these 2 branches sh is failing sporadically
      • in the noContainer branch sh is called not inside a container() closure
        • not a single failure was noticed in this branch in all the tries I started

      mainly 2 Exceptions were thrown

      [2021-11-18T10:49:57.920Z] java.io.EOFException
      [2021-11-18T10:49:57.921Z] 	at okio.RealBufferedSource.require(RealBufferedSource.java:61)
      [2021-11-18T10:49:57.921Z] 	at okio.RealBufferedSource.readByte(RealBufferedSource.java:74)
      [2021-11-18T10:49:57.921Z] 	at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:117)
      [2021-11-18T10:49:57.921Z] 	at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:101)
      [2021-11-18T10:49:57.921Z] 	at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
      [2021-11-18T10:49:57.921Z] 	at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
      [2021-11-18T10:49:57.921Z] 	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
      [2021-11-18T10:49:57.921Z] 	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
      [2021-11-18T10:49:57.921Z] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      [2021-11-18T10:49:57.921Z] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      [2021-11-18T10:49:57.921Z] 	at java.lang.Thread.run(Thread.java:748)
      [2021-11-18T10:49:57.921Z] ERROR: Process exited immediately after creation. See output below
      [2021-11-18T10:49:57.921Z] Executing sh script inside container jnlp of pod test-multiplecontainers-in-node-5d914e4e-3023-4bf0-845d-2-pcxs5
      [2021-11-18T10:49:57.921Z] 
      Process exited immediately after creation. Check logs above for more details.
      

      and

      [2021-11-18T10:49:58.203Z] java.net.ProtocolException: Expected HTTP 101 response but was '500 Internal Server Error'
      [2021-11-18T10:49:58.205Z] 	at okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
      [2021-11-18T10:49:58.205Z] 	at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)
      [2021-11-18T10:49:58.205Z] 	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
      [2021-11-18T10:49:58.205Z] 	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
      [2021-11-18T10:49:58.205Z] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      [2021-11-18T10:49:58.205Z] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      [2021-11-18T10:49:58.205Z] 	at java.lang.Thread.run(Thread.java:748)
      io.fabric8.kubernetes.client.KubernetesClientException: error dialing backend: dial tcp 192.168.3.11:10250: connect: connection refused
      
      • NOTE: the test consists of 100 iteration for each branch, all executed in the same Agent pod. so if we get a KubernetesClientException with a connect refused error if retry again on the same container it will eventually work again
        see:

        Attachments

          Issue Links

            Activity

            Hide
            jglick Jesse Glick added a comment -

            Possible duplicate of JENKINS-59652. The implementation of the container step is known to be poor and due for a rewrite.

            Show
            jglick Jesse Glick added a comment - Possible duplicate of JENKINS-59652 . The implementation of the container step is known to be poor and due for a rewrite.
            Hide
            ysmaoui Yacine added a comment - - edited

            Hi Jesse Glick  thanks for your reply

            I don't think in this particular case the pod was evicted or stopped or even that the connection to it was lost and this is why:

            • as you can see in jnlpcontainer_tests.groovy,  the 3rd parallel branch is executing the same sh call but not wrapped in a container() closure. this branch had not a single sh-step failure
              • without the container() closure, the commands are executed on the jnlp container, the same jnlp container where we get issues if we explicitly choose it with container('jnlp') 
            • the pod was alive and reachable without issues for whatever is executed outside container()
            • this affects only sh calls 

            so this is problematic:

             container('jnlp'){
                  sh("echo test")
            } 
            

            and this is not:

            // no container(){} closure
            sh ("echo test")
            

            while running at the same time on the same pod ( and same container as well )

            I am not sure how I can debug more to identify a possible workaround. any hints?

             

            Show
            ysmaoui Yacine added a comment - - edited Hi Jesse Glick   thanks for your reply I don't think in this particular case the pod was evicted or stopped or even that the connection to it was lost and this is why: as you can see in  jnlpcontainer_tests.groovy ,  the 3rd parallel branch is executing the same sh call but not wrapped in a container() closure. this branch had not a single sh-step failure without the container() closure, the commands are executed on the jnlp container, the same jnlp container where we get issues if we explicitly choose it with container('jnlp')  the pod was alive and reachable without issues for whatever is executed outside container() this affects only sh calls  so this is problematic: container( 'jnlp' ){ sh( "echo test" ) } and this is not: // no container(){} closure sh ( "echo test" ) while running at the same time on the same pod ( and same container as well ) this force us to use one (custom) jnlp container to run the pipeline, which is kind of against the recommendation in the docs https://github.com/jenkinsci/kubernetes-plugin#configuration We do not recommend overriding the jnlp container except under unusual circumstances. I am not sure how I can debug more to identify a possible workaround. any hints?  
            Hide
            jglick Jesse Glick added a comment -

            I do not know of any workaround beyond avoiding container.

            Show
            jglick Jesse Glick added a comment - I do not know of any workaround beyond avoiding container .
            Hide
            ysmaoui Yacine added a comment - - edited

            Hi Jesse Glick

            in a scripted pipeline:
            until the container() step is fixed/refactored, would it be possible to somehow select a different container than the jnlp as default for the execution of sh steps ? so that we don't have to use container()

            so, if in a pod we have:

            • a 'jnlp' container
            • a 'build' container

            we select the 'build' container as default ( maybe in the podTemplate definition), so that we don't have to do

            container('build'){
               sh("..")
            }
            

            I think if we want to keep the default jnlp container, most of the commands need to be executed somewhere else.

            or is this then the same effort as refactoring the container() step?

            Show
            ysmaoui Yacine added a comment - - edited Hi Jesse Glick in a scripted pipeline : until the container() step is fixed/refactored, would it be possible to somehow select a different container than the jnlp as default for the execution of sh steps ? so that we don't have to use container() so, if in a pod we have: a 'jnlp' container a 'build' container we select the 'build' container as default ( maybe in the podTemplate definition), so that we don't have to do container( 'build' ){ sh( ".." ) } I think if we want to keep the default jnlp container, most of the commands need to be executed somewhere else. or is this then the same effort as refactoring the container() step?
            Hide
            jglick Jesse Glick added a comment -

            would it be possible to somehow select a different container than the jnlp as default for the execution of sh steps?

            No it is not possible.

            The workaround is to use a pod with a single container (jnlp) whose image contains both a Jenkins agent (and JRE), and whatever other tools you might need.

            Show
            jglick Jesse Glick added a comment - would it be possible to somehow select a different container than the jnlp as default for the execution of sh steps? No it is not possible. The workaround is to use a pod with a single container ( jnlp ) whose image contains both a Jenkins agent (and JRE), and whatever other tools you might need.

              People

              Assignee:
              Unassigned Unassigned
              Reporter:
              ysmaoui Yacine
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated: