Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-65602

Durable task pipeline failed at sh initialisation

      I am using jenkins version : Jenkins 2.249.1

      Durable task : 1.35

      We run builds in kubenetes farm and our builds are dockerised, After the upgrade of Jenkins, and durable plugins as we see multiple issues raised with "sh" initialisation inside the container breaks, we considered the above solution as suggested and set workingDir: "/home/jenkins/agent" and builds are success after making the change.

       

      How ever still some of the builds on Jenkins are still failing randomly with same error

      [2021-05-10T15:16:33.046Z] [Pipeline] sh [2021-05-10T15:22:08.073Z] process apparently never started in /home/jenkins/agent/workspace/CORE-CommitStage@tmp/durable-f6a728e7 [2021-05-10T15:22:08.087Z] [Pipeline] }

      Also we had already enabled as per suggestions- 
      -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.LAUNCH_DIAGNOSTICS=true \

      Issue is however not persistent but still jobs fails randomly.  Looking for permanent fix for the current issue.

          [JENKINS-65602] Durable task pipeline failed at sh initialisation

          Hitesh kumar created issue -
          Hitesh kumar made changes -
          Priority Original: Minor [ 4 ] New: Major [ 3 ]

          Carroll Chiou added a comment -

          If you already have LAUNCH_DIAGNOSTICS available, can you share a fuller log for when the shell step fails?

          Carroll Chiou added a comment - If you already have LAUNCH_DIAGNOSTICS available, can you share a fuller log for when the shell step fails?

          Olle added a comment - - edited

          We are experiencing the same issue. In our case, we are executing a shell command that is expected to run for approximately 3 hours. For us, the problem is pretty much persistent. Some single jobs may pass but in 95% of the cases, we fail with the same error ("process apparently never started in ...").

          What we have tried (some of these approaches can be found in earlier reports of this bug):

           - Downgrading plugins (durable and docker-workflow)
           - Downgrading master
           - Checking for empty environment variables
           - Not using double user definitions when running the Docker container
           - Increasing HEARTBEAT_CHECK_INTERVAL
           - Setting alwaysPull to true

          Setting LAUNCH_DIAGNOSTICS does not give us any information when the job exits. We have set up a logger for durable (and docker/workflow) but are only given information like the ones presented below:

          org.jenkinsci.plugins.durabletask.BourneShellScript
          launching [nohup, sh, -c, { while [ -d '/home/[redacted]/durable-659c0cf1' -a \! -f '/home/[redacted]/durable-659c0cf1/jenkins-result.txt' ]; do touch '/home/[redacted]/durable-659c0cf1/jenkins-log.txt'; sleep 3; done } & jsc=durable-[redacted]; JENKINS_SERVER_COOKIE=$$jsc 'sh' -xe '/home/[redacted]/durable-659c0cf1/script.sh' > '/home/[redacted]/durable-659c0cf1/jenkins-log.txt' 2>&1; echo $$? > '/home/[redacted]/durable-659c0cf1/jenkins-result.txt.tmp'; mv '/home/[redacted]/durable-659c0cf1/jenkins-result.txt.tmp' '/home/[redacted]/durable-659c0cf1/jenkins-result.txt'; wait]
          org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#6399] already finished, no need to interrupt
          May 20, 2021 2:29:02 PM FINER org.jenkinsci.plugins.workflow.support.concurrent.Timeout
          

           

          To add to the confusion, the exact same job (Dockerfile and Jenkinsfile) works perfectly fine on another master. The difference between these masters are the following:

           

                           +------------------+-----------------+
                           | Master A         | Master B        |
          +----------------+------------------+-----------------+
          | Plugins        |         Slightly different         |
          |                | (same durable and docker-workflow) |
          +----------------+------------------+-----------------+
          | Host           | VM               | k8s             |
          +----------------+------------------+-----------------+
          | Authentication | LDAP             | Azure           |
          +----------------+------------------+-----------------+
          

           

          Olle added a comment - - edited We are experiencing the same issue. In our case, we are executing a shell command that is expected to run for approximately 3 hours. For us, the problem is pretty much persistent. Some single jobs may pass but in 95% of the cases, we fail with the same error ("process apparently never started in ..."). What we have tried (some of these approaches can be found in earlier reports of this bug):  - Downgrading plugins (durable and docker-workflow)  - Downgrading master  - Checking for empty environment variables  - Not using double user definitions when running the Docker container  - Increasing  HEARTBEAT_CHECK_INTERVAL   -  Setting alwaysPull to true Setting LAUNCH_DIAGNOSTICS does not give us any information when the job exits. We have set up a logger for durable (and docker/workflow) but are only given information like the ones presented below: org.jenkinsci.plugins.durabletask.BourneShellScript launching [nohup, sh, -c, { while [ -d '/home/[redacted]/durable-659c0cf1' -a \! -f '/home/[redacted]/durable-659c0cf1/jenkins-result.txt' ]; do touch '/home/[redacted]/durable-659c0cf1/jenkins-log.txt' ; sleep 3; done } & jsc=durable-[redacted]; JENKINS_SERVER_COOKIE=$$jsc 'sh' -xe '/home/[redacted]/durable-659c0cf1/script.sh' > '/home/[redacted]/durable-659c0cf1/jenkins-log.txt' 2>&1; echo $$? > '/home/[redacted]/durable-659c0cf1/jenkins-result.txt.tmp' ; mv '/home/[redacted]/durable-659c0cf1/jenkins-result.txt.tmp' '/home/[redacted]/durable-659c0cf1/jenkins-result.txt' ; wait] org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#6399] already finished, no need to interrupt May 20, 2021 2:29:02 PM FINER org.jenkinsci.plugins.workflow.support.concurrent.Timeout   To add to the confusion, the exact same job (Dockerfile and Jenkinsfile) works perfectly fine on another master. The difference between these masters are the following:   +------------------+-----------------+ | Master A | Master B | +----------------+------------------+-----------------+ | Plugins | Slightly different | | | (same durable and docker-workflow) | +----------------+------------------+-----------------+ | Host | VM | k8s | +----------------+------------------+-----------------+ | Authentication | LDAP | Azure | +----------------+------------------+-----------------+  

          Carroll Chiou added a comment -

          ollehu thanks for the feedback. I'm assuming your host is some x-86 linux system, you can try enabling the binary wrapper:
          org.jenkinsci.plugins.durabletask.BourneShellScript.FORCE_BINARY_WRAPPER=true

          Even if it fails, the diagnostic log is a bit more informative.

          Carroll Chiou added a comment - ollehu thanks for the feedback. I'm assuming your host is some x-86 linux system, you can try enabling the binary wrapper: org.jenkinsci.plugins.durabletask.BourneShellScript.FORCE_BINARY_WRAPPER=true Even if it fails, the diagnostic log is a bit more informative.

          Olle added a comment -

          carroll Thanks for getting back to me.

          FORCE_BINARY_WRAPPER did not really add that much information, at least nothing looking like a stacktrace. The log shows the following message repeated about every minute

          remote transcoding charset: null
          May 21, 2021 9:46:12 PM FINE org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep still running in /home/[redacted]/73_jenkins2-linux-pipeline on [agent]
          May 21, 2021 9:46:27 PM FINER org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [agent] seems to be online so using /home/[redacted]/73_jenkins2-linux-pipeline
          May 21, 2021 9:46:27 PM FINE org.jenkinsci.plugins.durabletask.FileMonitoringTaskremote transcoding charset: null
          

          Then, finally we see the following when the job exits (with the error message):

          May 21, 2021 9:46:42 PM FINE org.jenkinsci.plugins.durabletask.FileMonitoringTaskremote transcoding charset: null
          May 21, 2021 9:46:42 PM FINE org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStepcalling close with nl=true
          

          Is this normal? Especially the part that the first code-block is printed at every minute.

          Olle added a comment - carroll  Thanks for getting back to me. FORCE_BINARY_WRAPPER did not really add that much information, at least nothing looking like a stacktrace. The log shows the following message repeated about every minute remote transcoding charset: null May 21, 2021 9:46:12 PM FINE org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep still running in /home/[redacted]/73_jenkins2-linux-pipeline on [agent] May 21, 2021 9:46:27 PM FINER org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [agent] seems to be online so using /home/[redacted]/73_jenkins2-linux-pipeline May 21, 2021 9:46:27 PM FINE org.jenkinsci.plugins.durabletask.FileMonitoringTaskremote transcoding charset: null Then, finally we see the following when the job exits (with the error message): May 21, 2021 9:46:42 PM FINE org.jenkinsci.plugins.durabletask.FileMonitoringTaskremote transcoding charset: null May 21, 2021 9:46:42 PM FINE org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStepcalling close with nl= true Is this normal? Especially the part that the first code-block is printed at every minute.

          Ben added a comment -

          I'm having the same issue. It seems to be related to the container running the shell. For example, with the following Jenkinsfile, the first two execute shell as expected, but the third exhibits this problem.

          podTemplate(yaml: """
          apiVersion: v1
          kind: Pod
          spec:
            containers:
            - name: golang
              image: golang:1.8.0
              command:
              - sleep
              args:
              - 9999999
            - name: node-default
              image: node:12.14
              command:
              - sleep
              args:
              - 9999999
            - name: node-circle
              image: circleci/node:12.14
              command:
              - sleep
              args:
              - 9999999
          """
            ) {  
          
            node(POD_LABEL) {    
              stage('golang project') {
                container('golang') {
                  sh'''
                    ls
                  '''
                }
              }
          
              stage('node project') {
                container('node-default') {
                  sh'''
                    ls
                  '''
                }
              }
          
              stage('node project') {
                container('node-circle') {
                  sh'''
                    ls
                  '''
                }
              }
            }
          }
          

           

          Version Info:

          • Jenkins: v2.7.2
          • Durable Task: v1.35
          • Kubernetes: v1.29.2

          Ben added a comment - I'm having the same issue. It seems to be related to the container running the shell. For example, with the following Jenkinsfile, the first two execute shell as expected, but the third exhibits this problem. podTemplate(yaml: """ apiVersion: v1 kind: Pod spec: containers: - name: golang image: golang:1.8.0 command: - sleep args: - 9999999 - name: node- default image: node:12.14 command: - sleep args: - 9999999 - name: node-circle image: circleci/node:12.14 command: - sleep args: - 9999999 """ ) { node(POD_LABEL) { stage( 'golang project' ) { container( 'golang' ) { sh''' ls ''' } } stage( 'node project' ) { container( 'node- default ' ) { sh''' ls ''' } } stage( 'node project' ) { container( 'node-circle' ) { sh''' ls ''' } } } }   Version Info: Jenkins: v2.7.2 Durable Task: v1.35 Kubernetes: v1.29.2

          Carroll Chiou added a comment -

          ollehu can you confirm that the launch script was actually launching the binary and not the script wrapper? You should NOT see

          [nohup, sh, -c, { while [ -d 
          

          Carroll Chiou added a comment - ollehu can you confirm that the launch script was actually launching the binary and not the script wrapper? You should NOT see [nohup, sh, -c, { while [ -d

          I believe there is a parsing error in the shell script processing when running in a container.  I had previously submitted JENKINS-65759 where an apostrophe in the job name causes a similar issue.  Our users also see similar issues when using the docker.withImage(...).inside {} and docker.withImage(..).withRun(...) {}, but not with docker.withImage(...).withRun { c => } syntaxes.

          Jonathon Lamon added a comment - I believe there is a parsing error in the shell script processing when running in a container.  I had previously submitted JENKINS-65759  where an apostrophe in the job name causes a similar issue.  Our users also see similar issues when using the docker.withImage(...).inside {} and docker.withImage(..).withRun(...) {}, but not with docker.withImage(...).withRun { c => } syntaxes.

          Carroll Chiou added a comment -

          thanks for the tip jonl_percsol! That's very interesting behavior to say the least...I'll take a look.

          Carroll Chiou added a comment - thanks for the tip jonl_percsol ! That's very interesting behavior to say the least...I'll take a look.

            Unassigned Unassigned
            hiteshkumar Hitesh kumar
            Votes:
            4 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated: