Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-65602

Durable task pipeline failed at sh initialisation

      I am using jenkins version : Jenkins 2.249.1

      Durable task : 1.35

      We run builds in kubenetes farm and our builds are dockerised, After the upgrade of Jenkins, and durable plugins as we see multiple issues raised with "sh" initialisation inside the container breaks, we considered the above solution as suggested and set workingDir: "/home/jenkins/agent" and builds are success after making the change.

       

      How ever still some of the builds on Jenkins are still failing randomly with same error

      [2021-05-10T15:16:33.046Z] [Pipeline] sh [2021-05-10T15:22:08.073Z] process apparently never started in /home/jenkins/agent/workspace/CORE-CommitStage@tmp/durable-f6a728e7 [2021-05-10T15:22:08.087Z] [Pipeline] }

      Also we had already enabled as per suggestions- 
      -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.LAUNCH_DIAGNOSTICS=true \

      Issue is however not persistent but still jobs fails randomly.  Looking for permanent fix for the current issue.

          [JENKINS-65602] Durable task pipeline failed at sh initialisation

          Olle added a comment -

          carroll Thanks for getting back to me.

          FORCE_BINARY_WRAPPER did not really add that much information, at least nothing looking like a stacktrace. The log shows the following message repeated about every minute

          remote transcoding charset: null
          May 21, 2021 9:46:12 PM FINE org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep still running in /home/[redacted]/73_jenkins2-linux-pipeline on [agent]
          May 21, 2021 9:46:27 PM FINER org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [agent] seems to be online so using /home/[redacted]/73_jenkins2-linux-pipeline
          May 21, 2021 9:46:27 PM FINE org.jenkinsci.plugins.durabletask.FileMonitoringTaskremote transcoding charset: null
          

          Then, finally we see the following when the job exits (with the error message):

          May 21, 2021 9:46:42 PM FINE org.jenkinsci.plugins.durabletask.FileMonitoringTaskremote transcoding charset: null
          May 21, 2021 9:46:42 PM FINE org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStepcalling close with nl=true
          

          Is this normal? Especially the part that the first code-block is printed at every minute.

          Olle added a comment - carroll  Thanks for getting back to me. FORCE_BINARY_WRAPPER did not really add that much information, at least nothing looking like a stacktrace. The log shows the following message repeated about every minute remote transcoding charset: null May 21, 2021 9:46:12 PM FINE org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep still running in /home/[redacted]/73_jenkins2-linux-pipeline on [agent] May 21, 2021 9:46:27 PM FINER org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [agent] seems to be online so using /home/[redacted]/73_jenkins2-linux-pipeline May 21, 2021 9:46:27 PM FINE org.jenkinsci.plugins.durabletask.FileMonitoringTaskremote transcoding charset: null Then, finally we see the following when the job exits (with the error message): May 21, 2021 9:46:42 PM FINE org.jenkinsci.plugins.durabletask.FileMonitoringTaskremote transcoding charset: null May 21, 2021 9:46:42 PM FINE org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStepcalling close with nl= true Is this normal? Especially the part that the first code-block is printed at every minute.

          Ben added a comment -

          I'm having the same issue. It seems to be related to the container running the shell. For example, with the following Jenkinsfile, the first two execute shell as expected, but the third exhibits this problem.

          podTemplate(yaml: """
          apiVersion: v1
          kind: Pod
          spec:
            containers:
            - name: golang
              image: golang:1.8.0
              command:
              - sleep
              args:
              - 9999999
            - name: node-default
              image: node:12.14
              command:
              - sleep
              args:
              - 9999999
            - name: node-circle
              image: circleci/node:12.14
              command:
              - sleep
              args:
              - 9999999
          """
            ) {  
          
            node(POD_LABEL) {    
              stage('golang project') {
                container('golang') {
                  sh'''
                    ls
                  '''
                }
              }
          
              stage('node project') {
                container('node-default') {
                  sh'''
                    ls
                  '''
                }
              }
          
              stage('node project') {
                container('node-circle') {
                  sh'''
                    ls
                  '''
                }
              }
            }
          }
          

           

          Version Info:

          • Jenkins: v2.7.2
          • Durable Task: v1.35
          • Kubernetes: v1.29.2

          Ben added a comment - I'm having the same issue. It seems to be related to the container running the shell. For example, with the following Jenkinsfile, the first two execute shell as expected, but the third exhibits this problem. podTemplate(yaml: """ apiVersion: v1 kind: Pod spec: containers: - name: golang image: golang:1.8.0 command: - sleep args: - 9999999 - name: node- default image: node:12.14 command: - sleep args: - 9999999 - name: node-circle image: circleci/node:12.14 command: - sleep args: - 9999999 """ ) { node(POD_LABEL) { stage( 'golang project' ) { container( 'golang' ) { sh''' ls ''' } } stage( 'node project' ) { container( 'node- default ' ) { sh''' ls ''' } } stage( 'node project' ) { container( 'node-circle' ) { sh''' ls ''' } } } }   Version Info: Jenkins: v2.7.2 Durable Task: v1.35 Kubernetes: v1.29.2

          Carroll Chiou added a comment -

          ollehu can you confirm that the launch script was actually launching the binary and not the script wrapper? You should NOT see

          [nohup, sh, -c, { while [ -d 
          

          Carroll Chiou added a comment - ollehu can you confirm that the launch script was actually launching the binary and not the script wrapper? You should NOT see [nohup, sh, -c, { while [ -d

          I believe there is a parsing error in the shell script processing when running in a container.  I had previously submitted JENKINS-65759 where an apostrophe in the job name causes a similar issue.  Our users also see similar issues when using the docker.withImage(...).inside {} and docker.withImage(..).withRun(...) {}, but not with docker.withImage(...).withRun { c => } syntaxes.

          Jonathon Lamon added a comment - I believe there is a parsing error in the shell script processing when running in a container.  I had previously submitted JENKINS-65759  where an apostrophe in the job name causes a similar issue.  Our users also see similar issues when using the docker.withImage(...).inside {} and docker.withImage(..).withRun(...) {}, but not with docker.withImage(...).withRun { c => } syntaxes.

          Carroll Chiou added a comment -

          thanks for the tip jonl_percsol! That's very interesting behavior to say the least...I'll take a look.

          Carroll Chiou added a comment - thanks for the tip jonl_percsol ! That's very interesting behavior to say the least...I'll take a look.

          Olle added a comment - - edited

          carroll Has there been any changes made related to this problem? For us, the problem seems to have disappeared - at least for the affected pipeline.

          Olle added a comment - - edited carroll Has there been any changes made related to this problem? For us, the problem seems to have disappeared - at least for the affected pipeline.

          m t added a comment - - edited

          We are running a Jenkins Agent in a Docker Container on Debian and we just hit this after upgrading to Debian 11.

          After enabling LAUNCH_DIAGNOSTICS as described, the following errors appeared:

          09:50:30 sh: 1: cannot create /home/jenkins/agent/workspace/<job-name>@tmp/durable-fe1d0f21/jenkins-log.txt: Directory nonexistent
          09:50:30 sh: 1: cannot create /home/jenkins/agent/workspace/<job-name>@tmp/durable-fe1d0f21/jenkins-result.txt.tmp: Directory nonexistent
          09:50:30 mv: cannot stat '/home/jenkins/agent/workspace/<job-name>@tmp/durable-fe1d0f21/jenkins-result.txt.tmp': No such file or directory

          Then I noticed that these directories were created on the Docker host instead of inside the container...

          The actual issue was the following:

          <docker-hostname> does not seem to be running inside a container

          Jenkins failed to detect that the Docker agent was running in a container. I guess that is why it created the directories on the Docker host instead of inside the container.

          This is caused by Debian 11 changing to cgroup v2 by default which breaks the container detection. Looking at the code in docker-workflow-plugin, it tries to get container id from /proc/self/cgroup , but on cgroup v2 this just returns "0::/".

          I worked around this by booting Debian with "systemd.unified_cgroup_hierarchy=false". Weirdly enough, it's also necessary to rebuild the container of the jenkins agent to fix it completely (the issue also didn't appear immediately after upgrading to Debian 11, but only after re-creating the agent container).

          See also this related issue: Container detection fails on cgroup v2 devices · Issue #1592 · GoogleContainerTools/kaniko (github.com)

          In this case, they seem to have fixed this by detecting whether /.dockerenv exists. But as far as I can see, this doesn't allow access to the container id (I don't know if Jenkins actually needs it though, currently it's being logged at least).

           

          edit: there already is a bug (and another workaround) for this specific issue JENKINS-64608 Detection "running inside container" fails with cgroup namespace "private" for docker daemon - Jenkins Jira

          m t added a comment - - edited We are running a Jenkins Agent in a Docker Container on Debian and we just hit this after upgrading to Debian 11. After enabling LAUNCH_DIAGNOSTICS as described, the following errors appeared: 09:50:30 sh: 1: cannot create /home/jenkins/agent/workspace/<job-name>@tmp/durable-fe1d0f21/jenkins-log.txt: Directory nonexistent 09:50:30 sh: 1: cannot create /home/jenkins/agent/workspace/<job-name>@tmp/durable-fe1d0f21/jenkins-result.txt.tmp: Directory nonexistent 09:50:30 mv: cannot stat '/home/jenkins/agent/workspace/<job-name>@tmp/durable-fe1d0f21/jenkins-result.txt.tmp' : No such file or directory Then I noticed that these directories were created on the Docker host instead of inside the container... The actual issue was the following: <docker-hostname> does not seem to be running inside a container Jenkins failed to detect that the Docker agent was running in a container. I guess that is why it created the directories on the Docker host instead of inside the container. This is caused by Debian 11 changing to cgroup v2 by default which breaks the container detection. Looking at the code in docker-workflow-plugin, it tries to get container id from /proc/self/cgroup , but on cgroup v2 this just returns "0::/". I worked around this by booting Debian with "systemd.unified_cgroup_hierarchy=false". Weirdly enough, it's also necessary to rebuild the container of the jenkins agent to fix it completely (the issue also didn't appear immediately after upgrading to Debian 11, but only after re-creating the agent container). See also this related issue:  Container detection fails on cgroup v2 devices · Issue #1592 · GoogleContainerTools/kaniko (github.com) In this case, they seem to have fixed this by detecting whether /.dockerenv exists. But as far as I can see, this doesn't allow access to the container id (I don't know if Jenkins actually needs it though, currently it's being logged at least).   edit: there already is a bug (and another workaround) for this specific issue  JENKINS-64608 Detection "running inside container" fails with cgroup namespace "private" for docker daemon - Jenkins Jira

          Carroll Chiou added a comment -

          Thanks for the data mus65! One of the many reasons the docker-workflow-plugin is so challenging. A lot of times durable-task-plugin is a symptom of the underlying issue, but with limited error output, it's really hard to tell. I wonder if this issue is present when you don't use the the docker-workflow-plugin, i.e. just running the docker commands through shell?

          Carroll Chiou added a comment - Thanks for the data mus65 ! One of the many reasons the docker-workflow-plugin is so challenging. A lot of times durable-task-plugin is a symptom of the underlying issue, but with limited error output, it's really hard to tell. I wonder if this issue is present when you don't use the the docker-workflow-plugin, i.e. just running the docker commands through shell?

          m t added a comment -

          carroll I would expect docker commands to fail as well for most use cases. I had a closer look on why this fails exactly:

          When the container for the pipeline is started with "docker run", the docker workflow plugin usually passes the volume of the agent with "--volumes-from=<agent container id>", so the pipeline container has access to the workspace (which was checked out inside the agent container). But it only does this when it detects that the agent itself is running in a container. Since the container detection fails, the volume with the workspace is not passed to the pipeline container and the aforementioned issues happen because the workspace doesn't exist.

          So in theory, using docker commands on the shell only works if you either use "skipDefaultCheckout()" and do the git clone inside the stage yourself or you pass "--volumes-from=<agent container id>" yourself with the "args" parameter in the declarative pipeline. 

          by the way: the workaround from JENKINS-64608 to run the agent container with "–cgroupns host" also works fine for me and is much better than reverting to cgroupv1 for the whole host system.

          m t added a comment - carroll  I would expect docker commands to fail as well for most use cases. I had a closer look on why this fails exactly: When the container for the pipeline is started with "docker run", the docker workflow plugin usually passes the volume of the agent with "--volumes-from=<agent container id>", so the pipeline container has access to the workspace (which was checked out inside the agent container). But it only does this when it detects that the agent itself is running in a container. Since the container detection fails, the volume with the workspace is not passed to the pipeline container and the aforementioned issues happen because the workspace doesn't exist. So in theory, using docker commands on the shell only works if you either use "skipDefaultCheckout()" and do the git clone inside the stage yourself or you pass "--volumes-from=<agent container id>" yourself with the "args" parameter in the declarative pipeline.  by the way: the workaround from  JENKINS-64608  to run the agent container with "–cgroupns host" also works fine for me and is much better than reverting to cgroupv1 for the whole host system.

          Jesse Glick added a comment -

          Recommend uninstalling the docker-workflow plugin and running docker CLI commands directly.

          Unless there is something broken here which does not involve that plugin, I think this could be closed as a duplicate.

          Jesse Glick added a comment - Recommend uninstalling the docker-workflow plugin and running docker CLI commands directly. Unless there is something broken here which does not involve that plugin, I think this could be closed as a duplicate.

            Unassigned Unassigned
            hiteshkumar Hitesh kumar
            Votes:
            4 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated: