-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Jenkins version : Jenkins 2.249.1
Durable task : 1.35
-
Powered by SuggestiMate
I am using jenkins version : Jenkins 2.249.1
Durable task : 1.35
We run builds in kubenetes farm and our builds are dockerised, After the upgrade of Jenkins, and durable plugins as we see multiple issues raised with "sh" initialisation inside the container breaks, we considered the above solution as suggested and set workingDir: "/home/jenkins/agent" and builds are success after making the change.
How ever still some of the builds on Jenkins are still failing randomly with same error
[2021-05-10T15:16:33.046Z] [Pipeline] sh [2021-05-10T15:22:08.073Z] process apparently never started in /home/jenkins/agent/workspace/CORE-CommitStage@tmp/durable-f6a728e7 [2021-05-10T15:22:08.087Z] [Pipeline] }
Also we had already enabled as per suggestions-
-Dorg.jenkinsci.plugins.durabletask.BourneShellScript.LAUNCH_DIAGNOSTICS=true \
Issue is however not persistent but still jobs fails randomly. Looking for permanent fix for the current issue.
- relates to
-
JENKINS-64608 Detection "running inside container" fails with cgroup namespace "private" for docker daemon
-
- Resolved
-
[JENKINS-65602] Durable task pipeline failed at sh initialisation
We are experiencing the same issue. In our case, we are executing a shell command that is expected to run for approximately 3 hours. For us, the problem is pretty much persistent. Some single jobs may pass but in 95% of the cases, we fail with the same error ("process apparently never started in ...").
What we have tried (some of these approaches can be found in earlier reports of this bug):
- Downgrading plugins (durable and docker-workflow)
- Downgrading master
- Checking for empty environment variables
- Not using double user definitions when running the Docker container
- Increasing HEARTBEAT_CHECK_INTERVAL
- Setting alwaysPull to true
Setting LAUNCH_DIAGNOSTICS does not give us any information when the job exits. We have set up a logger for durable (and docker/workflow) but are only given information like the ones presented below:
org.jenkinsci.plugins.durabletask.BourneShellScript launching [nohup, sh, -c, { while [ -d '/home/[redacted]/durable-659c0cf1' -a \! -f '/home/[redacted]/durable-659c0cf1/jenkins-result.txt' ]; do touch '/home/[redacted]/durable-659c0cf1/jenkins-log.txt'; sleep 3; done } & jsc=durable-[redacted]; JENKINS_SERVER_COOKIE=$$jsc 'sh' -xe '/home/[redacted]/durable-659c0cf1/script.sh' > '/home/[redacted]/durable-659c0cf1/jenkins-log.txt' 2>&1; echo $$? > '/home/[redacted]/durable-659c0cf1/jenkins-result.txt.tmp'; mv '/home/[redacted]/durable-659c0cf1/jenkins-result.txt.tmp' '/home/[redacted]/durable-659c0cf1/jenkins-result.txt'; wait]
org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#6399] already finished, no need to interrupt
May 20, 2021 2:29:02 PM FINER org.jenkinsci.plugins.workflow.support.concurrent.Timeout
To add to the confusion, the exact same job (Dockerfile and Jenkinsfile) works perfectly fine on another master. The difference between these masters are the following:
+------------------+-----------------+ | Master A | Master B | +----------------+------------------+-----------------+ | Plugins | Slightly different | | | (same durable and docker-workflow) | +----------------+------------------+-----------------+ | Host | VM | k8s | +----------------+------------------+-----------------+ | Authentication | LDAP | Azure | +----------------+------------------+-----------------+
ollehu thanks for the feedback. I'm assuming your host is some x-86 linux system, you can try enabling the binary wrapper:
org.jenkinsci.plugins.durabletask.BourneShellScript.FORCE_BINARY_WRAPPER=true
Even if it fails, the diagnostic log is a bit more informative.
carroll Thanks for getting back to me.
FORCE_BINARY_WRAPPER did not really add that much information, at least nothing looking like a stacktrace. The log shows the following message repeated about every minute
remote transcoding charset: null May 21, 2021 9:46:12 PM FINE org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep still running in /home/[redacted]/73_jenkins2-linux-pipeline on [agent] May 21, 2021 9:46:27 PM FINER org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [agent] seems to be online so using /home/[redacted]/73_jenkins2-linux-pipeline May 21, 2021 9:46:27 PM FINE org.jenkinsci.plugins.durabletask.FileMonitoringTaskremote transcoding charset: null
Then, finally we see the following when the job exits (with the error message):
May 21, 2021 9:46:42 PM FINE org.jenkinsci.plugins.durabletask.FileMonitoringTaskremote transcoding charset: null May 21, 2021 9:46:42 PM FINE org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStepcalling close with nl=true
Is this normal? Especially the part that the first code-block is printed at every minute.
I'm having the same issue. It seems to be related to the container running the shell. For example, with the following Jenkinsfile, the first two execute shell as expected, but the third exhibits this problem.
podTemplate(yaml: """ apiVersion: v1 kind: Pod spec: containers: - name: golang image: golang:1.8.0 command: - sleep args: - 9999999 - name: node-default image: node:12.14 command: - sleep args: - 9999999 - name: node-circle image: circleci/node:12.14 command: - sleep args: - 9999999 """ ) { node(POD_LABEL) { stage('golang project') { container('golang') { sh''' ls ''' } } stage('node project') { container('node-default') { sh''' ls ''' } } stage('node project') { container('node-circle') { sh''' ls ''' } } } }
Version Info:
- Jenkins: v2.7.2
- Durable Task: v1.35
- Kubernetes: v1.29.2
ollehu can you confirm that the launch script was actually launching the binary and not the script wrapper? You should NOT see
[nohup, sh, -c, { while [ -d
I believe there is a parsing error in the shell script processing when running in a container. I had previously submitted JENKINS-65759 where an apostrophe in the job name causes a similar issue. Our users also see similar issues when using the docker.withImage(...).inside {} and docker.withImage(..).withRun(...) {}, but not with docker.withImage(...).withRun { c => } syntaxes.
thanks for the tip jonl_percsol! That's very interesting behavior to say the least...I'll take a look.
carroll Has there been any changes made related to this problem? For us, the problem seems to have disappeared - at least for the affected pipeline.
We are running a Jenkins Agent in a Docker Container on Debian and we just hit this after upgrading to Debian 11.
After enabling LAUNCH_DIAGNOSTICS as described, the following errors appeared:
09:50:30 sh: 1: cannot create /home/jenkins/agent/workspace/<job-name>@tmp/durable-fe1d0f21/jenkins-log.txt: Directory nonexistent
09:50:30 sh: 1: cannot create /home/jenkins/agent/workspace/<job-name>@tmp/durable-fe1d0f21/jenkins-result.txt.tmp: Directory nonexistent
09:50:30 mv: cannot stat '/home/jenkins/agent/workspace/<job-name>@tmp/durable-fe1d0f21/jenkins-result.txt.tmp': No such file or directory
Then I noticed that these directories were created on the Docker host instead of inside the container...
The actual issue was the following:
<docker-hostname> does not seem to be running inside a container
Jenkins failed to detect that the Docker agent was running in a container. I guess that is why it created the directories on the Docker host instead of inside the container.
This is caused by Debian 11 changing to cgroup v2 by default which breaks the container detection. Looking at the code in docker-workflow-plugin, it tries to get container id from /proc/self/cgroup , but on cgroup v2 this just returns "0::/".
I worked around this by booting Debian with "systemd.unified_cgroup_hierarchy=false". Weirdly enough, it's also necessary to rebuild the container of the jenkins agent to fix it completely (the issue also didn't appear immediately after upgrading to Debian 11, but only after re-creating the agent container).
See also this related issue: Container detection fails on cgroup v2 devices · Issue #1592 · GoogleContainerTools/kaniko (github.com)
In this case, they seem to have fixed this by detecting whether /.dockerenv exists. But as far as I can see, this doesn't allow access to the container id (I don't know if Jenkins actually needs it though, currently it's being logged at least).
edit: there already is a bug (and another workaround) for this specific issue JENKINS-64608 Detection "running inside container" fails with cgroup namespace "private" for docker daemon - Jenkins Jira
Thanks for the data mus65! One of the many reasons the docker-workflow-plugin is so challenging. A lot of times durable-task-plugin is a symptom of the underlying issue, but with limited error output, it's really hard to tell. I wonder if this issue is present when you don't use the the docker-workflow-plugin, i.e. just running the docker commands through shell?
carroll I would expect docker commands to fail as well for most use cases. I had a closer look on why this fails exactly:
When the container for the pipeline is started with "docker run", the docker workflow plugin usually passes the volume of the agent with "--volumes-from=<agent container id>", so the pipeline container has access to the workspace (which was checked out inside the agent container). But it only does this when it detects that the agent itself is running in a container. Since the container detection fails, the volume with the workspace is not passed to the pipeline container and the aforementioned issues happen because the workspace doesn't exist.
So in theory, using docker commands on the shell only works if you either use "skipDefaultCheckout()" and do the git clone inside the stage yourself or you pass "--volumes-from=<agent container id>" yourself with the "args" parameter in the declarative pipeline.
by the way: the workaround from JENKINS-64608 to run the agent container with "–cgroupns host" also works fine for me and is much better than reverting to cgroupv1 for the whole host system.
Recommend uninstalling the docker-workflow plugin and running docker CLI commands directly.
Unless there is something broken here which does not involve that plugin, I think this could be closed as a duplicate.
If you already have LAUNCH_DIAGNOSTICS available, can you share a fuller log for when the shell step fails?