-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Powered by SuggestiMate
We have a pipeline that's creating and launching jobs in parallel inside docker containers. The workload is spread across multiple machines. When the below pipeline launches 0-5 jobs in parallel they complete successfully 100% of the time, but when it launched 35 jobs there's 1 or 2 jobs that fail 100% of the time AFTER the workload completes successfully (shown by "python job runs here" in the below pipeline).
The error is always the same, the docker plugin fails to stop a container. The docker logs show that the container exited with code 137 meaning docker was finally able to stop the container with kill -9.
pipeline { agent { node { label 'master' } } stages { stage('prev stage running docker containers') {} stage('problematic stage') { steps { script { unstash name: 'playbook' def playbook = readJSON file: 'playbook.json' def simulations = [:] int counter = 0 playbook.each { job -> python_jobs["worker ${counter++}"] = { node(label: 'label') { ws(dir: 'workspace/python') { script { docker.withRegistry(env.PROJECT_DOCKER_REGISTRY, env.PROJECT_DOCKER_REGISTRY_CREDENTIAL_ID) { docker.image(env.PROJECT_DOCKER_IMAGE).inside('-e http_proxy -e https_proxy -e no_proxy') { // python job runs here } } } } } } } python_jobs.failFast = false parallel python_jobs } } } } }
Found unhandled java.io.IOException exception:1Failed to kill container 'fd4059a173c0bbf107e9231194747ecfc28595f9579ecbd77b82209cf5b219eb'.2 org.jenkinsci.plugins.docker.workflow.client.DockerClient.stop(DockerClient.java:187)3 org.jenkinsci.plugins.docker.workflow.WithContainerStep.destroy(WithContainerStep.java:111)4 org.jenkinsci.plugins.docker.workflow.WithContainerStep$Callback.finished(WithContainerStep.java:415)5 org.jenkinsci.plugins.workflow.steps.BodyExecutionCallback$TailCall.onSuccess(BodyExecutionCallback.java:119)6 org.jenkinsci.plugins.workflow.cps.CpsBodyExecution$SuccessAdapter.receive(CpsBodyExecution.java:375)7 com.cloudbees.groovy.cps.Outcome.resumeFrom(Outcome.java:70)8 com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:144)9 org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:17)10 org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:49)11 org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:180)12 org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:423)13 org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:331)14 org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:295)15 org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService.lambda$wrap$4(CpsVmExecutorService.java:136)16 java.base/java.util.concurrent.FutureTask.run(Unknown Source)17 hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:139)18 jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)19 jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68)20 jenkins.util.ErrorLoggingExecutorService.lambda$wrap$0(ErrorLoggingExecutorService.java:51)21 java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)22 java.base/java.util.concurrent.FutureTask.run(Unknown Source)23 java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)24 java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)25 org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.call(CpsVmExecutorService.java:53)26 org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.call(CpsVmExecutorService.java:50)27 org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:136)28 org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:275)29 org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService.lambda$categoryThreadFactory$0(CpsVmExecutorService.java:50)30 java.base/java.lang.Thread.run(Unknown Source)
[JENKINS-73567] docker-workflow-plugin Failed to kill container
Hello , we're also experiencing such behavior, to execute python-based script in docker, as a part of pipeline. We are investigating it, since 2 months
It's really questionable where the root cause lays, on docker side or on jenkins side ?
The final error , which we found, is in the docker logs, that's the reason of this Jenkins exception
Container failed to exit within 10s of kill - trying direct SIGKILL" container=bcca3c1624e019cc7950a344924c6bda9075bc3d5196ff232f1da1bdbb6f4391 error="context deadline exceeded"
Error on jenkins side is java.io.IOException: Failed to kill container '3c2590bf8920936ef88515294d88a9b2e315fcaebe9bf4a7181011cfe21181ed'.
We also did that ps -aux command as a last step inside of the container and there we're no differences between failed job and successful in terms of output
As a workaround, we're tried to overwrite some docker variables , in order to extend maximum timeout , but it's not help.
docker { image 'xxxxxxxxxxxxxxxxxxxxxxxxxx_job_launcher:2.9.0' // adding extra arguments here , in order to extend stop time for the LAVA test container, because of sporadic issues args '--stop-timeout 360'
So far we are pointing this issue, how Jenkins did this wrapping (with a cat command) to manage / start / stop docker container.
Also it's more or less correlates with load on the build servers . i.e the number of running containers . We are utilizing our servers quite heavily on the weekend and this error is happening mostly during weekend processing
We are able to reproduce the bug. It's due to very high IO load so it seems to be docker related and nothing to do with the plugin.
For some reason 8 parallel unzipping on a 64 core machine cripples the docker daemon and we're unable to start or stop a container even when done by hand in the terminal. emb , tdx_automation could you pls try to reproduce? Create some 8 different 5GB files, zip them then try to unzip all 8 in parallel and try to `docker run -rm -it <whatever image> /bin/sh`. Or try to start a container, leave it sitting there, start the unzipping, and try to `docker stop` the container.
Both of these operations hung until the unzipping completed + several seconds.
What's strange is that we're unable to reproduce this on laptops.
Hi bence , yep , we will try to reproduce it on our servers in the way you're describing it
>>it seems to be docker related and nothing to do with the plugin.
Ok , if so let us more control, how we can affect on the container behavior . If the host machihe is very loaded it's ok for us to wait, until load will normalize OR set the maximum CPU for the container we are spinning OR overwrite some commands in the docker deamon via the - arg , or even whole entrypoint of the container
Now we are just launching this 'cat' command , whis is wrapping all the user processes
update: we've changed to filesystem from xfs to ext4 and the problem seems to be gone. We'll roll out this change to each machine and I'll keep you updated on the stability of the cluster. Out of curiosity, what file system are you using?
>>Out of curiosity, what file system are you using?
In our case it's ext4 on local hard drive and this failed container use it
Also, generally on the builder machines , we are using some shared location , mapped as local drives , with nfs4 filesystem
Hi sbence92 , emb We were able to reproduce the issue, in general on one of our
builders .
Pre conditions :
- We are unzipping 10 files ~ 3 gb each in parralel on the server
- We are having this docker container 'yeasy/simple-web:latest with bin/sh session opened
- We are executing command docker stop on this container
- all is executed on ext4 with local filesystem
Below is the logs we have from docker deamon with my comments , highlighted by ******
************
unzip of 8 files in parralel is running
**************
Aug 20 17:25:17 cibuilder1 dockerd[918901]: time="2024-08-20T17:25:17.816448768Z" level=info msg="Container failed to exit within 10s of signal 15 - using the force" container=6169d597f2936838c7baa52e4b36d9f24906186ca4b9ac34b554eec431e8dc2a
Aug 20 17:25:17 cibuilder1 systemd[1]: docker-6169d597f2936838c7baa52e4b36d9f24906186ca4b9ac34b554eec431e8dc2a.scope: Deactivated successfully.
Aug 20 17:25:17 cibuilder1 dockerd[918901]: time="2024-08-20T17:25:17.989691349Z" level=info msg="ignoring event" container=6169d597f2936838c7baa52e4b36d9f24906186ca4b9ac34b554eec431e8dc2a module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
************
I have active bin/bash session running , inside container that I want to stop ,maybe this warning relates to that
**************
Aug 20 17:25:18 cibuilder1 dockerd[918901]: time="2024-08-20T17:25:18.006435113Z" level=warning msg="failed to close stdin: NotFound: task 6169d597f2936838c7baa52e4b36d9f24906186ca4b9ac34b554eec431e8dc2a not found: not found"
************
The same error message >>context deadline exceeded , which lead to the failure in Jenkins plugin. User session in container is not responding. Command whoami inside of container is not returning any result . docker ps output show this container as running
**************
Aug 20 17:25:27 cibuilder1 dockerd[918901]: time="2024-08-20T17:25:27.977545681Z" level=warning msg="Container failed to exit within 10s of kill - trying direct SIGKILL" container=6169d597f2936838c7baa52e4b36d9f24906186ca4b9ac34b554eec431e8dc2a error="context deadline exceeded"
************
3 minute after , container is finally stopped !!!, and we have these records in the log
**************
Aug 20 17:28:03 cibuilder1 kernel: [12204984.243829] INFO: task dockerd:1651478 blocked for more than 120 seconds.
Aug 20 17:28:03 cibuilder1 kernel: [12204984.245425] task:dockerd state stack:0 pid:1651478 ppid:1 flags:0x00004002
My summary : As we see after some time of glithes and slowdown, docker deamon is finally able to stop this container , however if we are talking about Jenkins we would have reported JAVA exception and failed pipeline as a result
sbence92 Can you please add feature request to have smooth interaction with docker deamon in order to cover such scenarios without JAVA based Jenkins exception ? Thanks
Updated
Added - full log output when docker container is stopped
Aug 20 17:28:03 cibuilder1 kernel: [12204984.243829] INFO: task dockerd:1651478 blocked for more than 120 seconds.
Aug 20 17:28:03 cibuilder1 kernel: [12204984.244406] Tainted: G W I 6.1.0-18-amd64 #1 Debian 6.1.76-1
Aug 20 17:28:03 cibuilder1 kernel: [12204984.244920] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 20 17:28:03 cibuilder1 kernel: [12204984.245425] task:dockerd state stack:0 pid:1651478 ppid:1 flags:0x00004002
Aug 20 17:28:03 cibuilder1 kernel: [12204984.245436] Call Trace:
Aug 20 17:28:03 cibuilder1 kernel: [12204984.245440] <TASK>
Aug 20 17:28:03 cibuilder1 kernel: [12204984.245447] __schedule+0x34d/0x9e0
Aug 20 17:28:03 cibuilder1 kernel: [12204984.245463] schedule+0x5a/0xd0
Aug 20 17:28:03 cibuilder1 kernel: [12204984.245470] wb_wait_for_completion+0x82/0xb0
Aug 20 17:28:03 cibuilder1 kernel: [12204984.245482] ? cpuusage_read+0x10/0x10
Aug 20 17:28:03 cibuilder1 kernel: [12204984.245490] sync_inodes_sb+0xda/0x2b0
Aug 20 17:28:03 cibuilder1 kernel: [12204984.245495] ? inode_to_bdi+0x34/0x50
Aug 20 17:28:03 cibuilder1 kernel: [12204984.245505] ? filemap_fdatawrite_wbc+0x19/0x80
Aug 20 17:28:03 cibuilder1 kernel: [12204984.245513] ? __filemap_fdatawrite_range+0x58/0x80
Aug 20 17:28:03 cibuilder1 kernel: [12204984.245520] sync_filesystem+0x60/0x90
Aug 20 17:28:03 cibuilder1 kernel: [12204984.245531] ovl_sync_fs+0x56/0x90 [overlay]
Aug 20 17:28:03 cibuilder1 kernel: [12204984.245557] sync_filesystem+0x7a/0x90
Aug 20 17:28:03 cibuilder1 kernel: [12204984.245563] generic_shutdown_super+0x22/0x130
Aug 20 17:28:03 cibuilder1 kernel: [12204984.245573] kill_anon_super+0x14/0x30
Aug 20 17:28:03 cibuilder1 kernel: [12204984.245580] deactivate_locked_super+0x2f/0xa0
Aug 20 17:28:03 cibuilder1 kernel: [12204984.245588] cleanup_mnt+0xbd/0x150
Aug 20 17:28:03 cibuilder1 kernel: [12204984.245597] task_work_run+0x59/0x90
Aug 20 17:28:03 cibuilder1 kernel: [12204984.245604] exit_to_user_mode_prepare+0x1dd/0x1e0
Aug 20 17:28:03 cibuilder1 kernel: [12204984.245611] syscall_exit_to_user_mode+0x17/0x40
Aug 20 17:28:03 cibuilder1 kernel: [12204984.245619] do_syscall_64+0x67/0xc0
Aug 20 17:28:03 cibuilder1 kernel: [12204984.245625] ? exit_to_user_mode_prepare+0x40/0x1e0
Aug 20 17:28:03 cibuilder1 kernel: [12204984.245630] ? syscall_exit_to_user_mode+0x27/0x40
Aug 20 17:28:03 cibuilder1 kernel: [12204984.245636] ? do_syscall_64+0x67/0xc0
Aug 20 17:28:03 cibuilder1 kernel: [12204984.245641] ? ksys_read+0xd4/0xf0
Aug 20 17:28:03 cibuilder1 kernel: [12204984.245647] ? exit_to_user_mode_prepare+0x40/0x1e0
Aug 20 17:28:03 cibuilder1 kernel: [12204984.245656] ? syscall_exit_to_user_mode+0x27/0x40
Aug 20 17:28:03 cibuilder1 kernel: [12204984.245662] ? do_syscall_64+0x67/0xc0
Aug 20 17:28:03 cibuilder1 kernel: [12204984.245666] ? do_syscall_64+0x67/0xc0
Aug 20 17:28:03 cibuilder1 kernel: [12204984.245670] entry_SYSCALL_64_after_hwframe+0x64/0xce
Aug 20 17:28:03 cibuilder1 kernel: [12204984.245681] RIP: 0033:0x55c8697c600e
Aug 20 17:28:03 cibuilder1 kernel: [12204984.245687] RSP: 002b:000000c001bb8a30 EFLAGS: 00000212 ORIG_RAX: 00000000000000a6
Aug 20 17:28:03 cibuilder1 kernel: [12204984.245693] RAX: 0000000000000000 RBX: 000000c001431490 RCX: 000055c8697c600e
Aug 20 17:28:03 cibuilder1 kernel: [12204984.245696] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 000000c001431490
Aug 20 17:28:03 cibuilder1 kernel: [12204984.245699] RBP: 000000c001bb8a70 R08: 0000000000000000 R09: 0000000000000000
Aug 20 17:28:03 cibuilder1 kernel: [12204984.245702] R10: 0000000000000000 R11: 0000000000000212 R12: 0000000000000000
Aug 20 17:28:03 cibuilder1 kernel: [12204984.245705] R13: 000000c00108ac00 R14: 000000c004220000 R15: 000000000000002f
Aug 20 17:28:03 cibuilder1 kernel: [12204984.245711] </TASK>
Last week, we're faced this issue twice on one of our builders. It's causing overall red status for our complex pipelines. We are considering to do a direct execution of the docker container, from the shell, omitting this plugin as a runtime
sbence92 any news on this topic ? Maybe you can improve this plugin, to handle such behavior
We are seeing the same issue or root cause for a failed job.
Our failure resulted in a zombie job we are unable to stop - The job keeps spamming the log with the following error message
org.jenkinsci.plugins.workflow.actions.ErrorAction$ErrorId: ce10d002-a2fc-4086-b894-8e1aeea7388e java.io.IOException: Failed to kill container '4cbe9de2b030bcdb4e4926b3439c07fa8bf358667fe511d206c51cb7a2241d10'. at PluginClassLoader for docker-workflow//org.jenkinsci.plugins.docker.workflow.client.DockerClient.stop(DockerClient.java:187) at PluginClassLoader for docker-workflow//org.jenkinsci.plugins.docker.workflow.WithContainerStep.destroy(WithContainerStep.java:111) at PluginClassLoader for docker-workflow//org.jenkinsci.plugins.docker.workflow.WithContainerStep$Execution.stop(WithContainerStep.java:235) at PluginClassLoader for workflow-cps//org.jenkinsci.plugins.workflow.cps.CpsThread.stop(CpsThread.java:315) at PluginClassLoader for workflow-cps//org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$6.onSuccess(CpsFlowExecution.java:1246) at PluginClassLoader for workflow-cps//org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$6.onSuccess(CpsFlowExecution.java:1235) at PluginClassLoader for workflow-cps//org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4$1.run(CpsFlowExecution.java:995) at PluginClassLoader for workflow-cps//org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService.lambda$wrap$2(CpsVmExecutorService.java:85) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:139) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68) at jenkins.util.ErrorLoggingExecutorService.lambda$wrap$0(ErrorLoggingExecutorService.java:51) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at PluginClassLoader for workflow-cps//org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.call(CpsVmExecutorService.java:53) at PluginClassLoader for workflow-cps//org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.call(CpsVmExecutorService.java:50) at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:136) at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:275) at PluginClassLoader for workflow-cps//org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService.lambda$categoryThreadFactory$0(CpsVmExecutorService.java:50) at java.base/java.lang.Thread.run(Thread.java:840) org.jenkinsci.plugins.workflow.actions.ErrorAction$ErrorId: c77711b9-2f6c-443d-bf72-e2b7eeca7b6e java.io.IOException: failed to run kill at PluginClassLoader for docker-workflow//org.jenkinsci.plugins.docker.workflow.WithContainerStep$Decorator$1.kill(WithContainerStep.java:385) at PluginClassLoader for durable-task//org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.stop(FileMonitoringTask.java:483) at PluginClassLoader for workflow-durable-task-step//org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.stop(DurableTaskStep.java:519) at PluginClassLoader for workflow-cps//org.jenkinsci.plugins.workflow.cps.CpsThread.stop(CpsThread.java:315) at PluginClassLoader for workflow-cps//org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$6.onSuccess(CpsFlowExecution.java:1246) at PluginClassLoader for workflow-cps//org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$6.onSuccess(CpsFlowExecution.java:1235) at PluginClassLoader for workflow-cps//org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4$1.run(CpsFlowExecution.java:995) at PluginClassLoader for workflow-cps//org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService.lambda$wrap$2(CpsVmExecutorService.java:85) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:139) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68) at jenkins.util.ErrorLoggingExecutorService.lambda$wrap$0(ErrorLoggingExecutorService.java:51) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at PluginClassLoader for workflow-cps//org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.call(CpsVmExecutorService.java:53) at PluginClassLoader for workflow-cps//org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.call(CpsVmExecutorService.java:50) at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:136) at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:275) at PluginClassLoader for workflow-cps//org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService.lambda$categoryThreadFactory$0(CpsVmExecutorService.java:50) at java.base/java.lang.Thread.run(Thread.java:840)
At current time we are unable to stop the job!!!!!!!!
based on
https://github.com/jenkinsci/docker-workflow-plugin/pull/312#issuecomment-2142546345
https://github.com/jenkinsci/docker-workflow-plugin/pull/312#issuecomment-2400747311
I say there's no hope for an official fix.
There's already been attempts:
https://github.com/jenkinsci/docker-workflow-plugin/pull/93
pointing out that 137 and 143 return codes on docker stop is fine:
https://github.com/docker-library/tomcat/issues/57#issuecomment-266556676
There's a few options we have:
1)
hack on the plugin itself and install a customized version of that. You can open a PR to https://github.com/jenkinsci/docker-workflow-plugin and the pipeline will build a .hpi that you can install on you Jenkins server.
2)
As the (not) maintainer https://github.com/jenkinsci/docker-workflow-plugin/pull/312#issuecomment-2400747311 suggested, stop using this plugin and just use docker cli commands in the jobs. You could still rely on working parts like withDockerRegistry, and do the mounting, env var passing by hand and just do a docker run.
edit:
3)
Reduce the load on the disk. We went from raid5 to raid0 on the build machines storage d isks, that helped. Also, you may try to move the docker root directory to a different physical disk that's not used by the build. I'd expect that docker will be more responsive and this whole issue might go away.
emb do you have any hunch why it's happening on your setup? We speculated some process is still holding on to some resource but the above ps -aux disproved that.