Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-73567

docker-workflow-plugin Failed to kill container

      We have a pipeline that's creating and launching jobs in parallel inside docker containers. The workload is spread across multiple machines. When the below pipeline launches 0-5 jobs in parallel they complete successfully 100% of the time, but when it launched 35 jobs there's 1 or 2 jobs that fail 100% of the time AFTER the workload completes successfully (shown by "python job runs here" in the below pipeline).

      The error is always the same, the docker plugin fails to stop a container. The docker logs show that the container exited with code 137 meaning docker was finally able to stop the container with kill -9.

       

       

       

      pipeline {
        agent {
          node {
            label 'master'
          }
        }
      
        stages {
          stage('prev stage running docker containers') {}
      
          stage('problematic stage') {
            steps {
              script {
                unstash name: 'playbook'
      
                def playbook = readJSON file: 'playbook.json'
                def simulations = [:]
      
                int counter = 0
                playbook.each { job ->
                  python_jobs["worker ${counter++}"] = {
                    node(label: 'label') {
                      ws(dir: 'workspace/python') {
                        script {
                          docker.withRegistry(env.PROJECT_DOCKER_REGISTRY, env.PROJECT_DOCKER_REGISTRY_CREDENTIAL_ID) {
                            docker.image(env.PROJECT_DOCKER_IMAGE).inside('-e http_proxy -e https_proxy -e no_proxy') {
      
                              // python job runs here
                            }
                          }
                        }
                      }
                    }
                  }
                }
                python_jobs.failFast = false
                parallel python_jobs
              }
            }
          }
        }
      }
      

       

      Found unhandled java.io.IOException exception:1Failed to kill container 'fd4059a173c0bbf107e9231194747ecfc28595f9579ecbd77b82209cf5b219eb'.2	org.jenkinsci.plugins.docker.workflow.client.DockerClient.stop(DockerClient.java:187)3	org.jenkinsci.plugins.docker.workflow.WithContainerStep.destroy(WithContainerStep.java:111)4	org.jenkinsci.plugins.docker.workflow.WithContainerStep$Callback.finished(WithContainerStep.java:415)5	org.jenkinsci.plugins.workflow.steps.BodyExecutionCallback$TailCall.onSuccess(BodyExecutionCallback.java:119)6	org.jenkinsci.plugins.workflow.cps.CpsBodyExecution$SuccessAdapter.receive(CpsBodyExecution.java:375)7	com.cloudbees.groovy.cps.Outcome.resumeFrom(Outcome.java:70)8	com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:144)9	org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:17)10	org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:49)11	org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:180)12	org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:423)13	org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:331)14	org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:295)15	org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService.lambda$wrap$4(CpsVmExecutorService.java:136)16	java.base/java.util.concurrent.FutureTask.run(Unknown Source)17	hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:139)18	jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)19	jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68)20	jenkins.util.ErrorLoggingExecutorService.lambda$wrap$0(ErrorLoggingExecutorService.java:51)21	java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)22	java.base/java.util.concurrent.FutureTask.run(Unknown Source)23	java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)24	java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)25	org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.call(CpsVmExecutorService.java:53)26	org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.call(CpsVmExecutorService.java:50)27	org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:136)28	org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:275)29	org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService.lambda$categoryThreadFactory$0(CpsVmExecutorService.java:50)30	java.base/java.lang.Thread.run(Unknown Source) 

          [JENKINS-73567] docker-workflow-plugin Failed to kill container

          bence added a comment -

          emb  do you have any hunch why it's happening on your setup? We speculated some process is still holding on to some resource but the above ps -aux disproved that.

          bence added a comment - emb   do you have any hunch why it's happening on your setup? We speculated some process is still holding on to some resource but the above ps -aux disproved that.

          Toradex added a comment - - edited

          Hello , we're also experiencing such behavior, to execute python-based script in docker, as a part of pipeline.   We are investigating it,  since 2 months 

          It's really questionable where the root cause lays,  on docker side or on jenkins side ?

          The final error , which we found, is in the docker logs,  that's the reason of this Jenkins exception 

          Container failed to exit within 10s of kill - trying direct SIGKILL" container=bcca3c1624e019cc7950a344924c6bda9075bc3d5196ff232f1da1bdbb6f4391 error="context deadline exceeded"

           

          Error  on jenkins side is   java.io.IOException: Failed to kill container '3c2590bf8920936ef88515294d88a9b2e315fcaebe9bf4a7181011cfe21181ed'.

           

          We also did that ps -aux command as a last step  inside of the container and there we're no differences  between failed job and successful in terms of output

          As a workaround, we're tried to overwrite some docker variables , in order to extend maximum timeout , but it's not help.

          docker { image 'xxxxxxxxxxxxxxxxxxxxxxxxxx_job_launcher:2.9.0' // adding extra arguments here , in order to extend stop time for the LAVA test container, because of sporadic issues args '--stop-timeout 360'

          So far we are pointing this issue,  how Jenkins did this wrapping (with a cat command) to manage / start / stop  docker container. 

          Also  it's more or less correlates with load on the build servers . i.e the number of running containers . We are utilizing our servers quite heavily on the weekend and this error is happening mostly during weekend processing

           
           

          Toradex added a comment - - edited Hello , we're also experiencing such behavior, to execute python-based script in docker, as a part of pipeline.   We are investigating it,  since 2 months  It's really questionable where the root cause lays,  on docker side or on jenkins side ? The final error , which we found, is in the docker logs,  that's the reason of this Jenkins exception  Container failed to exit within 10s of kill - trying direct SIGKILL" container=bcca3c1624e019cc7950a344924c6bda9075bc3d5196ff232f1da1bdbb6f4391 error="context deadline exceeded"   Error  on jenkins side is   java.io.IOException: Failed to kill container '3c2590bf8920936ef88515294d88a9b2e315fcaebe9bf4a7181011cfe21181ed'.   We also did that ps -aux command as a last step  inside of the container and there we're no differences  between failed job and successful in terms of output As a workaround, we're tried to overwrite some docker variables , in order to extend maximum timeout , but it's not help. docker { image 'xxxxxxxxxxxxxxxxxxxxxxxxxx_job_launcher:2.9.0' // adding extra arguments here , in order to extend stop time for the LAVA test container, because of sporadic issues args '--stop-timeout 360' So far we are pointing this issue,  how Jenkins did this wrapping (with a cat command) to manage / start / stop  docker container.  Also  it's more or less correlates with load on the build servers . i.e the number of running containers . We are utilizing our servers quite heavily on the weekend and this error is happening mostly during weekend processing    

          bence added a comment -

          We are able to reproduce the bug. It's due to very high IO load so it seems to be docker related and nothing to do with the plugin.

          For some reason 8 parallel unzipping on a 64 core machine cripples the docker daemon and we're unable to start or stop a container even when done by hand in the terminal. emb , tdx_automation  could you pls try to reproduce? Create some 8 different 5GB files, zip them then try to unzip all 8 in parallel and try to `docker run -rm -it <whatever image> /bin/sh`. Or try to start a container, leave it sitting there, start the unzipping, and try to `docker stop` the container.
          Both of these operations hung until the unzipping completed + several seconds.

          What's strange is that we're unable to reproduce this on laptops.

          bence added a comment - We are able to reproduce the bug. It's due to very high IO load so it seems to be docker related and nothing to do with the plugin. For some reason 8 parallel unzipping on a 64 core machine cripples the docker daemon and we're unable to start or stop a container even when done by hand in the terminal. emb , tdx_automation   could you pls try to reproduce? Create some 8 different 5GB files, zip them then try to unzip all 8 in parallel and try to `docker run -rm -it <whatever image> /bin/sh`. Or try to start a container, leave it sitting there, start the unzipping, and try to `docker stop` the container. Both of these operations hung until the unzipping completed + several seconds. What's strange is that we're unable to reproduce this on laptops.

          Toradex added a comment - - edited

          Hi bence , yep , we will try to reproduce it on our servers in the way you're describing it 

          >>it seems to be docker related and nothing to do with the plugin.

          Ok , if so let us more control,  how we can affect  on the container behavior . If the host machihe is very loaded it's ok for us to wait, until load will normalize OR set the maximum CPU for the container we are spinning OR overwrite some commands in the docker deamon via the  - arg , or even whole entrypoint of the container 

          Now we  are just launching this 'cat' command , whis is wrapping all the user processes   

          Toradex added a comment - - edited Hi bence , yep , we will try to reproduce it on our servers in the way you're describing it  >>it seems to be docker related and nothing to do with the plugin. Ok , if so let us more control,  how we can affect  on the container behavior . If the host machihe is very loaded it's ok for us to wait, until load will normalize OR set the maximum CPU for the container we are spinning OR overwrite some commands in the docker deamon via the  - arg , or even whole entrypoint of the container  Now we  are just launching this 'cat' command , whis is wrapping all the user processes   

          bence added a comment -

          update: we've changed to filesystem from xfs to ext4 and the problem seems to be gone. We'll roll out this change to each machine and I'll keep you updated on the stability of the cluster. Out of curiosity, what file system are you using?

          bence added a comment - update: we've changed to filesystem from xfs to ext4 and the problem seems to be gone. We'll roll out this change to each machine and I'll keep you updated on the stability of the cluster. Out of curiosity, what file system are you using?

          Toradex added a comment - - edited

          >>Out of curiosity, what file system are you using? 

          In our case it's ext4  on local hard drive and this failed container  use it 

           

          Also, generally on the  builder machines , we are using some shared location , mapped as local drives ,   with  nfs4 filesystem 

          Toradex added a comment - - edited >>Out of curiosity, what file system are you using?  In our case it's ext4   on local hard drive and this failed container  use it    Also, generally on the  builder machines , we are using some shared location , mapped as local drives ,   with  nfs4 filesystem 

          Toradex added a comment - - edited

          Hi sbence92 , emb  We were able to reproduce the issue, in general  on one of our

          builders . 

          Pre conditions :

          • We are unzipping 10 files ~ 3 gb each in parralel on the server 
          • We are  having this docker container 'yeasy/simple-web:latest with bin/sh session opened 
          • We are executing command docker stop on this container 
          • all is executed on ext4 with local filesystem

          Below is  the logs we have from docker deamon with my comments , highlighted by ******

          ************
          unzip of 8 files in parralel is running 
          **************

          Aug 20 17:25:17 cibuilder1 dockerd[918901]: time="2024-08-20T17:25:17.816448768Z" level=info msg="Container failed to exit within 10s of signal 15 - using the force" container=6169d597f2936838c7baa52e4b36d9f24906186ca4b9ac34b554eec431e8dc2a

          Aug 20 17:25:17 cibuilder1 systemd[1]: docker-6169d597f2936838c7baa52e4b36d9f24906186ca4b9ac34b554eec431e8dc2a.scope: Deactivated successfully.
          Aug 20 17:25:17 cibuilder1 dockerd[918901]: time="2024-08-20T17:25:17.989691349Z" level=info msg="ignoring event" container=6169d597f2936838c7baa52e4b36d9f24906186ca4b9ac34b554eec431e8dc2a module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"

          ************
          I have active bin/bash session running , inside container that I want to stop ,maybe this warning relates to that 
          **************

          Aug 20 17:25:18 cibuilder1 dockerd[918901]: time="2024-08-20T17:25:18.006435113Z" level=warning msg="failed to close stdin: NotFound: task 6169d597f2936838c7baa52e4b36d9f24906186ca4b9ac34b554eec431e8dc2a not found: not found"

          ************
          The same error message >>context deadline exceeded , which lead to the failure in Jenkins plugin. User session in container is not responding.  Command whoami inside of container is  not returning any result . docker ps output show this container as running 
          **************

          Aug 20 17:25:27 cibuilder1 dockerd[918901]: time="2024-08-20T17:25:27.977545681Z" level=warning msg="Container failed to exit within 10s of kill - trying direct SIGKILL" container=6169d597f2936838c7baa52e4b36d9f24906186ca4b9ac34b554eec431e8dc2a error="context deadline exceeded"

          ************
          3 minute after , container is finally stopped !!!, and we have these records in the log
          **************
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.243829] INFO: task dockerd:1651478 blocked for more than 120 seconds.
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.245425] task:dockerd         state stack:0     pid:1651478 ppid:1      flags:0x00004002

           

          My summary :  As we see after some time of glithes and slowdown,  docker deamon is finally able to stop this container , however if we are talking about Jenkins we would have reported JAVA exception and failed pipeline as a result 

           

          sbence92 Can you please add feature request to have smooth interaction with docker deamon in order to cover such scenarios without JAVA based Jenkins exception ? Thanks 

           

           

          Updated 

          Added - full log output  when docker container is stopped  

          Aug 20 17:28:03 cibuilder1 kernel: [12204984.243829] INFO: task dockerd:1651478 blocked for more than 120 seconds.
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.244406]       Tainted: G        W I        6.1.0-18-amd64 #1 Debian 6.1.76-1
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.244920] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.245425] task:dockerd         state stack:0     pid:1651478 ppid:1      flags:0x00004002
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.245436] Call Trace:
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.245440]  <TASK>
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.245447]  __schedule+0x34d/0x9e0
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.245463]  schedule+0x5a/0xd0
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.245470]  wb_wait_for_completion+0x82/0xb0
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.245482]  ? cpuusage_read+0x10/0x10
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.245490]  sync_inodes_sb+0xda/0x2b0
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.245495]  ? inode_to_bdi+0x34/0x50
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.245505]  ? filemap_fdatawrite_wbc+0x19/0x80
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.245513]  ? __filemap_fdatawrite_range+0x58/0x80
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.245520]  sync_filesystem+0x60/0x90
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.245531]  ovl_sync_fs+0x56/0x90 [overlay]
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.245557]  sync_filesystem+0x7a/0x90
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.245563]  generic_shutdown_super+0x22/0x130
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.245573]  kill_anon_super+0x14/0x30
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.245580]  deactivate_locked_super+0x2f/0xa0
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.245588]  cleanup_mnt+0xbd/0x150
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.245597]  task_work_run+0x59/0x90
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.245604]  exit_to_user_mode_prepare+0x1dd/0x1e0
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.245611]  syscall_exit_to_user_mode+0x17/0x40
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.245619]  do_syscall_64+0x67/0xc0
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.245625]  ? exit_to_user_mode_prepare+0x40/0x1e0
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.245630]  ? syscall_exit_to_user_mode+0x27/0x40
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.245636]  ? do_syscall_64+0x67/0xc0
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.245641]  ? ksys_read+0xd4/0xf0
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.245647]  ? exit_to_user_mode_prepare+0x40/0x1e0
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.245656]  ? syscall_exit_to_user_mode+0x27/0x40
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.245662]  ? do_syscall_64+0x67/0xc0
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.245666]  ? do_syscall_64+0x67/0xc0
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.245670]  entry_SYSCALL_64_after_hwframe+0x64/0xce
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.245681] RIP: 0033:0x55c8697c600e
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.245687] RSP: 002b:000000c001bb8a30 EFLAGS: 00000212 ORIG_RAX: 00000000000000a6
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.245693] RAX: 0000000000000000 RBX: 000000c001431490 RCX: 000055c8697c600e
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.245696] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 000000c001431490
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.245699] RBP: 000000c001bb8a70 R08: 0000000000000000 R09: 0000000000000000
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.245702] R10: 0000000000000000 R11: 0000000000000212 R12: 0000000000000000
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.245705] R13: 000000c00108ac00 R14: 000000c004220000 R15: 000000000000002f
          Aug 20 17:28:03 cibuilder1 kernel: [12204984.245711]  </TASK>

           

          Toradex added a comment - - edited Hi sbence92 , emb   We were able to reproduce the issue, in general  on one of our builders .  Pre conditions : We are unzipping 10 files ~ 3 gb each in parralel on the server  We are  having this docker container 'yeasy/simple-web:latest with bin/sh session opened  We are executing command docker stop on this container  all is executed on ext4 with local filesystem Below is  the logs we have from docker deamon with my comments , highlighted by ****** ************ unzip of 8 files in parralel is running  ************** Aug 20 17:25:17 cibuilder1 dockerd [918901] : time="2024-08-20T17:25:17.816448768Z" level=info msg="Container failed to exit within 10s of signal 15 - using the force" container=6169d597f2936838c7baa52e4b36d9f24906186ca4b9ac34b554eec431e8dc2a Aug 20 17:25:17 cibuilder1 systemd [1] : docker-6169d597f2936838c7baa52e4b36d9f24906186ca4b9ac34b554eec431e8dc2a.scope: Deactivated successfully. Aug 20 17:25:17 cibuilder1 dockerd [918901] : time="2024-08-20T17:25:17.989691349Z" level=info msg="ignoring event" container=6169d597f2936838c7baa52e4b36d9f24906186ca4b9ac34b554eec431e8dc2a module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete" ************ I have active bin/bash session running , inside container that I want to stop ,maybe this warning relates to that  ************** Aug 20 17:25:18 cibuilder1 dockerd [918901] : time="2024-08-20T17:25:18.006435113Z" level=warning msg="failed to close stdin: NotFound: task 6169d597f2936838c7baa52e4b36d9f24906186ca4b9ac34b554eec431e8dc2a not found: not found" ************ The same error message >> context deadline exceeded , which lead to the failure in Jenkins plugin. User session in container is not responding.  Command whoami inside of container is  not returning any result . docker ps output show this container as running  ************** Aug 20 17:25:27 cibuilder1 dockerd [918901] : time="2024-08-20T17:25:27.977545681Z" level=warning msg="Container failed to exit within 10s of kill - trying direct SIGKILL" container=6169d597f2936838c7baa52e4b36d9f24906186ca4b9ac34b554eec431e8dc2a error="context deadline exceeded" ************ 3 minute after , container is finally stopped !!! , and we have these records in the log ************** Aug 20 17:28:03 cibuilder1 kernel: [12204984.243829] INFO: task dockerd:1651478 blocked for more than 120 seconds. Aug 20 17:28:03 cibuilder1 kernel: [12204984.245425] task:dockerd         state stack:0     pid:1651478 ppid:1      flags:0x00004002   My summary :   As we see after some time of glithes and slowdown,  docker deamon is finally able to stop this container , however if we are talking about Jenkins we would have reported JAVA exception and failed pipeline as a result    sbence92 Can you please add feature request to have smooth interaction with docker deamon in order to cover such scenarios without JAVA based Jenkins exception ? Thanks      Updated   Added - full log output  when docker container is stopped   Aug 20 17:28:03 cibuilder1 kernel: [12204984.243829] INFO: task dockerd:1651478 blocked for more than 120 seconds. Aug 20 17:28:03 cibuilder1 kernel: [12204984.244406]       Tainted: G        W I        6.1.0-18-amd64 #1 Debian 6.1.76-1 Aug 20 17:28:03 cibuilder1 kernel: [12204984.244920] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 20 17:28:03 cibuilder1 kernel: [12204984.245425] task:dockerd         state stack:0     pid:1651478 ppid:1      flags:0x00004002 Aug 20 17:28:03 cibuilder1 kernel: [12204984.245436] Call Trace: Aug 20 17:28:03 cibuilder1 kernel: [12204984.245440]  <TASK> Aug 20 17:28:03 cibuilder1 kernel: [12204984.245447]  __schedule+0x34d/0x9e0 Aug 20 17:28:03 cibuilder1 kernel: [12204984.245463]  schedule+0x5a/0xd0 Aug 20 17:28:03 cibuilder1 kernel: [12204984.245470]  wb_wait_for_completion+0x82/0xb0 Aug 20 17:28:03 cibuilder1 kernel: [12204984.245482]  ? cpuusage_read+0x10/0x10 Aug 20 17:28:03 cibuilder1 kernel: [12204984.245490]  sync_inodes_sb+0xda/0x2b0 Aug 20 17:28:03 cibuilder1 kernel: [12204984.245495]  ? inode_to_bdi+0x34/0x50 Aug 20 17:28:03 cibuilder1 kernel: [12204984.245505]  ? filemap_fdatawrite_wbc+0x19/0x80 Aug 20 17:28:03 cibuilder1 kernel: [12204984.245513]  ? __filemap_fdatawrite_range+0x58/0x80 Aug 20 17:28:03 cibuilder1 kernel: [12204984.245520]  sync_filesystem+0x60/0x90 Aug 20 17:28:03 cibuilder1 kernel: [12204984.245531]  ovl_sync_fs+0x56/0x90 [overlay] Aug 20 17:28:03 cibuilder1 kernel: [12204984.245557]  sync_filesystem+0x7a/0x90 Aug 20 17:28:03 cibuilder1 kernel: [12204984.245563]  generic_shutdown_super+0x22/0x130 Aug 20 17:28:03 cibuilder1 kernel: [12204984.245573]  kill_anon_super+0x14/0x30 Aug 20 17:28:03 cibuilder1 kernel: [12204984.245580]  deactivate_locked_super+0x2f/0xa0 Aug 20 17:28:03 cibuilder1 kernel: [12204984.245588]  cleanup_mnt+0xbd/0x150 Aug 20 17:28:03 cibuilder1 kernel: [12204984.245597]  task_work_run+0x59/0x90 Aug 20 17:28:03 cibuilder1 kernel: [12204984.245604]  exit_to_user_mode_prepare+0x1dd/0x1e0 Aug 20 17:28:03 cibuilder1 kernel: [12204984.245611]  syscall_exit_to_user_mode+0x17/0x40 Aug 20 17:28:03 cibuilder1 kernel: [12204984.245619]  do_syscall_64+0x67/0xc0 Aug 20 17:28:03 cibuilder1 kernel: [12204984.245625]  ? exit_to_user_mode_prepare+0x40/0x1e0 Aug 20 17:28:03 cibuilder1 kernel: [12204984.245630]  ? syscall_exit_to_user_mode+0x27/0x40 Aug 20 17:28:03 cibuilder1 kernel: [12204984.245636]  ? do_syscall_64+0x67/0xc0 Aug 20 17:28:03 cibuilder1 kernel: [12204984.245641]  ? ksys_read+0xd4/0xf0 Aug 20 17:28:03 cibuilder1 kernel: [12204984.245647]  ? exit_to_user_mode_prepare+0x40/0x1e0 Aug 20 17:28:03 cibuilder1 kernel: [12204984.245656]  ? syscall_exit_to_user_mode+0x27/0x40 Aug 20 17:28:03 cibuilder1 kernel: [12204984.245662]  ? do_syscall_64+0x67/0xc0 Aug 20 17:28:03 cibuilder1 kernel: [12204984.245666]  ? do_syscall_64+0x67/0xc0 Aug 20 17:28:03 cibuilder1 kernel: [12204984.245670]  entry_SYSCALL_64_after_hwframe+0x64/0xce Aug 20 17:28:03 cibuilder1 kernel: [12204984.245681] RIP: 0033:0x55c8697c600e Aug 20 17:28:03 cibuilder1 kernel: [12204984.245687] RSP: 002b:000000c001bb8a30 EFLAGS: 00000212 ORIG_RAX: 00000000000000a6 Aug 20 17:28:03 cibuilder1 kernel: [12204984.245693] RAX: 0000000000000000 RBX: 000000c001431490 RCX: 000055c8697c600e Aug 20 17:28:03 cibuilder1 kernel: [12204984.245696] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 000000c001431490 Aug 20 17:28:03 cibuilder1 kernel: [12204984.245699] RBP: 000000c001bb8a70 R08: 0000000000000000 R09: 0000000000000000 Aug 20 17:28:03 cibuilder1 kernel: [12204984.245702] R10: 0000000000000000 R11: 0000000000000212 R12: 0000000000000000 Aug 20 17:28:03 cibuilder1 kernel: [12204984.245705] R13: 000000c00108ac00 R14: 000000c004220000 R15: 000000000000002f Aug 20 17:28:03 cibuilder1 kernel: [12204984.245711]  </TASK>  

          Toradex added a comment -

          Last  week, we're faced this issue twice on one of our builders.  It's causing overall red status for our complex pipelines.  We are considering to do a direct execution of the docker container, from the shell, omitting this plugin  as a runtime 

          sbence92 any news on this topic ?  Maybe you can improve this plugin, to handle such behavior 

          Toradex added a comment - Last  week, we're faced this issue twice on one of our builders.  It's causing overall red status for our complex pipelines.  We are considering to do a direct execution of the docker container, from the shell, omitting this plugin  as a runtime  sbence92 any news on this topic ?  Maybe you can improve this plugin, to handle such behavior 

          We are seeing the same issue or root cause for a failed job. 

          Our failure resulted in a zombie job we are unable to stop - The job keeps spamming the log with the following error message

          org.jenkinsci.plugins.workflow.actions.ErrorAction$ErrorId: ce10d002-a2fc-4086-b894-8e1aeea7388e
          java.io.IOException: Failed to kill container '4cbe9de2b030bcdb4e4926b3439c07fa8bf358667fe511d206c51cb7a2241d10'.
          	at PluginClassLoader for docker-workflow//org.jenkinsci.plugins.docker.workflow.client.DockerClient.stop(DockerClient.java:187)
          	at PluginClassLoader for docker-workflow//org.jenkinsci.plugins.docker.workflow.WithContainerStep.destroy(WithContainerStep.java:111)
          	at PluginClassLoader for docker-workflow//org.jenkinsci.plugins.docker.workflow.WithContainerStep$Execution.stop(WithContainerStep.java:235)
          	at PluginClassLoader for workflow-cps//org.jenkinsci.plugins.workflow.cps.CpsThread.stop(CpsThread.java:315)
          	at PluginClassLoader for workflow-cps//org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$6.onSuccess(CpsFlowExecution.java:1246)
          	at PluginClassLoader for workflow-cps//org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$6.onSuccess(CpsFlowExecution.java:1235)
          	at PluginClassLoader for workflow-cps//org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4$1.run(CpsFlowExecution.java:995)
          	at PluginClassLoader for workflow-cps//org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService.lambda$wrap$2(CpsVmExecutorService.java:85)
          	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
          	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
          	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:139)
          	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
          	at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68)
          	at jenkins.util.ErrorLoggingExecutorService.lambda$wrap$0(ErrorLoggingExecutorService.java:51)
          	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
          	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
          	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
          	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
          	at PluginClassLoader for workflow-cps//org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.call(CpsVmExecutorService.java:53)
          	at PluginClassLoader for workflow-cps//org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.call(CpsVmExecutorService.java:50)
          	at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:136)
          	at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:275)
          	at PluginClassLoader for workflow-cps//org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService.lambda$categoryThreadFactory$0(CpsVmExecutorService.java:50)
          	at java.base/java.lang.Thread.run(Thread.java:840)
          org.jenkinsci.plugins.workflow.actions.ErrorAction$ErrorId: c77711b9-2f6c-443d-bf72-e2b7eeca7b6e
          java.io.IOException: failed to run kill
          	at PluginClassLoader for docker-workflow//org.jenkinsci.plugins.docker.workflow.WithContainerStep$Decorator$1.kill(WithContainerStep.java:385)
          	at PluginClassLoader for durable-task//org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.stop(FileMonitoringTask.java:483)
          	at PluginClassLoader for workflow-durable-task-step//org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.stop(DurableTaskStep.java:519)
          	at PluginClassLoader for workflow-cps//org.jenkinsci.plugins.workflow.cps.CpsThread.stop(CpsThread.java:315)
          	at PluginClassLoader for workflow-cps//org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$6.onSuccess(CpsFlowExecution.java:1246)
          	at PluginClassLoader for workflow-cps//org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$6.onSuccess(CpsFlowExecution.java:1235)
          	at PluginClassLoader for workflow-cps//org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4$1.run(CpsFlowExecution.java:995)
          	at PluginClassLoader for workflow-cps//org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService.lambda$wrap$2(CpsVmExecutorService.java:85)
          	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
          	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
          	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:139)
          	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
          	at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68)
          	at jenkins.util.ErrorLoggingExecutorService.lambda$wrap$0(ErrorLoggingExecutorService.java:51)
          	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
          	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
          	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
          	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
          	at PluginClassLoader for workflow-cps//org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.call(CpsVmExecutorService.java:53)
          	at PluginClassLoader for workflow-cps//org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.call(CpsVmExecutorService.java:50)
          	at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:136)
          	at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:275)
          	at PluginClassLoader for workflow-cps//org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService.lambda$categoryThreadFactory$0(CpsVmExecutorService.java:50)
          	at java.base/java.lang.Thread.run(Thread.java:840) 

          At current time we are unable to stop the job!!!!!!!!

          Kenni Thomasberg added a comment - We are seeing the same issue or root cause for a failed job.  Our failure resulted in a zombie job we are unable to stop - The job keeps spamming the log with the following error message org.jenkinsci.plugins.workflow.actions.ErrorAction$ErrorId: ce10d002-a2fc-4086-b894-8e1aeea7388e java.io.IOException: Failed to kill container '4cbe9de2b030bcdb4e4926b3439c07fa8bf358667fe511d206c51cb7a2241d10' . at PluginClassLoader for docker-workflow //org.jenkinsci.plugins.docker.workflow.client.DockerClient.stop(DockerClient.java:187) at PluginClassLoader for docker-workflow //org.jenkinsci.plugins.docker.workflow.WithContainerStep.destroy(WithContainerStep.java:111) at PluginClassLoader for docker-workflow //org.jenkinsci.plugins.docker.workflow.WithContainerStep$Execution.stop(WithContainerStep.java:235) at PluginClassLoader for workflow-cps //org.jenkinsci.plugins.workflow.cps.CpsThread.stop(CpsThread.java:315) at PluginClassLoader for workflow-cps //org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$6.onSuccess(CpsFlowExecution.java:1246) at PluginClassLoader for workflow-cps //org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$6.onSuccess(CpsFlowExecution.java:1235) at PluginClassLoader for workflow-cps //org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4$1.run(CpsFlowExecution.java:995) at PluginClassLoader for workflow-cps //org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService.lambda$wrap$2(CpsVmExecutorService.java:85) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:139) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68) at jenkins.util.ErrorLoggingExecutorService.lambda$wrap$0(ErrorLoggingExecutorService.java:51) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at PluginClassLoader for workflow-cps //org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.call(CpsVmExecutorService.java:53) at PluginClassLoader for workflow-cps //org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.call(CpsVmExecutorService.java:50) at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:136) at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:275) at PluginClassLoader for workflow-cps //org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService.lambda$categoryThreadFactory$0(CpsVmExecutorService.java:50) at java.base/java.lang. Thread .run( Thread .java:840) org.jenkinsci.plugins.workflow.actions.ErrorAction$ErrorId: c77711b9-2f6c-443d-bf72-e2b7eeca7b6e java.io.IOException: failed to run kill at PluginClassLoader for docker-workflow //org.jenkinsci.plugins.docker.workflow.WithContainerStep$Decorator$1.kill(WithContainerStep.java:385) at PluginClassLoader for durable-task //org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.stop(FileMonitoringTask.java:483) at PluginClassLoader for workflow-durable-task-step //org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.stop(DurableTaskStep.java:519) at PluginClassLoader for workflow-cps //org.jenkinsci.plugins.workflow.cps.CpsThread.stop(CpsThread.java:315) at PluginClassLoader for workflow-cps //org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$6.onSuccess(CpsFlowExecution.java:1246) at PluginClassLoader for workflow-cps //org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$6.onSuccess(CpsFlowExecution.java:1235) at PluginClassLoader for workflow-cps //org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4$1.run(CpsFlowExecution.java:995) at PluginClassLoader for workflow-cps //org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService.lambda$wrap$2(CpsVmExecutorService.java:85) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:139) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68) at jenkins.util.ErrorLoggingExecutorService.lambda$wrap$0(ErrorLoggingExecutorService.java:51) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at PluginClassLoader for workflow-cps //org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.call(CpsVmExecutorService.java:53) at PluginClassLoader for workflow-cps //org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.call(CpsVmExecutorService.java:50) at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:136) at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:275) at PluginClassLoader for workflow-cps //org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService.lambda$categoryThreadFactory$0(CpsVmExecutorService.java:50) at java.base/java.lang. Thread .run( Thread .java:840) At current time we are unable to stop the job!!!!!!!!

          bence added a comment - - edited

          kth  tdx_automation 

          based on
          https://github.com/jenkinsci/docker-workflow-plugin/pull/312#issuecomment-2142546345
          https://github.com/jenkinsci/docker-workflow-plugin/pull/312#issuecomment-2400747311

          I say there's no hope for an official fix.

          There's already been attempts:
          https://github.com/jenkinsci/docker-workflow-plugin/pull/93

          pointing out that 137 and 143 return codes on docker stop is fine:
          https://github.com/docker-library/tomcat/issues/57#issuecomment-266556676

          There's a few options we have:

          1)
          hack on the plugin itself and install a customized version of that. You can open a PR to https://github.com/jenkinsci/docker-workflow-plugin and the pipeline will build a .hpi that you can install on you Jenkins server.

          2)
          As the (not) maintainer https://github.com/jenkinsci/docker-workflow-plugin/pull/312#issuecomment-2400747311 suggested, stop using this plugin and just use docker cli commands in the jobs. You could still rely on working parts like withDockerRegistry, and do the mounting, env var passing by hand and just do a docker run.

           

          edit:

          3)
          Reduce the load on the disk. We went from raid5 to raid0 on the build machines storage d isks, that helped. Also, you may try to move the docker root directory to a different physical disk that's not used by the build. I'd expect that docker will be more responsive and this whole issue might go away.

          bence added a comment - - edited kth   tdx_automation   based on https://github.com/jenkinsci/docker-workflow-plugin/pull/312#issuecomment-2142546345 https://github.com/jenkinsci/docker-workflow-plugin/pull/312#issuecomment-2400747311 I say there's no hope for an official fix. There's already been attempts: https://github.com/jenkinsci/docker-workflow-plugin/pull/93 pointing out that 137 and 143 return codes on docker stop is fine: https://github.com/docker-library/tomcat/issues/57#issuecomment-266556676 There's a few options we have: 1) hack on the plugin itself and install a customized version of that. You can open a PR to https://github.com/jenkinsci/docker-workflow-plugin and the pipeline will build a .hpi that you can install on you Jenkins server. 2) As the (not) maintainer https://github.com/jenkinsci/docker-workflow-plugin/pull/312#issuecomment-2400747311 suggested, stop using this plugin and just use docker cli commands in the jobs. You could still rely on working parts like withDockerRegistry, and do the mounting, env var passing by hand and just do a docker run.   edit: 3) Reduce the load on the disk. We went from raid5 to raid0 on the build machines storage d isks, that helped. Also, you may try to move the docker root directory to a different physical disk that's not used by the build. I'd expect that docker will be more responsive and this whole issue might go away.

            Unassigned Unassigned
            sbence92 bence
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: