Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-73567

docker-workflow-plugin Failed to kill container

      We have a pipeline that's creating and launching jobs in parallel inside docker containers. The workload is spread across multiple machines. When the below pipeline launches 0-5 jobs in parallel they complete successfully 100% of the time, but when it launched 35 jobs there's 1 or 2 jobs that fail 100% of the time AFTER the workload completes successfully (shown by "python job runs here" in the below pipeline).

      The error is always the same, the docker plugin fails to stop a container. The docker logs show that the container exited with code 137 meaning docker was finally able to stop the container with kill -9.

       

       

       

      pipeline {
        agent {
          node {
            label 'master'
          }
        }
      
        stages {
          stage('prev stage running docker containers') {}
      
          stage('problematic stage') {
            steps {
              script {
                unstash name: 'playbook'
      
                def playbook = readJSON file: 'playbook.json'
                def simulations = [:]
      
                int counter = 0
                playbook.each { job ->
                  python_jobs["worker ${counter++}"] = {
                    node(label: 'label') {
                      ws(dir: 'workspace/python') {
                        script {
                          docker.withRegistry(env.PROJECT_DOCKER_REGISTRY, env.PROJECT_DOCKER_REGISTRY_CREDENTIAL_ID) {
                            docker.image(env.PROJECT_DOCKER_IMAGE).inside('-e http_proxy -e https_proxy -e no_proxy') {
      
                              // python job runs here
                            }
                          }
                        }
                      }
                    }
                  }
                }
                python_jobs.failFast = false
                parallel python_jobs
              }
            }
          }
        }
      }
      

       

      Found unhandled java.io.IOException exception:1Failed to kill container 'fd4059a173c0bbf107e9231194747ecfc28595f9579ecbd77b82209cf5b219eb'.2	org.jenkinsci.plugins.docker.workflow.client.DockerClient.stop(DockerClient.java:187)3	org.jenkinsci.plugins.docker.workflow.WithContainerStep.destroy(WithContainerStep.java:111)4	org.jenkinsci.plugins.docker.workflow.WithContainerStep$Callback.finished(WithContainerStep.java:415)5	org.jenkinsci.plugins.workflow.steps.BodyExecutionCallback$TailCall.onSuccess(BodyExecutionCallback.java:119)6	org.jenkinsci.plugins.workflow.cps.CpsBodyExecution$SuccessAdapter.receive(CpsBodyExecution.java:375)7	com.cloudbees.groovy.cps.Outcome.resumeFrom(Outcome.java:70)8	com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:144)9	org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:17)10	org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:49)11	org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:180)12	org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:423)13	org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:331)14	org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:295)15	org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService.lambda$wrap$4(CpsVmExecutorService.java:136)16	java.base/java.util.concurrent.FutureTask.run(Unknown Source)17	hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:139)18	jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)19	jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68)20	jenkins.util.ErrorLoggingExecutorService.lambda$wrap$0(ErrorLoggingExecutorService.java:51)21	java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)22	java.base/java.util.concurrent.FutureTask.run(Unknown Source)23	java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)24	java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)25	org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.call(CpsVmExecutorService.java:53)26	org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.call(CpsVmExecutorService.java:50)27	org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:136)28	org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:275)29	org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService.lambda$categoryThreadFactory$0(CpsVmExecutorService.java:50)30	java.base/java.lang.Thread.run(Unknown Source) 

          [JENKINS-73567] docker-workflow-plugin Failed to kill container

          bence created issue -

          Emna added a comment -

          Any news regarding this bug ? it's happening more every now and then for my project, about 10% of the time

          Emna added a comment - Any news regarding this bug ? it's happening more every now and then for my project, about 10% of the time

          bence added a comment - - edited

          update: we've save the ps -aux output at the very end of the job (after the "python job runs here") and this shows nothing interesting for jobs that dies with the failed to stop container.

          All processes that the job started have completed, so nothing should prevent stopping the container.

          USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
          ccjrbp         1  0.0  0.0   4420   516 pts/0    Ss+  11:52   0:00 cat
          ccjrbp        13  0.0  0.0      0     0 ?        Z    11:53   0:00 [sh] <defunct>
          ccjrbp        28  0.0  0.0      0     0 ?        Z    11:53   0:00 [sh] <defunct>
          ccjrbp        42  0.0  0.0      0     0 ?        Z    11:53   0:00 [sh] <defunct>
          ccjrbp       975  0.0  0.0   2624   108 ?        S    12:10   0:00 sh -c (cp '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-0dfcb6cd/script.sh' '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-0dfcb6cd/script.sh.copy'; { while [ -d '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-0dfcb6cd' -a \! -f '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-0dfcb6cd/jenkins-result.txt' ]; do touch '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-0dfcb6cd/jenkins-log.txt'; sleep 3; done } & jsc=durable-1a3de1372cb4edad46ab5c314c42f4ab57c261a8ecbfae1ae98ee95300865407; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe  '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-0dfcb6cd/script.sh.copy' > '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-0dfcb6cd/jenkins-log.txt' 2>&1; echo $? > '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-0dfcb6cd/jenkins-result.txt.tmp'; mv '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-0dfcb6cd/jenkins-result.txt.tmp' '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-0dfcb6cd/jenkins-result.txt'; wait) >&- 2>&- &
          ccjrbp       977  0.0  0.0   2624   596 ?        S    12:10   0:00 sh -c (cp '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-0dfcb6cd/script.sh' '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-0dfcb6cd/script.sh.copy'; { while [ -d '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-0dfcb6cd' -a \! -f '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-0dfcb6cd/jenkins-result.txt' ]; do touch '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-0dfcb6cd/jenkins-log.txt'; sleep 3; done } & jsc=durable-1a3de1372cb4edad46ab5c314c42f4ab57c261a8ecbfae1ae98ee95300865407; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe  '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-0dfcb6cd/script.sh.copy' > '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-0dfcb6cd/jenkins-log.txt' 2>&1; echo $? > '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-0dfcb6cd/jenkins-result.txt.tmp'; mv '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-0dfcb6cd/jenkins-result.txt.tmp' '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-0dfcb6cd/jenkins-result.txt'; wait) >&- 2>&- &
          ccjrbp      1287  0.0  0.0   4276   524 ?        S    12:12   0:00 sleep 3
          ccjrbp      1295  0.0  0.0   2624   112 ?        S    12:12   0:00 sh -c (cp '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f/script.sh' '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f/script.sh.copy'; { while [ -d '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f' -a \! -f '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f/jenkins-result.txt' ]; do touch '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f/jenkins-log.txt'; sleep 3; done } & jsc=durable-1a3de1372cb4edad46ab5c314c42f4ab57c261a8ecbfae1ae98ee95300865407; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe  '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f/script.sh.copy' > '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f/jenkins-log.txt' 2>&1; echo $? > '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f/jenkins-result.txt.tmp'; mv '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f/jenkins-result.txt.tmp' '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f/jenkins-result.txt'; wait) >&- 2>&- &
          ccjrbp      1297  0.0  0.0   2624   112 ?        S    12:12   0:00 sh -c (cp '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f/script.sh' '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f/script.sh.copy'; { while [ -d '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f' -a \! -f '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f/jenkins-result.txt' ]; do touch '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f/jenkins-log.txt'; sleep 3; done } & jsc=durable-1a3de1372cb4edad46ab5c314c42f4ab57c261a8ecbfae1ae98ee95300865407; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe  '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f/script.sh.copy' > '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f/jenkins-log.txt' 2>&1; echo $? > '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f/jenkins-result.txt.tmp'; mv '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f/jenkins-result.txt.tmp' '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f/jenkins-result.txt'; wait) >&- 2>&- &
          ccjrbp      1298  0.0  0.0   2624   608 ?        S    12:12   0:00 sh -xe /net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f/script.sh.copy
          ccjrbp      1300  0.0  0.0   7660  3240 ?        R    12:12   0:00 ps -aux
          ccjrbp      1301  0.0  0.0   4276   588 ?        S    12:12   0:00 sleep 3
          

          also, journalctl -u dokcer shows that the container failed to responed to the stop in time, probably that results in a nonzero return code:

          Aug 05 12:12:41 abts55184.de.bosch.com dockerd[49350]: time="2024-08-05T12:12:41.775666396+02:00" level=info msg="Container failed to exit within 1s of signal 15 - using the force" container=6f34194455e62a83f1f5e53f914b5f39e254398cfa7ec9a50b1bdf14c9a1164f
          Aug 05 12:12:42 abts55184.de.bosch.com dockerd[49350]: time="2024-08-05T12:12:42.221565413+02:00" level=info msg="ignoring event" container=6f34194455e62a83f1f5e53f914b5f39e254398cfa7ec9a50b1bdf14c9a1164f module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
          Aug 05 12:12:52 abts55184.de.bosch.com dockerd[49350]: time="2024-08-05T12:12:52.208809976+02:00" level=warning msg="Container failed to exit within 10s of kill - trying direct SIGKILL" container=6f34194455e62a83f1f5e53f914b5f39e254398cfa7ec9a50b1bdf14c9a1164f error="context deadline exceeded"
          

          bence added a comment - - edited update: we've save the ps -aux output at the very end of the job (after the "python job runs here") and this shows nothing interesting for jobs that dies with the failed to stop container. All processes that the job started have completed, so nothing should prevent stopping the container. USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND ccjrbp 1 0.0 0.0 4420 516 pts/0 Ss+ 11:52 0:00 cat ccjrbp 13 0.0 0.0 0 0 ? Z 11:53 0:00 [sh] <defunct> ccjrbp 28 0.0 0.0 0 0 ? Z 11:53 0:00 [sh] <defunct> ccjrbp 42 0.0 0.0 0 0 ? Z 11:53 0:00 [sh] <defunct> ccjrbp 975 0.0 0.0 2624 108 ? S 12:10 0:00 sh -c (cp '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-0dfcb6cd/script.sh' '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-0dfcb6cd/script.sh.copy' ; { while [ -d '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-0dfcb6cd' -a \! -f '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-0dfcb6cd/jenkins-result.txt' ]; do touch '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-0dfcb6cd/jenkins-log.txt' ; sleep 3; done } & jsc=durable-1a3de1372cb4edad46ab5c314c42f4ab57c261a8ecbfae1ae98ee95300865407; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-0dfcb6cd/script.sh.copy' > '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-0dfcb6cd/jenkins-log.txt' 2>&1; echo $? > '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-0dfcb6cd/jenkins-result.txt.tmp' ; mv '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-0dfcb6cd/jenkins-result.txt.tmp' '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-0dfcb6cd/jenkins-result.txt' ; wait) >&- 2>&- & ccjrbp 977 0.0 0.0 2624 596 ? S 12:10 0:00 sh -c (cp '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-0dfcb6cd/script.sh' '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-0dfcb6cd/script.sh.copy' ; { while [ -d '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-0dfcb6cd' -a \! -f '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-0dfcb6cd/jenkins-result.txt' ]; do touch '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-0dfcb6cd/jenkins-log.txt' ; sleep 3; done } & jsc=durable-1a3de1372cb4edad46ab5c314c42f4ab57c261a8ecbfae1ae98ee95300865407; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-0dfcb6cd/script.sh.copy' > '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-0dfcb6cd/jenkins-log.txt' 2>&1; echo $? > '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-0dfcb6cd/jenkins-result.txt.tmp' ; mv '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-0dfcb6cd/jenkins-result.txt.tmp' '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-0dfcb6cd/jenkins-result.txt' ; wait) >&- 2>&- & ccjrbp 1287 0.0 0.0 4276 524 ? S 12:12 0:00 sleep 3 ccjrbp 1295 0.0 0.0 2624 112 ? S 12:12 0:00 sh -c (cp '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f/script.sh' '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f/script.sh.copy' ; { while [ -d '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f' -a \! -f '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f/jenkins-result.txt' ]; do touch '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f/jenkins-log.txt' ; sleep 3; done } & jsc=durable-1a3de1372cb4edad46ab5c314c42f4ab57c261a8ecbfae1ae98ee95300865407; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f/script.sh.copy' > '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f/jenkins-log.txt' 2>&1; echo $? > '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f/jenkins-result.txt.tmp' ; mv '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f/jenkins-result.txt.tmp' '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f/jenkins-result.txt' ; wait) >&- 2>&- & ccjrbp 1297 0.0 0.0 2624 112 ? S 12:12 0:00 sh -c (cp '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f/script.sh' '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f/script.sh.copy' ; { while [ -d '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f' -a \! -f '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f/jenkins-result.txt' ]; do touch '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f/jenkins-log.txt' ; sleep 3; done } & jsc=durable-1a3de1372cb4edad46ab5c314c42f4ab57c261a8ecbfae1ae98ee95300865407; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f/script.sh.copy' > '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f/jenkins-log.txt' 2>&1; echo $? > '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f/jenkins-result.txt.tmp' ; mv '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f/jenkins-result.txt.tmp' '/net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f/jenkins-result.txt' ; wait) >&- 2>&- & ccjrbp 1298 0.0 0.0 2624 608 ? S 12:12 0:00 sh -xe /net/abts55184/fs0/jenkins_love/workspace/ra6_sim@4@tmp/durable-f4f5411f/script.sh.copy ccjrbp 1300 0.0 0.0 7660 3240 ? R 12:12 0:00 ps -aux ccjrbp 1301 0.0 0.0 4276 588 ? S 12:12 0:00 sleep 3 also, journalctl -u dokcer shows that the container failed to responed to the stop in time, probably that results in a nonzero return code: Aug 05 12:12:41 abts55184.de.bosch.com dockerd[49350]: time= "2024-08-05T12:12:41.775666396+02:00" level=info msg= "Container failed to exit within 1s of signal 15 - using the force" container=6f34194455e62a83f1f5e53f914b5f39e254398cfa7ec9a50b1bdf14c9a1164f Aug 05 12:12:42 abts55184.de.bosch.com dockerd[49350]: time= "2024-08-05T12:12:42.221565413+02:00" level=info msg= "ignoring event" container=6f34194455e62a83f1f5e53f914b5f39e254398cfa7ec9a50b1bdf14c9a1164f module=libcontainerd namespace=moby topic=/tasks/delete type= "*events.TaskDelete" Aug 05 12:12:52 abts55184.de.bosch.com dockerd[49350]: time= "2024-08-05T12:12:52.208809976+02:00" level=warning msg= "Container failed to exit within 10s of kill - trying direct SIGKILL" container=6f34194455e62a83f1f5e53f914b5f39e254398cfa7ec9a50b1bdf14c9a1164f error= "context deadline exceeded"

          bence added a comment -

          emb  do you have any hunch why it's happening on your setup? We speculated some process is still holding on to some resource but the above ps -aux disproved that.

          bence added a comment - emb   do you have any hunch why it's happening on your setup? We speculated some process is still holding on to some resource but the above ps -aux disproved that.

          Toradex added a comment - - edited

          Hello , we're also experiencing such behavior, to execute python-based script in docker, as a part of pipeline.   We are investigating it,  since 2 months 

          It's really questionable where the root cause lays,  on docker side or on jenkins side ?

          The final error , which we found, is in the docker logs,  that's the reason of this Jenkins exception 

          Container failed to exit within 10s of kill - trying direct SIGKILL" container=bcca3c1624e019cc7950a344924c6bda9075bc3d5196ff232f1da1bdbb6f4391 error="context deadline exceeded"

           

          Error  on jenkins side is   java.io.IOException: Failed to kill container '3c2590bf8920936ef88515294d88a9b2e315fcaebe9bf4a7181011cfe21181ed'.

           

          We also did that ps -aux command as a last step  inside of the container and there we're no differences  between failed job and successful in terms of output

          As a workaround, we're tried to overwrite some docker variables , in order to extend maximum timeout , but it's not help.

          docker { image 'xxxxxxxxxxxxxxxxxxxxxxxxxx_job_launcher:2.9.0' // adding extra arguments here , in order to extend stop time for the LAVA test container, because of sporadic issues args '--stop-timeout 360'

          So far we are pointing this issue,  how Jenkins did this wrapping (with a cat command) to manage / start / stop  docker container. 

          Also  it's more or less correlates with load on the build servers . i.e the number of running containers . We are utilizing our servers quite heavily on the weekend and this error is happening mostly during weekend processing

           
           

          Toradex added a comment - - edited Hello , we're also experiencing such behavior, to execute python-based script in docker, as a part of pipeline.   We are investigating it,  since 2 months  It's really questionable where the root cause lays,  on docker side or on jenkins side ? The final error , which we found, is in the docker logs,  that's the reason of this Jenkins exception  Container failed to exit within 10s of kill - trying direct SIGKILL" container=bcca3c1624e019cc7950a344924c6bda9075bc3d5196ff232f1da1bdbb6f4391 error="context deadline exceeded"   Error  on jenkins side is   java.io.IOException: Failed to kill container '3c2590bf8920936ef88515294d88a9b2e315fcaebe9bf4a7181011cfe21181ed'.   We also did that ps -aux command as a last step  inside of the container and there we're no differences  between failed job and successful in terms of output As a workaround, we're tried to overwrite some docker variables , in order to extend maximum timeout , but it's not help. docker { image 'xxxxxxxxxxxxxxxxxxxxxxxxxx_job_launcher:2.9.0' // adding extra arguments here , in order to extend stop time for the LAVA test container, because of sporadic issues args '--stop-timeout 360' So far we are pointing this issue,  how Jenkins did this wrapping (with a cat command) to manage / start / stop  docker container.  Also  it's more or less correlates with load on the build servers . i.e the number of running containers . We are utilizing our servers quite heavily on the weekend and this error is happening mostly during weekend processing    

          bence added a comment -

          We are able to reproduce the bug. It's due to very high IO load so it seems to be docker related and nothing to do with the plugin.

          For some reason 8 parallel unzipping on a 64 core machine cripples the docker daemon and we're unable to start or stop a container even when done by hand in the terminal. emb , tdx_automation  could you pls try to reproduce? Create some 8 different 5GB files, zip them then try to unzip all 8 in parallel and try to `docker run -rm -it <whatever image> /bin/sh`. Or try to start a container, leave it sitting there, start the unzipping, and try to `docker stop` the container.
          Both of these operations hung until the unzipping completed + several seconds.

          What's strange is that we're unable to reproduce this on laptops.

          bence added a comment - We are able to reproduce the bug. It's due to very high IO load so it seems to be docker related and nothing to do with the plugin. For some reason 8 parallel unzipping on a 64 core machine cripples the docker daemon and we're unable to start or stop a container even when done by hand in the terminal. emb , tdx_automation   could you pls try to reproduce? Create some 8 different 5GB files, zip them then try to unzip all 8 in parallel and try to `docker run -rm -it <whatever image> /bin/sh`. Or try to start a container, leave it sitting there, start the unzipping, and try to `docker stop` the container. Both of these operations hung until the unzipping completed + several seconds. What's strange is that we're unable to reproduce this on laptops.

          Toradex added a comment - - edited

          Hi bence , yep , we will try to reproduce it on our servers in the way you're describing it 

          >>it seems to be docker related and nothing to do with the plugin.

          Ok , if so let us more control,  how we can affect  on the container behavior . If the host machihe is very loaded it's ok for us to wait, until load will normalize OR set the maximum CPU for the container we are spinning OR overwrite some commands in the docker deamon via the  - arg , or even whole entrypoint of the container 

          Now we  are just launching this 'cat' command , whis is wrapping all the user processes   

          Toradex added a comment - - edited Hi bence , yep , we will try to reproduce it on our servers in the way you're describing it  >>it seems to be docker related and nothing to do with the plugin. Ok , if so let us more control,  how we can affect  on the container behavior . If the host machihe is very loaded it's ok for us to wait, until load will normalize OR set the maximum CPU for the container we are spinning OR overwrite some commands in the docker deamon via the  - arg , or even whole entrypoint of the container  Now we  are just launching this 'cat' command , whis is wrapping all the user processes   

          bence added a comment -

          update: we've changed to filesystem from xfs to ext4 and the problem seems to be gone. We'll roll out this change to each machine and I'll keep you updated on the stability of the cluster. Out of curiosity, what file system are you using?

          bence added a comment - update: we've changed to filesystem from xfs to ext4 and the problem seems to be gone. We'll roll out this change to each machine and I'll keep you updated on the stability of the cluster. Out of curiosity, what file system are you using?

          Toradex added a comment - - edited

          >>Out of curiosity, what file system are you using? 

          In our case it's ext4  on local hard drive and this failed container  use it 

           

          Also, generally on the  builder machines , we are using some shared location , mapped as local drives ,   with  nfs4 filesystem 

          Toradex added a comment - - edited >>Out of curiosity, what file system are you using?  In our case it's ext4   on local hard drive and this failed container  use it    Also, generally on the  builder machines , we are using some shared location , mapped as local drives ,   with  nfs4 filesystem 
          Sayyed made changes -
          Resolution New: Fixed [ 1 ]
          Status Original: Open [ 1 ] New: Closed [ 6 ]

            Unassigned Unassigned
            sbence92 bence
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: