Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-49710

Pipelines run under heavy load sometimes hang running Docker

      We have some load tests that run ~50 tests at a time overnight, in loops - so thousands of tests in a night. About 1% of them hang forever and must be manually killed.

      Jenkins log:

      Started by upstream project "tools/release-validator" build number 92
      originally caused by:
       Started by timer
      Obtained Jenkinsfile from git [...]
      Running in Durability level: MAX_SURVIVABILITY
      Loading library TestRunner@master
      Attempting to resolve master from remote references...
       > git --version # timeout=10
       > git ls-remote -h -t [...] # timeout=10
      Found match: refs/heads/master revision 4f9f1287a87cedcccbe456d96176084fbfb2500c
       > git rev-parse --is-inside-work-tree # timeout=10
      Fetching changes from the remote Git repository
       > git config remote.origin.url [...] # timeout=10
      Fetching without tags
      Fetching upstream changes from [...]
       > git --version # timeout=10
       > git fetch --no-tags --progress [...] +refs/heads/*:refs/remotes/origin/*
      Checking out Revision 4f9f1287a87cedcccbe456d96176084fbfb2500c (master)
       > git config core.sparsecheckout # timeout=10
       > git checkout -f 4f9f1287a87cedcccbe456d96176084fbfb2500c
      Commit message: "[...]"
       > git rev-list --no-walk 4f9f1287a87cedcccbe456d96176084fbfb2500c # timeout=10
      [Pipeline] node
      Running on Jenkins in /var/jenkins_home/workspace/staging-load-tests/load-native_android_eu@10
      [Pipeline] {
      [Pipeline] stage
      [Pipeline] { (checkout)
      [Pipeline] checkout
       > git rev-parse --is-inside-work-tree # timeout=10
      Fetching changes from the remote Git repository
       > git config remote.origin.url [...] # timeout=10
      Fetching upstream changes from [...]
       > git --version # timeout=10
       > git fetch --tags --progress [...] +refs/heads/*:refs/remotes/origin/*
       > git rev-parse refs/remotes/origin/master^{commit} # timeout=10
       > git rev-parse refs/remotes/origin/origin/master^{commit} # timeout=10
      Checking out Revision 4d6e39a68e488aa7c9e130d664326af6c646d1cb (refs/remotes/origin/master)
       > git config core.sparsecheckout # timeout=10
       > git checkout -f 4d6e39a68e488aa7c9e130d664326af6c646d1cb
      Commit message: "Merge pull request #31 from [...]"
       > git rev-list --no-walk 4d6e39a68e488aa7c9e130d664326af6c646d1cb # timeout=10
      [Pipeline] }
      [Pipeline] // stage
      [Pipeline] stage
      [Pipeline] { (run test)
      [Pipeline] sh
      [load-native_android_eu@10] Running shell script
      + docker inspect -f . maven:3.5.2
      .
      [Pipeline] withDockerContainer
      Jenkins seems to be running inside container 5c894538586c4a19e2a60ca784403fbfda24cc75781a52ea8ae54028fecbe5ff
      $ docker run -t -d -u 0:0 -v /root/.m2:/root/.m2 -w /var/jenkins_home/workspace/staging-load-tests/load-native_android_eu@10 --volumes-from 5c894538586c4a19e2a60ca784403fbfda24cc75781a52ea8ae54028fecbe5ff -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** maven:3.5.2 cat
      $ docker top 901d717402c013afccae3074ec7e46c6ec70ce2e66f3e7e773ba9015d58c3cfa -eo pid,comm
      [Pipeline] // withDockerContainer
      [spinning wheel here]
      

      I notice that //withDockerContainer seems out of place - normally it doesn't occur until much later.

      Thread dump:

      Thread #6
      	at org.jenkinsci.plugins.docker.workflow.Docker$Image.inside(jar:file:/var/jenkins_home/plugins/docker-workflow/WEB-INF/lib/docker-workflow.jar!/org/jenkinsci/plugins/docker/workflow/Docker.groovy:129)
      	at org.jenkinsci.plugins.docker.workflow.Docker.node(jar:file:/var/jenkins_home/plugins/docker-workflow/WEB-INF/lib/docker-workflow.jar!/org/jenkinsci/plugins/docker/workflow/Docker.groovy:66)
      	at org.jenkinsci.plugins.docker.workflow.Docker$Image.inside(jar:file:/var/jenkins_home/plugins/docker-workflow/WEB-INF/lib/docker-workflow.jar!/org/jenkinsci/plugins/docker/workflow/Docker.groovy:123)
      	at TestRunner.runTest(/var/jenkins_home/jobs/staging-load-tests/jobs/load-native_android_eu/builds/35486/libs/TestRunner/vars/TestRunner.groovy:51)
      	at DSL.stage(Native Method)
      	at TestRunner.runTest(/var/jenkins_home/jobs/staging-load-tests/jobs/load-native_android_eu/builds/35486/libs/TestRunner/vars/TestRunner.groovy:44)
      	at DSL.node(running on )
      	at TestRunner.runTest(/var/jenkins_home/jobs/staging-load-tests/jobs/load-native_android_eu/builds/35486/libs/TestRunner/vars/TestRunner.groovy:36)
      	at TestRunner.call(/var/jenkins_home/jobs/staging-load-tests/jobs/load-native_android_eu/builds/35486/libs/TestRunner/vars/TestRunner.groovy:17)
      	at WorkflowScript.run(WorkflowScript:8)
      

      The pipeline script itself runs with a pipeline library script. Here's what triggers it:

      #!groovy
      @Library('TestRunner') _
      
      def test = { sh "mvn -q clean test -DthreadCount=${env.PARALLEL_TESTS ?: 5} -Dtest=${env.TESTS}" }
      
      TestRunner {
          steps = test
      }
      

      TestRunner has a bunch of code for flexibility but essentially runs something like:

      node {
        // checkout
        stage("test") {
          docker.inside("maven") {
             steps()
          }
        }
      }
      

      I can provide more detail if needed.

          [JENKINS-49710] Pipelines run under heavy load sometimes hang running Docker

          Abdulla Hawara added a comment - - edited

          My theory about the issue:

          Jenkins has `Text file busy` error coming from `durable` which is used by `pipeline nodes and processes` plugin.
          This issue is easily reproducible by running any command e.g. `sh "echo 'Hello'"` in any jenkins job many times in parallel.
          This hang is caused when running a docker container using the docker plugin (the container will stay alive), but when we run `docker` manually using sh 'docker ...' it never hangs, but that `Text file busy` error still appears because we run a lot of jobs at the same time. 

           

          To reproduce 

          Text file busy

          error, please follow these steps:

           

          1. Run Jenkins locally `docker run -p 8080:8080 -p 50000:50000 jenkins/jenkins:lts` ver. 2.89.4
          2. install just the recommended plugins on it
          3. create new pipeline projects called `hello` and `hello2`
          4. put this code in ‘hello’ :

          node{
              sh "echo 'hello'"
          }
          

          and this in `hello2`:

          node{
              sh "echo 'hello'"
              sleep(2)
          }
          

          5. Create a new pipeline project called `runner` and put this inside:

          COUNTER = 0
          
          node{
              def jobs = [:]
              
              // add 24 instances of the same test to run them later in parallel
              24.times {
                  jobs[('runner' + COUNTER++)] = {triggerProject('hello')()}
              }
              
              // add 24 instances of the same test to run them later in parallel
              24.times {
                  jobs[('runner' + COUNTER++)] = {triggerProject('hello2')()}
              }
              
              // run them all 20 times in parallel
              20.times {
                  parallel jobs
              }
          }
          
          def triggerProject(jobName) {
              return {
                  try{
                      build job: jobName, parameters: [string(name: 'VALUE', value: String.valueOf(COUNTER++))]
                  } catch (ex){
                      println ex
                  }
              }
          }
          

          6. Go to your jenkins configurations and change the executors to `50`
          7. try to run the runner once and if you got some sandbox exception, go to the in-procces Script Approval in the Manage Jenkins page and click approve for all commands
          8. run runner again and you will see that some of `hello` and `hello2` has `text file busy` error

          The logs you get afterwards:

          Running in Durability level: MAX_SURVIVABILITY
          [Pipeline] node
          Running on Jenkins in /var/jenkins_home/workspace/hello@6
          [Pipeline] {
          [Pipeline] sh
          [hello@6] Running shell script
          sh: 1: /var/jenkins_home/workspace/hello@6@tmp/durable-a771d7dd/script.sh: Text file busy
          [Pipeline] }
          [Pipeline] // node
          [Pipeline] End of Pipeline
          ERROR: script returned exit code 2
          Finished: FAILURE
          

          Abdulla Hawara added a comment - - edited My theory about the issue: Jenkins has ` Text file busy ` error coming from ` durable ` which is used by ` pipeline nodes and processes ` plugin. This issue is easily reproducible by running any command e.g. ` sh "echo 'Hello'" ` in any jenkins job many times in parallel. This hang is caused when running a docker container using the docker plugin (the container will stay alive), but when we run ` docker ` manually using sh 'docker ...'  it never hangs, but that ` Text file busy ` error still appears because we run a lot of jobs at the same time.    To reproduce  Text file busy error, please follow these steps:   1. Run Jenkins locally `docker run -p 8080:8080 -p 50000:50000 jenkins/jenkins:lts` ver. 2.89.4 2. install just the recommended plugins on it 3. create new pipeline projects called ` hello ` and ` hello2 ` 4. put this code in ‘hello’ : node{ sh "echo 'hello' " } and this in ` hello2 `: node{ sh "echo 'hello' " sleep(2) } 5. Create a new pipeline project called ` runner ` and put this inside: COUNTER = 0 node{ def jobs = [:] // add 24 instances of the same test to run them later in parallel 24.times { jobs[( 'runner' + COUNTER++)] = {triggerProject( 'hello' )()} } // add 24 instances of the same test to run them later in parallel 24.times { jobs[( 'runner' + COUNTER++)] = {triggerProject( 'hello2' )()} } // run them all 20 times in parallel 20.times { parallel jobs } } def triggerProject(jobName) { return { try { build job: jobName, parameters: [string(name: 'VALUE' , value: String .valueOf(COUNTER++))] } catch (ex){ println ex } } } 6. Go to your jenkins configurations and change the executors to ` 50 ` 7. try to run the runner once and if you got some sandbox exception, go to the in-procces Script Approval in the Manage Jenkins page and click approve for all commands 8. run runner again and you will see that some of ` hello ` and ` hello2 ` has ` text file busy ` error The logs you get afterwards: Running in Durability level: MAX_SURVIVABILITY [Pipeline] node Running on Jenkins in / var /jenkins_home/workspace/hello@6 [Pipeline] { [Pipeline] sh [hello@6] Running shell script sh: 1: / var /jenkins_home/workspace/hello@6@tmp/durable-a771d7dd/script.sh: Text file busy [Pipeline] } [Pipeline] // node [Pipeline] End of Pipeline ERROR: script returned exit code 2 Finished: FAILURE

          I'm experiencing the same issue sporadically.

          Constantin Bugneac added a comment - I'm experiencing the same issue sporadically.

            Unassigned Unassigned
            crummy malcolm crum
            Votes:
            2 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: