Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-49710

Pipelines run under heavy load sometimes hang running Docker

      We have some load tests that run ~50 tests at a time overnight, in loops - so thousands of tests in a night. About 1% of them hang forever and must be manually killed.

      Jenkins log:

      Started by upstream project "tools/release-validator" build number 92
      originally caused by:
       Started by timer
      Obtained Jenkinsfile from git [...]
      Running in Durability level: MAX_SURVIVABILITY
      Loading library TestRunner@master
      Attempting to resolve master from remote references...
       > git --version # timeout=10
       > git ls-remote -h -t [...] # timeout=10
      Found match: refs/heads/master revision 4f9f1287a87cedcccbe456d96176084fbfb2500c
       > git rev-parse --is-inside-work-tree # timeout=10
      Fetching changes from the remote Git repository
       > git config remote.origin.url [...] # timeout=10
      Fetching without tags
      Fetching upstream changes from [...]
       > git --version # timeout=10
       > git fetch --no-tags --progress [...] +refs/heads/*:refs/remotes/origin/*
      Checking out Revision 4f9f1287a87cedcccbe456d96176084fbfb2500c (master)
       > git config core.sparsecheckout # timeout=10
       > git checkout -f 4f9f1287a87cedcccbe456d96176084fbfb2500c
      Commit message: "[...]"
       > git rev-list --no-walk 4f9f1287a87cedcccbe456d96176084fbfb2500c # timeout=10
      [Pipeline] node
      Running on Jenkins in /var/jenkins_home/workspace/staging-load-tests/load-native_android_eu@10
      [Pipeline] {
      [Pipeline] stage
      [Pipeline] { (checkout)
      [Pipeline] checkout
       > git rev-parse --is-inside-work-tree # timeout=10
      Fetching changes from the remote Git repository
       > git config remote.origin.url [...] # timeout=10
      Fetching upstream changes from [...]
       > git --version # timeout=10
       > git fetch --tags --progress [...] +refs/heads/*:refs/remotes/origin/*
       > git rev-parse refs/remotes/origin/master^{commit} # timeout=10
       > git rev-parse refs/remotes/origin/origin/master^{commit} # timeout=10
      Checking out Revision 4d6e39a68e488aa7c9e130d664326af6c646d1cb (refs/remotes/origin/master)
       > git config core.sparsecheckout # timeout=10
       > git checkout -f 4d6e39a68e488aa7c9e130d664326af6c646d1cb
      Commit message: "Merge pull request #31 from [...]"
       > git rev-list --no-walk 4d6e39a68e488aa7c9e130d664326af6c646d1cb # timeout=10
      [Pipeline] }
      [Pipeline] // stage
      [Pipeline] stage
      [Pipeline] { (run test)
      [Pipeline] sh
      [load-native_android_eu@10] Running shell script
      + docker inspect -f . maven:3.5.2
      .
      [Pipeline] withDockerContainer
      Jenkins seems to be running inside container 5c894538586c4a19e2a60ca784403fbfda24cc75781a52ea8ae54028fecbe5ff
      $ docker run -t -d -u 0:0 -v /root/.m2:/root/.m2 -w /var/jenkins_home/workspace/staging-load-tests/load-native_android_eu@10 --volumes-from 5c894538586c4a19e2a60ca784403fbfda24cc75781a52ea8ae54028fecbe5ff -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** maven:3.5.2 cat
      $ docker top 901d717402c013afccae3074ec7e46c6ec70ce2e66f3e7e773ba9015d58c3cfa -eo pid,comm
      [Pipeline] // withDockerContainer
      [spinning wheel here]
      

      I notice that //withDockerContainer seems out of place - normally it doesn't occur until much later.

      Thread dump:

      Thread #6
      	at org.jenkinsci.plugins.docker.workflow.Docker$Image.inside(jar:file:/var/jenkins_home/plugins/docker-workflow/WEB-INF/lib/docker-workflow.jar!/org/jenkinsci/plugins/docker/workflow/Docker.groovy:129)
      	at org.jenkinsci.plugins.docker.workflow.Docker.node(jar:file:/var/jenkins_home/plugins/docker-workflow/WEB-INF/lib/docker-workflow.jar!/org/jenkinsci/plugins/docker/workflow/Docker.groovy:66)
      	at org.jenkinsci.plugins.docker.workflow.Docker$Image.inside(jar:file:/var/jenkins_home/plugins/docker-workflow/WEB-INF/lib/docker-workflow.jar!/org/jenkinsci/plugins/docker/workflow/Docker.groovy:123)
      	at TestRunner.runTest(/var/jenkins_home/jobs/staging-load-tests/jobs/load-native_android_eu/builds/35486/libs/TestRunner/vars/TestRunner.groovy:51)
      	at DSL.stage(Native Method)
      	at TestRunner.runTest(/var/jenkins_home/jobs/staging-load-tests/jobs/load-native_android_eu/builds/35486/libs/TestRunner/vars/TestRunner.groovy:44)
      	at DSL.node(running on )
      	at TestRunner.runTest(/var/jenkins_home/jobs/staging-load-tests/jobs/load-native_android_eu/builds/35486/libs/TestRunner/vars/TestRunner.groovy:36)
      	at TestRunner.call(/var/jenkins_home/jobs/staging-load-tests/jobs/load-native_android_eu/builds/35486/libs/TestRunner/vars/TestRunner.groovy:17)
      	at WorkflowScript.run(WorkflowScript:8)
      

      The pipeline script itself runs with a pipeline library script. Here's what triggers it:

      #!groovy
      @Library('TestRunner') _
      
      def test = { sh "mvn -q clean test -DthreadCount=${env.PARALLEL_TESTS ?: 5} -Dtest=${env.TESTS}" }
      
      TestRunner {
          steps = test
      }
      

      TestRunner has a bunch of code for flexibility but essentially runs something like:

      node {
        // checkout
        stage("test") {
          docker.inside("maven") {
             steps()
          }
        }
      }
      

      I can provide more detail if needed.

          [JENKINS-49710] Pipelines run under heavy load sometimes hang running Docker

            Unassigned Unassigned
            crummy malcolm crum
            Votes:
            2 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: