-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Jenkins ver. 2.89.3 and 2.89.4, docker commons 1.9 and 1.11, docker pipeline 1.15 and 1.15.1
We have some load tests that run ~50 tests at a time overnight, in loops - so thousands of tests in a night. About 1% of them hang forever and must be manually killed.
Jenkins log:
Started by upstream project "tools/release-validator" build number 92 originally caused by: Started by timer Obtained Jenkinsfile from git [...] Running in Durability level: MAX_SURVIVABILITY Loading library TestRunner@master Attempting to resolve master from remote references... > git --version # timeout=10 > git ls-remote -h -t [...] # timeout=10 Found match: refs/heads/master revision 4f9f1287a87cedcccbe456d96176084fbfb2500c > git rev-parse --is-inside-work-tree # timeout=10 Fetching changes from the remote Git repository > git config remote.origin.url [...] # timeout=10 Fetching without tags Fetching upstream changes from [...] > git --version # timeout=10 > git fetch --no-tags --progress [...] +refs/heads/*:refs/remotes/origin/* Checking out Revision 4f9f1287a87cedcccbe456d96176084fbfb2500c (master) > git config core.sparsecheckout # timeout=10 > git checkout -f 4f9f1287a87cedcccbe456d96176084fbfb2500c Commit message: "[...]" > git rev-list --no-walk 4f9f1287a87cedcccbe456d96176084fbfb2500c # timeout=10 [Pipeline] node Running on Jenkins in /var/jenkins_home/workspace/staging-load-tests/load-native_android_eu@10 [Pipeline] { [Pipeline] stage [Pipeline] { (checkout) [Pipeline] checkout > git rev-parse --is-inside-work-tree # timeout=10 Fetching changes from the remote Git repository > git config remote.origin.url [...] # timeout=10 Fetching upstream changes from [...] > git --version # timeout=10 > git fetch --tags --progress [...] +refs/heads/*:refs/remotes/origin/* > git rev-parse refs/remotes/origin/master^{commit} # timeout=10 > git rev-parse refs/remotes/origin/origin/master^{commit} # timeout=10 Checking out Revision 4d6e39a68e488aa7c9e130d664326af6c646d1cb (refs/remotes/origin/master) > git config core.sparsecheckout # timeout=10 > git checkout -f 4d6e39a68e488aa7c9e130d664326af6c646d1cb Commit message: "Merge pull request #31 from [...]" > git rev-list --no-walk 4d6e39a68e488aa7c9e130d664326af6c646d1cb # timeout=10 [Pipeline] } [Pipeline] // stage [Pipeline] stage [Pipeline] { (run test) [Pipeline] sh [load-native_android_eu@10] Running shell script + docker inspect -f . maven:3.5.2 . [Pipeline] withDockerContainer Jenkins seems to be running inside container 5c894538586c4a19e2a60ca784403fbfda24cc75781a52ea8ae54028fecbe5ff $ docker run -t -d -u 0:0 -v /root/.m2:/root/.m2 -w /var/jenkins_home/workspace/staging-load-tests/load-native_android_eu@10 --volumes-from 5c894538586c4a19e2a60ca784403fbfda24cc75781a52ea8ae54028fecbe5ff -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** maven:3.5.2 cat $ docker top 901d717402c013afccae3074ec7e46c6ec70ce2e66f3e7e773ba9015d58c3cfa -eo pid,comm [Pipeline] // withDockerContainer [spinning wheel here]
I notice that //withDockerContainer seems out of place - normally it doesn't occur until much later.
Thread dump:
Thread #6 at org.jenkinsci.plugins.docker.workflow.Docker$Image.inside(jar:file:/var/jenkins_home/plugins/docker-workflow/WEB-INF/lib/docker-workflow.jar!/org/jenkinsci/plugins/docker/workflow/Docker.groovy:129) at org.jenkinsci.plugins.docker.workflow.Docker.node(jar:file:/var/jenkins_home/plugins/docker-workflow/WEB-INF/lib/docker-workflow.jar!/org/jenkinsci/plugins/docker/workflow/Docker.groovy:66) at org.jenkinsci.plugins.docker.workflow.Docker$Image.inside(jar:file:/var/jenkins_home/plugins/docker-workflow/WEB-INF/lib/docker-workflow.jar!/org/jenkinsci/plugins/docker/workflow/Docker.groovy:123) at TestRunner.runTest(/var/jenkins_home/jobs/staging-load-tests/jobs/load-native_android_eu/builds/35486/libs/TestRunner/vars/TestRunner.groovy:51) at DSL.stage(Native Method) at TestRunner.runTest(/var/jenkins_home/jobs/staging-load-tests/jobs/load-native_android_eu/builds/35486/libs/TestRunner/vars/TestRunner.groovy:44) at DSL.node(running on ) at TestRunner.runTest(/var/jenkins_home/jobs/staging-load-tests/jobs/load-native_android_eu/builds/35486/libs/TestRunner/vars/TestRunner.groovy:36) at TestRunner.call(/var/jenkins_home/jobs/staging-load-tests/jobs/load-native_android_eu/builds/35486/libs/TestRunner/vars/TestRunner.groovy:17) at WorkflowScript.run(WorkflowScript:8)
The pipeline script itself runs with a pipeline library script. Here's what triggers it:
#!groovy @Library('TestRunner') _ def test = { sh "mvn -q clean test -DthreadCount=${env.PARALLEL_TESTS ?: 5} -Dtest=${env.TESTS}" } TestRunner { steps = test }
TestRunner has a bunch of code for flexibility but essentially runs something like:
node { // checkout stage("test") { docker.inside("maven") { steps() } } }
I can provide more detail if needed.
My theory about the issue:
Jenkins has `Text file busy` error coming from `durable` which is used by `pipeline nodes and processes` plugin.
This issue is easily reproducible by running any command e.g. `sh "echo 'Hello'"` in any jenkins job many times in parallel.
This hang is caused when running a docker container using the docker plugin (the container will stay alive), but when we run `docker` manually using sh 'docker ...' it never hangs, but that `Text file busy` error still appears because we run a lot of jobs at the same time.
To reproduce
error, please follow these steps:
1. Run Jenkins locally `docker run -p 8080:8080 -p 50000:50000 jenkins/jenkins:lts` ver. 2.89.4
2. install just the recommended plugins on it
3. create new pipeline projects called `hello` and `hello2`
4. put this code in ‘hello’ :
node{ sh "echo 'hello'" }
and this in `hello2`:
node{ sh "echo 'hello'" sleep(2) }
5. Create a new pipeline project called `runner` and put this inside:
6. Go to your jenkins configurations and change the executors to `50`
7. try to run the runner once and if you got some sandbox exception, go to the in-procces Script Approval in the Manage Jenkins page and click approve for all commands
8. run runner again and you will see that some of `hello` and `hello2` has `text file busy` error
The logs you get afterwards: