[JENKINS-46969] Docker container closes prematurely

Type: Bug
Resolution: Duplicate
Priority: Major
Component/s: docker-workflow-plugin
Labels:
None
Environment:
Jenkins 2.46.3 / 2.60.2
Docker Workflow plugin v1.10/1.12

Similar Issues:
Powered by SuggestiMate

Show

I've been seeing intermittent build failures with some of our Jenkins Pipeline builds when running build steps within Docker containers. The environments in question build a Docker image from a Dockerfile on the fly, then run the build steps within an instance of the image using the docker.inside() method. From what I can tell, the operations run within the container occassionally cease execution before completion. If I sound unsure of the exact cause it's because the output produced in the build logs are largely useless. Below is an example of one of my trial cases:

10:13:54 [ksp_delme] Running shell script*10:13:54* + for i in '{1..24}'

10:13:54 + sleep 1200
[Pipeline] }
[Pipeline] // stage
[Pipeline] }

10:16:07 $ docker stop --time=1 d8d8049ea42e9b7ee12a6363ea1a31439c18d0f9f3ea0068d605eb77562e4208
[Pipeline] // withDockerContainer
[Pipeline] }
[Pipeline] // timestamps
[Pipeline] }
ERROR: script returned exit code -1

So as you can see from the timestamps in the output, a container gets launched then I put an "sh 'sleep'" operation within the container. While the sleep should have lasted 20 minutes the container exited about 2 minutes later, and the log indicates that the "script" exited with a return code of "-1".

I can assure you that the sleep operation did not error out with a -1, but even if it did the error value reported by Jenkins would have been 255 since the negative integer appears to be converted to an unsigned 8bit value. Conversely, I did discover that if I log in to our build box and rerun this test case and manually force-terminate the container running the build, I get the exact same error code / result in the log. So I'm guessing that something somewhere is causing the container to terminate prematurely.

To make matters worse, I and run and re-run this test case dozens of times before I get the error, so it is very difficult to reproduce. Based on my preliminary review of our production systems I believe the load on the agents encountering this problem plays a part in the problem, although I'm not entirely sure how. Perhaps running many parallel builds on the same agent all running Docker containers may be a factor, but I'm not certain.

Any help anyone can give to help debug this problem further would be appreciated.

duplicates

JENKINS-35370 Workflow shell step ERROR: script returned exit code -1

Reopened

is related to

JENKINS-47822 docker pipeline finish beforehand when tcp socket is used

Closed

relates to

JENKINS-42322 Docker rm/stop/... commands killed by the timeout, failing builds

Resolved

JENKINS-40101 Different behavior between debian container using docker.inside

Open

JENKINS-35370 Workflow shell step ERROR: script returned exit code -1

Reopened

JENKINS-34289 docker.image.inside fails unexpectedly with Jenkinsfile

Resolved

JENKINS-42166 ProcessLiveness.workingLaunchers heuristic is flaky

Resolved

(2 relates to)

Kevin Phillips created issue - 2017-09-19 18:28

Kevin Phillips made changes - 2017-09-19 18:28

Link

New: This issue relates to ~~JENKINS-42322~~ [ ~~JENKINS-42322~~ ]

Kevin Phillips made changes - 2017-09-19 18:48

Description

Original: I've been seeing intermittent build failures with some of our Jenkins Pipeline builds when running build steps within Docker containers. The environments in question build a Docker image from a Dockerfile on the fly, then run the build steps within an instance of the image using the docker.inside() method. From what I can tell, the operations run within the container occassionally cease execution before completion. If I sound unsure of the exact cause it's because the output produced in the build logs are largely useless. Below is an example of one of my trial cases:

*10:13:54* [ksp_delme] Running shell script*10:13:54* + for i in '\{1..24}'*10:13:54* + sleep 1200
[Pipeline] }
[Pipeline] // stage
[Pipeline] }*10:16:07* $ docker stop --time=1 d8d8049ea42e9b7ee12a6363ea1a31439c18d0f9f3ea0068d605eb77562e4208
[Pipeline] // withDockerContainer
[Pipeline] }
[Pipeline] // timestamps
[Pipeline] }
ERROR: script returned exit code -1
So as you can see from the timestamps in the output, a container gets launched then I put an "sh 'sleep'" operation within the container. While the sleep should have lasted 20 minutes the container exited about 2 minutes later, and the log indicates that the "script" exited with a return code of "-1".

I can assure you that the sleep operation did not error out with a -1, but even if it did the error value reported by Jenkins would have been 255 since the negative integer appears to be converted to an unsigned 8bit value. Conversely, I did discover that if I log in to our build box and rerun this test case and manually force-terminate the container running the build, I get the exact same error code / result in the log. So I'm guessing that something somewhere is causing the container to terminate prematurely.

Any help anyone can give to help debug this problem further would be appreciated.

New: I've been seeing intermittent build failures with some of our Jenkins Pipeline builds when running build steps within Docker containers. The environments in question build a Docker image from a Dockerfile on the fly, then run the build steps within an instance of the image using the docker.inside() method. From what I can tell, the operations run within the container occassionally cease execution before completion. If I sound unsure of the exact cause it's because the output produced in the build logs are largely useless. Below is an example of one of my trial cases:

*10:13:54* [ksp_delme] Running shell script*10:13:54* + for i in '\{1..24}'

*10:13:54* + sleep 1200
[Pipeline] }
[Pipeline] // stage
[Pipeline] }

*10:16:07* $ docker stop --time=1 d8d8049ea42e9b7ee12a6363ea1a31439c18d0f9f3ea0068d605eb77562e4208
[Pipeline] // withDockerContainer
[Pipeline] }
[Pipeline] // timestamps
[Pipeline] }
ERROR: script returned exit code -1

So as you can see from the timestamps in the output, a container gets launched then I put an "sh 'sleep'" operation within the container. While the sleep should have lasted 20 minutes the container exited about 2 minutes later, and the log indicates that the "script" exited with a return code of "-1".

I can assure you that the sleep operation did not error out with a -1, but even if it did the error value reported by Jenkins would have been 255 since the negative integer appears to be converted to an unsigned 8bit value. Conversely, I did discover that if I log in to our build box and rerun this test case and manually force-terminate the container running the build, I get the exact same error code / result in the log. So I'm guessing that something somewhere is causing the container to terminate prematurely.

To make matters worse, I and run and re-run this test case dozens of times before I get the error, so it is very difficult to reproduce. Based on my preliminary review of our production systems I believe the load on the agents encountering this problem plays a part in the problem, although I'm not entirely sure how. Perhaps running many parallel builds on the same agent all running Docker containers may be a factor, but I'm not certain.

Any help anyone can give to help debug this problem further would be appreciated.

Kevin Phillips added a comment - 2017-09-19 18:58

For reference, the Pipeline DSL code I was using for my test above looks like this:

node () {
    catchError {
        timestamps {
            def build_env
            stage ("Init") {
                // git checkout ...
                build_env = docker.build("build_env", './docker')
            }
            build_env.inside {
                stage ("Test") {
                    sh 'for i in {1..24}; do sleep 1200; echo "Still Running"; done'
                }
            }
        }
    }
}

Kevin Phillips added a comment - 2017-09-19 18:58 For reference, the Pipeline DSL code I was using for my test above looks like this: node () { catchError { timestamps { def build_env stage ( "Init" ) { // git checkout ... build_env = docker.build( "build_env" , './docker' ) } build_env.inside { stage ( "Test" ) { sh ' for i in {1..24}; do sleep 1200; echo "Still Running" ; done' } } } } }

Kevin Phillips made changes - 2017-09-20 11:31

Link

New: This issue relates to JENKINS-40101 [ JENKINS-40101 ]

Kevin Phillips made changes - 2017-09-20 11:33

Link

New: This issue relates to JENKINS-35370 [ JENKINS-35370 ]

Kevin Phillips made changes - 2017-09-20 11:45

Link

New: This issue relates to ~~JENKINS-42166~~ [ ~~JENKINS-42166~~ ]

Kevin Phillips made changes - 2017-09-20 11:46

Link

New: This issue relates to ~~JENKINS-34289~~ [ ~~JENKINS-34289~~ ]

Kevin Phillips made changes - 2017-09-20 12:33

Link

New: This issue duplicates JENKINS-35370 [ JENKINS-35370 ]

Kevin Phillips made changes - 2017-09-20 12:34

Resolution		New: Duplicate [ 3 ]
Status	Original: Open [ 1 ]	New: Resolved [ 5 ]

Assignee:: Unassigned

Reporter:: Kevin Phillips

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2017-09-19 18:28

Updated:: 2017-11-03 20:36

Resolved:: 2017-09-20 12:34

Jenkins

Details

Description

Attachments

Issue Links

Activity

Collapse comment: Kevin Phillips added a comment - 2017-09-19 18:58

Expand comment: Kevin Phillips added a comment - 2017-09-19 18:58

People

Dates