Docker container closes prematurely

This issue is archived. You can view it, but you can't modify it. Learn more

XMLWordPrintable

      I've been seeing intermittent build failures with some of our Jenkins Pipeline builds when running build steps within Docker containers. The environments in question build a Docker image from a Dockerfile on the fly, then run the build steps within an instance of the image using the docker.inside() method. From what I can tell, the operations run within the container occassionally cease execution before completion. If I sound unsure of the exact cause it's because the output produced in the build logs are largely useless. Below is an example of one of my trial cases:

       
      10:13:54 [ksp_delme] Running shell script*10:13:54* + for i in '{1..24}'

      10:13:54 + sleep 1200
      [Pipeline] }
      [Pipeline] // stage
      [Pipeline] }

      10:16:07 $ docker stop --time=1 d8d8049ea42e9b7ee12a6363ea1a31439c18d0f9f3ea0068d605eb77562e4208
      [Pipeline] // withDockerContainer
      [Pipeline] }
      [Pipeline] // timestamps
      [Pipeline] }
      ERROR: script returned exit code -1

      So as you can see from the timestamps in the output, a container gets launched then I put an "sh 'sleep'" operation within the container. While the sleep should have lasted 20 minutes the container exited about 2 minutes later, and the log indicates that the "script" exited with a return code of "-1".

      I can assure you that the sleep operation did not error out with a -1, but even if it did the error value reported by Jenkins would have been 255 since the negative integer appears to be converted to an unsigned 8bit value. Conversely, I did discover that if I log in to our build box and rerun this test case and manually force-terminate the container running the build, I get the exact same error code / result in the log. So I'm guessing that something somewhere is causing the container to terminate prematurely.

      To make matters worse, I and run and re-run this test case dozens of times before I get the error, so it is very difficult to reproduce. Based on my preliminary review of our production systems I believe the load on the agents encountering this problem plays a part in the problem, although I'm not entirely sure how. Perhaps running many parallel builds on the same agent all running Docker containers may be a factor, but I'm not certain. 

      Any help anyone can give to help debug this problem further would be appreciated.

            Assignee:
            Unassigned
            Reporter:
            Kevin Phillips
            Archiver:
            Jenkins Service Account

              Created:
              Updated:
              Resolved:
              Archived: