Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-58656

Wrapper process leaves zombie when no init process present

    XMLWordPrintable

Details

    Description

      The merge of PR-98 moved the wrapper process to the background to allow the launching process to quickly exit. However, that very act will orphan the wrapper process. This is only a problem in environments where there is no init process (e.g. docker containers that are run with no --init flag).

      Unit tests did not discover this bug due to a race condition of when the last ps was called and when the wrapper process exited. If another ps is called after the test detects that the script as finished running, the zombie state of the wrapper process is revealed.

      I'm not sure how much of an issue this really is as there are numerous solutions on enabling zombie-reaping for containers, but as there is an explicit check for zombies in the unit tests, it seemed worth mentioning.

      Attachments

        Issue Links

          Activity

            We use the kubernetes plugin with kubernetes 1.15.3 on premise. Regarding sh steps per pod, we have only around 10, but invoke our build system via shell scripts which then work through Makefiles and invoke bash for every step of the Makefiles. It sums up to around 27k defunct bash processes that are not getting reaped and eventually we run into the error which i mentioned above.

            stephankirsten Stephan Kirsten added a comment - We use the kubernetes plugin with kubernetes 1.15.3 on premise. Regarding sh steps per pod, we have only around 10, but invoke our build system via shell scripts which then work through Makefiles and invoke bash for every step of the Makefiles. It sums up to around 27k defunct bash processes that are not getting reaped and eventually we run into the error which i mentioned above.
            jglick Jesse Glick added a comment -

            So reading between the lines, repeatedly running a build which has one sh step that runs a script that launches a thousand subprocesses should eventually result in an error. That is something that can be tested and, if true, worked around.

            jglick Jesse Glick added a comment - So reading between the lines, repeatedly running a build which has one sh step that runs a script that launches a thousand subprocesses should eventually result in an error. That is something that can be tested and, if true, worked around.

            I have been unable to reproduce the `Resource temporarily unavailable.` error when attempting to run pipelines that simulate the situation described.

            I created a cluster in GKE using the gcloud cli. gcloud container clusters create <cluster-name> --machine-type=n1-standard-2 --cluster-version=latest. Installed Cloudbees Core for Modern Platforms version 2.204.3.7 (latest public release at the time I started testing) using Helm. Used kubectl get nodes to find the names of the nodes and gcloud beta compute ssh to connect to the nodes via ssh. Then running watch 'ps fauxwww | fgrep Z' to watch for zombie processes on each node.

            Using groovy while (true) { sh 'sleep 1' } I was able to produce zombie processes on the node the build agent was assigned to. The process ran for 5 hours 17 minutes before using all the process resources. After the processes were exhausted the job exited with an error message that there were not processes available. After the pod running the job exited the zombie processes on the node were removed and the node continued to function.

            Using `while :; do /usr/bin/sleep .01; done` as a way to generate subprocesses I've tested as the direct parameter of an `sh` step in a pipeline using both `jenkins/jnlp-slave` and `cloudbees/cloudbees-core-agent` images. Neither produced any zombie processes on the worker nodes of the Kubernetes cluster. To induce another layer of subprocess I also put that `while` line into a file and had the `sh` process execute that file, but it also did not produce any zombie processes on the worker nodes. Additionally I made that while loop a step in a Makefile and executed it that way, which also did not produce any zombies on the nodes.

            kerogers Kenneth Rogers added a comment - I have been unable to reproduce the `Resource temporarily unavailable.` error when attempting to run pipelines that simulate the situation described. I created a cluster in GKE using the gcloud cli. gcloud container clusters create <cluster-name> --machine-type=n1-standard-2 --cluster-version=latest. Installed Cloudbees Core for Modern Platforms version 2.204.3.7 (latest public release at the time I started testing) using Helm. Used kubectl get nodes to find the names of the nodes and gcloud beta compute ssh to connect to the nodes via ssh. Then running watch 'ps fauxwww | fgrep Z' to watch for zombie processes on each node. Using groovy while (true) { sh 'sleep 1' } I was able to produce zombie processes on the node the build agent was assigned to. The process ran for 5 hours 17 minutes before using all the process resources. After the processes were exhausted the job exited with an error message that there were not processes available. After the pod running the job exited the zombie processes on the node were removed and the node continued to function. Using `while :; do /usr/bin/sleep .01; done` as a way to generate subprocesses I've tested as the direct parameter of an `sh` step in a pipeline using both `jenkins/jnlp-slave` and `cloudbees/cloudbees-core-agent` images. Neither produced any zombie processes on the worker nodes of the Kubernetes cluster. To induce another layer of subprocess I also put that `while` line into a file and had the `sh` process execute that file, but it also did not produce any zombie processes on the worker nodes. Additionally I made that while loop a step in a Makefile and executed it that way, which also did not produce any zombies on the nodes.

            I have been observing some issues on nodes.  I have been using amazon EKS for cluster and once in a while node ends up either with soft lockup or flips to NodeNotReady. I have tried to troubleshoot a lot but still now nothing concrete is figured out. I was working with AWS support and they told me that there were couple of other cases where they told similar behavior using Jenkins pods. One of the other patterns which I observed is, all the nodes which had issues had very high number of zombie processes, at least 4000+. I still haven't got any conclusive evidence to tell that issue is due to zombie process/Jenkins but the patterns all indicate that there could be something with k8s plugin of Jenkins which may be causing the issue.

            Did any of you face the same issue?

            cshivashankar chetan shivashankar added a comment - I have been observing some issues on nodes.  I have been using amazon EKS for cluster and once in a while node ends up either with soft lockup or flips to NodeNotReady. I have tried to troubleshoot a lot but still now nothing concrete is figured out. I was working with AWS support and they told me that there were couple of other cases where they told similar behavior using Jenkins pods. One of the other patterns which I observed is, all the nodes which had issues had very high number of zombie processes, at least 4000+. I still haven't got any conclusive evidence to tell that issue is due to zombie process/Jenkins but the patterns all indicate that there could be something with k8s plugin of Jenkins which may be causing the issue. Did any of you face the same issue?
            ningjunwei junwei ning added a comment -

            i use   helm   deploy  bitnami/jenkins   had meet this

            ningjunwei junwei ning added a comment - i use   helm   deploy  bitnami/jenkins   had meet this

            People

              Unassigned Unassigned
              carroll Carroll Chiou
              Votes:
              2 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated: