Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-58656

Wrapper process leaves zombie when no init process present

      The merge of PR-98 moved the wrapper process to the background to allow the launching process to quickly exit. However, that very act will orphan the wrapper process. This is only a problem in environments where there is no init process (e.g. docker containers that are run with no --init flag).

      Unit tests did not discover this bug due to a race condition of when the last ps was called and when the wrapper process exited. If another ps is called after the test detects that the script as finished running, the zombie state of the wrapper process is revealed.

      I'm not sure how much of an issue this really is as there are numerous solutions on enabling zombie-reaping for containers, but as there is an explicit check for zombies in the unit tests, it seemed worth mentioning.

          [JENKINS-58656] Wrapper process leaves zombie when no init process present

          Carroll Chiou created issue -
          Carroll Chiou made changes -
          Description Original: The merge of [PR-98|https://github.com/jenkinsci/durable-task-plugin/pull/98] moved the wrapper process to the background but, as a result, zombies it. This is only a problem in environments where there is no {{init}} process (e.g. docker containers that are run with no {{--init}} flag).

          Unit tests did not discover this bug due to a race condition of when the last {{ps}} was called and when the wrapper process exited. If another {{ps}} is called after the test detects that the script as finished running, the zombie state of the wrapper process is revealed.

          I'm not sure how much of an issue this really is as there are numerous solutions on enabling zombie-reaping for containers, but as there is an explicit check for zombies in the unit tests, it seemed worth mentioning.
          New: The merge of [PR-98|https://github.com/jenkinsci/durable-task-plugin/pull/98] moved the wrapper process to the background to allow the launching process to quickly exit. However, that very act will orphan the wrapper process. This is only a problem in environments where there is no {{init}} process (e.g. docker containers that are run with no {{--init}} flag).

          Unit tests did not discover this bug due to a race condition of when the last {{ps}} was called and when the wrapper process exited. If another {{ps}} is called after the test detects that the script as finished running, the zombie state of the wrapper process is revealed.

          I'm not sure how much of an issue this really is as there are numerous solutions on enabling zombie-reaping for containers, but as there is an explicit check for zombies in the unit tests, it seemed worth mentioning.
          Devin Nusbaum made changes -
          Labels New: pipeline
          Jesse Glick made changes -
          Link New: This issue is caused by JENKINS-58290 [ JENKINS-58290 ]

          Jesse Glick added a comment - - edited

          Adding to kubernetes plugin as it is important to check whether there is a practical impact on a Kubernetes node. Does something in K8s itself reap zombies? Can we reproduce a PID exhaustion error by repeatedly running brief sh steps? The defense for command-launcher + jenkins/slave is documented (just use docker run --init) if not enforced at runtime, but it is unknown at this time whether this affects a typical Kubernetes pod using jenkins/jnlp-slave.

          Jesse Glick added a comment - - edited Adding to kubernetes plugin as it is important to check whether there is a practical impact on a Kubernetes node. Does something in K8s itself reap zombies? Can we reproduce a PID exhaustion error by repeatedly running brief sh steps? The defense for command-launcher + jenkins/slave is documented (just use docker run --init ) if not enforced at runtime, but it is unknown at this time whether this affects a typical Kubernetes pod using jenkins/jnlp-slave .
          Jesse Glick made changes -
          Component/s New: kubernetes-plugin [ 20639 ]
          Jesse Glick made changes -
          Labels Original: pipeline New: pipeline regression

          Carroll Chiou added a comment - - edited

          It looks like if you enable pid namespace sharing, the pause container will handle zombie reaping (kubernetes 1.7+ and docker 1.13.1+). Otherwise you will have to have each container handle the zombie reaping independently.

          https://www.ianlewis.org/en/almighty-pause-container
          https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/pod-pid-namespace.md

          Update: ran a quick test to confirm this does work:

          Carroll Chiou added a comment - - edited It looks like if you enable pid namespace sharing, the pause container will handle zombie reaping (kubernetes 1.7+ and docker 1.13.1+). Otherwise you will have to have each container handle the zombie reaping independently. https://www.ianlewis.org/en/almighty-pause-container https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/pod-pid-namespace.md Update: ran a quick test to confirm this does work: Jenkinsfile Console output

          Jesse Glick added a comment -

          But it seems that PID namespace sharing is off by default, so most users with Jenkins running in a K8s cluster would not be able to rely on that.

          As per this comment it seems that to avoid a memory leak for kubernetes plugin users, we would need to patch docker-jnlp-slave to run Tini (or equivalent). But if I understand correctly, that would only help for subprocesses of the default agent container, not for other containers listed in the pod template and used via the container step: to fix these, we would need to ensure that the command run via (the API equivalent of) kubectl exec by ContainerExecDecorator waits for all its children. Right?

          Is there a known easy way to check for this condition in a realistic cluster, say GKE? Run some straightforward Pipeline build (using podTemplate + node + container + sh) a bunch of times, and then somehow get a root shell into the raw node and scan for zombie processes?

          Jesse Glick added a comment - But it seems that PID namespace sharing is off by default, so most users with Jenkins running in a K8s cluster would not be able to rely on that. As per this comment it seems that to avoid a memory leak for kubernetes plugin users, we would need to patch docker-jnlp-slave to run Tini (or equivalent). But if I understand correctly, that would only help for subprocesses of the default agent container, not for other containers listed in the pod template and used via the container step: to fix these, we would need to ensure that the command run via (the API equivalent of) kubectl exec by ContainerExecDecorator waits for all its children. Right? Is there a known easy way to check for this condition in a realistic cluster, say GKE? Run some straightforward Pipeline build (using podTemplate + node + container + sh ) a bunch of times, and then somehow get a root shell into the raw node and scan for zombie processes?

          Carroll Chiou added a comment - - edited

          Yes, just to clarify, I ran my Jenkinsfile on a Jenkins instance that was deployed to GKE. The Jenkinsfile has two stages to it, with each stage running a different pod template. The only difference between the two pod templates is thatI add the shareProcessNamespace: true to the last stage. You have to look a bit carefully, but in the ps output for the first stage, you will see a zombie process, whereas in the second stage, there is no zombie process.
          Now this instance is running my latest version of `durable-task` from PR-106. I can also confirm that the behavior is the same with the latest version on `master`

          I only need to run sh once and wait a bit to pull up a zombie instance. durable-task is guaranteed to create a zombie every time it is executed due to the background process requirement. This only happens within the container itself, so once the container goes away, so do the zombies. My understanding of zombie processes is that the only resource they're consuming is the entry in the process table. So I guess if you have a long running container that's doing a serious amount of shell steps then you can run into trouble? For reference, I looked into /proc/sys/kernel/pid_max for the jenkins/jnlp-slave image and got 99,999. Apparently on 32 bit systems pid_max can be configured up to >4 million (2^22) entries. And this is all for just one container.

          Carroll Chiou added a comment - - edited Yes, just to clarify, I ran my Jenkinsfile on a Jenkins instance that was deployed to GKE. The Jenkinsfile has two stages to it, with each stage running a different pod template. The only difference between the two pod templates is thatI add the shareProcessNamespace: true to the last stage. You have to look a bit carefully, but in the ps output for the first stage, you will see a zombie process, whereas in the second stage, there is no zombie process. Now this instance is running my latest version of `durable-task` from PR-106 . I can also confirm that the behavior is the same with the latest version on `master` I only need to run sh once and wait a bit to pull up a zombie instance. durable-task is guaranteed to create a zombie every time it is executed due to the background process requirement. This only happens within the container itself, so once the container goes away, so do the zombies. My understanding of zombie processes is that the only resource they're consuming is the entry in the process table. So I guess if you have a long running container that's doing a serious amount of shell steps then you can run into trouble? For reference, I looked into /proc/sys/kernel/pid_max for the jenkins/jnlp-slave image and got 99,999 . Apparently on 32 bit systems pid_max can be configured up to >4 million (2^22) entries. And this is all for just one container.

            Unassigned Unassigned
            carroll Carroll Chiou
            Votes:
            2 Vote for this issue
            Watchers:
            15 Start watching this issue

              Created:
              Updated: