Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-47615

failing container in pod triggers multiple restarts

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • kubernetes-plugin
    • Jenkins 2.73.2
      Kubernetes-plugin 1.1

      We saw a regression when upgrading to v. 1.1 of the Kubernetes-plugin (+Jenkins 2.73.2) — where a crashing container (not the slave) in a pod would retrigger restart of the pods every ten seconds (guess this is related to the activeDeadlineSeconds).

      The job had to be stopped manually to stop this pod creation from spiralling out of control and the crashed pods had to be deleted manually.

          [JENKINS-47615] failing container in pod triggers multiple restarts

          J Knurek added a comment -

           
           

          Carlos Sanchez added a comment - 2017-12-22 10:32
          Looks like it works as designed

          That's seriously your position on this bug!?!?!

          Something so simple as a single commit from a developer in a feature branch or an external system being unreachable can take down Jenkins for an entire organization. This defeats all the benefits of having ephemeral, self-contained slaves to run jobs. What's more, there is no visible indication in the Jenkins UI of what is happening, so the developers who are impacted have no way to know or how to stop it.

          Elaborate monitoring/alerting can help the ops team with reaction time, but doesn't prevent the issue from continuing. 

          If you don't want to make RestartPolicy and BackoffLimit configurable, then maybe addressing how the container cap works can at least limit the impact to the entire development team. 

          J Knurek added a comment -     Carlos Sanchez added a comment - 2017-12-22 10:32 Looks like it works as designed That's seriously your position on this bug!?!?! Something so simple as a single commit from a developer in a feature branch or an external system being unreachable can take down Jenkins for an entire organization. This defeats all the benefits of having ephemeral, self-contained slaves to run jobs. What's more, there is no visible indication in the Jenkins UI of what is happening, so the developers who are impacted have no way to know or how to stop it. Elaborate monitoring/alerting can help the ops team with reaction time, but doesn't prevent the issue from continuing.  If you don't want to make RestartPolicy and BackoffLimit configurable, then maybe addressing how the container cap works can at least limit the impact to the entire development team. 

          > That's seriously your position on this bug!?!?!

          As I said above, RestartPolicy is set to Never because jenkins will start new pods if they fail, I'm not sure if 'always' would be conflicting as both k8s and jenkins would try to start pods. Something that would need some testing

          Have you tried that?

          > addressing how the container cap works

          What do you suggest?

          Carlos Sanchez added a comment - > That's seriously your position on this bug!?!?! As I said above, RestartPolicy is set to Never because jenkins will start new pods if they fail, I'm not sure if 'always' would be conflicting as both k8s and jenkins would try to start pods. Something that would need some testing Have you tried that? > addressing how the container cap works What do you suggest?

          Witold Konior added a comment - - edited

          csanchez - I think that beside global container cap there should be as per job cap, or maybe even only per job cap. This would prevent single job to havoc whole Jenkins. Setting 1 pod per job is what we usually can afford.

          Witold Konior added a comment - - edited csanchez - I think that beside global container cap there should be as per job cap, or maybe even only per job cap. This would prevent single job to havoc whole Jenkins. Setting 1 pod per job is what we usually can afford.

          There is no simple way to set a per job setting (cap or anything else) because the agent provisioning runs as a totally separate service from jobs. Agents are created when jobs are needed but they are not tied together in any way other than labels. There is no 1-to-1 relationship.

          There was some work done in https://wiki.jenkins.io/display/JENKINS/One-Shot+Executor but not ready AFAIK

          Carlos Sanchez added a comment - There is no simple way to set a per job setting (cap or anything else) because the agent provisioning runs as a totally separate service from jobs. Agents are created when jobs are needed but they are not tied together in any way other than labels. There is no 1-to-1 relationship. There was some work done in https://wiki.jenkins.io/display/JENKINS/One-Shot+Executor but not ready AFAIK

          ASHOK MOHANTY added a comment -

          Any plan to fix this issue !!

          ASHOK MOHANTY added a comment - Any plan to fix this issue !!

          Julian Wilkinson added a comment - - edited

          This is still an issue. Are there any plans to fix this? In a system where a developer isn't intimately acquainted with the Jenkins Kubernetes Plugin this could prove to be a huge headache.

          Julian Wilkinson added a comment - - edited This is still an issue. Are there any plans to fix this? In a system where a developer isn't intimately acquainted with the Jenkins Kubernetes Plugin this could prove to be a huge headache.

          Chintan added a comment - - edited

          Is there a way to fail the build after certain number of retries along with the container logs that caused the failure and restart ?

          Chintan added a comment - - edited Is there a way to fail the build after certain number of retries along with the container logs that caused the failure and restart ?

          Mike Nau added a comment -

          Upvoting this as well. This is something we consistently struggle with. User configures an invalid pod template (for whatever reason) and the Jenkins Kubernetes plugin keeps trying to spin up the pod over and over again. Ideally there would be some way to set a cap on the number of retries or something similar. 

          Mike Nau added a comment - Upvoting this as well. This is something we consistently struggle with. User configures an invalid pod template (for whatever reason) and the Jenkins Kubernetes plugin keeps trying to spin up the pod over and over again. Ideally there would be some way to set a cap on the number of retries or something similar. 

          Pierson Yieh added a comment - - edited

          In response to mikenau_intuit's comment, looks like the master branch of the plugin has resolved this issue with the pod termination here: https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/pod/retention/Reaper.java#L243-L245

          vlatombe is there an expected release date for this? 

          Pierson Yieh added a comment - - edited In response to mikenau_intuit 's comment, looks like the master  branch of the plugin has resolved this issue with the pod termination here: https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/pod/retention/Reaper.java#L243-L245 vlatombe  is there an expected release date for this? 

          Mark Waite added a comment -

          pyieh you can use the incremental build of the kubernetes plugin now (last successful build). That will let you confirm that it works the way you want and you can help others by reporting your result.

          Mark Waite added a comment - pyieh you can use the incremental build of the kubernetes plugin now ( last successful build ). That will let you confirm that it works the way you want and you can help others by reporting your result.

            Unassigned Unassigned
            robin_b Robin Bartholdson
            Votes:
            4 Vote for this issue
            Watchers:
            12 Start watching this issue

              Created:
              Updated: