Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-47615

failing container in pod triggers multiple restarts

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • kubernetes-plugin
    • Jenkins 2.73.2
      Kubernetes-plugin 1.1

      We saw a regression when upgrading to v. 1.1 of the Kubernetes-plugin (+Jenkins 2.73.2) — where a crashing container (not the slave) in a pod would retrigger restart of the pods every ten seconds (guess this is related to the activeDeadlineSeconds).

      The job had to be stopped manually to stop this pod creation from spiralling out of control and the crashed pods had to be deleted manually.

          [JENKINS-47615] failing container in pod triggers multiple restarts

          Robin Bartholdson created issue -

          can you provide a pipeline example?

          Carlos Sanchez added a comment - can you provide a pipeline example?

          J Knurek added a comment -

          here's an exampe:

          podTemplate(label: 'test-fail', containers: [
              containerTemplate(name: 'golang', image: 'golang:1.9', command: 'kill pod')
          ]) {
              node('test-fail') {
                  container('golang') {
                      stage('prove failure') {
                          sh "ls"
                      }
                  }
              }
          }
          }

          J Knurek added a comment - here's an exampe: podTemplate(label: 'test-fail' , containers: [ containerTemplate(name: 'golang' , image: 'golang:1.9' , command: 'kill pod' ) ]) { node( 'test-fail' ) { container( 'golang' ) { stage( 'prove failure' ) { sh "ls" } } } } }

          J Knurek added a comment -

          these repeating failed pods use up the node resources and block actual jobs from being executed, and the Jenkins UI doesn't show any indication of what's going on (aka, this has been impacted our developers for a few days now  )

          J Knurek added a comment - these repeating failed pods use up the node resources and block actual jobs from being executed, and the Jenkins UI doesn't show any indication of what's going on (aka, this has been impacted our developers for a few days now  )

          csanchez — We are still seeing this a lot.

          Any idea how this could be fixed?

          Robin Bartholdson added a comment - csanchez  — We are still seeing this a lot. Any idea how this could be fixed?

          Looks like it works as designed, what's your expectation?

          • You are launching a pod and one of the containers crashes
          • Pod will end in error state
          • jenkins will try to start more pods until reaching container cap

          To me it just means your pod template needs fixing

          Carlos Sanchez added a comment - Looks like it works as designed, what's your expectation? You are launching a pod and one of the containers crashes Pod will end in error state jenkins will try to start more pods until reaching container cap To me it just means your pod template needs fixing

          The pods can depend on external systems that might be down for the moment and thus fail — so no: it's definitely not only related to a pod template that needs fixing

          I think there are two things that should be done:

          • Like a k8s job there should be a configurable RestartPolicy as well as BackoffLimit.
          • The failing slave zombies needs to be cleaned up somehow as they go havoc and eat up a lot of resources.

          Robin Bartholdson added a comment - The pods can depend on external systems that might be down for the moment and thus fail — so no: it's definitely not only related to a pod template that needs fixing I think there are two things that should be done: Like a k8s job there should be a configurable RestartPolicy as well as BackoffLimit . The failing slave zombies needs to be cleaned up somehow as they go havoc and eat up a lot of resources.

          RestartPolicy is set to Never because jenkins will start new pods if they fail, I'm not sure if 'always' would be conflicting as both k8s and jenkins would try to start pods. Something that would need some testing

          Errored pods are left around for inspection and to ensure when the cap is reached no more are created. Errored pods do not consume resources. If they get deleted then more resources would be used as jenkins would continue to spin new ones

          Carlos Sanchez added a comment - RestartPolicy is set to Never because jenkins will start new pods if they fail, I'm not sure if 'always' would be conflicting as both k8s and jenkins would try to start pods. Something that would need some testing Errored pods are left around for inspection and to ensure when the cap is reached no more are created. Errored pods do not consume resources. If they get deleted then more resources would be used as jenkins would continue to spin new ones

          J Knurek added a comment -

           
           

          Carlos Sanchez added a comment - 2017-12-22 10:32
          Looks like it works as designed

          That's seriously your position on this bug!?!?!

          Something so simple as a single commit from a developer in a feature branch or an external system being unreachable can take down Jenkins for an entire organization. This defeats all the benefits of having ephemeral, self-contained slaves to run jobs. What's more, there is no visible indication in the Jenkins UI of what is happening, so the developers who are impacted have no way to know or how to stop it.

          Elaborate monitoring/alerting can help the ops team with reaction time, but doesn't prevent the issue from continuing. 

          If you don't want to make RestartPolicy and BackoffLimit configurable, then maybe addressing how the container cap works can at least limit the impact to the entire development team. 

          J Knurek added a comment -     Carlos Sanchez added a comment - 2017-12-22 10:32 Looks like it works as designed That's seriously your position on this bug!?!?! Something so simple as a single commit from a developer in a feature branch or an external system being unreachable can take down Jenkins for an entire organization. This defeats all the benefits of having ephemeral, self-contained slaves to run jobs. What's more, there is no visible indication in the Jenkins UI of what is happening, so the developers who are impacted have no way to know or how to stop it. Elaborate monitoring/alerting can help the ops team with reaction time, but doesn't prevent the issue from continuing.  If you don't want to make RestartPolicy and BackoffLimit configurable, then maybe addressing how the container cap works can at least limit the impact to the entire development team. 

          > That's seriously your position on this bug!?!?!

          As I said above, RestartPolicy is set to Never because jenkins will start new pods if they fail, I'm not sure if 'always' would be conflicting as both k8s and jenkins would try to start pods. Something that would need some testing

          Have you tried that?

          > addressing how the container cap works

          What do you suggest?

          Carlos Sanchez added a comment - > That's seriously your position on this bug!?!?! As I said above, RestartPolicy is set to Never because jenkins will start new pods if they fail, I'm not sure if 'always' would be conflicting as both k8s and jenkins would try to start pods. Something that would need some testing Have you tried that? > addressing how the container cap works What do you suggest?

            Unassigned Unassigned
            robin_b Robin Bartholdson
            Votes:
            4 Vote for this issue
            Watchers:
            12 Start watching this issue

              Created:
              Updated: