[JENKINS-47615] failing container in pod triggers multiple restarts

Type: Bug
Resolution: Unresolved
Priority: Major
Component/s: kubernetes-plugin
Labels:
- regression
Environment:
Jenkins 2.73.2
Kubernetes-plugin 1.1

Similar Issues:
Powered by SuggestiMate

Show

We saw a regression when upgrading to v. 1.1 of the Kubernetes-plugin (+Jenkins 2.73.2) — where a crashing container (not the slave) in a pod would retrigger restart of the pods every ten seconds (guess this is related to the activeDeadlineSeconds).

The job had to be stopped manually to stop this pod creation from spiralling out of control and the crashed pods had to be deleted manually.

relates to

JENKINS-68409 kubernetes plugin can create a new pod every 10s when something wrong

Reopened

Robin Bartholdson created issue - 2017-10-24 21:35

Carlos Sanchez added a comment - 2017-10-31 08:10

can you provide a pipeline example?

Carlos Sanchez added a comment - 2017-10-31 08:10 can you provide a pipeline example?

J Knurek added a comment - 2017-11-01 15:25

here's an exampe:

podTemplate(label: 'test-fail', containers: [
    containerTemplate(name: 'golang', image: 'golang:1.9', command: 'kill pod')
]) {
    node('test-fail') {
        container('golang') {
            stage('prove failure') {
                sh "ls"
            }
        }
    }
}
}

J Knurek added a comment - 2017-11-01 15:25 here's an exampe: podTemplate(label: 'test-fail' , containers: [ containerTemplate(name: 'golang' , image: 'golang:1.9' , command: 'kill pod' ) ]) { node( 'test-fail' ) { container( 'golang' ) { stage( 'prove failure' ) { sh "ls" } } } } }

J Knurek added a comment - 2017-11-01 15:27

these repeating failed pods use up the node resources and block actual jobs from being executed, and the Jenkins UI doesn't show any indication of what's going on (aka, this has been impacted our developers for a few days now )

J Knurek added a comment - 2017-11-01 15:27 these repeating failed pods use up the node resources and block actual jobs from being executed, and the Jenkins UI doesn't show any indication of what's going on (aka, this has been impacted our developers for a few days now )

Robin Bartholdson added a comment - 2017-12-22 10:21

csanchez — We are still seeing this a lot.

Any idea how this could be fixed?

Robin Bartholdson added a comment - 2017-12-22 10:21 csanchez — We are still seeing this a lot. Any idea how this could be fixed?

Carlos Sanchez added a comment - 2017-12-22 10:32

Looks like it works as designed, what's your expectation?

You are launching a pod and one of the containers crashes
Pod will end in error state
jenkins will try to start more pods until reaching container cap

To me it just means your pod template needs fixing

Carlos Sanchez added a comment - 2017-12-22 10:32 Looks like it works as designed, what's your expectation? You are launching a pod and one of the containers crashes Pod will end in error state jenkins will try to start more pods until reaching container cap To me it just means your pod template needs fixing

Robin Bartholdson added a comment - 2017-12-22 10:42

The pods can depend on external systems that might be down for the moment and thus fail — so no: it's definitely not only related to a pod template that needs fixing

I think there are two things that should be done:

Like a k8s job there should be a configurable RestartPolicy as well as BackoffLimit.
The failing slave zombies needs to be cleaned up somehow as they go havoc and eat up a lot of resources.

Robin Bartholdson added a comment - 2017-12-22 10:42 The pods can depend on external systems that might be down for the moment and thus fail — so no: it's definitely not only related to a pod template that needs fixing I think there are two things that should be done: Like a k8s job there should be a configurable RestartPolicy as well as BackoffLimit . The failing slave zombies needs to be cleaned up somehow as they go havoc and eat up a lot of resources.

Carlos Sanchez added a comment - 2017-12-22 11:32

RestartPolicy is set to Never because jenkins will start new pods if they fail, I'm not sure if 'always' would be conflicting as both k8s and jenkins would try to start pods. Something that would need some testing

Errored pods are left around for inspection and to ensure when the cap is reached no more are created. Errored pods do not consume resources. If they get deleted then more resources would be used as jenkins would continue to spin new ones

Carlos Sanchez added a comment - 2017-12-22 11:32 RestartPolicy is set to Never because jenkins will start new pods if they fail, I'm not sure if 'always' would be conflicting as both k8s and jenkins would try to start pods. Something that would need some testing Errored pods are left around for inspection and to ensure when the cap is reached no more are created. Errored pods do not consume resources. If they get deleted then more resources would be used as jenkins would continue to spin new ones

J Knurek added a comment - 2018-01-04 09:45

Carlos Sanchez added a comment - 2017-12-22 10:32
Looks like it works as designed

That's seriously your position on this bug!?!?!

Something so simple as a single commit from a developer in a feature branch or an external system being unreachable can take down Jenkins for an entire organization. This defeats all the benefits of having ephemeral, self-contained slaves to run jobs. What's more, there is no visible indication in the Jenkins UI of what is happening, so the developers who are impacted have no way to know or how to stop it.

Elaborate monitoring/alerting can help the ops team with reaction time, but doesn't prevent the issue from continuing.

If you don't want to make RestartPolicy and BackoffLimit configurable, then maybe addressing how the container cap works can at least limit the impact to the entire development team.

J Knurek added a comment - 2018-01-04 09:45 Carlos Sanchez added a comment - 2017-12-22 10:32 Looks like it works as designed That's seriously your position on this bug!?!?! Something so simple as a single commit from a developer in a feature branch or an external system being unreachable can take down Jenkins for an entire organization. This defeats all the benefits of having ephemeral, self-contained slaves to run jobs. What's more, there is no visible indication in the Jenkins UI of what is happening, so the developers who are impacted have no way to know or how to stop it. Elaborate monitoring/alerting can help the ops team with reaction time, but doesn't prevent the issue from continuing. If you don't want to make RestartPolicy and BackoffLimit configurable, then maybe addressing how the container cap works can at least limit the impact to the entire development team.

Carlos Sanchez added a comment - 2018-01-04 13:39

> That's seriously your position on this bug!?!?!

As I said above, RestartPolicy is set to Never because jenkins will start new pods if they fail, I'm not sure if 'always' would be conflicting as both k8s and jenkins would try to start pods. Something that would need some testing

Have you tried that?

> addressing how the container cap works

What do you suggest?

Carlos Sanchez added a comment - 2018-01-04 13:39 > That's seriously your position on this bug!?!?! As I said above, RestartPolicy is set to Never because jenkins will start new pods if they fail, I'm not sure if 'always' would be conflicting as both k8s and jenkins would try to start pods. Something that would need some testing Have you tried that? > addressing how the container cap works What do you suggest?

Assignee:: Unassigned

Reporter:: Robin Bartholdson

Votes:: 4 Vote for this issue

Watchers:: 12 Start watching this issue

Created:: 2017-10-24 21:35

Updated:: 2022-11-04 13:36

Jenkins

Details

Description

Attachments

Issue Links

Activity

Collapse comment: Carlos Sanchez added a comment - 2017-10-31 08:10

Expand comment: Carlos Sanchez added a comment - 2017-10-31 08:10

Collapse comment: J Knurek added a comment - 2017-11-01 15:25

Expand comment: J Knurek added a comment - 2017-11-01 15:25

Collapse comment: J Knurek added a comment - 2017-11-01 15:27

Expand comment: J Knurek added a comment - 2017-11-01 15:27

Collapse comment: Robin Bartholdson added a comment - 2017-12-22 10:21

Expand comment: Robin Bartholdson added a comment - 2017-12-22 10:21

Collapse comment: Carlos Sanchez added a comment - 2017-12-22 10:32

Expand comment: Carlos Sanchez added a comment - 2017-12-22 10:32

Collapse comment: Robin Bartholdson added a comment - 2017-12-22 10:42

Expand comment: Robin Bartholdson added a comment - 2017-12-22 10:42

Collapse comment: Carlos Sanchez added a comment - 2017-12-22 11:32

Expand comment: Carlos Sanchez added a comment - 2017-12-22 11:32

Collapse comment: J Knurek added a comment - 2018-01-04 09:45

Expand comment: J Knurek added a comment - 2018-01-04 09:45

Collapse comment: Carlos Sanchez added a comment - 2018-01-04 13:39

Expand comment: Carlos Sanchez added a comment - 2018-01-04 13:39

People

Dates