-
Improvement
-
Resolution: Unresolved
-
Minor
-
I am running Jenkins in kubernetes with the kubernetes plugin. Versions as follows
~ # kubectl version
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.2", GitCommit:"bb9ffb1654d4a729bb4cec18ff088eacc153c239", GitTreeState:"clean", BuildDate:"2018-08-07T23:17:28Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.4", GitCommit:"5ca598b4ba5abb89bb773071ce452e33fb66339d", GitTreeState:"clean", BuildDate:"2018-06-06T08:00:59Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Jenkins ver. 2.138.2 from https://hub.docker.com/r/jenkins/jenkins/
[centos@k8s-master-0 ~]$ sudo docker version
Client:
Version: 17.03.2-ce
API version: 1.27
Go version: go1.7.5
Git commit: f5ec1e2
Built: Tue Jun 27 02:21:36 2017
OS/Arch: linux/amd64
Server:
Version: 17.03.2-ce
API version: 1.27 (minimum version 1.12)
Go version: go1.7.5
Git commit: f5ec1e2
Built: Tue Jun 27 02:21:36 2017
OS/Arch: linux/amd64
Experimental: false
I am running Jenkins in kubernetes with the kubernetes plugin. Versions as follows ~ # kubectl version Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.2", GitCommit:"bb9ffb1654d4a729bb4cec18ff088eacc153c239", GitTreeState:"clean", BuildDate:"2018-08-07T23:17:28Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.4", GitCommit:"5ca598b4ba5abb89bb773071ce452e33fb66339d", GitTreeState:"clean", BuildDate:"2018-06-06T08:00:59Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"} Jenkins ver. 2.138.2 from https://hub.docker.com/r/jenkins/jenkins/ [ centos@k8s-master-0 ~]$ sudo docker version Client: Version: 17.03.2-ce API version: 1.27 Go version: go1.7.5 Git commit: f5ec1e2 Built: Tue Jun 27 02:21:36 2017 OS/Arch: linux/amd64 Server: Version: 17.03.2-ce API version: 1.27 (minimum version 1.12) Go version: go1.7.5 Git commit: f5ec1e2 Built: Tue Jun 27 02:21:36 2017 OS/Arch: linux/amd64 Experimental: false
The majority of my builds run as expected and we run many builds per day. The podTemplate for my builds is:
podTemplate(cloud: 'k8s-houston', label: 'api-build', yaml: """ apiVersion: v1 kind: Pod metadata: name: maven spec: containers: - name: maven image: maven:3-jdk-8-alpine volumeMounts: - name: volume-0 mountPath: /mvn/.m2nrepo command: - cat tty: true resources: requests: memory: "512Mi" cpu: "500m" securityContext: runAsUser: 10000 fsGroup: 10000 """, containers: [ containerTemplate(name: 'jnlp', image: 'jenkins/jnlp-slave:3.23-1-alpine', args: '${computer.jnlpmac} ${computer.name}', resourceRequestCpu: '250m', resourceRequestMemory: '512Mi'), containerTemplate(name: 'pmd', image: 'stash.trinet-devops.com:8443/pmd:pmd-bin-5.5.4', alwaysPullImage: false, ttyEnabled: true, command: 'cat'), containerTemplate(name: 'owasp-zap', image: 'stash.trinet-devops.com:8443/owasp-zap:2.7.0', ttyEnabled: true, command: 'cat'), containerTemplate(name: 'kubectl', image: 'lachlanevenson/k8s-kubectl:v1.8.7', ttyEnabled: true, command: 'cat'), containerTemplate(name: 'dind', image: 'docker:18.01.0-ce-dind', privileged: true, resourceRequestCpu: '20m', resourceRequestMemory: '512Mi',), containerTemplate(name: 'docker-cmds', image: 'docker:18.01.0-ce', ttyEnabled: true, command: 'cat', envVars: [envVar(key: 'DOCKER_HOST', value: 'tcp://localhost:2375')]), ], volumes: [ persistentVolumeClaim(claimName: 'jenkins-pv-claim', mountPath: '/mvn/.m2nrepo'), emptyDirVolume(mountPath: '/var/lib/docker', memory: false) ] )
However, sometimes a build Pod will get stuck in Error state in kubernetes
~ # kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE jenkins-deployment-7849487c9b-nlhln 2/2 Running 4 12d 10.233.92.12 k8s-node-hm-3 jenkins-slave-7tj0d-ckwbs 11/11 Running 0 31s 10.233.69.176 k8s-node-1 jenkins-slave-7tj0d-qn3s6 11/11 Running 0 2m 10.233.77.230 k8s-node-hm-2 jenkins-slave-gz4pw-2dnn5 6/7 Error 0 2d 10.233.123.239 k8s-node-hm-1 jenkins-slave-m825p-1hjt7 5/5 Running 0 1m 10.233.123.196 k8s-node-hm-1 jenkins-slave-r59w1-qs283 6/7 Error 0 6d 10.233.76.104 k8s-node-2
You can see from the above listing of current pods that one Pod has been sitting around in Error state for 6 days. I have never seen a Pod in this state recover or get cleaned up. Manual intervention is always necessary.
When I describe the pod, I see that the "jnlp" container is in a bad state (snippet provided)
~ # kubectl describe pod jenkins-slave-r59w1-qs283 Name: jenkins-slave-r59w1-qs283 Namespace: jenkins Node: k8s-node-2/10.0.40.9 Start Time: Thu, 01 Nov 2018 12:20:06 +0000 Labels: jenkins=slave jenkins/api-build=true Annotations: kubernetes.io/limit-ranger=LimitRanger plugin set: cpu request for container owasp-zap; cpu limit for container owasp-zap; cpu limit for container dind; cpu limit for container maven; cpu request for ... Status: Running IP: 10.233.76.104 Containers: ... jnlp: Container ID: docker://a08af23511d01c5f9a249c7f8f8383040a5cc70c25a0680fb0bec4c80439ec7e Image: jenkins/jnlp-slave:3.23-1-alpine Image ID: docker-pullable://jenkins/jnlp-slave@sha256:3cffe807013fece5182124b1e09e742f96b084ae832406a287283a258e79391c Port: <none> Host Port: <none> Args: b39461cef6e0c9a0ab970bf7f6ff664b463d119e8ddc4c8e966f8a77c2dc055f jenkins-slave-r59w1-qs283 State: Terminated Reason: Error Exit Code: 255 Started: Thu, 01 Nov 2018 12:20:12 +0000 Finished: Thu, 01 Nov 2018 12:21:01 +0000 Ready: False Restart Count: 0 Limits: cpu: 2 memory: 4Gi Requests: cpu: 250m memory: 512Mi Environment: JENKINS_SECRET: b39461cef6e0c9a0ab970bf7f6ff664b463d119e8ddc4c8e966f8a77c2dc055f JENKINS_TUNNEL: jenkins-service:50000 JENKINS_AGENT_NAME: jenkins-slave-r59w1-qs283 JENKINS_NAME: jenkins-slave-r59w1-qs283 JENKINS_URL: http://jenkins-service:8080/ HOME: /home/jenkins Mounts: /home/jenkins from workspace-volume (rw) /mvn/.m2nrepo from volume-0 (rw) /var/lib/docker from volume-1 (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-kmrnj (ro) Conditions: Type Status Initialized True Ready False PodScheduled True Volumes: volume-0: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: jenkins-pv-claim ReadOnly: false volume-1: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: workspace-volume: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: default-token-kmrnj: Type: Secret (a volume populated by a Secret) SecretName: default-token-kmrnj Optional: false QoS Class: Burstable Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: <none>
The jnlp container is is in a state of Terminated with reason Error and exit code 255.
When I look at the logs for the above failed container (see attached) and compare it to a healthy container, they look the same up until the failed container shows this message.
Nov 01, 2018 12:20:49 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Terminated Nov 01, 2018 12:20:59 PM jenkins.slaves.restarter.JnlpSlaveRestarterInstaller$FindEffectiveRestarters$1 onReconnect INFO: Restarting agent via jenkins.slaves.restarter.UnixSlaveRestarter@53d577ce
It then seems to repeat the first attempt before printing a stacktrace, at which point the container enters the state described above.
I have also attached the Console Output from the build job associated with this pod. The build job spent "7 hr 41 min waiting" and ended up in a failed state.
It would be nice to fix this so the Error state was never reached, but the bug I'm pointing out here is that the Pod should be cleaned up when it enters the Error state. Shouldn't the Jenkins kubernetes plugin keep track of this and clean up Pods that end up in this state?
- is duplicated by
-
JENKINS-55860 Unable to clean up the pod with the status Error
- Closed
-
JENKINS-56400 Pods left in Error state when slave-master communication fails
- Closed