-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Jenkins 2.138.1, Kubernetes plugin 1.12.6, Kubernetes 1.11
Occasionally, we observe an issue with resuming jobs after a Jenkins Master restart.
Our configuration. Jenkins 2.138.1 (some older versions had the same problem) with the init.groovy.d configuration + persistent jobs running on top of the Kubernetes cluster. Slaves/executors are setup with the Kubernetes plugin 1.12.6 on Kubernetes 1.11.
After the Jenkins master restart a simple job (a git clone + a shell script) sometimes resume properly:
15:19:18 Commit message: "Quick commit" 15:19:18 > git rev-list --no-walk aaaaaa9bd4a093e0364df0f52f5447c15f3785f0 # timeout=10 [Pipeline] } [Pipeline] // dir [Pipeline] } [Pipeline] // stage [Pipeline] stage [Pipeline] { (Execute my fancy shell script) [Pipeline] sh Resuming build at Wed Oct 01 13:19:59 GMT 2018 after Jenkins restart Waiting to resume part of xxxxx-xxxxx #40: ‘jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx’ is offline 15:19:19 [xxxxx-xxxxx] Running shell script 15:19:22 + vault login '-method=aws' 'role=jenkins' Waiting to resume part of xxxxx-xxxxx #40: ‘jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx’ is offline Waiting to resume part of xxxxx-xxxxx #40: ‘jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx’ is offline Waiting to resume part of xxxxx-xxxxx #40 Ready to run at Wed Oct 01 13:20:13 GMT 2018 15:20:13 Agent jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx is provisioned from template Kubernetes Pod Template 15:20:13 Agent specification [Kubernetes Pod Template] (jenkins-slave-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx): 15:20:14 (script execution performed properly) (...) Finished: SUCCESS
However, usually it hangs at:
15:31:50 Running in /home/jenkins/workspace/xxxxx-xxxxx/playbooks
[Pipeline] {
[Pipeline] git
[Pipeline] }
[Pipeline] // dir
Resuming build at Wed Oct 01 13:32:39 GMT 2018 after Jenkins restart
Waiting to resume part of xxxxx-xxxxx #42: ‘jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx’ is offline
Waiting to resume part of xxxxx-xxxxx #42: ‘jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx’ is offline
Waiting to resume part of xxxxx-xxxxx #42: ‘jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx’ is offline
Waiting to resume part of xxxxx-xxxxx #42: ‘jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx’ is offline
Waiting to resume part of xxxxx-xxxxx #42: ‘jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx’ is offline
Waiting to resume part of xxxxx-xxxxx #42: ‘jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx’ is offline
Ready to run at Wed Oct 01 13:33:07 GMT 2018
15:33:07 Agent jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx is provisioned from template Kubernetes Pod Template
15:33:07 Agent specification [Kubernetes Pod Template] (jenkins-slave-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx):
15:33:07
and it needs to be aborted.
From the Jenkins logs of the failed resume I see:
Oct 01, 2018 1:32:13 PM INFO org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher launch Agent has already been launched, activating: {} Oct 01, 2018 1:32:13 PM INFO org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher launch Agent has already been launched, activating: {} Oct 01, 2018 1:33:03 PM FINE org.csanchez.jenkins.plugins.kubernetes.KubernetesComputer Computer KubernetesComputer name: jenkins-slave-ansible-update-xxxx-xxxx slave: KubernetesSlave name: jenkins-slave-ansible-update-xxxxx-xxxxx taskAccepted Oct 01, 2018 1:33:07 PM FINE org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerStepExecution onResume Oct 01, 2018 1:33:07 PM FINE org.csanchez.jenkins.plugins.kubernetes.KubernetesCloud Building connection to Kubernetes kubernetes URL namespace default Oct 01, 2018 1:33:07 PM FINE org.csanchez.jenkins.plugins.kubernetes.KubernetesFactoryAdapter Autoconfiguring Kubernetes client Oct 01, 2018 1:33:07 PM FINE org.csanchez.jenkins.plugins.kubernetes.KubernetesFactoryAdapter Creating Kubernetes client: KubernetesFactoryAdapter [serviceAddress=, namespace=default, caCertData=null, credentials=null, skipTlsVerify=true, connectTimeout=0, readTimeout=0] Oct 01, 2018 1:33:08 PM FINE org.csanchez.jenkins.plugins.kubernetes.KubernetesCloud Connected to Kubernetes kubernetes URL
the next lines in the successful resume (there are not present on failure) are:
Oct 01, 2018 1:20:15 PM FINEST org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator
Launch proc with environment: [AGENT_WORKDIR=/home/jenkins/agent, ...]
Oct 01, 2018 1:20:16 PM FINEST org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator
Executing shell script inside container [myContainer] of pod [jenkins-slave-ansible-update-adfs-mgmt-lfm45-02g27]
Oct 01, 2018 1:20:16 PM FINEST org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator
onOpen : java.util.concurrent.CountDownLatch@7b82ed2d[Count = 0]
Oct 01, 2018 1:20:17 PM FINEST org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator
Launching with env vars: [...]
Oct 01, 2018 1:20:19 PM FINEST org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator
Executing command: "... my shell command ..."
Oct 01, 2018 1:20:19 PM INFO org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1 doLaunch
Created process inside pod: [jenkins-slave-ansible-update-xxxxx-xxxxx], container: [myContainer] with pid:[-1]
Oct 01, 2018 1:20:44 PM FINEST org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator
onClose : java.util.concurrent.CountDownLatch@7b82ed2d[Count = 0]
Oct 01, 2018 1:20:45 PM FINE org.csanchez.jenkins.plugins.kubernetes.KubernetesComputer
Computer KubernetesComputer name: jenkins-slave-ansible-update-xxxxx-xxxxx slave: KubernetesSlave name: jenkins-slave-ansible-update-xxxxx-xxxxx taskCompleted
...
Looking at the plugin source code didn't help. We don't know the job hangs and Jenkins is not able to carry on with the executor.
Do you have any suggestion how it could be fixed or at least investigated deeply?
Have you by any chance made any progress on this?