Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-47476

Provisioned slaves cannot reconnect following a Jenkins restart

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • kubernetes-plugin
    • kubernetes-plugin 1.1
      jenkins 2.73.2

      If a jenkins master is restarted while a build is running on a provisioned slave, when the master is back online, the connection to the existing slave cannot be made.

      I used the file src/test/resources/org/csanchez/jenkins/plugins/kubernetes/pipeline/runInPodWithRestart.groovy for my pipeline.

      Attached is the pod log from the jnlp container and a snippet from the master's log.

          [JENKINS-47476] Provisioned slaves cannot reconnect following a Jenkins restart

          Scott Hebert added a comment -

          0x89 have you seen this before?

          Scott Hebert added a comment - 0x89 have you seen this before?

          Martin Sander added a comment -

          scoheb: no, I personally haven't seen this before.

          If I remember correctly, there used to be a disclaimer in README.md that restarting of pipelines after restart still has problems regarding container executions, but maybe I am wrong about that.

          Martin Sander added a comment - scoheb : no, I personally haven't seen this before. If I remember correctly, there used to be a disclaimer in README.md that restarting of pipelines after restart still has problems regarding container executions, but maybe I am wrong about that.

          I'm getting this currently. As far as I can tell, jenkins restarts, the agents are no longer "computers/nodes" in jenkins eyes, so it starts rejecting them from connecting over JNLP, and the container eventually errors.

          I can see errors like "WARNING: safe-exit thread for pod-test-x123b-rknm5 terminated"

          Chance Zibolski added a comment - I'm getting this currently. As far as I can tell, jenkins restarts, the agents are no longer "computers/nodes" in jenkins eyes, so it starts rejecting them from connecting over JNLP, and the container eventually errors. I can see errors like "WARNING: safe-exit thread for pod-test-x123b-rknm5 terminated"

          Chance Zibolski added a comment - - edited

          Also, I found that the pod templates still exist after this, and the plugin tries to reprovision these pods but gets errors that they already exist. Perhaps upon restart, the plugin should look at all it's pod templates, and check if those pods still exist, and re-create them as jenkins nodes so the pods can connect.

          Here's a full example of log messages containing a particular pods name, before and after jenkins restarts:

          https://gist.github.com/chancez/0a3e0e4798d4bb70280d136b7c12f8ec

          Chance Zibolski added a comment - - edited Also, I found that the pod templates still exist after this, and the plugin tries to reprovision these pods but gets errors that they already exist. Perhaps upon restart, the plugin should look at all it's pod templates, and check if those pods still exist, and re-create them as jenkins nodes so the pods can connect. Here's a full example of log messages containing a particular pods name, before and after jenkins restarts: https://gist.github.com/chancez/0a3e0e4798d4bb70280d136b7c12f8ec

          Scott Hebert added a comment -

          Scott Hebert added a comment - PR created: https://github.com/jenkinsci/kubernetes-plugin/pull/244

          Chance Zibolski added a comment - - edited

          That PR didn't fix anything for me. I still get errors because jenkins doesn't recognize the pod's client name after restarting.

          INFO: [JNLP4-connect connection to jenkins-shared-jnlp.jenkins.svc.cluster.local/10.3.195.36:50000] Local headers refused by remote: Unknown client name: kube-chargeback-build-839q3-rb70v
          

          And earlier in the startup logs (before jenkins is even fully finished initializing and running my plugins):

          Error in provisioning; slave=KubernetesSlave name: kube-chargeback-build-839q3-93sd9, template=org.csanchez.jenkins.plugins.kubernetes.PodTemplate@330bbd56
          io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://kubernetes.default.svc/api/v1/namespaces/jenkins/pods. Message: pods "kube-chargeback-build-839q3-93sd9" already exists. Received status: Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=null, kind=pods, name=kube-chargeback-build-839q3-93sd9, retryAfterSeconds=null, additionalProperties={}), kind=Status, message=pods "kube-chargeback-build-839q3-93sd9" already exists, metadata=ListMeta(resourceVersion=null, selfLink=null, additionalProperties={}), reason=AlreadyExists, status=Failure, additionalProperties={}).
          	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:470)
          	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:409)
          	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:379)
          	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:343)
          	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:226)
          	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:769)
          	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:356)
          	at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:133)
          	at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:285)
          	at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
          	at java.lang.Thread.run(Thread.java:745)
          

          Chance Zibolski added a comment - - edited That PR didn't fix anything for me. I still get errors because jenkins doesn't recognize the pod's client name after restarting. INFO: [JNLP4-connect connection to jenkins-shared-jnlp.jenkins.svc.cluster.local/10.3.195.36:50000] Local headers refused by remote: Unknown client name: kube-chargeback-build-839q3-rb70v And earlier in the startup logs (before jenkins is even fully finished initializing and running my plugins): Error in provisioning; slave=KubernetesSlave name: kube-chargeback-build-839q3-93sd9, template=org.csanchez.jenkins.plugins.kubernetes.PodTemplate@330bbd56 io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://kubernetes.default.svc/api/v1/namespaces/jenkins/pods. Message: pods "kube-chargeback-build-839q3-93sd9" already exists. Received status: Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=null, kind=pods, name=kube-chargeback-build-839q3-93sd9, retryAfterSeconds=null, additionalProperties={}), kind=Status, message=pods "kube-chargeback-build-839q3-93sd9" already exists, metadata=ListMeta(resourceVersion=null, selfLink=null, additionalProperties={}), reason=AlreadyExists, status=Failure, additionalProperties={}). at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:470) at io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:409) at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:379) at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:343) at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:226) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:769) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:356) at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:133) at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:285) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

          It turns out my plugin upgrade got reverted while testing by someone, sorry about that. It seems like pods are in fact re-connecting!

          Chance Zibolski added a comment - It turns out my plugin upgrade got reverted while testing by someone, sorry about that. It seems like pods are in fact re-connecting!

            scoheb Scott Hebert
            scoheb Scott Hebert
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: