Pods are terminated after ~110s and ignore PodTemplate.connectionTimeout when containers start

This issue is archived. You can view it, but you can't modify it. Learn more

XMLWordPrintable

      Issue started right after upgrading from 1.30.4 to any later version (1.30.5 to latest 1.30.10).

      Log shows:

       

      2021-11-20 13:42:17.978+0000 [id=108]   INFO    o.c.j.p.k.KubernetesLauncher#launch: Created Pod: kubernetes jenkins/meta-mbdn9-9j8d3
      2021-11-20 13:44:08.228+0000 [id=108]   WARNING o.c.j.p.k.KubernetesLauncher#launch: Error in provisioning; agent=KubernetesSlave name: meta-mbdn9-9j8d3, temp
      late=PodTemplate{id='a54377cc-f22f-4767-9571-f9abf713d15f', name='meta-mbdn9', namespace='jenkins', idleMinutes=5, label='meta', serviceAccount='jenkins', nod
      eSelector='node_pool=build-pool', containers=[ContainerTemplate{name='gcloud', image='gcr.io/google.com/cloudsdktool/cloud-sdk:debian_component_based', comman
      d='cat', ttyEnabled=true}], annotations=[PodAnnotation{key='buildUrl', value='http://jenkins-ui:8080/job/k8s/job/reclaim-volumes/2675/'}, PodAnnotation{key='r
      unUrl', value='job/k8s/job/reclaim-volumes/2675/'}]}
      io.fabric8.kubernetes.client.KubernetesClientTimeoutException: Timed out waiting for [1000000] milliseconds for [Pod] with name:[meta-mbdn9-9j8d3] in namespac
      e [jenkins].
              at org.csanchez.jenkins.plugins.kubernetes.AllContainersRunningPodWatcher.await(AllContainersRunningPodWatcher.java:96)
              at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:169)
              at hudson.slaves.SlaveComputer.lambda$_connect$0(SlaveComputer.java:293)
              at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
              at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80)
              at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
              at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
              at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
              at java.base/java.lang.Thread.run(Thread.java:829)
      2021-11-20 13:44:08.230+0000 [id=108]   INFO    o.c.j.p.k.KubernetesSlave#_terminate: Terminating Kubernetes instance for agent meta-mbdn9-9j8d3
      Terminated Kubernetes instance for agent jenkins/meta-mbdn9-9j8d3
      
      
      

       

      This happens for every pod if its containers do no start in < 110 seconds (logs shows that after ~110 seconds the pod gets terminated by the plugin) even the error message is wrong: Timed out waiting for [1000000] milliseconds - it didn't wait 1000 seconds as it should.

       

      I believe the issue comes from this commit: https://github.com/jenkinsci/kubernetes-plugin/commit/f95a604462fd7723ba8246b748c83dc90d65a9e3

      I think these changed lines don't do the right thing:

      - return periodicAwait(10, System.currentTimeMillis(), Math.max(remaining / 10, 1000L), remaining); 
      + // Retry with 10% of the remaining time, with a min of 1s and a max of 10s 
      + return periodicAwait(10, System.currentTimeMillis(), Math.min(10000L, Math.max(remaining / 10, 1000L)), remaining);

       

      I've reverted that line and now my pods behave correctly like in 1.30.4. and agents are provisioned correctly.

            Assignee:
            Vincent Latombe
            Reporter:
            Stefan
            Archiver:
            Jenkins Service Account

              Created:
              Updated:
              Archived: