Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-47501

2 pods are started instead of 1 after update to version 1.1

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Critical Critical
    • kubernetes-plugin
    • None
    • Jenkins 2.85
      Kubernetes plugin 1.1 / 1.1.2

      We updated the kubernetes plugin to 1.1 and Jenkins to 2.85 yesterday.

      When you create the pod with a node block inside a podTemplate, it will create 2 pods now instead of 1. The second one seems to be create 10 seconds after the first 1 is created.

       

      debian-lfxb0-8s18k 2/2 Running 0 5s
      debian-lfxb0-d53mq 2/2 Running 0 15s

      Simplified test case

      podTemplate(
          name: "debian",
          label: "debian",
          instanceCap: 1,
      
          containers: [
              containerTemplate(
                  name: 'debian',
                  image: "debian",
                  ttyEnabled: true,
                  command: "cat"
              )
          ]
      ) {
          node("debian") {
              sleep 99999
          }
      }
      

       

      Edit: It seems to be some change in 1.1, kubernetes plugin 1.0 seems to only start one pod. 

      Jenkins log

      Dec 12, 2017 1:20:11 PM INFO org.csanchez.jenkins.plugins.kubernetes.KubernetesCloud provision
      Excess workload after pending Spot instances: 1
      Dec 12, 2017 1:20:11 PM INFO org.csanchez.jenkins.plugins.kubernetes.KubernetesCloud provision
      Template: Kubernetes Pod Template
      Dec 12, 2017 1:20:11 PM INFO okhttp3.internal.platform.Platform log
      ALPN callback dropped: HTTP/2 is disabled. Is alpn-boot on the boot class path?
      Dec 12, 2017 1:20:11 PM INFO hudson.slaves.NodeProvisioner$StandardStrategyImpl apply
      Started provisioning Kubernetes Pod Template from kubernetes with 1 executors. Remaining excess workload: 0
      Dec 12, 2017 1:20:21 PM INFO hudson.slaves.NodeProvisioner$2 run
      Kubernetes Pod Template provisioning successfully completed. We have now 5 computer(s)
      Dec 12, 2017 1:20:21 PM INFO org.csanchez.jenkins.plugins.kubernetes.KubernetesCloud provision
      Excess workload after pending Spot instances: 1
      Dec 12, 2017 1:20:21 PM INFO org.csanchez.jenkins.plugins.kubernetes.KubernetesCloud provision
      Template: Kubernetes Pod Template
      Dec 12, 2017 1:20:21 PM INFO okhttp3.internal.platform.Platform log
      ALPN callback dropped: HTTP/2 is disabled. Is alpn-boot on the boot class path?
      Dec 12, 2017 1:20:21 PM INFO hudson.slaves.NodeProvisioner$StandardStrategyImpl apply
      Started provisioning Kubernetes Pod Template from kubernetes with 1 executors. Remaining excess workload: -0.81
      Dec 12, 2017 1:20:21 PM INFO okhttp3.internal.platform.Platform log
      ALPN callback dropped: HTTP/2 is disabled. Is alpn-boot on the boot class path?
      Dec 12, 2017 1:20:21 PM INFO org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher launch
      Created Pod: debian-9z8rf-fvl9q in namespace default
      Dec 12, 2017 1:20:21 PM INFO org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher launch
      Waiting for Pod to be scheduled (0/100): debian-9z8rf-fvl9q
      Dec 12, 2017 1:20:22 PM INFO hudson.TcpSlaveAgentListener$ConnectionHandler run
      Accepted JNLP4-connect connection #8 from xxx.xx.x.xx/xxx.xx.x.xx:34406
      Dec 12, 2017 1:20:31 PM INFO hudson.slaves.NodeProvisioner$2 run
      Kubernetes Pod Template provisioning successfully completed. We have now 6 computer(s)
      Dec 12, 2017 1:20:31 PM INFO okhttp3.internal.platform.Platform log
      ALPN callback dropped: HTTP/2 is disabled. Is alpn-boot on the boot class path?
      Dec 12, 2017 1:20:31 PM INFO org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher launch
      Created Pod: debian-9z8rf-wlwht in namespace default
      Dec 12, 2017 1:20:31 PM INFO org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher launch
      Waiting for Pod to be scheduled (0/100): debian-9z8rf-wlwht
      Dec 12, 2017 1:20:32 PM INFO hudson.TcpSlaveAgentListener$ConnectionHandler run
      Accepted JNLP4-connect connection #9 from xxx.xx.x.xxx/xxx.xx.x.xxx:38066
      Dec 12, 2017 1:22:42 PM INFO org.csanchez.jenkins.plugins.kubernetes.KubernetesSlave _terminate
      Terminating Kubernetes instance for agent debian-9z8rf-wlwht
      Dec 12, 2017 1:22:42 PM WARNING jenkins.slaves.DefaultJnlpSlaveReceiver channelClosed
      Computer.threadPoolForRemoting [#52] for debian-9z8rf-wlwht terminated
      java.nio.channels.ClosedChannelException
      	at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:208)
      	at org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222)
      	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832)
      	at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287)
      	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181)
      	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283)
      	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503)
      	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248)
      	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200)
      	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:213)
      	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:800)
      	at org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:173)
      	at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:311)
      	at hudson.remoting.Channel.close(Channel.java:1405)
      	at hudson.remoting.Channel.close(Channel.java:1358)
      	at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:737)
      	at hudson.slaves.SlaveComputer.access$800(SlaveComputer.java:96)
      	at hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:655)
      	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      

       

       

          [JENKINS-47501] 2 pods are started instead of 1 after update to version 1.1

          Fixed in 1.3.1

          Carlos Sanchez added a comment - Fixed in 1.3.1

          Carlos Rojas added a comment -

          I have installed the latest version 1.5.2, and I experimented the same issue, i setup only 3 for container cap and run more than 5 containers at the same time

          Carlos Rojas added a comment - I have installed the latest version 1.5.2, and I experimented the same issue, i setup only 3 for container cap and run more than 5 containers at the same time

          Aiman Alsari added a comment -

          I can confirm this is still happening on 1.5.2 - The most reliable way of reproducing this issue is to run the test case provided by the OP, have a single job running, queue up 2 or 3 behind it using the same pod template, then abort the first one.

          What happens is that 2 or more pods are provisioned simultaneously, disregarding the limit. This isn't that big of a deal (for me at least) but then it gets into a crazy crash loop.

          Something is calling "KubernetesSlave._terminate()" on one of the newly provisioned pods, it can then fail in one of two ways:

          • The pod is terminated and the provisioning halts resulting in "java.lang.IllegalStateException: Pod no longer exists:"
          • The master has removed the node from it's list of nodes and so JNLP fails. On the master it has "Refusing headers from remote: Unknown client name: foo-bar". On the actual JNLP pod it has "The server rejected the connection: None of the protocols were accepted".

          There seems to be a race condition somewhere, something is calling the _terminate method for this pod right when it starts up. My first guess was that when the first aborted job is terminating, provisioning of the two new pods happens simultaneously and the termination code of the first job then kills the new ones as well as the first one. This doesn't seem to be the case though as I can not replicate the issue reliably with just one active + one queued job, it has to be 2 or 3.

          Once in this state the only way to fix it is to delete all the queued jobs and clean all Error'ed pods.

          jenkins_log

          Aiman Alsari added a comment - I can confirm this is still happening on 1.5.2 - The most reliable way of reproducing this issue is to run the test case provided by the OP, have a single job running, queue up 2 or 3 behind it using the same pod template, then abort the first one. What happens is that 2 or more pods are provisioned simultaneously, disregarding the limit. This isn't that big of a deal (for me at least) but then it gets into a crazy crash loop. Something is calling "KubernetesSlave._terminate()" on one of the newly provisioned pods, it can then fail in one of two ways: The pod is terminated and the provisioning halts resulting in "java.lang.IllegalStateException: Pod no longer exists:" The master has removed the node from it's list of nodes and so JNLP fails. On the master it has "Refusing headers from remote: Unknown client name: foo-bar". On the actual JNLP pod it has "The server rejected the connection: None of the protocols were accepted". There seems to be a race condition somewhere, something is calling the _terminate method for this pod right when it starts up. My first guess was that when the first aborted job is terminating, provisioning of the two new pods happens simultaneously and the termination code of the first job then kills the new ones as well as the first one. This doesn't seem to be the case though as I can not replicate the issue reliably with just one active + one queued job, it has to be 2 or 3. Once in this state the only way to fix it is to delete all the queued jobs and clean all Error'ed pods. jenkins_log

          Aiman Alsari added a comment -

          I found that someone else has also seen this issue. See Yuan Yao's comment on JENKINS-44042

          Aiman Alsari added a comment - I found that someone else has also seen this issue. See Yuan Yao's comment on  JENKINS-44042

          Aiman Alsari added a comment -

          Just to update: I have found a suitable workaround. The premature call to terminate seems to come from the OnceRetentionStrategy. So if you set "idleMinutes: 1" in your podTemplate, it will use a different retention strategy.

          Unfortunately this means your containers get re-used so you can't guarantee the clean state of them, but it's better than nothing.

          Aiman Alsari added a comment - Just to update: I have found a suitable workaround. The premature call to terminate seems to come from the OnceRetentionStrategy. So if you set "idleMinutes: 1" in your podTemplate, it will use a different retention strategy. Unfortunately this means your containers get re-used so you can't guarantee the clean state of them, but it's better than nothing.

          Sounds to me this is a very common issue and this mechanism should be implemented by NodeProvisioner, and not re-implemented with various approach by each and every cloud-plugin

          Nicolas De Loof added a comment - Sounds to me this is a very common issue and this mechanism should be implemented by NodeProvisioner, and not re-implemented with various approach by each and every cloud-plugin

          I can confirm this is happening in 1.12.4 still.

           

          I'm seeing that if you have multiple jobs pending, and your jnlp container is crashing or and additional containers crash, the plugin exponentially tries to create more agents.

           

          It basically DDoS my k8s cluster.

           

          I had 64 workloads in the process of being added/removed due to a typo in the Jenkins Hostname for the jnlp agents to connect to. I had 20 jobs pending and the maximum for containers and instances are both set to 1.

          Erik Kristensen added a comment - I can confirm this is happening in 1.12.4 still.   I'm seeing that if you have multiple jobs pending, and your jnlp container is crashing or and additional containers crash, the plugin exponentially tries to create more agents.   It basically DDoS my k8s cluster.   I had 64 workloads in the process of being added/removed due to a typo in the Jenkins Hostname for the jnlp agents to connect to. I had 20 jobs pending and the maximum for containers and instances are both set to 1.

          Dax Games added a comment -

          I can confirm this is still happening with Kubernetes Plugin 1.15.1 and 1.15.2 on Jenkins 2.150.3.

           

          In my case it spins up a new pod every 10 seconds until the first one connects.  I just watched a single job spin up 10 pods and the moment the first pod connected the plugin stopped spinning new pods.

           

          This is a HUGE problem.

          Dax Games added a comment - I can confirm this is still happening with Kubernetes Plugin 1.15.1 and 1.15.2 on Jenkins 2.150.3.   In my case it spins up a new pod every 10 seconds until the first one connects.  I just watched a single job spin up 10 pods and the moment the first pod connected the plugin stopped spinning new pods.   This is a HUGE problem.

          I think the problem was already fixed, now with the latest versions it's happening again...

           

          Bernhard Kaszt added a comment - I think the problem was already fixed, now with the latest versions it's happening again...  

          Closing as stale. Feel free to re-open if you can reproduce the issue with a recent version of the plugin.

          Vincent Latombe added a comment - Closing as stale. Feel free to re-open if you can reproduce the issue with a recent version of the plugin.

            vlatombe Vincent Latombe
            berni_ Bernhard Kaszt
            Votes:
            5 Vote for this issue
            Watchers:
            17 Start watching this issue

              Created:
              Updated:
              Resolved: