Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-53427

Agent creation failure because of concurrent attempts to schedule a pod

XMLWordPrintable

    • Icon: Task Task
    • Resolution: Unresolved
    • Icon: Critical Critical
    • kubernetes-plugin
    • None
    • Jenkins ver. 2.107.3
      kubernetes-plugin 1.8.4
      Kubernetes 1.8

      The vast majority of the pods have been created properly, but for some of them it looks like there are several concurrent attempts to create a single pod.

      I've just grep'ed logs on the master for such particular pod which plugin tried to create in several threads (Pod and PodTemplates details skiped): 

      ../custom/k8s.log:2018-09-03 12:08:12.876+0000 [id=629145] FINE o.c.j.p.k.PodTemplateBuilder#build: Pod built: Pod(apiVersion=v1, kind=Pod, ...)
      ./custom/k8s.log:2018-09-03 12:08:12.876+0000 [id=629145] FINE o.c.j.p.k.KubernetesLauncher#launch: Creating Pod: test-0ghpx in namespace dev./custom/k8s.log:2018-09-03 12:08:12.969+0000 [id=629145] INFO o.c.j.p.k.KubernetesLauncher#launch: Created Pod: test-0ghpx in namespace dev
      ./custom/k8s.log:2018-09-03 12:08:12.970+0000 [id=629145] INFO o.c.j.p.k.KubernetesLauncher#launch: Waiting for Pod to be scheduled (0/100): test-0ghpx./custom/k8s.log:2018-09-03 12:08:14.143+0000 [id=640057] FINE o.c.j.p.k.PodTemplateBuilder#build: Pod built: Pod(...)
      ./custom/k8s.log:2018-09-03 12:08:14.144+0000 [id=640057] FINE o.c.j.p.k.KubernetesLauncher#launch: Creating Pod: test-0ghpx in namespace dev
      ./custom/k8s.log:2018-09-03 12:08:14.214+0000 [id=640057] WARNING o.c.j.p.k.KubernetesLauncher#launch: Error in provisioning; agent=KubernetesSlave name: test-0ghpx, template=PodTemplate{...}
      ./custom/k8s.log:io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://k8s.local:6443/api/v1/namespaces/dev/pods. Message: pods "test-0ghpx" already exists. Received status: Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=null, kind=pods, name=test-0ghpx, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=pods "test-0ghpx" already exists, metadata=ListMeta(resourceVersion=null, selfLink=null, additionalProperties={}), reason=AlreadyExists, status=Failure, additionalProperties={}).
      ./custom/k8s.log:2018-09-03 12:08:14.214+0000 [id=640057] FINER o.c.j.p.k.KubernetesLauncher#launch: Removing Jenkins node: test-0ghpx
      ./custom/k8s.log:2018-09-03 12:08:14.215+0000 [id=640057] INFO o.c.j.p.k.KubernetesSlave#_terminate: Terminating Kubernetes instance for agent test-0ghpx
      ./custom/k8s.log:2018-09-03 12:08:14.289+0000 [id=640057] INFO o.c.j.p.k.KubernetesSlave#_terminate: Terminated Kubernetes instance for agent dev/test-0ghpx
      ./custom/k8s.log:2018-09-03 12:08:14.290+0000 [id=640057] INFO o.c.j.p.k.KubernetesSlave#_terminate: Disconnected computer test-0ghpx
      ./custom/k8s.log:2018-09-03 12:08:14.542+0000 [id=640056] FINE o.c.j.p.k.PodTemplateBuilder#build: Pod built: Pod(...)
      ./custom/k8s.log:2018-09-03 12:08:14.543+0000 [id=640056] FINE o.c.j.p.k.KubernetesLauncher#launch: Creating Pod: test-0ghpx in namespace dev
      ./custom/k8s.log:2018-09-03 12:08:14.615+0000 [id=640056] INFO o.c.j.p.k.KubernetesLauncher#launch: Created Pod: test-0ghpx in namespace dev
      ./custom/k8s.log:2018-09-03 12:08:14.616+0000 [id=640056] INFO o.c.j.p.k.KubernetesLauncher#launch: Waiting for Pod to be scheduled (0/100): test-0ghpx
      ./custom/k8s.log:2018-09-03 12:08:18.976+0000 [id=629145] INFO o.c.j.p.k.KubernetesLauncher#launch: Waiting for Pod to be scheduled (1/100): test-0ghpx
      ./custom/k8s.log:2018-09-03 12:08:20.620+0000 [id=640056] INFO o.c.j.p.k.KubernetesLauncher#launch: Waiting for Pod to be scheduled (1/100): test-0ghpx
      ./custom/k8s.log:2018-09-03 12:08:24.983+0000 [id=629145] INFO o.c.j.p.k.KubernetesLauncher#launch: Waiting for Pod to be scheduled (2/100): test-0ghpx
      ./custom/k8s.log:2018-09-03 12:08:26.625+0000 [id=640056] INFO o.c.j.p.k.KubernetesLauncher#launch: Waiting for Pod to be scheduled (2/100): test-0ghpx
      ./custom/k8s.log:2018-09-03 12:08:30.988+0000 [id=629145] INFO o.c.j.p.k.KubernetesLauncher#launch: Waiting for Pod to be scheduled (3/100): test-0ghpx
      ./custom/k8s.log:2018-09-03 12:08:32.629+0000 [id=640056] INFO o.c.j.p.k.KubernetesLauncher#launch: Waiting for Pod to be scheduled (3/100): test-0ghpx
      ./custom/k8s.log:2018-09-03 12:08:36.993+0000 [id=629145] INFO o.c.j.p.k.KubernetesLauncher#launch: Waiting for Pod to be scheduled (4/100): test-0ghpx
      ./custom/k8s.log:2018-09-03 12:08:38.634+0000 [id=640056] INFO o.c.j.p.k.KubernetesLauncher#launch: Waiting for Pod to be scheduled (4/100): test-0ghpx
      ./custom/k8s.log:2018-09-03 12:08:42.998+0000 [id=629145] INFO o.c.j.p.k.KubernetesLauncher#launch: Waiting for Pod to be scheduled (5/100): test-0ghpx
      ./custom/k8s.log:2018-09-03 12:08:44.639+0000 [id=640056] INFO o.c.j.p.k.KubernetesLauncher#launch: Waiting for Pod to be scheduled (5/100): test-0ghpx
      ......
      ./custom/k8s.log:2018-09-03 12:18:01.997+0000 [id=629145] INFO o.c.j.p.k.KubernetesLauncher#launch: Waiting for Pod to be scheduled (98/100): test-0ghpx
      ./custom/k8s.log:2018-09-03 12:18:03.559+0000 [id=640056] INFO o.c.j.p.k.KubernetesLauncher#launch: Waiting for Pod to be scheduled (98/100): test-0ghpx
      ./custom/k8s.log:2018-09-03 12:18:08.002+0000 [id=629145] INFO o.c.j.p.k.KubernetesLauncher#launch: Waiting for Pod to be scheduled (99/100): test-0ghpx
      ./custom/k8s.log:2018-09-03 12:18:09.563+0000 [id=640056] INFO o.c.j.p.k.KubernetesLauncher#launch: Waiting for Pod to be scheduled (99/100): test-0ghpx
      ./custom/k8s.log:2018-09-03 12:18:14.008+0000 [id=629145] WARNING o.c.j.p.k.KubernetesLauncher#launch: Error in provisioning; agent=KubernetesSlave name: test-0ghpx, template=PodTemplate{...}
      ./custom/k8s.log:2018-09-03 12:18:14.008+0000 [id=629145] FINER o.c.j.p.k.KubernetesLauncher#launch: Removing Jenkins node: test-0ghpx
      ./custom/k8s.log:2018-09-03 12:18:14.008+0000 [id=629145] INFO o.c.j.p.k.KubernetesSlave#_terminate: Terminating Kubernetes instance for agent test-0ghpx
      ./custom/k8s.log:2018-09-03 12:18:14.009+0000 [id=629145] SEVERE o.c.j.p.k.KubernetesSlave#_terminate: Computer for agent is null: test-0ghpx
      ./custom/k8s.log:2018-09-03 12:18:15.568+0000 [id=640056] WARNING o.c.j.p.k.KubernetesLauncher#launch: Error in provisioning; agent=KubernetesSlave name: test-0ghpx, template=PodTemplate{...}
      ./custom/k8s.log:2018-09-03 12:18:15.569+0000 [id=640056] FINER o.c.j.p.k.KubernetesLauncher#launch: Removing Jenkins node: test-0ghpx
      ./custom/k8s.log:2018-09-03 12:18:15.569+0000 [id=640056] INFO o.c.j.p.k.KubernetesSlave#_terminate: Terminating Kubernetes instance for agent test-0ghpx
      ./custom/k8s.log:2018-09-03 12:18:15.570+0000 [id=640056] SEVERE o.c.j.p.k.KubernetesSlave#_terminate: Computer for agent is null: test-0ghpx
      ./slaves/test-0ghpx/slave.log:io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://k8s.local:6443/api/v1/namespaces/dev/pods. Message: pods "test-0ghpx" already exists. Received status: Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=null, kind=pods, name=test-0ghpx, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=pods "test-0ghpx" already exists, metadata=ListMeta(resourceVersion=null, selfLink=null, additionalProperties={}), reason=AlreadyExists, status=Failure, additionalProperties={}).
      

       

       

      So the failure raises like this:

      1. First thread schedules a pod
      2. Second thread tries to schedule pod, doesn't check that pod is already scheduled, fails with an attempt to create a pod with the same name, deletes the corresponding jenkins node 
      3. Third thread tries to schedule pod (node is already terminated)
      4. First and third threads wait for pod to be scheduled until timeout is reached, because schedule is impossible due to node is already killed. 
        We can face logs like this from the jnlp container of this pod:
      Aug 31, 2018 11:17:15 AM jenkins.slaves.restarter.JnlpSlaveRestarterInstaller$2$1 onReconnectAug 31, 2018 11:17:15 AM jenkins.slaves.restarter.JnlpSlaveRestarterInstaller$2$1 onReconnectINFO: Restarting agent via jenkins.slaves.restarter.UnixSlaveRestarter@2eb6111aAug 31, 2018 11:17:17 AM hudson.remoting.jnlp.Main createEngineINFO: Setting up agent: test-0ghpxAug 31, 2018 11:17:17 AM hudson.remoting.jnlp.Main$CuiListener <init>INFO: Jenkins agent is running in headless mode.Aug 31, 2018 11:17:17 AM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDirINFO: Using /home/jenkins/agent/remoting as a remoting work directoryBoth error and output logs will be printed to /home/jenkins/agent/remotingAug 31, 2018 11:17:17 AM hudson.remoting.jnlp.Main$CuiListener statusINFO: Locating server among [http://jenkins.local]Aug 31, 2018 11:17:17 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolveINFO: Remoting server accepts the following protocols: [JNLP4-connect, Ping]Aug 31, 2018 11:17:17 AM hudson.remoting.jnlp.Main$CuiListener statusINFO: Agent discovery successful  Agent address: jenkins-jnlp.local  Agent port:    30150  Identity:      XXXXAug 31, 2018 11:17:17 AM hudson.remoting.jnlp.Main$CuiListener statusINFO: HandshakingAug 31, 2018 11:17:17 AM hudson.remoting.jnlp.Main$CuiListener statusINFO: Connecting to jenkins-jnlp.local:30150Aug 31, 2018 11:17:17 AM hudson.remoting.jnlp.Main$CuiListener statusINFO: Trying protocol: JNLP4-connectAug 31, 2018 11:17:17 AM hudson.remoting.jnlp.Main$CuiListener statusINFO: Remote identity confirmed: XXXXXXAug 31, 2018 11:17:17 AM org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer onRecvINFO: [JNLP4-connect connection to jenkins-jnlp.local/2.2.2.2:30150] Local headers refused by remote: Unknown client name: test-0ghpxAug 31, 2018 11:17:17 AM hudson.remoting.jnlp.Main$CuiListener statusINFO: Protocol JNLP4-connect encountered an unexpected exceptionjava.util.concurrent.ExecutionException: org.jenkinsci.remoting.protocol.impl.ConnectionRefusalException: Unknown client name: test-0ghpx at org.jenkinsci.remoting.util.SettableFuture.get(SettableFuture.java:223) at hudson.remoting.Engine.innerRun(Engine.java:609) at hudson.remoting.Engine.run(Engine.java:469)Caused by: org.jenkinsci.remoting.protocol.impl.ConnectionRefusalException: Unknown client name: test-0ghpx at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.newAbortCause(ConnectionHeadersFilterLayer.java:378) at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.onRecvClosed(ConnectionHeadersFilterLayer.java:433) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832) at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:172) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832) at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154) at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer.access$1500(BIONetworkLayer.java:48) at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetworkLayer.java:247) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at hudson.remoting.Engine$1$1.run(Engine.java:94) at java.lang.Thread.run(Thread.java:748) Suppressed: java.nio.channels.ClosedChannelException ... 7 more
      Aug 31, 2018 11:17:17 AM hudson.remoting.jnlp.Main$CuiListener statusINFO: Connecting to jenkins-jnlp.local:30150Aug 31, 2018 11:17:17 AM hudson.remoting.jnlp.Main$CuiListener statusINFO: Server reports protocol JNLP4-plaintext not supported, skippingAug 31, 2018 11:17:17 AM hudson.remoting.jnlp.Main$CuiListener statusINFO: Protocol JNLP3-connect is not enabled, skippingAug 31, 2018 11:17:17 AM hudson.remoting.jnlp.Main$CuiListener statusINFO: Server reports protocol JNLP2-connect not supported, skippingAug 31, 2018 11:17:17 AM hudson.remoting.jnlp.Main$CuiListener statusINFO: Server reports protocol JNLP-connect not supported, skippingAug 31, 2018 11:17:17 AM hudson.remoting.jnlp.Main$CuiListener errorSEVERE: The server rejected the connection: None of the protocols were acceptedjava.lang.Exception: The server rejected the connection: None of the protocols were accepted at hudson.remoting.Engine.onConnectionRejected(Engine.java:670) at hudson.remoting.Engine.innerRun(Engine.java:634) at hudson.remoting.Engine.run(Engine.java:469)
      

      For the successful creations i see only one thread which does the job.

      I've just created POC how we can mitigate the impact - but is looks much more like workaround (moreover - non-thread safe workaround) rather then proper fix

       

      Also it looks like the similar problem was described here: https://issues.jenkins-ci.org/browse/JENKINS-44042?focusedCommentId=311231&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-311231
      So probably this issue can be treated also as reproduction of that problem: JENKINS-44042

            Unassigned Unassigned
            fduch Alex Medvedev
            Votes:
            6 Vote for this issue
            Watchers:
            16 Start watching this issue

              Created:
              Updated: