Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-68649

Azure virtual nodes: "Agent is not connected after 0 seconds, status: Pending"

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • kubernetes-plugin
    • None
    • Kubernetes: 1.23.5 (Azure AKS)
      Jenkins: 2.332.3-lts-alpine
      Kubernetes Plugin: 3600.v144b_cd192ca_a_
      Agent node: virtual kubelet with Azure ACI connector

      I have Jenkins set up on my Azure AKS Kubernetes cluster, and have configured the default agent pod template to use the ACI virtual node.

      When the kubernetes plugin launches a new agent pod, it frequently fails with the log message:

      Agent is not connected after 0 seconds, status: Pending

      After this, it immediately destroys the agent and pod, then attempts to create a new one. Most agents fail to launch in this way, but every so often (about 1/15) it succeeds, and the job runs.

      What's the cause?

      After some investigation it looks like this issue is caused by an interaction between a quirk in the lifecycle of pods running on the ACI virtual node, and the way that the Jenkins kubernetes plugin waits for agents to start.

      Pod lifecycle on ACI virtual node

      It looks like there's a brief point during the launch of of a pod on the ACI virtual node where the pod is in an unexpected state:

      $ kubectl get pods -w
      NAME     READY   STATUS     RESTARTS   AGE
      test22   0/1     Creating   0          8s
      test22   0/1     Waiting    0          27s
      test22   0/1     Waiting    0          53s
      test22   1/1     Pending    0          73s **
      test22   1/1     Running    0          78s
      test22   0/1     Terminated   0          89s
      

      Just after the ACI instance is successfully launched, all containers enter the Running status, but the overall Pod status remains as Pending, as can be seen on the line above marked **. This happens for most pods launched on the virtual node, but occasionally, a pod will go straight to the Running status at the same time as its containers – Kubernetes agents only successfully launch when this happens.

      I have tested with other non-Jenkins pods and containers, and can confirm that it is a general behaviour with pods on the virtual node, and not something specific to the Jenkins agent pods. I expect it is caused by the virtual kubelet operator polling the container and pod statuses asynchronously, resulting in a brief period when the statuses are inconsistent. However, I haven't dug into the ACI operator code to confirm this.

      Jenkins Kubernetes Plugin Behaviour

      When launching a new agent pod, the code in the KubernetesLauncher class waits for the pod to start and the agent to connect before starting. This is split into 2 phases:

      1. Waiting for all containers in the pod to reach the Running state
      2. Waiting for the agent to connect, while checking the pod hasn't since died

      During the second phase, the code periodically checks that the pod is in the Running state. If it isn't, it breaks out of the wait and marks the agent as failed. When the pod is in the state marked with ** above, phase 1 is complete, because all containers are Running, but the pod's status is still Pending. This is what causes the kubernetes plugin to think that the agent has failed, at which point it deletes the agent and pod, and tries again. If instead it waited a few more seconds, it would see the pod enter the running state, and everything would be fine.

      What to do about it?

      It could be argued that this is a bug in the aci-connector virtual kubelet operator. However, I haven't found any spec that states that a pod must change to the Running state immediately as soon as its containers are Running. There could be other scenarios, particularly with different types of virtual kubelet, where this race condition arises.

      I can see 2 ways of fixing this in the Jenkins kubernetes plugin:

      1. Make phase 1 (as described above) wait on the overall pod status, rather than container statuses
      2. During phase 2, relax the checks on the pod status: rather than insist that the pod be Running, instead check that the pod is not in a finished state (i.e. Succeeded or Failed). If the pod is truly stuck in the Pending or Unknown state, this will be picked up when the slave connect timeout expires.

            Unassigned Unassigned
            hworblehat Rowan Lonsdale
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: