Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-67390

Kubernetes agents can stay in suspended state if Jenkins restarts while provisioned

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • kubernetes-plugin
    • None

      KubernetesLauncher works the following way:

      • Sets the agent to suspended state. Since a Kubernetes agent is tied to a pod with multiple containers, the container running the agent could be up, but not other containers. So until every container is ready, the agent is kept suspended so that no job is assigned to it until it's really ready.
      • Creates the pod
      • Waits for all containers to be up
      • In the happy case, all containers are up, so the agent is no longer suspended.

      In case of an exception or a timeout during this timeframe, the node (and the matching pod) are terminated.

      The issue arises if Jenkins restarts while waiting for all containers. In that case, the agent has been persisted, it is suspended, however on restart, there is no event handler that is resuming the watch.
      Eventually, the agent is removed after the configured retention timeout. But during that timeframe it cannot be used, and if there is a provisioning limit configured, it can block further provisioning.

      To fix this, the launcher could complete just after creating the pod, leaving the agent as suspended.
      A global event handler should be added on startup, watching for pod status, and unsuspending agents as soon as the corresponding pod is ready to be used. Or terminating them if they exceed the provisioning timeout. It would behave the same even if Jenkins is restarted in between.

          [JENKINS-67390] Kubernetes agents can stay in suspended state if Jenkins restarts while provisioned

          There are no comments yet on this issue.

            Unassigned Unassigned
            vlatombe Vincent Latombe
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: