Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-47561

Pipelines wait indefinitely for kubernetes slaves to come back online

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Minor Minor
    • kubernetes-plugin
    • None

      Our situation

      We run Jenkins Master and ephemeral slaves in kubernetes clusters and love it. However we quite regularly run into stalled pipelines that wait indefinitely for a kubernetes ephemeral slave to come back online, which will never happen. I have no 100% reproducible steps because this does not happen every time. Instead it happens on average once out of 15 slaves. A slave will inexplicably disconnect/crash/be-decommissioned and the Jenkins pipeline using it thinks, "Ok, durable task means I wait for the slave that just went offline to come back online... forever." And so it does. :/

      Someone else has opened a ticket with Jenkins Master team requesting an option to flag a job/task as not durable, however they do make clear that this behavior is something that needs to be addressed by the plugin as well.

      The Story "Requirement"

      This may be a bug or a feature/improvement request, depending on how your team has intended the plugin to function. I have flagged this as a bug because I do not believe this behavior is intended.

      Kubernetes slaves should fail a pipeline step/task that is running on it whenever the slave goes offline.

          [JENKINS-47561] Pipelines wait indefinitely for kubernetes slaves to come back online

          This may be a duplicate. I searched many different ways through all the jiras on this plugin (regardless of status) and did not find this particular issue.

          We very much like the kubernetes plugin, so thank you very much for making it publicly available.

          Sam Beckwith III added a comment - This may be a duplicate. I searched many different ways through all the jiras on this plugin (regardless of status) and did not find this particular issue. We very much like the kubernetes plugin, so thank you very much for making it publicly available.

          possibly the cause is a duplicate of JENKINS-47476

          Carlos Sanchez added a comment - possibly the cause is a duplicate of JENKINS-47476

          I don't see this problem with the latest versions of the plugin. Reopen if that is not the case

          Carlos Sanchez added a comment - I don't see this problem with the latest versions of the plugin. Reopen if that is not the case

          Rishi Thakkar added a comment -

          I still see this issue in the newest version of the plugin.

           

          The pod runs out of ephemeral storage and the JNLP agent dies. Then, the build step gets stuck indefinitely.

          Rishi Thakkar added a comment - I still see this issue in the newest version of the plugin.   The pod runs out of ephemeral storage and the JNLP agent dies. Then, the build step gets stuck indefinitely.

          Arun Kaushik added a comment -

          We are using 1.14.9 version on kubernetes-plugin and still facing this issue. We pass CPU/Memory limits in container template and when those limits are reached, the container is killed, leaving the build in stalled state for ever. Seems like Jenkins master never gets to know that slave is decommissioned intentionally and it has to move on and fail the build. 

          Arun Kaushik added a comment - We are using 1.14.9 version on kubernetes-plugin  and still facing this issue. We pass CPU/Memory limits in container template and when those limits are reached, the container is killed, leaving the build in stalled state for ever. Seems like Jenkins master never gets to know that slave is decommissioned intentionally and it has to move on and fail the build. 

          David Schott added a comment - - edited

          I encountered this same issue while testing the EC2 Spot integration in the CloudBees CI on AWS Quick Start

          jglick tweeted about the fix here.

          David Schott added a comment - - edited I encountered this same issue while testing the EC2 Spot integration in the CloudBees CI on AWS Quick Start jglick  tweeted about the fix here .

            Unassigned Unassigned
            sbeckwithiii Sam Beckwith III
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: