Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-50306

"Still waiting to schedule task" indicates a flaw in the Jenkins pipelining design in my opinion

    XMLWordPrintable

    Details

    • Similar Issues:

      Description

      In my Jenkins server, I sometimes have a little bit of queueing and you'll see messages like this in a pipeline which has recently been submitted:

      Still waiting to schedule task Waiting for next available executor on ip-172-31-141-11.us-west-2.compute.internal

      I use node labels in order to select which agents may be used. However, the above message suggests to me that the Jenkins pipeline runner makes a selection as to which agent will receive the job at the moment it encounters the node() command in my Jenkinsfile.

      The reason that I believe this logic to be flawed is that the particular node in question (ip-172-31-141-11.us-west-2.compute.internal) might get killed off while the current job is running which suggests that my queued job will be stuck waiting forever because there is almost no chance that AWS will relaunch the same node with the same hostname.

      A better strategy would be one in which I request a node via something like node("mac") and then jenkins tells me that its waiting to schedule an executor on the next node labeled "mac" as opposed to selecting an individual machine which might go away.

        Attachments

          Activity

          Hide
          abayer Andrew Bayer added a comment -

          Yes, when the node step is executed, it goes onto the agent. If you're seeing the build start, get to the node step, and immediately get that message, something's wrong somewhere, but there is a common case for that message, sadly, on master restart and running Pipelines resuming - often, dynamically provisioned agents no longer exist but Jenkins still tries to resume Pipelines onto the agents they were on at the time the master stopped, so things get gummed up. If that's not the situation where you're seeing this, is there any chance you could include your Jenkinsfile, or even better a minimal reproduction case? Oh, and are you using the throttle step by any chance?

          Show
          abayer Andrew Bayer added a comment - Yes, when the node step is executed, it goes onto the agent. If you're seeing the build start, get to the node step, and immediately get that message, something's wrong somewhere, but there is a common case for that message, sadly, on master restart and running Pipelines resuming - often, dynamically provisioned agents no longer exist but Jenkins still tries to resume Pipelines onto the agents they were on at the time the master stopped, so things get gummed up. If that's not the situation where you're seeing this, is there any chance you could include your Jenkinsfile, or even better a minimal reproduction case? Oh, and are you using the throttle step by any chance?
          Hide
          piratejohnny Jon B added a comment - - edited

          Indeed I am using the "Throttle Concurrent Builds Plug-in" because my monolith's main pipeline is very busy and if i don't throttle, the agent capacity would get slathered across too many concurrent job runs making them all take forever.

          My pipeline logic is quite complex but I'm pretty sure the cause must be a flaw in the concurrent builds plugin. If you think it is essential to triage, I will create a watered down pipeline that repros this behavior but I'll hold on that until you confirm its worth my time to produce such a Jenkinsfile.

          Show
          piratejohnny Jon B added a comment - - edited Indeed I am using the "Throttle Concurrent Builds Plug-in" because my monolith's main pipeline is very busy and if i don't throttle, the agent capacity would get slathered across too many concurrent job runs making them all take forever. My pipeline logic is quite complex but I'm pretty sure the cause must be a flaw in the concurrent builds plugin. If you think it is essential to triage, I will create a watered down pipeline that repros this behavior but I'll hold on that until you confirm its worth my time to produce such a Jenkinsfile.
          Hide
          abayer Andrew Bayer added a comment -

          Yup, it's throttle-concurrent-builds - it'd be worth knowing how you have the throttling category set up (are you saying only run X of jobs in this category across all nodes, or ...on a single node, etc), though.

          Show
          abayer Andrew Bayer added a comment - Yup, it's throttle-concurrent-builds - it'd be worth knowing how you have the throttling category set up (are you saying only run X of jobs in this category across all nodes, or ...on a single node, etc), though.
          Hide
          lifeofguenter Günter Grodotzki added a comment -

          I am seeing this issue as well especially on cron triggered builds. It does not happen to all triggered builds though:
          [Pipeline] nodeStill waiting to schedule task
          ‘Jenkins Prebuilt Slave (sir-2c3r7s3n)’ is offline; ‘Jenkins Prebuilt Slave (sir-6z9r4yhm)’ is offline; ‘Jenkins Prebuilt Slave (sir-d4mg59im)’ is offline; ‘Jenkins Prebuilt Slave (sir-vt4g7crq)’ is offline
          Nodes are EC2 (spot) that are launched on-demand. They are not failing to launch in general as other builds work fine.

           

          Show
          lifeofguenter Günter Grodotzki added a comment - I am seeing this issue as well especially on cron triggered builds. It does not happen to all triggered builds though: [Pipeline] nodeStill waiting to schedule task ‘Jenkins Prebuilt Slave (sir-2c3r7s3n)’ is offline; ‘Jenkins Prebuilt Slave (sir-6z9r4yhm)’ is offline; ‘Jenkins Prebuilt Slave (sir-d4mg59im)’ is offline; ‘Jenkins Prebuilt Slave (sir-vt4g7crq)’ is offline Nodes are EC2 (spot) that are launched on-demand. They are not failing to launch in general as other builds work fine.  
          Hide
          abou Abdelkader Boumediene added a comment -

          Any news about this issue ?

          I have this issue also.

          Note that if I perform "Reply" build that freezes with this error, it works fine.

          So issue seems occur only when Pipeline is called from another

          Show
          abou Abdelkader Boumediene added a comment - Any news about this issue ? I have this issue also. Note that if I perform "Reply" build that freezes with this error, it works fine. So issue seems occur only when Pipeline is called from another
          Hide
          larkoie Larkoie added a comment -

          We have the same issue:

          We have 4 nodes (permanent) that are managed by the Jenkins master who is in charge to start them when needed and shut them down when idle. We often have 1 or 2 nodes online and 2 or 3 nodes offline while we have lots of builds in queue.

          We are using throttle-concurrent-builds like this :

          label = example

          Maximum Total Concurrent Builds = 0

          Maximum Concurrent Builds Per Node = 1

           

          in the build logs we have :

          Still waiting to schedule task

          Already running 1 builds on node

          'node2' is offline

          'node3' is offline

          'node4' is offline

           

          When it's stuck like this, the job will be in queue as long as the online node is busy, then it will run on it when it's free again instead of simply start an offline node and assign the build to it.

          It looks like nodes are assigned to a job BEFORE it checks the "throttle-concurrent-builds". So once a job is assigned to a node, if there is a throttle rule Jenkins will not try to reassign it to another "offline" node.

          Our guess is that the throttle-concurrent-builds-plugin should have precedence on node assignment.

          Show
          larkoie Larkoie added a comment - We have the same issue: We have 4 nodes (permanent) that are managed by the Jenkins master who is in charge to start them when needed and shut them down when idle. We often have 1 or 2 nodes online and 2 or 3 nodes offline while we have lots of builds in queue. We are using throttle-concurrent-builds like this : label = example Maximum Total Concurrent Builds = 0 Maximum Concurrent Builds Per Node = 1   in the build logs we have : Still waiting to schedule task Already running 1 builds on node 'node2' is offline 'node3' is offline 'node4' is offline   When it's stuck like this, the job will be in queue as long as the online node is busy, then it will run on it when it's free again instead of simply start an offline node and assign the build to it. It looks like nodes are assigned to a job BEFORE it checks the "throttle-concurrent-builds". So once a job is assigned to a node, if there is a throttle rule Jenkins will not try to reassign it to another "offline" node. Our guess is that the throttle-concurrent-builds-plugin  should have precedence on node assignment.
          Hide
          piratejohnny Jon B added a comment - - edited

          Since I originally posted this, we have become a CloudBees customer where the CI happens in kubernetes pods.

          That being said, to answer Andrew Bayer's question about how the plugin is configured, I'm pasting a screenshot of how we tend to have it set up in our community jenkins that was using plain ec2 nodes as workers:

          Show
          piratejohnny Jon B added a comment - - edited Since I originally posted this, we have become a CloudBees customer where the CI happens in kubernetes pods. That being said, to answer Andrew Bayer 's question about how the plugin is configured, I'm pasting a screenshot of how we tend to have it set up in our community jenkins that was using plain ec2 nodes as workers:

            People

            Assignee:
            Unassigned Unassigned
            Reporter:
            piratejohnny Jon B
            Votes:
            2 Vote for this issue
            Watchers:
            6 Start watching this issue

              Dates

              Created:
              Updated: