Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-25867

Gearman won't schedule new jobs even though there are slots available on master

      We have a setup with one Jenkins master and Zuul triggers the job through the Jenkins Gearman plugin.

      Sometimes no new jobs will be scheduled even though all slots are available.

      A workaround for the slaves is to disconnect/connect the slave and then new job would be scheduled again.
      For the master the only way to get jobs to be scheduled again is to restart the Jenkins service.

      When this happens on one node jobs would still be scheduled on other nodes.

      Attaching server thread log for gearman threads when no jobs are currently running and jobs are scheduled in the queue.
      Also attaching a trunkated jenkins.log by (grep -C 2 10.33.14.26_manager)

      Let me know if you need more logs or other info, I would be happy to help

          [JENKINS-25867] Gearman won't schedule new jobs even though there are slots available on master

          Christian Bremer added a comment - Placed a bounty of 200$ for this issue on freedomsponsors. https://freedomsponsors.org/issue/595/gearman-wont-schedule-new-jobs-even-though-there-are-slots-available-on-master?alert=SPONSOR#

          Khai Do added a comment -

          It would help if you can test without zuul to verify that this is a gearman-plugin issue. I would try 2 things:
          1. Make sure the jenkins job is set to 'Execute concurrent builds if necessary', more info here: https://wiki.jenkins-ci.org/display/JENKINS/Gearman+Plugin?focusedCommentId=74875603#comment-74875603
          2. Instead of zuul, use the simple gearman client to schedule the jobs, https://github.com/zaro0508/gearman-plugin-client

          Khai Do added a comment - It would help if you can test without zuul to verify that this is a gearman-plugin issue. I would try 2 things: 1. Make sure the jenkins job is set to 'Execute concurrent builds if necessary', more info here: https://wiki.jenkins-ci.org/display/JENKINS/Gearman+Plugin?focusedCommentId=74875603#comment-74875603 2. Instead of zuul, use the simple gearman client to schedule the jobs, https://github.com/zaro0508/gearman-plugin-client

          Christian Bremer added a comment - - edited

          1. We already have concurrent builds enabled on all jobs.
          2. We have the same issue when scheduling through the simple gearman client.
          The job will start when assigning it to another label than master.
          The job will not start when assigning it to master label, the label has 6 free executors.

          The command I executed was: python gear_client.py -s localhost -p 4730 –
          function build:~Update_Scripts --wait

          Thanks for the link to the simple gearman client, it was very neat to use to try things out.

          Please let me know if I can help in any other way.

          Christian Bremer added a comment - - edited 1. We already have concurrent builds enabled on all jobs. 2. We have the same issue when scheduling through the simple gearman client. The job will start when assigning it to another label than master. The job will not start when assigning it to master label, the label has 6 free executors. The command I executed was: python gear_client.py -s localhost -p 4730 – function build:~Update_Scripts --wait Thanks for the link to the simple gearman client, it was very neat to use to try things out. Please let me know if I can help in any other way.

          Khai Do added a comment -

          Khai Do added a comment - lets work on this in https://storyboard.openstack.org/#!/story/2000030

          Khai Do added a comment -

          Sorry to switch again but storyboard isn't really working for us. Lets switch back to using jenkins issue tracker.

          Khai Do added a comment - Sorry to switch again but storyboard isn't really working for us. Lets switch back to using jenkins issue tracker.

          Christian Bremer added a comment - - edited

          +1 for that decision
          Any update on the issue?

          Christian Bremer added a comment - - edited +1 for that decision Any update on the issue?

          Khai Do added a comment -

          I've noticed that there is a bug when using the "OFFLINE_NODE_WHEN_COMPLETE=true" parameter when there are multiple executors. When used with multiple executors the node gets offline and the jobs get unregistered but it never re-registers therefore no new job requests get executed. Looks like you have setup your jenkins with multiple executors so was wondering if you are using this parameter?

          Khai Do added a comment - I've noticed that there is a bug when using the "OFFLINE_NODE_WHEN_COMPLETE=true" parameter when there are multiple executors. When used with multiple executors the node gets offline and the jobs get unregistered but it never re-registers therefore no new job requests get executed. Looks like you have setup your jenkins with multiple executors so was wondering if you are using this parameter?

          No we are not using "OFFLINE_NODE_WHEN_COMPLETE=true" in our instances.
          You are correct in your assumption that we use multiple executors slots on almost all nodes, although I have seen this on nodes with 1 slot as well.

          Christian Bremer added a comment - No we are not using "OFFLINE_NODE_WHEN_COMPLETE=true" in our instances. You are correct in your assumption that we use multiple executors slots on almost all nodes, although I have seen this on nodes with 1 slot as well.

          Khai Do added a comment - - edited

          A few deadlock issues[1][2] have been fixed but not released yet. [1] has been merged to master but [2] has not. Was wondering if you could checkout [2], build and test?

          Please use the updated gearman test client[3] to test.

          [1] https://review.openstack.org/#/c/179988
          [2] https://review.openstack.org/#/c/192429
          [3] https://github.com/zaro0508/gearman-plugin-client.git

          Khai Do added a comment - - edited A few deadlock issues [1] [2] have been fixed but not released yet. [1] has been merged to master but [2] has not. Was wondering if you could checkout [2] , build and test? Please use the updated gearman test client [3] to test. [1] https://review.openstack.org/#/c/179988 [2] https://review.openstack.org/#/c/192429 [3] https://github.com/zaro0508/gearman-plugin-client.git

          [3] does not seem to have been updated for the last 10 months, is it the correct link?

          Christian Bremer added a comment - [3] does not seem to have been updated for the last 10 months, is it the correct link?

          Khai Do added a comment -

          No sure what you mean. Project says last update was on 06/16/2015, https://github.com/zaro0508/gearman-plugin-client/commit/befdbe1a143b117637c6f2c33c92f990c1b78848

          Maybe need to clear your brower cache?

          Khai Do added a comment - No sure what you mean. Project says last update was on 06/16/2015, https://github.com/zaro0508/gearman-plugin-client/commit/befdbe1a143b117637c6f2c33c92f990c1b78848 Maybe need to clear your brower cache?

          Odd, I see that it is updated now, I will try it out and get back to you with feedback.

          Christian Bremer added a comment - Odd, I see that it is updated now, I will try it out and get back to you with feedback.

          Antoine Musso added a comment - - edited

          The original storyboard: https://storyboard.openstack.org/#!/story/2000030
          Our downstream bug: https://phabricator.wikimedia.org/T72597

          Our env is: Jenkins 1.596 , gearman-plugin 0.1.1-8-gf2024bd (from source).

          The only blocks we have are on a specific slave that happens to run matrix jobs which are triggered by the Jenkins internal scheduler. We do not use OFFLINE_NODE_WHEN_COMPLETE yet. Most importantly, I can not find a way to reproduce the issue reliably.

          I noticed the executor threads are held in a lock though (details and threads dump at https://phabricator.wikimedia.org/T72597#748059 ). And the computer is sometime a NULL value:

          > The Jenkins logger for hudson.plugins.gearman.logger shows a spam of:
          >
          > Nov 26, 2014 10:24:21 PM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
          > AvailabilityMonitor canTake request for >SOME VALUE<
          >
          > Where >SOME VALUE< is null or one one of the executor thread.

          An example:

          Jul 01, 2015 10:11:59 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
          AvailabilityMonitor canTake request for null
          

          I have deployed the german-plugin with https://review.openstack.org/#/c/192429/ , but since I have no way to reproduce the issue I would not be able to confirm whether the issue is solved :-\

          Antoine Musso added a comment - - edited The original storyboard: https://storyboard.openstack.org/#!/story/2000030 Our downstream bug: https://phabricator.wikimedia.org/T72597 Our env is: Jenkins 1.596 , gearman-plugin 0.1.1-8-gf2024bd (from source). The only blocks we have are on a specific slave that happens to run matrix jobs which are triggered by the Jenkins internal scheduler. We do not use OFFLINE_NODE_WHEN_COMPLETE yet. Most importantly, I can not find a way to reproduce the issue reliably. I noticed the executor threads are held in a lock though (details and threads dump at https://phabricator.wikimedia.org/T72597#748059 ). And the computer is sometime a NULL value: > The Jenkins logger for hudson.plugins.gearman.logger shows a spam of: > > Nov 26, 2014 10:24:21 PM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake > AvailabilityMonitor canTake request for >SOME VALUE< > > Where >SOME VALUE< is null or one one of the executor thread. An example: Jul 01, 2015 10:11:59 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake AvailabilityMonitor canTake request for null I have deployed the german-plugin with https://review.openstack.org/#/c/192429/ , but since I have no way to reproduce the issue I would not be able to confirm whether the issue is solved :-\

          Khai Do added a comment -

          Ahh makes sense now. As explicitly stated in the gearman plugin wiki (https://wiki.jenkins-ci.org/display/JENKINS/Gearman+Plugin), matrix jobs are not supported (known issues section). We don't use matrix projects therefore didn't put the effort into supporting it.

          Khai Do added a comment - Ahh makes sense now. As explicitly stated in the gearman plugin wiki ( https://wiki.jenkins-ci.org/display/JENKINS/Gearman+Plugin ), matrix jobs are not supported (known issues section). We don't use matrix projects therefore didn't put the effort into supporting it.

          Khai Do added a comment -

          @Christian, any updates on your end on this issue? Antoine reported that the related issue has resolved.

          Khai Do added a comment - @Christian, any updates on your end on this issue? Antoine reported that the related issue has resolved.

          Khai Do added a comment -

          I believe this is fixed in version 0.1.2

          Khai Do added a comment - I believe this is fixed in version 0.1.2

          Antoine Musso added a comment - - edited

          It happened again with the the gearman-plugin v0.1.2 (

          Jul 28, 2015 10:30:35 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
          AvailabilityMonitor canTake request for null
          Jul 28, 2015 10:30:35 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
          AvailabilityMonitor canTake request for null
          Jul 28, 2015 10:30:35 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
          AvailabilityMonitor canTake request for null
          

          With jobs tied to that instance being stuck waiting for an available executor on deployment-bastion.

          Marking the node offline and online doesn't remove the lock :-/

          The executor threads have:

          "Gearman worker deployment-bastion.eqiad_exec-1" prio=5 WAITING
          	java.lang.Object.wait(Native Method)
          	java.lang.Object.wait(Object.java:503)
          	hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:73)
          	hudson.plugins.gearman.StartJobWorker.safeExecuteFunction(StartJobWorker.java:196)
          	hudson.plugins.gearman.StartJobWorker.executeFunction(StartJobWorker.java:114)
          	org.gearman.worker.AbstractGearmanFunction.call(AbstractGearmanFunction.java:125)
          	org.gearman.worker.AbstractGearmanFunction.call(AbstractGearmanFunction.java:22)
          	hudson.plugins.gearman.MyGearmanWorkerImpl.submitFunction(MyGearmanWorkerImpl.java:593)
          	hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:328)
          	hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166)
          	java.lang.Thread.run(Thread.java:745)
          
          "Gearman worker deployment-bastion.eqiad_exec-2" prio=5 TIMED_WAITING
          	java.lang.Object.wait(Native Method)
          	hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83)
          	hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:380)
          	hudson.plugins.gearman.MyGearmanWorkerImpl.processSessionEvent(MyGearmanWorkerImpl.java:421)
          	hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:320)
          	hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166)
          	java.lang.Thread.run(Thread.java:745)
          
          "Gearman worker deployment-bastion.eqiad_exec-3" prio=5 TIMED_WAITING
          	java.lang.Object.wait(Native Method)
          	hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83)
          	hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:380)
          	hudson.plugins.gearman.MyGearmanWorkerImpl.processSessionEvent(MyGearmanWorkerImpl.java:421)
          	hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:320)
          	hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166)
          	java.lang.Thread.run(Thread.java:745)
          
          "Gearman worker deployment-bastion.eqiad_exec-4" prio=5 TIMED_WAITING
          	java.lang.Object.wait(Native Method)
          	hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83)
          	hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:380)
          	hudson.plugins.gearman.MyGearmanWorkerImpl.processSessionEvent(MyGearmanWorkerImpl.java:421)
          	hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:320)
          	hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166)
          	java.lang.Thread.run(Thread.java:745)
          
          "Gearman worker deployment-bastion.eqiad_exec-5" prio=5 TIMED_WAITING
          	java.lang.Object.wait(Native Method)
          	hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83)
          	hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:380)
          	hudson.plugins.gearman.MyGearmanWorkerImpl.processSessionEvent(MyGearmanWorkerImpl.java:421)
          	hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:320)
          	hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166)
          	java.lang.Thread.run(Thread.java:745)
          

          Antoine Musso added a comment - - edited It happened again with the the gearman-plugin v0.1.2 ( Jul 28, 2015 10:30:35 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake AvailabilityMonitor canTake request for null Jul 28, 2015 10:30:35 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake AvailabilityMonitor canTake request for null Jul 28, 2015 10:30:35 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake AvailabilityMonitor canTake request for null With jobs tied to that instance being stuck waiting for an available executor on deployment-bastion. Marking the node offline and online doesn't remove the lock :-/ The executor threads have: "Gearman worker deployment-bastion.eqiad_exec-1" prio=5 WAITING java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:503) hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:73) hudson.plugins.gearman.StartJobWorker.safeExecuteFunction(StartJobWorker.java:196) hudson.plugins.gearman.StartJobWorker.executeFunction(StartJobWorker.java:114) org.gearman.worker.AbstractGearmanFunction.call(AbstractGearmanFunction.java:125) org.gearman.worker.AbstractGearmanFunction.call(AbstractGearmanFunction.java:22) hudson.plugins.gearman.MyGearmanWorkerImpl.submitFunction(MyGearmanWorkerImpl.java:593) hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:328) hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166) java.lang.Thread.run(Thread.java:745) "Gearman worker deployment-bastion.eqiad_exec-2" prio=5 TIMED_WAITING java.lang.Object.wait(Native Method) hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83) hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:380) hudson.plugins.gearman.MyGearmanWorkerImpl.processSessionEvent(MyGearmanWorkerImpl.java:421) hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:320) hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166) java.lang.Thread.run(Thread.java:745) "Gearman worker deployment-bastion.eqiad_exec-3" prio=5 TIMED_WAITING java.lang.Object.wait(Native Method) hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83) hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:380) hudson.plugins.gearman.MyGearmanWorkerImpl.processSessionEvent(MyGearmanWorkerImpl.java:421) hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:320) hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166) java.lang.Thread.run(Thread.java:745) "Gearman worker deployment-bastion.eqiad_exec-4" prio=5 TIMED_WAITING java.lang.Object.wait(Native Method) hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83) hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:380) hudson.plugins.gearman.MyGearmanWorkerImpl.processSessionEvent(MyGearmanWorkerImpl.java:421) hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:320) hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166) java.lang.Thread.run(Thread.java:745) "Gearman worker deployment-bastion.eqiad_exec-5" prio=5 TIMED_WAITING java.lang.Object.wait(Native Method) hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83) hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:380) hudson.plugins.gearman.MyGearmanWorkerImpl.processSessionEvent(MyGearmanWorkerImpl.java:421) hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:320) hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166) java.lang.Thread.run(Thread.java:745)

          Antoine Musso added a comment -

          The node is named deployment-bastion-eqiad, with a label deployment-bastion-eqiad. Jobs are tied to deployment-bastion-eqiad.

          The workaround I found was to remove the label from the node. Once done, the jobs shows in the queue with 'no node having label deployment-bastion-eqiad'.

          I then applied the label again on the host and the job managed to run.

          So maybe it is an issue in Jenkins itself :-}

          Antoine Musso added a comment - The node is named deployment-bastion-eqiad, with a label deployment-bastion-eqiad. Jobs are tied to deployment-bastion-eqiad. The workaround I found was to remove the label from the node. Once done, the jobs shows in the queue with 'no node having label deployment-bastion-eqiad'. I then applied the label again on the host and the job managed to run. So maybe it is an issue in Jenkins itself :-}

          Antoine Musso added a comment -

          The deadlock still happens from time to time with Jenkins 1.625.3 LTS and Gearman plugin 1.3.3 with https://review.openstack.org/#/c/252768/

          Antoine Musso added a comment - The deadlock still happens from time to time with Jenkins 1.625.3 LTS and Gearman plugin 1.3.3 with https://review.openstack.org/#/c/252768/

            zaro Khai Do
            ki82 Christian Bremer
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: