Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-25867

Gearman won't schedule new jobs even though there are slots available on master

      We have a setup with one Jenkins master and Zuul triggers the job through the Jenkins Gearman plugin.

      Sometimes no new jobs will be scheduled even though all slots are available.

      A workaround for the slaves is to disconnect/connect the slave and then new job would be scheduled again.
      For the master the only way to get jobs to be scheduled again is to restart the Jenkins service.

      When this happens on one node jobs would still be scheduled on other nodes.

      Attaching server thread log for gearman threads when no jobs are currently running and jobs are scheduled in the queue.
      Also attaching a trunkated jenkins.log by (grep -C 2 10.33.14.26_manager)

      Let me know if you need more logs or other info, I would be happy to help

          [JENKINS-25867] Gearman won't schedule new jobs even though there are slots available on master

          [3] does not seem to have been updated for the last 10 months, is it the correct link?

          Christian Bremer added a comment - [3] does not seem to have been updated for the last 10 months, is it the correct link?

          Khai Do added a comment -

          No sure what you mean. Project says last update was on 06/16/2015, https://github.com/zaro0508/gearman-plugin-client/commit/befdbe1a143b117637c6f2c33c92f990c1b78848

          Maybe need to clear your brower cache?

          Khai Do added a comment - No sure what you mean. Project says last update was on 06/16/2015, https://github.com/zaro0508/gearman-plugin-client/commit/befdbe1a143b117637c6f2c33c92f990c1b78848 Maybe need to clear your brower cache?

          Odd, I see that it is updated now, I will try it out and get back to you with feedback.

          Christian Bremer added a comment - Odd, I see that it is updated now, I will try it out and get back to you with feedback.

          Antoine Musso added a comment - - edited

          The original storyboard: https://storyboard.openstack.org/#!/story/2000030
          Our downstream bug: https://phabricator.wikimedia.org/T72597

          Our env is: Jenkins 1.596 , gearman-plugin 0.1.1-8-gf2024bd (from source).

          The only blocks we have are on a specific slave that happens to run matrix jobs which are triggered by the Jenkins internal scheduler. We do not use OFFLINE_NODE_WHEN_COMPLETE yet. Most importantly, I can not find a way to reproduce the issue reliably.

          I noticed the executor threads are held in a lock though (details and threads dump at https://phabricator.wikimedia.org/T72597#748059 ). And the computer is sometime a NULL value:

          > The Jenkins logger for hudson.plugins.gearman.logger shows a spam of:
          >
          > Nov 26, 2014 10:24:21 PM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
          > AvailabilityMonitor canTake request for >SOME VALUE<
          >
          > Where >SOME VALUE< is null or one one of the executor thread.

          An example:

          Jul 01, 2015 10:11:59 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
          AvailabilityMonitor canTake request for null
          

          I have deployed the german-plugin with https://review.openstack.org/#/c/192429/ , but since I have no way to reproduce the issue I would not be able to confirm whether the issue is solved :-\

          Antoine Musso added a comment - - edited The original storyboard: https://storyboard.openstack.org/#!/story/2000030 Our downstream bug: https://phabricator.wikimedia.org/T72597 Our env is: Jenkins 1.596 , gearman-plugin 0.1.1-8-gf2024bd (from source). The only blocks we have are on a specific slave that happens to run matrix jobs which are triggered by the Jenkins internal scheduler. We do not use OFFLINE_NODE_WHEN_COMPLETE yet. Most importantly, I can not find a way to reproduce the issue reliably. I noticed the executor threads are held in a lock though (details and threads dump at https://phabricator.wikimedia.org/T72597#748059 ). And the computer is sometime a NULL value: > The Jenkins logger for hudson.plugins.gearman.logger shows a spam of: > > Nov 26, 2014 10:24:21 PM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake > AvailabilityMonitor canTake request for >SOME VALUE< > > Where >SOME VALUE< is null or one one of the executor thread. An example: Jul 01, 2015 10:11:59 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake AvailabilityMonitor canTake request for null I have deployed the german-plugin with https://review.openstack.org/#/c/192429/ , but since I have no way to reproduce the issue I would not be able to confirm whether the issue is solved :-\

          Khai Do added a comment -

          Ahh makes sense now. As explicitly stated in the gearman plugin wiki (https://wiki.jenkins-ci.org/display/JENKINS/Gearman+Plugin), matrix jobs are not supported (known issues section). We don't use matrix projects therefore didn't put the effort into supporting it.

          Khai Do added a comment - Ahh makes sense now. As explicitly stated in the gearman plugin wiki ( https://wiki.jenkins-ci.org/display/JENKINS/Gearman+Plugin ), matrix jobs are not supported (known issues section). We don't use matrix projects therefore didn't put the effort into supporting it.

          Khai Do added a comment -

          @Christian, any updates on your end on this issue? Antoine reported that the related issue has resolved.

          Khai Do added a comment - @Christian, any updates on your end on this issue? Antoine reported that the related issue has resolved.

          Khai Do added a comment -

          I believe this is fixed in version 0.1.2

          Khai Do added a comment - I believe this is fixed in version 0.1.2

          Antoine Musso added a comment - - edited

          It happened again with the the gearman-plugin v0.1.2 (

          Jul 28, 2015 10:30:35 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
          AvailabilityMonitor canTake request for null
          Jul 28, 2015 10:30:35 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
          AvailabilityMonitor canTake request for null
          Jul 28, 2015 10:30:35 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
          AvailabilityMonitor canTake request for null
          

          With jobs tied to that instance being stuck waiting for an available executor on deployment-bastion.

          Marking the node offline and online doesn't remove the lock :-/

          The executor threads have:

          "Gearman worker deployment-bastion.eqiad_exec-1" prio=5 WAITING
          	java.lang.Object.wait(Native Method)
          	java.lang.Object.wait(Object.java:503)
          	hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:73)
          	hudson.plugins.gearman.StartJobWorker.safeExecuteFunction(StartJobWorker.java:196)
          	hudson.plugins.gearman.StartJobWorker.executeFunction(StartJobWorker.java:114)
          	org.gearman.worker.AbstractGearmanFunction.call(AbstractGearmanFunction.java:125)
          	org.gearman.worker.AbstractGearmanFunction.call(AbstractGearmanFunction.java:22)
          	hudson.plugins.gearman.MyGearmanWorkerImpl.submitFunction(MyGearmanWorkerImpl.java:593)
          	hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:328)
          	hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166)
          	java.lang.Thread.run(Thread.java:745)
          
          "Gearman worker deployment-bastion.eqiad_exec-2" prio=5 TIMED_WAITING
          	java.lang.Object.wait(Native Method)
          	hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83)
          	hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:380)
          	hudson.plugins.gearman.MyGearmanWorkerImpl.processSessionEvent(MyGearmanWorkerImpl.java:421)
          	hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:320)
          	hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166)
          	java.lang.Thread.run(Thread.java:745)
          
          "Gearman worker deployment-bastion.eqiad_exec-3" prio=5 TIMED_WAITING
          	java.lang.Object.wait(Native Method)
          	hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83)
          	hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:380)
          	hudson.plugins.gearman.MyGearmanWorkerImpl.processSessionEvent(MyGearmanWorkerImpl.java:421)
          	hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:320)
          	hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166)
          	java.lang.Thread.run(Thread.java:745)
          
          "Gearman worker deployment-bastion.eqiad_exec-4" prio=5 TIMED_WAITING
          	java.lang.Object.wait(Native Method)
          	hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83)
          	hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:380)
          	hudson.plugins.gearman.MyGearmanWorkerImpl.processSessionEvent(MyGearmanWorkerImpl.java:421)
          	hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:320)
          	hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166)
          	java.lang.Thread.run(Thread.java:745)
          
          "Gearman worker deployment-bastion.eqiad_exec-5" prio=5 TIMED_WAITING
          	java.lang.Object.wait(Native Method)
          	hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83)
          	hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:380)
          	hudson.plugins.gearman.MyGearmanWorkerImpl.processSessionEvent(MyGearmanWorkerImpl.java:421)
          	hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:320)
          	hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166)
          	java.lang.Thread.run(Thread.java:745)
          

          Antoine Musso added a comment - - edited It happened again with the the gearman-plugin v0.1.2 ( Jul 28, 2015 10:30:35 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake AvailabilityMonitor canTake request for null Jul 28, 2015 10:30:35 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake AvailabilityMonitor canTake request for null Jul 28, 2015 10:30:35 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake AvailabilityMonitor canTake request for null With jobs tied to that instance being stuck waiting for an available executor on deployment-bastion. Marking the node offline and online doesn't remove the lock :-/ The executor threads have: "Gearman worker deployment-bastion.eqiad_exec-1" prio=5 WAITING java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:503) hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:73) hudson.plugins.gearman.StartJobWorker.safeExecuteFunction(StartJobWorker.java:196) hudson.plugins.gearman.StartJobWorker.executeFunction(StartJobWorker.java:114) org.gearman.worker.AbstractGearmanFunction.call(AbstractGearmanFunction.java:125) org.gearman.worker.AbstractGearmanFunction.call(AbstractGearmanFunction.java:22) hudson.plugins.gearman.MyGearmanWorkerImpl.submitFunction(MyGearmanWorkerImpl.java:593) hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:328) hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166) java.lang.Thread.run(Thread.java:745) "Gearman worker deployment-bastion.eqiad_exec-2" prio=5 TIMED_WAITING java.lang.Object.wait(Native Method) hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83) hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:380) hudson.plugins.gearman.MyGearmanWorkerImpl.processSessionEvent(MyGearmanWorkerImpl.java:421) hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:320) hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166) java.lang.Thread.run(Thread.java:745) "Gearman worker deployment-bastion.eqiad_exec-3" prio=5 TIMED_WAITING java.lang.Object.wait(Native Method) hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83) hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:380) hudson.plugins.gearman.MyGearmanWorkerImpl.processSessionEvent(MyGearmanWorkerImpl.java:421) hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:320) hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166) java.lang.Thread.run(Thread.java:745) "Gearman worker deployment-bastion.eqiad_exec-4" prio=5 TIMED_WAITING java.lang.Object.wait(Native Method) hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83) hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:380) hudson.plugins.gearman.MyGearmanWorkerImpl.processSessionEvent(MyGearmanWorkerImpl.java:421) hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:320) hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166) java.lang.Thread.run(Thread.java:745) "Gearman worker deployment-bastion.eqiad_exec-5" prio=5 TIMED_WAITING java.lang.Object.wait(Native Method) hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83) hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:380) hudson.plugins.gearman.MyGearmanWorkerImpl.processSessionEvent(MyGearmanWorkerImpl.java:421) hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:320) hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166) java.lang.Thread.run(Thread.java:745)

          Antoine Musso added a comment -

          The node is named deployment-bastion-eqiad, with a label deployment-bastion-eqiad. Jobs are tied to deployment-bastion-eqiad.

          The workaround I found was to remove the label from the node. Once done, the jobs shows in the queue with 'no node having label deployment-bastion-eqiad'.

          I then applied the label again on the host and the job managed to run.

          So maybe it is an issue in Jenkins itself :-}

          Antoine Musso added a comment - The node is named deployment-bastion-eqiad, with a label deployment-bastion-eqiad. Jobs are tied to deployment-bastion-eqiad. The workaround I found was to remove the label from the node. Once done, the jobs shows in the queue with 'no node having label deployment-bastion-eqiad'. I then applied the label again on the host and the job managed to run. So maybe it is an issue in Jenkins itself :-}

          Antoine Musso added a comment -

          The deadlock still happens from time to time with Jenkins 1.625.3 LTS and Gearman plugin 1.3.3 with https://review.openstack.org/#/c/252768/

          Antoine Musso added a comment - The deadlock still happens from time to time with Jenkins 1.625.3 LTS and Gearman plugin 1.3.3 with https://review.openstack.org/#/c/252768/

            zaro Khai Do
            ki82 Christian Bremer
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: