Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-25867

Gearman won't schedule new jobs even though there are slots available on master

    XMLWordPrintable

Details

    Description

      We have a setup with one Jenkins master and Zuul triggers the job through the Jenkins Gearman plugin.

      Sometimes no new jobs will be scheduled even though all slots are available.

      A workaround for the slaves is to disconnect/connect the slave and then new job would be scheduled again.
      For the master the only way to get jobs to be scheduled again is to restart the Jenkins service.

      When this happens on one node jobs would still be scheduled on other nodes.

      Attaching server thread log for gearman threads when no jobs are currently running and jobs are scheduled in the queue.
      Also attaching a trunkated jenkins.log by (grep -C 2 10.33.14.26_manager)

      Let me know if you need more logs or other info, I would be happy to help

      Attachments

        Issue Links

          Activity

            ki82 Christian Bremer created issue -
            ki82 Christian Bremer added a comment - Placed a bounty of 200$ for this issue on freedomsponsors. https://freedomsponsors.org/issue/595/gearman-wont-schedule-new-jobs-even-though-there-are-slots-available-on-master?alert=SPONSOR#
            ki82 Christian Bremer made changes -
            Field Original Value New Value
            Assignee Khai Do [ zaro ]
            zaro Khai Do added a comment -

            It would help if you can test without zuul to verify that this is a gearman-plugin issue. I would try 2 things:
            1. Make sure the jenkins job is set to 'Execute concurrent builds if necessary', more info here: https://wiki.jenkins-ci.org/display/JENKINS/Gearman+Plugin?focusedCommentId=74875603#comment-74875603
            2. Instead of zuul, use the simple gearman client to schedule the jobs, https://github.com/zaro0508/gearman-plugin-client

            zaro Khai Do added a comment - It would help if you can test without zuul to verify that this is a gearman-plugin issue. I would try 2 things: 1. Make sure the jenkins job is set to 'Execute concurrent builds if necessary', more info here: https://wiki.jenkins-ci.org/display/JENKINS/Gearman+Plugin?focusedCommentId=74875603#comment-74875603 2. Instead of zuul, use the simple gearman client to schedule the jobs, https://github.com/zaro0508/gearman-plugin-client
            ki82 Christian Bremer added a comment - - edited

            1. We already have concurrent builds enabled on all jobs.
            2. We have the same issue when scheduling through the simple gearman client.
            The job will start when assigning it to another label than master.
            The job will not start when assigning it to master label, the label has 6 free executors.

            The command I executed was: python gear_client.py -s localhost -p 4730 –
            function build:~Update_Scripts --wait

            Thanks for the link to the simple gearman client, it was very neat to use to try things out.

            Please let me know if I can help in any other way.

            ki82 Christian Bremer added a comment - - edited 1. We already have concurrent builds enabled on all jobs. 2. We have the same issue when scheduling through the simple gearman client. The job will start when assigning it to another label than master. The job will not start when assigning it to master label, the label has 6 free executors. The command I executed was: python gear_client.py -s localhost -p 4730 – function build:~Update_Scripts --wait Thanks for the link to the simple gearman client, it was very neat to use to try things out. Please let me know if I can help in any other way.
            zaro Khai Do added a comment - lets work on this in https://storyboard.openstack.org/#!/story/2000030
            zaro Khai Do added a comment -

            Sorry to switch again but storyboard isn't really working for us. Lets switch back to using jenkins issue tracker.

            zaro Khai Do added a comment - Sorry to switch again but storyboard isn't really working for us. Lets switch back to using jenkins issue tracker.
            ki82 Christian Bremer added a comment - - edited

            +1 for that decision
            Any update on the issue?

            ki82 Christian Bremer added a comment - - edited +1 for that decision Any update on the issue?
            zaro Khai Do added a comment -

            I've noticed that there is a bug when using the "OFFLINE_NODE_WHEN_COMPLETE=true" parameter when there are multiple executors. When used with multiple executors the node gets offline and the jobs get unregistered but it never re-registers therefore no new job requests get executed. Looks like you have setup your jenkins with multiple executors so was wondering if you are using this parameter?

            zaro Khai Do added a comment - I've noticed that there is a bug when using the "OFFLINE_NODE_WHEN_COMPLETE=true" parameter when there are multiple executors. When used with multiple executors the node gets offline and the jobs get unregistered but it never re-registers therefore no new job requests get executed. Looks like you have setup your jenkins with multiple executors so was wondering if you are using this parameter?
            zaro Khai Do made changes -
            Link This issue is related to JENKINS-28891 [ JENKINS-28891 ]

            No we are not using "OFFLINE_NODE_WHEN_COMPLETE=true" in our instances.
            You are correct in your assumption that we use multiple executors slots on almost all nodes, although I have seen this on nodes with 1 slot as well.

            ki82 Christian Bremer added a comment - No we are not using "OFFLINE_NODE_WHEN_COMPLETE=true" in our instances. You are correct in your assumption that we use multiple executors slots on almost all nodes, although I have seen this on nodes with 1 slot as well.
            zaro Khai Do added a comment - - edited

            A few deadlock issues[1][2] have been fixed but not released yet. [1] has been merged to master but [2] has not. Was wondering if you could checkout [2], build and test?

            Please use the updated gearman test client[3] to test.

            [1] https://review.openstack.org/#/c/179988
            [2] https://review.openstack.org/#/c/192429
            [3] https://github.com/zaro0508/gearman-plugin-client.git

            zaro Khai Do added a comment - - edited A few deadlock issues [1] [2] have been fixed but not released yet. [1] has been merged to master but [2] has not. Was wondering if you could checkout [2] , build and test? Please use the updated gearman test client [3] to test. [1] https://review.openstack.org/#/c/179988 [2] https://review.openstack.org/#/c/192429 [3] https://github.com/zaro0508/gearman-plugin-client.git

            [3] does not seem to have been updated for the last 10 months, is it the correct link?

            ki82 Christian Bremer added a comment - [3] does not seem to have been updated for the last 10 months, is it the correct link?
            zaro Khai Do added a comment -

            No sure what you mean. Project says last update was on 06/16/2015, https://github.com/zaro0508/gearman-plugin-client/commit/befdbe1a143b117637c6f2c33c92f990c1b78848

            Maybe need to clear your brower cache?

            zaro Khai Do added a comment - No sure what you mean. Project says last update was on 06/16/2015, https://github.com/zaro0508/gearman-plugin-client/commit/befdbe1a143b117637c6f2c33c92f990c1b78848 Maybe need to clear your brower cache?

            Odd, I see that it is updated now, I will try it out and get back to you with feedback.

            ki82 Christian Bremer added a comment - Odd, I see that it is updated now, I will try it out and get back to you with feedback.
            hashar Antoine Musso added a comment - - edited

            The original storyboard: https://storyboard.openstack.org/#!/story/2000030
            Our downstream bug: https://phabricator.wikimedia.org/T72597

            Our env is: Jenkins 1.596 , gearman-plugin 0.1.1-8-gf2024bd (from source).

            The only blocks we have are on a specific slave that happens to run matrix jobs which are triggered by the Jenkins internal scheduler. We do not use OFFLINE_NODE_WHEN_COMPLETE yet. Most importantly, I can not find a way to reproduce the issue reliably.

            I noticed the executor threads are held in a lock though (details and threads dump at https://phabricator.wikimedia.org/T72597#748059 ). And the computer is sometime a NULL value:

            > The Jenkins logger for hudson.plugins.gearman.logger shows a spam of:
            >
            > Nov 26, 2014 10:24:21 PM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
            > AvailabilityMonitor canTake request for >SOME VALUE<
            >
            > Where >SOME VALUE< is null or one one of the executor thread.

            An example:

            Jul 01, 2015 10:11:59 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
            AvailabilityMonitor canTake request for null
            

            I have deployed the german-plugin with https://review.openstack.org/#/c/192429/ , but since I have no way to reproduce the issue I would not be able to confirm whether the issue is solved :-\

            hashar Antoine Musso added a comment - - edited The original storyboard: https://storyboard.openstack.org/#!/story/2000030 Our downstream bug: https://phabricator.wikimedia.org/T72597 Our env is: Jenkins 1.596 , gearman-plugin 0.1.1-8-gf2024bd (from source). The only blocks we have are on a specific slave that happens to run matrix jobs which are triggered by the Jenkins internal scheduler. We do not use OFFLINE_NODE_WHEN_COMPLETE yet. Most importantly, I can not find a way to reproduce the issue reliably. I noticed the executor threads are held in a lock though (details and threads dump at https://phabricator.wikimedia.org/T72597#748059 ). And the computer is sometime a NULL value: > The Jenkins logger for hudson.plugins.gearman.logger shows a spam of: > > Nov 26, 2014 10:24:21 PM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake > AvailabilityMonitor canTake request for >SOME VALUE< > > Where >SOME VALUE< is null or one one of the executor thread. An example: Jul 01, 2015 10:11:59 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake AvailabilityMonitor canTake request for null I have deployed the german-plugin with https://review.openstack.org/#/c/192429/ , but since I have no way to reproduce the issue I would not be able to confirm whether the issue is solved :-\
            zaro Khai Do added a comment -

            Ahh makes sense now. As explicitly stated in the gearman plugin wiki (https://wiki.jenkins-ci.org/display/JENKINS/Gearman+Plugin), matrix jobs are not supported (known issues section). We don't use matrix projects therefore didn't put the effort into supporting it.

            zaro Khai Do added a comment - Ahh makes sense now. As explicitly stated in the gearman plugin wiki ( https://wiki.jenkins-ci.org/display/JENKINS/Gearman+Plugin ), matrix jobs are not supported (known issues section). We don't use matrix projects therefore didn't put the effort into supporting it.
            zaro Khai Do added a comment -

            @Christian, any updates on your end on this issue? Antoine reported that the related issue has resolved.

            zaro Khai Do added a comment - @Christian, any updates on your end on this issue? Antoine reported that the related issue has resolved.
            zaro Khai Do added a comment -

            I believe this is fixed in version 0.1.2

            zaro Khai Do added a comment - I believe this is fixed in version 0.1.2
            zaro Khai Do made changes -
            Assignee Khai Do [ zaro ] Christian Bremer [ ki82 ]
            Resolution Fixed [ 1 ]
            Status Open [ 1 ] Resolved [ 5 ]
            hashar Antoine Musso added a comment - - edited

            It happened again with the the gearman-plugin v0.1.2 (

            Jul 28, 2015 10:30:35 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
            AvailabilityMonitor canTake request for null
            Jul 28, 2015 10:30:35 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
            AvailabilityMonitor canTake request for null
            Jul 28, 2015 10:30:35 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
            AvailabilityMonitor canTake request for null
            

            With jobs tied to that instance being stuck waiting for an available executor on deployment-bastion.

            Marking the node offline and online doesn't remove the lock :-/

            The executor threads have:

            "Gearman worker deployment-bastion.eqiad_exec-1" prio=5 WAITING
            	java.lang.Object.wait(Native Method)
            	java.lang.Object.wait(Object.java:503)
            	hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:73)
            	hudson.plugins.gearman.StartJobWorker.safeExecuteFunction(StartJobWorker.java:196)
            	hudson.plugins.gearman.StartJobWorker.executeFunction(StartJobWorker.java:114)
            	org.gearman.worker.AbstractGearmanFunction.call(AbstractGearmanFunction.java:125)
            	org.gearman.worker.AbstractGearmanFunction.call(AbstractGearmanFunction.java:22)
            	hudson.plugins.gearman.MyGearmanWorkerImpl.submitFunction(MyGearmanWorkerImpl.java:593)
            	hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:328)
            	hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166)
            	java.lang.Thread.run(Thread.java:745)
            
            "Gearman worker deployment-bastion.eqiad_exec-2" prio=5 TIMED_WAITING
            	java.lang.Object.wait(Native Method)
            	hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83)
            	hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:380)
            	hudson.plugins.gearman.MyGearmanWorkerImpl.processSessionEvent(MyGearmanWorkerImpl.java:421)
            	hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:320)
            	hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166)
            	java.lang.Thread.run(Thread.java:745)
            
            "Gearman worker deployment-bastion.eqiad_exec-3" prio=5 TIMED_WAITING
            	java.lang.Object.wait(Native Method)
            	hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83)
            	hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:380)
            	hudson.plugins.gearman.MyGearmanWorkerImpl.processSessionEvent(MyGearmanWorkerImpl.java:421)
            	hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:320)
            	hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166)
            	java.lang.Thread.run(Thread.java:745)
            
            "Gearman worker deployment-bastion.eqiad_exec-4" prio=5 TIMED_WAITING
            	java.lang.Object.wait(Native Method)
            	hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83)
            	hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:380)
            	hudson.plugins.gearman.MyGearmanWorkerImpl.processSessionEvent(MyGearmanWorkerImpl.java:421)
            	hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:320)
            	hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166)
            	java.lang.Thread.run(Thread.java:745)
            
            "Gearman worker deployment-bastion.eqiad_exec-5" prio=5 TIMED_WAITING
            	java.lang.Object.wait(Native Method)
            	hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83)
            	hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:380)
            	hudson.plugins.gearman.MyGearmanWorkerImpl.processSessionEvent(MyGearmanWorkerImpl.java:421)
            	hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:320)
            	hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166)
            	java.lang.Thread.run(Thread.java:745)
            
            hashar Antoine Musso added a comment - - edited It happened again with the the gearman-plugin v0.1.2 ( Jul 28, 2015 10:30:35 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake AvailabilityMonitor canTake request for null Jul 28, 2015 10:30:35 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake AvailabilityMonitor canTake request for null Jul 28, 2015 10:30:35 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake AvailabilityMonitor canTake request for null With jobs tied to that instance being stuck waiting for an available executor on deployment-bastion. Marking the node offline and online doesn't remove the lock :-/ The executor threads have: "Gearman worker deployment-bastion.eqiad_exec-1" prio=5 WAITING java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:503) hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:73) hudson.plugins.gearman.StartJobWorker.safeExecuteFunction(StartJobWorker.java:196) hudson.plugins.gearman.StartJobWorker.executeFunction(StartJobWorker.java:114) org.gearman.worker.AbstractGearmanFunction.call(AbstractGearmanFunction.java:125) org.gearman.worker.AbstractGearmanFunction.call(AbstractGearmanFunction.java:22) hudson.plugins.gearman.MyGearmanWorkerImpl.submitFunction(MyGearmanWorkerImpl.java:593) hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:328) hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166) java.lang.Thread.run(Thread.java:745) "Gearman worker deployment-bastion.eqiad_exec-2" prio=5 TIMED_WAITING java.lang.Object.wait(Native Method) hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83) hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:380) hudson.plugins.gearman.MyGearmanWorkerImpl.processSessionEvent(MyGearmanWorkerImpl.java:421) hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:320) hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166) java.lang.Thread.run(Thread.java:745) "Gearman worker deployment-bastion.eqiad_exec-3" prio=5 TIMED_WAITING java.lang.Object.wait(Native Method) hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83) hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:380) hudson.plugins.gearman.MyGearmanWorkerImpl.processSessionEvent(MyGearmanWorkerImpl.java:421) hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:320) hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166) java.lang.Thread.run(Thread.java:745) "Gearman worker deployment-bastion.eqiad_exec-4" prio=5 TIMED_WAITING java.lang.Object.wait(Native Method) hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83) hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:380) hudson.plugins.gearman.MyGearmanWorkerImpl.processSessionEvent(MyGearmanWorkerImpl.java:421) hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:320) hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166) java.lang.Thread.run(Thread.java:745) "Gearman worker deployment-bastion.eqiad_exec-5" prio=5 TIMED_WAITING java.lang.Object.wait(Native Method) hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83) hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:380) hudson.plugins.gearman.MyGearmanWorkerImpl.processSessionEvent(MyGearmanWorkerImpl.java:421) hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:320) hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166) java.lang.Thread.run(Thread.java:745)
            hashar Antoine Musso made changes -
            Assignee Christian Bremer [ ki82 ] Khai Do [ zaro ]
            Resolution Fixed [ 1 ]
            Status Resolved [ 5 ] Reopened [ 4 ]
            hashar Antoine Musso added a comment -

            The node is named deployment-bastion-eqiad, with a label deployment-bastion-eqiad. Jobs are tied to deployment-bastion-eqiad.

            The workaround I found was to remove the label from the node. Once done, the jobs shows in the queue with 'no node having label deployment-bastion-eqiad'.

            I then applied the label again on the host and the job managed to run.

            So maybe it is an issue in Jenkins itself :-}

            hashar Antoine Musso added a comment - The node is named deployment-bastion-eqiad, with a label deployment-bastion-eqiad. Jobs are tied to deployment-bastion-eqiad. The workaround I found was to remove the label from the node. Once done, the jobs shows in the queue with 'no node having label deployment-bastion-eqiad'. I then applied the label again on the host and the job managed to run. So maybe it is an issue in Jenkins itself :-}
            hashar Antoine Musso made changes -
            Environment Jenkins 1.580.1 LTS
            Gearman plugin 0.1.1
            Jenkins 1.580.1 LTS and Gearman plugin 0.1.1
            Jenkins 1.625.3 LTS and Gearman plugin 1.3.3 with https://review.openstack.org/#/c/252768/
            hashar Antoine Musso added a comment -

            The deadlock still happens from time to time with Jenkins 1.625.3 LTS and Gearman plugin 1.3.3 with https://review.openstack.org/#/c/252768/

            hashar Antoine Musso added a comment - The deadlock still happens from time to time with Jenkins 1.625.3 LTS and Gearman plugin 1.3.3 with https://review.openstack.org/#/c/252768/
            rtyler R. Tyler Croy made changes -
            Workflow JNJira [ 159848 ] JNJira + In-Review [ 186251 ]

            People

              zaro Khai Do
              ki82 Christian Bremer
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: