-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
Jenkins 1.580.1 LTS and Gearman plugin 0.1.1
Jenkins 1.625.3 LTS and Gearman plugin 1.3.3 with https://review.openstack.org/#/c/252768/
-
Powered by SuggestiMate
We have a setup with one Jenkins master and Zuul triggers the job through the Jenkins Gearman plugin.
Sometimes no new jobs will be scheduled even though all slots are available.
A workaround for the slaves is to disconnect/connect the slave and then new job would be scheduled again.
For the master the only way to get jobs to be scheduled again is to restart the Jenkins service.
When this happens on one node jobs would still be scheduled on other nodes.
Attaching server thread log for gearman threads when no jobs are currently running and jobs are scheduled in the queue.
Also attaching a trunkated jenkins.log by (grep -C 2 10.33.14.26_manager)
Let me know if you need more logs or other info, I would be happy to help
- is related to
-
JENKINS-28891 Bringing slaves online after running a build does not re-register gearman jobs
-
- Resolved
-
[JENKINS-25867] Gearman won't schedule new jobs even though there are slots available on master
It would help if you can test without zuul to verify that this is a gearman-plugin issue. I would try 2 things:
1. Make sure the jenkins job is set to 'Execute concurrent builds if necessary', more info here: https://wiki.jenkins-ci.org/display/JENKINS/Gearman+Plugin?focusedCommentId=74875603#comment-74875603
2. Instead of zuul, use the simple gearman client to schedule the jobs, https://github.com/zaro0508/gearman-plugin-client
1. We already have concurrent builds enabled on all jobs.
2. We have the same issue when scheduling through the simple gearman client.
The job will start when assigning it to another label than master.
The job will not start when assigning it to master label, the label has 6 free executors.
The command I executed was: python gear_client.py -s localhost -p 4730 –
function build:~Update_Scripts --wait
Thanks for the link to the simple gearman client, it was very neat to use to try things out.
Please let me know if I can help in any other way.
Sorry to switch again but storyboard isn't really working for us. Lets switch back to using jenkins issue tracker.
I've noticed that there is a bug when using the "OFFLINE_NODE_WHEN_COMPLETE=true" parameter when there are multiple executors. When used with multiple executors the node gets offline and the jobs get unregistered but it never re-registers therefore no new job requests get executed. Looks like you have setup your jenkins with multiple executors so was wondering if you are using this parameter?
No we are not using "OFFLINE_NODE_WHEN_COMPLETE=true" in our instances.
You are correct in your assumption that we use multiple executors slots on almost all nodes, although I have seen this on nodes with 1 slot as well.
A few deadlock issues[1][2] have been fixed but not released yet. [1] has been merged to master but [2] has not. Was wondering if you could checkout [2], build and test?
Please use the updated gearman test client[3] to test.
[1] https://review.openstack.org/#/c/179988
[2] https://review.openstack.org/#/c/192429
[3] https://github.com/zaro0508/gearman-plugin-client.git
[3] does not seem to have been updated for the last 10 months, is it the correct link?
No sure what you mean. Project says last update was on 06/16/2015, https://github.com/zaro0508/gearman-plugin-client/commit/befdbe1a143b117637c6f2c33c92f990c1b78848
Maybe need to clear your brower cache?
Odd, I see that it is updated now, I will try it out and get back to you with feedback.
The original storyboard: https://storyboard.openstack.org/#!/story/2000030
Our downstream bug: https://phabricator.wikimedia.org/T72597
Our env is: Jenkins 1.596 , gearman-plugin 0.1.1-8-gf2024bd (from source).
The only blocks we have are on a specific slave that happens to run matrix jobs which are triggered by the Jenkins internal scheduler. We do not use OFFLINE_NODE_WHEN_COMPLETE yet. Most importantly, I can not find a way to reproduce the issue reliably.
I noticed the executor threads are held in a lock though (details and threads dump at https://phabricator.wikimedia.org/T72597#748059 ). And the computer is sometime a NULL value:
> The Jenkins logger for hudson.plugins.gearman.logger shows a spam of:
>
> Nov 26, 2014 10:24:21 PM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
> AvailabilityMonitor canTake request for >SOME VALUE<
>
> Where >SOME VALUE< is null or one one of the executor thread.
An example:
Jul 01, 2015 10:11:59 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake AvailabilityMonitor canTake request for null
I have deployed the german-plugin with https://review.openstack.org/#/c/192429/ , but since I have no way to reproduce the issue I would not be able to confirm whether the issue is solved :-\
Ahh makes sense now. As explicitly stated in the gearman plugin wiki (https://wiki.jenkins-ci.org/display/JENKINS/Gearman+Plugin), matrix jobs are not supported (known issues section). We don't use matrix projects therefore didn't put the effort into supporting it.
@Christian, any updates on your end on this issue? Antoine reported that the related issue has resolved.
It happened again with the the gearman-plugin v0.1.2 (
Jul 28, 2015 10:30:35 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake AvailabilityMonitor canTake request for null Jul 28, 2015 10:30:35 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake AvailabilityMonitor canTake request for null Jul 28, 2015 10:30:35 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake AvailabilityMonitor canTake request for null
With jobs tied to that instance being stuck waiting for an available executor on deployment-bastion.
Marking the node offline and online doesn't remove the lock :-/
The executor threads have:
"Gearman worker deployment-bastion.eqiad_exec-1" prio=5 WAITING java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:503) hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:73) hudson.plugins.gearman.StartJobWorker.safeExecuteFunction(StartJobWorker.java:196) hudson.plugins.gearman.StartJobWorker.executeFunction(StartJobWorker.java:114) org.gearman.worker.AbstractGearmanFunction.call(AbstractGearmanFunction.java:125) org.gearman.worker.AbstractGearmanFunction.call(AbstractGearmanFunction.java:22) hudson.plugins.gearman.MyGearmanWorkerImpl.submitFunction(MyGearmanWorkerImpl.java:593) hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:328) hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166) java.lang.Thread.run(Thread.java:745) "Gearman worker deployment-bastion.eqiad_exec-2" prio=5 TIMED_WAITING java.lang.Object.wait(Native Method) hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83) hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:380) hudson.plugins.gearman.MyGearmanWorkerImpl.processSessionEvent(MyGearmanWorkerImpl.java:421) hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:320) hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166) java.lang.Thread.run(Thread.java:745) "Gearman worker deployment-bastion.eqiad_exec-3" prio=5 TIMED_WAITING java.lang.Object.wait(Native Method) hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83) hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:380) hudson.plugins.gearman.MyGearmanWorkerImpl.processSessionEvent(MyGearmanWorkerImpl.java:421) hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:320) hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166) java.lang.Thread.run(Thread.java:745) "Gearman worker deployment-bastion.eqiad_exec-4" prio=5 TIMED_WAITING java.lang.Object.wait(Native Method) hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83) hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:380) hudson.plugins.gearman.MyGearmanWorkerImpl.processSessionEvent(MyGearmanWorkerImpl.java:421) hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:320) hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166) java.lang.Thread.run(Thread.java:745) "Gearman worker deployment-bastion.eqiad_exec-5" prio=5 TIMED_WAITING java.lang.Object.wait(Native Method) hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83) hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:380) hudson.plugins.gearman.MyGearmanWorkerImpl.processSessionEvent(MyGearmanWorkerImpl.java:421) hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:320) hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166) java.lang.Thread.run(Thread.java:745)
The node is named deployment-bastion-eqiad, with a label deployment-bastion-eqiad. Jobs are tied to deployment-bastion-eqiad.
The workaround I found was to remove the label from the node. Once done, the jobs shows in the queue with 'no node having label deployment-bastion-eqiad'.
I then applied the label again on the host and the job managed to run.
So maybe it is an issue in Jenkins itself :-}
The deadlock still happens from time to time with Jenkins 1.625.3 LTS and Gearman plugin 1.3.3 with https://review.openstack.org/#/c/252768/
Placed a bounty of 200$ for this issue on freedomsponsors.
https://freedomsponsors.org/issue/595/gearman-wont-schedule-new-jobs-even-though-there-are-slots-available-on-master?alert=SPONSOR#