[JENKINS-12290] Deadlock/Lockup when using "trigger/call builds on other projects" and "Block until the triggered projects finish their builds" is used with only 1 executor

cjo9900 added a comment - 2012-07-19 16:17

The case that you mention here is one of many possible cases I can think of the following
Case A
Parent (trigger+block) -> child(label=qwerty)
A1: No nodes online that have label==qwerty
A2: No nodes that have label defined.

Case B
Parent(label=qwerty)(trigger+block) -> child(label=qwerty)
B1: Label qwerty has only one executor which is in use by parent

Case C
Parent(trigger+block) -> child
C1: Master only has 1 executor

This covers the simple cases, however if the parent is a Matrix project, we end up with an even more difficult probelm to solve.

Case D
Parent(x*y configurations) -> x*y matrix builds -> x*y child builds

D1: Less than or equal x*y executors - Child builds cannot run

So to be able to resolve this within the plugin at either configuration or runtime is very difficult.
as we cannot just check if master has a single executor as other factors come into play, regarding Cloud
services and job properties(label, resource, etc) that the child projects require.

Problems:
Configuration time.
Can only check current situation of child projects + Nodes.
Project list might be a parameter, so cannot determine the project list.
Passing a label parameter to child project might affect checking.
Cannot account for any cases where Cloud can allocate nodes.

Runtime
Cannot always garrenttee that a Node could be created for a Cloud instance
Cannot control busy Executors that used by other projects, needed by started build.

Implementation Ideas

Get Node that we are being built on (own node)
(or list of nodes containing all matrix siblings see below)
get Projects to start
get All Nodes
get All Clouds

foreach project + parameter set
#check 1
can it be started on master with parameters?
Yes
does master == ownNode
(
if master.numExecutors > 1 — Can run on master.
)
else
(
if master.numExecutors > 0 — Can run on master.
)

No
// cannot start on master try other Nodes
forEach Node
can it be started on node with parameters?
Yes
does node == ownNode
(
if node.numExecutors > 1 — Can run on node.
)
else
(
if node.numExecutors > 0 — Can run on node.
)

No
// cannot start on master or existing Nodes try Cloud services
forEach Cloud
can Cloud start a required node
Yes
— Can create a node to run on
No
— No possible way that we can continue as we will block.

This should handle most cases, if it can be implemented,
however it assumes that any other builds that are ongoing on any executor
will be able to finish and allow a proposed job to start.
This may fail if the parent build is a Matrix build as this behaviour would not
take into account the jobs siblings which also have similar blocking behaviour.

This could be resolved by
get current build get root build
if these are the same there is no issue
if are different and root build is a matrix we need to find out
where all of child jobs are running/going to run on.

Overall this could be done but is there a need for this?

I feel that it would be better to just add a warning when enabling the
"Block until the triggered projects finish their builds" item that informs
the user that this might occur.

cjo9900 added a comment - 2012-07-19 16:17 The case that you mention here is one of many possible cases I can think of the following Case A Parent (trigger+block) -> child(label=qwerty) A1: No nodes online that have label==qwerty A2: No nodes that have label defined. Case B Parent(label=qwerty)(trigger+block) -> child(label=qwerty) B1: Label qwerty has only one executor which is in use by parent Case C Parent(trigger+block) -> child C1: Master only has 1 executor This covers the simple cases, however if the parent is a Matrix project, we end up with an even more difficult probelm to solve. Case D Parent(x*y configurations) -> x*y matrix builds -> x*y child builds D1: Less than or equal x*y executors - Child builds cannot run So to be able to resolve this within the plugin at either configuration or runtime is very difficult. as we cannot just check if master has a single executor as other factors come into play, regarding Cloud services and job properties(label, resource, etc) that the child projects require. Problems: Configuration time. Can only check current situation of child projects + Nodes. Project list might be a parameter, so cannot determine the project list. Passing a label parameter to child project might affect checking. Cannot account for any cases where Cloud can allocate nodes. Runtime Cannot always garrenttee that a Node could be created for a Cloud instance Cannot control busy Executors that used by other projects, needed by started build. Implementation Ideas Get Node that we are being built on (own node) (or list of nodes containing all matrix siblings see below) get Projects to start get All Nodes get All Clouds foreach project + parameter set #check 1 can it be started on master with parameters? Yes does master == ownNode ( if master.numExecutors > 1 — Can run on master. ) else ( if master.numExecutors > 0 — Can run on master. ) No // cannot start on master try other Nodes forEach Node can it be started on node with parameters? Yes does node == ownNode ( if node.numExecutors > 1 — Can run on node. ) else ( if node.numExecutors > 0 — Can run on node. ) No // cannot start on master or existing Nodes try Cloud services forEach Cloud can Cloud start a required node Yes — Can create a node to run on No — No possible way that we can continue as we will block. This should handle most cases, if it can be implemented, however it assumes that any other builds that are ongoing on any executor will be able to finish and allow a proposed job to start. This may fail if the parent build is a Matrix build as this behaviour would not take into account the jobs siblings which also have similar blocking behaviour. This could be resolved by get current build get root build if these are the same there is no issue if are different and root build is a matrix we need to find out where all of child jobs are running/going to run on. Overall this could be done but is there a need for this? I feel that it would be better to just add a warning when enabling the "Block until the triggered projects finish their builds" item that informs the user that this might occur.

Roland Schulz added a comment - 2012-09-07 04:01 - edited

I think it is important. I have a multi-configuration project for both the build and the test step. Each have more configurations then I have executors/cores. Thus the job will always hang because no executors are available when it tries to trigger the test. Ideally the triggering job (in the example the "build" job) would not use any executor while it is waiting. Similar to how the multi-configuration master doesn't use an executor (fixed in 936).

This would let me use the block feature as a work-around for 11409.

Roland Schulz added a comment - 2012-09-07 04:01 - edited I think it is important. I have a multi-configuration project for both the build and the test step. Each have more configurations then I have executors/cores. Thus the job will always hang because no executors are available when it tries to trigger the test. Ideally the triggering job (in the example the "build" job) would not use any executor while it is waiting. Similar to how the multi-configuration master doesn't use an executor (fixed in 936 ). This would let me use the block feature as a work-around for 11409 .

Marcin Hawraniak added a comment - 2015-03-03 08:24

Really looking forward to see this resolved. Due to this we receive jobs results later than expected (executors are blocked by idle upstream projects) and we can't run others at the same time because of the executors limit.

I wonder also if there is any workaround for this at the moment.

Marcin Hawraniak added a comment - 2015-03-03 08:24 Really looking forward to see this resolved. Due to this we receive jobs results later than expected (executors are blocked by idle upstream projects) and we can't run others at the same time because of the executors limit. I wonder also if there is any workaround for this at the moment.

Dominic Cleal added a comment - 2015-03-16 14:05

I think what Roland's describing is the "flyweight" job which is used for matrix jobs - a job that doesn't use an executor slot.

It'd be interesting if when blocking for a triggered job, the triggering process could be changed to a flyweight, the triggered job inherits the slot and then it swaps back on completion, or for the builder to get a temporary additional slot as the blocked process won't be using much in the way of resources.

Dominic Cleal added a comment - 2015-03-16 14:05 I think what Roland's describing is the "flyweight" job which is used for matrix jobs - a job that doesn't use an executor slot. It'd be interesting if when blocking for a triggered job, the triggering process could be changed to a flyweight, the triggered job inherits the slot and then it swaps back on completion, or for the builder to get a temporary additional slot as the blocked process won't be using much in the way of resources.

Heiko Böttger added a comment - 2016-07-18 08:14

Hi is any one working on that?
For me this problem make the parameterized jobs plugin unusable. I cannot risk that jenkins suddenly get stuck and block all builds.

My scenario is as following:

I have a buildscript which needs to be executed (on specific slave) to create a property files for each dependency I need to build. I use the parameterized job plugin to run a job for each dependency and wait until its finished. Here the problem starts as soon as I need to block until the end, the executed used by the main job is never releases. This issue can be easily reproduced using a slave with exactly one executor. Now I thought the Matrix Job used as upstream job type would solve the issue since it is executed as a flyweight job, however I need to execute the script for creating the dependencies on a specific slave which forces me to define a slave axis. The slave axis again is exeuted as a subtask which is not flyweight causing the same issues. One solution which will work is releasing the executor before blocking and acquire it again when the subtask is done.

My current idea to workaround that is avoiding the need for a axis in the matrixjob and using a separate downstream job to create the property files which I will transfer to the matrxijob by using the copy artifacts plugin. That's rather complicated and I am still not sure whether that will work.

Any better idea?

Heiko Böttger added a comment - 2016-07-18 08:14 Hi is any one working on that? For me this problem make the parameterized jobs plugin unusable. I cannot risk that jenkins suddenly get stuck and block all builds. My scenario is as following: I have a buildscript which needs to be executed (on specific slave) to create a property files for each dependency I need to build. I use the parameterized job plugin to run a job for each dependency and wait until its finished. Here the problem starts as soon as I need to block until the end, the executed used by the main job is never releases. This issue can be easily reproduced using a slave with exactly one executor. Now I thought the Matrix Job used as upstream job type would solve the issue since it is executed as a flyweight job, however I need to execute the script for creating the dependencies on a specific slave which forces me to define a slave axis. The slave axis again is exeuted as a subtask which is not flyweight causing the same issues. One solution which will work is releasing the executor before blocking and acquire it again when the subtask is done. My current idea to workaround that is avoiding the need for a axis in the matrixjob and using a separate downstream job to create the property files which I will transfer to the matrxijob by using the copy artifacts plugin. That's rather complicated and I am still not sure whether that will work. Any better idea?

Michal Wesolowski added a comment - 2018-05-17 07:11

I'm also looking forward to see this resolved.

Michal Wesolowski added a comment - 2018-05-17 07:11 I'm also looking forward to see this resolved.

Vasily L added a comment - 2018-09-27 17:23 - edited

I think the waiting parent job should share its executor for the children.
Alternatively, it can just release the executor while blocked and pick it up on unblock.

Vasily L added a comment - 2018-09-27 17:23 - edited I think the waiting parent job should share its executor for the children. Alternatively, it can just release the executor while blocked and pick it up on unblock.

Tiger Cheng added a comment - 2018-10-22 19:46

Also following and in agreement with Vasily that it is expected that if the childjob is asking for a node that is run by the parent node, that it should be allowed access to the node. Without this expectation working, our complex pipeline is unable to properly archive artifacts in the ideal clean manner it was designed for. The alternative is to seek for a solution to archive from nodes

Tiger Cheng added a comment - 2018-10-22 19:46 Also following and in agreement with Vasily that it is expected that if the childjob is asking for a node that is run by the parent node, that it should be allowed access to the node. Without this expectation working, our complex pipeline is unable to properly archive artifacts in the ideal clean manner it was designed for. The alternative is to seek for a solution to archive from nodes

Petar Tahchiev added a comment - 2018-11-18 09:00 - edited

I've just hit this issue My scenario is the following. I have several projects:

- bom

- platform

- archetype

- console

- release-all

The release-all is a pipeline build which calls release on each of them in the following order bom->platform->archetype->console However because I have just one executor running release-all blocks the executor and bom is never started to release because it is waiting for the next available executor.

Petar Tahchiev added a comment - 2018-11-18 09:00 - edited I've just hit this issue My scenario is the following. I have several projects: - bom - platform - archetype - console - release-all The release-all is a pipeline build which calls release on each of them in the following order bom->platform->archetype->console However because I have just one executor running release-all blocks the executor and bom is never started to release because it is waiting for the next available executor.

Tamas Hegedus added a comment - 2019-11-05 12:44

I have a single jenkins server with two executors. I cannot run two pipelines in parallel because it immediately gets deadlocked as it cannot start the child jobs. I wonder why doesn't this issue have critical priority. It renders pipeline jobs useless in most cases. ~~JENKINS-26959~~ was set critical and was closed as a duplicate of this issue.

Tamas Hegedus added a comment - 2019-11-05 12:44 I have a single jenkins server with two executors. I cannot run two pipelines in parallel because it immediately gets deadlocked as it cannot start the child jobs. I wonder why doesn't this issue have critical priority. It renders pipeline jobs useless in most cases. JENKINS-26959 was set critical and was closed as a duplicate of this issue.

Tobias Gierke added a comment - 2019-11-14 08:00

I just hit this bug, IMHO this is a core feature (being able to trigger another project and wait for it's completion without blocking an executor).

Tobias Gierke added a comment - 2019-11-14 08:00 I just hit this bug, IMHO this is a core feature (being able to trigger another project and wait for it's completion without blocking an executor).

Alexander Borsuk added a comment - 2020-01-16 15:53

This is a critical issue that makes parallel/matrix builds unusable. In my pipeline I need to build for Mac and Linux. Obviously, it can be done in parallel. There are two build nodes available (Linux and Mac). The bug reproduces when one build node waits until another build node finishes. And another one waits until the first one finishes. This is a deadlock.

Alexander Borsuk added a comment - 2020-01-16 15:53 This is a critical issue that makes parallel/matrix builds unusable. In my pipeline I need to build for Mac and Linux. Obviously, it can be done in parallel. There are two build nodes available (Linux and Mac). The bug reproduces when one build node waits until another build node finishes. And another one waits until the first one finishes. This is a deadlock.

NhatKhai Nguyen added a comment - 2021-06-01 21:33

What if we have the queue scheduler reserve x number of executors per agent - so that parent jobs don't block all of executors out from the child (a temporary safe guard)?

NhatKhai Nguyen added a comment - 2021-06-01 21:33 What if we have the queue scheduler reserve x number of executors per agent - so that parent jobs don't block all of executors out from the child (a temporary safe guard)?

NhatKhai Nguyen added a comment - 2021-06-01 21:35 - edited

And/or while parent job wait for the child jobs, it should release the executor. Then when come back to parent jobs, it just had to acquire the new executor again on the same agent machine.

(Or hang over the executor to child jobs, and take back from it when it done)

NhatKhai Nguyen added a comment - 2021-06-01 21:35 - edited And/or while parent job wait for the child jobs, it should release the executor. Then when come back to parent jobs, it just had to acquire the new executor again on the same agent machine. (Or hang over the executor to child jobs, and take back from it when it done)

NhatKhai Nguyen added a comment - 2021-06-01 21:39 - edited

I think similar release, and acquire could be apply to all the do no thing operation like: sleep x sec, wait for other job finished etc...

NhatKhai Nguyen added a comment - 2021-06-01 21:39 - edited I think similar release, and acquire could be apply to all the do no thing operation like: sleep x sec, wait for other job finished etc...

Jose added a comment - 2023-01-04 15:07 - edited

I see some issues with this ticket:

I'm not very familiar about the project/components terms used in Jenkins, however I believe the issue is not about the "parameterized-trigger-plugin" component, but rather about the "build step" used to launch other jobs from within another job.
The issue is not related to having only 1 executor, but rather to having no available executors when launching too many master/orchestrators jobs that run sub-jobs, causing a deadlock.

I propose updating the Jira ticket to reflect the issue as: "Deadlock/Lockup when using "launching builds on other jobs" and "Block until the triggered jobs finish their builds" when no available executors"

From what I understand, this issue occurs in the following scenario:

Node/slave with only 2 (or 1) executors available
1 upstream job finishes successfully
2 downstream (orchestrator) jobs are automatically triggered immediatelly by the upstream job when it finishes successfully
Each downstream job launches other sub-jobs (dependent jobs) that each require their own executor

Since the last bullet point requires extra executors, and there are none available, the master/orchestrator jobs will enter a deadlock.

There are several ways to resolve this issue:

(my preference): The launched dependent jobs do not occupy extra executors if run on the same node/slave.
Distribute the execution of the downstream jobs so that they do not collide and there are available executors for their sub-jobs.

A high-level implementation of the above could be:

For the first option, this could be automated by the Jenkins orchestrator. Alternatively, a new flag in the build step could also help.
For the second way, this could be achieved by adding a flag to the trigger-upstream configuration in the downstream job with a parameter similar to the cron H flag. In other words, instead of immediately executing the downstream jobs, they could be distributed over a set period of time (e.g. within an hour).

Jose added a comment - 2023-01-04 15:07 - edited I see some issues with this ticket: I'm not very familiar about the project/components terms used in Jenkins, however I believe the issue is not about the "parameterized-trigger-plugin" component, but rather about the "build step" used to launch other jobs from within another job. The issue is not related to having only 1 executor, but rather to having no available executors when launching too many master/orchestrators jobs that run sub-jobs, causing a deadlock. I propose updating the Jira ticket to reflect the issue as: "Deadlock/Lockup when using "launching builds on other jobs" and "Block until the triggered jobs finish their builds" when no available executors" From what I understand, this issue occurs in the following scenario: Node/slave with only 2 (or 1) executors available 1 upstream job finishes successfully 2 downstream (orchestrator) jobs are automatically triggered immediatelly by the upstream job when it finishes successfully Each downstream job launches other sub-jobs (dependent jobs) that each require their own executor Since the last bullet point requires extra executors, and there are none available, the master/orchestrator jobs will enter a deadlock. There are several ways to resolve this issue: (my preference): The launched dependent jobs do not occupy extra executors if run on the same node/slave. Distribute the execution of the downstream jobs so that they do not collide and there are available executors for their sub-jobs. A high-level implementation of the above could be: For the first option, this could be automated by the Jenkins orchestrator. Alternatively, a new flag in the build step could also help. For the second way, this could be achieved by adding a flag to the trigger-upstream configuration in the downstream job with a parameter similar to the cron H flag. In other words, instead of immediately executing the downstream jobs, they could be distributed over a set period of time (e.g. within an hour).

Magnus Reftel added a comment - 2023-03-08 11:35

Another way of seeing the issue is that Jenkins needlessly holds an executor occupied while the job it is running is just waiting for another job to finish. If the job somehow would yield its executor while waiting, there would be no deadlocks.

Magnus Reftel added a comment - 2023-03-08 11:35 Another way of seeing the issue is that Jenkins needlessly holds an executor occupied while the job it is running is just waiting for another job to finish. If the job somehow would yield its executor while waiting, there would be no deadlocks.

Jenkins

Details

Description

Attachments

Issue Links

Activity

Collapse comment: cjo9900 added a comment - 2012-07-19 16:17

Expand comment: cjo9900 added a comment - 2012-07-19 16:17

Collapse comment: Roland Schulz added a comment - 2012-09-07 04:01, Edited by Roland Schulz - 2012-09-07 04:11

Expand comment: Roland Schulz added a comment - 2012-09-07 04:01, Edited by Roland Schulz - 2012-09-07 04:11

Collapse comment: Marcin Hawraniak added a comment - 2015-03-03 08:24

Expand comment: Marcin Hawraniak added a comment - 2015-03-03 08:24

Collapse comment: Dominic Cleal added a comment - 2015-03-16 14:05

Expand comment: Dominic Cleal added a comment - 2015-03-16 14:05

Collapse comment: Heiko Böttger added a comment - 2016-07-18 08:14

Expand comment: Heiko Böttger added a comment - 2016-07-18 08:14

Collapse comment: Michal Wesolowski added a comment - 2018-05-17 07:11

Expand comment: Michal Wesolowski added a comment - 2018-05-17 07:11

Collapse comment: Vasily L added a comment - 2018-09-27 17:23, Edited by Vasily L - 2018-09-27 17:24

Expand comment: Vasily L added a comment - 2018-09-27 17:23, Edited by Vasily L - 2018-09-27 17:24

Collapse comment: Tiger Cheng added a comment - 2018-10-22 19:46

Expand comment: Tiger Cheng added a comment - 2018-10-22 19:46

Collapse comment: Petar Tahchiev added a comment - 2018-11-18 09:00, Edited by Petar Tahchiev - 2018-11-18 09:02

Expand comment: Petar Tahchiev added a comment - 2018-11-18 09:00, Edited by Petar Tahchiev - 2018-11-18 09:02

Collapse comment: Tamas Hegedus added a comment - 2019-11-05 12:44

Expand comment: Tamas Hegedus added a comment - 2019-11-05 12:44

Collapse comment: Tobias Gierke added a comment - 2019-11-14 08:00

Expand comment: Tobias Gierke added a comment - 2019-11-14 08:00

Collapse comment: Alexander Borsuk added a comment - 2020-01-16 15:53

Expand comment: Alexander Borsuk added a comment - 2020-01-16 15:53

Collapse comment: NhatKhai Nguyen added a comment - 2021-06-01 21:33

Expand comment: NhatKhai Nguyen added a comment - 2021-06-01 21:33

Collapse comment: NhatKhai Nguyen added a comment - 2021-06-01 21:35, Edited by NhatKhai Nguyen - 2021-06-01 21:37

Expand comment: NhatKhai Nguyen added a comment - 2021-06-01 21:35, Edited by NhatKhai Nguyen - 2021-06-01 21:37

Collapse comment: NhatKhai Nguyen added a comment - 2021-06-01 21:39, Edited by NhatKhai Nguyen - 2021-06-01 21:40

Expand comment: NhatKhai Nguyen added a comment - 2021-06-01 21:39, Edited by NhatKhai Nguyen - 2021-06-01 21:40

Collapse comment: Jose added a comment - 2023-01-04 15:07, Edited by Jose - 2023-01-04 15:10

Expand comment: Jose added a comment - 2023-01-04 15:07, Edited by Jose - 2023-01-04 15:10

Collapse comment: Magnus Reftel added a comment - 2023-03-08 11:35

Expand comment: Magnus Reftel added a comment - 2023-03-08 11:35

People

Dates