Quite a number of different manifestations of this observed by a number of our customers using different cloud providers. In common is the use of a "single-shot" style retention strategy, though the root cause is observable with great care when using any retention strategy other than Always.
The basic issue is that you cannot determine if a node is idle unless you hold the Queue lock as that is the only way to ensure that the Queue is not in the process of assigning work to the node you are removing.
Symptoms include:
- Build logs that claim the job was executed on "master" even though the job is tied to a specific label that master does not have. The build log will have been "unable to be determined"
- Build logs where the node is gone just as soon as the job starts
2015-03-05 13:27:55.101 Started by upstream project "____" build number ___
2015-03-05 13:27:55.102 originally caused by:
2015-03-05 13:27:55.103 Started by user ____
2015-03-05 13:27:55.437 FATAL: no longer a configured node for ____
2015-03-05 13:27:55.440 java.lang.IllegalStateException: no longer a configured node for ____
2015-03-05 13:27:55.440 at hudson.model.AbstractBuild$AbstractBuildExecution.getCurrentNode(AbstractBuild.java:452)
2015-03-05 13:27:55.440 at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:484)
2015-03-05 13:27:55.441 at hudson.model.Run.execute(Run.java:1745)
2015-03-05 13:27:55.441 at hudson.model.Build.run(Build.java:113)
2015-03-05 13:27:55.441 at hudson.model.ResourceController.execute(ResourceController.java:89)
2015-03-05 13:27:55.441 at hudson.model.Executor.run(Executor.java:240)
Code changed in jenkins
User: Stephen Connolly
Path:
core/src/main/java/hudson/Functions.java
core/src/main/java/hudson/model/AbstractCIBase.java
core/src/main/java/hudson/model/Computer.java
core/src/main/java/hudson/model/Executor.java
core/src/main/java/hudson/model/Hudson.java
core/src/main/java/hudson/model/Node.java
core/src/main/java/hudson/model/Queue.java
core/src/main/java/hudson/model/ResourceController.java
core/src/main/java/hudson/slaves/AbstractCloudSlave.java
core/src/main/java/hudson/slaves/ComputerRetentionWork.java
core/src/main/java/hudson/slaves/NodeProvisioner.java
core/src/main/java/hudson/slaves/RetentionStrategy.java
core/src/main/java/hudson/slaves/SlaveComputer.java
core/src/main/java/jenkins/model/Jenkins.java
core/src/main/java/jenkins/model/Nodes.java
core/src/main/java/jenkins/util/AtmostOneTaskExecutor.java
core/src/main/resources/hudson/model/Messages.properties
core/src/main/resources/lib/hudson/executors.jelly
core/src/main/resources/lib/layout/layout.jelly
http://jenkins-ci.org/commit/jenkins/92147c3597308bc05e6448ccc41409fcc7c05fd7
Log:
[FIXED JENKINS-27565] Refactor the Queue and Nodes to use a consistent locking strategy
The test system I set up to verify resolution of customer(s)' issues driving this change, required
additional changes in order to fully resolve the issues at hand. As a result I am bundling these
changes: