I believe this is linked to JENKINS-6901, JENKINS-8223, and JENKINS-11031.
In our environment (Jenkins ver. 1.437, Hudson Locks and Latches 0.6) I can consistently reproduce the state where
- one job hangs indefinitely,
- this can only be seen on the job's own page, not on the master page or on the slave page. (cf. romberts comment of 19/May/10)
- restarting the master is necessary to resolve the situation. (seems to be common to all of 6901, 8223, 11031)
- the same job cannot run again, because it seems to be running already.
How to reproduce:
1. have 3 jobs that share a lock, have 2 executors
It works even if there are only 2 different jobs that do nothing more than "sleep 30" (no SCM required).
2. start 1 job, then start another 2 jobs
--> The first running job has acquired the lock, the other two are waiting in the queue.
3. Once job #1 finishes, the other 2 are simultaneously allocated an executor each.
4. One of them waits (no output at all), because the other one holds the lock.
--> This is the safety net as described e.g. in JENKINS-11031, so far everything's fine.
5. Kill the job waiting-in-executor while the other one is still running.
Now we're in the strange state where different pages disagree on whether the job is actually running, and resolving the situation requires a master restart.
Most times, it is possible to avoid step 5 above and wait for one job to finish, upon which the other job will actually start running and eventually terminate correctly. From time to time, however, an imprudent user action will make a certain job being blocked, and we have to restart the master.
Seems this issue is hard to describe, but not so rare. Any hope of having this fixed?
Build queue on the front page does not show the phantom build, only the one from the project page does.