-
Bug
-
Resolution: Fixed
-
Major
-
Red Hat Linux, Jenkins version 1.505
-
Powered by SuggestiMate
When I kill a running job in Jenkins at the very early stage of its build, it doesn't get killed and goes to a null state. I can't see that in the main console, but when I go to the job page, I see the job running. I am unable to kick off another build for the same job as it says that the job is already running. Please see the screenshot.
The only way I have for now is to restart jenkins. It has been happening quite frequently in the past 2-3 days time. Please send me a resolution for this at the earliest, it is not easy to restart jenkins very often in a busy day.
Thanks,
Aswini
[JENKINS-17667] Unable to kill a job which is running
Why am I not getting any updates? I know that there will be lot of issues daily and its hard to resolve everything at the earliest. But atleast a small comment saying that we are working on this would do. This is a blocker for me daily. I end up restarting jenkins to resolve this issue.
Aswini, I think you may have misunderstood the nature of open source projects. Open source projects are addressed by a community of interested users who implement enhancements and work on problems based on their needs and their interests.
I think the response you're expecting is much closer to commercial support, rather than an open source community. You might consider contacting CloudBees about their commercially supported offering based on Jenkins.
There is also a concept of a "bug bounty" that I've seen offered elsewhere in the Jenkins project, though I'm not sure if that generally has the result you are seeking, since you seem to be seeking response times more typical of commercial products than open source projects.
Hi Mark, I thought this forum will be watched by the Jenkins developers and they will post a solution for my question. If you have any answer for my query, can you let me know? Thanks.
I don't have an answer to your question. I've observed that sometimes a Jenkins job is harder to interrupt than others. My usual technique has been to click the "x" to stop the job, then if the job has not stopped shortly, I'll click somewhere else in the UI (causing the page to refresh), then click the "x" to stop the job a second time.
My issue is that there is no log itself for that job and it keeps running for null time in the machine. The only way for me to kick off the job is to do after restarting jenkins.
This happens when I kill the job as soon as it starts.
I think I have the same problem. What I see is this:
- Job timed out (using the build timeout plugin)
- There is no system process on the machine any more for the job (ps ax)
- There is a thread running inside Jenkins having this stack (taken from http://jenkins:8080/threadDump):
Executor #12 for master : executing MyJob #1381 "Executor #12 for master : executing MyJob #1381" Id=4432 Group=main RUNNABLE at java.util.WeakHashMap.get(WeakHashMap.java:471) at hudson.tools.InstallerTranslator.getToolHome(InstallerTranslator.java:55) at hudson.tools.ToolLocationNodeProperty.getToolHome(ToolLocationNodeProperty.java:107) at hudson.tools.ToolInstallation.translateFor(ToolInstallation.java:204) at hudson.tasks.Maven$MavenInstallation.forNode(Maven.java:610) at hudson.maven.MavenModuleSetBuild.getEnvironment(MavenModuleSetBuild.java:182) at hudson.scm.SubversionSCM.getModuleRoot(SubversionSCM.java:1554) at hudson.model.AbstractBuild.getModuleRoot(AbstractBuild.java:372) at hudson.maven.MavenModuleSetBuild$MavenModuleSetBuildExecution.doRun(MavenModuleSetBuild.java:698) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:585) at hudson.model.Run.execute(Run.java:1676) at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:519) at hudson.model.ResourceController.execute(ResourceController.java:88) at hudson.model.Executor.run(Executor.java:231) Number of locked synchronizers = 1 - java.util.concurrent.locks.ReentrantLock$NonfairSync@67ae5a69
- Jenkins UI shows the job still running and denies another execution
I've seen this from time to time. Any attempt at killing the job is useless (from the gui, from a curl command). I have seen this more frequently in the past and now it's back. I'm suspecting it may be related to using locks and latches... I wonder if there's a timeout that may have been exceeded and it's left in a state of limbo.
One thing I can suggest... it's rather drastic. But as I'm a pretty heavy user of the jenkins cli, if you pull down the job configuration, you can delete the job (removing this ghost running build) and recreate the job. You loose build history etc. but if you really need to get rid of it and a restart of Jenkins is simply out of the question.
jenkins get-job abc > config.xml
jenkins delete-job abc
jenkins create-job abc < config.xml
hth,steven
and of course 'jenkins' in the above example is a shell script of
#!/bin/bash
java -jar ~/bin/jenkins-cli.jar -s https://jenkins_url -i ~/.ssh/id_rsa $@
And by the way... having used commercial software for years.. I've never seen the level of response that's being suggested.
I am encountering this issue as well. There are many threads stuck on the line
at java.util.WeakHashMap.get(WeakHashMap.java:380) at hudson.tools.InstallerTranslator.getToolHome(InstallerTranslator.java:55)
The version of java I'm using:
java -version java version "1.6.0_30" OpenJDK Runtime Environment (IcedTea6 1.13.1) (6b30-1.13.1-1ubuntu2~0.12.04.1) OpenJDK 64-Bit Server VM (build 23.25-b01, mixed mode)
Jenkins version:
Jenkins ver. 1.557
It seems as if the same WeakHashMap instance is being used by multiple threads, and since in the documentation for WeakHashMap it says
http://docs.oracle.com/javase/7/docs/api/java/util/WeakHashMap.html
Like most collection classes, this class is not synchronized. A synchronized WeakHashMap may be constructed using the Collections.synchronizedMap method.
It seems like you should be using Collections.synchronizedMap on this or you should prboably use
http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/cache/CacheBuilder.html
which allows for weak keys and is thread safe.
I took a look at
https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/tools/InstallerTranslator.java
and see a possible issue.
none of the Map.put or Map.get calls are wrapped in a synchronized block.
According to the java memory model Each Thread can have its own local view of memory that is inconsistent with another Thread.
This is to enable multiple CPUS to have their own caching and not force all reads/writes to be consistent with each other, which would slow things down.
So that if you have
static class A { static int val= 0 }; Thread1: A.a=1; Thread2: System.out.println(A.a); //Can print out either 0 or 1.
Inorder to have consistency you can use volatiles, so
static class A { static volatile int val= 0 }; Thread1: A.a=1; Thread2: System.out.println(A.a); //Will print out 1.
A volatile in effect forces a read on main memory instead a per thread cache.
Another way to achieve this effect is to use a synchronized block.
Thread1: synchrnoized(lock) { A.a=1; } Thread2: synchronized(lock) {System.out.println(A.a); } //Will print out 1.
This is from my understanding of
http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html
Now in your case, your gets and puts are not synchronized, so you can end up with strange behavior. I suggest wrapping your new WeakHashMap() invocations with Collections.synchronizedMap(new WeakHashMap())
Code changed in jenkins
User: Joshua Kolash
Path:
core/src/main/java/hudson/tools/InstallerTranslator.java
http://jenkins-ci.org/commit/jenkins/e3952f41c5649c326ace3cc263420a8c287e1e7c
Log:
[FIXED JENKINS-17667] - Syncronization of InstallerTranslator::getToolHome()
There appears to be some unthreadsafe initialization going on here.
Initialize/get inside a synchronized block for threadsaftey.
Code changed in jenkins
User: Daniel Beck
Path:
core/src/main/java/hudson/tools/InstallerTranslator.java
http://jenkins-ci.org/commit/jenkins/5ef293d7a0ac0a2f7a443a8460abda196f8056e0
Log:
Merge pull request #1176 from jkolash/master
[FIXED JENKINS-17667] - Syncronization of InstallerTranslator::getToolHome()
Compare: https://github.com/jenkinsci/jenkins/compare/7b70bd96d7ad...5ef293d7a0ac
Code changed in jenkins
User: Daniel Beck
Path:
changelog.html
http://jenkins-ci.org/commit/jenkins/fdc0b5c5650b3ee849f02e3c5d94d23c12886adc
Log:
Noting #1314, #1316, #1308, JENKINS-17667, JENKINS-22395, JENKINS-18065
Integrated in jenkins_main_trunk #3515
[FIXED JENKINS-17667] - Syncronization of InstallerTranslator::getToolHome() (Revision e3952f41c5649c326ace3cc263420a8c287e1e7c)
Result = SUCCESS
joshua.kolash : e3952f41c5649c326ace3cc263420a8c287e1e7c
Files :
- core/src/main/java/hudson/tools/InstallerTranslator.java
Code changed in jenkins
User: Jesse Glick
Path:
changelog.html
core/src/main/java/hudson/tools/InstallerTranslator.java
http://jenkins-ci.org/commit/jenkins/7c253c1cef6a40bf504e313e68a85e4fc065aa0f
Log:
JENKINS-17667 Reverting commit e3952f41c5649c326ace3cc263420a8c287e1e7c.
Code changed in jenkins
User: Jesse Glick
Path:
changelog.html
core/src/main/java/hudson/tools/InstallerTranslator.java
test/src/test/java/hudson/tools/InstallerTranslatorTest.java
http://jenkins-ci.org/commit/jenkins/17d90931655e6c67651ec371344552d7c23bdcda
Log:
[FIXED JENKINS-17667] Fixed race condition when running tool installers on many slaves at once.
Correcting change made in #1176, which introduced an NPE, to restore original logic merely wrapped in a synchronized block.
Reproduced NPE in new functional test (original bug probably very hard to reproduce).
Compare: https://github.com/jenkinsci/jenkins/compare/84d49ceef2d6...17d90931655e
Integrated in jenkins_main_trunk #3525
[FIXED JENKINS-17667] Fixed race condition when running tool installers on many slaves at once. (Revision 17d90931655e6c67651ec371344552d7c23bdcda)
Result = SUCCESS
Jesse Glick : 17d90931655e6c67651ec371344552d7c23bdcda
Files :
- test/src/test/java/hudson/tools/InstallerTranslatorTest.java
- changelog.html
- core/src/main/java/hudson/tools/InstallerTranslator.java
Integrated in jenkins_main_trunk #3532
JENKINS-17667 Reverting commit e3952f41c5649c326ace3cc263420a8c287e1e7c. (Revision 7c253c1cef6a40bf504e313e68a85e4fc065aa0f)
Result = SUCCESS
Jesse Glick : 7c253c1cef6a40bf504e313e68a85e4fc065aa0f
Files :
- changelog.html
- core/src/main/java/hudson/tools/InstallerTranslator.java
Code changed in jenkins
User: Jesse Glick
Path:
core/src/main/java/hudson/tools/InstallerTranslator.java
test/src/test/java/hudson/tools/InstallerTranslatorTest.java
http://jenkins-ci.org/commit/jenkins/65d34a5076d8c4ec15601cecba1257d0cbfe867a
Log:
[FIXED JENKINS-17667] Fixed race condition when running tool installers on many slaves at once.
Correcting change made in #1176, which introduced an NPE, to restore original logic merely wrapped in a synchronized block.
Reproduced NPE in new functional test (original bug probably very hard to reproduce).
(cherry picked from commit 17d90931655e6c67651ec371344552d7c23bdcda)
Conflicts:
changelog.html
core/src/main/java/hudson/tools/InstallerTranslator.java
For future reference, since I found that bug googling, seems like we're currently having some form of reminiscence of that issue. Running 1.593.
Currently crawling the thread dump, I don't see anything obvious, yet.
Reopening, as this is exactly the same behaviour described above:
"null on master" and so on.
Btw, this is very weird because this job is "restricted" to slaves who have a label which is not set on the master.
I've got a thread dump
After Jenkins restart, the timing has been adjusted and the node on which it seems Jenkins actually wanted to send the build has been fixed:
"took 0 ms on rhel6-3" (instead now of "master" as it was displayed while it was stuck).
So, that also matches the issue described, and the guy who did this confirmed: the build was tried to be killed very early during its launch.
@Daniel I reopened before seing your comment, because the symptoms are exactly the same at first sight, and I didn't want to disseminate data onto different JIRA issues when it seemed to be the same one.
But if needed I can still file a new one, and link to here.
Disable the job and then enable the job.. You will see that all jobs has killed and not rerunning.
I didn't want to disseminate data onto different JIRA issues
A good idea as long as it never misleads or contradicts. As soon as that happens it's a mess and you need to figure out what's going on. Since you can mark jobs as being related, this shouldn't be an issue.
Not sure why this got reassigned away from me. I committed the fix to the known issue. (If there are other issues with similar symptoms, they should be filed separately and linked.)
I observed the same issue in the Jenkins version. 1.625.3. The issue occurred in the Multi-configuration project job. In my case the issue appeared after changing the Job weight of the Multi-configuration project job to 2 from 1. The next build of the Multi-configuration project job were non responsive. The non-responsive build under the build history displayed the on hover message "Started Null ago, Estimated remaining time: null." Could trigger the next build after reverting the job weight to 1 and after enabling the job configuration option "Execute concurrent builds if necessary"
Any updates on this issue?