[JENKINS-5413] SCM polling getting hung

rsteele added a comment - 2010-02-01 11:30

I believe I'm seeing this problem as well and have been for a couple (maybe more?) weeks. One difference: I'm using the ClearCase plugin as my SCM provider, but otherwise the symptoms seem to be the same: one of my slaves, though still "online", seems to get stuck while polling for changes (though I don't see any ClearCase processes running in Process Explorer). Furthermore, killing the slave doesn't seem to do any good and the master doesn't even notice the slave has died.

rsteele added a comment - 2010-02-01 11:30 I believe I'm seeing this problem as well and have been for a couple (maybe more?) weeks. One difference: I'm using the ClearCase plugin as my SCM provider, but otherwise the symptoms seem to be the same: one of my slaves, though still "online", seems to get stuck while polling for changes (though I don't see any ClearCase processes running in Process Explorer). Furthermore, killing the slave doesn't seem to do any good and the master doesn't even notice the slave has died.

Carl Quinn added a comment - 2010-02-25 22:46

And I'm seeing it as well with a Perforce SCM.

Carl Quinn added a comment - 2010-02-25 22:46 And I'm seeing it as well with a Perforce SCM.

mdillon added a comment - 2010-02-26 14:38

Here is a stack dump from a Hudson master we were running after all 10 asychronous polling threads were hung. The job names, executor names, and internal classes names have been munged just in case. This thread dump appears to be missing the stack for the main thread for some reason, but I don't think that is a big deal.

Once our server got itself into this state, we were not able to unstick polling without a restart. Disconnecting the affected executor did not cause these threads to go away and reconnecting the executor did not cause polling to resume.

Our workaround has been to add a setting to revert the subversion plugin back to master-only polling on the affected installations. FWIW, we seem to see this on high-load Hudson installations.

mdillon added a comment - 2010-02-26 14:38 Here is a stack dump from a Hudson master we were running after all 10 asychronous polling threads were hung. The job names, executor names, and internal classes names have been munged just in case. This thread dump appears to be missing the stack for the main thread for some reason, but I don't think that is a big deal. Once our server got itself into this state, we were not able to unstick polling without a restart. Disconnecting the affected executor did not cause these threads to go away and reconnecting the executor did not cause polling to resume. Our workaround has been to add a setting to revert the subversion plugin back to master-only polling on the affected installations. FWIW, we seem to see this on high-load Hudson installations.

mdillon added a comment - 2010-02-26 14:39

BTW, that thread dump was from a Hudson master running the equivalent of Hudson 1.322. I don't know if anyone else in the company has a thread dump from a more recent Hudson version.

mdillon added a comment - 2010-02-26 14:39 BTW, that thread dump was from a Hudson master running the equivalent of Hudson 1.322. I don't know if anyone else in the company has a thread dump from a more recent Hudson version.

dshields777 added a comment - 2010-03-09 11:22

I'm seeing the same behavior.

mdillon, you mention reverting to master-only polling as a workaround. How did you do that? Is there a config setting that I'm missing? Or do you mean you went back to an earlier version of the SVN plugin?

dshields777 added a comment - 2010-03-09 11:22 I'm seeing the same behavior. mdillon, you mention reverting to master-only polling as a workaround. How did you do that? Is there a config setting that I'm missing? Or do you mean you went back to an earlier version of the SVN plugin?

Dean Yu added a comment - 2010-03-09 22:42

@dshields777: We build our Hudson installation from source with some modifications. We added a switch to the Subversion plugin to poll from the master. We can certainly contribute this patch upstream so other people can use the same workaround.

Dean Yu added a comment - 2010-03-09 22:42 @dshields777: We build our Hudson installation from source with some modifications. We added a switch to the Subversion plugin to poll from the master. We can certainly contribute this patch upstream so other people can use the same workaround.

wgracelee added a comment - 2010-03-29 08:31

Hi, we have the same problem on Hudson 1.352 on using subversion. Is the patch available for download now?

wgracelee added a comment - 2010-03-29 08:31 Hi, we have the same problem on Hudson 1.352 on using subversion. Is the patch available for download now?

daniel_franzen added a comment - 2010-04-06 23:03

Hi, I've been seeing this issue too (since late January - I wish I could give you an exact version number). I haven't found any way to reliably reproduce it (it happens at random every other day). It occurred as late as yesterday, running Hudson 1.353 with Subversion Plugin 1.16.

The typical scenario in our case is as follows
1) A job hangs.*
2) The node becomes unusable. A job starting on the node stops at "Building remotely on MySlave".
3) SVN polling gets stuck.
4) The node can be made usable by disconnecting and reconnecting in Hudson's node management.
5) Polling only resumes after a Hudson restart.

*) Some background on how these job hang-ups manifest themselves:
There's one particular job that hangs often, but I can't determine what's special about it. It typically hangs at "Recording plot data". In other words, it's not during actual job execution, but just at the end. This occurs on any of our slave nodes - nodes that are running other Hudson jobs without a hitch. When I've removed plotting it hangs at archiving/fingerprinting instead. I suspect one of the code analysis plugins we run at the end (e.g. Findbugs) might be responsible. If you believe this is of interest to the SVN polling issue I'll be happy to provide more detailed information.

daniel_franzen added a comment - 2010-04-06 23:03 Hi, I've been seeing this issue too (since late January - I wish I could give you an exact version number). I haven't found any way to reliably reproduce it (it happens at random every other day). It occurred as late as yesterday, running Hudson 1.353 with Subversion Plugin 1.16. The typical scenario in our case is as follows 1) A job hangs.* 2) The node becomes unusable. A job starting on the node stops at "Building remotely on MySlave". 3) SVN polling gets stuck. 4) The node can be made usable by disconnecting and reconnecting in Hudson's node management. 5) Polling only resumes after a Hudson restart. *) Some background on how these job hang-ups manifest themselves: There's one particular job that hangs often, but I can't determine what's special about it. It typically hangs at "Recording plot data". In other words, it's not during actual job execution, but just at the end. This occurs on any of our slave nodes - nodes that are running other Hudson jobs without a hitch. When I've removed plotting it hangs at archiving/fingerprinting instead. I suspect one of the code analysis plugins we run at the end (e.g. Findbugs) might be responsible. If you believe this is of interest to the SVN polling issue I'll be happy to provide more detailed information.

Tony Sweeney added a comment - 2010-04-29 11:35 - edited

We get this multiple times per day. The only solution is a server master restart, losing any builds currently in progress. This is making Hudson damn near unusable. Would the guy who made the master-only polling fix be prepared to make it available here?

Tony Sweeney added a comment - 2010-04-29 11:35 - edited We get this multiple times per day. The only solution is a server master restart, losing any builds currently in progress. This is making Hudson damn near unusable. Would the guy who made the master-only polling fix be prepared to make it available here?

lkishalmi added a comment - 2010-04-29 13:15

We are also fighting with this issue however we are using Perforce as SCM, so I think this issue goes down to the core of Hudson master-slave communication. Unfortunately it seems there is only one reliable slave configuration for Hudson and it is SSH slave with Solaris OS. It was the only configuration which has never failed so far, however we have a Linux master with several Windows one Linux one Mac and one Solaris slaves.

In order to improve the stability of your system you might configure your slaves on-demand. With this we are running without this issue for a few days now.

lkishalmi added a comment - 2010-04-29 13:15 We are also fighting with this issue however we are using Perforce as SCM, so I think this issue goes down to the core of Hudson master-slave communication. Unfortunately it seems there is only one reliable slave configuration for Hudson and it is SSH slave with Solaris OS. It was the only configuration which has never failed so far, however we have a Linux master with several Windows one Linux one Mac and one Solaris slaves. In order to improve the stability of your system you might configure your slaves on-demand. With this we are running without this issue for a few days now.

lkishalmi added a comment - 2010-04-29 13:17

Changed the component and the title as it seems this issue is not just related to Subversion.

lkishalmi added a comment - 2010-04-29 13:17 Changed the component and the title as it seems this issue is not just related to Subversion.

jpadmana added a comment - 2010-05-13 15:47

We are seeing the same issue with Hudson 1.353 and Subversion.

Current SCM Polling Activities
There are more SCM polling activities scheduled than handled, so the threads are not keeping up with the demands. Check if your polling is hanging, and/or increase the number of threads if necessary.

The following polling activities are currently in progress:

jpadmana added a comment - 2010-05-13 15:47 We are seeing the same issue with Hudson 1.353 and Subversion. Current SCM Polling Activities There are more SCM polling activities scheduled than handled, so the threads are not keeping up with the demands. Check if your polling is hanging, and/or increase the number of threads if necessary. The following polling activities are currently in progress:

ericguz added a comment - 2010-06-09 05:36

We are currently seeing this on an almost nightly basis. Both with Hudson 1.355 and 1.360 using Perforce 2008.2 for SCM. Master and Slave are all running on Solaris 10 i86.

ericguz added a comment - 2010-06-09 05:36 We are currently seeing this on an almost nightly basis. Both with Hudson 1.355 and 1.360 using Perforce 2008.2 for SCM. Master and Slave are all running on Solaris 10 i86.

Christian Bremer added a comment - 2010-06-11 07:42

We are also seeing this problem a couple times a day. We are using Hudson 1.355 and ClearCase Plugin 1.2.
Any workaround or fix would be highly appreciated!!

Christian Bremer added a comment - 2010-06-11 07:42 We are also seeing this problem a couple times a day. We are using Hudson 1.355 and ClearCase Plugin 1.2. Any workaround or fix would be highly appreciated!!

jpshackelford added a comment - 2010-06-28 09:03 - edited

We are seeing this or a related problem as well. Linux / Perforce / Hudson ver. 1.361. See thread_dump_02.txt and hung_scm_pollers_02.png. It would be great if we could at least have an option of killing the threads without bouncing Hudson.

jpshackelford added a comment - 2010-06-28 09:03 - edited We are seeing this or a related problem as well. Linux / Perforce / Hudson ver. 1.361. See thread_dump_02.txt and hung_scm_pollers_02.png. It would be great if we could at least have an option of killing the threads without bouncing Hudson.

jpshackelford added a comment - 2010-06-28 10:08 - edited

I have just noticed after bouncing the server and watching 180 jobs run through while monitoring http://<my server>/descriptor/hudson.triggers.SCMTrigger/ is that the polling threads associated with the Thread Dump message "SCM polling for hudson.maven.MavenModuleSet..." seem to show up as running for the whole time it takes for the build to run, but I don't see this on other job types. Same thing when I look at the page http://<my server>/job/<my job>/scmPollLog/?. The first line reads "Started on Jun 28, 2010 11:51:03 AM" and then I see nothing else until the build completes 30 mins later. And then the rest of the log shows:

Looking for changes...
Using remote perforce client: icdudrelgapp_releng_ease_producer--1152796861
[EASE - Producer] $ /hosting/bin/p4 workspace -o icdudrelgapp_releng_ease_producer--1152796861
Saving modified client icdudrelgapp_releng_ease_producer--1152796861
[EASE - Producer] $ /hosting/bin/p4 -s client -i
Last sync'd change was 420598
[EASE - Producer] $ /hosting/bin/p4 changes -m 2 //icdudrelgapp_releng_ease_producer--1152796861/...
Latest submitted change selected by workspace is 420598
Assuming that the workspace definition has not changed.
Done. Took 0.57 sec
No changes

Perforce wasn't hung--I was also monitoring the commands processed by Perforce and they were flying through. I haven't dug into the source yet, but I don't understand why it looks like the poller is tied up for 30 mins while the build runs though Perforce would have easily processed the request in milliseconds.

jpshackelford added a comment - 2010-06-28 10:08 - edited I have just noticed after bouncing the server and watching 180 jobs run through while monitoring http://<my server>/descriptor/hudson.triggers.SCMTrigger/ is that the polling threads associated with the Thread Dump message "SCM polling for hudson.maven.MavenModuleSet..." seem to show up as running for the whole time it takes for the build to run, but I don't see this on other job types. Same thing when I look at the page http://<my server>/job/<my job>/scmPollLog/?. The first line reads "Started on Jun 28, 2010 11:51:03 AM" and then I see nothing else until the build completes 30 mins later. And then the rest of the log shows: Looking for changes... Using remote perforce client: icdudrelgapp_releng_ease_producer--1152796861 [EASE - Producer] $ /hosting/bin/p4 workspace -o icdudrelgapp_releng_ease_producer--1152796861 Saving modified client icdudrelgapp_releng_ease_producer--1152796861 [EASE - Producer] $ /hosting/bin/p4 -s client -i Last sync'd change was 420598 [EASE - Producer] $ /hosting/bin/p4 changes -m 2 //icdudrelgapp_releng_ease_producer--1152796861/... Latest submitted change selected by workspace is 420598 Assuming that the workspace definition has not changed. Done. Took 0.57 sec No changes Perforce wasn't hung--I was also monitoring the commands processed by Perforce and they were flying through. I haven't dug into the source yet, but I don't understand why it looks like the poller is tied up for 30 mins while the build runs though Perforce would have easily processed the request in milliseconds.

carlspring added a comment - 2010-07-06 08:51

I had the same issue when having around 150 modules.
The moment I added SVN hook based build triggering, this went away. Some other version controls support this as well.

carlspring added a comment - 2010-07-06 08:51 I had the same issue when having around 150 modules. The moment I added SVN hook based build triggering, this went away. Some other version controls support this as well.

carlspring added a comment - 2010-07-06 09:04

I stand corrected.
Even THIS didn't solve the problem as it just reoccurred!

Hudson 1.364.

carlspring added a comment - 2010-07-06 09:04 I stand corrected. Even THIS didn't solve the problem as it just reoccurred! Hudson 1.364.

vjuranek added a comment - 2010-07-08 02:39

Reading the discussion it seems to me that at least in some cases (when slave is stuck and seems to be online even if disconnected) this is exposure of ~~JENKINS-5977~~.
Btw. stuck SCM polling threads can be easily killed via Groovy console (at least it works for me)

vjuranek added a comment - 2010-07-08 02:39 Reading the discussion it seems to me that at least in some cases (when slave is stuck and seems to be online even if disconnected) this is exposure of JENKINS-5977 . Btw. stuck SCM polling threads can be easily killed via Groovy console (at least it works for me)

carlspring added a comment - 2010-07-08 03:13

Reducing the polling to once every half hour and setting up SVN hooks, helped solved the issue, although I am not fully sure it has gone away.

If it's possible to kill the hung up pollers threads, I would personally recommend adding a poll response timeout mechanism to Hudson.

Apparently, a lot of people are experiencing this, if they have a large number of projects and I think it should be addressed with a high priority.

carlspring added a comment - 2010-07-08 03:13 Reducing the polling to once every half hour and setting up SVN hooks, helped solved the issue, although I am not fully sure it has gone away. If it's possible to kill the hung up pollers threads, I would personally recommend adding a poll response timeout mechanism to Hudson. Apparently, a lot of people are experiencing this, if they have a large number of projects and I think it should be addressed with a high priority.

Hans-Juergen Hafner added a comment - 2010-08-09 06:18 - edited

@vjuranek

Could you please give me an example how to kill stuck SCM polling threads via Groovy console?
( I´m just a rookie in Groovy)
BR,
Hans-Jürgen

Hans-Juergen Hafner added a comment - 2010-08-09 06:18 - edited @vjuranek Could you please give me an example how to kill stuck SCM polling threads via Groovy console? ( I´m just a rookie in Groovy) BR, Hans-Jürgen

vjuranek added a comment - 2010-08-10 13:41

@hjhafner
very primitive (I'm too lazy to develop something better as this is not very important issue) Groovy script is bellow. It could happened that it will kill also SCM polling which are not stuck, but we run this script automatically only once a day so it doesn't cause any troubles for us. You can improve it e.g. by saving ids and names of SCM polling threads, check once again after some time and kill only threads which ids are on the list from previous check.

Thread.getAllStackTraces().keySet().each(){ item ->
if(item.getName().contains("SCM polling") && item.getName().contains("waiting for hudson.remoting"))

{ println "Interrupting thread " + item.getId() item.interrupt() }

}

vjuranek added a comment - 2010-08-10 13:41 @hjhafner very primitive (I'm too lazy to develop something better as this is not very important issue) Groovy script is bellow. It could happened that it will kill also SCM polling which are not stuck, but we run this script automatically only once a day so it doesn't cause any troubles for us. You can improve it e.g. by saving ids and names of SCM polling threads, check once again after some time and kill only threads which ids are on the list from previous check. Thread.getAllStackTraces().keySet().each(){ item -> if(item.getName().contains("SCM polling") && item.getName().contains("waiting for hudson.remoting")) { println "Interrupting thread " + item.getId() item.interrupt() } }

Hans-Juergen Hafner added a comment - 2010-08-13 04:49

@vjuranek
Thanks a lot!
The script worked very well . (With on small change: A ";" was missed before item.interrupt() )

Hans-Juergen Hafner added a comment - 2010-08-13 04:49 @vjuranek Thanks a lot! The script worked very well . (With on small change: A ";" was missed before item.interrupt() )

erwan_q added a comment - 2010-09-13 15:23

I have the same issue every week when I restart the master. The slave is unusable (scm polling stuck, fail to join the slave....). Workaround is to disconnect, kill slave manually, restart slaves. It could be great to have a batch script to implement this restart action at reboot time

erwan_q added a comment - 2010-09-13 15:23 I have the same issue every week when I restart the master. The slave is unusable (scm polling stuck, fail to join the slave....). Workaround is to disconnect, kill slave manually, restart slaves. It could be great to have a batch script to implement this restart action at reboot time

lkishalmi added a comment - 2010-11-04 00:37

~~JENKINS-5977~~ seems to be the root cause of this issue.
Try to upgrade to 1.380+

lkishalmi added a comment - 2010-11-04 00:37 JENKINS-5977 seems to be the root cause of this issue. Try to upgrade to 1.380+

Dean Yu added a comment - 2010-11-18 10:53

I've just applied the patch to perform polling on the master for the Subversion plugin. Apologies to everyone who was waiting for this patch. Updating the plugin wiki with instructions.

Dean Yu added a comment - 2010-11-18 10:53 I've just applied the patch to perform polling on the master for the Subversion plugin. Apologies to everyone who was waiting for this patch. Updating the plugin wiki with instructions.

eguess74 added a comment - 2011-03-09 19:19 - edited

I just started to experience this problem.
we have two instances of Jenkins running. One of them started to have this polling error. It is not only that the polling gets stuck but also the CPU is overloaded for no reason. I have restarted the server, but it came back to this error state in no time.
I have changed the concurrent poll amount from 5 to 10 and restarted again to kill hangup polls. Watching...

We are using jenkins 1.399 + git plugin 1.1.5 + no slaves involved
We have about 450 jobs with polling interval set to 10 min

eguess74 added a comment - 2011-03-09 19:19 - edited I just started to experience this problem. we have two instances of Jenkins running. One of them started to have this polling error. It is not only that the polling gets stuck but also the CPU is overloaded for no reason. I have restarted the server, but it came back to this error state in no time. I have changed the concurrent poll amount from 5 to 10 and restarted again to kill hangup polls. Watching... We are using jenkins 1.399 + git plugin 1.1.5 + no slaves involved We have about 450 jobs with polling interval set to 10 min

eguess74 added a comment - 2011-03-09 20:33 - edited

BTW the script provided by vjuranek didn't work for me throwing MissingMethodException
But i have found and ability to kill threads from GUI using monitoring plugin! The thread details section has this nice little kill button

eguess74 added a comment - 2011-03-09 20:33 - edited BTW the script provided by vjuranek didn't work for me throwing MissingMethodException But i have found and ability to kill threads from GUI using monitoring plugin! The thread details section has this nice little kill button

eguess74 added a comment - 2011-03-10 15:09

for the record:
I was able to narrow it down to three jobs that were consistently getting stuck on the polling/fetching step.
I was trying different approaches but the only thing that actually resolved the problem was to recreate thos jobs from scratch. I.e. i blew away all related folders and workspace and recreated the job. This brought the CPU usage down and there is no stuck threads anymore already for a full day...

eguess74 added a comment - 2011-03-10 15:09 for the record: I was able to narrow it down to three jobs that were consistently getting stuck on the polling/fetching step. I was trying different approaches but the only thing that actually resolved the problem was to recreate thos jobs from scratch. I.e. i blew away all related folders and workspace and recreated the job. This brought the CPU usage down and there is no stuck threads anymore already for a full day...

Stephen Connolly added a comment - 2011-06-07 08:38

CloudBees have raised http://issues.tmatesoft.com/issue/SVNKIT-15

Stephen Connolly added a comment - 2011-06-07 08:38 CloudBees have raised http://issues.tmatesoft.com/issue/SVNKIT-15

Thomas Mielke added a comment - 2011-07-19 16:23 - edited

@vjuranek, @Hans-Juergen Hafner

My team is using Git, and we started experiencing the problem after adding several extra builds to our jenkins server. We resolved the issue by running the script vjuranek provided as a cronjob:

cron
0 * * * *       /var/lib/hudson/killscm.sh

# cat /var/lib/hudson/killscm.sh 
java -jar /var/lib/hudson/hudson-cli.jar -s http://myserver:8090/ groovy /var/lib/hudson/threadkill.groovy

# cat /var/lib/hudson/threadkill.groovy 
Thread.getAllStackTraces().keySet().each() { 
	item ->
	if (item.getName().contains("SCM polling") && item.getName().contains("waiting for hudson.remoting")) { 
		println "Interrupting thread " + item.getId(); 
		item.interrupt() 
	}
}

Since running these scripts, our nightly builds haven't hung for the last 5 consecutive days.

Thomas Mielke added a comment - 2011-07-19 16:23 - edited @vjuranek, @Hans-Juergen Hafner My team is using Git, and we started experiencing the problem after adding several extra builds to our jenkins server. We resolved the issue by running the script vjuranek provided as a cronjob: cron 0 * * * * / var /lib/hudson/killscm.sh # cat / var /lib/hudson/killscm.sh java -jar / var /lib/hudson/hudson-cli.jar -s http: //myserver:8090/ groovy / var /lib/hudson/threadkill.groovy # cat / var /lib/hudson/threadkill.groovy Thread .getAllStackTraces().keySet().each() { item -> if (item.getName().contains( "SCM polling" ) && item.getName().contains( "waiting for hudson.remoting" )) { println "Interrupting thread " + item.getId(); item.interrupt() } } Since running these scripts, our nightly builds haven't hung for the last 5 consecutive days.

Frederic Pesquet added a comment - 2011-09-08 07:54

We are still experiencing this (with 1.420 and 1.414).
The workarround to kill scm threads periodically does not work for me, not sure why (blocked threads are not killed by the scripts, and can not be killed either by the monitoring plugin).
We have a large number of jobs (>1000). The Hudson is blocking every day, and the only way to unlock it is to restart it.
The issue does not seem to be specific to one SCM: we are using SVN and GIT. When I tried to implement the workarround to poll from master with SVN (hudson.scm.SubversionSCM.pollFromMaster), the blocking occured on the GIT polling.
There are 45 voters on this issue, I guess I'm not alone here... Can we raise the priority of this? Seems a real core issue.

Frederic Pesquet added a comment - 2011-09-08 07:54 We are still experiencing this (with 1.420 and 1.414). The workarround to kill scm threads periodically does not work for me, not sure why (blocked threads are not killed by the scripts, and can not be killed either by the monitoring plugin). We have a large number of jobs (>1000). The Hudson is blocking every day, and the only way to unlock it is to restart it. The issue does not seem to be specific to one SCM: we are using SVN and GIT. When I tried to implement the workarround to poll from master with SVN (hudson.scm.SubversionSCM.pollFromMaster), the blocking occured on the GIT polling. There are 45 voters on this issue, I guess I'm not alone here... Can we raise the priority of this? Seems a real core issue.

carlspring added a comment - 2011-09-08 08:13

We used to hit this while we were still using Hudson (ca. 1.3xx). We also have a large number of jobs 300-400. We haven't run into this since we moved to Jenkins.

(Just a suggestion).

carlspring added a comment - 2011-09-08 08:13 We used to hit this while we were still using Hudson (ca. 1.3xx). We also have a large number of jobs 300-400. We haven't run into this since we moved to Jenkins. (Just a suggestion).

Frederic Pesquet added a comment - 2011-09-08 14:05

We are already on Jenkins...
More infos:
We have also a medium number of slaves (>25). It is not uncommon that a slave cease to respond temporarily, or reboot, etc.
As described in this bug initial description, "Furthermore, if the slave dies, the locked channel object still exists in the master JVM.".
I guess we are probably experiencing something like this. The SCM polling getting hung is just the most obvious symptom here.

Frederic Pesquet added a comment - 2011-09-08 14:05 We are already on Jenkins... More infos: We have also a medium number of slaves (>25). It is not uncommon that a slave cease to respond temporarily, or reboot, etc. As described in this bug initial description, "Furthermore, if the slave dies, the locked channel object still exists in the master JVM.". I guess we are probably experiencing something like this. The SCM polling getting hung is just the most obvious symptom here.

Lars Kruse added a comment - 2011-10-19 09:59

Have a look here for at step-by-step description of the work-around http://howto.praqma.net/hudson/jenkins-5413-workaround

Lars Kruse added a comment - 2011-10-19 09:59 Have a look here for at step-by-step description of the work-around http://howto.praqma.net/hudson/jenkins-5413-workaround

Jes Struck added a comment - 2011-10-19 10:06

I just started to do a studie of why this happens. If there are any, there have bullet prof scenario to force this behaviore please do tell. Because i cannot reproduce this consistent

Jes Struck added a comment - 2011-10-19 10:06 I just started to do a studie of why this happens. If there are any, there have bullet prof scenario to force this behaviore please do tell. Because i cannot reproduce this consistent

Jes Struck added a comment - 2011-12-05 15:17

we had some issues in our scm plugin, that resulted in the polling thread hanging.
we had three things
1)Our poll-log that was send between our slaves and the master, gave us multiple threads wrting to the same appender.
2)We also saw that some uncaught exception resulted in hanging thread.
3)We have also seen threads hang becaus of a field that where declared transient, which could not be declared transient.

after we solved thees issue in our plugin we have not seen threads hang anymore
and therfore we cant reproduce this scenarioe

Jes Struck added a comment - 2011-12-05 15:17 we had some issues in our scm plugin, that resulted in the polling thread hanging. we had three things 1)Our poll-log that was send between our slaves and the master, gave us multiple threads wrting to the same appender. 2)We also saw that some uncaught exception resulted in hanging thread. 3)We have also seen threads hang becaus of a field that where declared transient, which could not be declared transient. after we solved thees issue in our plugin we have not seen threads hang anymore and therfore we cant reproduce this scenarioe

Vitalii Grygoruk added a comment - 2011-12-05 15:57

Can you share fixed version of plugin for test?

Vitalii Grygoruk added a comment - 2011-12-05 15:57 Can you share fixed version of plugin for test?

Jes Struck added a comment - 2011-12-05 19:12

vi are doing stress tests on the fixed version, and exspect to release the new version of our plugin wednesday or thursday

Jes Struck added a comment - 2011-12-05 19:12 vi are doing stress tests on the fixed version, and exspect to release the new version of our plugin wednesday or thursday

Christian Wolfgang added a comment - 2011-12-16 21:24

Hello.

We have released our plugin, the source can be found at https://github.com/jenkinsci/clearcase-ucm-plugin.

Some of the things, that made the polling hang on the slaves were mainly uncaught exception. If they weren't caught, they sometimes resulted in slaves hanging.
And, as Jes wrote, we experienced a transient SimpleDateFormat causing the slaves to hang and this only happened in the polling phase.

Our main problem, which does not only concern the polling, is cleartool, which, from time to time, stops working due to many reasons.
It exits with the error message "albd_contact call failed: RPC: Unable to receive; errno = [WINSOCK] Connection reset by peer",
but it never returns the control to the master, which results in slaves hanging.

Our plugin is currently only available for the windows platform and we've had a lot of issues with the desktop heap size.
ClearCase needs a larger heap size than the default setting(512kb), so setting it to a larger value decreases the number of simultaneous desktops,
which sometimes caused the slave OS to freeze. If the value is too low, cleartool sometimes fails with the winsock error as mentioned before.

The conclusion is, make sure the thrown exceptions are caught and the serializable classes have proper transient fields.

I am not saying this is bullet proof, but as far as our tests goes, we haven't experienced the issue yet.

Our test setup is 15 jobs polling every minute using one slave with two executors. ClearCase crashes before anything else happens.

Christian Wolfgang added a comment - 2011-12-16 21:24 Hello. We have released our plugin, the source can be found at https://github.com/jenkinsci/clearcase-ucm-plugin . Some of the things, that made the polling hang on the slaves were mainly uncaught exception. If they weren't caught, they sometimes resulted in slaves hanging. And, as Jes wrote, we experienced a transient SimpleDateFormat causing the slaves to hang and this only happened in the polling phase. Our main problem, which does not only concern the polling, is cleartool, which, from time to time, stops working due to many reasons. It exits with the error message "albd_contact call failed: RPC: Unable to receive; errno = [WINSOCK] Connection reset by peer", but it never returns the control to the master, which results in slaves hanging. Our plugin is currently only available for the windows platform and we've had a lot of issues with the desktop heap size. ClearCase needs a larger heap size than the default setting(512kb), so setting it to a larger value decreases the number of simultaneous desktops, which sometimes caused the slave OS to freeze. If the value is too low, cleartool sometimes fails with the winsock error as mentioned before. The conclusion is, make sure the thrown exceptions are caught and the serializable classes have proper transient fields. I am not saying this is bullet proof, but as far as our tests goes, we haven't experienced the issue yet. Our test setup is 15 jobs polling every minute using one slave with two executors. ClearCase crashes before anything else happens.

Markus added a comment - 2012-01-12 09:18

We've been using the above suggested Groovy script for some time to avoid this problem, but as of 1.446 the script fails with "Remote call on CLI channel from /[ip] failed" (~~JENKINS-12302~~). Anyone else have the that problem?

Markus added a comment - 2012-01-12 09:18 We've been using the above suggested Groovy script for some time to avoid this problem, but as of 1.446 the script fails with "Remote call on CLI channel from / [ip] failed" ( JENKINS-12302 ). Anyone else have the that problem?

Franck Gilliers added a comment - 2012-02-08 10:14

Hello,

I experience the same issue.
If it can help, i describe my configuration:

master : linux - jenkins 1.448
slaves : windows XP and seven via a service - msysgit 1.7.4
plugin git : 1.1.15
projects ~ 100

I trigger a build via the push notification as describe in git plugin 1.1.14. But, as it does not always trigger the build (i do not know why), i keep a SCM polling every two hours.
To avoid the hanging, every night I restart the server and reboot slaves. During daytime, i have to kill the children git processes to free the slave

Franck Gilliers added a comment - 2012-02-08 10:14 Hello, I experience the same issue. If it can help, i describe my configuration: master : linux - jenkins 1.448 slaves : windows XP and seven via a service - msysgit 1.7.4 plugin git : 1.1.15 projects ~ 100 I trigger a build via the push notification as describe in git plugin 1.1.14. But, as it does not always trigger the build (i do not know why), i keep a SCM polling every two hours. To avoid the hanging, every night I restart the server and reboot slaves. During daytime, i have to kill the children git processes to free the slave

Christian Wolfgang added a comment - 2012-02-08 11:21

I see two branches of this issue:

1) The fact that slaves gets hung(or threads winds up waiting for lengthy polling) and how the master Jenkins instance should handle this, and
2) How to prevent slaves from getting hung.

The initial issue suggests 1), but some of the replies suggests 2).
I guess both are valid issues. Should they be treated as one or should this issue be split up in two?

Christian Wolfgang added a comment - 2012-02-08 11:21 I see two branches of this issue: 1) The fact that slaves gets hung(or threads winds up waiting for lengthy polling) and how the master Jenkins instance should handle this, and 2) How to prevent slaves from getting hung. The initial issue suggests 1), but some of the replies suggests 2). I guess both are valid issues. Should they be treated as one or should this issue be split up in two?

Guido Serra added a comment - 2012-03-23 11:46

Hi, I got the same issue:

Jenkins GIT 1.1.16
Slave: Windows 7, msysgit (Git-1.7.9-preview20120201.exe)

After I moved the SCM from SVN to the Git solution, the poll/build stopped working

This is how I got the windows machine being able to checkout git with ssh/publickey:

http://guidoserra.it/archivi/2012/03/22/jenkins-msysgit-publickey/

p.s. I was even thinking of using Fisheye to trigger the build on code change detection

Guido Serra added a comment - 2012-03-23 11:46 Hi, I got the same issue: Jenkins GIT 1.1.16 Slave: Windows 7, msysgit (Git-1.7.9-preview20120201.exe) After I moved the SCM from SVN to the Git solution, the poll/build stopped working This is how I got the windows machine being able to checkout git with ssh/publickey: http://guidoserra.it/archivi/2012/03/22/jenkins-msysgit-publickey/ p.s. I was even thinking of using Fisheye to trigger the build on code change detection

Christian Wolfgang added a comment - 2012-04-16 12:36

We have solved our problems now.

It turned out, that the underlying framework for the plugin threw RuntimeExceptions which were not catched all the time. After we handled those exceptions the slaves stopped hanging.

Christian Wolfgang added a comment - 2012-04-16 12:36 We have solved our problems now. It turned out, that the underlying framework for the plugin threw RuntimeExceptions which were not catched all the time. After we handled those exceptions the slaves stopped hanging.

Joe Hansche added a comment - 2012-06-08 19:05

wolfgang: by "plugin" you're referring to the clearcase plugin, right? So that was the issue with your plugin, but not necessarily the issue with the slaves hanging in general? Although potentially related, I guess?

So if the SCM polling plugin raises a RuntimeException, the slave thread will die off without notifying the master, and therefore the master continues waiting for it to finish, even though it never will?

Joe Hansche added a comment - 2012-06-08 19:05 wolfgang : by "plugin" you're referring to the clearcase plugin, right? So that was the issue with your plugin, but not necessarily the issue with the slaves hanging in general? Although potentially related, I guess? So if the SCM polling plugin raises a RuntimeException, the slave thread will die off without notifying the master, and therefore the master continues waiting for it to finish, even though it never will?

Alex Lorenz added a comment - 2012-06-27 11:34

This does not only happen on slaves, but also on single machine Jenkins systems.
With us here at TomTom, it happens regularly and makes us lose valuable builds.

Escalate -> Critical

Alex Lorenz added a comment - 2012-06-27 11:34 This does not only happen on slaves, but also on single machine Jenkins systems. With us here at TomTom, it happens regularly and makes us lose valuable builds. Escalate -> Critical

Christian Wolfgang added a comment - 2012-06-27 14:07

Joe: Yes, the ClearCase UCM plugin. We experienced the slaves to hang when having uncaught runtime exceptions and thus the master's polling thread will never be joined.

Christian Wolfgang added a comment - 2012-06-27 14:07 Joe: Yes, the ClearCase UCM plugin. We experienced the slaves to hang when having uncaught runtime exceptions and thus the master's polling thread will never be joined.

Mandeep Rai added a comment - 2012-09-04 17:04 - edited

I modified the script a little bit:

Jenkins.instance.getTrigger("SCMTrigger").getRunners().each()
{
  item ->
  println(item.getTarget().name)
  println(item.getDuration())
  println(item.getStartTime())
  long millis = Calendar.instance.time.time - item.getStartTime()

  if(millis > (1000 * 60 * 3)) // 1000 millis in a second * 60 seconds in a minute * 3 minutes
  {
    Thread.getAllStackTraces().keySet().each()
    { 
      tItem ->
      if (tItem.getName().contains("SCM polling") && tItem.getName().contains(item.getTarget().name))
      { 
        println "Interrupting thread " + tItem.getName(); 
        tItem.interrupt()
      }
    }
  }
}

Mandeep Rai added a comment - 2012-09-04 17:04 - edited I modified the script a little bit: Jenkins.instance.getTrigger( "SCMTrigger" ).getRunners().each() { item -> println(item.getTarget().name) println(item.getDuration()) println(item.getStartTime()) long millis = Calendar.instance.time.time - item.getStartTime() if (millis > (1000 * 60 * 3)) // 1000 millis in a second * 60 seconds in a minute * 3 minutes { Thread .getAllStackTraces().keySet().each() { tItem -> if (tItem.getName().contains( "SCM polling" ) && tItem.getName().contains(item.getTarget().name)) { println "Interrupting thread " + tItem.getName(); tItem.interrupt() } } } }

lacostej added a comment - 2013-01-14 20:15

I encountered a very similar issue, yet I have a slightly different setup:

1 master 1 slave
yet the polling was stuck on the master only
SCM polling hanging (warning displayed in Jenkins configure screen). Oldest hanging thread is more than 2 days old.
it seems it all started with a Unix process that in some way never returned:

 ps -aef| grep jenkins 
  300 12707     1   0 26Nov12 ??       1336:33.09 /usr/bin/java -Xmx1024M -XX:MaxPermSize=128M -jar /Applications/Jenkins/jenkins.war
  300 98690 12707   0 Sat03PM ??         0:00.00 git fetch -t https://github.com/jenkinsci/testflight-plugin.git +refs/heads/*:refs/remotes/origin/*
  300 98692 98690   0 Sat03PM ??         4:39.72 git-remote-https https://github.com/jenkinsci/testflight-plugin.git https://github.com/jenkinsci/testflight-plugin.git
    0  3371  3360   0  8:20PM ttys000    0:00.02 su jenkins
  300  4017  3372   0  8:52PM ttys000    0:00.00 grep jenkins
    0 10920 10896   0 19Nov12 ttys001    0:00.03 login -pfl jenkins /bin/bash -c exec -la bash /bin/bash

Running Jenkins 1.479

I killed the processes and associated threads, and it started being better.

Polling doesn't enfore timeouts ?

lacostej added a comment - 2013-01-14 20:15 I encountered a very similar issue, yet I have a slightly different setup: 1 master 1 slave yet the polling was stuck on the master only SCM polling hanging (warning displayed in Jenkins configure screen). Oldest hanging thread is more than 2 days old. it seems it all started with a Unix process that in some way never returned: ps -aef| grep jenkins 300 12707 1 0 26Nov12 ?? 1336:33.09 /usr/bin/java -Xmx1024M -XX:MaxPermSize=128M -jar /Applications/Jenkins/jenkins.war 300 98690 12707 0 Sat03PM ?? 0:00.00 git fetch -t https: //github.com/jenkinsci/testflight-plugin.git +refs/heads/*:refs/remotes/origin/* 300 98692 98690 0 Sat03PM ?? 4:39.72 git-remote-https https: //github.com/jenkinsci/testflight-plugin.git https://github.com/jenkinsci/testflight-plugin.git 0 3371 3360 0 8:20PM ttys000 0:00.02 su jenkins 300 4017 3372 0 8:52PM ttys000 0:00.00 grep jenkins 0 10920 10896 0 19Nov12 ttys001 0:00.03 login -pfl jenkins /bin/bash -c exec -la bash /bin/bash Running Jenkins 1.479 I killed the processes and associated threads, and it started being better. Polling doesn't enfore timeouts ?

Raja Aluri added a comment - 2013-02-08 18:31

For people who are on windows and want to setup a scheduled task. Here is a oneliner in powershell.

PS C:\Users\jenkins>
PS C:\Users\jenkins>
PS C:\Users\jenkins>
PS C:\Users\jenkins> tasklist /FI "IMAGENAME eq ssh.exe" /FI "Status eq Unknown" /NH | %{ $_.Split(' *',[StringSplitOptions]"RemoveEmptyEntries")[1]}  |ForEach-Object {taskkill /F /PID $_}
PS C:\Users\jenkins>
PS C:\Users\jenkins>
PS C:\Users\jenkins>
PS C:\Users\jenkins>

Raja Aluri added a comment - 2013-02-08 18:31 For people who are on windows and want to setup a scheduled task. Here is a oneliner in powershell. PS C:\Users\jenkins> PS C:\Users\jenkins> PS C:\Users\jenkins> PS C:\Users\jenkins> tasklist /FI "IMAGENAME eq ssh.exe" /FI "Status eq Unknown" /NH | %{ $_.Split( ' *' ,[StringSplitOptions] "RemoveEmptyEntries" )[1]} |ForEach- Object {taskkill /F /PID $_} PS C:\Users\jenkins> PS C:\Users\jenkins> PS C:\Users\jenkins> PS C:\Users\jenkins>

Linards L added a comment - 2013-03-08 09:17 - edited

Have not noticed that for long time now. But .. in the v1.48x version series I had similar problems, but then they dissapeared. Remember we also did some scheduled rebooting temperings ... now using v1.494.

Linards L added a comment - 2013-03-08 09:17 - edited Have not noticed that for long time now. But .. in the v1.48x version series I had similar problems, but then they dissapeared. Remember we also did some scheduled rebooting temperings ... now using v1.494.

Derek Seibert added a comment - 2013-03-25 13:37 - edited

Our team encounters this issue almost daily using the Dimensions SCM plugin. We run a single instance Jenkins server which polls a Stream ever 30 minutes. I wanted to comment just to explain how we first noticed the issue was occurring in case anybody searching this issue starts in the same place.

We select the Dimensions Polling Log for our job and see the following at maybe 9AM or 10AM. The polling has hung at this point and we need to restart our application server.

Started on Mar 25, 2013 8:30:35 AM

We expect to see something like.

Started on Mar 25, 2013 8:30:35 AM
Done. Took 19 sec
No changes

This is why this issue is so troubling. There is no notification trigger when "Started on..." has just been sitting there hung for a while, and no further polling can be done by that job without a restart of the application server.

Derek Seibert added a comment - 2013-03-25 13:37 - edited Our team encounters this issue almost daily using the Dimensions SCM plugin. We run a single instance Jenkins server which polls a Stream ever 30 minutes. I wanted to comment just to explain how we first noticed the issue was occurring in case anybody searching this issue starts in the same place. We select the Dimensions Polling Log for our job and see the following at maybe 9AM or 10AM. The polling has hung at this point and we need to restart our application server. Started on Mar 25, 2013 8:30:35 AM We expect to see something like. Started on Mar 25, 2013 8:30:35 AM Done. Took 19 sec No changes This is why this issue is so troubling. There is no notification trigger when "Started on..." has just been sitting there hung for a while, and no further polling can be done by that job without a restart of the application server.

lacostej added a comment - 2013-03-25 14:41

Derek,

Not sure if Dimensions plugin is using a native call or not hunder the hood.

Could you make a threaddump and/or a list of processes ?

J

lacostej added a comment - 2013-03-25 14:41 Derek, Not sure if Dimensions plugin is using a native call or not hunder the hood. Could you make a threaddump and/or a list of processes ? J

Manuel de la Peña added a comment - 2013-04-16 06:24

We are using "Github Pull Request Builder" plugin and we encounter this issue daily :S

Manuel de la Peña added a comment - 2013-04-16 06:24 We are using "Github Pull Request Builder" plugin and we encounter this issue daily :S

Jenkins

Details

Description

Attachments

Attachments

Issue Links

Activity

Collapse comment: rsteele added a comment - 2010-02-01 11:30

Expand comment: rsteele added a comment - 2010-02-01 11:30

Collapse comment: Carl Quinn added a comment - 2010-02-25 22:46

Expand comment: Carl Quinn added a comment - 2010-02-25 22:46

Collapse comment: mdillon added a comment - 2010-02-26 14:38

Expand comment: mdillon added a comment - 2010-02-26 14:38

Collapse comment: mdillon added a comment - 2010-02-26 14:39

Expand comment: mdillon added a comment - 2010-02-26 14:39

Collapse comment: dshields777 added a comment - 2010-03-09 11:22

Expand comment: dshields777 added a comment - 2010-03-09 11:22

Collapse comment: Dean Yu added a comment - 2010-03-09 22:42

Expand comment: Dean Yu added a comment - 2010-03-09 22:42

Collapse comment: wgracelee added a comment - 2010-03-29 08:31

Expand comment: wgracelee added a comment - 2010-03-29 08:31

Collapse comment: daniel_franzen added a comment - 2010-04-06 23:03

Expand comment: daniel_franzen added a comment - 2010-04-06 23:03

Collapse comment: Tony Sweeney added a comment - 2010-04-29 11:35, Edited by Tony Sweeney - 2010-04-29 11:38

Expand comment: Tony Sweeney added a comment - 2010-04-29 11:35, Edited by Tony Sweeney - 2010-04-29 11:38

Collapse comment: lkishalmi added a comment - 2010-04-29 13:15

Expand comment: lkishalmi added a comment - 2010-04-29 13:15

Collapse comment: lkishalmi added a comment - 2010-04-29 13:17

Expand comment: lkishalmi added a comment - 2010-04-29 13:17

Collapse comment: jpadmana added a comment - 2010-05-13 15:47

Expand comment: jpadmana added a comment - 2010-05-13 15:47

Collapse comment: ericguz added a comment - 2010-06-09 05:36

Expand comment: ericguz added a comment - 2010-06-09 05:36

Collapse comment: Christian Bremer added a comment - 2010-06-11 07:42

Expand comment: Christian Bremer added a comment - 2010-06-11 07:42

Collapse comment: jpshackelford added a comment - 2010-06-28 09:03, Edited by jpshackelford - 2010-06-28 09:07

Expand comment: jpshackelford added a comment - 2010-06-28 09:03, Edited by jpshackelford - 2010-06-28 09:07

Collapse comment: jpshackelford added a comment - 2010-06-28 10:08, Edited by jpshackelford - 2010-06-28 10:10

Expand comment: jpshackelford added a comment - 2010-06-28 10:08, Edited by jpshackelford - 2010-06-28 10:10

Collapse comment: carlspring added a comment - 2010-07-06 08:51

Expand comment: carlspring added a comment - 2010-07-06 08:51

Collapse comment: carlspring added a comment - 2010-07-06 09:04

Expand comment: carlspring added a comment - 2010-07-06 09:04

Collapse comment: vjuranek added a comment - 2010-07-08 02:39

Expand comment: vjuranek added a comment - 2010-07-08 02:39

Collapse comment: carlspring added a comment - 2010-07-08 03:13

Expand comment: carlspring added a comment - 2010-07-08 03:13

Collapse comment: Hans-Juergen Hafner added a comment - 2010-08-09 06:18, Edited by Hans-Juergen Hafner - 2010-08-09 06:19

Expand comment: Hans-Juergen Hafner added a comment - 2010-08-09 06:18, Edited by Hans-Juergen Hafner - 2010-08-09 06:19

Collapse comment: vjuranek added a comment - 2010-08-10 13:41

Expand comment: vjuranek added a comment - 2010-08-10 13:41

Collapse comment: Hans-Juergen Hafner added a comment - 2010-08-13 04:49

Expand comment: Hans-Juergen Hafner added a comment - 2010-08-13 04:49

Collapse comment: erwan_q added a comment - 2010-09-13 15:23

Expand comment: erwan_q added a comment - 2010-09-13 15:23

Collapse comment: lkishalmi added a comment - 2010-11-04 00:37

Expand comment: lkishalmi added a comment - 2010-11-04 00:37

Collapse comment: Dean Yu added a comment - 2010-11-18 10:53

Expand comment: Dean Yu added a comment - 2010-11-18 10:53

Collapse comment: eguess74 added a comment - 2011-03-09 19:19, Edited by eguess74 - 2011-03-09 19:20

Expand comment: eguess74 added a comment - 2011-03-09 19:19, Edited by eguess74 - 2011-03-09 19:20

Collapse comment: eguess74 added a comment - 2011-03-09 20:33, Edited by eguess74 - 2011-03-09 22:51

Expand comment: eguess74 added a comment - 2011-03-09 20:33, Edited by eguess74 - 2011-03-09 22:51

Collapse comment: eguess74 added a comment - 2011-03-10 15:09

Expand comment: eguess74 added a comment - 2011-03-10 15:09

Collapse comment: Stephen Connolly added a comment - 2011-06-07 08:38

Expand comment: Stephen Connolly added a comment - 2011-06-07 08:38

Collapse comment: Thomas Mielke added a comment - 2011-07-19 16:23, Edited by Thomas Mielke - 2011-07-19 16:32

Expand comment: Thomas Mielke added a comment - 2011-07-19 16:23, Edited by Thomas Mielke - 2011-07-19 16:32

Collapse comment: Frederic Pesquet added a comment - 2011-09-08 07:54

Expand comment: Frederic Pesquet added a comment - 2011-09-08 07:54

Collapse comment: carlspring added a comment - 2011-09-08 08:13

Expand comment: carlspring added a comment - 2011-09-08 08:13

Collapse comment: Frederic Pesquet added a comment - 2011-09-08 14:05

Expand comment: Frederic Pesquet added a comment - 2011-09-08 14:05

Collapse comment: Lars Kruse added a comment - 2011-10-19 09:59

Expand comment: Lars Kruse added a comment - 2011-10-19 09:59

Collapse comment: Jes Struck added a comment - 2011-10-19 10:06

Expand comment: Jes Struck added a comment - 2011-10-19 10:06