-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Windows 7 SP1 x64 master
Windows 7 SP1 x64 slave
connection over JNLP agent
-
Powered by SuggestiMate
Project on slave builders with perforce SCM are hanging after several hours. Perforces workspaces configered as permanent with only one checkbox
"Don't update client workspace".
Main log:
Sep 26, 2012 1:16:28 PM hudson.plugins.perforce.PerforceSCM getEffectiveClientName
WARNING: Could not get hostname for slave <SlaveName>
Polling log:
Started on Sep 26, 2012 1:19:27 PM
Looking for changes...
Using node: Builder
Using remote perforce client: <ws_name>
terminate here
[JENKINS-15315] Slave polling hungup
Treaddump is attached.
Perforce Polling Log
Started on Sep 27, 2012 9:31:26 PM
Looking for changes...
Using node: Builder
Using remote perforce client: Alexey.Larsky_NM_v01_builder
...nothing here...
It looks like your p4 client executable is hanging for some reason. Double check all your settings, and make sure you are using a client version that matches your server version. Also make sure that you aren't using any special characters in your workspace name.
Thanks Rob.
I have updated p4 to P4/NTX64/2012.1/490371 (2012/07/02) from 2010.1 on server and client. In workspace's names using only dots and underscores.
I will check issue on updated version.
I have also experienced this problem a few times. The log only contains:
Perforce Polling Log
Started on Oct 4, 2012 12:18:23 PM
Looking for changes...
Using node: cphwrk0249
Using remote perforce client: jenkins_bsp-trunk--379981060
I can't find any p4.exe processes running on cphwrk0249.
Two of the times I experienced this was on days where I know that the network connection on cphwrk0249 had been disconnected for several minutes during the day.
perforce client is version P4/NTX64/2012.1/442152 (2012/04/06)
server is P4D/LINUX26X86/2012.1/518826 (2012/08/30)
I again get hungup sitiation on slave. P4 versions - lastest 2012.2.
Attaching new threaddump.
Uploading "Thread dump [Jenkins] (2012-10-12 14-00-59).htm"
Threaddump for p4 polling hungup on slaves.
It's hanging in IO, which to me suggests a network problem. I'm not sure what else I can do aside from adding some kind of timeout. :/
But network problem - is ordinary situation between two computers (master and slave(s)).
And this situation mustn't influence to future builds. In other words task should be able to restore after network error.
Thank you
Yeah, I agree, but the remoting API is supposed to take care of those details. If there is a connection issue, it should be failing outright instead of hanging.
Actually, can you post your java, jenkins, and perforce plugin versions?
Hello Robert,
This issue is quite painful for our users, so I would like to fix it. This issue has been reproduced at the latest plugin version (Jenkins version – 1.480.3, java version "1.7.0_19", OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode) and several previous versions as well.
According to the stacktraces, P4 hangs at BufferedReader::readLine(), which is infinitely waits for new line or EOF. I suppose that P4 command-line client finishes before call of readLine() or somehow enters interactive mode and waits till user’s input. However, I can’t reproduce issue at my testing stand with slaves with debugger. Restart of the slave fixes the problem.
There are 84 BufferedReader::readLine() calls in p4 plugin, so we can’t just fix getPerforceResponse() function.
Possible solutions:
• We can add something like timeouts to the checkout() and other top-level overrides. BTW, it’s just a workaround for operative notification, because only restart can fix issue’s origin.
• Add timeout to the getPerforceResponse() only and wait for other errors (yeah, just fix the known issue)
• Replace BufferedReaders by wrapper, which knows how to handle issue.
I’m going to implement second approach. It will be possible to configure timeout via perforce global configuration.
Best regards,
Oleg Nenashev
R&D Engineer, Synopsys Inc.
www.synopsys.com
The timeout needs to be reset every time a line comes through. As I've mentioned, some users have a very large amount of data that can take hours to sync, so it shouldn't time out an operation if it's clearly still doing something.
Yes, I agree with you.
Hope to finish testing and create pull request today.
Code changed in jenkins
User: Oleg Nenashev
Path:
src/main/java/com/tek42/perforce/parse/AbstractPerforceTemplate.java
src/main/java/hudson/plugins/perforce/PerforceSCM.java
http://jenkins-ci.org/commit/perforce-plugin/b8b0115c5b630566ea2473ad6ced2f0769cc0c7b
Log:
Added optional timeout to com.tek42.perforce.parse.AbstractPerforceTemplate::getPerforceResponse()
Should prevent hanging of p4 checkout in case of https://issues.jenkins-ci.org/browse/JENKINS-15315
Signed-off-by: Oleg Nenashev <nenashev@synopsys.com>
Code changed in jenkins
User: Rob Petti
Path:
src/main/java/com/tek42/perforce/parse/AbstractPerforceTemplate.java
src/main/java/com/tek42/perforce/process/CmdLineExecutor.java
src/main/java/com/tek42/perforce/process/Executor.java
src/main/java/hudson/plugins/perforce/HudsonP4DefaultExecutor.java
src/main/java/hudson/plugins/perforce/HudsonP4RemoteExecutor.java
src/main/java/hudson/plugins/perforce/PerforceSCM.java
src/main/resources/hudson/plugins/perforce/PerforceSCM/global.jelly
src/main/webapp/help/p4ReadLineTimeout.html
http://jenkins-ci.org/commit/perforce-plugin/043a336b1afc14c1e3c1ce6e29e570d3ae09f592
Log:
Merge pull request #32 from synopsys-arc-oss/p4-hangs-issue-workaround
"Slave polling hangup" issue workaround (JENKINS-15315)
Compare: https://github.com/jenkinsci/perforce-plugin/compare/1fc190959170...043a336b1afc
This change is in 1.3.25; should the JIRA ticket be resolved, or are you still planning some further fixes?
I've had a colleague report intermittent issues with functionality related to this, so I wouldn't call it resolved just yet. It also seems like the timeout code can still deadlock, since ready() does not guarantee that the next readLine() won't block.
Yes, ready() guarantees only that next read() is valid.
I haven't experienced unterminated hangups since the PR, but I agree with Robert. Issue has not been fixed.
What do you mean under "intermittent issues", Rob? Could you update the issue?
Not sure if they are related, but retrieving the perforce response sometimes results in no data being received since 1.3.25:
Caught exception communicating with perforce. Problem getting user information for <USER> com.tek42.perforce.PerforceException: Problem getting user information for <USER> at hudson.plugins.perforce.PerforceSCM.retrieveUserInformation(PerforceSCM.java:711) at hudson.plugins.perforce.PerforceSCM.checkout(PerforceSCM.java:994) at hudson.model.AbstractProject.checkout(AbstractProject.java:1369) at hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:676) at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:88) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:581) at hudson.model.Run.execute(Run.java:1576) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46) at hudson.model.ResourceController.execute(ResourceController.java:88) at hudson.model.Executor.run(Executor.java:241) Caused by: com.tek42.perforce.PerforceException: No output for: /usr/local/bin/p4 user -o <USER> at com.tek42.perforce.parse.AbstractPerforceTemplate.getPerforceResponse(AbstractPerforceTemplate.java:434) at com.tek42.perforce.parse.AbstractPerforceTemplate.getPerforceResponse(AbstractPerforceTemplate.java:298) at com.tek42.perforce.parse.Users.getUser(Users.java:56) at hudson.plugins.perforce.PerforceSCM.retrieveUserInformation(PerforceSCM.java:709) ... 9 more ERROR: Unable to communicate with perforce. Problem getting user information for <USER>
Ah, yeah, the read loop starts with:
while (reader.ready() || p4.isAlive())
It's entirely possible that when the loop starts, no data has been sent back by the remote slave yet, so reader.ready() returns false, no data is read, and the plugin throws that error.
FYI, a No output for issue is filed as JENKINS-15904; unsure if there is any relation.
What executor do you use in such case?
In case of HudsonP4RemoteExecutor, isAlive() runs after exec(). currentProcess should be available => p4.isAlive() should return true from the start.
Therefore, only one case is possible:
- Process has been already completed
- ... but no data has been received yet
I've tried remote hosts with ~1 second delays, but I have not managed to reproduce your case.
Anyway, I should start from automated tests for Perforce checkout operations, which will be able to accept various global configurations.
HudsonP4RemoteExecutor: @Override public boolean isAlive() throws IOException, InterruptedException { return currentProcess != null ? currentProcess.isAlive() : false; } RemoteProc: @Override public boolean isAlive() throws IOException, InterruptedException { return !process.isDone(); }
@Jesse That one is different. It's about the remote executor not passing OS-level exceptions back to Jenkins, and instead just closing the pipe as if nothing is wrong. The plugin sees that no data comes back, and throws that error instead of the actual exception (Cannot run program).
@Oleg all remote operations are using the remote executor. I'm not sure if this happens on the master, but it definitely occurs on the slaves. The only thing I can think of is that it may be possible for a remote process to register as being terminated before the data actually becomes available on the pipe, assuming the buffer is large enough for all the data being returned by the command. That would explain why it's failing on relatively small operations, such as p4 user and p4 users, since they terminate quite quickly compared to things such as syncs, and return only a small amount of data.
It may be necessary to remove the loop condition, and just break once we know that the pipe is closed or the timeout has been reached.
It seems like the problem is with the reader.ready() call. Apparently this never becomes true, even when there is data on the pipe. I try to just check for this before reading, but it hangs indefinitely. Apart from reading from the raw InputStream, I can't see any other way of handling this. :/
Another approach: We could add a wrapper to launcher's IO streams and perform monitoring of its activity via external thread, which can interrupt the launcher and close the stream. External thread in a significant overhead, but it could be a general approach for all external calls in P4 plugin.
BTW, reader.ready() works for me (local and remote Windows slave).
I'm testing on a Linux master,and all small operations fail to return ready as true. I already tried using a watchdog thread, but read cannot be interrupted at all. We would need to write our own reader so we can manipulate the stream directly.
It seems like InputStream.available() is always returning 0 as well... I'm at a total loss now. We might not have any choice but to back out the changes.
InputStream.available() returns null by default. Several child classes like BufferedInputStream override this method.
What about usage of Future wrapper? StackOverflow has several samples: http://stackoverflow.com/questions/804951/is-it-possible-to-read-from-a-inputstream-with-a-timeout
P.S: I suppose that usage of newest P4Java versions could be the best solution for this issue (not for workaround), but it almost means rewriting from scratch.
Look at the comments for the answer the suggests Futures. If we used this, we'd leak threads every time Perforce hangs until Jenkins is restarted, since there's absolutely no way to interrupt a read operation in Java apart from killing the JVM entirely... I don't think this is an option here.
I rewrote the timeout functionality to spawn a thread that waits, then closes the underlying InputStream if no lines have been received for a while. This seems to work fine, at least on my system.
Also, there's still no timeout on several of the perforce response methods being used by the plugin. Only one of them currently has a timeout.
I'm still having problems.
Slave polling p4 command hangs (on OSX slave), perforce plugin timeout feature isn't killing it.
Eventually every fork on the slave failed with EAGAIN.
dtruss shows nothing happening, netstat shows p4's sockets in SYN_SENT
Jenkins 1.569, plugin 1.3.27, p4 2013.3
Will update to 2014.1 and see if anything changes.
The issue has not been solved yet, timeouts are not reliable.
BTW, an update to the new client version may workaround the issue
Running the same commands manually that are hanging, with the same credentials, shows no issues.
In my case, after leaving it for a week on polling, netstat shows several thousand entires sitting in FIN_WAIT_2
Version 1.31 is most stable. 1.33 and 1,34 hungs on slaves periodically.
Can you provide a threaddump when the hang occurs?
http://jenkinsurl/threadDump