-
Bug
-
Resolution: Fixed
-
Major
-
Jenkins Version : 2.48
OS on Master : RHEL 5.4
OS on Salve : RHEL 6.6
Java version on salve : jdk1.7.0_80
-
Powered by SuggestiMate
We face below connection errors intermittently while running jobs on node123.
Error which we see in build log is : Cannot contact node123: java.lang.InterruptedException
I dont see any error in thread dump or any other logs related to this node.
Also i see there was not connection drop between Master and node.
Slave is see is running since more than 24 hrs now.
- jenkins.log.gz
- 12 kB
- pipeline_log.txt
- 16 kB
- is related to
-
JENKINS-45023 Channel#call() should reject requests if the channel is being closed
-
- Resolved
-
- relates to
-
JENKINS-61697 Cannot contact node: java.lang.InterruptedException
-
- Closed
-
[JENKINS-43038] Intermittent error "Cannot contact node123: java.lang.InterruptedException " in jenkins
The problem persist also in Ubuntu 16.04, Jenkins 2.32.3.
Unfortunately, I cannot find any evidence of exception stack trace.
As part of the above problem troubleshooting I've used SSH Jenkins slave running at the same server as master. I've managed to workaround the problem by switching my jobs to run at master and not slave.
OK. If you see no exception, please provide full Jenkins System logs at least. Without such information I cannot triangulate the issue
I've just uploaded my Jenkins log.
Please note that most exceptions in the log are referring to disconnect/connect of slave.
The problem could be related to another problem that I've reported recently
https://issues.jenkins-ci.org/browse/JENKINS-43106
At the problem description you may find more logs, including thread dump that may shed some lite at the root cause of the problem.
I've managed to catch the exception which may shed some light on the problem.
Please review the attached pipeline_log.txt
At the same time build log printed:
[Pipeline] stage
[Pipeline] { (Create GIT TAG)
[Pipeline] sh
10:45:26 [CISystem_generic@2] Running shell script
10:45:37 Cannot contact ##############: java.lang.InterruptedException
10:45:47 Cannot contact ##############: java.lang.InterruptedException
10:45:57 Cannot contact ##############: java.lang.InterruptedException
10:46:07 Cannot contact ##############: java.lang.InterruptedException
10:46:18 Cannot contact ##############: java.lang.InterruptedException
10:46:28 Cannot contact ##############: java.lang.InterruptedException
10:46:38 Cannot contact ##############: java.lang.InterruptedException
10:46:48 Cannot contact ##############: java.lang.InterruptedException
10:46:59 Cannot contact ##############: java.lang.InterruptedException
10:47:09 Cannot contact ##############: java.lang.InterruptedException
10:47:19 Cannot contact ##############: java.lang.InterruptedException
10:47:29 Cannot contact ##############: java.lang.InterruptedException
10:47:40 Cannot contact ##############: java.lang.InterruptedException
10:47:50 Cannot contact ##############: java.lang.InterruptedException
10:48:00 Cannot contact ##############: java.lang.InterruptedException
10:48:10 Cannot contact ##############: java.lang.InterruptedException
10:48:21 Cannot contact ##############: java.lang.InterruptedException
10:48:31 Cannot contact ##############: java.lang.InterruptedException
10:48:41 Cannot contact ##############: java.lang.InterruptedException
10:48:51 Cannot contact ##############: java.lang.InterruptedException
10:49:02 Cannot contact ##############: java.lang.InterruptedException
10:49:12 Cannot contact ##############: java.lang.InterruptedException
10:49:22 Cannot contact ##############: java.lang.InterruptedException
10:49:32 Cannot contact ##############: java.lang.InterruptedException
10:49:43 Cannot contact ##############: java.lang.InterruptedException
10:49:53 Cannot contact ##############: java.lang.InterruptedException
10:50:03 Cannot contact ##############: java.lang.InterruptedException
10:50:13 Cannot contact ##############: java.lang.InterruptedException
10:50:24 Cannot contact ##############: java.lang.InterruptedException
10:50:34 Cannot contact ##############: java.lang.InterruptedException
10:50:44 Cannot contact ##############: java.lang.InterruptedException
10:50:54 Cannot contact ##############: java.lang.InterruptedException
10:51:00 + git tag -a ############## -m Created by Jenkins
10:51:00 + git tag -a ############## -m Created by Jenkins
10:51:00 + git tag -a ############## -m Created by Jenkins
10:51:00 + git tag -a ############## -m Created by Jenkins
10:51:00 + git tag -a ############## -m Created by Jenkins
10:51:00 + git tag -a ############## -m Created by Jenkins
10:51:00 + git tag -a ############## -m Created by Jenkins
10:51:00 + git tag -a ############## -m Created by Jenkins
10:51:00 + git tag -a ############## -m Created by Jenkins
10:51:00 + git tag -a ############## -m Created by Jenkins
10:51:00 + git tag -a ############## -m Created by Jenkins
10:51:00 + git tag -a ############## -m Created by Jenkins
10:51:00 + git tag -a ############## -m Created by Jenkins
10:51:00 + git tag -a ############## -m Created by Jenkins
10:51:00 + git tag -a ############## -m Created by Jenkins
10:51:00 + git tag -a ############## -m Created by Jenkins
10:51:00 + git tag -a ############## -m Created by Jenkins
10:51:00 + git tag -a ############## -m Created by Jenkins
10:51:00 + git tag -a ############## -m Created by Jenkins
10:51:00 + git tag -a ############## -m Created by Jenkins
10:51:00 + git tag -a ############## -m Created by Jenkins
10:51:00 + git tag -a ############## -m Created by Jenkins
10:51:00 + git tag -a ############## -m Created by Jenkins
10:51:00 + git tag -a ############## -m Created by Jenkins
10:51:00 + git tag -a ############## -m Created by Jenkins
10:51:00 + git tag -a ############## -m Created by Jenkins
10:51:00 + git tag -a ############## -m Created by Jenkins
10:51:00 + git tag -a ############## -m Created by Jenkins
10:51:00 + git tag -a ############## -m Created by Jenkins
10:51:00 + git tag -a ############## -m Created by Jenkins
10:51:00 + git tag -a ############## -m Created by Jenkins
10:51:00 + git tag -a ############## -m Created by Jenkins
10:51:00 + git tag -a ############## -m Created by Jenkins
I am also seeing intermittent failures. I see the following message in pipeline job's console output:
[BuildLocal] Cannot contact sjc-baas-paw-012.cisco.com: java.lang.InterruptedException
and the following message in jenkins logs:
Jun 30, 2017 12:51:43 AM WARNING org.jenkinsci.plugins.durabletask.ProcessLiveness isAlive
hudson.Launcher$RemoteLauncher@40518b87 on hudson.remoting.Channel@1f35abae:sjc-baas-paw-012.cisco.com does not seem able to determine whether processes are alive or not.
When I look at the slave nodes log output, it has no reported issues:
...
[06/29/17 22:32:23] [SSH] Checking java version of java [06/29/17 22:32:23] [SSH] java -version returned 1.8.0_72. [06/29/17 22:32:23] [SSH] Starting sftp client. [06/29/17 22:32:23] [SSH] Copying latest slave.jar... [06/29/17 22:32:23] [SSH] Copied 719,269 bytes. Expanded the channel window size to 4MB [06/29/17 22:32:23] [SSH] Starting slave process: /ws/sjap/baas/sw/packages/astro/master/0.7/20170630_052623/paw/bin/env_run.py /bin/sh -c 'cd "/nobackup/baas/jenkins/stage.astro.cisco.com" && java -jar slave.jar' <===[JENKINS REMOTING CAPACITY]===>channel started Slave.jar version: 3.7 This is a Unix agent Evacuated stdout [StartupTrigger] - Scanning jobs for node sjc-baas-paw-012.cisco.com Agent successfully connected and online
I could not find any traces of errors anywhere else. I am using jenkins LTS 2.60.1 and java 1.8. Please let me know if I can provide any other info to debug this issue.
jglick is it about the recent change for built-in timeouts in Workflow?
Probably. Most likely before this step would have just hung without seeming to progress.
We are having a big issue with our master with every build being slowed down. It takes 4x times for builds to go through and this exception keeps coming for steps that usually takes a second earlier. Those steps takes minutes to finish now.
- All our jobs are pipeline scripted jobs
- No memory heap issues in master
- The situation is same with any number of slaves. All slaves behave the same and we think it is something in master.
Kind of completely blocked and teams getting frustrated, unfortunately
I am working on some remoting fixes which may address it. E.g. JENKINS-45023.
But generally there is no information, which would allow to diagnose root cause on the remoting side if it is there. I need logs and stacktraces from both master/agent for the time when the issue happens.
We are seeing this too. It usually happens when the slave is under high load. There is nothing in the log except "Cannot contact node123: java.lang.InterruptedException".
I took a look at the jenkins system log right before the particular job failed and I saw a bunch of these logs:
Process hudson.Launcher$RemoteLauncher$ProcImpl@1cfb4158 has not really finished after the join() method completion
Is there any way to enable extra logging on slaves?
In order to get diagnostics information, you need to run the test agent with hudson.remoting.Command, hudson.remoting.Request, and hudson.remoting.Channel loggers using the "FINER" logging level. You can enable such level globally, but it may lead to the generation of huge logs since the issue cannot be reproduced in a short test.
Remoting 3.11 should include some extra diagnostics for this case as a part of JENKINS-45233
JENKINS-45023 may be also a root cause of this issue if the agent goes offline
Saw this same message in the middle of a build from an elastic provisioned Openstack node. Using a pipeline job with 7 parallel stages on 7 elastic nodes.Running Jenkins 2.89.1, Openstack plugin 2.29 and latest pipeline plugins.
I got the same error and it I have the impression that Jenkins is unable to recover from this because in >30 minutes I didn't get any progress message on the console. The slave.jar process is still running on that machine and there is no networking issues between the master and the slave. Does anyone know how to debug this further? Maybe we can narrow down the bug.
It is true that my Jenkins is 2.60.3, which seems a little bit old.
I see this issue as well during testing which can take about 10-20 minutes of running a single shell script.
I suppose it happens when the agent gets disconnected for a split second. Is there anyway to create a workaround protecting the shell script from this. At the moment I have to manually abort the running test.
Thanks,
Tsvi
Unfortunately I have no capacity to work on Remoting in medium term, so I will unassign it and let others to take it. If somebody is interested to submit a pull request, I will be happy to help to get it reviewed and released.
msavlani1 shahmishal tsvi If you update to the latest Pipeline plugins and especially support-core plugin and use the suggested GC settings (https://jenkins.io/blog/2016/11/21/gc-tuning/) you should find that the InterruptedExceptions are pretty much gone – they are the result of timeouts in remoting-related operations generally. The only cases they should happen now I believe are actual hardware/system/network issues.
In the last quarter of 2017 we did a big change to the way Pipeline's durable tasks interact with remoting that should avoid many of these issues.
Edit: There was an additional issue fixed around support-core that caused problems and was recently fixed. Specifically, support-core plugin in version 2.42 added heap histogram analysis for diagnostics but this had the unexpected side effect of introducing periodic catastrophically long GC pauses that made the Jenkins master unresponsive for long periods and triggered timeouts (and thus the InterruptedException here when Timeouts kick in).
Please see https://issues.jenkins-ci.org/browse/JENKINS-49931 for more details of that.
For now I'm going to transition this to "closed" because when working with several users showing this among other symptoms, the suggestions above successfully resolved the issues – but I'm happy to re-open this if you all still experience problems after applying the above (please reply to note the same).
Hi I am recently seeing the same "Cannot contact node123: java.lang.InterruptedException" error but only during parallel stages in a pipeline job.
I have created a brand new Jenkins environment (Jenkins version 2.121.1) with all updated plugins and have the GC settings according to the gc-tuning page from the above comment.
This issue is intermittent (about 1 every 8 builds or so).
Support-Core version 2.48
Pipeline version 2.5
Any other advice?
Thanks,
joebarber What you describe sounds a lot like https://issues.jenkins-ci.org/browse/JENKINS-46507 but we have not had a consistent way to reproduce the issue, so it's very hard to debug. If you can provide a simple, self-contained sample Pipeline in the comments of that ticket that will reproduce the issue, that would be very helpful. Thanks!
We're experiencing the same issue when our java agent get killed my OOM or machine on which agent is running is rebooted. Is there any way to reduce amount of time Jenkins will wait till the build will be mark as failed?
Please provide a full exception stacktrace at least