[JENKINS-5055] server rejected connection: already connected to master

Type: Bug
Resolution: Fixed
Priority: Critical
Component/s: remoting
Labels:
None
Environment:

Hide
I have a hudson setup with a single master pc (running win-xp) and 12 slave pcs (running win-xp and a few of them win2k), running hudson version 1.334 currently.

The hudson master is running as a windows service. The hudson slaves are run as java clients, started through 'javaws http://master:8080/computer/slavename/slave-agent.jnlp'. Btw, I tried to get the slave also running as service, but I never managed to get the right permissions when I run the slave as a service.
The jobs are written in shell commands, and executed on the slaves using cygwin sh.

The hudson slave park is partitioned this way:
* 3 slave pcs contain TriMedia based hardware boards. Jobs compile
  test suites on the pc and run the executables on those boards.
* The other 9 are used for pc jobs. Jobs compile test suites on the
  pc and run the executables on a TriMedia simulator on the pc.

I labeled the group-of-9 'windows'. In order to run a testsuite with different configurations (optimisation level, library set, etc) I use matrix jobs on my 'windows' queue. Using matrix jobs, I ran into bug 936, so I applied the workaround by adding -Dhudson.model.Hudson.flyweightSupport=true to C:\hudson\hundson.xml on the master.

Show
I have a hudson setup with a single master pc (running win-xp) and 12 slave pcs (running win-xp and a few of them win2k), running hudson version 1.334 currently. The hudson master is running as a windows service. The hudson slaves are run as java clients, started through 'javaws http://master:8080/computer/slavename/slave-agent.jnlp' . Btw, I tried to get the slave also running as service, but I never managed to get the right permissions when I run the slave as a service. The jobs are written in shell commands, and executed on the slaves using cygwin sh. The hudson slave park is partitioned this way: * 3 slave pcs contain TriMedia based hardware boards. Jobs compile   test suites on the pc and run the executables on those boards. * The other 9 are used for pc jobs. Jobs compile test suites on the   pc and run the executables on a TriMedia simulator on the pc. I labeled the group-of-9 'windows'. In order to run a testsuite with different configurations (optimisation level, library set, etc) I use matrix jobs on my 'windows' queue. Using matrix jobs, I ran into bug 936, so I applied the workaround by adding -Dhudson.model.Hudson.flyweightSupport=true to C:\hudson\hundson.xml on the master.

Similar Issues:
Powered by SuggestiMate

Show

After some idle time (no jobs running, master and slaves idle), the master showed a slave as offline.

on the slave, I see an Error pop-up window saying:
...
java.lang.Exception: The server rejected the connection: nlvhtcnxp1dt361 is
already connected to this master. Rejecting this connection. at
hudson.remoting.engine.Run(Engine.java:191)
...

after clicking ok on pop-up windows, the hudson slave app terminates.
restarting the hudson slave app manually seems to work fine.

duplicates

JENKINS-5973 Slaves reconnecting after restarting are rejected because Hudson thinks the slave already connected

Closed

is duplicated by

JENKINS-5355 Disconnected slaves cannot reconnect again

Closed

is related to

JENKINS-28492 The server rejected the connection: *** is already connected to this master. Rejecting this connection.

Resolved

tomdevries created issue - 2009-12-09 08:26

Kohsuke Kawaguchi added a comment - 2010-01-15 16:36

The root cause of the issue appears that the socket communication between the slave and the master is lost in such a way that the master doesn't notice. So when the slave connects back, the master thinks it's a bogus attempt since the slave is already connected.

Do you have a NAT/firewall between a master and a slave?

One fix could be to have the master check if the slave is alive before rejecting the new incoming connection, but this may take 10s of secs as it can involve packet retransmission. Another possibility might be to let the slave send in some token so that the master can verify that it's being reconnected from what it's supposed to be currently connecting.

Still thinking about how to fix this.

Kohsuke Kawaguchi added a comment - 2010-01-15 16:36 The root cause of the issue appears that the socket communication between the slave and the master is lost in such a way that the master doesn't notice. So when the slave connects back, the master thinks it's a bogus attempt since the slave is already connected. Do you have a NAT/firewall between a master and a slave? One fix could be to have the master check if the slave is alive before rejecting the new incoming connection, but this may take 10s of secs as it can involve packet retransmission. Another possibility might be to let the slave send in some token so that the master can verify that it's being reconnected from what it's supposed to be currently connecting. Still thinking about how to fix this.

vkodocha added a comment - 2010-03-17 23:16

I've have exactly the same issue here with our setup. We have a master node running on Mac OS X and a windows xp slave running in vmware on the same machine. Hudson version is 1.351 but the problem is basically appearing since we installed this system the first time. It does occur at least once a day.

The network communication between the xp and the mac is done via nat.

One workaround would be to have a possibility to disable the dialog so I could make a little test app on the win slave which could check if the slave is still running and restart it.

vkodocha added a comment - 2010-03-17 23:16 I've have exactly the same issue here with our setup. We have a master node running on Mac OS X and a windows xp slave running in vmware on the same machine. Hudson version is 1.351 but the problem is basically appearing since we installed this system the first time. It does occur at least once a day. The network communication between the xp and the mac is done via nat. One workaround would be to have a possibility to disable the dialog so I could make a little test app on the win slave which could check if the slave is still running and restart it.

vkodocha added a comment - 2010-03-18 01:46

This two issues seam to be the same

vkodocha added a comment - 2010-03-18 01:46 This two issues seam to be the same

vkodocha made changes - 2010-03-18 01:46

Link

New: This issue duplicates ~~JENKINS-5973~~ [ ~~JENKINS-5973~~ ]

tapiomtr added a comment - 2010-03-18 22:10 - edited

We also have same kind of problem.

What I notice is that our Linux based Hudson slave was started to svn checkout, but it was jam for some reason. Same time the aain Hudson indicate that "There are more SCM polling activities scheduled than handled, so the threads are not keeping up with the demands".
So then I tried to restart that Linux based Hudson slave, but it can't start up because the main Hudson still thinks that the slave is still connected. I now wait about 30min, that the main Hudson find out that the slave is gone, e.g. the slave is still idle, while it's not running at all.

Only way to solve this problem is first restart main Hudson server.

Hudson master running on:
Redhat Linux running on VMware VM

Hudson slave running on:
Redhat Linux running on VMware VM

tapiomtr added a comment - 2010-03-18 22:10 - edited We also have same kind of problem. What I notice is that our Linux based Hudson slave was started to svn checkout, but it was jam for some reason. Same time the aain Hudson indicate that "There are more SCM polling activities scheduled than handled, so the threads are not keeping up with the demands". So then I tried to restart that Linux based Hudson slave, but it can't start up because the main Hudson still thinks that the slave is still connected. I now wait about 30min, that the main Hudson find out that the slave is gone, e.g. the slave is still idle, while it's not running at all. Only way to solve this problem is first restart main Hudson server. Hudson master running on: Redhat Linux running on VMware VM Hudson slave running on: Redhat Linux running on VMware VM

Alan Harder made changes - 2010-07-05 10:37

Component/s

New: master-slave [ 15489 ]

SCM/JIRA link daemon added a comment - 2011-01-24 14:03

Code changed in hudson
User: Kohsuke Kawaguchi
Path:
changelog.html
core/src/main/java/hudson/TcpSlaveAgentListener.java
core/src/main/java/hudson/slaves/SlaveComputer.java
remoting/src/main/java/hudson/remoting/Engine.java
http://hudson-labs.org/commit/core/68ed742227891a3f716e4e479388c36876bb935a
Log:
[FIXED JENKINS-5055] allow the same JNLP slave to reconnect without getting rejected.

SCM/JIRA link daemon added a comment - 2011-01-24 14:03 Code changed in hudson User: Kohsuke Kawaguchi Path: changelog.html core/src/main/java/hudson/TcpSlaveAgentListener.java core/src/main/java/hudson/slaves/SlaveComputer.java remoting/src/main/java/hudson/remoting/Engine.java http://hudson-labs.org/commit/core/68ed742227891a3f716e4e479388c36876bb935a Log: [FIXED JENKINS-5055] allow the same JNLP slave to reconnect without getting rejected.

SCM/JIRA link daemon made changes - 2011-01-24 14:03

Resolution		New: Fixed [ 1 ]
Status	Original: Open [ 1 ]	New: Resolved [ 5 ]

Kohsuke Kawaguchi made changes - 2011-01-24 14:03

Link

New: This issue is duplicated by ~~JENKINS-5355~~ [ ~~JENKINS-5355~~ ]

Assignee:: Unassigned

Reporter:: tomdevries

Votes:: 9 Vote for this issue

Watchers:: 21 Start watching this issue

Created:: 2009-12-09 08:26

Updated:: 2016-11-22 15:55

Resolved:: 2011-01-24 14:03

Jenkins

Details

Description

Attachments

Issue Links

Activity

Collapse comment: Kohsuke Kawaguchi added a comment - 2010-01-15 16:36

Expand comment: Kohsuke Kawaguchi added a comment - 2010-01-15 16:36

Collapse comment: vkodocha added a comment - 2010-03-17 23:16

Expand comment: vkodocha added a comment - 2010-03-17 23:16

Collapse comment: vkodocha added a comment - 2010-03-18 01:46

Expand comment: vkodocha added a comment - 2010-03-18 01:46

Collapse comment: tapiomtr added a comment - 2010-03-18 22:10, Edited by tapiomtr - 2010-03-18 22:15

Expand comment: tapiomtr added a comment - 2010-03-18 22:10, Edited by tapiomtr - 2010-03-18 22:15

Collapse comment: SCM/JIRA link daemon added a comment - 2011-01-24 14:03

Expand comment: SCM/JIRA link daemon added a comment - 2011-01-24 14:03

People

Dates