-
Bug
-
Resolution: Not A Defect
-
Minor
-
Jenkins 1.561, Xvfb plugin 1.0.10
-
Powered by SuggestiMate
Once in a while it happens that a Jenkins build fails because the Xvfb it wants to create already exists:
Xvfb starting$ Xvfb :1 -screen 0 1024x768x8 -fbdir /u10/app/gcadmin/jenkins/stable/jenkins_data/2014-07-24_11-54-567539530762307179860xvfb
_XSERVTransSocketUNIXCreateListener: ...SocketCreateListener() failed
_XSERVTransMakeAllCOTSServerListeners: server already running
Fatal server error:
Cannot establish any listening sockets - Make sure an X server isn't already running
Settings are pretty default: I use the number of the executor with offset 1 (so it is not possible for another server running with this number). The workaaround at the moment is manual deletion from time to time.
[JENKINS-23980] Xvfb doesn't remove /tmp/.X-* locks after a build has finished
From looking at X11 source code[1], and FAQ[2], it seems that there must be an existing X server listening on :1.
[1] http://cgit.freedesktop.org/xorg/xserver/tree/os/utils.c#n260 and http://cgit.freedesktop.org/xorg/xserver/tree/os/connection.c#n385
[2] http://www.x.org/wiki/FAQErrorMessages/#index6h2
Thanks for your response and the references!
I configured the jenkins jobs which use the plugin to use the executor + offset 1 as display variable (this is the default anyways). So it can't be that there are other jobs using the same display number since the executor number is unique and two jobs can't use the same executor. We are using xvfb only for this purpose as well (i checked and xvfb is deactivated in existing jobs by default after the installation). I checked the .X*-lock files and the PID inside is no longer valid/used.
Could it be that the lock is being deleted to early (process still running?) What happens if the display is started but the jenkins fails? Is there another configuration that matters ("Shutdown Xvfb with whole job, not just with the main build action" is not set for my jobs as there is only one gradle call inside)
Thank you for your help and your work here, it is very much appreciated!
The setup you have should generate unique display numbers, the only way that I see you getting the error that you're getting is if the Xvfb was not terminated when the job finished, i.e. if the Jenkins itself crashes. So the display number would be still occupied by the 'zombie' Xvfb process started by the Xvfb plugin that did not terminate due to the Jenkins crash.
In that case from version 1.0.9 (JENKINS-20758) and further in version 1.0.10 (https://github.com/jenkinsci/xvfb-plugin/commit/cd3b4b0280c2754782a2d23248665830192441fb) there is support in the Xvfb plugin to try and shutdown previously started, but not terminated Xvfb processes when master starts or slave reconnects.
The management of the lock files is done by Xvfb process itself, I doubt that the running process would remove it's own lock file and keep running.
The 'Shutdown Xvfb with whole job, not just with the main build action' option lets you keep Xvfb running until the job completely finishes, i.e. if you need it running in post build actions, not just for the main build actions, so it probably is not very helpful in your case.
To quickly fix your problem you could use the 'Let Xvfb choose display name' option, when used Xvfb picks its own display name by looking for an unused port.
But if you to troubleshoot this further, you need to check if there are dangling Xvfb processes left when the job finishes and try to narrow down the circumstances under which Xvfb fails to start with "SocketCreateListener() failed" message.
Hi,
I just found out, that the default of the plugin does not use EXECUTOR_NUMBER as default display number (like the tool tip suggests: "Offset for display names, default is 1. Display names are taken from build executor's number, i.e. if the build is performed by executor 4, and offset is 100, display name will be 104.") The default is random (my bad but the tool tip confused me a bit.)
I tried to set the specific display name to ${EXECUTOR_NUMBER} instead which is no number (yet). So it can happen, that two jobs have the same display number which explains the errors sometimes. How can I achieve to use the executor number instead of random?
Thanks,
Stefan
Stefan,
if you specify 'Xvfb specific display number' that display number will always be used. If you specify 'Xvfb display name offset' that number will be added to the executor number and the result will be used as display name.
There is no support for variable use in the 'Xvfb specific display number', the value placed there must be a number.
The display number is never randomly generated, it can be:
1. always the same (if you specify 'Xvfb specific display number')
2. some offset from executor number (with the offset specified in 'Xvfb display name offset')
3. or, chosen by Xvfb (if you check 'Let Xvfb choose display name')
You would choose the first option if you need to use the same display number for every job. Whereas the second and third option guarantee the uniqueness of the display number.
Do note, that if you have more than one slave per physical machine you need to account for the potential overlap when using the second option. So if you have more than one slave per machine there will be an overlap if for instance two jobs are run by the first executor on slave A and the first executor on the slave B, if slaves A and B are running on the same machine. This cannot be avoided, as no data is shared between slaves. If this is your situation you can: separate the jobs with 'Restrict where this project can be run' and 'Xvfb display name offset' (in the previous example you would tie job1 to slave A and have the offset set to 100, and job2 to slave B and have the offset set to 200), use the third option, or limit the number of slaves per physical machine to one.
Ok, I will check the configuration again.
Is it correct, that EXECUTOR_NUMBER != the number shown in the Build Executor Status table? I got -DDISPLAY:4 for executor #8 and #9 (not at the same time though, I ran them one after the other) Maybe that's why thought it is random...
Thanks a lot!
Stefan,
executor numbers are the numbers that appear to the left of the job name in the example screenshot that would be the number 1. That executor is running a job called test that was built three times before, and hence it got the build number 4. The build number has no consequence on the Xvfb display number.
That is what I assumed at first but as I mentioned before when the job gets executed by executor (slot) #10 it results in a display number 13 but I only set an offset of 1. Another time I got display :7 on slot #10. So the same slot gets me different executor number. I saw your code and you just get the executor number, so I'm guessing this is happens in the Executor code (if it is a problem at all). I'm attaching 2 more screenshots.
Hi Stefan,
so the executor numbers are assigned on demand from a pool of numbers, the logic behind that is pretty clear here:
https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/model/Computer.java#L715
And from that indeed the number can differ from the one shown in the table, as you have stipulated, I have reproduced this by increasing the number of executors and running consecutive builds. In my case the executor number shown in the UI was 5, and the display number was based on the executor numbered 2 – so the display number was 3.
But that still generates unique display numbers with the assumption that the range of display numbers from the minimum to the maximum executor number plus the offset is free for Xvfb to use. That is, no other X servers are using those numbers, no other slaves are run on that physical machine, and that there are no leftover Xvfb processes.
This is guaranteed by the way Jenkins adds the executors to the executor list – no executor numbers are reused, as can be seen in the linked source.
Got it! So i guess this is an xvfb problem.
Would it be possible to add a check if the PID in the lock file is still valid? I'm not sure if this has any bad side effects, maybe you could add an opt-in for that? That would be great! Or is this covered by the so called "zombie process handling" you talked about?
Thanks,
Stefan
Hi,
no, actually the error you're getting:
_XSERVTransSocketUNIXCreateListener: ...SocketCreateListener() failed
_XSERVTransMakeAllCOTSServerListeners: server already running
is due to another X server already using the same display number, as per source and FAQ mentioned in comment-206473.
An existing lock file is a pretty good signal that another X server is using the same display number, the only way I can see a lock file remaining if the X server process crashed or was killed with SIGKILL. And as can be seen in the X server source code in that case X server tries really hard to remove the stale lock file and to launch at the specified display number.
I think that the handling of X server lock file is best left to the X server. I could add the functionality to the Xvfb plugin to kill already running X server detected by the lock file. But as you can imagine this would result in erratic build behavior as concurrent builds would terminate each other, or user's X sessions would unexpectedly terminate.
The zombie process handling is automatic termination of Xvfb processes left by slave disconnect or Jenkins master crash. So if your slave disconnects, for example due to a network issues, the build fails and no Xvfb process termination could be performed as the slave cannot be contacted to execute the process termination. For these cases Xvfb plugin keeps a list of Xvfb processes that it started, and on slave reconnect it goes trough the list and terminates the leftover Xvfb processes and removes the temporary frame buffer directory used by the Xvfb. This is done automatically on Jenkins master startup, and on slave reconnect.
I think that in your case you need to make sure that there is no overlap between display numbers, which could be caused by using the same offset for more than one job running on the same slave, using multiple slaves per physical machine or running X servers on the display numbers used by Jenkins. The easiest way to have non overlapping display numbers is to use the 'Let Xvfb choose display name' option and have that done automatically by the X server.
Hi,
My server admin told me, there was indeed a xvfb process blocking port 6003. After killing it there are no failing builds yet. So, sorry for the confusion...
Now I have another problem:
I'm calling a gradle build (the xvfb is started successfully) and the gradle build calls three executables one after another which need a Display to connect to. The first two calls are successful the third fails with:
java.lang.InternalError: Can't connect to X11 window server using ':103' as the value of the DISPLAY variable.
I checked the processes and there is no longer an xvfb instance running for this display. (I use netstat -tupln | grep tcp to check for a process listening to port 6103). It looks like the instance is shut down after the second connection. Do you happen to know why xvfb is doing this and is there an option for the plugin to keep xvfb alive no matter how many processes connect?
Thanks,
Stefan
Hi Stefan,
the Xvfb plugin keeps the Xvfb running for the duration of the build steps, there is an option to keep it running for the post build actions as well (Shoutdown Xvfb with whole job, not just with the main build action). The started Xvfb process makes the display available until the termination, it should not matter how many processes connect to it.
Not sure what your build is doing, but if you can reproduce this i suggest you open another issue and attach the job output console with the Log Xvfb option turned on. A job configuration, or detailed steps to reproduce this would be very beneficial.
Hi Stefan,
thanks for reporting. Are you absolutely sure that no other instance of some other X server is running (on display :1 in your example)?
If you are, I could check if the process with the PID in /tmp/.X#-lock is running and if not remove the lock file before starting Xvfb.