Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-23980

Xvfb doesn't remove /tmp/.X-* locks after a build has finished

    • Icon: Bug Bug
    • Resolution: Not A Defect
    • Icon: Minor Minor
    • xvfb-plugin
    • Jenkins 1.561, Xvfb plugin 1.0.10

      Once in a while it happens that a Jenkins build fails because the Xvfb it wants to create already exists:

      Xvfb starting$ Xvfb :1 -screen 0 1024x768x8 -fbdir /u10/app/gcadmin/jenkins/stable/jenkins_data/2014-07-24_11-54-567539530762307179860xvfb
      _XSERVTransSocketUNIXCreateListener: ...SocketCreateListener() failed
      _XSERVTransMakeAllCOTSServerListeners: server already running

      Fatal server error:
      Cannot establish any listening sockets - Make sure an X server isn't already running

      Settings are pretty default: I use the number of the executor with offset 1 (so it is not possible for another server running with this number). The workaaround at the moment is manual deletion from time to time.

          [JENKINS-23980] Xvfb doesn't remove /tmp/.X-* locks after a build has finished

          Stefan Schultz created issue -

          zregvart added a comment -

          Hi Stefan,
          thanks for reporting. Are you absolutely sure that no other instance of some other X server is running (on display :1 in your example)?
          If you are, I could check if the process with the PID in /tmp/.X#-lock is running and if not remove the lock file before starting Xvfb.

          zregvart added a comment - Hi Stefan, thanks for reporting. Are you absolutely sure that no other instance of some other X server is running (on display :1 in your example)? If you are, I could check if the process with the PID in /tmp/.X#-lock is running and if not remove the lock file before starting Xvfb.

          zregvart added a comment -

          From looking at X11 source code[1], and FAQ[2], it seems that there must be an existing X server listening on :1.

          [1] http://cgit.freedesktop.org/xorg/xserver/tree/os/utils.c#n260 and http://cgit.freedesktop.org/xorg/xserver/tree/os/connection.c#n385
          [2] http://www.x.org/wiki/FAQErrorMessages/#index6h2

          zregvart added a comment - From looking at X11 source code [1] , and FAQ [2] , it seems that there must be an existing X server listening on :1. [1] http://cgit.freedesktop.org/xorg/xserver/tree/os/utils.c#n260 and http://cgit.freedesktop.org/xorg/xserver/tree/os/connection.c#n385 [2] http://www.x.org/wiki/FAQErrorMessages/#index6h2

          Thanks for your response and the references!

          I configured the jenkins jobs which use the plugin to use the executor + offset 1 as display variable (this is the default anyways). So it can't be that there are other jobs using the same display number since the executor number is unique and two jobs can't use the same executor. We are using xvfb only for this purpose as well (i checked and xvfb is deactivated in existing jobs by default after the installation). I checked the .X*-lock files and the PID inside is no longer valid/used.

          Could it be that the lock is being deleted to early (process still running?) What happens if the display is started but the jenkins fails? Is there another configuration that matters ("Shutdown Xvfb with whole job, not just with the main build action" is not set for my jobs as there is only one gradle call inside)

          Thank you for your help and your work here, it is very much appreciated!

          Stefan Schultz added a comment - Thanks for your response and the references! I configured the jenkins jobs which use the plugin to use the executor + offset 1 as display variable (this is the default anyways). So it can't be that there are other jobs using the same display number since the executor number is unique and two jobs can't use the same executor. We are using xvfb only for this purpose as well (i checked and xvfb is deactivated in existing jobs by default after the installation). I checked the .X*-lock files and the PID inside is no longer valid/used. Could it be that the lock is being deleted to early (process still running?) What happens if the display is started but the jenkins fails? Is there another configuration that matters ("Shutdown Xvfb with whole job, not just with the main build action" is not set for my jobs as there is only one gradle call inside) Thank you for your help and your work here, it is very much appreciated!

          zregvart added a comment -

          The setup you have should generate unique display numbers, the only way that I see you getting the error that you're getting is if the Xvfb was not terminated when the job finished, i.e. if the Jenkins itself crashes. So the display number would be still occupied by the 'zombie' Xvfb process started by the Xvfb plugin that did not terminate due to the Jenkins crash.

          In that case from version 1.0.9 (JENKINS-20758) and further in version 1.0.10 (https://github.com/jenkinsci/xvfb-plugin/commit/cd3b4b0280c2754782a2d23248665830192441fb) there is support in the Xvfb plugin to try and shutdown previously started, but not terminated Xvfb processes when master starts or slave reconnects.

          The management of the lock files is done by Xvfb process itself, I doubt that the running process would remove it's own lock file and keep running.

          The 'Shutdown Xvfb with whole job, not just with the main build action' option lets you keep Xvfb running until the job completely finishes, i.e. if you need it running in post build actions, not just for the main build actions, so it probably is not very helpful in your case.

          To quickly fix your problem you could use the 'Let Xvfb choose display name' option, when used Xvfb picks its own display name by looking for an unused port.

          But if you to troubleshoot this further, you need to check if there are dangling Xvfb processes left when the job finishes and try to narrow down the circumstances under which Xvfb fails to start with "SocketCreateListener() failed" message.

          zregvart added a comment - The setup you have should generate unique display numbers, the only way that I see you getting the error that you're getting is if the Xvfb was not terminated when the job finished, i.e. if the Jenkins itself crashes. So the display number would be still occupied by the 'zombie' Xvfb process started by the Xvfb plugin that did not terminate due to the Jenkins crash. In that case from version 1.0.9 ( JENKINS-20758 ) and further in version 1.0.10 ( https://github.com/jenkinsci/xvfb-plugin/commit/cd3b4b0280c2754782a2d23248665830192441fb ) there is support in the Xvfb plugin to try and shutdown previously started, but not terminated Xvfb processes when master starts or slave reconnects. The management of the lock files is done by Xvfb process itself, I doubt that the running process would remove it's own lock file and keep running. The 'Shutdown Xvfb with whole job, not just with the main build action' option lets you keep Xvfb running until the job completely finishes, i.e. if you need it running in post build actions, not just for the main build actions, so it probably is not very helpful in your case. To quickly fix your problem you could use the 'Let Xvfb choose display name' option, when used Xvfb picks its own display name by looking for an unused port. But if you to troubleshoot this further, you need to check if there are dangling Xvfb processes left when the job finishes and try to narrow down the circumstances under which Xvfb fails to start with "SocketCreateListener() failed" message.

          Hi,

          I just found out, that the default of the plugin does not use EXECUTOR_NUMBER as default display number (like the tool tip suggests: "Offset for display names, default is 1. Display names are taken from build executor's number, i.e. if the build is performed by executor 4, and offset is 100, display name will be 104.") The default is random (my bad but the tool tip confused me a bit.)

          I tried to set the specific display name to ${EXECUTOR_NUMBER} instead which is no number (yet). So it can happen, that two jobs have the same display number which explains the errors sometimes. How can I achieve to use the executor number instead of random?

          Thanks,
          Stefan

          Stefan Schultz added a comment - Hi, I just found out, that the default of the plugin does not use EXECUTOR_NUMBER as default display number (like the tool tip suggests: "Offset for display names, default is 1. Display names are taken from build executor's number, i.e. if the build is performed by executor 4, and offset is 100, display name will be 104.") The default is random (my bad but the tool tip confused me a bit.) I tried to set the specific display name to ${EXECUTOR_NUMBER} instead which is no number (yet). So it can happen, that two jobs have the same display number which explains the errors sometimes. How can I achieve to use the executor number instead of random? Thanks, Stefan

          Stefan Schultz added a comment - - edited

          Here is a screen of what i mean. "Keine Zahl" means "NaN"

          Stefan Schultz added a comment - - edited Here is a screen of what i mean. "Keine Zahl" means "NaN"
          Stefan Schultz made changes -
          Attachment New: configuration_jenkins_xvfb.jpg [ 26446 ]

          zregvart added a comment -

          Stefan,
          if you specify 'Xvfb specific display number' that display number will always be used. If you specify 'Xvfb display name offset' that number will be added to the executor number and the result will be used as display name.

          There is no support for variable use in the 'Xvfb specific display number', the value placed there must be a number.

          The display number is never randomly generated, it can be:
          1. always the same (if you specify 'Xvfb specific display number')
          2. some offset from executor number (with the offset specified in 'Xvfb display name offset')
          3. or, chosen by Xvfb (if you check 'Let Xvfb choose display name')

          You would choose the first option if you need to use the same display number for every job. Whereas the second and third option guarantee the uniqueness of the display number.

          Do note, that if you have more than one slave per physical machine you need to account for the potential overlap when using the second option. So if you have more than one slave per machine there will be an overlap if for instance two jobs are run by the first executor on slave A and the first executor on the slave B, if slaves A and B are running on the same machine. This cannot be avoided, as no data is shared between slaves. If this is your situation you can: separate the jobs with 'Restrict where this project can be run' and 'Xvfb display name offset' (in the previous example you would tie job1 to slave A and have the offset set to 100, and job2 to slave B and have the offset set to 200), use the third option, or limit the number of slaves per physical machine to one.

          see https://github.com/jenkinsci/xvfb-plugin/blob/master/src/main/java/org/jenkinsci/plugins/xvfb/XvfbBuildWrapper.java#L467

          zregvart added a comment - Stefan, if you specify 'Xvfb specific display number' that display number will always be used. If you specify 'Xvfb display name offset' that number will be added to the executor number and the result will be used as display name. There is no support for variable use in the 'Xvfb specific display number', the value placed there must be a number. The display number is never randomly generated, it can be: 1. always the same (if you specify 'Xvfb specific display number') 2. some offset from executor number (with the offset specified in 'Xvfb display name offset') 3. or, chosen by Xvfb (if you check 'Let Xvfb choose display name') You would choose the first option if you need to use the same display number for every job. Whereas the second and third option guarantee the uniqueness of the display number. Do note, that if you have more than one slave per physical machine you need to account for the potential overlap when using the second option. So if you have more than one slave per machine there will be an overlap if for instance two jobs are run by the first executor on slave A and the first executor on the slave B, if slaves A and B are running on the same machine. This cannot be avoided, as no data is shared between slaves. If this is your situation you can: separate the jobs with 'Restrict where this project can be run' and 'Xvfb display name offset' (in the previous example you would tie job1 to slave A and have the offset set to 100, and job2 to slave B and have the offset set to 200), use the third option, or limit the number of slaves per physical machine to one. see https://github.com/jenkinsci/xvfb-plugin/blob/master/src/main/java/org/jenkinsci/plugins/xvfb/XvfbBuildWrapper.java#L467

          Ok, I will check the configuration again.

          Is it correct, that EXECUTOR_NUMBER != the number shown in the Build Executor Status table? I got -DDISPLAY:4 for executor #8 and #9 (not at the same time though, I ran them one after the other) Maybe that's why thought it is random...

          Thanks a lot!

          Stefan Schultz added a comment - Ok, I will check the configuration again. Is it correct, that EXECUTOR_NUMBER != the number shown in the Build Executor Status table? I got -DDISPLAY:4 for executor #8 and #9 (not at the same time though, I ran them one after the other) Maybe that's why thought it is random... Thanks a lot!

            zregvart zregvart
            pedro_cucaracha Stefan Schultz
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: