Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-23980

Xvfb doesn't remove /tmp/.X-* locks after a build has finished

    • Icon: Bug Bug
    • Resolution: Not A Defect
    • Icon: Minor Minor
    • xvfb-plugin
    • Jenkins 1.561, Xvfb plugin 1.0.10

      Once in a while it happens that a Jenkins build fails because the Xvfb it wants to create already exists:

      Xvfb starting$ Xvfb :1 -screen 0 1024x768x8 -fbdir /u10/app/gcadmin/jenkins/stable/jenkins_data/2014-07-24_11-54-567539530762307179860xvfb
      _XSERVTransSocketUNIXCreateListener: ...SocketCreateListener() failed
      _XSERVTransMakeAllCOTSServerListeners: server already running

      Fatal server error:
      Cannot establish any listening sockets - Make sure an X server isn't already running

      Settings are pretty default: I use the number of the executor with offset 1 (so it is not possible for another server running with this number). The workaaround at the moment is manual deletion from time to time.

          [JENKINS-23980] Xvfb doesn't remove /tmp/.X-* locks after a build has finished

          That is what I assumed at first but as I mentioned before when the job gets executed by executor (slot) #10 it results in a display number 13 but I only set an offset of 1. Another time I got display :7 on slot #10. So the same slot gets me different executor number. I saw your code and you just get the executor number, so I'm guessing this is happens in the Executor code (if it is a problem at all). I'm attaching 2 more screenshots.

          Stefan Schultz added a comment - That is what I assumed at first but as I mentioned before when the job gets executed by executor (slot) #10 it results in a display number 13 but I only set an offset of 1. Another time I got display :7 on slot #10. So the same slot gets me different executor number. I saw your code and you just get the executor number, so I'm guessing this is happens in the Executor code (if it is a problem at all). I'm attaching 2 more screenshots.

          This is my job running on executor #10

          Stefan Schultz added a comment - This is my job running on executor #10

          This is the display name i get

          Stefan Schultz added a comment - This is the display name i get

          zregvart added a comment -

          Hi Stefan,
          so the executor numbers are assigned on demand from a pool of numbers, the logic behind that is pretty clear here:
          https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/model/Computer.java#L715

          And from that indeed the number can differ from the one shown in the table, as you have stipulated, I have reproduced this by increasing the number of executors and running consecutive builds. In my case the executor number shown in the UI was 5, and the display number was based on the executor numbered 2 – so the display number was 3.

          But that still generates unique display numbers with the assumption that the range of display numbers from the minimum to the maximum executor number plus the offset is free for Xvfb to use. That is, no other X servers are using those numbers, no other slaves are run on that physical machine, and that there are no leftover Xvfb processes.

          This is guaranteed by the way Jenkins adds the executors to the executor list – no executor numbers are reused, as can be seen in the linked source.

          zregvart added a comment - Hi Stefan, so the executor numbers are assigned on demand from a pool of numbers, the logic behind that is pretty clear here: https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/model/Computer.java#L715 And from that indeed the number can differ from the one shown in the table, as you have stipulated, I have reproduced this by increasing the number of executors and running consecutive builds. In my case the executor number shown in the UI was 5, and the display number was based on the executor numbered 2 – so the display number was 3. But that still generates unique display numbers with the assumption that the range of display numbers from the minimum to the maximum executor number plus the offset is free for Xvfb to use. That is, no other X servers are using those numbers, no other slaves are run on that physical machine, and that there are no leftover Xvfb processes. This is guaranteed by the way Jenkins adds the executors to the executor list – no executor numbers are reused, as can be seen in the linked source.

          Got it! So i guess this is an xvfb problem.

          Would it be possible to add a check if the PID in the lock file is still valid? I'm not sure if this has any bad side effects, maybe you could add an opt-in for that? That would be great! Or is this covered by the so called "zombie process handling" you talked about?

          Thanks,
          Stefan

          Stefan Schultz added a comment - Got it! So i guess this is an xvfb problem. Would it be possible to add a check if the PID in the lock file is still valid? I'm not sure if this has any bad side effects, maybe you could add an opt-in for that? That would be great! Or is this covered by the so called "zombie process handling" you talked about? Thanks, Stefan

          zregvart added a comment -

          Hi,
          no, actually the error you're getting:

          _XSERVTransSocketUNIXCreateListener: ...SocketCreateListener() failed
          _XSERVTransMakeAllCOTSServerListeners: server already running

          is due to another X server already using the same display number, as per source and FAQ mentioned in comment-206473.

          An existing lock file is a pretty good signal that another X server is using the same display number, the only way I can see a lock file remaining if the X server process crashed or was killed with SIGKILL. And as can be seen in the X server source code in that case X server tries really hard to remove the stale lock file and to launch at the specified display number.

          I think that the handling of X server lock file is best left to the X server. I could add the functionality to the Xvfb plugin to kill already running X server detected by the lock file. But as you can imagine this would result in erratic build behavior as concurrent builds would terminate each other, or user's X sessions would unexpectedly terminate.

          The zombie process handling is automatic termination of Xvfb processes left by slave disconnect or Jenkins master crash. So if your slave disconnects, for example due to a network issues, the build fails and no Xvfb process termination could be performed as the slave cannot be contacted to execute the process termination. For these cases Xvfb plugin keeps a list of Xvfb processes that it started, and on slave reconnect it goes trough the list and terminates the leftover Xvfb processes and removes the temporary frame buffer directory used by the Xvfb. This is done automatically on Jenkins master startup, and on slave reconnect.

          I think that in your case you need to make sure that there is no overlap between display numbers, which could be caused by using the same offset for more than one job running on the same slave, using multiple slaves per physical machine or running X servers on the display numbers used by Jenkins. The easiest way to have non overlapping display numbers is to use the 'Let Xvfb choose display name' option and have that done automatically by the X server.

          zregvart added a comment - Hi, no, actually the error you're getting: _XSERVTransSocketUNIXCreateListener: ...SocketCreateListener() failed _XSERVTransMakeAllCOTSServerListeners: server already running is due to another X server already using the same display number, as per source and FAQ mentioned in comment-206473 . An existing lock file is a pretty good signal that another X server is using the same display number, the only way I can see a lock file remaining if the X server process crashed or was killed with SIGKILL. And as can be seen in the X server source code in that case X server tries really hard to remove the stale lock file and to launch at the specified display number. I think that the handling of X server lock file is best left to the X server. I could add the functionality to the Xvfb plugin to kill already running X server detected by the lock file. But as you can imagine this would result in erratic build behavior as concurrent builds would terminate each other, or user's X sessions would unexpectedly terminate. The zombie process handling is automatic termination of Xvfb processes left by slave disconnect or Jenkins master crash. So if your slave disconnects, for example due to a network issues, the build fails and no Xvfb process termination could be performed as the slave cannot be contacted to execute the process termination. For these cases Xvfb plugin keeps a list of Xvfb processes that it started, and on slave reconnect it goes trough the list and terminates the leftover Xvfb processes and removes the temporary frame buffer directory used by the Xvfb. This is done automatically on Jenkins master startup, and on slave reconnect. I think that in your case you need to make sure that there is no overlap between display numbers, which could be caused by using the same offset for more than one job running on the same slave, using multiple slaves per physical machine or running X servers on the display numbers used by Jenkins. The easiest way to have non overlapping display numbers is to use the 'Let Xvfb choose display name' option and have that done automatically by the X server.

          Hi,

          My server admin told me, there was indeed a xvfb process blocking port 6003. After killing it there are no failing builds yet. So, sorry for the confusion...

          Now I have another problem:

          I'm calling a gradle build (the xvfb is started successfully) and the gradle build calls three executables one after another which need a Display to connect to. The first two calls are successful the third fails with:

          java.lang.InternalError: Can't connect to X11 window server using ':103' as the value of the DISPLAY variable.

          I checked the processes and there is no longer an xvfb instance running for this display. (I use netstat -tupln | grep tcp to check for a process listening to port 6103). It looks like the instance is shut down after the second connection. Do you happen to know why xvfb is doing this and is there an option for the plugin to keep xvfb alive no matter how many processes connect?

          Thanks,
          Stefan

          Stefan Schultz added a comment - Hi, My server admin told me, there was indeed a xvfb process blocking port 6003. After killing it there are no failing builds yet. So, sorry for the confusion... Now I have another problem: I'm calling a gradle build (the xvfb is started successfully) and the gradle build calls three executables one after another which need a Display to connect to. The first two calls are successful the third fails with: java.lang.InternalError: Can't connect to X11 window server using ':103' as the value of the DISPLAY variable. I checked the processes and there is no longer an xvfb instance running for this display. (I use netstat -tupln | grep tcp to check for a process listening to port 6103). It looks like the instance is shut down after the second connection. Do you happen to know why xvfb is doing this and is there an option for the plugin to keep xvfb alive no matter how many processes connect? Thanks, Stefan

          zregvart added a comment -

          Hi Stefan,
          the Xvfb plugin keeps the Xvfb running for the duration of the build steps, there is an option to keep it running for the post build actions as well (Shoutdown Xvfb with whole job, not just with the main build action). The started Xvfb process makes the display available until the termination, it should not matter how many processes connect to it.

          Not sure what your build is doing, but if you can reproduce this i suggest you open another issue and attach the job output console with the Log Xvfb option turned on. A job configuration, or detailed steps to reproduce this would be very beneficial.

          zregvart added a comment - Hi Stefan, the Xvfb plugin keeps the Xvfb running for the duration of the build steps, there is an option to keep it running for the post build actions as well (Shoutdown Xvfb with whole job, not just with the main build action). The started Xvfb process makes the display available until the termination, it should not matter how many processes connect to it. Not sure what your build is doing, but if you can reproduce this i suggest you open another issue and attach the job output console with the Log Xvfb option turned on. A job configuration, or detailed steps to reproduce this would be very beneficial.

          zregvart added a comment -

          Treating this as not a defect

          zregvart added a comment - Treating this as not a defect

          zregvart added a comment -

          Closing

          zregvart added a comment - Closing

            zregvart zregvart
            pedro_cucaracha Stefan Schultz
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: