Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-10719

memory leak with remotely-run command output

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved (View Workflow)
    • Major
    • Resolution: Fixed
    • accurev-plugin
    • None
    • Distributed build, Jenkins on one machine and multiple slaves on other machines. All running Windows (Jenkins on XP, some slaves on Win7, some W2k8s-R2, some XP).

    Description

      My Jenkins main server process keeps throwing OutOfHeapSpace errors and falling over.
      I've given it 1.5gigs of ram (the Java process, that is, the XP machine has 3.5gigs), and it still does it - uptime has fallen down to hours (sometimes minutes), rather than weeks.

      I've taken a look at the heap and it's packed full of 8meg byte[] arrays, each containing the results of running the "accurev show ... streams" command (which on my accurev server, returns 22300 elements albeit mostly snapshots, but 8megs worth).

      It would appear that, if the plugin runs the "accurev show streams" command, whilst it tries to forget about the output, the output is not forgotten enough for it to get garbage collected.

      Attachments

        Issue Links

          Activity

            pjdarton pjdarton added a comment -

            Re-assigning to "automatic" - I don't think that this is an accurev-plugin bug - I think it's a jenkins-core bug that's exploited by the accurev-plugin.

            pjdarton pjdarton added a comment - Re-assigning to "automatic" - I don't think that this is an accurev-plugin bug - I think it's a jenkins-core bug that's exploited by the accurev-plugin.
            robsimon robsimon added a comment -

            We have seen the same issue and since that we use our slaves in "online on demand - offline when idle" mode and then we run a job every 10min which triggers the GC via the monitoring pluging. This will drop this Tables.

            This 'solved' the issue in our case and we can run our Jenkins for months with following startup settings -"XX:MaxPermSize=512m -Xmx4096m".

            robsimon robsimon added a comment - We have seen the same issue and since that we use our slaves in "online on demand - offline when idle" mode and then we run a job every 10min which triggers the GC via the monitoring pluging. This will drop this Tables. This 'solved' the issue in our case and we can run our Jenkins for months with following startup settings -"XX:MaxPermSize=512m -Xmx4096m".
            pjdarton pjdarton added a comment -

            In my case, I was running with -Xmx1500m, but CPU was (after a period of normal running, during which memory was leaked) hitting 100% with the GC running all the time and eventually threads were being killed due to the GC running too much (without much success).
            In the java-heap post-mortem, there were loads of 8meg byte buffers - about a gigabyte's worth. I would have expected that if the garbage collector was able to collect them, it would have collected all of the ones it could have done and hence reduced this down to one or two at most (at the very most, I'd expect 1 per project, if all projects were polling at the same time the heap dump was made).

            My guess is that these buffers would have only been disposed of once the slave connection had died, which is why your setup didn't suffer from this issue (as you take your slaves offline when they're not needed). On my setup, I keep the slaves online all the time (except when they reboot), so the buffers can build up over time (until they occupy enough memory to flatline the GC).

            As mentioned above (22/Aug/11 10:49am), I've found that polling on the master (only) cures the problem (as well as preventing spurious builds caused by slaves being offline), which is why I think that it's a Jenkins-core problem and not an accurev-plugin bug (the plugin isn't obeying best practise of processing the command output on the slave, but that should just result in lower efficiency, not a memory leak).

            pjdarton pjdarton added a comment - In my case, I was running with -Xmx1500m, but CPU was (after a period of normal running, during which memory was leaked) hitting 100% with the GC running all the time and eventually threads were being killed due to the GC running too much (without much success). In the java-heap post-mortem, there were loads of 8meg byte buffers - about a gigabyte's worth. I would have expected that if the garbage collector was able to collect them, it would have collected all of the ones it could have done and hence reduced this down to one or two at most (at the very most, I'd expect 1 per project, if all projects were polling at the same time the heap dump was made). My guess is that these buffers would have only been disposed of once the slave connection had died, which is why your setup didn't suffer from this issue (as you take your slaves offline when they're not needed). On my setup, I keep the slaves online all the time (except when they reboot), so the buffers can build up over time (until they occupy enough memory to flatline the GC). As mentioned above (22/Aug/11 10:49am), I've found that polling on the master (only) cures the problem (as well as preventing spurious builds caused by slaves being offline), which is why I think that it's a Jenkins-core problem and not an accurev-plugin bug (the plugin isn't obeying best practise of processing the command output on the slave, but that should just result in lower efficiency, not a memory leak).
            robsimon robsimon added a comment -

            It appears to me that the

            show streams

            command is tied to the slave connection and not to the slave nor to the master. That will be the reason why the GC won't collect those data even when it's obsolete. When a slave connection is terminated even just for 1 second then the GC is able to remove those items. However, due to polling issues against offline salve workspaces we changed our setup as well as pjdarton to poll on master.

            But the newly introduced "show one stream" feature has two NPE's (JENKINS-10906,[JENKINS-10937|https://issues.jenkins-ci.org/browse/JENKINS-10937]) in our environment. We switched this feature off even though we'd loved to use it.

            robsimon robsimon added a comment - It appears to me that the show streams command is tied to the slave connection and not to the slave nor to the master. That will be the reason why the GC won't collect those data even when it's obsolete. When a slave connection is terminated even just for 1 second then the GC is able to remove those items. However, due to polling issues against offline salve workspaces we changed our setup as well as pjdarton to poll on master. But the newly introduced "show one stream" feature has two NPE's ( JENKINS-10906 ,[ JENKINS-10937 |https://issues.jenkins-ci.org/browse/JENKINS-10937]) in our environment. We switched this feature off even though we'd loved to use it.

            Sometime after upgrading to Jenkins 2 this memory leak issue seems to be gone.

            casz Joseph Petersen (old) added a comment - Sometime after upgrading to Jenkins 2 this memory leak issue seems to be gone.

            People

              jetersen Joseph Petersen
              pjdarton pjdarton
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: