• Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • core, maven-plugin
    • None
    • core 1.564-SNAPSHOT, remoting 2.41

      On a number of the slaves at builds.apache.org, we're seeing slaves hanging after a while, both Linux and Windows slaves. The common thread seems to be Maven jobs being run on them and eventually hanging, causing everything else on the slave to hang (including, in some cases, attempts to get the threaddump from within Jenkins). The original Maven build hangs indefinitely, and any subsequent builds trying to run on the same slave get to the point of starting the git clone/svn checkout/etc and then just hang. The Linux slaves are running Java 1.8.0_05, and the Windows are running some Java 7 version - not sure which.

      Threaddump for Linux is at https://gist.github.com/abayer/3d567b56776e1ce78ad7 (one job hanging for over a day, another that started an hour or so ago but is now hanging), threaddump for Windows is at https://gist.github.com/abayer/c99f72ca1232e4d8acfa (only one job running at all on there, hanging for 17 hours or so).

          [JENKINS-23098] Slaves hanging with Maven jobs

          Andrew Bayer added a comment -

          kohsuke, jglick - any ideas? I don't know where to start. For what it's worth, the forked off Maven process for the hung job is still running in these cases, but not doing anything...

          Andrew Bayer added a comment - kohsuke , jglick - any ideas? I don't know where to start. For what it's worth, the forked off Maven process for the hung job is still running in these cases, but not doing anything...

          Jesse Glick added a comment -

          Do not see any clues there. The master thread dump might also be relevant. Better to install the Support Core plugin and attach a diagnostic bundle that would have everything.

          Jesse Glick added a comment - Do not see any clues there. The master thread dump might also be relevant. Better to install the Support Core plugin and attach a diagnostic bundle that would have everything.

          Jesse Glick added a comment -

          (And consider using freestyle projects, which are much less trouble-prone.)

          Jesse Glick added a comment - (And consider using freestyle projects, which are much less trouble-prone.)

          Andrew Bayer added a comment -

          Yeah, I'd love to get off the Maven projects, but, well, there's 600 or so of them (out of 1150 or so jobs) and they're pretty well entrenched. If we can't resolve this, I'll try to start the ball rolling on a complete rebuild of the Apache Jenkins setup with the Maven plugin explicitly removed, but that'll be a giant pain in the ass given the fact that we're talking about a massive number of separate ASF projects each with their own teams, etc, etc...yeah.

          Installing Support Core now, and full thread dump up at https://gist.github.com/abayer/7ff4de807c6373eec40d.

          Might be worth mentioning that we see absolutely no hangs like this on the hadoopX slaves, which only run freestyle jobs, so far as I can tell, so it definitely looks like a problem in the Maven plugin...

          Andrew Bayer added a comment - Yeah, I'd love to get off the Maven projects, but, well, there's 600 or so of them (out of 1150 or so jobs) and they're pretty well entrenched. If we can't resolve this, I'll try to start the ball rolling on a complete rebuild of the Apache Jenkins setup with the Maven plugin explicitly removed, but that'll be a giant pain in the ass given the fact that we're talking about a massive number of separate ASF projects each with their own teams, etc, etc...yeah. Installing Support Core now, and full thread dump up at https://gist.github.com/abayer/7ff4de807c6373eec40d . Might be worth mentioning that we see absolutely no hangs like this on the hadoopX slaves, which only run freestyle jobs, so far as I can tell, so it definitely looks like a problem in the Maven plugin...

          Andrew Bayer added a comment -

          ...and fwiw, in the new version of my Jenkins best practices talk, I harp quite a bit on how you should never use the Maven plugin because it's a morass of pain. =)

          Andrew Bayer added a comment - ...and fwiw, in the new version of my Jenkins best practices talk, I harp quite a bit on how you should never use the Maven plugin because it's a morass of pain. =)

          Andrew Bayer added a comment -

          And also fwiw, the support core plugin doesn't actually seem to give me a real bundle. I'm guessing because the whole master is so borked. =)

          Andrew Bayer added a comment - And also fwiw, the support core plugin doesn't actually seem to give me a real bundle. I'm guessing because the whole master is so borked. =)

          Jesse Glick added a comment -

          Handling GET /job/Mahout-Quality/ws/trunk/examples/target/site/apidocs/index.html sounds bad. Is someone seriously trying to load a generated site from the workspace? Avoid (remote) workspace browsing whenever possible.

          Jesse Glick added a comment - Handling GET /job/Mahout-Quality/ws/trunk/examples/target/site/apidocs/index.html sounds bad. Is someone seriously trying to load a generated site from the workspace? Avoid (remote) workspace browsing whenever possible.

          Jesse Glick added a comment -

          And Handling GET /job/river-qa-refactor-j9/ws/trunk/qa/result/*zip*/result.zip is even worse. Teach people to archive artifacts, then start disabling workspace browse permission. You are getting DoS’d I think.

          Jesse Glick added a comment - And Handling GET /job/river-qa-refactor-j9/ws/trunk/qa/result/*zip*/result.zip is even worse. Teach people to archive artifacts, then start disabling workspace browse permission. You are getting DoS’d I think.

          Andrew Bayer added a comment -

          Yeah, quite aware of that from another JIRA I opened. I've turned off anonymous workspace read access and am trying to get people to stop linking to workspaces in general, but again, at ASF it's hard to get everyone to even notice the emails I send them about what they should stop doing, let alone actually stop doing it. Fun!

          Andrew Bayer added a comment - Yeah, quite aware of that from another JIRA I opened. I've turned off anonymous workspace read access and am trying to get people to stop linking to workspaces in general, but again, at ASF it's hard to get everyone to even notice the emails I send them about what they should stop doing, let alone actually stop doing it. Fun!

          Andrew Bayer added a comment -

          Just as an experiment, I'm disabling workspace read for everyone but admins, so we'll see how that goes.

          Andrew Bayer added a comment - Just as an experiment, I'm disabling workspace read for everyone but admins, so we'll see how that goes.

          Andrew Bayer added a comment -

          Ok, got the support bundle to generate properly using the CLI. I'm going to give it a day or so post-restart with workspace read off, see if we have hangs, and if so, get a bundle here.

          Andrew Bayer added a comment - Ok, got the support bundle to generate properly using the CLI. I'm going to give it a day or so post-restart with workspace read off, see if we have hangs, and if so, get a bundle here.

          Andrew Bayer added a comment -

          So we've downgraded from 1.564-SNAPSHOT to 1.554.1 and that seems to have solved the problem - makes me guess that the problem is somewhere in the remoting changes between 1.554 and 1.564.

          Andrew Bayer added a comment - So we've downgraded from 1.564-SNAPSHOT to 1.554.1 and that seems to have solved the problem - makes me guess that the problem is somewhere in the remoting changes between 1.554 and 1.564.

          Jesse Glick added a comment -

          Did you pick up the JENKINS-22734 fix in 1.563? Running a snapshot build is not wise unless you are really prepared to review ongoing commits.

          Jesse Glick added a comment - Did you pick up the JENKINS-22734 fix in 1.563? Running a snapshot build is not wise unless you are really prepared to review ongoing commits.

          Andrew Bayer added a comment -

          Don't think we had - I want to get us off SNAPSHOTs, period, so yeah. That said, the symptoms described in that JIRA don't seem to match the ones we were seeing - the slaves were still "connected", just hung.

          Andrew Bayer added a comment - Don't think we had - I want to get us off SNAPSHOTs, period, so yeah. That said, the symptoms described in that JIRA don't seem to match the ones we were seeing - the slaves were still "connected", just hung.

          Andrew Bayer added a comment -

          Got another hang now on 1.554.1 - the Maven interceptor running on the slave is hung eating 99% of CPU for hours. Its thread dump is at https://gist.github.com/abayer/bc554112335fe229ddfe.

          Andrew Bayer added a comment - Got another hang now on 1.554.1 - the Maven interceptor running on the slave is hung eating 99% of CPU for hours. Its thread dump is at https://gist.github.com/abayer/bc554112335fe229ddfe .

          Jesse Glick added a comment -

          That thread dump looks idle to me. Not sure what you are hitting.

          Jesse Glick added a comment - That thread dump looks idle to me. Not sure what you are hitting.

          Andrew Bayer added a comment -

          Very weird. It was idling at 99% CPU for 3 hours after the log said Maven was done, so...weird.

          Andrew Bayer added a comment - Very weird. It was idling at 99% CPU for 3 hours after the log said Maven was done, so...weird.

          Tony Bridges added a comment -

          This looks very similar to what I am seeing on Windows master/slave running 1.554.3 with maven plugin 2.4. I'm also seeing a particular maven job (not all) consistently hanging up after metadata collection.

          Tony Bridges added a comment - This looks very similar to what I am seeing on Windows master/slave running 1.554.3 with maven plugin 2.4. I'm also seeing a particular maven job (not all) consistently hanging up after metadata collection.

          Tony Bridges added a comment -

          That latter hang, by the way, is not present with the maven plugin 2.1 after a downgrade. That might be a useful data point.

          Tony Bridges added a comment - That latter hang, by the way, is not present with the maven plugin 2.1 after a downgrade. That might be a useful data point.

          We had the same issue with the maven plugin 2.3 and different Jenkins versions (1.554.2, 1.554.1 and older non-LTS versions). We had to downgrade to 2.1 to solve the issue and get our Jenkins stable again.

          Wilm Schomburg added a comment - We had the same issue with the maven plugin 2.3 and different Jenkins versions (1.554.2, 1.554.1 and older non-LTS versions). We had to downgrade to 2.1 to solve the issue and get our Jenkins stable again.

          Jesse Glick added a comment -

          @tbridges @wilm if you can reproduce the problem easily in newer plugin versions but not older, we really need you to git bisect until you find the plugin commit introducing the problem, since I at least have no other leads.

          Jesse Glick added a comment - @tbridges @wilm if you can reproduce the problem easily in newer plugin versions but not older, we really need you to git bisect until you find the plugin commit introducing the problem, since I at least have no other leads.

          Jesse Glick added a comment -

          Looks like the fix of JENKINS-22354, in 2.2, may have introduced this bug.

          Jesse Glick added a comment - Looks like the fix of JENKINS-22354 , in 2.2, may have introduced this bug.

          thread dump from abayer shows that something weird is happening with SplittableBuildListener.

          Below is my analysis of the issue from one of our customers (ZD-19531), which turns out to be the same problem:

          3 threads appear to be blocked on SplittableBuildListener.synchronizeOnMark of the same object, which is odd, as the execution of this is supposed to be sequential.

          • Computer.threadPoolForRemoting [#1099] is waiting to enter SplittableBuildListener.synchronizeOnMark.
          • Computer.threadPoolForRemoting [#1108] is inside synchronizeOnMark and on markCountLock.wait.
          • Computer.threadPoolForRemoting [#1113] has found the mark and trying to report that, but blocked to get in
          • Computer.threadPoolForRemoting [#1104] is inside synchronizeOnMark waiting for Future.get()

          I think there's incorrect use of synchronization here. When wait() happens, the lock is released, which allows another thread to enter synchronizedOnMark. We need to use another lock to ensure synchronizeOnMark is not concurrently invoked.

          Kohsuke Kawaguchi added a comment - thread dump from abayer shows that something weird is happening with SplittableBuildListener . Below is my analysis of the issue from one of our customers (ZD-19531), which turns out to be the same problem: — 3 threads appear to be blocked on SplittableBuildListener.synchronizeOnMark of the same object, which is odd, as the execution of this is supposed to be sequential. Computer.threadPoolForRemoting [#1099] is waiting to enter SplittableBuildListener.synchronizeOnMark. Computer.threadPoolForRemoting [#1108] is inside synchronizeOnMark and on markCountLock.wait. Computer.threadPoolForRemoting [#1113] has found the mark and trying to report that, but blocked to get in Computer.threadPoolForRemoting [#1104] is inside synchronizeOnMark waiting for Future.get() I think there's incorrect use of synchronization here. When wait() happens, the lock is released, which allows another thread to enter synchronizedOnMark. We need to use another lock to ensure synchronizeOnMark is not concurrently invoked.

          Code changed in jenkins
          User: Kohsuke Kawaguchi
          Path:
          src/main/java/hudson/maven/SplittableBuildListener.java
          http://jenkins-ci.org/commit/maven-plugin/b145d5925ddeae2d697743920da204e6991375ac
          Log:
          [FIXED JENKINS-23098]

          Reference: ZD-19531

          Looking at [4], one notices that three threads are in an effective dead lock state around synchronizeOnMark. I extracted relevant part into [5].

          Thread #1661 is trying to report a discovered mark, but blocking [1]. Thread #1665 is inside synchronizeOnMark, on markCountLock.wait() [2]. Thread #1667 is stuck on Future.get() and hasn't returned [3], which holds the lock that blocks [1] from unblocking [2].

          The root problem is that synchronizeOnMark method is never meant to be concurrently executed. But given the way the lock is used, if one thread gets to wait(), it's possible that another thread would come along and go into this function.

          In this change, I'm preventing that by introducing another lock to serialize the execution of the entire synchronizeOnMark() call. I'm not using the "this" object for locking because it's already used for another purpose (see the lock() method)

          I'm not yet clear on why the synchronizeOnMark() method is called concurrently to begin with. The interaction with the -T option of Maven is suspected.

          [1] https://gist.github.com/kohsuke/374c22e737a77c9b0421#file-gistfile1-txt-L2
          [2] https://gist.github.com/kohsuke/374c22e737a77c9b0421#file-gistfile1-txt-L34
          [3] https://gist.github.com/kohsuke/374c22e737a77c9b0421#file-gistfile1-txt-L71
          [4] https://gist.github.com/abayer/7ff4de807c6373eec40d
          [5] https://gist.github.com/kohsuke/374c22e737a77c9b0421

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Kohsuke Kawaguchi Path: src/main/java/hudson/maven/SplittableBuildListener.java http://jenkins-ci.org/commit/maven-plugin/b145d5925ddeae2d697743920da204e6991375ac Log: [FIXED JENKINS-23098] Reference: ZD-19531 Looking at [4] , one notices that three threads are in an effective dead lock state around synchronizeOnMark. I extracted relevant part into [5] . Thread #1661 is trying to report a discovered mark, but blocking [1] . Thread #1665 is inside synchronizeOnMark, on markCountLock.wait() [2] . Thread #1667 is stuck on Future.get() and hasn't returned [3] , which holds the lock that blocks [1] from unblocking [2] . The root problem is that synchronizeOnMark method is never meant to be concurrently executed. But given the way the lock is used, if one thread gets to wait(), it's possible that another thread would come along and go into this function. In this change, I'm preventing that by introducing another lock to serialize the execution of the entire synchronizeOnMark() call. I'm not using the "this" object for locking because it's already used for another purpose (see the lock() method) I'm not yet clear on why the synchronizeOnMark() method is called concurrently to begin with. The interaction with the -T option of Maven is suspected. [1] https://gist.github.com/kohsuke/374c22e737a77c9b0421#file-gistfile1-txt-L2 [2] https://gist.github.com/kohsuke/374c22e737a77c9b0421#file-gistfile1-txt-L34 [3] https://gist.github.com/kohsuke/374c22e737a77c9b0421#file-gistfile1-txt-L71 [4] https://gist.github.com/abayer/7ff4de807c6373eec40d [5] https://gist.github.com/kohsuke/374c22e737a77c9b0421

          If you see this problem, can you please try out this build and report back if that fixes the problem?

          Kohsuke Kawaguchi added a comment - If you see this problem, can you please try out this build and report back if that fixes the problem?

          Released Maven plugin 2.5 with this fix.

          Kohsuke Kawaguchi added a comment - Released Maven plugin 2.5 with this fix.

            kohsuke Kohsuke Kawaguchi
            abayer Andrew Bayer
            Votes:
            1 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: