Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-19776

Deadlock of AsyncFutureImpl.get() during massive submission of distributed jobs

      1) I trigger jobs from Parametrized Trigger Plugin. Job submits about 64 parallel jobs with "Hello, world!" output and waits till their completion
      2) At some point monitoring of jobs hangs. When all slave jobs finish, master job stills waiting..
      3) According to logs, hudson.remoting.AsyncFutureImpl.get() hangs, because "completed" was initially false. Then, wait() cycle never returns. Seems that AsyncFutureImpl:set() has not been called for one of the jobs.

      Additional analysis:

      • Submission works well on local host w/o additional remote node
      • In the log I see log rotation errors only() // see below
      • All executor thread have been finished for their jobs

      Call stack of the job (0x00000007866656f0 is not used by other threads):

      "Executor #7 for master : executing Test_MassiveSubmission #8" prio=6 tid=0x000000001148e000 nid=0x16bfc in Object.wait() [0x000000000d12e000]
      java.lang.Thread.State: WAITING (on object monitor)
      at java.lang.Object.wait(Native Method)

      • waiting on <0x00000007866656f0> (a hudson.model.queue.FutureImpl)
        at java.lang.Object.wait(Object.java:503)
        at hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:73)
      • locked <0x00000007866656f0> (a hudson.model.queue.FutureImpl)
        at hudson.plugins.parameterizedtrigger.TriggerBuilder.perform(TriggerBuilder.java:135)
        at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:19)
        at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:802)
        at hudson.model.Build$BuildExecution.build(Build.java:199)
        at hudson.model.Build$BuildExecution.doRun(Build.java:160)
        at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:584)
        at hudson.model.Run.execute(Run.java:1592)
        at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
        at hudson.model.ResourceController.execute(ResourceController.java:88)
        at hudson.model.Executor.run(Executor.java:237)

      Error log contains only following errors:

      SEVERE: Failed to rotate log
      java.io.IOException: C:\Users\nenashev\Documents\Work\Jenkins\contrib\parameterized-trigger-plugin\.\work\jobs\Test_MassiveSubmissionSlave\builds\2013-09-26_15-36-16 is in use
      at hudson.model.Run.delete(Run.java:1380)
      at hudson.tasks.LogRotator.perform(LogRotator.java:133)
      at hudson.model.Job.logRotate(Job.java:404)
      at hudson.model.Run.execute(Run.java:1655)
      at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
      at hudson.model.ResourceController.execute(ResourceController.java:88)
      at hudson.model.Executor.run(Executor.java:237)

          [JENKINS-19776] Deadlock of AsyncFutureImpl.get() during massive submission of distributed jobs

          Jesse Glick added a comment -

          The Failed to rotate log is a known (and fixed) bug for external jobs; observed but unknown cause for other job types. Some better diagnostics in dev versions. Probably unrelated to the hang.

          Jesse Glick added a comment - The Failed to rotate log is a known (and fixed) bug for external jobs; observed but unknown cause for other job types. Some better diagnostics in dev versions. Probably unrelated to the hang.

          Oleg Nenashev added a comment -

          I'm going to try the custom core with remoting-2.32 tomorrow.
          Probably issue has been fixed by other changes. Probably...

          Oleg Nenashev added a comment - I'm going to try the custom core with remoting-2.32 tomorrow. Probably issue has been fixed by other changes. Probably...

          Oleg Nenashev added a comment - - edited

          BTW, I have not managed to reproduce issue on 1.530. Could it be fixed by https://github.com/jenkinsci/jenkins/commit/bf444887ac16cc802695827da0a0f30949aa0f1f ?

          Oleg Nenashev added a comment - - edited BTW, I have not managed to reproduce issue on 1.530. Could it be fixed by https://github.com/jenkinsci/jenkins/commit/bf444887ac16cc802695827da0a0f30949aa0f1f ?

          Jesse Glick added a comment -

          JENKINS-19377 was about a problem with the external job plugin, so its fix should not have had any effect on other job types.

          If you are able to reproduce the issue in some earlier version of Jenkins, but now now, then bisection can be used to pinpoint the fix, which might be useful to know (for example for backporting).

          Jesse Glick added a comment - JENKINS-19377 was about a problem with the external job plugin, so its fix should not have had any effect on other job types. If you are able to reproduce the issue in some earlier version of Jenkins, but now now, then bisection can be used to pinpoint the fix, which might be useful to know (for example for backporting).

          Oleg Nenashev added a comment -

          Issue has gone after migration to custom core with remoting-2.32 (and several other patches from 1.509.4). Due to its randomness and low probability, there's no absolute warranty

          I'll try 1.509.4-RC on my installation. If I fail to reproduce issue within a 1-2 weeks, then I'll just close the issue.

          P.S: I still experience hanging in case of jobs triggering from parallel builds, but seems that it is a plugin's issue (https://issues.jenkins-ci.org/browse/JENKINS-16679).

          Oleg Nenashev added a comment - Issue has gone after migration to custom core with remoting-2.32 (and several other patches from 1.509.4). Due to its randomness and low probability, there's no absolute warranty I'll try 1.509.4-RC on my installation. If I fail to reproduce issue within a 1-2 weeks, then I'll just close the issue. P.S: I still experience hanging in case of jobs triggering from parallel builds, but seems that it is a plugin's issue ( https://issues.jenkins-ci.org/browse/JENKINS-16679 ).

          please share if anyone seeing this issue still?
          and at which jenkins lts it is fixed?

          i am geting this issue on jekins lts- ver. 1.480.3 and parameterised triiger-2.19

          regards
          Hiteswar

          hiteswar kumar added a comment - please share if anyone seeing this issue still? and at which jenkins lts it is fixed? i am geting this issue on jekins lts- ver. 1.480.3 and parameterised triiger-2.19 regards Hiteswar

          Oleg Nenashev added a comment -

          AFAIK, there's no direct fix for the issue.
          However, I have not seen it since migration to 1.509.4

          Oleg Nenashev added a comment - AFAIK, there's no direct fix for the issue. However, I have not seen it since migration to 1.509.4

          Oleg Nenashev added a comment -

          I cannot reproduce the issue since update to 1.509.4
          If anybody experiences it, please reopen the issue.

          Oleg Nenashev added a comment - I cannot reproduce the issue since update to 1.509.4 If anybody experiences it, please reopen the issue.

            oleg_nenashev Oleg Nenashev
            oleg_nenashev Oleg Nenashev
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: