Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-39454

Race condition in WorkUnitContext copying actions vs. Executor picking off a pending item

      I found that under some timing conditions it could happen that a BuildableItem in the pendings list would be considered a duplication target by scheduleInternal. In other words,

      • makePending is called—an executor is ready
      • scheduleInternal is called on the same Task, and seeing this entry in pendings, considers it a duplicate (ScheduleResult.Existing)
      • onStartExecuting is called

      Normally this would probably not matter. However the WorkUnitContext constructor is called right before makePending, and at this time it makes a (shallow) copy of the actions in the item. If scheduleInternal comes along later and finds some FoldableAction, foldIntoExisting will be called on an item whose action list has already been copied. So when the Executor copies the actions to the Executable, it will be copying the list as it existed prior to the foldIntoExisting call.

      If the foldIntoExisting implementation mutates an Action in place, as for example CauseAction does, this is generally OK because Executor will see the mutated object (let us hope the Action is thread-safe). But if the implementation adds a new action, the update will be silently ignored.

      The symptom that I saw was that sometimes if two threads of an upstream Pipeline build running the build step (or I suppose different builds or even jobs) happened to schedule the same downstream job around the same time, occasionally the resulting downstream Run would get just one BuildTriggerAction when it should have had two, as a result of which the stranded build step would show a status of unsure what happened to downstream build and would just hang even after the downstream build completed.

      The fix seems to simple: do not consider pendings from scheduleInternal. Just create a fresh item if this one is already "out the door". allowNewBuildableTask should anyway ensure that we are not trying to run too many builds in parallel. Note that the time interval between moving into the pending list and actually starting to execute should generally be very short—all Jenkins is doing is starting a new thread via Executor.start—so we are not reducing the system's inherent concurrency much.

          [JENKINS-39454] Race condition in WorkUnitContext copying actions vs. Executor picking off a pending item

          Jesse Glick added a comment -

          pipeline-build-step PR 9 has a test for this.

          Jesse Glick added a comment - pipeline-build-step PR 9 has a test for this.

          Code changed in jenkins
          User: Jesse Glick
          Path:
          core/src/main/java/hudson/model/Executor.java
          core/src/main/java/hudson/model/Queue.java
          http://jenkins-ci.org/commit/jenkins/a57b52ec7e16d9b9985d6303e918aa6fdfa0a141
          Log:
          [FIXED JENKINS-39454] Do not consider pendings when deciding whether a schedule result should be new or existing, as we have already taken a snapshot of actions. (#2609)

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: core/src/main/java/hudson/model/Executor.java core/src/main/java/hudson/model/Queue.java http://jenkins-ci.org/commit/jenkins/a57b52ec7e16d9b9985d6303e918aa6fdfa0a141 Log: [FIXED JENKINS-39454] Do not consider pendings when deciding whether a schedule result should be new or existing, as we have already taken a snapshot of actions. (#2609)

          Oleg Nenashev added a comment -

          Integrated towards 2.28

          Oleg Nenashev added a comment - Integrated towards 2.28

          Oleg Nenashev added a comment - - edited

          Marked as LTS candidate. Maybe too dangerous to backport + AFAIK there is a workaround applied in the plugins. CC jglick

          Oleg Nenashev added a comment - - edited Marked as LTS candidate. Maybe too dangerous to backport + AFAIK there is a workaround applied in the plugins. CC jglick

          Jesse Glick added a comment -

          Yes I developed a workaround in the plugin yesterday, so I think this does not need to be backported.

          Jesse Glick added a comment - Yes I developed a workaround in the plugin yesterday, so I think this does not need to be backported.

          Absorbed by 2.32 LTS line.

          Oliver Gondža added a comment - Absorbed by 2.32 LTS line.

            jglick Jesse Glick
            jglick Jesse Glick
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: