Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-54643

A connection interruption causes the pipeline to fail when USE_WATCHING=true

      Run Jenkins with -Dorg.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep.USE_WATCHING=true. Add an agent launched via SSH (the launch method may not be important; this is just what I've observed the issue with).

      Add a pipeline job with this script:

      node('mynode') {
          sh '''#!/bin/sh -e
              for n in $(seq 100); do
                  echo "$n"
                  sleep 1
              done
          '''
          sh 'echo OK'
      }
      

      Run the pipeline. When it starts printing numbers to the log, disconnect the master from the network. After 30 seconds, reconnect it.

      What happens is that for a while (haven't measured, but it feels like a couple of minutes) nothing new appears in the log. After that, the job instantly completes, but:

      • Some of the output is missing from the log.
      • The "echo OK" step doesn't run.
      • The pipeline fails with an EOFException.

      I'm attaching a full example log.

      By contrast, with USE_WATCHING=false the log resumes a few seconds after the reconnection, no output is skipped and the job succeeds.

          [JENKINS-54643] A connection interruption causes the pipeline to fail when USE_WATCHING=true

          Sam Van Oort added a comment -

          jglick Have you seen this one?

          Sam Van Oort added a comment - jglick Have you seen this one?

          Jesse Glick added a comment -

          We have a functional test for a similar scenario which does not display this issue, but it is probably too simple.

          Jesse Glick added a comment - We have a functional test for a similar scenario which does not display this issue, but it is probably too simple.

          Jesse Glick added a comment -

          The failure of the second sh step sounds like JENKINS-41854. Why watch mode would trigger that, I am not sure. The channel is getting closed, which is unsurprising if the network is unplugged (for example, a ping thread would be expected to fail); the more interesting question is why it does not get closed when in polling mode.

          Loss of some output in the face of network outages is hard to avoid in watch mode; this is simply a tradeoff for far more efficient network and master CPU utilization. PR 86 discussed possible alternative approaches that would adjust the tradeoffs.

          Jesse Glick added a comment - The failure of the second sh step sounds like JENKINS-41854 . Why watch mode would trigger that, I am not sure. The channel is getting closed, which is unsurprising if the network is unplugged (for example, a ping thread would be expected to fail); the more interesting question is why it does not get closed when in polling mode. Loss of some output in the face of network outages is hard to avoid in watch mode; this is simply a tradeoff for far more efficient network and master CPU utilization. PR 86 discussed possible alternative approaches that would adjust the tradeoffs.

          Jesse Glick added a comment -

          Filed JENKINS-56851 for loss of output.

          Jesse Glick added a comment - Filed JENKINS-56851 for loss of output.

          Jesse Glick added a comment -

          Considering a duplicate of JENKINS-41854 since that was the primary reported problem.

          Jesse Glick added a comment - Considering a duplicate of JENKINS-41854 since that was the primary reported problem.

            jglick Jesse Glick
            rdonchen_intel Roman Donchenko
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: