Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-34448

Slave's offlineCause is rewritten during a rare condition

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • core
    • None

      SlaveComputer._connect() can rewrite given offlineCause during a rare use-case - race condition.

      Connect and Disconnect of a slave is executed in a separate threads (due to non-block an UI operation). It is possible to connect slave and during this operation something asks to disconnect it. SlaveComputer._connect() expects if channel is null it means something wrong happened while launching although an exception hadn't thrown. It writes OfflineCause.LaunchFailed cause as a reason for that even it is already given for other reason (e.g. disconnect by CLI). The problematic part of the code is here. If channel is null, there is always stored LaunchFailed cause although the situation is the result of other action (disconnect, temporary offline etc.).

      I'm convinced this part of code should be executed only when we have empty offlineCause here (so nothing recognized a launch issue yet). Also all listeners in the case an offlineCause is given are notified already (closed channel, terminated channel...).

      I've investigated this issue during test coverage of disconnect-node CLI command when I've had a simple scenario:

      DumbSlave slave = j.createSlave("aNode", "", null);
      slave.toComputer().waitUntilOnline();
      assertThat(slave.toComputer().isOnline(), equalTo(true));
      assertThat(slave.toComputer().getOfflineCause(), equalTo(null));
      
      CLICommandInvoker.Result result = command
          authorizedTo(Computer.DISCONNECT, Jenkins.READ)
          .invokeWithArgs("aNode");
      assertThat(slave.toComputer().isOffline(), equalTo(true));
      assertThat(slave.toComputer().getOfflineCause() instanceof OfflineCause.ByCLI, equalTo(true));
      

      where the last assert failed randomly (from 10 to 40 percent of executions) for the reason that offlineCause wasn't a ByCLI but LaunchFailed although slave had been correctly started before and was able to execute a job.

          [JENKINS-34448] Slave's offlineCause is rewritten during a rare condition

          PR sent.

          Pavel Janoušek added a comment - PR sent.

          Code changed in jenkins
          User: Ing. Pavel Janousek
          Path:
          core/src/main/java/hudson/slaves/SlaveComputer.java
          http://jenkins-ci.org/commit/jenkins/822cec62dba4f31aac03afed26b575091f774342
          Log:
          JENKINS-34448 Fixed offlineCause race condition

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Ing. Pavel Janousek Path: core/src/main/java/hudson/slaves/SlaveComputer.java http://jenkins-ci.org/commit/jenkins/822cec62dba4f31aac03afed26b575091f774342 Log: JENKINS-34448 Fixed offlineCause race condition

          Code changed in jenkins
          User: Daniel Beck
          Path:
          core/src/main/java/hudson/slaves/SlaveComputer.java
          http://jenkins-ci.org/commit/jenkins/b8f983bceb598c10263580ffc331e6ba52d0c853
          Log:
          Merge pull request #2290 from pjanouse/JENKINS-34448

          JENKINS-34448 Fixed offlineCause race condition

          Compare: https://github.com/jenkinsci/jenkins/compare/67597a7c91ea...b8f983bceb59

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Daniel Beck Path: core/src/main/java/hudson/slaves/SlaveComputer.java http://jenkins-ci.org/commit/jenkins/b8f983bceb598c10263580ffc331e6ba52d0c853 Log: Merge pull request #2290 from pjanouse/ JENKINS-34448 JENKINS-34448 Fixed offlineCause race condition Compare: https://github.com/jenkinsci/jenkins/compare/67597a7c91ea...b8f983bceb59

          Merged.

          Pavel Janoušek added a comment - Merged.

            pajasoft Pavel Janoušek
            pajasoft Pavel Janoušek
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: