[JENKINS-34448] Slave's offlineCause is rewritten during a rare condition - Jenkins Jira

Type: Bug
Resolution: Fixed
Priority: Major
Component/s: core
Labels:
None

Similar Issues:
Powered by SuggestiMate

Show

SlaveComputer._connect() can rewrite given offlineCause during a rare use-case - race condition.

Connect and Disconnect of a slave is executed in a separate threads (due to non-block an UI operation). It is possible to connect slave and during this operation something asks to disconnect it. SlaveComputer._connect() expects if channel is null it means something wrong happened while launching although an exception hadn't thrown. It writes OfflineCause.LaunchFailed cause as a reason for that even it is already given for other reason (e.g. disconnect by CLI). The problematic part of the code is here. If channel is null, there is always stored LaunchFailed cause although the situation is the result of other action (disconnect, temporary offline etc.).

I'm convinced this part of code should be executed only when we have empty offlineCause here (so nothing recognized a launch issue yet). Also all listeners in the case an offlineCause is given are notified already (closed channel, terminated channel...).

I've investigated this issue during test coverage of disconnect-node CLI command when I've had a simple scenario:

DumbSlave slave = j.createSlave("aNode", "", null);
slave.toComputer().waitUntilOnline();
assertThat(slave.toComputer().isOnline(), equalTo(true));
assertThat(slave.toComputer().getOfflineCause(), equalTo(null));

CLICommandInvoker.Result result = command
    authorizedTo(Computer.DISCONNECT, Jenkins.READ)
    .invokeWithArgs("aNode");
assertThat(slave.toComputer().isOffline(), equalTo(true));
assertThat(slave.toComputer().getOfflineCause() instanceof OfflineCause.ByCLI, equalTo(true));

where the last assert failed randomly (from 10 to 40 percent of executions) for the reason that offlineCause wasn't a ByCLI but LaunchFailed although slave had been correctly started before and was able to execute a job.

is blocking

JENKINS-34328 Extract disconnect-node CLI from Core

Resolved

Assignee:: Pavel Janoušek

Reporter:: Pavel Janoušek

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2016-04-26 09:27

Updated:: 2016-04-28 12:23

Resolved:: 2016-04-28 12:23

Details

Description

Attachments

Issue Links

Activity

People

Dates