Type: Bug
Resolution: Fixed
Priority: Minor
Component/s: core, remoting
Labels:
Environment:

Hide
Jenkins version: 2.263.3
java.runtime.name OpenJDK Runtime Environment
java.runtime.version 11.0.9.1+1-post-Debian-1deb10u2
java.specification.name Java Platform API Specification
java.specification.vendor Oracle Corporation
java.specification.version 11
os.arch amd64
os.name Linux
os.version 4.19.0-14-amd64
Server version: Apache Tomcat/9.0.31 (Debian)

Show
Jenkins version: 2.263.3 java.runtime.name OpenJDK Runtime Environment java.runtime.version 11.0.9.1+1-post-Debian-1deb10u2 java.specification.name Java Platform API Specification java.specification.vendor Oracle Corporation java.specification.version 11 os.arch amd64 os.name Linux os.version 4.19.0-14-amd64 Server version: Apache Tomcat/9.0.31 (Debian)

Similar Issues:
Powered by SuggestiMate

Show
Released As:
2.297, 2.289.2

Summary

During a shell step calling some test processes, these child processes are sometimes killed by the agent.jar process, which establishes the Jenkins connection to the agent.

This causes the test to fail which is a false negative result.

Exact situation

A build is started on an executor (through agent.jar process), a shell step is executed (shell script process), which starts the test itself (test process).
Not the shell script process, the child process of the agent.jar process is killed (which is normal in case of abort I guess), but the shell script's child, the test process is killed.
We found this by extending our test binary with printing out context info about the kill event (initiator PID).

A pretty strange aspect of the issue is, that we start 3 agent processes on a single machine (due some workspace usage optimization), lets call these agent.jar process 1, 2, 3, and the agent.jar process 2 kills the test sub process belonging to (started by) the agent.jar process 1.

Current findings

The issue occurs only on our apple agents on the latest osx version (currently Big Sur 11.2.3).
It did not occured on Linux nor Windows, nor on previous OSX versions, on which we actively test too.

The attached picture shows test builds executed on a problematic agent. Sometimes these events occur on more executors in the exact same time, sometimes only one test process is killed (both occured on this picture). Most of the time they are killed, around the same time as another executor just finished a test build.

My best guess is: Some process leak detection algorithm causes the unnecessary kills.

(Self generated view, x-axis is time, all red blocks failed due the same termination issue)

The agent.jar logs show the following relevant lines:

./remoting.finer.log.0:Mar 11, 2021 10:36:48 AM hudson.util.ProcessTree$Darwin <init>
./remoting.finer.log.0:Mar 11, 2021 10:36:48 AM hudson.remoting.Channel send
./remoting.finer.log.0:Mar 11, 2021 10:36:48 AM hudson.util.ProcessTree$UnixProcess killRecursively

It started to occure since the end of 2020.
Our Jenkins version update dates:

Until 2020.10.20 --> 2.252
Until 2020.12.02 --> 2.263
Until 2021.01.25 --> 2.262.1
Now it is on 2.262.3

So it could have started with jenkins version 2.262.1, or the new OSX versions, which were updated/introduced around that time too.

It is not dependent on the agent.jar version, it occured for both:

Remoting version: 3.17
Remoting version: 4.5

It is not dependent on the CPU architecture of the agent, it occured for both x64 and arm64.
It is not dependent of the Java version of the agent, it occured for:

openjdk version "15.0.1" 2020-10-20, OpenJDK Runtime Environment (build 15.0.1+9-18)
openjdk version "11.0.1" 2018-10-16, OpenJDK Runtime Environment 18.9 (build 11.0.1+13)

With my first research I found this relevant PR, but not sure how tightly is it related:
JENKINS-59152 - Reduce the default process soft-kill timeout from 2 minutes to 5 seconds #4225

The test jobs do not have any time-out detecting logic/plugin/configuration.

We tried to gather more relevant logs, what happens exactly, but found only the mentioned events, maybe you can recommend a better log configuration (attached ours: agent.logging.properties), to catch more events related to the issue.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

agent.logging.properties
0.5 kB
2021-03-22 14:11
killedTogether.PNG
24 kB
2021-03-22 14:27

is related to

JENKINS-65911 Jenkins shuts down instead of restarting on Mac M1

Closed

links to

PR 5548

Details

Description

Summary

Exact situation

Current findings

Attachments

Attachments

Issue Links

Activity

People

Dates