Status: Closed (View Workflow)
Environment:Jenkins version: 2.263.3
java.runtime.name OpenJDK Runtime Environment
java.specification.name Java Platform API Specification
java.specification.vendor Oracle Corporation
Server version: Apache Tomcat/9.0.31 (Debian)Jenkins version: 2.263.3 java.runtime.name OpenJDK Runtime Environment java.runtime.version 22.214.171.124+1-post-Debian-1deb10u2 java.specification.name Java Platform API Specification java.specification.vendor Oracle Corporation java.specification.version 11 os.arch amd64 os.name Linux os.version 4.19.0-14-amd64 Server version: Apache Tomcat/9.0.31 (Debian)
Released As:2.297, 2.289.2
During a shell step calling some test processes, these child processes are sometimes killed by the agent.jar process, which establishes the Jenkins connection to the agent.
This causes the test to fail which is a false negative result.
A build is started on an executor (through agent.jar process), a shell step is executed (shell script process), which starts the test itself (test process).
Not the shell script process, the child process of the agent.jar process is killed (which is normal in case of abort I guess), but the shell script's child, the test process is killed.
We found this by extending our test binary with printing out context info about the kill event (initiator PID).
A pretty strange aspect of the issue is, that we start 3 agent processes on a single machine (due some workspace usage optimization), lets call these agent.jar process 1, 2, 3, and the agent.jar process 2 kills the test sub process belonging to (started by) the agent.jar process 1.
The issue occurs only on our apple agents on the latest osx version (currently Big Sur 11.2.3).
It did not occured on Linux nor Windows, nor on previous OSX versions, on which we actively test too.
The attached picture shows test builds executed on a problematic agent. Sometimes these events occur on more executors in the exact same time, sometimes only one test process is killed (both occured on this picture). Most of the time they are killed, around the same time as another executor just finished a test build.
My best guess is: Some process leak detection algorithm causes the unnecessary kills.
(Self generated view, x-axis is time, all red blocks failed due the same termination issue)
The agent.jar logs show the following relevant lines:
It started to occure since the end of 2020.
Our Jenkins version update dates:
- Until 2020.10.20 --> 2.252
- Until 2020.12.02 --> 2.263
- Until 2021.01.25 --> 2.262.1
- Now it is on 2.262.3
So it could have started with jenkins version 2.262.1, or the new OSX versions, which were updated/introduced around that time too.
It is not dependent on the agent.jar version, it occured for both:
- Remoting version: 3.17
- Remoting version: 4.5
It is not dependent on the CPU architecture of the agent, it occured for both x64 and arm64.
It is not dependent of the Java version of the agent, it occured for:
- openjdk version "15.0.1" 2020-10-20, OpenJDK Runtime Environment (build 15.0.1+9-18)
- openjdk version "11.0.1" 2018-10-16, OpenJDK Runtime Environment 18.9 (build 11.0.1+13)
With my first research I found this relevant PR, but not sure how tightly is it related:
JENKINS-59152 - Reduce the default process soft-kill timeout from 2 minutes to 5 seconds #4225
The test jobs do not have any time-out detecting logic/plugin/configuration.
We tried to gather more relevant logs, what happens exactly, but found only the mentioned events, maybe you can recommend a better log configuration (attached ours: agent.logging.properties), to catch more events related to the issue.