Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-65195

Unexpected process termination by agent.jar on Apple Big Sur

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Minor Minor
    • core, remoting
    • 2.297, 2.289.2

      Summary

      During a shell step calling some test processes, these child processes are sometimes killed by the agent.jar process, which establishes the Jenkins connection to the agent.

      This causes the test to fail which is a false negative result.

      Exact situation

      A build is started on an executor (through agent.jar process), a shell step is executed (shell script process), which starts the test itself (test process).
      Not the shell script process, the child process of the agent.jar process is killed (which is normal in case of abort I guess), but the shell script's child, the test process is killed.
      We found this by extending our test binary with printing out context info about the kill event (initiator PID).

      A pretty strange aspect of the issue is, that we start 3 agent processes on a single machine (due some workspace usage optimization), lets call these agent.jar process 1, 2, 3, and the agent.jar process 2 kills the test sub process belonging to (started by) the agent.jar process 1.

      Current findings

      The issue occurs only on our apple agents on the latest osx version (currently Big Sur 11.2.3).
      It did not occured on Linux nor Windows, nor on previous OSX versions, on which we actively test too.

      The attached picture shows test builds executed on a problematic agent. Sometimes these events occur on more executors in the exact same time, sometimes only one test process is killed (both occured on this picture). Most of the time they are killed, around the same time as another executor just finished a test build.

      My best guess is: Some process leak detection algorithm causes the unnecessary kills.

      (Self generated view, x-axis is time, all red blocks failed due the same termination issue)

      The agent.jar logs show the following relevant lines:

      ./remoting.finer.log.0:Mar 11, 2021 10:36:48 AM hudson.util.ProcessTree$Darwin <init>
      ./remoting.finer.log.0:Mar 11, 2021 10:36:48 AM hudson.remoting.Channel send
      ./remoting.finer.log.0:Mar 11, 2021 10:36:48 AM hudson.util.ProcessTree$UnixProcess killRecursively

      It started to occure since the end of 2020.
      Our Jenkins version update dates:

      • Until 2020.10.20 --> 2.252
      • Until 2020.12.02 --> 2.263
      • Until 2021.01.25 --> 2.262.1
      • Now it is on 2.262.3

      So it could have started with jenkins version 2.262.1, or the new OSX versions, which were updated/introduced around that time too.

      It is not dependent on the agent.jar version, it occured for both:

      • Remoting version: 3.17
      • Remoting version: 4.5

      It is not dependent on the CPU architecture of the agent, it occured for both x64 and arm64.
      It is not dependent of the Java version of the agent, it occured for:

      • openjdk version "15.0.1" 2020-10-20, OpenJDK Runtime Environment (build 15.0.1+9-18)
      • openjdk version "11.0.1" 2018-10-16, OpenJDK Runtime Environment 18.9 (build 11.0.1+13)

      With my first research I found this relevant PR, but not sure how tightly is it related:
      JENKINS-59152 - Reduce the default process soft-kill timeout from 2 minutes to 5 seconds #4225

      The test jobs do not have any time-out detecting logic/plugin/configuration.

      We tried to gather more relevant logs, what happens exactly, but found only the mentioned events, maybe you can recommend a better log configuration (attached ours: agent.logging.properties), to catch more events related to the issue.

       

       

            ngg1 NGG
            rudolf Rudolf Horvath
            Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: