Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-68626

BUILD_ID=dontKillMe does not preserve processes when job is aborted

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • core
    • None
    • Jenkins 2.303.1

      Setting the BUILD_ID environment variable in a process to dontKillMe is not preventing it from being killed by ProcessTreeKiller as described in the link.

      Given a freestyle project with the following build script:

      #!/bin/bash
      
      set -x
      
      exec 2>/tmp/abort-test.debug
      exec 1>&2
      
      trap 'echo "Got INTR"; exit 1' SIGINT
      trap 'echo "Got TERM"; exit 1' SIGTERM
      trap 'echo "Got HUP"; exit 1' SIGHUP
      
      BUILD_ID=dontKillMe /usr/bin/sleep 300 &
      wait
      sleep 5
      

      does not make the /usr/bin/sleep immune from the ProcessTreeKiller when the job is aborted.

      Here's my proof...

      Start the job running and then observe which processes are part of the job by finding which processes have the /tmp/abort-test.debug  file open:

      # fuser /tmp/abort-test.debug 
      /tmp/abort-test.debug: 37173 37177
      # ps -p37173,37177 fw
        PID TTY      STAT   TIME COMMAND
      37173 ?        S      0:00 /bin/bash /tmp/jenkins759647078472167760.sh
      37177 ?        S      0:00  \_ /usr/bin/sleep 300
      

      Now find the parent of the script to find the Jenkins executor process:

      # ps -ef | grep 37173
      lcl_bui+ 37173 12777  0 19:18 ?        00:00:00 /bin/bash /tmp/jenkins759647078472167760.sh
      lcl_bui+ 37177 37173  0 19:18 ?        00:00:00 /usr/bin/sleep 300
      # ps -p 12777
        PID TTY          TIME CMD
      12777 ?        08:57:24 java
      

      Now attach strace to the executor:

      strace -o /tmp/java.strace -f -p 12777 -e trace=\!futex,sched_yield
      

      and then kill the job from Jenkins. Wait a few seconds and then observe what's in the file strace wrote and see that it did indeed kill the /usr/bin/sleep process:

      ...
      37152 kill(37177, SIGTERM)              = 0
      37152 stat("/proc/37177/status", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
      37152 stat("/proc/37177/status", 0x7f3af3bc8090) = -1 ENOENT (No such file or directory)
      37152 prctl(PR_SET_NAME, "pool-1-thread-4"...) = 0
      37152 write(24, "\v+", 2)               = 2
      37152 write(24, "\254\355\0\5sr\0\33hudson.remoting.UserRequ"..., 2859) = 2859
      12949 <... read resumed>"\7\363", 8192) = 2
      12949 read(0, "\254\355\0\5sr\0\30hudson.remoting.Response"..., 8192) = 2035
      12949 read(0,  <unfinished ...>
      37152 prctl(PR_SET_NAME, "pool-1-thread-4"...) = 0
      

      Followed by the main job script:

      37152 kill(37173, SIGTERM)              = 0
      37152 stat("/proc/37173/status", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
      37152 stat("/proc/37173/status", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
      37152 stat("/proc/37173/status", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
      37152 stat("/proc/37173/status", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
      37152 stat("/proc/37173/status", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
      37152 stat("/proc/37173/status", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
      12838 mprotect(0x7f3d4a8f7000, 4096, PROT_READ) = 0
      12838 mprotect(0x7f3d4a8f7000, 4096, PROT_READ|PROT_WRITE) = 0
      12838 mprotect(0x7f3d4a8f8000, 4096, PROT_NONE) = 0
      12838 mprotect(0x7f3d4a8f8000, 4096, PROT_READ) = 0
      37152 stat("/proc/37173/status", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
      37152 stat("/proc/37173/status", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
      12949 <... read resumed>"\6\352", 8192) = 2
      12949 read(0, "\254\355\0\5sr\0\33hudson.remoting.UserRequ"..., 8192) = 1770
      12949 read(0,  <unfinished ...>
      37234 prctl(PR_SET_NAME, "pool-1-thread-4"...) = 0
      37234 write(24, "\6+", 2)               = 2
      37234 write(24, "\254\355\0\5sr\0\30hudson.remoting.Response"..., 1579) = 1579
      37234 prctl(PR_SET_NAME, "pool-1-thread-4"...) = 0
      37152 stat("/proc/37173/status", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
      37152 stat("/proc/37173/status", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
      37152 stat("/proc/37173/status", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
      37152 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=37173, si_uid=1101, si_status=1, si_utime=0, si_stime=0} ---
      37152 restart_syscall(<... resuming interrupted stat ...> <unfinished ...>
      37174 <... wait4 resumed>[{WIFEXITED(s) && WEXITSTATUS(s) == 1}], 0, NULL) = 37173
      37152 <... restart_syscall resumed>)    = -1 ETIMEDOUT (Connection timed out)
      

      Which we can confirm by looking at the xtrace output from the job:

      # cat /tmp/abort-test.debug 
      + exec
      + trap 'echo "Got INTR"; exit 1' SIGINT
      + trap 'echo "Got TERM"; exit 1' SIGTERM
      + trap 'echo "Got HUP"; exit 1' SIGHUP
      + wait
      + BUILD_ID=dontKillMe
      + /usr/bin/sleep 300
      + sleep 5
      ++ echo 'Got TERM'
      Got TERM
      ++ exit 1
      

      The important point here is that the /usr/bin/sleep process was killed by the executor even though it had it's BUILD_ID set to dontKillMe.

            Unassigned Unassigned
            brianjmurrell Brian J Murrell
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: