Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-69061

Pipelines do not resume properly after Jenkins restart

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • Jenkins 2.360, Durable Task Plugin 496.va67c6f9eefa7, Pipeline Version 590.v6a_d052e5a_a_b_5

       Problem: 

      Since recent upgrades (2.332 and now 2.360), when running a pipeline project, if the jenkins service is restarted, the pipeline process does not resume properly once Jenkins process resumes. The job will eventually fail after attempting to resume for several minutes.

      This problem exists if you restart the service via a normal LInux service restart OR by using the jenkins-cli to trigger the restart. Same behavior either way.

      Steps to reproduce the issue: 

      1. Start a pipeline job on a single-node Jenkins host. Simple example Jenkinsfile below.
      2. While running, restart the jenkins service "service jenkins restart" ( OR using jenkins-cli.jar to restart )
      3. After Jenkins starts, the task attempts to resume, but instead eventually fails (log below). This used to work fine. 

      More details and notes: 

      This used to work perfectly fine on an older version of Jenkins (2.2x) but recently we upgraded the Jenkins hosts (through apt on Ubuntu 20.04) to v2.332 and also upgraded the plugins and the issue started happening. The pipelines have not changed. I've since tried upgrading to 2.360 and still not working. All plugins are up to date as well. 

      The script output is similar to many other open/closed issues related to the durable task plugin, however this scenario doesn't match those other issues.

      The below log is what shows after Jenkins comes back online after restarting and the job attempts to resume. 

      Resuming build at Tue Jul 19 23:26:56 UTC 2022 after Jenkins restart
      Waiting to resume part of test-job #5: Waiting for next available executor
      Ready to run at Tue Jul 19 23:27:01 UTC 2022
      wrapper script does not seem to be touching the log file in /data/jenkins_home/workspace/test-job@tmp/durable-b0167617
      (JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400) 
      • The log file mentioned in this failure message does exist during job execution.
      • Manually touching/writing to the log file does not resolve the problem.
      • After the above message throws, the job goes into a "failed" state, but it takes a while.
      • The issue is not the filesystem nor available memory as other solutions have mentioned in related tickets/posts. 
      • There are no available plugin updates (fully up to date).
      • This seemed to happen when we got on the 2.332 version which also included the migration to systemd. So, is there a possibility that the service restart using systemd (versus the old init system used previous to 2.332) is breaking the durable task?

       

      Example Pipeline file:

      pipeline {
        agent any
      
        stages {
      
          stage("Sleep for 60 seconds") {
            steps {
      
              echo "Go restart jenkins service now and see that this job wont succeed"
      
              sh "sleep 60"
      
              echo "The job will never get this far"
      
            }
          }
        }
      } 

      Use Case / Impact:

      Major impact because this is a primary component that is not working:

      • This is a regression
      • Being able to resume after a fault/unexpected crash of the Jenkins process is something that should be expected of the durable tasks (and the primary reason for durable pipelines to exist at all). This is especially true for deployment automations. 
      • Restarting the jenkins service is a normal part of workflows when using the Jenkins init.groovy.d hook scripts to configure Jenkins itself (aka Jenkins configures Jenkins). "service jenkins restart" has been a part of our CI/CD workflow for a long time and only recently stopped working properly.

            Unassigned Unassigned
            mdebord1 Matt Dee
            Votes:
            1 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated: