Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-51568

Pipeline jobs hanging in Build Executor even if it is finished

    XMLWordPrintable

Details

    Description

      We have huge Jenkins instance, which runs about 20k builds a day.

      At some moment after couple days without restart of jenkins master, pipeline jobs starts to hang executors after build finish. Freestyle and maven jobs works fine.
      Busy executor looks like:

      But build status is "finished":

      There are records about start and finish build in jenkins.log, but executor wasn't released at May 28, 2018 4:35:14 PM:

      May 28, 2018 4:28:07 PM org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxResolvingClassLoader$4$1 load
      WARNING: took 5,770ms to load/not load groovy.lang.GroovyObject$groovy$util$script15275137998231310805619$SSH_LOGIN from classLoader hudson.PluginManager$UberClassLoader
      2018/05/28 05:07:798 - job/KKA/job/TRIGGER_JOB_NEW_BUILD_IN_NEXUS_FLAG/ #4533 Started by timer
      May 28, 2018 4:28:07 PM org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxResolvingClassLoader$4$1 load
      WARNING: took 6,783ms to load/not load groovy.lang.GroovyObject$groovy$util$script15275138109721310805619$SSH_LOGIN from classLoader hudson.PluginManager$UberClassLoader
      
      ...
      
      WARNING: Owner[PPRBTEAM/archive/Deimos/deimos-module-list-wf/23333:PPRBTEAM/archive/Deimos/deimos-module-list-wf #23333] was not in the list to begin with: [Owner[GringoTesting/Try/1:GringoTesting/Try #1], Owner[PPRB_DevOps/Install_EIP/Install_EIP_2_cli/211:PPRB_DevOps/Install_EIP/Install_EIP_2_cli #211], Owner[AutoTransaction/AutoTransaction_release_major-2018-04-30_deprecated/275:AutoTransaction/AutoTransaction_release_major-2018-04-30_deprecated #275], Owner[DataFactory/AHD/Pipeline_dev/49:DataFactory/AHD/Pipeline_dev #49], Owner[CSUO/PipelineBadCode/19:CSUO/PipelineBadCode #19], Owner[PPRB_DevOps/Install_EIP/Install_EIP_2_cli/616:PPRB_DevOps/Install_EIP/Install_EIP_2_cli #616], Owner[KKMB/bundle-barriers/barrier-sbof-commondealinf-r1.28.0/7:KKMB/bundle-barriers/barrier-sbof-commondealinf-r1.28.0 #7], Owner[ESB/FS/FS_CI_RFC_PR_8/3060:ESB/FS/FS_CI_RFC_PR_8 #3060], Owner[TFS/TestJobs/integrationOnly/809:TFS/TestJobs/integrationOnly #809], Owner[PPRB_DevOps/KBT/Install_KBT(DEV)_ear/265:PPRB_DevOps/KBT/Install_KBT(DEV)_ear #265], Owner[MBP/mbp-ci/123:MBP/mbp-ci #123], Owner[PRCRED/EKB3/loans-for-persons-ekb3-pipeline___force-WF-and-envelope_Stend1_dev/338:PRCRED/EKB3/loans-for-persons-ekb3-pipeline___force-WF-and-envelope_Stend1_dev #338], Owner[Kalita/tmp/test_pr/46:Kalita/tmp/test_pr #46], Owner[Kalita/tmp/test_pr/47:Kalita/tmp/test_pr #47], Owner[ASKOO/koo-release-build/feature%2F513/8:ASKOO/koo-release-build/feature%2F513 #8], Owner[ASKOO/koo-release-build/develop/117:ASKOO/koo-release-build/develop #117], Owner[ASKOO/koo-release-build/feature%2F513/9:ASKOO/koo-release-build/feature%2F513 #9], Owner[ASKOO/koo-release-build/develop/118:ASKOO/koo-release-build/develop #118], Owner[GBK/QG_check_minor250518/205:GBK/QG_check_minor250518 #205], Owner[GBK/QG_Check_dev/345:GBK/QG_Check_dev #345], Owner[PRCRED/EKB3/loans-for-persons-ekb3-pipeline___force-WF-and-envelope_Stend1_dev/339:PRCRED/EKB3/loans-for-persons-ekb3-pipeline___force-WF-and-envelope_Stend1_dev #339], Owner[GateWayDP/Gateways/Gateway_ESBGW_CI_PIPELINE/363:GateWayDP/Gateways/Gateway_ESBGW_CI_PIPELINE #363], Owner[DataFactory/stork/Build_PullRequest/524:DataFactory/stork/Build_PullRequest #524], Owner[KKA/KKA_PIPE_CI/352:KKA/KKA_PIPE_CI #352], Owner[KKA/KKA_PIPE_DEPLOY/154:KKA/KKA_PIPE_DEPLOY #154], Owner[GBK/QG_check_minor250518/206:GBK/QG_check_minor250518 #206], Owner[PPRB_DepositCashOperations/card/BuildDistr_Develop_Nexus_Prod_QG/57:PPRB_DepositCashOperations/card/BuildDistr_Develop_Nexus_Prod_QG #57], Owner[PRCRED/CBIR/AUTODEPLOY_PIPE_KENNY/33:PRCRED/CBIR/AUTODEPLOY_PIPE_KENNY #33], Owner[Kalita/Regular_clt_dev_builds/295:Kalita/Regular_clt_dev_builds #295], Owner[DataFactory/stork/Build_PullRequest/525:DataFactory/stork/Build_PullRequest #525], Owner[Kalita/tmp/test_pr/48:Kalita/tmp/test_pr #48], Owner[AEP/AEP_QG/806:AEP/AEP_QG #806], Owner[SNUiL/DevOps/CI-Builds/CI-Build-PIR-29/549:SNUiL/DevOps/CI-Builds/CI-Build-PIR-29 #549], Owner[GBK/QG_Check_dev/346:GBK/QG_Check_dev #346], Owner[GBK/QG_check_minor250518/207:GBK/QG_check_minor250518 #207], Owner[ECOD/DEVELOP/NexusArtifactFlag/1:ECOD/DEVELOP/NexusArtifactFlag #1], Owner[Kalita/Parallel_Pipeline_2/1003:Kalita/Parallel_Pipeline_2 #1003], Owner[HPSM/HPSM_pipeline/403:HPSM/HPSM_pipeline #403], Owner[ESB_CF/IIB9_PIPELINE/3751:ESB_CF/IIB9_PIPELINE #3751], Owner[DataFactory/stork/Build_PullRequest/526:DataFactory/stork/Build_PullRequest #526], Owner[PPRB_DepositCashOperations/card/BuildDistr_Develop_Nexus_Prod_QG/58:PPRB_DepositCashOperations/card/BuildDistr_Develop_Nexus_Prod_QG #58], Owner[DevOps/AHD/sleep-test/2:DevOps/AHD/sleep-test #2], Owner[ESB_SMP/FullBuild/484:ESB_SMP/FullBuild #484], Owner[DataFactory/stork/Build_PullRequest/527:DataFactory/stork/Build_PullRequest #527], Owner[SBK/SMARTREGRESS_TEST_IFT1/4:SBK/SMARTREGRESS_TEST_IFT1 #4], Owner[impprb/checkQG/249:impprb/checkQG #249], Owner[DevOps/AHD/sleep-test/4:DevOps/AHD/sleep-test #4], Owner[TFS/PRBuilders/BuilderPRForCore/903:TFS/PRBuilders/BuilderPRForCore #903], Owner[HPSM/HPSM_pipeline/404:HPSM/HPSM_pipeline #404], Owner[FCCM8/regular_release/41:FCCM8/regular_release #41], Owner[ESB/DevOps/Other/AUTOTESTS/DO_PR_INIT/380:ESB/DevOps/Other/AUTOTESTS/DO_PR_INIT #380], Owner[PPRB_DevOps/KBT/Install_KBT(DEV)_ear/268:PPRB_DevOps/KBT/Install_KBT(DEV)_ear #268], Owner[DataFactory/stork/Build_PullRequest/528:DataFactory/stork/Build_PullRequest #528], Owner[SNUiL/DevOps/CI-Deploy/CI-Deploy-to-testing-from-git/389:SNUiL/DevOps/CI-Deploy/CI-Deploy-to-testing-from-git #389], Owner[ASCC/ascc_full_RELEASE/02.013.00_STG-19937_jenkins_release_job/41:ASCC/ascc_full_RELEASE/02.013.00_STG-19937_jenkins_release_job #41], Owner[ASKOO/koo-release-build/feature%2F253/29:ASKOO/koo-release-build/feature%2F253 #29], Owner[GateWayDP/Gateways/Gateway_EDOGO_CI_PIPELINE/334:GateWayDP/Gateways/Gateway_EDOGO_CI_PIPELINE #334], Owner[DEPOZITORY/PB/69:DEPOZITORY/PB #69], Owner[DataFactory/stork/Auto_Test_DEV/2115:DataFactory/stork/Auto_Test_DEV #2115], Owner[GBK/QG_check_minor250518/208:GBK/QG_check_minor250518 #208], Owner[TDS/GREEN_AN_GREEN/87:TDS/GREEN_AN_GREEN #87], Owner[PPRB_DepositCashOperations/common/Publish_to_IFT_universal/320:PPRB_DepositCashOperations/common/Publish_to_IFT_universal #320], Owner[PPRB_DevOps/Quality_Gate_pipes/Universal_Quality_Gate_pipe/9156:PPRB_DevOps/Quality_Gate_pipes/Universal_Quality_Gate_pipe #9156], Owner[ASCC/server1/02.013.00/1496:ASCC/server1/02.013.00 #1496], Owner[ASCC/ascc_server_branch_build/02.013.00_STG-19937_jenkins_release_job/37:ASCC/ascc_server_branch_build/02.013.00_STG-19937_jenkins_release_job #37], Owner[ESB/FS/Meshkov/FS_CI_RFC_tst/763:ESB/FS/Meshkov/FS_CI_RFC_tst #763], Owner[PPRB_DevOps/Quality_Gate_pipes/Universal_Quality_Gate_pipe/9157:PPRB_DevOps/Quality_Gate_pipes/Universal_Quality_Gate_pipe #9157], Owner[mmt/DEV/1051:mmt/DEV #1051], Owner[ASKOO/koo-release-build/support%2F02.021/18:ASKOO/koo-release-build/support%2F02.021 #18], Owner[Kalita/Parallel_Pipeline_2/1005:Kalita/Parallel_Pipeline_2 #1005], Owner[DataFactory/stork/Build_PullRequest/529:DataFactory/stork/Build_PullRequest #529], Owner[SNUiL/DevOps/CI-Builds/CI-Build-SNUILDEV-3794-COURIER/249:SNUiL/DevOps/CI-Builds/CI-Build-SNUILDEV-3794-COURIER #249], Owner[ASBS/buildByPipeline/532:ASBS/buildByPipeline #532], Owner[ASCC/ascc_server_branch_build/02.014.00_STG-18499_CompositeOutCashOrders/14:ASCC/ascc_server_branch_build/02.014.00_STG-18499_CompositeOutCashOrders #14], Owner[adpSWIFT/adpSWIFT_PIPELINE/3102:adpSWIFT/adpSWIFT_PIPELINE #3102], Owner[ASCC/server1/02.014.00/407:ASCC/server1/02.014.00 #407], Owner[ESB/DevOps/Dev/ESB_KF_CI00223537/PartialESBInstall/33:ESB/DevOps/Dev/ESB_KF_CI00223537/PartialESBInstall #33], Owner[PPRB_DevOps/Quality_Gate_pipes/Universal_Quality_Gate_pipe/9160:PPRB_DevOps/Quality_Gate_pipes/Universal_Quality_Gate_pipe #9160], Owner[ESB/FS/FS_PR_INIT/24747:ESB/FS/FS_PR_INIT #24747], Owner[PPRB.OIP/kbt-scripts/ucp-corp/359:PPRB.OIP/kbt-scripts/ucp-corp #359], Owner[PPRB_DevOps/Install_EIP/Install_EIP_2_cli/733:PPRB_DevOps/Install_EIP/Install_EIP_2_cli #733], Owner[edosgo/elgo-mvd-clientverify/elgo-mvd-clientverify2/38:edosgo/elgo-mvd-clientverify/elgo-mvd-clientverify2 #38], Owner[PPRB_DevOps/Install_EIP/Install_EIP_2_cli/734:PPRB_DevOps/Install_EIP/Install_EIP_2_cli #734], Owner[MBP/mbp-ci/140:MBP/mbp-ci #140], Owner[PPRBDOC/upload_to_pipe/deprecated/QualityGateOnOurPipe/QualityGate-Order/50:PPRBDOC/upload_to_pipe/deprecated/QualityGateOnOurPipe/QualityGate-Order #50], Owner[ESB/FS/FS_CI_RFC/3478:ESB/FS/FS_CI_RFC #3478], Owner[DataFactory/stork/Build_required_distrib/122:DataFactory/stork/Build_required_distrib #122], Owner[TDS/Update_stand_by_url/920:TDS/Update_stand_by_url #920], Owner[CBDBO/Pipeline/2871:CBDBO/Pipeline #2871], Owner[PRCRED/EKB3/loans-for-persons-ekb3-pipeline___force-WF-and-envelope_Stend2_major-2-2018-05-27/315:PRCRED/EKB3/loans-for-persons-ekb3-pipeline___force-WF-and-envelope_Stend2_major-2-2018-05-27 #315], Owner[edosgo/elgo-fns-clientverify/elgo-fns-clientverify-release/193:edosgo/elgo-fns-clientverify/elgo-fns-clientverify-release #193], Owner[DepositPfETL/deposit-client-validation-pipeline-parameters/19:DepositPfETL/deposit-client-validation-pipeline-parameters #19], Owner[mmt/TEST/1:mmt/TEST #1], Owner[edosgo/elgo-remote-starter/323:edosgo/elgo-remote-starter #323], Owner[ESB/DevOps/Other/AUTOTESTS/PartialESBRestoreExGroup/302:ESB/DevOps/Other/AUTOTESTS/PartialESBRestoreExGroup #302], Owner[PPRB_DevOps/Install_EIP/Install_EIP_2_cli/735:PPRB_DevOps/Install_EIP/Install_EIP_2_cli #735], Owner[PPRBTEAM/Gera/gera-autodeploy-wf/116:PPRBTEAM/Gera/gera-autodeploy-wf #116], Owner[EKPiT/deploy-db-dev1/100:EKPiT/deploy-db-dev1 #100], Owner[TDS/PR_pipeline/903:TDS/PR_pipeline #903], Owner[edosgo/elgo-msh-reestrcontract/elgo-msh-reestrcontract-dev-barrier-pipeline/14:edosgo/elgo-msh-reestrcontract/elgo-msh-reestrcontract-dev-barrier-pipeline #14], Owner[ESB/FS/FS_CI_INIT/3648:ESB/FS/FS_CI_INIT #3648], Owner[KKA/TRIGGER_JOB_NEW_BUILD_IN_NEXUS/4907:KKA/TRIGGER_JOB_NEW_BUILD_IN_NEXUS #4907], Owner[edosgo/elgo-msh-reestrcontract/elgo-msh-reestrcontract-dev-barrier-deploy/21:edosgo/elgo-msh-reestrcontract/elgo-msh-reestrcontract-dev-barrier-deploy #21], Owner[Tengri/HDPLocalCiInsallation/151:Tengri/HDPLocalCiInsallation #151], Owner[mgr/API/PUBLISH_ALL/68435:mgr/API/PUBLISH_ALL #68435], Owner[EPS/Main_Pre_Build_Distr_DevOps2018_Pipeline/6405:EPS/Main_Pre_Build_Distr_DevOps2018_Pipeline #6405], Owner[KKA/TRIGGER_JOB_NEW_BUILD_IN_NEXUS_FLAG/4533:KKA/TRIGGER_JOB_NEW_BUILD_IN_NEXUS_FLAG #4533], Owner[PPRBCPRB/Pipeline_Server_Build_For_Dev_Server/10969:PPRBCPRB/Pipeline_Server_Build_For_Dev_Server #10969], Owner[PPRB_CEP/PSI_TEST_Pipe/7763:PPRB_CEP/PSI_TEST_Pipe #7763]]
      ...
      
      May 28, 2018 4:35:14 PM org.jenkinsci.plugins.workflow.job.WorkflowRun finish
      INFO: KKA/TRIGGER_JOB_NEW_BUILD_IN_NEXUS_FLAG #4533 completed: SUCCESS
      
      ....

      As a result, build queue grows since all avaliable executors are busy.

      Some workaround to defer restart: periodically run script to free executors:

      import hudson.model.*;
      
      nodes = jenkins.model.Jenkins.instance.nodes
      nodes.removeAll(Collections.singleton(null))
      
      nodes.each { node ->
      manager.listener.logger.println("-------PROCESSING NODE: $node.displayName -------------------")
      def exec = node.toComputer()
      if (exec == null) {	
      manager.listener.logger.println("------- WARNING: $node.displayName: NULL. Removing! -------------------")
      Jenkins.instance.removeNode(node)
      return;
      }
      
      exec.getExecutors().each { job ->
      if (job.busy && job.progress == -1) {
      manager.listener.logger.println("JOB $job.name LOOKS LIKE STUCK. KILLING.")
      def owner = job.owner
      owner.removeExecutor((hudson.model.Executor) job)
      }
      }
      }
      
      return null
      
       

      Attachments

        1. build-pipeline-steps.PNG
          build-pipeline-steps.PNG
          44 kB
        2. BusyTimers.png
          BusyTimers.png
          37 kB
        3. finished-build.PNG
          finished-build.PNG
          66 kB
        4. image-2018-11-27-11-31-06-881.png
          image-2018-11-27-11-31-06-881.png
          462 kB
        5. image-2018-11-27-11-32-58-457.png
          image-2018-11-27-11-32-58-457.png
          230 kB
        6. NormalCase.PNG
          NormalCase.PNG
          28 kB
        7. sleep-test-script.PNG
          sleep-test-script.PNG
          12 kB
        8. stuck-executor.png
          stuck-executor.png
          4 kB
        9. thread_dump_avaneesh.txt
          543 kB
        10. thread_dump.html
          490 kB

        Issue Links

          Activity

            morlajb1 mor lajb added a comment -

            we have the same problem - LTS 2.346.1 - while pipeline complete the executer hanging for 30-60 minutes ...
            we use ec2-fleet and k8s plugins happen on both - pods and ec2's
            any workaround beside delete the instances and start again  ? 

            morlajb1 mor lajb added a comment - we have the same problem - LTS 2.346.1 - while pipeline complete the executer hanging for 30-60 minutes ... we use ec2-fleet and k8s plugins happen on both - pods and ec2's any workaround beside delete the instances and start again  ? 
            jglick Jesse Glick added a comment -

            Whatever the root cause may be in particular cases, https://github.com/jenkinsci/workflow-durable-task-step-plugin/releases/tag/1146.v1a_d2e603f929 should clean up automatically.

            jglick Jesse Glick added a comment - Whatever the root cause may be in particular cases, https://github.com/jenkinsci/workflow-durable-task-step-plugin/releases/tag/1146.v1a_d2e603f929 should clean up automatically.
            allan_burdajewicz Allan BURDAJEWICZ added a comment - - edited

            Was able to reproduce this - leaked queue items even though the pipeline is completed - while troubleshooting a user's scenario. A simple scenario that I could find to reproduce the problem was the following and involves both the timeout and node step:

                timeout(time: 20, unit: 'SECONDS') {    
                    try {
                        // Label that does not exist
                        node('doesnotexists') {
                            sh "sleep 999999"
                        }
                    } finally {  
                        // Label that does not exist
                        node('doesnotexists') {
                            sh "sleep 999999"
                        }
                    }
                }
            

            This would leak a queue item every time.

            And indeed https://github.com/jenkinsci/workflow-durable-task-step-plugin/releases/tag/1146.v1a_d2e603f929 fixes the problem in that particular case.

            allan_burdajewicz Allan BURDAJEWICZ added a comment - - edited Was able to reproduce this - leaked queue items even though the pipeline is completed - while troubleshooting a user's scenario. A simple scenario that I could find to reproduce the problem was the following and involves both the timeout and node step: timeout(time: 20, unit: 'SECONDS' ) { try { // Label that does not exist node( 'doesnotexists' ) { sh "sleep 999999" } } finally { // Label that does not exist node( 'doesnotexists' ) { sh "sleep 999999" } } } This would leak a queue item every time. And indeed https://github.com/jenkinsci/workflow-durable-task-step-plugin/releases/tag/1146.v1a_d2e603f929 fixes the problem in that particular case.
            jglick Jesse Glick added a comment -

            allan_burdajewicz in that case I think the issue is that timeout only delivers one interruption and waits for a grace period before escalating to a hard kill

            Body did not finish within grace period; terminating with extreme prejudice
            

            which bypasses cleanup code. If you put something liable to hang in a finally block you are triggering this scenario. Maybe timeout could escalate more smoothly but it is hard for it to tell whether its body “paid attention” to the interrupt and is actually going to process it soon or not.

            jglick Jesse Glick added a comment - allan_burdajewicz in that case I think the issue is that timeout only delivers one interruption and waits for a grace period before escalating to a hard kill Body did not finish within grace period; terminating with extreme prejudice which bypasses cleanup code. If you put something liable to hang in a finally block you are triggering this scenario. Maybe timeout could escalate more smoothly but it is hard for it to tell whether its body “paid attention” to the interrupt and is actually going to process it soon or not.

            The fix in workflow-durable-task-step 1146.v1a_d2e603f929 definitely seem to solve the problem for that scenario. Can't reproduce it anymore.

            allan_burdajewicz Allan BURDAJEWICZ added a comment - The fix in workflow-durable-task-step 1146.v1a_d2e603f929 definitely seem to solve the problem for that scenario. Can't reproduce it anymore.

            People

              Unassigned Unassigned
              brainsam Alexander Moiseenko
              Votes:
              2 Vote for this issue
              Watchers:
              22 Start watching this issue

              Dates

                Created:
                Updated: