Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-51568

Pipeline jobs hanging in Build Executor even if it is finished

    XMLWordPrintable

    Details

    • Similar Issues:

      Description

      We have huge Jenkins instance, which runs about 20k builds a day.

      At some moment after couple days without restart of jenkins master, pipeline jobs starts to hang executors after build finish. Freestyle and maven jobs works fine.
      Busy executor looks like:

      But build status is "finished":

      There are records about start and finish build in jenkins.log, but executor wasn't released at May 28, 2018 4:35:14 PM:

      May 28, 2018 4:28:07 PM org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxResolvingClassLoader$4$1 load
      WARNING: took 5,770ms to load/not load groovy.lang.GroovyObject$groovy$util$script15275137998231310805619$SSH_LOGIN from classLoader hudson.PluginManager$UberClassLoader
      2018/05/28 05:07:798 - job/KKA/job/TRIGGER_JOB_NEW_BUILD_IN_NEXUS_FLAG/ #4533 Started by timer
      May 28, 2018 4:28:07 PM org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxResolvingClassLoader$4$1 load
      WARNING: took 6,783ms to load/not load groovy.lang.GroovyObject$groovy$util$script15275138109721310805619$SSH_LOGIN from classLoader hudson.PluginManager$UberClassLoader
      
      ...
      
      WARNING: Owner[PPRBTEAM/archive/Deimos/deimos-module-list-wf/23333:PPRBTEAM/archive/Deimos/deimos-module-list-wf #23333] was not in the list to begin with: [Owner[GringoTesting/Try/1:GringoTesting/Try #1], Owner[PPRB_DevOps/Install_EIP/Install_EIP_2_cli/211:PPRB_DevOps/Install_EIP/Install_EIP_2_cli #211], Owner[AutoTransaction/AutoTransaction_release_major-2018-04-30_deprecated/275:AutoTransaction/AutoTransaction_release_major-2018-04-30_deprecated #275], Owner[DataFactory/AHD/Pipeline_dev/49:DataFactory/AHD/Pipeline_dev #49], Owner[CSUO/PipelineBadCode/19:CSUO/PipelineBadCode #19], Owner[PPRB_DevOps/Install_EIP/Install_EIP_2_cli/616:PPRB_DevOps/Install_EIP/Install_EIP_2_cli #616], Owner[KKMB/bundle-barriers/barrier-sbof-commondealinf-r1.28.0/7:KKMB/bundle-barriers/barrier-sbof-commondealinf-r1.28.0 #7], Owner[ESB/FS/FS_CI_RFC_PR_8/3060:ESB/FS/FS_CI_RFC_PR_8 #3060], Owner[TFS/TestJobs/integrationOnly/809:TFS/TestJobs/integrationOnly #809], Owner[PPRB_DevOps/KBT/Install_KBT(DEV)_ear/265:PPRB_DevOps/KBT/Install_KBT(DEV)_ear #265], Owner[MBP/mbp-ci/123:MBP/mbp-ci #123], Owner[PRCRED/EKB3/loans-for-persons-ekb3-pipeline___force-WF-and-envelope_Stend1_dev/338:PRCRED/EKB3/loans-for-persons-ekb3-pipeline___force-WF-and-envelope_Stend1_dev #338], Owner[Kalita/tmp/test_pr/46:Kalita/tmp/test_pr #46], Owner[Kalita/tmp/test_pr/47:Kalita/tmp/test_pr #47], Owner[ASKOO/koo-release-build/feature%2F513/8:ASKOO/koo-release-build/feature%2F513 #8], Owner[ASKOO/koo-release-build/develop/117:ASKOO/koo-release-build/develop #117], Owner[ASKOO/koo-release-build/feature%2F513/9:ASKOO/koo-release-build/feature%2F513 #9], Owner[ASKOO/koo-release-build/develop/118:ASKOO/koo-release-build/develop #118], Owner[GBK/QG_check_minor250518/205:GBK/QG_check_minor250518 #205], Owner[GBK/QG_Check_dev/345:GBK/QG_Check_dev #345], Owner[PRCRED/EKB3/loans-for-persons-ekb3-pipeline___force-WF-and-envelope_Stend1_dev/339:PRCRED/EKB3/loans-for-persons-ekb3-pipeline___force-WF-and-envelope_Stend1_dev #339], Owner[GateWayDP/Gateways/Gateway_ESBGW_CI_PIPELINE/363:GateWayDP/Gateways/Gateway_ESBGW_CI_PIPELINE #363], Owner[DataFactory/stork/Build_PullRequest/524:DataFactory/stork/Build_PullRequest #524], Owner[KKA/KKA_PIPE_CI/352:KKA/KKA_PIPE_CI #352], Owner[KKA/KKA_PIPE_DEPLOY/154:KKA/KKA_PIPE_DEPLOY #154], Owner[GBK/QG_check_minor250518/206:GBK/QG_check_minor250518 #206], Owner[PPRB_DepositCashOperations/card/BuildDistr_Develop_Nexus_Prod_QG/57:PPRB_DepositCashOperations/card/BuildDistr_Develop_Nexus_Prod_QG #57], Owner[PRCRED/CBIR/AUTODEPLOY_PIPE_KENNY/33:PRCRED/CBIR/AUTODEPLOY_PIPE_KENNY #33], Owner[Kalita/Regular_clt_dev_builds/295:Kalita/Regular_clt_dev_builds #295], Owner[DataFactory/stork/Build_PullRequest/525:DataFactory/stork/Build_PullRequest #525], Owner[Kalita/tmp/test_pr/48:Kalita/tmp/test_pr #48], Owner[AEP/AEP_QG/806:AEP/AEP_QG #806], Owner[SNUiL/DevOps/CI-Builds/CI-Build-PIR-29/549:SNUiL/DevOps/CI-Builds/CI-Build-PIR-29 #549], Owner[GBK/QG_Check_dev/346:GBK/QG_Check_dev #346], Owner[GBK/QG_check_minor250518/207:GBK/QG_check_minor250518 #207], Owner[ECOD/DEVELOP/NexusArtifactFlag/1:ECOD/DEVELOP/NexusArtifactFlag #1], Owner[Kalita/Parallel_Pipeline_2/1003:Kalita/Parallel_Pipeline_2 #1003], Owner[HPSM/HPSM_pipeline/403:HPSM/HPSM_pipeline #403], Owner[ESB_CF/IIB9_PIPELINE/3751:ESB_CF/IIB9_PIPELINE #3751], Owner[DataFactory/stork/Build_PullRequest/526:DataFactory/stork/Build_PullRequest #526], Owner[PPRB_DepositCashOperations/card/BuildDistr_Develop_Nexus_Prod_QG/58:PPRB_DepositCashOperations/card/BuildDistr_Develop_Nexus_Prod_QG #58], Owner[DevOps/AHD/sleep-test/2:DevOps/AHD/sleep-test #2], Owner[ESB_SMP/FullBuild/484:ESB_SMP/FullBuild #484], Owner[DataFactory/stork/Build_PullRequest/527:DataFactory/stork/Build_PullRequest #527], Owner[SBK/SMARTREGRESS_TEST_IFT1/4:SBK/SMARTREGRESS_TEST_IFT1 #4], Owner[impprb/checkQG/249:impprb/checkQG #249], Owner[DevOps/AHD/sleep-test/4:DevOps/AHD/sleep-test #4], Owner[TFS/PRBuilders/BuilderPRForCore/903:TFS/PRBuilders/BuilderPRForCore #903], Owner[HPSM/HPSM_pipeline/404:HPSM/HPSM_pipeline #404], Owner[FCCM8/regular_release/41:FCCM8/regular_release #41], Owner[ESB/DevOps/Other/AUTOTESTS/DO_PR_INIT/380:ESB/DevOps/Other/AUTOTESTS/DO_PR_INIT #380], Owner[PPRB_DevOps/KBT/Install_KBT(DEV)_ear/268:PPRB_DevOps/KBT/Install_KBT(DEV)_ear #268], Owner[DataFactory/stork/Build_PullRequest/528:DataFactory/stork/Build_PullRequest #528], Owner[SNUiL/DevOps/CI-Deploy/CI-Deploy-to-testing-from-git/389:SNUiL/DevOps/CI-Deploy/CI-Deploy-to-testing-from-git #389], Owner[ASCC/ascc_full_RELEASE/02.013.00_STG-19937_jenkins_release_job/41:ASCC/ascc_full_RELEASE/02.013.00_STG-19937_jenkins_release_job #41], Owner[ASKOO/koo-release-build/feature%2F253/29:ASKOO/koo-release-build/feature%2F253 #29], Owner[GateWayDP/Gateways/Gateway_EDOGO_CI_PIPELINE/334:GateWayDP/Gateways/Gateway_EDOGO_CI_PIPELINE #334], Owner[DEPOZITORY/PB/69:DEPOZITORY/PB #69], Owner[DataFactory/stork/Auto_Test_DEV/2115:DataFactory/stork/Auto_Test_DEV #2115], Owner[GBK/QG_check_minor250518/208:GBK/QG_check_minor250518 #208], Owner[TDS/GREEN_AN_GREEN/87:TDS/GREEN_AN_GREEN #87], Owner[PPRB_DepositCashOperations/common/Publish_to_IFT_universal/320:PPRB_DepositCashOperations/common/Publish_to_IFT_universal #320], Owner[PPRB_DevOps/Quality_Gate_pipes/Universal_Quality_Gate_pipe/9156:PPRB_DevOps/Quality_Gate_pipes/Universal_Quality_Gate_pipe #9156], Owner[ASCC/server1/02.013.00/1496:ASCC/server1/02.013.00 #1496], Owner[ASCC/ascc_server_branch_build/02.013.00_STG-19937_jenkins_release_job/37:ASCC/ascc_server_branch_build/02.013.00_STG-19937_jenkins_release_job #37], Owner[ESB/FS/Meshkov/FS_CI_RFC_tst/763:ESB/FS/Meshkov/FS_CI_RFC_tst #763], Owner[PPRB_DevOps/Quality_Gate_pipes/Universal_Quality_Gate_pipe/9157:PPRB_DevOps/Quality_Gate_pipes/Universal_Quality_Gate_pipe #9157], Owner[mmt/DEV/1051:mmt/DEV #1051], Owner[ASKOO/koo-release-build/support%2F02.021/18:ASKOO/koo-release-build/support%2F02.021 #18], Owner[Kalita/Parallel_Pipeline_2/1005:Kalita/Parallel_Pipeline_2 #1005], Owner[DataFactory/stork/Build_PullRequest/529:DataFactory/stork/Build_PullRequest #529], Owner[SNUiL/DevOps/CI-Builds/CI-Build-SNUILDEV-3794-COURIER/249:SNUiL/DevOps/CI-Builds/CI-Build-SNUILDEV-3794-COURIER #249], Owner[ASBS/buildByPipeline/532:ASBS/buildByPipeline #532], Owner[ASCC/ascc_server_branch_build/02.014.00_STG-18499_CompositeOutCashOrders/14:ASCC/ascc_server_branch_build/02.014.00_STG-18499_CompositeOutCashOrders #14], Owner[adpSWIFT/adpSWIFT_PIPELINE/3102:adpSWIFT/adpSWIFT_PIPELINE #3102], Owner[ASCC/server1/02.014.00/407:ASCC/server1/02.014.00 #407], Owner[ESB/DevOps/Dev/ESB_KF_CI00223537/PartialESBInstall/33:ESB/DevOps/Dev/ESB_KF_CI00223537/PartialESBInstall #33], Owner[PPRB_DevOps/Quality_Gate_pipes/Universal_Quality_Gate_pipe/9160:PPRB_DevOps/Quality_Gate_pipes/Universal_Quality_Gate_pipe #9160], Owner[ESB/FS/FS_PR_INIT/24747:ESB/FS/FS_PR_INIT #24747], Owner[PPRB.OIP/kbt-scripts/ucp-corp/359:PPRB.OIP/kbt-scripts/ucp-corp #359], Owner[PPRB_DevOps/Install_EIP/Install_EIP_2_cli/733:PPRB_DevOps/Install_EIP/Install_EIP_2_cli #733], Owner[edosgo/elgo-mvd-clientverify/elgo-mvd-clientverify2/38:edosgo/elgo-mvd-clientverify/elgo-mvd-clientverify2 #38], Owner[PPRB_DevOps/Install_EIP/Install_EIP_2_cli/734:PPRB_DevOps/Install_EIP/Install_EIP_2_cli #734], Owner[MBP/mbp-ci/140:MBP/mbp-ci #140], Owner[PPRBDOC/upload_to_pipe/deprecated/QualityGateOnOurPipe/QualityGate-Order/50:PPRBDOC/upload_to_pipe/deprecated/QualityGateOnOurPipe/QualityGate-Order #50], Owner[ESB/FS/FS_CI_RFC/3478:ESB/FS/FS_CI_RFC #3478], Owner[DataFactory/stork/Build_required_distrib/122:DataFactory/stork/Build_required_distrib #122], Owner[TDS/Update_stand_by_url/920:TDS/Update_stand_by_url #920], Owner[CBDBO/Pipeline/2871:CBDBO/Pipeline #2871], Owner[PRCRED/EKB3/loans-for-persons-ekb3-pipeline___force-WF-and-envelope_Stend2_major-2-2018-05-27/315:PRCRED/EKB3/loans-for-persons-ekb3-pipeline___force-WF-and-envelope_Stend2_major-2-2018-05-27 #315], Owner[edosgo/elgo-fns-clientverify/elgo-fns-clientverify-release/193:edosgo/elgo-fns-clientverify/elgo-fns-clientverify-release #193], Owner[DepositPfETL/deposit-client-validation-pipeline-parameters/19:DepositPfETL/deposit-client-validation-pipeline-parameters #19], Owner[mmt/TEST/1:mmt/TEST #1], Owner[edosgo/elgo-remote-starter/323:edosgo/elgo-remote-starter #323], Owner[ESB/DevOps/Other/AUTOTESTS/PartialESBRestoreExGroup/302:ESB/DevOps/Other/AUTOTESTS/PartialESBRestoreExGroup #302], Owner[PPRB_DevOps/Install_EIP/Install_EIP_2_cli/735:PPRB_DevOps/Install_EIP/Install_EIP_2_cli #735], Owner[PPRBTEAM/Gera/gera-autodeploy-wf/116:PPRBTEAM/Gera/gera-autodeploy-wf #116], Owner[EKPiT/deploy-db-dev1/100:EKPiT/deploy-db-dev1 #100], Owner[TDS/PR_pipeline/903:TDS/PR_pipeline #903], Owner[edosgo/elgo-msh-reestrcontract/elgo-msh-reestrcontract-dev-barrier-pipeline/14:edosgo/elgo-msh-reestrcontract/elgo-msh-reestrcontract-dev-barrier-pipeline #14], Owner[ESB/FS/FS_CI_INIT/3648:ESB/FS/FS_CI_INIT #3648], Owner[KKA/TRIGGER_JOB_NEW_BUILD_IN_NEXUS/4907:KKA/TRIGGER_JOB_NEW_BUILD_IN_NEXUS #4907], Owner[edosgo/elgo-msh-reestrcontract/elgo-msh-reestrcontract-dev-barrier-deploy/21:edosgo/elgo-msh-reestrcontract/elgo-msh-reestrcontract-dev-barrier-deploy #21], Owner[Tengri/HDPLocalCiInsallation/151:Tengri/HDPLocalCiInsallation #151], Owner[mgr/API/PUBLISH_ALL/68435:mgr/API/PUBLISH_ALL #68435], Owner[EPS/Main_Pre_Build_Distr_DevOps2018_Pipeline/6405:EPS/Main_Pre_Build_Distr_DevOps2018_Pipeline #6405], Owner[KKA/TRIGGER_JOB_NEW_BUILD_IN_NEXUS_FLAG/4533:KKA/TRIGGER_JOB_NEW_BUILD_IN_NEXUS_FLAG #4533], Owner[PPRBCPRB/Pipeline_Server_Build_For_Dev_Server/10969:PPRBCPRB/Pipeline_Server_Build_For_Dev_Server #10969], Owner[PPRB_CEP/PSI_TEST_Pipe/7763:PPRB_CEP/PSI_TEST_Pipe #7763]]
      ...
      
      May 28, 2018 4:35:14 PM org.jenkinsci.plugins.workflow.job.WorkflowRun finish
      INFO: KKA/TRIGGER_JOB_NEW_BUILD_IN_NEXUS_FLAG #4533 completed: SUCCESS
      
      ....

      As a result, build queue grows since all avaliable executors are busy.

      Some workaround to defer restart: periodically run script to free executors:

      import hudson.model.*;
      
      nodes = jenkins.model.Jenkins.instance.nodes
      nodes.removeAll(Collections.singleton(null))
      
      nodes.each { node ->
      manager.listener.logger.println("-------PROCESSING NODE: $node.displayName -------------------")
      def exec = node.toComputer()
      if (exec == null) {	
      manager.listener.logger.println("------- WARNING: $node.displayName: NULL. Removing! -------------------")
      Jenkins.instance.removeNode(node)
      return;
      }
      
      exec.getExecutors().each { job ->
      if (job.busy && job.progress == -1) {
      manager.listener.logger.println("JOB $job.name LOOKS LIKE STUCK. KILLING.")
      def owner = job.owner
      owner.removeExecutor((hudson.model.Executor) job)
      }
      }
      }
      
      return null
      
       

        Attachments

        1. build-pipeline-steps.PNG
          build-pipeline-steps.PNG
          44 kB
        2. BusyTimers.png
          BusyTimers.png
          37 kB
        3. finished-build.PNG
          finished-build.PNG
          66 kB
        4. image-2018-11-27-11-31-06-881.png
          image-2018-11-27-11-31-06-881.png
          462 kB
        5. image-2018-11-27-11-32-58-457.png
          image-2018-11-27-11-32-58-457.png
          230 kB
        6. NormalCase.PNG
          NormalCase.PNG
          28 kB
        7. sleep-test-script.PNG
          sleep-test-script.PNG
          12 kB
        8. stuck-executor.png
          stuck-executor.png
          4 kB
        9. thread_dump.html
          490 kB

          Issue Links

            Activity

            Hide
            brainsam Alexander Moiseenko added a comment - - edited

            At the same time I see, that sleep step is not working properly too:

            simple job

            node('Linux_Default') {
                sleep time: 5, unit: 'SECONDS'
                echo "Well done!"
            }
            

            runs for hours, and starts to work properly after restart.

            Show
            brainsam Alexander Moiseenko added a comment - - edited At the same time I see, that sleep step is not working properly too: simple job node( 'Linux_Default' ) {     sleep time: 5, unit: 'SECONDS'     echo "Well done!" } runs for hours, and starts to work properly after restart.
            Hide
            brainsam Alexander Moiseenko added a comment -

            possibly links to JENKINS-46283

            Show
            brainsam Alexander Moiseenko added a comment - possibly links to  JENKINS-46283
            Hide
            svanoort Sam Van Oort added a comment -

            Devin Nusbaum Would you be able to take a peek at this please?

            Show
            svanoort Sam Van Oort added a comment - Devin Nusbaum Would you be able to take a peek at this please?
            Hide
            dnusbaum Devin Nusbaum added a comment - - edited

            Alexander Moiseenko I thought this might be a dupe of JENKINS-45571, but in your case it looks like the stuck executors are full Executors running on build agents rather than the flyweight executors that run on the master, so it seems like it might be something else.

            What versions of the Pipeline Groovy Plugin, Pipeline Job Plugin, Durable Task Plugin, and the Pipeline Nodes and Processes Plugin are you running?

            EDIT: Also, how is the Linux_default agent configured inside of Jenkins (i.e. SSH, EC2, JNLP, etc.)?

            Show
            dnusbaum Devin Nusbaum added a comment - - edited Alexander Moiseenko I thought this might be a dupe of JENKINS-45571 , but in your case it looks like the stuck executors are full Executors running on build agents rather than the flyweight executors that run on the master, so it seems like it might be something else. What versions of the Pipeline Groovy Plugin, Pipeline Job Plugin, Durable Task Plugin, and the Pipeline Nodes and Processes Plugin are you running? EDIT: Also, how is the Linux_default agent configured inside of Jenkins (i.e. SSH, EC2, JNLP, etc.)?
            Hide
            brainsam Alexander Moiseenko added a comment -

            Hello.

            Same problem again, our new env:

            Jenkins 2.121.1, 

            Pipeline Groovy 2.55

            Pipeline: job 2.26

            Durable Task Plugin: 1.26

            Pipeline Nodes and Processes: 2.22

             

            workaround script doesn't help anymore, after durable task plugin update probably.

             

            We've created simple pipeline job with `sleep` step to check if problem have reappered again: 

            pipeline {
               
                options {
                    timeout(time: 60, unit: 'SECONDS')
                }
                   
                stages {
                    
                    stage('sleep') {
                        steps {
                            sleep 5
                        }
                    }
                }
            }

            which finishes successfully in normal case and fails when executors gets stuck

            Show
            brainsam Alexander Moiseenko added a comment - Hello. Same problem again, our new env: Jenkins 2.121.1,  Pipeline Groovy 2.55 Pipeline: job 2.26 Durable Task Plugin: 1.26 Pipeline Nodes and Processes: 2.22   workaround script doesn't help anymore, after durable task plugin update probably.   We've created simple pipeline job with `sleep` step to check if problem have reappered again:  pipeline { options { timeout(time: 60, unit: 'SECONDS' ) } stages { stage( 'sleep' ) { steps { sleep 5 } } } } which finishes successfully in normal case and fails when executors gets stuck
            Hide
            svanoort Sam Van Oort added a comment - - edited

            Jesse Glick Could you please take a look? Appears that the latest comment may reflect a regression due to controller.watch API.

            Edit: though it's not entirely clear from context

            Show
            svanoort Sam Van Oort added a comment - - edited Jesse Glick Could you please take a look? Appears that the latest comment may reflect a regression due to controller.watch API. Edit: though it's not entirely clear from context
            Hide
            jglick Jesse Glick added a comment -

            I am confused by the relationship between your screenshots and the thread dump. build-pipeline-steps.PNG and finished-build.PNG display build #4533. stuck-executor.PNG displays build #4908 running on a one-executor agent jenkins-agent-linux-008, and thread-dump.txt indicates that this agent is currently processing a (Git) checkout. They are not even builds of the same job: one is of KKA/TRIGGER_JOB_NEW_BUILD_IN_NEXUS_FLAG, the other KKA/TRIGGER_JOB_NEW_BUILD_IN_NEXUS. Nothing about this seems improper—you have some running builds, and some completed builds. Maybe I am missing something, or maybe you chose the wrong attachments.

            I do see one anomalous thing in the thread dump: most of the pool threads for DurableTaskStep retain a Thread.name set in this block even after the block has completed and the thread is parked; WithThreadName ought to be resetting the name reliably, even before the re-schedule call. The code which adds the waiting for JNLP4-connect connection from … suffix to the thread name is here, which also looks like it should be cleaning up properly. I have no explanation for this bug, though I also see no reason to think it is related to your problem.

            Show
            jglick Jesse Glick added a comment - I am confused by the relationship between your screenshots and the thread dump. build-pipeline-steps.PNG and finished-build.PNG display build #4533. stuck-executor.PNG displays build #4908 running on a one-executor agent jenkins-agent-linux-008 , and thread-dump.txt indicates that this agent is currently processing a (Git) checkout . They are not even builds of the same job : one is of KKA/TRIGGER_JOB_NEW_BUILD_IN_NEXUS_FLAG , the other KKA/TRIGGER_JOB_NEW_BUILD_IN_NEXUS . Nothing about this seems improper—you have some running builds, and some completed builds. Maybe I am missing something, or maybe you chose the wrong attachments. I do see one anomalous thing in the thread dump: most of the pool threads for DurableTaskStep retain a Thread.name set in this block even after the block has completed and the thread is parked; WithThreadName ought to be resetting the name reliably, even before the re- schedule call . The code which adds the waiting for JNLP4-connect connection from … suffix to the thread name is here , which also looks like it should be cleaning up properly. I have no explanation for this bug, though I also see no reason to think it is related to your problem.
            Hide
            davidvanlaatum David van Laatum added a comment -

            I seem to have the same problem, so I created a job with the simple sleep 5 pipeline above and setup a cron job to check if the job has failed to finish recently then grab a thread dump and restart jenkins. I also noticed the test job starts taking longer and longer in the sleep step.

             

            Show
            davidvanlaatum David van Laatum added a comment - I seem to have the same problem, so I created a job with the simple sleep 5 pipeline above and setup a cron job to check if the job has failed to finish recently then grab a thread dump and restart jenkins. I also noticed the test job starts taking longer and longer in the sleep step.  
            Hide
            brainsam Alexander Moiseenko added a comment - - edited

            I think we've found root cause of a problem. Broken `sleep` and `timeout` methods and executor hanging all relates to jenkins.util.Timer threads. 

            During hang state threads looks like:

            And in normal case:

             

            In [^thread-dump.txt] we can see jenkins.util.Timer stack traces, in our case the root cause was logfilesizechecker plugin, https://github.com/jenkinsci/logfilesizechecker-plugin/blob/master/src/main/java/hudson/plugins/logfilesizechecker/LogfilesizecheckerWrapper.java#L78

            that uses timer every second to check log size and this operation creates additional cpu load:

            "jenkins.util.Timer [#9]" - Thread t@267
               java.lang.Thread.State: RUNNABLE
                    at java.util.concurrent.ConcurrentSkipListMap.cpr(ConcurrentSkipListMap.java:655)
                    at java.util.concurrent.ConcurrentSkipListMap.findPredecessor(ConcurrentSkipListMap.java:682)
                    at java.util.concurrent.ConcurrentSkipListMap.doGet(ConcurrentSkipListMap.java:781)
                    at java.util.concurrent.ConcurrentSkipListMap.get(ConcurrentSkipListMap.java:1546)
                    at jenkins.model.Nodes.getNode(Nodes.java:295)
                    at jenkins.model.Jenkins.getNode(Jenkins.java:2058)
                    at hudson.model.Computer.getNode(Computer.java:590)
                    at hudson.slaves.SlaveComputer.getNode(SlaveComputer.java:199)
                    at hudson.slaves.SlaveComputer.getNode(SlaveComputer.java:96)
                    at jenkins.model.Jenkins$7.compare(Jenkins.java:1927)
                    at jenkins.model.Jenkins$7.compare(Jenkins.java:1925)
                    at java.util.TimSort.countRunAndMakeAscending(TimSort.java:360)
                    at java.util.TimSort.sort(TimSort.java:234)
                    at java.util.Arrays.sort(Arrays.java:1438)
                    at jenkins.model.Jenkins.getComputers(Jenkins.java:1925)
                    at hudson.model.Executor.of(Executor.java:941)
                    at hudson.model.Run.getExecutor(Run.java:530)
                    at hudson.plugins.logfilesizechecker.LogfilesizecheckerWrapper$LogSizeTimerTask.doRun(LogfilesizecheckerWrapper.java:107)
                    at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:51)
                    at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
                    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
                    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
                    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
                    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
                    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
                    at java.lang.Thread.run(Thread.java:748)   Locked ownable synchronizers:
                    - locked <4dd03582> (a java.util.concurrent.ThreadPoolExecutor$Worker) 

             

            Our jenkins.util.Timer threads are mostly in park state and sleep works fine, since we've changed DELAY value from 1 second to 10 in https://github.com/jenkinsci/logfilesizechecker-plugin/blob/master/src/main/java/hudson/plugins/logfilesizechecker/LogfilesizecheckerWrapper.java#L46

             

             

            Show
            brainsam Alexander Moiseenko added a comment - - edited I think we've found root cause of a problem. Broken `sleep` and `timeout` methods and executor hanging all relates to jenkins.util.Timer threads.  During hang state threads looks like: And in normal case:   In  [^thread-dump.txt]  we can see jenkins.util.Timer stack traces, in our case the root cause was logfilesizechecker plugin, https://github.com/jenkinsci/logfilesizechecker-plugin/blob/master/src/main/java/hudson/plugins/logfilesizechecker/LogfilesizecheckerWrapper.java#L78 that uses timer every second to check log size and this operation creates additional cpu load: "jenkins.util.Timer [#9]" - Thread t@267 java.lang.Thread.State: RUNNABLE at java.util.concurrent.ConcurrentSkipListMap.cpr(ConcurrentSkipListMap.java:655) at java.util.concurrent.ConcurrentSkipListMap.findPredecessor(ConcurrentSkipListMap.java:682) at java.util.concurrent.ConcurrentSkipListMap.doGet(ConcurrentSkipListMap.java:781) at java.util.concurrent.ConcurrentSkipListMap.get(ConcurrentSkipListMap.java:1546) at jenkins.model.Nodes.getNode(Nodes.java:295) at jenkins.model.Jenkins.getNode(Jenkins.java:2058) at hudson.model.Computer.getNode(Computer.java:590) at hudson.slaves.SlaveComputer.getNode(SlaveComputer.java:199) at hudson.slaves.SlaveComputer.getNode(SlaveComputer.java:96) at jenkins.model.Jenkins$7.compare(Jenkins.java:1927) at jenkins.model.Jenkins$7.compare(Jenkins.java:1925) at java.util.TimSort.countRunAndMakeAscending(TimSort.java:360) at java.util.TimSort.sort(TimSort.java:234) at java.util.Arrays.sort(Arrays.java:1438) at jenkins.model.Jenkins.getComputers(Jenkins.java:1925) at hudson.model.Executor.of(Executor.java:941) at hudson.model.Run.getExecutor(Run.java:530) at hudson.plugins.logfilesizechecker.LogfilesizecheckerWrapper$LogSizeTimerTask.doRun(LogfilesizecheckerWrapper.java:107) at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:51) at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers: - locked <4dd03582> (a java.util.concurrent.ThreadPoolExecutor$Worker)   Our jenkins.util.Timer threads are mostly in park state and sleep works fine, since we've changed DELAY value from 1 second to 10 in https://github.com/jenkinsci/logfilesizechecker-plugin/blob/master/src/main/java/hudson/plugins/logfilesizechecker/LogfilesizecheckerWrapper.java#L46    
            Hide
            jglick Jesse Glick added a comment -

            Reassigning acc. to diagnosis by original reporter. Other comments may well have completely unrelated issues.

            Show
            jglick Jesse Glick added a comment - Reassigning acc. to diagnosis by original reporter. Other comments may well have completely unrelated issues.
            Hide
            davidvanlaatum David van Laatum added a comment -

            We also have a heap of jobs with log file size check on I have disabled to see if it stops happening

            Show
            davidvanlaatum David van Laatum added a comment - We also have a heap of jobs with log file size check on I have disabled to see if it stops happening
            Hide
            davidvanlaatum David van Laatum added a comment -

            seems to have fixed if for us too jenkins has been stable since I disabled the log file size check on all builds

            Show
            davidvanlaatum David van Laatum added a comment - seems to have fixed if for us too jenkins has been stable since I disabled the log file size check on all builds
            Hide
            rag1 Raphael Greger added a comment -

            Hi David

            I got exactly the same problem. What was your workaround? Where did you set the log size check?

            Show
            rag1 Raphael Greger added a comment - Hi David I got exactly the same problem. What was your workaround? Where did you set the log size check?
            Hide
            davidvanlaatum David van Laatum added a comment -

            in the job config there is an option "Abort the build if its log file size is too big" from memory I used the configuration slicer plugin to remove it from all jobs

            Show
            davidvanlaatum David van Laatum added a comment - in the job config there is an option "Abort the build if its log file size is too big" from memory I used the configuration slicer plugin to remove it from all jobs
            Hide
            rag1 Raphael Greger added a comment -

            Thank you David. I didn't find this configuration but I did some restriction with the log of job config history. Now the problem seems to be vanished. 

            Show
            rag1 Raphael Greger added a comment - Thank you David. I didn't find this configuration but I did some restriction with the log of job config history. Now the problem seems to be vanished. 
            Hide
            mengfeil li added a comment -

            look like the issue is still exist. is there any update? 2.289.1

            Show
            mengfeil li added a comment - look like the issue is still exist. is there any update? 2.289.1

              People

              Assignee:
              Unassigned Unassigned
              Reporter:
              brainsam Alexander Moiseenko
              Votes:
              2 Vote for this issue
              Watchers:
              20 Start watching this issue

                Dates

                Created:
                Updated: