Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-60667

Jobs hanging indefinitely on ec2 slaves

XMLWordPrintable

      We have test jobs hanging on ec2 slaves indefinitely when the jobs are trying to clone ws from master using clone-workspace-scm plugin or publish html reports to master using htmlpublisher plugin. I don't think the issue is related to these two plugins though.

       

      We also notice that this issue mostly happens after the ec2 slave is up for a few days, i.e. if we take down the current slave and create a new instance, the jobs will run successfully at the begining, but start hanging after a few days. So I suspect that something is clogged over time.

       

      Our test jobs have the same pipeline: clone workspace -> run maven surefire for testing -> publish test results using html publisher

       

      Some issues that may be relative: 

      https://issues.jenkins-ci.org/browse/JENKINS-5977

      https://issues.jenkins-ci.org/browse/JENKINS-57119

       

      Any help would be much appreciated

       

      Update:

      We now shorten the Idle termination time so that we have new instances more often and find some pattern in this hanging behaviour.

      It appears to happen everyday around 11:00-11:30 AM UTC. We originally have two test jobs scheduled around 11:00 AM UTC. But for testing purpose, we changed the schedule for some other test jobs to 11:00 AM as well. The conclusion we get is that ANY job runs at that time in ec2 cloud will time out (job timeout set to 2 hours) after maven surefire tests and hang on html publish. Once the jobs time out, any new test jobs scheduled after that will hang on cloning workspace. If we move the jobs scheduled around 11:00 AM UTC to master (also in AWS), none of the jobs will hang. Also, if we run these test jobs at a different time in ec2 cloud, they will finish successfully as well.

       

      Logs from a hanging job: (we are in EST so 5 hours behind UTC)

      06:48:29 Please refer to /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports for the individual test results.
      08:06:19 Build timed out (after 120 minutes). Marking the build as failed.
      08:06:19 Build was aborted
      08:06:19 [htmlpublisher] Archiving HTML reports...
      08:06:19 [htmlpublisher] Archiving at PROJECT level /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports/html to /var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/htmlreports/HTML_20Report

      Logs from a working job:

      06:08:46 Please refer to /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports for the individual test results.
      06:08:46 [JENKINS] Recording test results
      06:08:50 [WARNING] Attempt to (de-)serialize anonymous class org.jfrog.hudson.maven2.MavenDependenciesRecorder$1; see: https://jenkins.io/redirect/serialization-of-anonymous-classes/
      06:08:50 [INFO] ------------------------------------------------------------------------
      06:08:50 [INFO] BUILD SUCCESS
      06:08:50 [INFO] ------------------------------------------------------------------------
      06:08:50 [INFO] Total time: 35:32 min
      06:08:50 [INFO] Finished at: 2020-01-16T11:08:50+00:00
      06:08:50 [INFO] Final Memory: 30M/746M
      06:08:50 [INFO] ------------------------------------------------------------------------
      06:08:50 Waiting for Jenkins to finish collecting data
      06:08:53 [JENKINS] Archiving /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/pom.xml to com.pointclickcare.automation/pcc_quality_automation/4.1.0-SNAPSHOT/pcc_quality_automation-4.1.0-SNAPSHOT.pom
      06:08:53 channel stopped
      06:08:53 [htmlpublisher] Archiving HTML reports...
      06:08:53 [htmlpublisher] Archiving at PROJECT level /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports/html to /var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/htmlreports/HTML_20Report
      06:08:53 TestNG Reports Processing: START
      06:08:53 Looking for TestNG results report in workspace using pattern: **/testng-results.xml
      06:08:54 Saving reports...
      06:08:54 Processing '/var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/builds/2/testng/testng-results.xml'
      06:08:54 100.000000% of tests were skipped, which exceeded threshold of 0%. Marking build as FAILURE
      06:08:54 TestNG Reports Processing: FINISH
      06:08:54 Build step 'Publish TestNG Results' changed build result to FAILURE
      06:08:58 [WS-CLEANUP] Deleting project workspace...
      06:08:58 [WS-CLEANUP] Deferred wipeout is used...
      06:08:58 [WS-CLEANUP] done
      

        1. ec2_slave_dump.txt
          24 kB
          Handi Gao
        2. master_dump.txt
          143 kB
          Handi Gao

            thoulen FABRIZIO MANFREDI
            h35gao Handi Gao
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: