Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-60667

Jobs hanging indefinitely on ec2 slaves

    XMLWordPrintable

    Details

    • Similar Issues:

      Description

      We have test jobs hanging on ec2 slaves indefinitely when the jobs are trying to clone ws from master using clone-workspace-scm plugin or publish html reports to master using htmlpublisher plugin. I don't think the issue is related to these two plugins though.

       

      We also notice that this issue mostly happens after the ec2 slave is up for a few days, i.e. if we take down the current slave and create a new instance, the jobs will run successfully at the begining, but start hanging after a few days. So I suspect that something is clogged over time.

       

      Our test jobs have the same pipeline: clone workspace -> run maven surefire for testing -> publish test results using html publisher

       

      Some issues that may be relative: 

      https://issues.jenkins-ci.org/browse/JENKINS-5977

      https://issues.jenkins-ci.org/browse/JENKINS-57119

       

      Any help would be much appreciated

       

      Update:

      We now shorten the Idle termination time so that we have new instances more often and find some pattern in this hanging behaviour.

      It appears to happen everyday around 11:00-11:30 AM UTC. We originally have two test jobs scheduled around 11:00 AM UTC. But for testing purpose, we changed the schedule for some other test jobs to 11:00 AM as well. The conclusion we get is that ANY job runs at that time in ec2 cloud will time out (job timeout set to 2 hours) after maven surefire tests and hang on html publish. Once the jobs time out, any new test jobs scheduled after that will hang on cloning workspace. If we move the jobs scheduled around 11:00 AM UTC to master (also in AWS), none of the jobs will hang. Also, if we run these test jobs at a different time in ec2 cloud, they will finish successfully as well.

       

      Logs from a hanging job: (we are in EST so 5 hours behind UTC)

      06:48:29 Please refer to /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports for the individual test results.
      08:06:19 Build timed out (after 120 minutes). Marking the build as failed.
      08:06:19 Build was aborted
      08:06:19 [htmlpublisher] Archiving HTML reports...
      08:06:19 [htmlpublisher] Archiving at PROJECT level /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports/html to /var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/htmlreports/HTML_20Report

      Logs from a working job:

      06:08:46 Please refer to /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports for the individual test results.
      06:08:46 [JENKINS] Recording test results
      06:08:50 [WARNING] Attempt to (de-)serialize anonymous class org.jfrog.hudson.maven2.MavenDependenciesRecorder$1; see: https://jenkins.io/redirect/serialization-of-anonymous-classes/
      06:08:50 [INFO] ------------------------------------------------------------------------
      06:08:50 [INFO] BUILD SUCCESS
      06:08:50 [INFO] ------------------------------------------------------------------------
      06:08:50 [INFO] Total time: 35:32 min
      06:08:50 [INFO] Finished at: 2020-01-16T11:08:50+00:00
      06:08:50 [INFO] Final Memory: 30M/746M
      06:08:50 [INFO] ------------------------------------------------------------------------
      06:08:50 Waiting for Jenkins to finish collecting data
      06:08:53 [JENKINS] Archiving /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/pom.xml to com.pointclickcare.automation/pcc_quality_automation/4.1.0-SNAPSHOT/pcc_quality_automation-4.1.0-SNAPSHOT.pom
      06:08:53 channel stopped
      06:08:53 [htmlpublisher] Archiving HTML reports...
      06:08:53 [htmlpublisher] Archiving at PROJECT level /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports/html to /var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/htmlreports/HTML_20Report
      06:08:53 TestNG Reports Processing: START
      06:08:53 Looking for TestNG results report in workspace using pattern: **/testng-results.xml
      06:08:54 Saving reports...
      06:08:54 Processing '/var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/builds/2/testng/testng-results.xml'
      06:08:54 100.000000% of tests were skipped, which exceeded threshold of 0%. Marking build as FAILURE
      06:08:54 TestNG Reports Processing: FINISH
      06:08:54 Build step 'Publish TestNG Results' changed build result to FAILURE
      06:08:58 [WS-CLEANUP] Deleting project workspace...
      06:08:58 [WS-CLEANUP] Deferred wipeout is used...
      06:08:58 [WS-CLEANUP] done
      

        Attachments

          Activity

          h35gao Handi Gao created issue -
          h35gao Handi Gao made changes -
          Field Original Value New Value
          Description We have jobs hanging on ec2 slaves indefinitely when the jobs are trying to clone ws from master using clone-workspace-scm plugin or publish html reports to master using htmlpublisher plugin. I don't think the issue is related to these two plugins though.

           

          We also notice that this issue mostly happens after the ec2 slave is up for a few days, i.e. if we take down the current slave and create a new instance, the jobs will run successfully at the begining, but start hanging after a few days. So I suspect that something is clogged over time.

           

          Some issues that may be relative: 

          https://issues.jenkins-ci.org/browse/JENKINS-5977

          https://issues.jenkins-ci.org/browse/JENKINS-57119
          We have jobs hanging on ec2 slaves indefinitely when the jobs are trying to clone ws from master using clone-workspace-scm plugin or publish html reports to master using htmlpublisher plugin. I don't think the issue is related to these two plugins though.

           

          We also notice that this issue mostly happens after the ec2 slave is up for a few days, i.e. if we take down the current slave and create a new instance, the jobs will run successfully at the begining, but start hanging after a few days. So I suspect that something is clogged over time.

           

          Some issues that may be relative: 

          https://issues.jenkins-ci.org/browse/JENKINS-5977

          https://issues.jenkins-ci.org/browse/JENKINS-57119

           

          Any help would be much appreciated
          h35gao Handi Gao made changes -
          Component/s maven-plugin [ 16033 ]
          h35gao Handi Gao made changes -
          Environment Jenkins: 2.204.1 (base image: jenkinsci/blueocean:1.21.0)
          ec2 plugin: 1.47
          maven plugin: 3.4
          clone-workspace-scm plugin: 0.6
          htmlpublisher plugin: 1.21
          h35gao Handi Gao made changes -
          Description We have jobs hanging on ec2 slaves indefinitely when the jobs are trying to clone ws from master using clone-workspace-scm plugin or publish html reports to master using htmlpublisher plugin. I don't think the issue is related to these two plugins though.

           

          We also notice that this issue mostly happens after the ec2 slave is up for a few days, i.e. if we take down the current slave and create a new instance, the jobs will run successfully at the begining, but start hanging after a few days. So I suspect that something is clogged over time.

           

          Some issues that may be relative: 

          https://issues.jenkins-ci.org/browse/JENKINS-5977

          https://issues.jenkins-ci.org/browse/JENKINS-57119

           

          Any help would be much appreciated
          We have test jobs hanging on ec2 slaves indefinitely when the jobs are trying to clone ws from master using clone-workspace-scm plugin or publish html reports to master using htmlpublisher plugin. I don't think the issue is related to these two plugins though.

           

          We also notice that this issue mostly happens after the ec2 slave is up for a few days, i.e. if we take down the current slave and create a new instance, the jobs will run successfully at the begining, but start hanging after a few days. So I suspect that something is clogged over time.

           

          Our test jobs have the same pipeline: clone workspace -> run maven surefire for testing -> publish test results using html publisher

           

          Some issues that may be relative: 

          https://issues.jenkins-ci.org/browse/JENKINS-5977

          https://issues.jenkins-ci.org/browse/JENKINS-57119

           

          Any help would be much appreciated

           

          Update:

          We now shorten the Idle termination time so that we have new instances more often and find some pattern in this hanging behaviour.

          It appears to happen everyday around 11:00-11:30 AM UTC. We originally have two test jobs scheduled around 11:00 AM UTC. But for testing purpose, we changed the schedule for some other test jobs to 11:00 AM as well. The conclusion we get is that ANY job runs at that time in ec2 cloud will time out (job timeout set to 2 hours) after maven surefire tests and hang on html publish. Once the jobs time out, any new test jobs scheduled after that will hang on cloning workspace. If we move the jobs scheduled around 11:00 AM UTC to master (also in AWS), none of the jobs will hang. Also, if we run these test jobs at a different time in ec2 cloud, they will finish successfully as well.

           

          Logs from a hanging job:
          {code:java}
          06:48:29 Please refer to /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports for the individual test results.
          08:06:19 Build timed out (after 120 minutes). Marking the build as failed.
          08:06:19 Build was aborted
          08:06:19 [htmlpublisher] Archiving HTML reports...
          08:06:19 [htmlpublisher] Archiving at PROJECT level /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports/html to /var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/htmlreports/HTML_20Report{code}
          Logs from a working job:
          *06:08:46* Please refer to /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports for the individual test results.*06:08:46* [JENKINS] Recording test results*06:08:50* [WARNING] Attempt to (de-)serialize anonymous class org.jfrog.hudson.maven2.MavenDependenciesRecorder$1; see: [https://jenkins.io/redirect/serialization-of-anonymous-classes/]*06:08:50* [INFO] ------------------------------------------------------------------------*06:08:50* [INFO] BUILD SUCCESS*06:08:50* [INFO] ------------------------------------------------------------------------*06:08:50* [INFO] Total time: 35:32 min*06:08:50* [INFO] Finished at: 2020-01-16T11:08:50+00:00*06:08:50* [INFO] Final Memory: 30M/746M*06:08:50* [INFO] ------------------------------------------------------------------------*06:08:50* Waiting for Jenkins to finish collecting data*06:08:53* [JENKINS] Archiving /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/pom.xml to com.pointclickcare.automation/pcc_quality_automation/4.1.0-SNAPSHOT/pcc_quality_automation-4.1.0-SNAPSHOT.pom*06:08:53* channel stopped*06:08:53* [htmlpublisher] Archiving HTML reports...*06:08:53* [htmlpublisher] Archiving at PROJECT level /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports/html to /var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/htmlreports/HTML_20Report*06:08:53* TestNG Reports Processing: START*06:08:53* Looking for TestNG results report in workspace using pattern: **/testng-results.xml*06:08:54* Saving reports...*06:08:54* Processing '/var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/builds/2/testng/testng-results.xml'*06:08:54* 100.000000% of tests were skipped, which exceeded threshold of 0%. Marking build as FAILURE*06:08:54* TestNG Reports Processing: FINISH*06:08:54* Build step 'Publish TestNG Results' changed build result to FAILURE*06:08:58* [WS-CLEANUP] Deleting project workspace...*06:08:58* [WS-CLEANUP] Deferred wipeout is used...*06:08:58* [WS-CLEANUP] done
          h35gao Handi Gao made changes -
          Description We have test jobs hanging on ec2 slaves indefinitely when the jobs are trying to clone ws from master using clone-workspace-scm plugin or publish html reports to master using htmlpublisher plugin. I don't think the issue is related to these two plugins though.

           

          We also notice that this issue mostly happens after the ec2 slave is up for a few days, i.e. if we take down the current slave and create a new instance, the jobs will run successfully at the begining, but start hanging after a few days. So I suspect that something is clogged over time.

           

          Our test jobs have the same pipeline: clone workspace -> run maven surefire for testing -> publish test results using html publisher

           

          Some issues that may be relative: 

          https://issues.jenkins-ci.org/browse/JENKINS-5977

          https://issues.jenkins-ci.org/browse/JENKINS-57119

           

          Any help would be much appreciated

           

          Update:

          We now shorten the Idle termination time so that we have new instances more often and find some pattern in this hanging behaviour.

          It appears to happen everyday around 11:00-11:30 AM UTC. We originally have two test jobs scheduled around 11:00 AM UTC. But for testing purpose, we changed the schedule for some other test jobs to 11:00 AM as well. The conclusion we get is that ANY job runs at that time in ec2 cloud will time out (job timeout set to 2 hours) after maven surefire tests and hang on html publish. Once the jobs time out, any new test jobs scheduled after that will hang on cloning workspace. If we move the jobs scheduled around 11:00 AM UTC to master (also in AWS), none of the jobs will hang. Also, if we run these test jobs at a different time in ec2 cloud, they will finish successfully as well.

           

          Logs from a hanging job:
          {code:java}
          06:48:29 Please refer to /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports for the individual test results.
          08:06:19 Build timed out (after 120 minutes). Marking the build as failed.
          08:06:19 Build was aborted
          08:06:19 [htmlpublisher] Archiving HTML reports...
          08:06:19 [htmlpublisher] Archiving at PROJECT level /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports/html to /var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/htmlreports/HTML_20Report{code}
          Logs from a working job:
          *06:08:46* Please refer to /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports for the individual test results.*06:08:46* [JENKINS] Recording test results*06:08:50* [WARNING] Attempt to (de-)serialize anonymous class org.jfrog.hudson.maven2.MavenDependenciesRecorder$1; see: [https://jenkins.io/redirect/serialization-of-anonymous-classes/]*06:08:50* [INFO] ------------------------------------------------------------------------*06:08:50* [INFO] BUILD SUCCESS*06:08:50* [INFO] ------------------------------------------------------------------------*06:08:50* [INFO] Total time: 35:32 min*06:08:50* [INFO] Finished at: 2020-01-16T11:08:50+00:00*06:08:50* [INFO] Final Memory: 30M/746M*06:08:50* [INFO] ------------------------------------------------------------------------*06:08:50* Waiting for Jenkins to finish collecting data*06:08:53* [JENKINS] Archiving /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/pom.xml to com.pointclickcare.automation/pcc_quality_automation/4.1.0-SNAPSHOT/pcc_quality_automation-4.1.0-SNAPSHOT.pom*06:08:53* channel stopped*06:08:53* [htmlpublisher] Archiving HTML reports...*06:08:53* [htmlpublisher] Archiving at PROJECT level /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports/html to /var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/htmlreports/HTML_20Report*06:08:53* TestNG Reports Processing: START*06:08:53* Looking for TestNG results report in workspace using pattern: **/testng-results.xml*06:08:54* Saving reports...*06:08:54* Processing '/var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/builds/2/testng/testng-results.xml'*06:08:54* 100.000000% of tests were skipped, which exceeded threshold of 0%. Marking build as FAILURE*06:08:54* TestNG Reports Processing: FINISH*06:08:54* Build step 'Publish TestNG Results' changed build result to FAILURE*06:08:58* [WS-CLEANUP] Deleting project workspace...*06:08:58* [WS-CLEANUP] Deferred wipeout is used...*06:08:58* [WS-CLEANUP] done
          We have test jobs hanging on ec2 slaves indefinitely when the jobs are trying to clone ws from master using clone-workspace-scm plugin or publish html reports to master using htmlpublisher plugin. I don't think the issue is related to these two plugins though.

           

          We also notice that this issue mostly happens after the ec2 slave is up for a few days, i.e. if we take down the current slave and create a new instance, the jobs will run successfully at the begining, but start hanging after a few days. So I suspect that something is clogged over time.

           

          Our test jobs have the same pipeline: clone workspace -> run maven surefire for testing -> publish test results using html publisher

           

          Some issues that may be relative: 

          https://issues.jenkins-ci.org/browse/JENKINS-5977

          https://issues.jenkins-ci.org/browse/JENKINS-57119

           

          Any help would be much appreciated

           

          Update:

          We now shorten the Idle termination time so that we have new instances more often and find some pattern in this hanging behaviour.

          It appears to happen everyday around 11:00-11:30 AM UTC. We originally have two test jobs scheduled around 11:00 AM UTC. But for testing purpose, we changed the schedule for some other test jobs to 11:00 AM as well. The conclusion we get is that ANY job runs at that time in ec2 cloud will time out (job timeout set to 2 hours) after maven surefire tests and hang on html publish. Once the jobs time out, any new test jobs scheduled after that will hang on cloning workspace. If we move the jobs scheduled around 11:00 AM UTC to master (also in AWS), none of the jobs will hang. Also, if we run these test jobs at a different time in ec2 cloud, they will finish successfully as well.

           

          Logs from a hanging job:
          {code:java}
          06:48:29 Please refer to /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports for the individual test results.
          08:06:19 Build timed out (after 120 minutes). Marking the build as failed.
          08:06:19 Build was aborted
          08:06:19 [htmlpublisher] Archiving HTML reports...
          08:06:19 [htmlpublisher] Archiving at PROJECT level /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports/html to /var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/htmlreports/HTML_20Report{code}
          Logs from a working job:
          {code:java}
          06:08:46 Please refer to /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports for the individual test results.
          06:08:46 [JENKINS] Recording test results
          06:08:50 [WARNING] Attempt to (de-)serialize anonymous class org.jfrog.hudson.maven2.MavenDependenciesRecorder$1; see: https://jenkins.io/redirect/serialization-of-anonymous-classes/
          06:08:50 [INFO] ------------------------------------------------------------------------
          06:08:50 [INFO] BUILD SUCCESS
          06:08:50 [INFO] ------------------------------------------------------------------------
          06:08:50 [INFO] Total time: 35:32 min
          06:08:50 [INFO] Finished at: 2020-01-16T11:08:50+00:00
          06:08:50 [INFO] Final Memory: 30M/746M
          06:08:50 [INFO] ------------------------------------------------------------------------
          06:08:50 Waiting for Jenkins to finish collecting data
          06:08:53 [JENKINS] Archiving /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/pom.xml to com.pointclickcare.automation/pcc_quality_automation/4.1.0-SNAPSHOT/pcc_quality_automation-4.1.0-SNAPSHOT.pom
          06:08:53 channel stopped
          06:08:53 [htmlpublisher] Archiving HTML reports...
          06:08:53 [htmlpublisher] Archiving at PROJECT level /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports/html to /var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/htmlreports/HTML_20Report
          06:08:53 TestNG Reports Processing: START
          06:08:53 Looking for TestNG results report in workspace using pattern: **/testng-results.xml
          06:08:54 Saving reports...
          06:08:54 Processing '/var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/builds/2/testng/testng-results.xml'
          06:08:54 100.000000% of tests were skipped, which exceeded threshold of 0%. Marking build as FAILURE
          06:08:54 TestNG Reports Processing: FINISH
          06:08:54 Build step 'Publish TestNG Results' changed build result to FAILURE
          06:08:58 [WS-CLEANUP] Deleting project workspace...
          06:08:58 [WS-CLEANUP] Deferred wipeout is used...
          06:08:58 [WS-CLEANUP] done
          {code}
          h35gao Handi Gao made changes -
          Description We have test jobs hanging on ec2 slaves indefinitely when the jobs are trying to clone ws from master using clone-workspace-scm plugin or publish html reports to master using htmlpublisher plugin. I don't think the issue is related to these two plugins though.

           

          We also notice that this issue mostly happens after the ec2 slave is up for a few days, i.e. if we take down the current slave and create a new instance, the jobs will run successfully at the begining, but start hanging after a few days. So I suspect that something is clogged over time.

           

          Our test jobs have the same pipeline: clone workspace -> run maven surefire for testing -> publish test results using html publisher

           

          Some issues that may be relative: 

          https://issues.jenkins-ci.org/browse/JENKINS-5977

          https://issues.jenkins-ci.org/browse/JENKINS-57119

           

          Any help would be much appreciated

           

          Update:

          We now shorten the Idle termination time so that we have new instances more often and find some pattern in this hanging behaviour.

          It appears to happen everyday around 11:00-11:30 AM UTC. We originally have two test jobs scheduled around 11:00 AM UTC. But for testing purpose, we changed the schedule for some other test jobs to 11:00 AM as well. The conclusion we get is that ANY job runs at that time in ec2 cloud will time out (job timeout set to 2 hours) after maven surefire tests and hang on html publish. Once the jobs time out, any new test jobs scheduled after that will hang on cloning workspace. If we move the jobs scheduled around 11:00 AM UTC to master (also in AWS), none of the jobs will hang. Also, if we run these test jobs at a different time in ec2 cloud, they will finish successfully as well.

           

          Logs from a hanging job:
          {code:java}
          06:48:29 Please refer to /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports for the individual test results.
          08:06:19 Build timed out (after 120 minutes). Marking the build as failed.
          08:06:19 Build was aborted
          08:06:19 [htmlpublisher] Archiving HTML reports...
          08:06:19 [htmlpublisher] Archiving at PROJECT level /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports/html to /var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/htmlreports/HTML_20Report{code}
          Logs from a working job:
          {code:java}
          06:08:46 Please refer to /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports for the individual test results.
          06:08:46 [JENKINS] Recording test results
          06:08:50 [WARNING] Attempt to (de-)serialize anonymous class org.jfrog.hudson.maven2.MavenDependenciesRecorder$1; see: https://jenkins.io/redirect/serialization-of-anonymous-classes/
          06:08:50 [INFO] ------------------------------------------------------------------------
          06:08:50 [INFO] BUILD SUCCESS
          06:08:50 [INFO] ------------------------------------------------------------------------
          06:08:50 [INFO] Total time: 35:32 min
          06:08:50 [INFO] Finished at: 2020-01-16T11:08:50+00:00
          06:08:50 [INFO] Final Memory: 30M/746M
          06:08:50 [INFO] ------------------------------------------------------------------------
          06:08:50 Waiting for Jenkins to finish collecting data
          06:08:53 [JENKINS] Archiving /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/pom.xml to com.pointclickcare.automation/pcc_quality_automation/4.1.0-SNAPSHOT/pcc_quality_automation-4.1.0-SNAPSHOT.pom
          06:08:53 channel stopped
          06:08:53 [htmlpublisher] Archiving HTML reports...
          06:08:53 [htmlpublisher] Archiving at PROJECT level /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports/html to /var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/htmlreports/HTML_20Report
          06:08:53 TestNG Reports Processing: START
          06:08:53 Looking for TestNG results report in workspace using pattern: **/testng-results.xml
          06:08:54 Saving reports...
          06:08:54 Processing '/var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/builds/2/testng/testng-results.xml'
          06:08:54 100.000000% of tests were skipped, which exceeded threshold of 0%. Marking build as FAILURE
          06:08:54 TestNG Reports Processing: FINISH
          06:08:54 Build step 'Publish TestNG Results' changed build result to FAILURE
          06:08:58 [WS-CLEANUP] Deleting project workspace...
          06:08:58 [WS-CLEANUP] Deferred wipeout is used...
          06:08:58 [WS-CLEANUP] done
          {code}
          We have test jobs hanging on ec2 slaves indefinitely when the jobs are trying to clone ws from master using clone-workspace-scm plugin or publish html reports to master using htmlpublisher plugin. I don't think the issue is related to these two plugins though.

           

          We also notice that this issue mostly happens after the ec2 slave is up for a few days, i.e. if we take down the current slave and create a new instance, the jobs will run successfully at the begining, but start hanging after a few days. So I suspect that something is clogged over time.

           

          Our test jobs have the same pipeline: clone workspace -> run maven surefire for testing -> publish test results using html publisher

           

          Some issues that may be relative: 

          https://issues.jenkins-ci.org/browse/JENKINS-5977

          https://issues.jenkins-ci.org/browse/JENKINS-57119

           

          Any help would be much appreciated

           

          Update:

          We now shorten the Idle termination time so that we have new instances more often and find some pattern in this hanging behaviour.

          It appears to happen everyday around 11:00-11:30 AM UTC. We originally have two test jobs scheduled around 11:00 AM UTC. But for testing purpose, we changed the schedule for some other test jobs to 11:00 AM as well. The conclusion we get is that ANY job runs at that time in ec2 cloud will time out (job timeout set to 2 hours) after maven surefire tests and hang on html publish. Once the jobs time out, any new test jobs scheduled after that will hang on cloning workspace. If we move the jobs scheduled around 11:00 AM UTC to master (also in AWS), none of the jobs will hang. Also, if we run these test jobs at a different time in ec2 cloud, they will finish successfully as well.

           

          Logs from a hanging job: (we are in EST so 5 hours behind UTC)
          {code:java}
          06:48:29 Please refer to /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports for the individual test results.
          08:06:19 Build timed out (after 120 minutes). Marking the build as failed.
          08:06:19 Build was aborted
          08:06:19 [htmlpublisher] Archiving HTML reports...
          08:06:19 [htmlpublisher] Archiving at PROJECT level /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports/html to /var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/htmlreports/HTML_20Report{code}
          Logs from a working job:
          {code:java}
          06:08:46 Please refer to /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports for the individual test results.
          06:08:46 [JENKINS] Recording test results
          06:08:50 [WARNING] Attempt to (de-)serialize anonymous class org.jfrog.hudson.maven2.MavenDependenciesRecorder$1; see: https://jenkins.io/redirect/serialization-of-anonymous-classes/
          06:08:50 [INFO] ------------------------------------------------------------------------
          06:08:50 [INFO] BUILD SUCCESS
          06:08:50 [INFO] ------------------------------------------------------------------------
          06:08:50 [INFO] Total time: 35:32 min
          06:08:50 [INFO] Finished at: 2020-01-16T11:08:50+00:00
          06:08:50 [INFO] Final Memory: 30M/746M
          06:08:50 [INFO] ------------------------------------------------------------------------
          06:08:50 Waiting for Jenkins to finish collecting data
          06:08:53 [JENKINS] Archiving /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/pom.xml to com.pointclickcare.automation/pcc_quality_automation/4.1.0-SNAPSHOT/pcc_quality_automation-4.1.0-SNAPSHOT.pom
          06:08:53 channel stopped
          06:08:53 [htmlpublisher] Archiving HTML reports...
          06:08:53 [htmlpublisher] Archiving at PROJECT level /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports/html to /var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/htmlreports/HTML_20Report
          06:08:53 TestNG Reports Processing: START
          06:08:53 Looking for TestNG results report in workspace using pattern: **/testng-results.xml
          06:08:54 Saving reports...
          06:08:54 Processing '/var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/builds/2/testng/testng-results.xml'
          06:08:54 100.000000% of tests were skipped, which exceeded threshold of 0%. Marking build as FAILURE
          06:08:54 TestNG Reports Processing: FINISH
          06:08:54 Build step 'Publish TestNG Results' changed build result to FAILURE
          06:08:58 [WS-CLEANUP] Deleting project workspace...
          06:08:58 [WS-CLEANUP] Deferred wipeout is used...
          06:08:58 [WS-CLEANUP] done
          {code}
          h35gao Handi Gao made changes -
          Environment Jenkins: 2.204.1 (base image: jenkinsci/blueocean:1.21.0)
          ec2 plugin: 1.47
          maven plugin: 3.4
          clone-workspace-scm plugin: 0.6
          htmlpublisher plugin: 1.21
          Jenkins: 2.204.1 (base image: jenkinsci/blueocean:1.21.0)
          ec2 plugin: 1.47
          maven plugin: 3.4
          clone-workspace-scm plugin: 0.6
          htmlpublisher plugin: 1.21
          surefire: 2.17

            People

            Assignee:
            thoulen FABRIZIO MANFREDI
            Reporter:
            h35gao Handi Gao
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated: