[JENKINS-60667] Jobs hanging indefinitely on ec2 slaves

Type: Bug
Resolution: Unresolved
Priority: Major
Component/s: clone-workspace-scm-plugin, (4)
core, ec2-plugin, htmlpublisher-plugin, maven-plugin
Labels:
None
Environment:
Jenkins: 2.204.1 (base image: jenkinsci/blueocean:1.21.0)
ec2 plugin: 1.47
maven plugin: 3.4
clone-workspace-scm plugin: 0.6
htmlpublisher plugin: 1.21
surefire: 2.17

Similar Issues:
Powered by SuggestiMate

Show

We have test jobs hanging on ec2 slaves indefinitely when the jobs are trying to clone ws from master using clone-workspace-scm plugin or publish html reports to master using htmlpublisher plugin. I don't think the issue is related to these two plugins though.

We also notice that this issue mostly happens after the ec2 slave is up for a few days, i.e. if we take down the current slave and create a new instance, the jobs will run successfully at the begining, but start hanging after a few days. So I suspect that something is clogged over time.

Our test jobs have the same pipeline: clone workspace -> run maven surefire for testing -> publish test results using html publisher

Some issues that may be relative:

https://issues.jenkins-ci.org/browse/JENKINS-5977

https://issues.jenkins-ci.org/browse/JENKINS-57119

Any help would be much appreciated

Update:

We now shorten the Idle termination time so that we have new instances more often and find some pattern in this hanging behaviour.

It appears to happen everyday around 11:00-11:30 AM UTC. We originally have two test jobs scheduled around 11:00 AM UTC. But for testing purpose, we changed the schedule for some other test jobs to 11:00 AM as well. The conclusion we get is that ANY job runs at that time in ec2 cloud will time out (job timeout set to 2 hours) after maven surefire tests and hang on html publish. Once the jobs time out, any new test jobs scheduled after that will hang on cloning workspace. If we move the jobs scheduled around 11:00 AM UTC to master (also in AWS), none of the jobs will hang. Also, if we run these test jobs at a different time in ec2 cloud, they will finish successfully as well.

Logs from a hanging job: (we are in EST so 5 hours behind UTC)

06:48:29 Please refer to /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports for the individual test results.
08:06:19 Build timed out (after 120 minutes). Marking the build as failed.
08:06:19 Build was aborted
08:06:19 [htmlpublisher] Archiving HTML reports...
08:06:19 [htmlpublisher] Archiving at PROJECT level /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports/html to /var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/htmlreports/HTML_20Report

Logs from a working job:

06:08:46 Please refer to /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports for the individual test results.
06:08:46 [JENKINS] Recording test results
06:08:50 [WARNING] Attempt to (de-)serialize anonymous class org.jfrog.hudson.maven2.MavenDependenciesRecorder$1; see: https://jenkins.io/redirect/serialization-of-anonymous-classes/
06:08:50 [INFO] ------------------------------------------------------------------------
06:08:50 [INFO] BUILD SUCCESS
06:08:50 [INFO] ------------------------------------------------------------------------
06:08:50 [INFO] Total time: 35:32 min
06:08:50 [INFO] Finished at: 2020-01-16T11:08:50+00:00
06:08:50 [INFO] Final Memory: 30M/746M
06:08:50 [INFO] ------------------------------------------------------------------------
06:08:50 Waiting for Jenkins to finish collecting data
06:08:53 [JENKINS] Archiving /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/pom.xml to com.pointclickcare.automation/pcc_quality_automation/4.1.0-SNAPSHOT/pcc_quality_automation-4.1.0-SNAPSHOT.pom
06:08:53 channel stopped
06:08:53 [htmlpublisher] Archiving HTML reports...
06:08:53 [htmlpublisher] Archiving at PROJECT level /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports/html to /var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/htmlreports/HTML_20Report
06:08:53 TestNG Reports Processing: START
06:08:53 Looking for TestNG results report in workspace using pattern: **/testng-results.xml
06:08:54 Saving reports...
06:08:54 Processing '/var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/builds/2/testng/testng-results.xml'
06:08:54 100.000000% of tests were skipped, which exceeded threshold of 0%. Marking build as FAILURE
06:08:54 TestNG Reports Processing: FINISH
06:08:54 Build step 'Publish TestNG Results' changed build result to FAILURE
06:08:58 [WS-CLEANUP] Deleting project workspace...
06:08:58 [WS-CLEANUP] Deferred wipeout is used...
06:08:58 [WS-CLEANUP] done

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

ec2_slave_dump.txt
24 kB
2020-01-06 23:36
master_dump.txt
143 kB
2020-01-06 23:36

Handi Gao created issue - 2020-01-06 23:49

Handi Gao made changes - 2020-01-07 00:01

Description

Original: We have jobs hanging on ec2 slaves indefinitely when the jobs are trying to clone ws from master using clone-workspace-scm plugin or publish html reports to master using htmlpublisher plugin. I don't think the issue is related to these two plugins though.

We also notice that this issue mostly happens after the ec2 slave is up for a few days, i.e. if we take down the current slave and create a new instance, the jobs will run successfully at the begining, but start hanging after a few days. So I suspect that something is clogged over time.

Some issues that may be relative:

https://issues.jenkins-ci.org/browse/JENKINS-5977

https://issues.jenkins-ci.org/browse/JENKINS-57119

New: We have jobs hanging on ec2 slaves indefinitely when the jobs are trying to clone ws from master using clone-workspace-scm plugin or publish html reports to master using htmlpublisher plugin. I don't think the issue is related to these two plugins though.

We also notice that this issue mostly happens after the ec2 slave is up for a few days, i.e. if we take down the current slave and create a new instance, the jobs will run successfully at the begining, but start hanging after a few days. So I suspect that something is clogged over time.

Some issues that may be relative:

https://issues.jenkins-ci.org/browse/JENKINS-5977

https://issues.jenkins-ci.org/browse/JENKINS-57119

Any help would be much appreciated

Handi Gao made changes - 2020-01-07 19:20

Component/s

New: maven-plugin [ 16033 ]

Handi Gao made changes - 2020-01-07 19:25

Environment

New: Jenkins: 2.204.1 (base image: jenkinsci/blueocean:1.21.0)
ec2 plugin: 1.47
maven plugin: 3.4
clone-workspace-scm plugin: 0.6
htmlpublisher plugin: 1.21

Handi Gao made changes - 2020-01-17 19:17

Description

New: We have test jobs hanging on ec2 slaves indefinitely when the jobs are trying to clone ws from master using clone-workspace-scm plugin or publish html reports to master using htmlpublisher plugin. I don't think the issue is related to these two plugins though.

We also notice that this issue mostly happens after the ec2 slave is up for a few days, i.e. if we take down the current slave and create a new instance, the jobs will run successfully at the begining, but start hanging after a few days. So I suspect that something is clogged over time.

Our test jobs have the same pipeline: clone workspace -> run maven surefire for testing -> publish test results using html publisher

Some issues that may be relative:

https://issues.jenkins-ci.org/browse/JENKINS-5977

https://issues.jenkins-ci.org/browse/JENKINS-57119

Any help would be much appreciated

Update:

We now shorten the Idle termination time so that we have new instances more often and find some pattern in this hanging behaviour.

It appears to happen everyday around 11:00-11:30 AM UTC. We originally have two test jobs scheduled around 11:00 AM UTC. But for testing purpose, we changed the schedule for some other test jobs to 11:00 AM as well. The conclusion we get is that ANY job runs at that time in ec2 cloud will time out (job timeout set to 2 hours) after maven surefire tests and hang on html publish. Once the jobs time out, any new test jobs scheduled after that will hang on cloning workspace. If we move the jobs scheduled around 11:00 AM UTC to master (also in AWS), none of the jobs will hang. Also, if we run these test jobs at a different time in ec2 cloud, they will finish successfully as well.

Logs from a hanging job:
{code:java}
06:48:29 Please refer to /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports for the individual test results.
08:06:19 Build timed out (after 120 minutes). Marking the build as failed.
08:06:19 Build was aborted
08:06:19 [htmlpublisher] Archiving HTML reports...
08:06:19 [htmlpublisher] Archiving at PROJECT level /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports/html to /var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/htmlreports/HTML_20Report{code}
Logs from a working job:
*06:08:46* Please refer to /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports for the individual test results.*06:08:46* [JENKINS] Recording test results*06:08:50* [WARNING] Attempt to (de-)serialize anonymous class org.jfrog.hudson.maven2.MavenDependenciesRecorder$1; see: [https://jenkins.io/redirect/serialization-of-anonymous-classes/]*06:08:50* [INFO] ------------------------------------------------------------------------*06:08:50* [INFO] BUILD SUCCESS*06:08:50* [INFO] ------------------------------------------------------------------------*06:08:50* [INFO] Total time: 35:32 min*06:08:50* [INFO] Finished at: 2020-01-16T11:08:50+00:00*06:08:50* [INFO] Final Memory: 30M/746M*06:08:50* [INFO] ------------------------------------------------------------------------*06:08:50* Waiting for Jenkins to finish collecting data*06:08:53* [JENKINS] Archiving /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/pom.xml to com.pointclickcare.automation/pcc_quality_automation/4.1.0-SNAPSHOT/pcc_quality_automation-4.1.0-SNAPSHOT.pom*06:08:53* channel stopped*06:08:53* [htmlpublisher] Archiving HTML reports...*06:08:53* [htmlpublisher] Archiving at PROJECT level /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports/html to /var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/htmlreports/HTML_20Report*06:08:53* TestNG Reports Processing: START*06:08:53* Looking for TestNG results report in workspace using pattern: **/testng-results.xml*06:08:54* Saving reports...*06:08:54* Processing '/var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/builds/2/testng/testng-results.xml'*06:08:54* 100.000000% of tests were skipped, which exceeded threshold of 0%. Marking build as FAILURE*06:08:54* TestNG Reports Processing: FINISH*06:08:54* Build step 'Publish TestNG Results' changed build result to FAILURE*06:08:58* [WS-CLEANUP] Deleting project workspace...*06:08:58* [WS-CLEANUP] Deferred wipeout is used...*06:08:58* [WS-CLEANUP] done

Handi Gao made changes - 2020-01-17 19:17

Description

Original: We have test jobs hanging on ec2 slaves indefinitely when the jobs are trying to clone ws from master using clone-workspace-scm plugin or publish html reports to master using htmlpublisher plugin. I don't think the issue is related to these two plugins though.

We also notice that this issue mostly happens after the ec2 slave is up for a few days, i.e. if we take down the current slave and create a new instance, the jobs will run successfully at the begining, but start hanging after a few days. So I suspect that something is clogged over time.

Our test jobs have the same pipeline: clone workspace -> run maven surefire for testing -> publish test results using html publisher

Some issues that may be relative:

https://issues.jenkins-ci.org/browse/JENKINS-5977

https://issues.jenkins-ci.org/browse/JENKINS-57119

Any help would be much appreciated

Update:

We now shorten the Idle termination time so that we have new instances more often and find some pattern in this hanging behaviour.

It appears to happen everyday around 11:00-11:30 AM UTC. We originally have two test jobs scheduled around 11:00 AM UTC. But for testing purpose, we changed the schedule for some other test jobs to 11:00 AM as well. The conclusion we get is that ANY job runs at that time in ec2 cloud will time out (job timeout set to 2 hours) after maven surefire tests and hang on html publish. Once the jobs time out, any new test jobs scheduled after that will hang on cloning workspace. If we move the jobs scheduled around 11:00 AM UTC to master (also in AWS), none of the jobs will hang. Also, if we run these test jobs at a different time in ec2 cloud, they will finish successfully as well.

Logs from a hanging job:
{code:java}
06:48:29 Please refer to /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports for the individual test results.
08:06:19 Build timed out (after 120 minutes). Marking the build as failed.
08:06:19 Build was aborted
08:06:19 [htmlpublisher] Archiving HTML reports...
08:06:19 [htmlpublisher] Archiving at PROJECT level /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports/html to /var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/htmlreports/HTML_20Report{code}
Logs from a working job:
*06:08:46* Please refer to /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports for the individual test results.*06:08:46* [JENKINS] Recording test results*06:08:50* [WARNING] Attempt to (de-)serialize anonymous class org.jfrog.hudson.maven2.MavenDependenciesRecorder$1; see: [https://jenkins.io/redirect/serialization-of-anonymous-classes/]*06:08:50* [INFO] ------------------------------------------------------------------------*06:08:50* [INFO] BUILD SUCCESS*06:08:50* [INFO] ------------------------------------------------------------------------*06:08:50* [INFO] Total time: 35:32 min*06:08:50* [INFO] Finished at: 2020-01-16T11:08:50+00:00*06:08:50* [INFO] Final Memory: 30M/746M*06:08:50* [INFO] ------------------------------------------------------------------------*06:08:50* Waiting for Jenkins to finish collecting data*06:08:53* [JENKINS] Archiving /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/pom.xml to com.pointclickcare.automation/pcc_quality_automation/4.1.0-SNAPSHOT/pcc_quality_automation-4.1.0-SNAPSHOT.pom*06:08:53* channel stopped*06:08:53* [htmlpublisher] Archiving HTML reports...*06:08:53* [htmlpublisher] Archiving at PROJECT level /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports/html to /var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/htmlreports/HTML_20Report*06:08:53* TestNG Reports Processing: START*06:08:53* Looking for TestNG results report in workspace using pattern: **/testng-results.xml*06:08:54* Saving reports...*06:08:54* Processing '/var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/builds/2/testng/testng-results.xml'*06:08:54* 100.000000% of tests were skipped, which exceeded threshold of 0%. Marking build as FAILURE*06:08:54* TestNG Reports Processing: FINISH*06:08:54* Build step 'Publish TestNG Results' changed build result to FAILURE*06:08:58* [WS-CLEANUP] Deleting project workspace...*06:08:58* [WS-CLEANUP] Deferred wipeout is used...*06:08:58* [WS-CLEANUP] done

New: We have test jobs hanging on ec2 slaves indefinitely when the jobs are trying to clone ws from master using clone-workspace-scm plugin or publish html reports to master using htmlpublisher plugin. I don't think the issue is related to these two plugins though.

We also notice that this issue mostly happens after the ec2 slave is up for a few days, i.e. if we take down the current slave and create a new instance, the jobs will run successfully at the begining, but start hanging after a few days. So I suspect that something is clogged over time.

Our test jobs have the same pipeline: clone workspace -> run maven surefire for testing -> publish test results using html publisher

Some issues that may be relative:

https://issues.jenkins-ci.org/browse/JENKINS-5977

https://issues.jenkins-ci.org/browse/JENKINS-57119

Any help would be much appreciated

Update:

We now shorten the Idle termination time so that we have new instances more often and find some pattern in this hanging behaviour.

It appears to happen everyday around 11:00-11:30 AM UTC. We originally have two test jobs scheduled around 11:00 AM UTC. But for testing purpose, we changed the schedule for some other test jobs to 11:00 AM as well. The conclusion we get is that ANY job runs at that time in ec2 cloud will time out (job timeout set to 2 hours) after maven surefire tests and hang on html publish. Once the jobs time out, any new test jobs scheduled after that will hang on cloning workspace. If we move the jobs scheduled around 11:00 AM UTC to master (also in AWS), none of the jobs will hang. Also, if we run these test jobs at a different time in ec2 cloud, they will finish successfully as well.

Logs from a hanging job:
{code:java}
06:48:29 Please refer to /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports for the individual test results.
08:06:19 Build timed out (after 120 minutes). Marking the build as failed.
08:06:19 Build was aborted
08:06:19 [htmlpublisher] Archiving HTML reports...
08:06:19 [htmlpublisher] Archiving at PROJECT level /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports/html to /var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/htmlreports/HTML_20Report{code}
Logs from a working job:
{code:java}
06:08:46 Please refer to /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports for the individual test results.
06:08:46 [JENKINS] Recording test results
06:08:50 [WARNING] Attempt to (de-)serialize anonymous class org.jfrog.hudson.maven2.MavenDependenciesRecorder$1; see: https://jenkins.io/redirect/serialization-of-anonymous-classes/
06:08:50 [INFO] ------------------------------------------------------------------------
06:08:50 [INFO] BUILD SUCCESS
06:08:50 [INFO] ------------------------------------------------------------------------
06:08:50 [INFO] Total time: 35:32 min
06:08:50 [INFO] Finished at: 2020-01-16T11:08:50+00:00
06:08:50 [INFO] Final Memory: 30M/746M
06:08:50 [INFO] ------------------------------------------------------------------------
06:08:50 Waiting for Jenkins to finish collecting data
06:08:53 [JENKINS] Archiving /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/pom.xml to com.pointclickcare.automation/pcc_quality_automation/4.1.0-SNAPSHOT/pcc_quality_automation-4.1.0-SNAPSHOT.pom
06:08:53 channel stopped
06:08:53 [htmlpublisher] Archiving HTML reports...
06:08:53 [htmlpublisher] Archiving at PROJECT level /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports/html to /var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/htmlreports/HTML_20Report
06:08:53 TestNG Reports Processing: START
06:08:53 Looking for TestNG results report in workspace using pattern: **/testng-results.xml
06:08:54 Saving reports...
06:08:54 Processing '/var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/builds/2/testng/testng-results.xml'
06:08:54 100.000000% of tests were skipped, which exceeded threshold of 0%. Marking build as FAILURE
06:08:54 TestNG Reports Processing: FINISH
06:08:54 Build step 'Publish TestNG Results' changed build result to FAILURE
06:08:58 [WS-CLEANUP] Deleting project workspace...
06:08:58 [WS-CLEANUP] Deferred wipeout is used...
06:08:58 [WS-CLEANUP] done
{code}

Handi Gao made changes - 2020-01-17 19:19

Description

Original: We have test jobs hanging on ec2 slaves indefinitely when the jobs are trying to clone ws from master using clone-workspace-scm plugin or publish html reports to master using htmlpublisher plugin. I don't think the issue is related to these two plugins though.

We also notice that this issue mostly happens after the ec2 slave is up for a few days, i.e. if we take down the current slave and create a new instance, the jobs will run successfully at the begining, but start hanging after a few days. So I suspect that something is clogged over time.

Our test jobs have the same pipeline: clone workspace -> run maven surefire for testing -> publish test results using html publisher

Some issues that may be relative:

https://issues.jenkins-ci.org/browse/JENKINS-5977

https://issues.jenkins-ci.org/browse/JENKINS-57119

Any help would be much appreciated

Update:

We now shorten the Idle termination time so that we have new instances more often and find some pattern in this hanging behaviour.

It appears to happen everyday around 11:00-11:30 AM UTC. We originally have two test jobs scheduled around 11:00 AM UTC. But for testing purpose, we changed the schedule for some other test jobs to 11:00 AM as well. The conclusion we get is that ANY job runs at that time in ec2 cloud will time out (job timeout set to 2 hours) after maven surefire tests and hang on html publish. Once the jobs time out, any new test jobs scheduled after that will hang on cloning workspace. If we move the jobs scheduled around 11:00 AM UTC to master (also in AWS), none of the jobs will hang. Also, if we run these test jobs at a different time in ec2 cloud, they will finish successfully as well.

Logs from a hanging job:
{code:java}
06:48:29 Please refer to /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports for the individual test results.
08:06:19 Build timed out (after 120 minutes). Marking the build as failed.
08:06:19 Build was aborted
08:06:19 [htmlpublisher] Archiving HTML reports...
08:06:19 [htmlpublisher] Archiving at PROJECT level /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports/html to /var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/htmlreports/HTML_20Report{code}
Logs from a working job:
{code:java}
06:08:46 Please refer to /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports for the individual test results.
06:08:46 [JENKINS] Recording test results
06:08:50 [WARNING] Attempt to (de-)serialize anonymous class org.jfrog.hudson.maven2.MavenDependenciesRecorder$1; see: https://jenkins.io/redirect/serialization-of-anonymous-classes/
06:08:50 [INFO] ------------------------------------------------------------------------
06:08:50 [INFO] BUILD SUCCESS
06:08:50 [INFO] ------------------------------------------------------------------------
06:08:50 [INFO] Total time: 35:32 min
06:08:50 [INFO] Finished at: 2020-01-16T11:08:50+00:00
06:08:50 [INFO] Final Memory: 30M/746M
06:08:50 [INFO] ------------------------------------------------------------------------
06:08:50 Waiting for Jenkins to finish collecting data
06:08:53 [JENKINS] Archiving /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/pom.xml to com.pointclickcare.automation/pcc_quality_automation/4.1.0-SNAPSHOT/pcc_quality_automation-4.1.0-SNAPSHOT.pom
06:08:53 channel stopped
06:08:53 [htmlpublisher] Archiving HTML reports...
06:08:53 [htmlpublisher] Archiving at PROJECT level /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports/html to /var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/htmlreports/HTML_20Report
06:08:53 TestNG Reports Processing: START
06:08:53 Looking for TestNG results report in workspace using pattern: **/testng-results.xml
06:08:54 Saving reports...
06:08:54 Processing '/var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/builds/2/testng/testng-results.xml'
06:08:54 100.000000% of tests were skipped, which exceeded threshold of 0%. Marking build as FAILURE
06:08:54 TestNG Reports Processing: FINISH
06:08:54 Build step 'Publish TestNG Results' changed build result to FAILURE
06:08:58 [WS-CLEANUP] Deleting project workspace...
06:08:58 [WS-CLEANUP] Deferred wipeout is used...
06:08:58 [WS-CLEANUP] done
{code}

New: We have test jobs hanging on ec2 slaves indefinitely when the jobs are trying to clone ws from master using clone-workspace-scm plugin or publish html reports to master using htmlpublisher plugin. I don't think the issue is related to these two plugins though.

We also notice that this issue mostly happens after the ec2 slave is up for a few days, i.e. if we take down the current slave and create a new instance, the jobs will run successfully at the begining, but start hanging after a few days. So I suspect that something is clogged over time.

Our test jobs have the same pipeline: clone workspace -> run maven surefire for testing -> publish test results using html publisher

Some issues that may be relative:

https://issues.jenkins-ci.org/browse/JENKINS-5977

https://issues.jenkins-ci.org/browse/JENKINS-57119

Any help would be much appreciated

Update:

We now shorten the Idle termination time so that we have new instances more often and find some pattern in this hanging behaviour.

It appears to happen everyday around 11:00-11:30 AM UTC. We originally have two test jobs scheduled around 11:00 AM UTC. But for testing purpose, we changed the schedule for some other test jobs to 11:00 AM as well. The conclusion we get is that ANY job runs at that time in ec2 cloud will time out (job timeout set to 2 hours) after maven surefire tests and hang on html publish. Once the jobs time out, any new test jobs scheduled after that will hang on cloning workspace. If we move the jobs scheduled around 11:00 AM UTC to master (also in AWS), none of the jobs will hang. Also, if we run these test jobs at a different time in ec2 cloud, they will finish successfully as well.

Logs from a hanging job: (we are in EST so 5 hours behind UTC)
{code:java}
06:48:29 Please refer to /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports for the individual test results.
08:06:19 Build timed out (after 120 minutes). Marking the build as failed.
08:06:19 Build was aborted
08:06:19 [htmlpublisher] Archiving HTML reports...
08:06:19 [htmlpublisher] Archiving at PROJECT level /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports/html to /var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/htmlreports/HTML_20Report{code}
Logs from a working job:
{code:java}
06:08:46 Please refer to /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports for the individual test results.
06:08:46 [JENKINS] Recording test results
06:08:50 [WARNING] Attempt to (de-)serialize anonymous class org.jfrog.hudson.maven2.MavenDependenciesRecorder$1; see: https://jenkins.io/redirect/serialization-of-anonymous-classes/
06:08:50 [INFO] ------------------------------------------------------------------------
06:08:50 [INFO] BUILD SUCCESS
06:08:50 [INFO] ------------------------------------------------------------------------
06:08:50 [INFO] Total time: 35:32 min
06:08:50 [INFO] Finished at: 2020-01-16T11:08:50+00:00
06:08:50 [INFO] Final Memory: 30M/746M
06:08:50 [INFO] ------------------------------------------------------------------------
06:08:50 Waiting for Jenkins to finish collecting data
06:08:53 [JENKINS] Archiving /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/pom.xml to com.pointclickcare.automation/pcc_quality_automation/4.1.0-SNAPSHOT/pcc_quality_automation-4.1.0-SNAPSHOT.pom
06:08:53 channel stopped
06:08:53 [htmlpublisher] Archiving HTML reports...
06:08:53 [htmlpublisher] Archiving at PROJECT level /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports/html to /var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/htmlreports/HTML_20Report
06:08:53 TestNG Reports Processing: START
06:08:53 Looking for TestNG results report in workspace using pattern: **/testng-results.xml
06:08:54 Saving reports...
06:08:54 Processing '/var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/builds/2/testng/testng-results.xml'
06:08:54 100.000000% of tests were skipped, which exceeded threshold of 0%. Marking build as FAILURE
06:08:54 TestNG Reports Processing: FINISH
06:08:54 Build step 'Publish TestNG Results' changed build result to FAILURE
06:08:58 [WS-CLEANUP] Deleting project workspace...
06:08:58 [WS-CLEANUP] Deferred wipeout is used...
06:08:58 [WS-CLEANUP] done
{code}

Handi Gao made changes - 2020-01-17 19:48

Environment

Original: Jenkins: 2.204.1 (base image: jenkinsci/blueocean:1.21.0)
ec2 plugin: 1.47
maven plugin: 3.4
clone-workspace-scm plugin: 0.6
htmlpublisher plugin: 1.21

New: Jenkins: 2.204.1 (base image: jenkinsci/blueocean:1.21.0)
ec2 plugin: 1.47
maven plugin: 3.4
clone-workspace-scm plugin: 0.6
htmlpublisher plugin: 1.21
surefire: 2.17

Assignee:: FABRIZIO MANFREDI

Reporter:: Handi Gao

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2020-01-06 23:49

Updated:: 2020-01-17 19:48

Jenkins

Details

Description

Attachments

Attachments

Activity

People

Dates