[JENKINS-49707] Auto retry for elastic agents after channel closure - Jenkins Jira

Type: New Feature
Resolution: Fixed
Priority: Critical
Component/s: workflow-durable-task-step-plugin
Labels:
None

Similar Issues:
Powered by SuggestiMate

Show
Epic Link:
Pipeline Durability
Released As:
https://github.com/jenkinsci/workflow-durable-task-step-plugin/releases/tag/1206.v8a_d5f86e336b https://github.com/jenkinsci/workflow-basic-steps-plugin/releases/tag/969.vc4ec3e4854b_f https://github.com/jenkinsci/kubernetes-plugin/releases/tag/3670.v6ca

While my pipeline was running, the node that was executing logic terminated. I see this at the bottom of my console output:

Cannot contact ip-172-31-242-8.us-west-2.compute.internal: java.io.IOException: remote file operation failed: /ebs/jenkins/workspace/common-pipelines-nodeploy at hudson.remoting.Channel@48503f20:ip-172-31-242-8.us-west-2.compute.internal: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on ip-172-31-242-8.us-west-2.compute.internal failed. The channel is closing down or has closed down

There's a spinning arrow below it.

I have a cron script that uses the Jenkins master CLI to remove nodes which have stopped responding. When I examine this node's page in my Jenkins website, it looks like the node is still running that job and i see an orange label that says "Feb 22, 2018 5:16:02 PM Node is being removed".

I'm wondering what would be a better way to say "If the channel closes down, retry the work on another node with the same label?

Things seem stuck. Please advise.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

grub.remoting.logs.zip
3 kB
2018-07-04 08:19
grubSystemInformation.html
67 kB
2018-07-04 08:19
image-2018-02-22-17-27-31-541.png
56 kB
2018-02-23 01:27
image-2018-02-22-17-28-03-053.png
30 kB
2018-02-23 01:28
JavaMelodyGrubHeapDump_4_07_18.pdf
220 kB
2018-07-04 08:19
JavaMelodyNodeGrubThreads_4_07_18.pdf
9 kB
2018-07-04 08:19
jenkins_agent_devbuild9_remoting_logs.zip
4 kB
2018-06-29 01:31
jenkins_Agent_devbuild9_System_Information.html
66 kB
2018-06-29 01:31
jenkins_agents_Thread_dump.html
172 kB
2018-06-29 01:31
jenkins_support_2018-06-29_01.14.18.zip
1.26 MB
2018-06-29 01:31
jenkins.log
984 kB
2018-07-04 08:19
jobConsoleOutput.txt
12 kB
2018-07-04 08:19
jobConsoleOutput.txt
12 kB
2018-07-04 08:18
MonitoringJavaelodyOnNodes.html
44 kB
2018-07-04 08:19
NetworkAndMachineStats.png
224 kB
2018-07-04 08:19
slaveLogInMaster.grub.zip
8 kB
2018-07-04 08:19
support_2018-07-04_07.35.22.zip
956 kB
2018-07-04 08:19
threadDump.txt
98 kB
2018-11-17 22:17
Thread dump [Jenkins].html
219 kB
2018-07-04 08:19

causes

JENKINS-73618 ws step re-provisions already provisioned workspace if controller restarted in midway during build

Open

JENKINS-69936 PWD returning wrong path

Resolved

JENKINS-70528 node / dir / node on same agent sets PWD to that of dir rather than @2 workspace

Resolved

depends on

JENKINS-30383 SynchronousNonBlockingStepExecution should allow restart of idempotent steps

Resolved

is duplicated by

JENKINS-49241 pipeline hangs if slave node momentarily disconnects

Open

JENKINS-47868 Pipeline durability hang when slave node disconnected

Reopened

JENKINS-43781 Quickly detecting and restarting a job if the job's slave disconnects

Resolved

JENKINS-57675 Pipeline steps running forever when executor fails

Resolved

JENKINS-47561 Pipelines wait indefinitely for kubernetes slaves to come back online

Closed

JENKINS-43607 Jenkins pipeline not aborted when the machine running docker container goes offline

Resolved

JENKINS-56673 Better handling of ChannelClosedException in Declarative pipeline

Resolved

is related to

JENKINS-41854 Contextualize a fresh FilePath after an agent reconnection

Resolved

relates to

JENKINS-36013 Automatically abort ExecutorPickle rehydration from an ephemeral node

Closed

JENKINS-61387 SlaveComputer not cleaned up after the channel is closed

Open

JENKINS-67285 if jenkins-agent pod has removed fail fast jobs that use this jenkins-agent pod

Open

JENKINS-71113 AgentErrorCondition should handle "missing workspace" error

Open

JENKINS-60507 Pipeline stuck when allocating machine | node block appears to be neither running nor scheduled

Reopened

JENKINS-59340 Pipeline hangs when Agent pod is Terminated

Resolved

JENKINS-35246 Kubernetes agents not getting deleted in Jenkins after pods are deleted

Resolved

JENKINS-70333 Default for Declarative agent retries

Open

JENKINS-68963 build logs should contain if a spot agent is terminated

Open

links to

CloudBees-internal issue

jenkins-infra/docker-jenkins-weekly #512

jenkins-infra/pipeline-library #405

kubernetes-plugin #1083

pipeline-model-definition-plugin #533

workflow-api-plugin #217

workflow-basic-steps-plugin #195

workflow-durable-task-step-plugin #180

workflow-durable-task-step-plugin #254

(6 is duplicated by, 1 is related to, 9 relates to, 9 links to)

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates