Details
-
New Feature
-
Status: Resolved (View Workflow)
-
Critical
-
Resolution: Fixed
-
None
Description
While my pipeline was running, the node that was executing logic terminated. I see this at the bottom of my console output:
Cannot contact ip-172-31-242-8.us-west-2.compute.internal: java.io.IOException: remote file operation failed: /ebs/jenkins/workspace/common-pipelines-nodeploy at hudson.remoting.Channel@48503f20:ip-172-31-242-8.us-west-2.compute.internal: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on ip-172-31-242-8.us-west-2.compute.internal failed. The channel is closing down or has closed down
There's a spinning arrow below it.
I have a cron script that uses the Jenkins master CLI to remove nodes which have stopped responding. When I examine this node's page in my Jenkins website, it looks like the node is still running that job and i see an orange label that says "Feb 22, 2018 5:16:02 PM Node is being removed".
I'm wondering what would be a better way to say "If the channel closes down, retry the work on another node with the same label?
Things seem stuck. Please advise.
Attachments
Issue Links
- causes
-
JENKINS-69936 PWD returning wrong path
-
- Resolved
-
-
JENKINS-70528 node / dir / node on same agent sets PWD to that of dir rather than @2 workspace
-
- Resolved
-
- depends on
-
JENKINS-30383 SynchronousNonBlockingStepExecution should allow restart of idempotent steps
-
- Open
-
- is duplicated by
-
JENKINS-49241 pipeline hangs if slave node momentarily disconnects
-
- Open
-
-
JENKINS-47868 Pipeline durability hang when slave node disconnected
-
- Reopened
-
-
JENKINS-43781 Quickly detecting and restarting a job if the job's slave disconnects
-
- Resolved
-
-
JENKINS-57675 Pipeline steps running forever when executor fails
-
- Resolved
-
-
JENKINS-47561 Pipelines wait indefinitely for kubernetes slaves to come back online
-
- Closed
-
-
JENKINS-43607 Jenkins pipeline not aborted when the machine running docker container goes offline
-
- Resolved
-
-
JENKINS-56673 Better handling of ChannelClosedException in Declarative pipeline
-
- Resolved
-
- is related to
-
JENKINS-41854 Contextualize a fresh FilePath after an agent reconnection
-
- Resolved
-
- relates to
-
JENKINS-36013 Automatically abort ExecutorPickle rehydration from an ephemeral node
-
- Closed
-
-
JENKINS-61387 SlaveComputer not cleaned up after the channel is closed
-
- Open
-
-
JENKINS-67285 if jenkins-agent pod has removed fail fast jobs that use this jenkins-agent pod
-
- Open
-
-
JENKINS-71113 AgentErrorCondition should handle "missing workspace" error
-
- Open
-
-
JENKINS-59340 Pipeline hangs when Agent pod is Terminated
-
- Resolved
-
-
JENKINS-60507 Pipeline stuck when allocating machine | node block appears to be neither running nor scheduled
-
- Resolved
-
-
JENKINS-35246 Kubernetes agents not getting deleted in Jenkins after pods are deleted
-
- Resolved
-
-
JENKINS-70333 Default for Declarative agent retries
-
- Open
-
-
JENKINS-68963 build logs should contain if a spot agent is terminated
-
- Open
-
- links to
jglick Just wanted to thank you and everybody else who's been working on jenkins and to confirm that the work over on https://issues.jenkins-ci.org/browse/JENKINS-36013 appears to have handled this case in a much better way. I consider the current behavior to be a major step in the right direction for Jenkins. Here's what I noticed:
Last night, our Jenkins worker pool did its normal scheduled nightly scale down and one of the pipelines got disrupted. The message I see in my affected pipeline's console log is:
{ agent \{ label 'universal' }Agent ip-172-31-235-152.us-west-2.compute.internal was deleted; cancelling node body
The above mentioned hostname is the one that Jenkins selected at at the top of my declarative pipeline as a result of my call for a 'universal' machine (universal is how we label all of our workers):
pipeline
...
This particular declarative pipeline tries to "sh" to the console at the end inside a post{} section and clean up after itself, but since the node was lost, the next error that also appears in the Jenkins console log is:
org.jenkinsci.plugins.workflow.steps.MissingContextVariableException: Required context class hudson.FilePath is missing
This error was the result of the following code:
post {
always {
sh """|#!/bin/bash
...
Let me just point out that the recent Jenkins advancements are fantastic. Before
JENKINS-36013, this pipeline would have just been stuck with no error messages. I'm so happy with this progress you have no idea.Now if there's any way to get this to actually retry the step it was on such that the pipeline can actually tolerate losing the node, we would have the best of all worlds. At my company, the fact that a node is deleted during a scaledown is a confusing irrelevant problem for one of my developers to grapple with. The job of my developers (the folks writing Jenkins pipelines) is to write idempotent pipeline steps and my job is make sure all of the developer's steps trigger and the pipeline concludes with a high amount of durability.
Keep up the great work you are all doing. This is great.