[JENKINS-67285] if jenkins-agent pod has removed fail fast jobs that use this jenkins-agent pod - Jenkins Jira

Type: Bug
Resolution: Unresolved
Priority: Minor
Component/s: kubernetes-plugin, remoting
Labels:
None

Similar Issues:
Powered by SuggestiMate

Show

When I use preemptible (spot) instances to host my jenkins-agent pods, sometimes the nodes will be removed unexpectedly, which means that the jenkins-agent pods will be removed, even if jobs are still running on those jenkins-agent pods.

It seems that the jobs from some reason will be stuck (hang) until timeout (if configured) because this is how it works, but the same jenkins-agent pod was already deleted and will never come back because the k8s plugin is generating different name every time.

14:27:43  jenkins-agent-***** was marked offline: Connection was broken: java.nio.channels.ClosedChannelException
14:27:43  	at jenkins.agents.WebSocketAgents$Session.closed(WebSocketAgents.java:142)
14:27:43  	at jenkins.websocket.WebSocketSession.onWebSocketSomething(WebSocketSession.java:91)
14:27:43  	at com.sun.proxy.$Proxy101.onWebSocketClose(Unknown Source)
14:27:43  	at org.eclipse.jetty.websocket.common.events.JettyListenerEventDriver.onClose(JettyListenerEventDriver.java:149)
14:27:43  	at org.eclipse.jetty.websocket.common.WebSocketSession.callApplicationOnClose(WebSocketSession.java:394)
14:27:43  	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.close(AbstractWebSocketConnection.java:225)
14:27:43  	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection$Flusher.onCompleteFailure(AbstractWebSocketConnection.java:100)
14:27:43  	at org.eclipse.jetty.util.IteratingCallback.failed(IteratingCallback.java:402)
14:27:43  	at org.eclipse.jetty.io.WriteFlusher.write(WriteFlusher.java:302)
14:27:43  	at org.eclipse.jetty.io.AbstractEndPoint.write(AbstractEndPoint.java:381)
14:27:43  	at org.eclipse.jetty.websocket.common.io.FrameFlusher.flush(FrameFlusher.java:264)
14:27:43  	at org.eclipse.jetty.websocket.common.io.FrameFlusher.process(FrameFlusher.java:193)
14:27:43  	at org.eclipse.jetty.util.IteratingCallback.processing(IteratingCallback.java:241)
14:27:43  	at org.eclipse.jetty.util.IteratingCallback.iterate(IteratingCallback.java:223)
14:27:43  	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.outgoingFrame(AbstractWebSocketConnection.java:581)
14:27:43  	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.close(AbstractWebSocketConnection.java:181)
14:27:43  	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:510)
14:27:43  	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:440)
14:27:43  	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
14:27:43  	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
14:27:43  	at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
14:27:43  	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338)
14:27:43  	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315)
14:27:43  	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173)
14:27:43  	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)
14:27:43  	at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:383)
14:27:43  	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:882)
14:27:43  	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1036)
14:27:43  	at java.base/java.lang.Thread.run(Thread.java:829)

In this case you must fail fast the job and not wait to Time out because it's waste of time to wait for jenkins-agent pod that will never come back to life.

Suggested solution:
Maybe when clean dead jenkins-agent pods it will be possible to cancel all jobs on the same jenkins-agent pod

somehow related:
https://issues.jenkins.io/browse/JENKINS-23171
https://issues.jenkins.io/browse/JENKINS-43781
https://issues.jenkins.io/browse/JENKINS-35246

https://stackoverflow.com/questions/46521492/jenkins-stop-trying-to-reconnect-to-the-slave-if-it-is-offline