-
Bug
-
Resolution: Unresolved
-
Minor
-
None
When I use preemptible (spot) instances to host my jenkins-agent pods, sometimes the nodes will be removed unexpectedly, which means that the jenkins-agent pods will be removed, even if jobs are still running on those jenkins-agent pods.
It seems that the jobs from some reason will be stuck (hang) until timeout (if configured) because this is how it works, but the same jenkins-agent pod was already deleted and will never come back because the k8s plugin is generating different name every time.
14:27:43 jenkins-agent-***** was marked offline: Connection was broken: java.nio.channels.ClosedChannelException 14:27:43 at jenkins.agents.WebSocketAgents$Session.closed(WebSocketAgents.java:142) 14:27:43 at jenkins.websocket.WebSocketSession.onWebSocketSomething(WebSocketSession.java:91) 14:27:43 at com.sun.proxy.$Proxy101.onWebSocketClose(Unknown Source) 14:27:43 at org.eclipse.jetty.websocket.common.events.JettyListenerEventDriver.onClose(JettyListenerEventDriver.java:149) 14:27:43 at org.eclipse.jetty.websocket.common.WebSocketSession.callApplicationOnClose(WebSocketSession.java:394) 14:27:43 at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.close(AbstractWebSocketConnection.java:225) 14:27:43 at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection$Flusher.onCompleteFailure(AbstractWebSocketConnection.java:100) 14:27:43 at org.eclipse.jetty.util.IteratingCallback.failed(IteratingCallback.java:402) 14:27:43 at org.eclipse.jetty.io.WriteFlusher.write(WriteFlusher.java:302) 14:27:43 at org.eclipse.jetty.io.AbstractEndPoint.write(AbstractEndPoint.java:381) 14:27:43 at org.eclipse.jetty.websocket.common.io.FrameFlusher.flush(FrameFlusher.java:264) 14:27:43 at org.eclipse.jetty.websocket.common.io.FrameFlusher.process(FrameFlusher.java:193) 14:27:43 at org.eclipse.jetty.util.IteratingCallback.processing(IteratingCallback.java:241) 14:27:43 at org.eclipse.jetty.util.IteratingCallback.iterate(IteratingCallback.java:223) 14:27:43 at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.outgoingFrame(AbstractWebSocketConnection.java:581) 14:27:43 at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.close(AbstractWebSocketConnection.java:181) 14:27:43 at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:510) 14:27:43 at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:440) 14:27:43 at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) 14:27:43 at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105) 14:27:43 at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) 14:27:43 at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338) 14:27:43 at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315) 14:27:43 at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173) 14:27:43 at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131) 14:27:43 at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:383) 14:27:43 at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:882) 14:27:43 at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1036) 14:27:43 at java.base/java.lang.Thread.run(Thread.java:829)
In this case you must fail fast the job and not wait to Time out because it's waste of time to wait for jenkins-agent pod that will never come back to life.
Suggested solution:
Maybe when clean dead jenkins-agent pods it will be possible to cancel all jobs on the same jenkins-agent pod
somehow related:
https://issues.jenkins.io/browse/JENKINS-23171
https://issues.jenkins.io/browse/JENKINS-43781
https://issues.jenkins.io/browse/JENKINS-35246
- relates to
-
JENKINS-49707 Auto retry for elastic agents after channel closure
- Resolved