Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-67285

if jenkins-agent pod has removed fail fast jobs that use this jenkins-agent pod

XMLWordPrintable

      When I use preemptible (spot) instances to host my jenkins-agent pods, sometimes the nodes will be removed unexpectedly, which means  that the jenkins-agent pods will be removed, even if jobs are still running on those jenkins-agent pods.

      It seems that the jobs from some reason will be stuck (hang) until timeout (if configured) because this is how it works, but the same jenkins-agent pod was already deleted and will never come back because the k8s plugin is generating different name every time.

       

      14:27:43  jenkins-agent-***** was marked offline: Connection was broken: java.nio.channels.ClosedChannelException
      14:27:43  	at jenkins.agents.WebSocketAgents$Session.closed(WebSocketAgents.java:142)
      14:27:43  	at jenkins.websocket.WebSocketSession.onWebSocketSomething(WebSocketSession.java:91)
      14:27:43  	at com.sun.proxy.$Proxy101.onWebSocketClose(Unknown Source)
      14:27:43  	at org.eclipse.jetty.websocket.common.events.JettyListenerEventDriver.onClose(JettyListenerEventDriver.java:149)
      14:27:43  	at org.eclipse.jetty.websocket.common.WebSocketSession.callApplicationOnClose(WebSocketSession.java:394)
      14:27:43  	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.close(AbstractWebSocketConnection.java:225)
      14:27:43  	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection$Flusher.onCompleteFailure(AbstractWebSocketConnection.java:100)
      14:27:43  	at org.eclipse.jetty.util.IteratingCallback.failed(IteratingCallback.java:402)
      14:27:43  	at org.eclipse.jetty.io.WriteFlusher.write(WriteFlusher.java:302)
      14:27:43  	at org.eclipse.jetty.io.AbstractEndPoint.write(AbstractEndPoint.java:381)
      14:27:43  	at org.eclipse.jetty.websocket.common.io.FrameFlusher.flush(FrameFlusher.java:264)
      14:27:43  	at org.eclipse.jetty.websocket.common.io.FrameFlusher.process(FrameFlusher.java:193)
      14:27:43  	at org.eclipse.jetty.util.IteratingCallback.processing(IteratingCallback.java:241)
      14:27:43  	at org.eclipse.jetty.util.IteratingCallback.iterate(IteratingCallback.java:223)
      14:27:43  	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.outgoingFrame(AbstractWebSocketConnection.java:581)
      14:27:43  	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.close(AbstractWebSocketConnection.java:181)
      14:27:43  	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:510)
      14:27:43  	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:440)
      14:27:43  	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
      14:27:43  	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
      14:27:43  	at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
      14:27:43  	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338)
      14:27:43  	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315)
      14:27:43  	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173)
      14:27:43  	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)
      14:27:43  	at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:383)
      14:27:43  	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:882)
      14:27:43  	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1036)
      14:27:43  	at java.base/java.lang.Thread.run(Thread.java:829)
      

      In this case you must fail fast the job and not wait to Time out because it's waste of time to wait for jenkins-agent pod that will never come back to life.

      Suggested solution: 
      Maybe when clean dead jenkins-agent pods it will be possible to cancel all jobs on the same jenkins-agent pod

      somehow related:
      https://issues.jenkins.io/browse/JENKINS-23171
      https://issues.jenkins.io/browse/JENKINS-43781
      https://issues.jenkins.io/browse/JENKINS-35246

      https://stackoverflow.com/questions/46521492/jenkins-stop-trying-to-reconnect-to-the-slave-if-it-is-offline

            Unassigned Unassigned
            dordor dor s
            Votes:
            2 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: