Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-67285

if jenkins-agent pod has removed fail fast jobs that use this jenkins-agent pod

      When I use preemptible (spot) instances to host my jenkins-agent pods, sometimes the nodes will be removed unexpectedly, which means  that the jenkins-agent pods will be removed, even if jobs are still running on those jenkins-agent pods.

      It seems that the jobs from some reason will be stuck (hang) until timeout (if configured) because this is how it works, but the same jenkins-agent pod was already deleted and will never come back because the k8s plugin is generating different name every time.

       

      14:27:43  jenkins-agent-***** was marked offline: Connection was broken: java.nio.channels.ClosedChannelException
      14:27:43  	at jenkins.agents.WebSocketAgents$Session.closed(WebSocketAgents.java:142)
      14:27:43  	at jenkins.websocket.WebSocketSession.onWebSocketSomething(WebSocketSession.java:91)
      14:27:43  	at com.sun.proxy.$Proxy101.onWebSocketClose(Unknown Source)
      14:27:43  	at org.eclipse.jetty.websocket.common.events.JettyListenerEventDriver.onClose(JettyListenerEventDriver.java:149)
      14:27:43  	at org.eclipse.jetty.websocket.common.WebSocketSession.callApplicationOnClose(WebSocketSession.java:394)
      14:27:43  	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.close(AbstractWebSocketConnection.java:225)
      14:27:43  	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection$Flusher.onCompleteFailure(AbstractWebSocketConnection.java:100)
      14:27:43  	at org.eclipse.jetty.util.IteratingCallback.failed(IteratingCallback.java:402)
      14:27:43  	at org.eclipse.jetty.io.WriteFlusher.write(WriteFlusher.java:302)
      14:27:43  	at org.eclipse.jetty.io.AbstractEndPoint.write(AbstractEndPoint.java:381)
      14:27:43  	at org.eclipse.jetty.websocket.common.io.FrameFlusher.flush(FrameFlusher.java:264)
      14:27:43  	at org.eclipse.jetty.websocket.common.io.FrameFlusher.process(FrameFlusher.java:193)
      14:27:43  	at org.eclipse.jetty.util.IteratingCallback.processing(IteratingCallback.java:241)
      14:27:43  	at org.eclipse.jetty.util.IteratingCallback.iterate(IteratingCallback.java:223)
      14:27:43  	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.outgoingFrame(AbstractWebSocketConnection.java:581)
      14:27:43  	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.close(AbstractWebSocketConnection.java:181)
      14:27:43  	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:510)
      14:27:43  	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:440)
      14:27:43  	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
      14:27:43  	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
      14:27:43  	at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
      14:27:43  	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338)
      14:27:43  	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315)
      14:27:43  	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173)
      14:27:43  	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)
      14:27:43  	at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:383)
      14:27:43  	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:882)
      14:27:43  	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1036)
      14:27:43  	at java.base/java.lang.Thread.run(Thread.java:829)
      

      In this case you must fail fast the job and not wait to Time out because it's waste of time to wait for jenkins-agent pod that will never come back to life.

      Suggested solution: 
      Maybe when clean dead jenkins-agent pods it will be possible to cancel all jobs on the same jenkins-agent pod

      somehow related:
      https://issues.jenkins.io/browse/JENKINS-23171
      https://issues.jenkins.io/browse/JENKINS-43781
      https://issues.jenkins.io/browse/JENKINS-35246

      https://stackoverflow.com/questions/46521492/jenkins-stop-trying-to-reconnect-to-the-slave-if-it-is-offline

          [JENKINS-67285] if jenkins-agent pod has removed fail fast jobs that use this jenkins-agent pod

          dor s added a comment -

          Hi vlatombe , sorry for bothering you, it will be awesome if you can have a look on my issue 

          dor s added a comment - Hi vlatombe  , sorry for bothering you, it will be awesome if you can have a look on my issue 

          Tim Jacomb added a comment -

          this related to what you are working on jglick?

          Tim Jacomb added a comment - this related to what you are working on jglick ?

          Jesse Glick added a comment -

          Currently if a pod is deleted, the agent will be promptly removed; Jenkins will then fail the build after 5m. JENKINS-49707 would make the build automatically retry the node block on a new pod with the same configuration.

          Jesse Glick added a comment - Currently if a pod is deleted, the agent will be promptly removed; Jenkins will then fail the build after 5m. JENKINS-49707 would make the build automatically retry the node block on a new pod with the same configuration.

          dor s added a comment -

          Thank you for your response timja , jglick But it seems that the issue is still here

          My jenkins agent pod jenkins-agent-gx1q7 failed at 12:39:05 and was stuck until I have aborted manually my job at 13:31:26

           

          12:39:05  Cannot contact jenkins-agent-gx1q7: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@646e3dc9:jenkins-agent-gx1q7": Remote call on jenkins-agent-gx1q7 failed. The channel is closing down or has closed down
          13:31:26  Could not connect to jenkins-agent-gx1q7 to send interrupt signal to process
          Aborted by Foo Bar
          [Pipeline] }
          [Pipeline] // withDockerRegistry
          [Pipeline] }
          [Pipeline] // withEnv
          [Pipeline] }
          [Pipeline] // retry
          [Pipeline] }
          [Pipeline] // withEnv
          [Pipeline] }
          13:31:27  Failed in branch helm-push
          [Pipeline] // parallel
          [Pipeline] }
          [Pipeline] // stage
          [Pipeline] echo
          13:31:27  Build status: ABORTED
          [Pipeline] echo
          13:31:27  Build Error:
          [Pipeline] echo
          13:31:27  org.jenkinsci.plugins.workflow.steps.FlowInterruptedException
          

          Can't you reproduce this?

           

          env:

           

          Jenkins 2.319.1
          jenkins/inbound-agent:4.11-1-jdk11
          jenkins Kubernetes plugin 1.31.2

           

          dor s added a comment - Thank you for your response timja  , jglick  But it seems that the issue is still here My jenkins agent pod  jenkins-agent-gx1q7 failed at  12:39:05 and was stuck until I have aborted manually my job at  13:31:26   12:39:05 Cannot contact jenkins-agent-gx1q7: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@646e3dc9:jenkins-agent-gx1q7" : Remote call on jenkins-agent-gx1q7 failed. The channel is closing down or has closed down 13:31:26 Could not connect to jenkins-agent-gx1q7 to send interrupt signal to process Aborted by Foo Bar [Pipeline] } [Pipeline] // withDockerRegistry [Pipeline] } [Pipeline] // withEnv [Pipeline] } [Pipeline] // retry [Pipeline] } [Pipeline] // withEnv [Pipeline] } 13:31:27 Failed in branch helm-push [Pipeline] // parallel [Pipeline] } [Pipeline] // stage [Pipeline] echo 13:31:27 Build status: ABORTED [Pipeline] echo 13:31:27 Build Error: [Pipeline] echo 13:31:27 org.jenkinsci.plugins.workflow.steps.FlowInterruptedException Can't you reproduce this?   env:   Jenkins 2.319.1 jenkins/inbound-agent:4.11-1-jdk11 jenkins Kubernetes plugin 1.31.2  

          Jesse Glick added a comment -

          From your build log, something is not working as expected but it is hard to say what. Was the Pod removed? Was the agent removed from the list of agents in Jenkins? We have test coverage for various scenarios (run in Kind) but these do not necessarily match the behavior of a real cluster facing a particular kind of outage.

          Jesse Glick added a comment - From your build log, something is not working as expected but it is hard to say what. Was the Pod removed? Was the agent removed from the list of agents in Jenkins? We have test coverage for various scenarios (run in Kind) but these do not necessarily match the behavior of a real cluster facing a particular kind of outage.

          dor s added a comment - - edited

          Hi jglick , here are my answers: 

          > Was the Pod removed?

          Yes (Assuming that you ask if the pod has been removed from the K8S cluster)

           

          > Was the agent removed from the list of agents in Jenkins?

          No

           

          I have reproduced this behavior by doing the following steps:

          1. Start my decelerative pipeline
          2. Follow the logs from the agent pod that run my declarative pipeline from step 1
          3. Follow the jenkins controller logs
          4. Terminate the ec2 spot instance that host the agent pod that run my declarative pipeline from step 1

           

          It seems that there are no logs in the jenkins agent pod that says that the jnlp process about to go down 

           

          kubectl -n jenkins logs -f jenkins-agent-x4ncg -c jnlp
          
          Jan 15, 2022 10:27:12 AM hudson.remoting.jnlp.Main createEngine
          INFO: Setting up agent: jenkins-agent-x4ncg
          Jan 15, 2022 10:27:12 AM hudson.remoting.jnlp.Main$CuiListener <init>
          INFO: Jenkins agent is running in headless mode.
          Jan 15, 2022 10:27:12 AM hudson.remoting.Engine startEngine
          INFO: Using Remoting version: 4.11
          Jan 15, 2022 10:27:12 AM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
          INFO: Using /home/jenkins/agent/remoting as a remoting work directory
          Jan 15, 2022 10:27:12 AM org.jenkinsci.remoting.engine.WorkDirManager setupLogging
          INFO: Both error and output logs will be printed to /home/jenkins/agent/remoting
          Jan 15, 2022 10:27:12 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: WebSocket connection open
          Jan 15, 2022 10:27:12 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Connected
          

           

          it seems that for the k8s controller it took  a few minutes to detect that the jenkins agent pod has been killed

           

          kubectl -n jenkins get pod -o wide
          NAME                              READY   STATUS    RESTARTS   AGE   IP              NODE                            NOMINATED NODE   READINESS GATES
          jenkins-agent-x4ncg               2/2     Running   0          47m   192.168.1.161   ip-10-0-1-56.ec2.internal   <none>           <none>
          

           

          and after about 3 minutes the pod wasn't appear anymore as expected

           

          sudo kubectl -n jenkins get pod -o wide
          NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
          

           

           

          Meanwhile the job console says that the jenkins agent pod has been disconnected

           

          11:10:14  Cannot contact jenkins-agent-x4ncg: hudson.remoting.RequestAbortedException: java.nio.channels.ClosedChannelException

           

           

          30 minutes have passed since I have terminated the ec2 spot instance that host the agent pod that run my declarative pipeline, and the job is still stuck on the message above.

           

          Here are the logs from the jenkins master

          2022-01-15 11:06:45.430+0000 [id=596206] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 8027, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms2022-01-15 11:06:45.430+0000 [id=596206] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 8027, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms2022-01-15 11:06:45.430+0000 [id=596206] INFO c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 5 nodes assigned to this Jenkins instance, which we will check2022-01-15 11:06:45.430+0000 [id=596206] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed2022-01-15 11:06:45.430+0000 [id=596206] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished DockerContainerWatchdog Asynchronous Periodic Work. 0 ms2022-01-15 11:10:14.367+0000 [id=594808] WARNING j.agents.WebSocketAgents$Session#errorjava.io.IOException: Broken pipe at java.base/sun.nio.ch.FileDispatcherImpl.writev0(Native Method) at java.base/sun.nio.ch.SocketDispatcher.writev(SocketDispatcher.java:51) at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:182) at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:130) at java.base/sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:493) at java.base/java.nio.channels.SocketChannel.write(SocketChannel.java:507) at org.eclipse.jetty.io.ChannelEndPoint.flush(ChannelEndPoint.java:273)Caused: org.eclipse.jetty.io.EofException at org.eclipse.jetty.io.ChannelEndPoint.flush(ChannelEndPoint.java:279) at org.eclipse.jetty.io.WriteFlusher.flush(WriteFlusher.java:422) at org.eclipse.jetty.io.WriteFlusher.write(WriteFlusher.java:277) at org.eclipse.jetty.io.AbstractEndPoint.write(AbstractEndPoint.java:381) at org.eclipse.jetty.websocket.common.io.FrameFlusher.flush(FrameFlusher.java:264) at org.eclipse.jetty.websocket.common.io.FrameFlusher.process(FrameFlusher.java:193) at org.eclipse.jetty.util.IteratingCallback.processing(IteratingCallback.java:241) at org.eclipse.jetty.util.IteratingCallback.iterate(IteratingCallback.java:223) at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.outgoingFrame(AbstractWebSocketConnection.java:581) at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.close(AbstractWebSocketConnection.java:181) at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:510) at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:440) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105) at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131) at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:386) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883) at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034) at java.base/java.lang.Thread.run(Thread.java:829)2022-01-15 11:10:14.369+0000 [id=594808] INFO j.s.DefaultJnlpSlaveReceiver#channelClosed: Jetty (winstone)-594808 for jenkins-agent-x4ncg terminated: java.nio.channels.ClosedChannelException2022-01-15 11:11:45.430+0000 [id=596332] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started DockerContainerWatchdog Asynchronous Periodic Work2022-01-15 11:11:45.430+0000 [id=596332] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered2022-01-15 11:11:45.430+0000 [id=596332] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 8028, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms2022-01-15 11:11:45.430+0000 [id=596332] INFO c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 5 nodes assigned to this Jenkins instance, which we will check2022-01-15 11:11:45.430+0000 [id=596332] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed2022-01-15 11:11:45.431+0000 [id=596332] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished DockerContainerWatchdog Asynchronous Periodic Work. 1 ms2022-01-15 11:16:45.430+0000 [id=596375] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started DockerContainerWatchdog Asynchronous Periodic Work2022-01-15 11:16:45.430+0000 [id=596375] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered2022-01-15 11:16:45.430+0000 [id=596375] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 8029, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms2022-01-15 11:16:45.430+0000 [id=596375] INFO c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 5 nodes assigned to this Jenkins instance, which we will check2022-01-15 11:16:45.430+0000 [id=596375] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed2022-01-15 11:16:45.430+0000 [id=596375] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished DockerContainerWatchdog Asynchronous Periodic Work. 0 ms2022-01-15 11:21:45.430+0000 [id=596418] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started DockerContainerWatchdog Asynchronous Periodic Work2022-01-15 11:21:45.430+0000 [id=596418] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered2022-01-15 11:21:45.430+0000 [id=596418] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 8030, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms2022-01-15 11:21:45.430+0000 [id=596418] INFO c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 5 nodes assigned to this Jenkins instance, which we will check2022-01-15 11:21:45.430+0000 [id=596418] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed2022-01-15 11:21:45.430+0000 [id=596418] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished DockerContainerWatchdog Asynchronous Periodic Work. 0 ms2022-01-15 11:26:45.430+0000 [id=596460] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started DockerContainerWatchdog Asynchronous Periodic Work2022-01-15 11:26:45.430+0000 [id=596460] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered2022-01-15 11:26:45.430+0000 [id=596460] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 8031, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms2022-01-15 11:26:45.430+0000 [id=596460] INFO c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 5 nodes assigned to this Jenkins instance, which we will check2022-01-15 11:26:45.430+0000 [id=596460] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed2022-01-15 11:26:45.430+0000 [id=596460] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished DockerContainerWatchdog Asynchronous Periodic Work. 0 ms2022-01-15 11:31:45.430+0000 [id=596505] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started DockerContainerWatchdog Asynchronous Periodic Work2022-01-15 11:31:45.430+0000 [id=596505] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered2022-01-15 11:31:45.430+0000 [id=596505] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 8032, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms2022-01-15 11:31:45.430+0000 [id=596505] INFO c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 5 nodes assigned to this Jenkins instance, which we will check2022-01-15 11:31:45.431+0000 [id=596505] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed2022-01-15 11:31:45.431+0000 [id=596505] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished DockerContainerWatchdog Asynchronous Periodic Work. 1 ms
          

           

          Something to consider: I have aws-node-termination-handler which sends SIGTERM to all of the pods that located on the same host that is going to be terminated.
          But I was expected to see some logs from the jnlp process that the process is about to go down (graceful shutdown).

           

          To sum things up:

          • The agent pod is not exist any more in the k8s cluster
          • The agent pod is still appear as disconnected in the jenkins agents list
          • the job is still hang

           

           

          dor s added a comment - - edited Hi jglick  , here are my answers:  > Was the  Pod  removed? Yes (Assuming that you ask if the pod has been removed from the K8S cluster)   > Was the agent removed from the list of agents in Jenkins? No   I have reproduced this behavior by doing the following steps: Start my decelerative pipeline Follow the logs from the agent pod that run my declarative pipeline from step 1 Follow the jenkins controller logs Terminate the ec2 spot instance that host the agent pod that run my declarative pipeline from step 1   It seems that there are no logs in the jenkins agent pod that says that the jnlp process about to go down    kubectl -n jenkins logs -f jenkins-agent-x4ncg -c jnlp Jan 15, 2022 10:27:12 AM hudson.remoting.jnlp.Main createEngine INFO: Setting up agent: jenkins-agent-x4ncg Jan 15, 2022 10:27:12 AM hudson.remoting.jnlp.Main$CuiListener <init> INFO: Jenkins agent is running in headless mode. Jan 15, 2022 10:27:12 AM hudson.remoting.Engine startEngine INFO: Using Remoting version: 4.11 Jan 15, 2022 10:27:12 AM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir INFO: Using /home/jenkins/agent/remoting as a remoting work directory Jan 15, 2022 10:27:12 AM org.jenkinsci.remoting.engine.WorkDirManager setupLogging INFO: Both error and output logs will be printed to /home/jenkins/agent/remoting Jan 15, 2022 10:27:12 AM hudson.remoting.jnlp.Main$CuiListener status INFO: WebSocket connection open Jan 15, 2022 10:27:12 AM hudson.remoting.jnlp.Main$CuiListener status INFO: Connected   it seems that for the k8s controller it took  a few minutes to detect that the jenkins agent pod has been killed   kubectl -n jenkins get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES jenkins-agent-x4ncg 2/2 Running 0 47m 192.168.1.161 ip-10-0-1-56.ec2.internal <none> <none>   and after about 3 minutes the pod wasn't appear anymore as expected   sudo kubectl -n jenkins get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES     Meanwhile the job console says that the jenkins agent pod has been disconnected   11:10:14 Cannot contact jenkins-agent-x4ncg: hudson.remoting.RequestAbortedException: java.nio.channels.ClosedChannelException     30 minutes have passed since I have terminated the ec2 spot instance that host the agent pod that run my declarative pipeline, and the job is still stuck on the message above.   Here are the logs from the jenkins master 2022-01-15 11:06:45.430+0000 [id=596206] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 8027, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms2022-01-15 11:06:45.430+0000 [id=596206] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 8027, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms2022-01-15 11:06:45.430+0000 [id=596206] INFO c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 5 nodes assigned to this Jenkins instance, which we will check2022-01-15 11:06:45.430+0000 [id=596206] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed2022-01-15 11:06:45.430+0000 [id=596206] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished DockerContainerWatchdog Asynchronous Periodic Work. 0 ms2022-01-15 11:10:14.367+0000 [id=594808] WARNING j.agents.WebSocketAgents$Session#errorjava.io.IOException: Broken pipe at java.base/sun.nio.ch.FileDispatcherImpl.writev0(Native Method) at java.base/sun.nio.ch.SocketDispatcher.writev(SocketDispatcher.java:51) at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:182) at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:130) at java.base/sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:493) at java.base/java.nio.channels.SocketChannel.write(SocketChannel.java:507) at org.eclipse.jetty.io.ChannelEndPoint.flush(ChannelEndPoint.java:273)Caused: org.eclipse.jetty.io.EofException at org.eclipse.jetty.io.ChannelEndPoint.flush(ChannelEndPoint.java:279) at org.eclipse.jetty.io.WriteFlusher.flush(WriteFlusher.java:422) at org.eclipse.jetty.io.WriteFlusher.write(WriteFlusher.java:277) at org.eclipse.jetty.io.AbstractEndPoint.write(AbstractEndPoint.java:381) at org.eclipse.jetty.websocket.common.io.FrameFlusher.flush(FrameFlusher.java:264) at org.eclipse.jetty.websocket.common.io.FrameFlusher.process(FrameFlusher.java:193) at org.eclipse.jetty.util.IteratingCallback.processing(IteratingCallback.java:241) at org.eclipse.jetty.util.IteratingCallback.iterate(IteratingCallback.java:223) at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.outgoingFrame(AbstractWebSocketConnection.java:581) at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.close(AbstractWebSocketConnection.java:181) at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:510) at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:440) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105) at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131) at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:386) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883) at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034) at java.base/java.lang. Thread .run( Thread .java:829)2022-01-15 11:10:14.369+0000 [id=594808] INFO j.s.DefaultJnlpSlaveReceiver#channelClosed: Jetty (winstone)-594808 for jenkins-agent-x4ncg terminated: java.nio.channels.ClosedChannelException2022-01-15 11:11:45.430+0000 [id=596332] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started DockerContainerWatchdog Asynchronous Periodic Work2022-01-15 11:11:45.430+0000 [id=596332] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered2022-01-15 11:11:45.430+0000 [id=596332] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 8028, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms2022-01-15 11:11:45.430+0000 [id=596332] INFO c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 5 nodes assigned to this Jenkins instance, which we will check2022-01-15 11:11:45.430+0000 [id=596332] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed2022-01-15 11:11:45.431+0000 [id=596332] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished DockerContainerWatchdog Asynchronous Periodic Work. 1 ms2022-01-15 11:16:45.430+0000 [id=596375] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started DockerContainerWatchdog Asynchronous Periodic Work2022-01-15 11:16:45.430+0000 [id=596375] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered2022-01-15 11:16:45.430+0000 [id=596375] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 8029, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms2022-01-15 11:16:45.430+0000 [id=596375] INFO c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 5 nodes assigned to this Jenkins instance, which we will check2022-01-15 11:16:45.430+0000 [id=596375] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed2022-01-15 11:16:45.430+0000 [id=596375] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished DockerContainerWatchdog Asynchronous Periodic Work. 0 ms2022-01-15 11:21:45.430+0000 [id=596418] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started DockerContainerWatchdog Asynchronous Periodic Work2022-01-15 11:21:45.430+0000 [id=596418] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered2022-01-15 11:21:45.430+0000 [id=596418] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 8030, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms2022-01-15 11:21:45.430+0000 [id=596418] INFO c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 5 nodes assigned to this Jenkins instance, which we will check2022-01-15 11:21:45.430+0000 [id=596418] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed2022-01-15 11:21:45.430+0000 [id=596418] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished DockerContainerWatchdog Asynchronous Periodic Work. 0 ms2022-01-15 11:26:45.430+0000 [id=596460] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started DockerContainerWatchdog Asynchronous Periodic Work2022-01-15 11:26:45.430+0000 [id=596460] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered2022-01-15 11:26:45.430+0000 [id=596460] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 8031, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms2022-01-15 11:26:45.430+0000 [id=596460] INFO c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 5 nodes assigned to this Jenkins instance, which we will check2022-01-15 11:26:45.430+0000 [id=596460] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed2022-01-15 11:26:45.430+0000 [id=596460] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished DockerContainerWatchdog Asynchronous Periodic Work. 0 ms2022-01-15 11:31:45.430+0000 [id=596505] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started DockerContainerWatchdog Asynchronous Periodic Work2022-01-15 11:31:45.430+0000 [id=596505] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered2022-01-15 11:31:45.430+0000 [id=596505] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 8032, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms2022-01-15 11:31:45.430+0000 [id=596505] INFO c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 5 nodes assigned to this Jenkins instance, which we will check2022-01-15 11:31:45.431+0000 [id=596505] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed2022-01-15 11:31:45.431+0000 [id=596505] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished DockerContainerWatchdog Asynchronous Periodic Work. 1 ms   Something to consider: I have aws-node-termination-handler which sends SIGTERM to all of the pods that located on the same host that is going to be terminated. But I was expected to see some logs from the jnlp process that the process is about to go down (graceful shutdown).   To sum things up: The agent pod is not exist any more in the k8s cluster The agent pod is still appear as disconnected in the jenkins agents list the job is still hang    

          dor s added a comment -

          Hi jglick , if you need any additional information just let me know

          dor s added a comment - Hi jglick  , if you need any additional information just let me know

          Jesse Glick added a comment -

          I have no immediate plans to work on this.

          Jesse Glick added a comment - I have no immediate plans to work on this.

            Unassigned Unassigned
            dordor dor s
            Votes:
            2 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: