Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-67285

if jenkins-agent pod has removed fail fast jobs that use this jenkins-agent pod

    XMLWordPrintable

    Details

    • Similar Issues:

      Description

      When I use preemptible (spot) instances to host my jenkins-agent pods, sometimes the nodes will be removed unexpectedly, which means  that the jenkins-agent pods will be removed, even if jobs are still running on those jenkins-agent pods.

      It seems that the jobs from some reason will be stuck (hang) until timeout (if configured) because this is how it works, but the same jenkins-agent pod was already deleted and will never come back because the k8s plugin is generating different name every time.

       

      14:27:43  jenkins-agent-***** was marked offline: Connection was broken: java.nio.channels.ClosedChannelException
      14:27:43  	at jenkins.agents.WebSocketAgents$Session.closed(WebSocketAgents.java:142)
      14:27:43  	at jenkins.websocket.WebSocketSession.onWebSocketSomething(WebSocketSession.java:91)
      14:27:43  	at com.sun.proxy.$Proxy101.onWebSocketClose(Unknown Source)
      14:27:43  	at org.eclipse.jetty.websocket.common.events.JettyListenerEventDriver.onClose(JettyListenerEventDriver.java:149)
      14:27:43  	at org.eclipse.jetty.websocket.common.WebSocketSession.callApplicationOnClose(WebSocketSession.java:394)
      14:27:43  	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.close(AbstractWebSocketConnection.java:225)
      14:27:43  	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection$Flusher.onCompleteFailure(AbstractWebSocketConnection.java:100)
      14:27:43  	at org.eclipse.jetty.util.IteratingCallback.failed(IteratingCallback.java:402)
      14:27:43  	at org.eclipse.jetty.io.WriteFlusher.write(WriteFlusher.java:302)
      14:27:43  	at org.eclipse.jetty.io.AbstractEndPoint.write(AbstractEndPoint.java:381)
      14:27:43  	at org.eclipse.jetty.websocket.common.io.FrameFlusher.flush(FrameFlusher.java:264)
      14:27:43  	at org.eclipse.jetty.websocket.common.io.FrameFlusher.process(FrameFlusher.java:193)
      14:27:43  	at org.eclipse.jetty.util.IteratingCallback.processing(IteratingCallback.java:241)
      14:27:43  	at org.eclipse.jetty.util.IteratingCallback.iterate(IteratingCallback.java:223)
      14:27:43  	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.outgoingFrame(AbstractWebSocketConnection.java:581)
      14:27:43  	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.close(AbstractWebSocketConnection.java:181)
      14:27:43  	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:510)
      14:27:43  	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:440)
      14:27:43  	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
      14:27:43  	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
      14:27:43  	at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
      14:27:43  	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338)
      14:27:43  	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315)
      14:27:43  	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173)
      14:27:43  	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)
      14:27:43  	at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:383)
      14:27:43  	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:882)
      14:27:43  	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1036)
      14:27:43  	at java.base/java.lang.Thread.run(Thread.java:829)
      

      In this case you must fail fast the job and not wait to Time out because it's waste of time to wait for jenkins-agent pod that will never come back to life.

      Suggested solution: 
      Maybe when clean dead jenkins-agent pods it will be possible to cancel all jobs on the same jenkins-agent pod

      somehow related:
      https://issues.jenkins.io/browse/JENKINS-23171
      https://issues.jenkins.io/browse/JENKINS-43781
      https://issues.jenkins.io/browse/JENKINS-35246

      https://stackoverflow.com/questions/46521492/jenkins-stop-trying-to-reconnect-to-the-slave-if-it-is-offline

        Attachments

          Issue Links

            Activity

            Hide
            dordor dor s added a comment -

            Hi Vincent Latombe , sorry for bothering you, it will be awesome if you can have a look on my issue 

            Show
            dordor dor s added a comment - Hi Vincent Latombe  , sorry for bothering you, it will be awesome if you can have a look on my issue 
            Hide
            timja Tim Jacomb added a comment -

            this related to what you are working on Jesse Glick?

            Show
            timja Tim Jacomb added a comment - this related to what you are working on Jesse Glick ?
            Hide
            jglick Jesse Glick added a comment -

            Currently if a pod is deleted, the agent will be promptly removed; Jenkins will then fail the build after 5m. JENKINS-49707 would make the build automatically retry the node block on a new pod with the same configuration.

            Show
            jglick Jesse Glick added a comment - Currently if a pod is deleted, the agent will be promptly removed; Jenkins will then fail the build after 5m. JENKINS-49707 would make the build automatically retry the node block on a new pod with the same configuration.
            Hide
            dordor dor s added a comment -

            Thank you for your response Tim Jacomb , Jesse Glick But it seems that the issue is still here

            My jenkins agent pod jenkins-agent-gx1q7 failed at 12:39:05 and was stuck until I have aborted manually my job at 13:31:26

             

            12:39:05  Cannot contact jenkins-agent-gx1q7: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@646e3dc9:jenkins-agent-gx1q7": Remote call on jenkins-agent-gx1q7 failed. The channel is closing down or has closed down
            13:31:26  Could not connect to jenkins-agent-gx1q7 to send interrupt signal to process
            Aborted by Foo Bar
            [Pipeline] }
            [Pipeline] // withDockerRegistry
            [Pipeline] }
            [Pipeline] // withEnv
            [Pipeline] }
            [Pipeline] // retry
            [Pipeline] }
            [Pipeline] // withEnv
            [Pipeline] }
            13:31:27  Failed in branch helm-push
            [Pipeline] // parallel
            [Pipeline] }
            [Pipeline] // stage
            [Pipeline] echo
            13:31:27  Build status: ABORTED
            [Pipeline] echo
            13:31:27  Build Error:
            [Pipeline] echo
            13:31:27  org.jenkinsci.plugins.workflow.steps.FlowInterruptedException
            

            Can't you reproduce this?

             

            env:

             

            Jenkins 2.319.1
            jenkins/inbound-agent:4.11-1-jdk11
            jenkins Kubernetes plugin 1.31.2

             

            Show
            dordor dor s added a comment - Thank you for your response Tim Jacomb  , Jesse Glick  But it seems that the issue is still here My jenkins agent pod  jenkins-agent-gx1q7 failed at  12:39:05 and was stuck until I have aborted manually my job at  13:31:26   12:39:05 Cannot contact jenkins-agent-gx1q7: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@646e3dc9:jenkins-agent-gx1q7" : Remote call on jenkins-agent-gx1q7 failed. The channel is closing down or has closed down 13:31:26 Could not connect to jenkins-agent-gx1q7 to send interrupt signal to process Aborted by Foo Bar [Pipeline] } [Pipeline] // withDockerRegistry [Pipeline] } [Pipeline] // withEnv [Pipeline] } [Pipeline] // retry [Pipeline] } [Pipeline] // withEnv [Pipeline] } 13:31:27 Failed in branch helm-push [Pipeline] // parallel [Pipeline] } [Pipeline] // stage [Pipeline] echo 13:31:27 Build status: ABORTED [Pipeline] echo 13:31:27 Build Error: [Pipeline] echo 13:31:27 org.jenkinsci.plugins.workflow.steps.FlowInterruptedException Can't you reproduce this?   env:   Jenkins 2.319.1 jenkins/inbound-agent:4.11-1-jdk11 jenkins Kubernetes plugin 1.31.2  
            Hide
            jglick Jesse Glick added a comment -

            From your build log, something is not working as expected but it is hard to say what. Was the Pod removed? Was the agent removed from the list of agents in Jenkins? We have test coverage for various scenarios (run in Kind) but these do not necessarily match the behavior of a real cluster facing a particular kind of outage.

            Show
            jglick Jesse Glick added a comment - From your build log, something is not working as expected but it is hard to say what. Was the Pod removed? Was the agent removed from the list of agents in Jenkins? We have test coverage for various scenarios (run in Kind) but these do not necessarily match the behavior of a real cluster facing a particular kind of outage.
            Hide
            dordor dor s added a comment - - edited

            Hi Jesse Glick , here are my answers: 

            > Was the Pod removed?

            Yes (Assuming that you ask if the pod has been removed from the K8S cluster)

             

            > Was the agent removed from the list of agents in Jenkins?

            No

             

            I have reproduced this behavior by doing the following steps:

            1. Start my decelerative pipeline
            2. Follow the logs from the agent pod that run my declarative pipeline from step 1
            3. Follow the jenkins controller logs
            4. Terminate the ec2 spot instance that host the agent pod that run my declarative pipeline from step 1

             

            It seems that there are no logs in the jenkins agent pod that says that the jnlp process about to go down 

             

            kubectl -n jenkins logs -f jenkins-agent-x4ncg -c jnlp
            
            Jan 15, 2022 10:27:12 AM hudson.remoting.jnlp.Main createEngine
            INFO: Setting up agent: jenkins-agent-x4ncg
            Jan 15, 2022 10:27:12 AM hudson.remoting.jnlp.Main$CuiListener <init>
            INFO: Jenkins agent is running in headless mode.
            Jan 15, 2022 10:27:12 AM hudson.remoting.Engine startEngine
            INFO: Using Remoting version: 4.11
            Jan 15, 2022 10:27:12 AM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
            INFO: Using /home/jenkins/agent/remoting as a remoting work directory
            Jan 15, 2022 10:27:12 AM org.jenkinsci.remoting.engine.WorkDirManager setupLogging
            INFO: Both error and output logs will be printed to /home/jenkins/agent/remoting
            Jan 15, 2022 10:27:12 AM hudson.remoting.jnlp.Main$CuiListener status
            INFO: WebSocket connection open
            Jan 15, 2022 10:27:12 AM hudson.remoting.jnlp.Main$CuiListener status
            INFO: Connected
            

             

            it seems that for the k8s controller it took  a few minutes to detect that the jenkins agent pod has been killed

             

            kubectl -n jenkins get pod -o wide
            NAME                              READY   STATUS    RESTARTS   AGE   IP              NODE                            NOMINATED NODE   READINESS GATES
            jenkins-agent-x4ncg               2/2     Running   0          47m   192.168.1.161   ip-10-0-1-56.ec2.internal   <none>           <none>
            

             

            and after about 3 minutes the pod wasn't appear anymore as expected

             

            sudo kubectl -n jenkins get pod -o wide
            NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
            

             

             

            Meanwhile the job console says that the jenkins agent pod has been disconnected

             

            11:10:14  Cannot contact jenkins-agent-x4ncg: hudson.remoting.RequestAbortedException: java.nio.channels.ClosedChannelException

             

             

            30 minutes have passed since I have terminated the ec2 spot instance that host the agent pod that run my declarative pipeline, and the job is still stuck on the message above.

             

            Here are the logs from the jenkins master

            2022-01-15 11:06:45.430+0000 [id=596206] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 8027, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms2022-01-15 11:06:45.430+0000 [id=596206] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 8027, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms2022-01-15 11:06:45.430+0000 [id=596206] INFO c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 5 nodes assigned to this Jenkins instance, which we will check2022-01-15 11:06:45.430+0000 [id=596206] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed2022-01-15 11:06:45.430+0000 [id=596206] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished DockerContainerWatchdog Asynchronous Periodic Work. 0 ms2022-01-15 11:10:14.367+0000 [id=594808] WARNING j.agents.WebSocketAgents$Session#errorjava.io.IOException: Broken pipe at java.base/sun.nio.ch.FileDispatcherImpl.writev0(Native Method) at java.base/sun.nio.ch.SocketDispatcher.writev(SocketDispatcher.java:51) at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:182) at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:130) at java.base/sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:493) at java.base/java.nio.channels.SocketChannel.write(SocketChannel.java:507) at org.eclipse.jetty.io.ChannelEndPoint.flush(ChannelEndPoint.java:273)Caused: org.eclipse.jetty.io.EofException at org.eclipse.jetty.io.ChannelEndPoint.flush(ChannelEndPoint.java:279) at org.eclipse.jetty.io.WriteFlusher.flush(WriteFlusher.java:422) at org.eclipse.jetty.io.WriteFlusher.write(WriteFlusher.java:277) at org.eclipse.jetty.io.AbstractEndPoint.write(AbstractEndPoint.java:381) at org.eclipse.jetty.websocket.common.io.FrameFlusher.flush(FrameFlusher.java:264) at org.eclipse.jetty.websocket.common.io.FrameFlusher.process(FrameFlusher.java:193) at org.eclipse.jetty.util.IteratingCallback.processing(IteratingCallback.java:241) at org.eclipse.jetty.util.IteratingCallback.iterate(IteratingCallback.java:223) at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.outgoingFrame(AbstractWebSocketConnection.java:581) at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.close(AbstractWebSocketConnection.java:181) at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:510) at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:440) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105) at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131) at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:386) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883) at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034) at java.base/java.lang.Thread.run(Thread.java:829)2022-01-15 11:10:14.369+0000 [id=594808] INFO j.s.DefaultJnlpSlaveReceiver#channelClosed: Jetty (winstone)-594808 for jenkins-agent-x4ncg terminated: java.nio.channels.ClosedChannelException2022-01-15 11:11:45.430+0000 [id=596332] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started DockerContainerWatchdog Asynchronous Periodic Work2022-01-15 11:11:45.430+0000 [id=596332] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered2022-01-15 11:11:45.430+0000 [id=596332] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 8028, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms2022-01-15 11:11:45.430+0000 [id=596332] INFO c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 5 nodes assigned to this Jenkins instance, which we will check2022-01-15 11:11:45.430+0000 [id=596332] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed2022-01-15 11:11:45.431+0000 [id=596332] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished DockerContainerWatchdog Asynchronous Periodic Work. 1 ms2022-01-15 11:16:45.430+0000 [id=596375] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started DockerContainerWatchdog Asynchronous Periodic Work2022-01-15 11:16:45.430+0000 [id=596375] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered2022-01-15 11:16:45.430+0000 [id=596375] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 8029, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms2022-01-15 11:16:45.430+0000 [id=596375] INFO c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 5 nodes assigned to this Jenkins instance, which we will check2022-01-15 11:16:45.430+0000 [id=596375] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed2022-01-15 11:16:45.430+0000 [id=596375] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished DockerContainerWatchdog Asynchronous Periodic Work. 0 ms2022-01-15 11:21:45.430+0000 [id=596418] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started DockerContainerWatchdog Asynchronous Periodic Work2022-01-15 11:21:45.430+0000 [id=596418] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered2022-01-15 11:21:45.430+0000 [id=596418] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 8030, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms2022-01-15 11:21:45.430+0000 [id=596418] INFO c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 5 nodes assigned to this Jenkins instance, which we will check2022-01-15 11:21:45.430+0000 [id=596418] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed2022-01-15 11:21:45.430+0000 [id=596418] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished DockerContainerWatchdog Asynchronous Periodic Work. 0 ms2022-01-15 11:26:45.430+0000 [id=596460] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started DockerContainerWatchdog Asynchronous Periodic Work2022-01-15 11:26:45.430+0000 [id=596460] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered2022-01-15 11:26:45.430+0000 [id=596460] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 8031, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms2022-01-15 11:26:45.430+0000 [id=596460] INFO c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 5 nodes assigned to this Jenkins instance, which we will check2022-01-15 11:26:45.430+0000 [id=596460] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed2022-01-15 11:26:45.430+0000 [id=596460] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished DockerContainerWatchdog Asynchronous Periodic Work. 0 ms2022-01-15 11:31:45.430+0000 [id=596505] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started DockerContainerWatchdog Asynchronous Periodic Work2022-01-15 11:31:45.430+0000 [id=596505] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered2022-01-15 11:31:45.430+0000 [id=596505] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 8032, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms2022-01-15 11:31:45.430+0000 [id=596505] INFO c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 5 nodes assigned to this Jenkins instance, which we will check2022-01-15 11:31:45.431+0000 [id=596505] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed2022-01-15 11:31:45.431+0000 [id=596505] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished DockerContainerWatchdog Asynchronous Periodic Work. 1 ms
            

             

            Something to consider: I have aws-node-termination-handler which sends SIGTERM to all of the pods that located on the same host that is going to be terminated.
            But I was expected to see some logs from the jnlp process that the process is about to go down (graceful shutdown).

             

            To sum things up:

            • The agent pod is not exist any more in the k8s cluster
            • The agent pod is still appear as disconnected in the jenkins agents list
            • the job is still hang

             

             

            Show
            dordor dor s added a comment - - edited Hi Jesse Glick  , here are my answers:  > Was the  Pod  removed? Yes (Assuming that you ask if the pod has been removed from the K8S cluster)   > Was the agent removed from the list of agents in Jenkins? No   I have reproduced this behavior by doing the following steps: Start my decelerative pipeline Follow the logs from the agent pod that run my declarative pipeline from step 1 Follow the jenkins controller logs Terminate the ec2 spot instance that host the agent pod that run my declarative pipeline from step 1   It seems that there are no logs in the jenkins agent pod that says that the jnlp process about to go down    kubectl -n jenkins logs -f jenkins-agent-x4ncg -c jnlp Jan 15, 2022 10:27:12 AM hudson.remoting.jnlp.Main createEngine INFO: Setting up agent: jenkins-agent-x4ncg Jan 15, 2022 10:27:12 AM hudson.remoting.jnlp.Main$CuiListener <init> INFO: Jenkins agent is running in headless mode. Jan 15, 2022 10:27:12 AM hudson.remoting.Engine startEngine INFO: Using Remoting version: 4.11 Jan 15, 2022 10:27:12 AM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir INFO: Using /home/jenkins/agent/remoting as a remoting work directory Jan 15, 2022 10:27:12 AM org.jenkinsci.remoting.engine.WorkDirManager setupLogging INFO: Both error and output logs will be printed to /home/jenkins/agent/remoting Jan 15, 2022 10:27:12 AM hudson.remoting.jnlp.Main$CuiListener status INFO: WebSocket connection open Jan 15, 2022 10:27:12 AM hudson.remoting.jnlp.Main$CuiListener status INFO: Connected   it seems that for the k8s controller it took  a few minutes to detect that the jenkins agent pod has been killed   kubectl -n jenkins get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES jenkins-agent-x4ncg 2/2 Running 0 47m 192.168.1.161 ip-10-0-1-56.ec2.internal <none> <none>   and after about 3 minutes the pod wasn't appear anymore as expected   sudo kubectl -n jenkins get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES     Meanwhile the job console says that the jenkins agent pod has been disconnected   11:10:14 Cannot contact jenkins-agent-x4ncg: hudson.remoting.RequestAbortedException: java.nio.channels.ClosedChannelException     30 minutes have passed since I have terminated the ec2 spot instance that host the agent pod that run my declarative pipeline, and the job is still stuck on the message above.   Here are the logs from the jenkins master 2022-01-15 11:06:45.430+0000 [id=596206] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 8027, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms2022-01-15 11:06:45.430+0000 [id=596206] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 8027, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms2022-01-15 11:06:45.430+0000 [id=596206] INFO c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 5 nodes assigned to this Jenkins instance, which we will check2022-01-15 11:06:45.430+0000 [id=596206] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed2022-01-15 11:06:45.430+0000 [id=596206] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished DockerContainerWatchdog Asynchronous Periodic Work. 0 ms2022-01-15 11:10:14.367+0000 [id=594808] WARNING j.agents.WebSocketAgents$Session#errorjava.io.IOException: Broken pipe at java.base/sun.nio.ch.FileDispatcherImpl.writev0(Native Method) at java.base/sun.nio.ch.SocketDispatcher.writev(SocketDispatcher.java:51) at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:182) at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:130) at java.base/sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:493) at java.base/java.nio.channels.SocketChannel.write(SocketChannel.java:507) at org.eclipse.jetty.io.ChannelEndPoint.flush(ChannelEndPoint.java:273)Caused: org.eclipse.jetty.io.EofException at org.eclipse.jetty.io.ChannelEndPoint.flush(ChannelEndPoint.java:279) at org.eclipse.jetty.io.WriteFlusher.flush(WriteFlusher.java:422) at org.eclipse.jetty.io.WriteFlusher.write(WriteFlusher.java:277) at org.eclipse.jetty.io.AbstractEndPoint.write(AbstractEndPoint.java:381) at org.eclipse.jetty.websocket.common.io.FrameFlusher.flush(FrameFlusher.java:264) at org.eclipse.jetty.websocket.common.io.FrameFlusher.process(FrameFlusher.java:193) at org.eclipse.jetty.util.IteratingCallback.processing(IteratingCallback.java:241) at org.eclipse.jetty.util.IteratingCallback.iterate(IteratingCallback.java:223) at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.outgoingFrame(AbstractWebSocketConnection.java:581) at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.close(AbstractWebSocketConnection.java:181) at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:510) at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:440) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105) at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131) at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:386) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883) at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034) at java.base/java.lang. Thread .run( Thread .java:829)2022-01-15 11:10:14.369+0000 [id=594808] INFO j.s.DefaultJnlpSlaveReceiver#channelClosed: Jetty (winstone)-594808 for jenkins-agent-x4ncg terminated: java.nio.channels.ClosedChannelException2022-01-15 11:11:45.430+0000 [id=596332] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started DockerContainerWatchdog Asynchronous Periodic Work2022-01-15 11:11:45.430+0000 [id=596332] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered2022-01-15 11:11:45.430+0000 [id=596332] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 8028, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms2022-01-15 11:11:45.430+0000 [id=596332] INFO c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 5 nodes assigned to this Jenkins instance, which we will check2022-01-15 11:11:45.430+0000 [id=596332] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed2022-01-15 11:11:45.431+0000 [id=596332] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished DockerContainerWatchdog Asynchronous Periodic Work. 1 ms2022-01-15 11:16:45.430+0000 [id=596375] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started DockerContainerWatchdog Asynchronous Periodic Work2022-01-15 11:16:45.430+0000 [id=596375] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered2022-01-15 11:16:45.430+0000 [id=596375] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 8029, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms2022-01-15 11:16:45.430+0000 [id=596375] INFO c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 5 nodes assigned to this Jenkins instance, which we will check2022-01-15 11:16:45.430+0000 [id=596375] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed2022-01-15 11:16:45.430+0000 [id=596375] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished DockerContainerWatchdog Asynchronous Periodic Work. 0 ms2022-01-15 11:21:45.430+0000 [id=596418] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started DockerContainerWatchdog Asynchronous Periodic Work2022-01-15 11:21:45.430+0000 [id=596418] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered2022-01-15 11:21:45.430+0000 [id=596418] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 8030, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms2022-01-15 11:21:45.430+0000 [id=596418] INFO c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 5 nodes assigned to this Jenkins instance, which we will check2022-01-15 11:21:45.430+0000 [id=596418] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed2022-01-15 11:21:45.430+0000 [id=596418] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished DockerContainerWatchdog Asynchronous Periodic Work. 0 ms2022-01-15 11:26:45.430+0000 [id=596460] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started DockerContainerWatchdog Asynchronous Periodic Work2022-01-15 11:26:45.430+0000 [id=596460] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered2022-01-15 11:26:45.430+0000 [id=596460] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 8031, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms2022-01-15 11:26:45.430+0000 [id=596460] INFO c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 5 nodes assigned to this Jenkins instance, which we will check2022-01-15 11:26:45.430+0000 [id=596460] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed2022-01-15 11:26:45.430+0000 [id=596460] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished DockerContainerWatchdog Asynchronous Periodic Work. 0 ms2022-01-15 11:31:45.430+0000 [id=596505] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started DockerContainerWatchdog Asynchronous Periodic Work2022-01-15 11:31:45.430+0000 [id=596505] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered2022-01-15 11:31:45.430+0000 [id=596505] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 8032, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms2022-01-15 11:31:45.430+0000 [id=596505] INFO c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 5 nodes assigned to this Jenkins instance, which we will check2022-01-15 11:31:45.431+0000 [id=596505] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed2022-01-15 11:31:45.431+0000 [id=596505] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished DockerContainerWatchdog Asynchronous Periodic Work. 1 ms   Something to consider: I have aws-node-termination-handler which sends SIGTERM to all of the pods that located on the same host that is going to be terminated. But I was expected to see some logs from the jnlp process that the process is about to go down (graceful shutdown).   To sum things up: The agent pod is not exist any more in the k8s cluster The agent pod is still appear as disconnected in the jenkins agents list the job is still hang    

              People

              Assignee:
              jthompson Jeff Thompson
              Reporter:
              dordor dor s
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated: