Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-70048

Worker node going offline abruptly with "hudson.slaves.ChannelPinger$1#onDead: Ping failed. Terminating the channel JNLP4-connect connection from ip" error

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Not A Defect
    • Icon: Major Major
    • kubernetes-plugin
    • None

      Hi team,

      We are facing issue in jenkins where jenkins agent disconnects(or goes offline) from master while still job is running on agent/worker. We are getting below error(highlighted) and tried below things but issue is still not resolving fully. Jenkins is deployed on EKS.

      Error:

      -------------------------------------------------------------------------------------------------------------------------

      5332191-2022-11-02 13:59:19.734+0000 [id=140214]        INFO    c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 2606, Executions with /project-sol-vodafone/ekscomponent/sol-ekscomponent-app-done/jenkins master ⇣
      …/sol-ekscomponent-app-done/jenkins on 🌱 master [😰] 
      ❯ kd logs jenkins-0 -c jenkins | grep -a10 -b10 worker-7j4x4
      5332191-2022-11-02 13:59:19.734+0000 [id=140214]     INFO              c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 2606, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms
      5332747-2022-11-02 13:59:19.734+0000 [id=140214]     INFO              c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 0 nodes assigned to this Jenkins instance, which we will check
      5332921-2022-11-02 13:59:19.734+0000 [id=140214]     INFO              c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed
      5333061-2022-11-02 13:59:19.734+0000 [id=140214]     INFO              hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished DockerContainerWatchdog Asynchronous Periodic Work. 1 ms
      5333220-2022-11-02 14:04:19.733+0000 [id=140239]     INFO              hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started DockerContainerWatchdog Asynchronous Periodic Work
      5333372-2022-11-02 14:04:19.734+0000 [id=140239]     INFO              c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered
      5333506-2022-11-02 14:04:19.734+0000 [id=140239]     INFO              c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 2607, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms
      5334062-2022-11-02 14:04:19.734+0000 [id=140239]     INFO              c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 0 nodes assigned to this Jenkins instance, which we will check
      5334236-2022-11-02 14:04:19.734+0000 [id=140239]     INFO              c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed
      5334376-2022-11-02 14:04:19.734+0000 [id=140239]     INFO              hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished DockerContainerWatchdog Asynchronous Periodic Work. 0 ms
      5334535:2022-11-02 14:07:54.573+0000 [id=140290]     INFO              hudson.slaves.NodeProvisioner#update: worker-7j4x4 provisioning successfully completed. We have now 2 computer(s)
      5334695:2022-11-02 14:07:54.675+0000 [id=140291]     INFO              o.c.j.p.k.KubernetesLauncher#launch: Created Pod: kubernetes done-jenkins/worker-7j4x4
      5334828:2022-11-02 14:07:56.619+0000 [id=140291]     INFO              o.c.j.p.k.KubernetesLauncher#launch: Pod is running: kubernetes done-jenkins/worker-7j4x4
      5334964-2022-11-02 14:07:58.650+0000 [id=140309]     INFO              h.TcpSlaveAgentListener$ConnectionHandler#run: Accepted JNLP4-connect connection #97 from /100.122.254.111:42648
      5335123-2022-11-02 14:09:19.733+0000 [id=140536]     INFO              hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started DockerContainerWatchdog Asynchronous Periodic Work
      5335275-2022-11-02 14:09:19.733+0000 [id=140536]     INFO              c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered
      5335409-2022-11-02 14:09:19.734+0000 [id=140536]     INFO              c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 2608, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms
      5335965-2022-11-02 14:09:19.734+0000 [id=140536]     INFO              c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 1 nodes assigned to this Jenkins instance, which we will check
      5336139-2022-11-02 14:09:19.734+0000 [id=140536]     INFO              c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed
      5336279-2022-11-02 14:09:19.734+0000 [id=140536]     INFO              hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished DockerContainerWatchdog Asynchronous Periodic Work. 1 ms
      5336438-groovy.lang.MissingPropertyException: No such property: envVar for class: groovy.lang.Binding
      5336532-            at groovy.lang.Binding.getVariable(Binding.java:63)
      5336585-            at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onGetProperty(SandboxInterceptor.java:271)

      5394279-2022-11-02 15:09:19.733+0000 [id=141899]     INFO              hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started DockerContainerWatchdog Asynchronous Periodic Work
      5394431-2022-11-02 15:09:19.734+0000 [id=141899]     INFO              c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered
      5394565-2022-11-02 15:09:19.734+0000 [id=141899]     INFO              c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 2620, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms
      5395121-2022-11-02 15:09:19.734+0000 [id=141899]     INFO              c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 3 nodes assigned to this Jenkins instance, which we will check
      5395295-2022-11-02 15:09:19.734+0000 [id=141899]     INFO              c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed
      5395435-2022-11-02 15:09:19.734+0000 [id=141899]     INFO              hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished DockerContainerWatchdog Asynchronous Periodic Work. 1 ms
      5395594-2022-11-02 15:11:59.502+0000 [id=140320]     INFO              hudson.slaves.ChannelPinger$1#onDead: Ping failed. Terminating the channel JNLP4-connect connection from ip-100-122-254-111.eu-central-1.compute.internal/100.122.254.111:42648.
      5395817-java.util.concurrent.TimeoutException: Ping started at 1667401679501 hasn't completed by 1667401919502
      5395920-            at hudson.remoting.PingThread.ping(PingThread.java:134)
      5395977-            at hudson.remoting.PingThread.run(PingThread.java:90)
      5396032:2022-11-02 15:11:59.503+0000 [id=141914]     INFO              j.s.DefaultJnlpSlaveReceiver#channelClosed: Computer.threadPoolForRemoting 5049 for worker-7j4x4 terminated: java.nio.channels.ClosedChannelException
      5396231-2022-11-02 15:12:35.579+0000 [id=141933]     INFO              hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started Periodic background build discarder
      5396368-2022-11-02 15:12:36.257+0000 [id=141933]     INFO              hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished Periodic background build discarder. 678 ms
      5396514-2022-11-02 15:14:15.582+0000 [id=141422]     INFO              hudson.slaves.ChannelPinger$1#onDead: Ping failed. Terminating the channel JNLP4-connect connection from ip-100-122-237-38.eu-central-1.compute.internal/100.122.237.38:55038.
      5396735-java.util.concurrent.TimeoutException: Ping started at 1667401815582 hasn't completed by 1667402055582
      5396838-            at hudson.remoting.PingThread.ping(PingThread.java:134)
      5396895-            at hudson.remoting.PingThread.run(PingThread.java:90)
      5396950-2022-11-02 15:14:15.584+0000 [id=141915]     INFO              j.s.DefaultJnlpSlaveReceiver#channelClosed: Computer.threadPoolForRemoting 5050 for worker-fjf1p terminated: java.nio.channels.ClosedChannelException
      5397149-2022-11-02 15:14:19.733+0000 [id=141950]     INFO              hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started DockerContainerWatchdog Asynchronous Periodic Work
      5397301-2022-11-02 15:14:19.733+0000 [id=141950]     INFO              c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered
      5397435-2022-11-02 15:14:19.734+0000 [id=141950]     INFO              c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 2621, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms

      -------------------------------------------------------------------------------------------------------------

       

      Tried below things:

      1.    Increased idleMinutes to 180 from default
      2.    Verified that resources are sufficient as per graphana dashboard
      3.    Changed podRetention to onFailure from Never
      4.    Changed podRetention to Always from Never
      5.    Increased readTimeout
      6.    Increased connectTimeout
      7.    Increased  slaveConnectTimeoutStr
      8.    Disabled the ping thread from UI via disabling “response time" checkbox from preventive node monitroing

      9.  Increased activeDeadlineSeconds

      10. Verified same java version on master and agent

      11. Updated kubernetes and kubernetes API client plugins

       

      Any suggestion or resolutions pls.

       

            Unassigned Unassigned
            mitesh1793 Mitesh
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: