Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-68122

Agent connection broken (randomly) with error java.util.concurrent.TimeoutException (regression in 2.325)

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • core
    • Jenkins 2.332.1 on Ubuntu 18.04 with OpenJDk 11.0.14
      Amazon EC2 plugin 1.68
    • 2.343, 2.332.3

      After upgrade Jenkins from 2.319.2 to 2.332.1 we start experience with EC2 agent connection broken with time out ping thread error:  

      java.util.concurrent.TimeoutException: Ping started at 1648107727099 hasn't completed by 1648107967100java.util.concurrent.TimeoutException: Ping started at 1648107727099 hasn't completed by 1648107967100 at hudson.remoting.PingThread.ping(PingThread.java:132) at hudson.remoting.PingThread.run(PingThread.java:88)

      This happen randomly and the build job is hung at pipeline git check out stage. When the agent connection broken, we can re-launch and it reconnect but the build job seem no longer can access the agent and just stall until cancel. While this happen, other EC2 agent still running and ping at os level from master to agent in question still get response. We tried to disable "Response Time" in Preventive Node monitoring (manage Node and Cloud). This just delay the broken connection from 2 missing ping to 5 or 6 as the master continue to monitor disk space, swap... Kill the job and rebuild will success most of the time (some time stuck on the same broken connection). 

          [JENKINS-68122] Agent connection broken (randomly) with error java.util.concurrent.TimeoutException (regression in 2.325)

          Jesse Glick added a comment -

          Well I hope the fix works this time and it can go into 2.332.3. Since this is pretty severe for those who encounter it, I took a stab at an agent-side workaround. Completely untested, just based on code inspection, but if it works then it might allow you to use a stock core for the next few weeks: https://github.com/jenkinsci/remoting/pull/527

          Jesse Glick added a comment - Well I hope the fix works this time and it can go into 2.332.3. Since this is pretty severe for those who encounter it, I took a stab at an agent-side workaround. Completely untested, just based on code inspection, but if it works then it might allow you to use a stock core for the next few weeks: https://github.com/jenkinsci/remoting/pull/527

          Jesse Glick added a comment -

          every time a new instance of RingBufferLogHandler is created

          Well, that would be once in the controller JVM plus any occasional plugin usages, and once in the agent JVM. Not really important. But sure, either way works (or ought to work).

          Jesse Glick added a comment - every time a new instance of RingBufferLogHandler is created Well, that would be once in the controller JVM plus any occasional plugin usages, and once in the agent JVM. Not really important. But sure, either way works (or ought to work).

          Basil Crow added a comment -

          Fixed in jenkinsci/jenkins#6449 toward 2.343.

          Basil Crow added a comment - Fixed in jenkinsci/jenkins#6449 toward 2.343.

          Mark Chester added a comment -

          I read through the comments here and didn't see any workaround for until the new LTS is released.  We are running on 2.332.3 and have this problem BAD.  We are failing jobs several times per day.  Is there any workaround that does not involve leaving the LTS line?

          Mark Chester added a comment - I read through the comments here and didn't see any workaround for until the new LTS is released.  We are running on 2.332.3 and have this problem BAD.  We are failing jobs several times per day.  Is there any workaround that does not involve leaving the LTS line?

          Mark Waite added a comment -

          koyaanisqatsi the earlier comments indicate that reducing the logging level of the agent seems to reduce the issue. Have you tried that?

          Mark Waite added a comment - koyaanisqatsi the earlier comments indicate that reducing the logging level of the agent seems to reduce the issue. Have you tried that?

          Mark Chester added a comment -

          markewaite I'm not sure how to do that on our installation.  We are not launching agents from a CLI.  They are launched automatically in AWS by "Clouds and Nodes" configs.  I'm trying to identify the plugin that provides this, which I believe is this guy: https://plugins.jenkins.io/ssh-slaves/  I have not found any way to inject a logging config into our agent startup process.

          Mark Chester added a comment - markewaite I'm not sure how to do that on our installation.  We are not launching agents from a CLI.  They are launched automatically in AWS by "Clouds and Nodes" configs.  I'm trying to identify the plugin that provides this, which I believe is this guy: https://plugins.jenkins.io/ssh-slaves/   I have not found any way to inject a logging config into our agent startup process.

          Mark Waite added a comment -

          If you're not increasing the logging level, then there is likely no benefit to reducing the logging level.

          You could assist with testing the Jenkins 2.346.1 release candidate and confirm that it is resolved in your environment with that release.

          Mark Waite added a comment - If you're not increasing the logging level, then there is likely no benefit to reducing the logging level. You could assist with testing the Jenkins 2.346.1 release candidate and confirm that it is resolved in your environment with that release.

          Basil Crow added a comment -

          markewaite, you appear to be confused about which LTS release this fix was backported to. The fix was backported to 2.332.3, so testing 2.346.1 will not help. koyaanisqatsi I suggest filing a new ticket, as your problem may be unrelated.

          Basil Crow added a comment - markewaite , you appear to be confused about which LTS release this fix was backported to. The fix was backported to 2.332.3, so testing 2.346.1 will not help. koyaanisqatsi I suggest filing a new ticket, as your problem may be unrelated.

          Kapa Wo added a comment -

          we no longer has this issue after upgrade to 2.332.3. Matt Chester, your issue may be different. However, you can try our work around (earlier comment) which we used before the 2.332.3 LTS released. 

          Kapa Wo added a comment - we no longer has this issue after upgrade to 2.332.3. Matt Chester, your issue may be different. However, you can try our work around (earlier comment) which we used before the 2.332.3 LTS released. 

          Mark Chester added a comment -

          Well, the issue has cleared up after I reverted my agent configs back to an EC2 instance type that did not use NVMe storage (to m6i.2xlarge from c5d.4xlarge) and the AMI back to one older than current (to ami-04dd4500af104442f, from ami-0c1bc246476a5572b in eu-west-1).  I also disabled "Stop/Disconnect on Idle Timeout", which I had enabled to save the initialization time of a new instance.  We had updated to NVMe storage due to needing the low-latency storage for Docker-based builds, but that seems to have affected Jenkins in a negative way.  It didn't help the builds anyway, so reverting wasn't a big deal.  Normally I would have made only one change at a time, so that I could see the impact.  But this was blocking a lot of developers and a hotfix we needed to get out.

          I agree my case sounds like something different than this issue.  I don't have the liberty of upgrading outside the LTS versions.  If we can find some time and approval to test things further, I'll open a new issue and include diagnostic info.

          Mark Chester added a comment - Well, the issue has cleared up after I reverted my agent configs back to an EC2 instance type that did not use NVMe storage (to m6i.2xlarge from c5d.4xlarge) and the AMI back to one older than current (to ami-04dd4500af104442f, from ami-0c1bc246476a5572b in eu-west-1).  I also disabled "Stop/Disconnect on Idle Timeout", which I had enabled to save the initialization time of a new instance.  We had updated to NVMe storage due to needing the low-latency storage for Docker-based builds, but that seems to have affected Jenkins in a negative way.  It didn't help the builds anyway, so reverting wasn't a big deal.  Normally I would have made only one change at a time, so that I could see the impact.  But this was blocking a lot of developers and a hotfix we needed to get out. I agree my case sounds like something different than this issue.  I don't have the liberty of upgrading outside the LTS versions.  If we can find some time and approval to test things further, I'll open a new issue and include diagnostic info.

            basil Basil Crow
            kapawo Kapa Wo
            Votes:
            5 Vote for this issue
            Watchers:
            11 Start watching this issue

              Created:
              Updated:
              Resolved: