Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-75423

Computer.threadPoolForRemoting is taking all resources from jenkins

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • swarm-plugin
    • None

      We are having major issues with our Jenkins instances and are trying to understand what the issue is. We currently have uptime of less than 2 days before the Jenkins docker we are running needs to be rebooted. We currently have ~10 build nodes connected to Jenkins with the Swarm plugin.

      We can see that one thread seems to take all cpu-power of a single core when this happens. When profiling we can see that the Computer.threadPoolForRemoting thread seems to work continously. During this period the Web gui is incredibly slow and we may look at load times of multiple minutes for each page. The jobs that are scheduled and manually triggered keeps queuing up on the Built-in node and can not be handed over to the build agents fast enough. The connections to the agents are also very shaky and we get reconnection messages in the build logs.

      I can see old issues (from 2015 to 2017) about the SSH agent leaking threads in Computer.threadPoolForRemoting. But it seems to have been resolved a log time ago. Have similar issues been known to exist in the Swarm plugin?

       

      More about the environment:

      Jenkins is running in a docker container based on jenkins/jenkins:jdk21, with the swarm-agent and the YourKit profiler agent added to it.

      The host machine is a Virtual computer hosted in VCenter with 16 CPUs and 32GB RAM. The swarm agents are a mix of physical and virtual nodes but the specs of the 5 physical nodes we use are: 64 CPU, 1.5TB RAM, 4.0TB disk. The other nodes are virtual hosts with 8 CPUS and 16 GB RAM.

      The total amount of jobs we start on these nodes are probably around 500 per day.

      We have been following along with the releases of new Jenkins version for 3+ years and we have during this time never had a Jenkins controller that has "survived" for 3 weeks. During the last year we have had a scheduled docker restart every week and that has been sufficient until recently, when we got the issues described above, after just a couple of days.

       

      What we have tried:

      Restarting the jenkins docker container seems to "fix" the issue, but it only holds for ~2 days at the moment. We have also tried to disconnect the agents. We saw no difference in the jenkins CPU usage until the last agent had been disconnected. Then it went back to normal.

            Unassigned Unassigned
            emilsgn Emil
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: