• Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • kubernetes-plugin
    • None

      We run jobs that end up putting 600+ builds in the queue, and it will start some of them then seemingly get stuck and not update anything even though the resources are ready to run (not blocked).

      k8s plugin. one loop for a label can take a very long time.

      Apr 29, 2022 12:02:38 PM FINER hudson.slaves.NodeProvisioner
      ran update on k8s-beaker in 3378791ms

      While this slow update is running, a lock is kept and any suggesting to review the label is ignored, so the behaviour we notice is a system with a lot of jobs that could run, but jenkins not spinning up new k8s pods to fullfil the requests, until eventually the update completes and then a bunch of them start. In some of our scenarios it can be stuck like this for about one hour (see above timer).

          [JENKINS-68371] NodeProvisioner stuck / slow under load

          Samuel Beaulieu added a comment - - edited

          Turns out its the k8s nodeprovisioner which calls this class StandardPlannedNodeBuilder.java in a single threaded manner. Based on tests with load increasing up to 600+ jobs in the queue, it might take from 0s to 30 or 40 seconds per .build() which can make it very slow for hundreds of agents to provision. By adding a thread pool, at least they get created concurrently.

           

          https://github.com/jonathannewman/kubernetes-plugin/commit/33c2022bfa8c539726cbda021dc007072efee92a

          Samuel Beaulieu added a comment - - edited Turns out its the k8s nodeprovisioner which calls this class StandardPlannedNodeBuilder.java in a single threaded manner. Based on tests with load increasing up to 600+ jobs in the queue, it might take from 0s to 30 or 40 seconds per .build() which can make it very slow for hundreds of agents to provision. By adding a thread pool, at least they get created concurrently.   https://github.com/jonathannewman/kubernetes-plugin/commit/33c2022bfa8c539726cbda021dc007072efee92a

          Running this code is 100% better, but we also added logging and it shows that there is an issue, I dont know if its expected to take that long?

          May 02, 2022 3:21:22 PM INFO org.csanchez.jenkins.plugins.kubernetes.StandardPlannedNodeBuilder lambda$build$0
          Created slave in 114531 milliseconds
          May 02, 2022 3:21:22 PM INFO org.csanchez.jenkins.plugins.kubernetes.StandardPlannedNodeBuilder lambda$build$0
          Created slave in 114532 milliseconds
          May 02, 2022 3:21:22 PM INFO org.csanchez.jenkins.plugins.kubernetes.StandardPlannedNodeBuilder lambda$build$0
          Created slave in 113530 milliseconds
          May 02, 2022 3:21:22 PM INFO org.csanchez.jenkins.plugins.kubernetes.StandardPlannedNodeBuilder lambda$build$0
          Created slave in 113530 milliseconds
          May 02, 2022 3:21:22 PM INFO org.csanchez.jenkins.plugins.kubernetes.StandardPlannedNodeBuilder lambda$build$0
          Created slave in 94531 milliseconds
          May 02, 2022 3:21:22 PM INFO org.csanchez.jenkins.plugins.kubernetes.StandardPlannedNodeBuilder lambda$build$0
          Created slave in 94531 milliseconds
          May 02, 2022 3:21:22 PM INFO org.csanchez.jenkins.plugins.kubernetes.StandardPlannedNodeBuilder lambda$build$0
          Created slave in 112529 milliseconds
          May 02, 2022 3:21:22 PM INFO org.csanchez.jenkins.plugins.kubernetes.StandardPlannedNodeBuilder lambda$build$0
          Created slave in 112530 milliseconds
          May 02, 2022 3:21:22 PM INFO org.csanchez.jenkins.plugins.kubernetes.StandardPlannedNodeBuilder lambda$build$0
          Created slave in 111528 milliseconds 

          Samuel Beaulieu added a comment - Running this code is 100% better, but we also added logging and it shows that there is an issue, I dont know if its expected to take that long? May 02, 2022 3:21:22 PM INFO org.csanchez.jenkins.plugins.kubernetes.StandardPlannedNodeBuilder lambda$build$0 Created slave in 114531 milliseconds May 02, 2022 3:21:22 PM INFO org.csanchez.jenkins.plugins.kubernetes.StandardPlannedNodeBuilder lambda$build$0 Created slave in 114532 milliseconds May 02, 2022 3:21:22 PM INFO org.csanchez.jenkins.plugins.kubernetes.StandardPlannedNodeBuilder lambda$build$0 Created slave in 113530 milliseconds May 02, 2022 3:21:22 PM INFO org.csanchez.jenkins.plugins.kubernetes.StandardPlannedNodeBuilder lambda$build$0 Created slave in 113530 milliseconds May 02, 2022 3:21:22 PM INFO org.csanchez.jenkins.plugins.kubernetes.StandardPlannedNodeBuilder lambda$build$0 Created slave in 94531 milliseconds May 02, 2022 3:21:22 PM INFO org.csanchez.jenkins.plugins.kubernetes.StandardPlannedNodeBuilder lambda$build$0 Created slave in 94531 milliseconds May 02, 2022 3:21:22 PM INFO org.csanchez.jenkins.plugins.kubernetes.StandardPlannedNodeBuilder lambda$build$0 Created slave in 112529 milliseconds May 02, 2022 3:21:22 PM INFO org.csanchez.jenkins.plugins.kubernetes.StandardPlannedNodeBuilder lambda$build$0 Created slave in 112530 milliseconds May 02, 2022 3:21:22 PM INFO org.csanchez.jenkins.plugins.kubernetes.StandardPlannedNodeBuilder lambda$build$0 Created slave in 111528 milliseconds

          Building a single node should be much quicker than that, there is definitely something wrong here.

          Vincent Latombe added a comment - Building a single node should be much quicker than that, there is definitely something wrong here.

          In most instances it prints 'Created slave in 0 milliseconds' but as load increases we get messages where it takes over a few seconds etc. we have about 600+ jobs that run on a single label, and when this happens there might be 40 agents running and 560 jobs in the queue most in a buildable state. Since the agent building is single threaded this locks up the provisioning loop for a long time and prints that it ignores reviews during that time

          Apr 29, 2022 7:13:29 AM FINE hudson.slaves.NodeProvisioner
          ignoring suggested review for k8s-beaker

          Samuel Beaulieu added a comment - In most instances it prints 'Created slave in 0 milliseconds' but as load increases we get messages where it takes over a few seconds etc. we have about 600+ jobs that run on a single label, and when this happens there might be 40 agents running and 560 jobs in the queue most in a buildable state. Since the agent building is single threaded this locks up the provisioning loop for a long time and prints that it ignores reviews during that time Apr 29, 2022 7:13:29 AM FINE hudson.slaves.NodeProvisioner ignoring suggested review for k8s-beaker

            Unassigned Unassigned
            sbeaulie Samuel Beaulieu
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: