Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-73304

Sudden SCM checkout break on all agents with java.lang.NoSuchMethodError for gitclient

    • Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • Icon: Minor Minor
    • git-client-plugin
    • LTS Version: 2.426.3
      OpenJDK Version: 17.0.6+10-0ubuntu1~20.04.1
      Git plugin Version: 5.2.1
      Controller OS: Debian Bookworm
      Agents (ephemeral docker based): Ubuntu Focal

      Greetings. So this issue looks similar to https://issues.jenkins.io/browse/JENKINS-38072 in presentation... it's happened to us twice in the past 6 weeks and both times presented suddenly with builds unable to get past the checkout scm stage across all agents.

      00:00:02.569  [GitHub Checks] GitHub check (name: Jenkins, status: in_progress) has been published.
      00:00:03.252  [Pipeline] Start of Pipeline
      00:00:05.278  [Pipeline] node
      00:00:05.609  Running on docker-nodename-f4a88e28-2870-11ef-8344-da94012290d4 in /workspace/jenkins-agent/workspace/ring_reponame-stack_PR-2300
      00:00:05.622  [Pipeline] {
      00:00:06.350  [Pipeline] stage
      00:00:06.372  [Pipeline] { (Declarative: Checkout SCM)
      00:00:07.247  [Pipeline] checkout
      00:00:07.285  The recommended git tool is: git
      00:00:07.898  [Pipeline] }
      00:00:08.011  [Pipeline] // stage
      00:00:08.121  [Pipeline] }
      00:00:08.511  [Pipeline] // node
      00:00:08.777  [Pipeline] End of Pipeline
      00:00:09.026  Also:   hudson.remoting.Channel$CallSiteStackTrace: Remote call to JNLP4-connect connection from hostname.of.agent.pool/10.xxx.xxx.xxx:44604
      00:00:09.026          at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1787)
      00:00:09.026          at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:356)
      00:00:09.026          at hudson.remoting.Channel.call(Channel.java:1003)
      00:00:09.026          at hudson.FilePath.act(FilePath.java:1230)
      00:00:09.026          at hudson.FilePath.act(FilePath.java:1219)
      00:00:09.026          at org.jenkinsci.plugins.gitclient.Git.getClient(Git.java:138)
      00:00:09.026          at hudson.plugins.git.GitSCM.createClient(GitSCM.java:916)
      00:00:09.026          at hudson.plugins.git.GitSCM.createClient(GitSCM.java:847)
      00:00:09.026          at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1294)
      00:00:09.026          at org.jenkinsci.plugins.workflow.steps.scm.SCMStep.checkout(SCMStep.java:129)
      00:00:09.026          at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:97)
      00:00:09.026          at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:84)
      00:00:09.026          at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution.lambda$start$0(SynchronousNonBlockingStepExecution.java:47)
      00:00:09.026          at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
      00:00:09.026          at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
      00:00:09.026          at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
      00:00:09.026          at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
      00:00:09.026          at java.base/java.lang.Thread.run(Thread.java:840)
      00:00:09.026  java.lang.NoSuchMethodError: 'void hudson.plugins.git.GitAPI.setHostKeyFactory(org.jenkinsci.plugins.gitclient.verifier.HostKeyVerifierFactory)'
      00:00:09.026      at org.jenkinsci.plugins.gitclient.Git$GitAPIMasterToSlaveFileCallable.invoke(Git.java:208)
      00:00:09.026      at org.jenkinsci.plugins.gitclient.Git$GitAPIMasterToSlaveFileCallable.invoke(Git.java:176)
      00:00:09.026      at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3616)
      00:00:09.026      at hudson.remoting.UserRequest.perform(UserRequest.java:211)
      00:00:09.026      at hudson.remoting.UserRequest.perform(UserRequest.java:54)
      00:00:09.026      at hudson.remoting.Request$2.run(Request.java:377)
      00:00:09.026      at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78)
      00:00:09.026      at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
      00:00:09.026      at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
      00:00:09.026      at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
      00:00:09.026      at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:125)
      00:00:09.026      at java.base/java.lang.Thread.run(Thread.java:833)
      00:00:09.026  Also:   org.jenkinsci.plugins.workflow.actions.ErrorAction$ErrorId: ed8f4997-9fb2-44fa-9777-32a06a529b03
      00:00:09.026  Caused: java.io.IOException: Remote call on JNLP4-connect connection from hostname.of.agent.pool/10.xxx.xxx.xxx:44604 failed
      00:00:09.026      at hudson.remoting.Channel.call(Channel.java:1007)
      00:00:09.026      at hudson.FilePath.act(FilePath.java:1230)
      00:00:09.026      at hudson.FilePath.act(FilePath.java:1219)
      00:00:09.026      at org.jenkinsci.plugins.gitclient.Git.getClient(Git.java:138)
      00:00:09.026      at hudson.plugins.git.GitSCM.createClient(GitSCM.java:916)
      00:00:09.026      at hudson.plugins.git.GitSCM.createClient(GitSCM.java:847)
      00:00:09.026      at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1294)
      00:00:09.026      at org.jenkinsci.plugins.workflow.steps.scm.SCMStep.checkout(SCMStep.java:129)
      00:00:09.026      at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:97)
      00:00:09.026      at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:84)
      00:00:09.026      at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution.lambda$start$0(SynchronousNonBlockingStepExecution.java:47)
      00:00:09.026      at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
      00:00:09.026      at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
      00:00:09.026      at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
      00:00:09.026      at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
      00:00:09.026      at java.base/java.lang.Thread.run(Thread.java:840)
      00:00:09.604  [GitHub Checks] GitHub check (name: Jenkins, status: completed) has been published.
      00:00:10.408
      00:00:10.408  GitHub has been notified of this commit’s build result
      00:00:10.408
      00:00:10.408  Finished: FAILURE
       

      No recent changes have been deployed to our controller images, agent images, or CasC to explain it.

      It was resolved by restarting the controller, but due to the fact that we've had this a second time we would like to get to the bottom of it and to move towards ideally a preventative action or less ideally a corrective action that is quicker than a controller restart for us, which can take an hour before the UI is fully back to operational.

      There is also the fact that the presentation of this issue doesn't trigger healthcheck failure on either controller or agent instances so we can clock a few hours of service interruption with time taken to get notified of build failures, investigate/identify errors and restart controller.

      Our agents are all ephemeral, docker based agents connecting over JNLP. After a build finishes on an agent, they get automatically removed as nodes from the controller causing the agent instance to restart. When an instance starts up again, it is from an identical image and there is no filesystem persistence so it's a clean slate.
      It connects to the controller to register a node under a new name derived from the task id, and then launches the agent java process to connect to the controller. Custom code in form of a groovy plugin and agent side jenkins API client is responsible for this process including removing the old agents.

      So with our ephemeral setup in mind are there any ideas as to what would be causing this and whether there are steps that could be taken to prevent reoccurrence or correct the occasional reoccurrence without a full controller restart?
      Controller uptime would have been between 12-36 hours old, agent lifetime is only between a couple minutes to couple hours max, and no changes to either the controller or agent images had been deployed recently nor had the CasC been modified.

      Let me know if there is any more information I should provide.

       

       

          [JENKINS-73304] Sudden SCM checkout break on all agents with java.lang.NoSuchMethodError for gitclient

          Mark Waite added a comment -

          I believe that stack trace means that the Jenkins controller sent a request to the Jenkins agent to perform a checkout of a repository using ssh. The checkout step first tried to check that the known hosts configuration was correct, but a Java class that was needed for that check did not exist on the agent. The mentioned Java class is part of the git client plugin, so it should have been available on the Jenkins agent, since remoting copies classes from the controller to the agent before calling them.

          One workaround is to switch from checkout with ssh to checkout with https.

          Another workaround might be to monitor the agent logs and the job logs for that failure message and raise an alert in your monitoring system.

          I've not seen that failure, so I don't have any great ideas of other ways to diagnose it.

          Mark Waite added a comment - I believe that stack trace means that the Jenkins controller sent a request to the Jenkins agent to perform a checkout of a repository using ssh. The checkout step first tried to check that the known hosts configuration was correct, but a Java class that was needed for that check did not exist on the agent. The mentioned Java class is part of the git client plugin, so it should have been available on the Jenkins agent, since remoting copies classes from the controller to the agent before calling them. One workaround is to switch from checkout with ssh to checkout with https. Another workaround might be to monitor the agent logs and the job logs for that failure message and raise an alert in your monitoring system. I've not seen that failure, so I don't have any great ideas of other ways to diagnose it.

          Christopher added a comment -

          Thanks so much for the explanation. Can you help me understand a few things please?

          > The mentioned Java class is part of the git client plugin, so it should have been available on the Jenkins agent, since remoting copies classes from the controller to the agent before calling them.

          See this where I'm a bit puzzled. As mentioned, the agents are brought up and down after each build, which means a new node is registered via the API, and the agent jar is at that point launched.

          It's interesting that the plugin mentions something about SetSID and Swarm Plugin being useful together and I'd love to know why in that case particularly. I'd never heard about that swarm plugin but it sounds very similar to the wheels we reinvented to scale a containerised/virtualised solution for ephemeral agents.

          So for 40 fresh nodes available all suddenly failing on finding the method for a CLI git based interaction, after it has already successfully fetched a shared library from Git successfully via HTTPS, would I be right in drawing a conclusion that the source of truth for what classes remoting must be copying to the agents is corrupted? 

          The part that is throwing the exception is a method in a different package distributed with plugin i.e. the legacy Git API  https://github.com/jenkinsci/git-client-plugin/blob/master/src/main/java/org/jenkinsci/plugins/gitclient/Git.java#L208

          Can you think of any potential events or theoretical culprits for such a state on the controller? Something to do with CasC, an administrative process, or filesystem clobbering?

          Suspect here could be some plugin dependency hell going on owing to caching delays in Artifactory perhaps. Another factor that could corrupt the state may be attributed to storage dramas, as there were Ceph issues occurring in the days leading up to both incidents, and that volume backed home directory is the only part capable of holding such state.

          It's interesting that breaking builds via corrupting a library is something we would effectively do during a failover activity to prevent one controller from running already completed builds on the same file system as another controller instance, so if we can break and unbreak all builds via  `GlobalLibraries.get().setLibraries(libs);` then surely we can take a similar approach here around the plugins?

          If a restart fixed it, I know that our entrypoint would have wiped the plugins folder under JENKINS_HOME in order to ensure the plugins baked into the image get copied out by the upstream entrypoint scripting. Maybe wiping them and a CasC hot reload would save a restart?

          Christopher added a comment - Thanks so much for the explanation. Can you help me understand a few things please? > The mentioned Java class is part of the git client plugin, so it should have been available on the Jenkins agent, since remoting copies classes from the controller to the agent before calling them. See this where I'm a bit puzzled. As mentioned, the agents are brought up and down after each build, which means a new node is registered via the API, and the agent jar is at that point launched. It's interesting that the plugin mentions something about SetSID and Swarm Plugin being useful together and I'd love to know why in that case particularly. I'd never heard about that swarm plugin but it sounds very similar to the wheels we reinvented to scale a containerised/virtualised solution for ephemeral agents. So for 40 fresh nodes available all suddenly failing on finding the method for a CLI git based interaction, after it has already successfully fetched a shared library from Git successfully via HTTPS, would I be right in drawing a conclusion that the source of truth for what classes remoting must be copying to the agents is corrupted?  The part that is throwing the exception is a method in a different package distributed with plugin i.e. the legacy Git API  https://github.com/jenkinsci/git-client-plugin/blob/master/src/main/java/org/jenkinsci/plugins/gitclient/Git.java#L208 Can you think of any potential events or theoretical culprits for such a state on the controller? Something to do with CasC, an administrative process, or filesystem clobbering? Suspect here could be some plugin dependency hell going on owing to caching delays in Artifactory perhaps. Another factor that could corrupt the state may be attributed to storage dramas, as there were Ceph issues occurring in the days leading up to both incidents, and that volume backed home directory is the only part capable of holding such state. It's interesting that breaking builds via corrupting a library is something we would effectively do during a failover activity to prevent one controller from running already completed builds on the same file system as another controller instance, so if we can break and unbreak all builds via  `GlobalLibraries.get().setLibraries(libs);` then surely we can take a similar approach here around the plugins? If a restart fixed it, I know that our entrypoint would have wiped the plugins folder under JENKINS_HOME in order to ensure the plugins baked into the image get copied out by the upstream entrypoint scripting. Maybe wiping them and a CasC hot reload would save a restart?

          Mark Waite added a comment -

          Suspect here could be some plugin dependency hell going on owing to caching delays in Artifactory perhaps. Another factor that could corrupt the state may be attributed to storage dramas, as there were Ceph issues occurring in the days leading up to both incidents, and that volume backed home directory is the only part capable of holding such state.

          I have no experience with distributed storage systems for the Jenkins home directory. The Ceph project uses Jenkins for their CI processes, so you might be able to get further hints from them. CloudBees documentation includes an NFS guide that might offer some helpful things to consider.

          Suspect here could be some plugin dependency hell going on owing to caching delays in Artifactory perhaps

          As far as I know, the code that copies Java class files from the controller to the agent does not involve any artifact repository. I don't think that Artifactory is involved in that.

          Another factor that could corrupt the state may be attributed to storage dramas, as there were Ceph issues occurring in the days leading up to both incidents, and that volume backed home directory is the only part capable of holding such state.

          That seems a more likely culprit to me, though that's just a guess.

          Mark Waite added a comment - Suspect here could be some plugin dependency hell going on owing to caching delays in Artifactory perhaps. Another factor that could corrupt the state may be attributed to storage dramas, as there were Ceph issues occurring in the days leading up to both incidents, and that volume backed home directory is the only part capable of holding such state. I have no experience with distributed storage systems for the Jenkins home directory. The Ceph project uses Jenkins for their CI processes, so you might be able to get further hints from them. CloudBees documentation includes an NFS guide that might offer some helpful things to consider. Suspect here could be some plugin dependency hell going on owing to caching delays in Artifactory perhaps As far as I know, the code that copies Java class files from the controller to the agent does not involve any artifact repository. I don't think that Artifactory is involved in that. Another factor that could corrupt the state may be attributed to storage dramas, as there were Ceph issues occurring in the days leading up to both incidents, and that volume backed home directory is the only part capable of holding such state. That seems a more likely culprit to me, though that's just a guess.

          Mark Waite added a comment -

          Closing as "Cannot reproduce" since there have been no further reports of the issue and there was some indication that it might be an issue related to the storage system.

          Mark Waite added a comment - Closing as "Cannot reproduce" since there have been no further reports of the issue and there was some indication that it might be an issue related to the storage system.

            Unassigned Unassigned
            krzwrd Christopher
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: