[JENKINS-73304] Sudden SCM checkout break on all agents with java.lang.NoSuchMethodError for gitclient - Jenkins Jira

Type: Bug
Resolution: Cannot Reproduce
Priority: Minor
Component/s: git-client-plugin
Labels:
- git-client
Environment:
LTS Version: 2.426.3
OpenJDK Version: 17.0.6+10-0ubuntu1~20.04.1
Git plugin Version: 5.2.1
Controller OS: Debian Bookworm
Agents (ephemeral docker based): Ubuntu Focal

Similar Issues:
Powered by SuggestiMate

Show

Greetings. So this issue looks similar to https://issues.jenkins.io/browse/JENKINS-38072 in presentation... it's happened to us twice in the past 6 weeks and both times presented suddenly with builds unable to get past the checkout scm stage across all agents.

00:00:02.569  [GitHub Checks] GitHub check (name: Jenkins, status: in_progress) has been published.
00:00:03.252  [Pipeline] Start of Pipeline
00:00:05.278  [Pipeline] node
00:00:05.609  Running on docker-nodename-f4a88e28-2870-11ef-8344-da94012290d4 in /workspace/jenkins-agent/workspace/ring_reponame-stack_PR-2300
00:00:05.622  [Pipeline] {
00:00:06.350  [Pipeline] stage
00:00:06.372  [Pipeline] { (Declarative: Checkout SCM)
00:00:07.247  [Pipeline] checkout
00:00:07.285  The recommended git tool is: git
00:00:07.898  [Pipeline] }
00:00:08.011  [Pipeline] // stage
00:00:08.121  [Pipeline] }
00:00:08.511  [Pipeline] // node
00:00:08.777  [Pipeline] End of Pipeline
00:00:09.026  Also:   hudson.remoting.Channel$CallSiteStackTrace: Remote call to JNLP4-connect connection from hostname.of.agent.pool/10.xxx.xxx.xxx:44604
00:00:09.026          at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1787)
00:00:09.026          at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:356)
00:00:09.026          at hudson.remoting.Channel.call(Channel.java:1003)
00:00:09.026          at hudson.FilePath.act(FilePath.java:1230)
00:00:09.026          at hudson.FilePath.act(FilePath.java:1219)
00:00:09.026          at org.jenkinsci.plugins.gitclient.Git.getClient(Git.java:138)
00:00:09.026          at hudson.plugins.git.GitSCM.createClient(GitSCM.java:916)
00:00:09.026          at hudson.plugins.git.GitSCM.createClient(GitSCM.java:847)
00:00:09.026          at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1294)
00:00:09.026          at org.jenkinsci.plugins.workflow.steps.scm.SCMStep.checkout(SCMStep.java:129)
00:00:09.026          at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:97)
00:00:09.026          at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:84)
00:00:09.026          at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution.lambda$start$0(SynchronousNonBlockingStepExecution.java:47)
00:00:09.026          at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
00:00:09.026          at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
00:00:09.026          at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
00:00:09.026          at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
00:00:09.026          at java.base/java.lang.Thread.run(Thread.java:840)
00:00:09.026  java.lang.NoSuchMethodError: 'void hudson.plugins.git.GitAPI.setHostKeyFactory(org.jenkinsci.plugins.gitclient.verifier.HostKeyVerifierFactory)'
00:00:09.026      at org.jenkinsci.plugins.gitclient.Git$GitAPIMasterToSlaveFileCallable.invoke(Git.java:208)
00:00:09.026      at org.jenkinsci.plugins.gitclient.Git$GitAPIMasterToSlaveFileCallable.invoke(Git.java:176)
00:00:09.026      at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3616)
00:00:09.026      at hudson.remoting.UserRequest.perform(UserRequest.java:211)
00:00:09.026      at hudson.remoting.UserRequest.perform(UserRequest.java:54)
00:00:09.026      at hudson.remoting.Request$2.run(Request.java:377)
00:00:09.026      at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78)
00:00:09.026      at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
00:00:09.026      at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
00:00:09.026      at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
00:00:09.026      at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:125)
00:00:09.026      at java.base/java.lang.Thread.run(Thread.java:833)
00:00:09.026  Also:   org.jenkinsci.plugins.workflow.actions.ErrorAction$ErrorId: ed8f4997-9fb2-44fa-9777-32a06a529b03
00:00:09.026  Caused: java.io.IOException: Remote call on JNLP4-connect connection from hostname.of.agent.pool/10.xxx.xxx.xxx:44604 failed
00:00:09.026      at hudson.remoting.Channel.call(Channel.java:1007)
00:00:09.026      at hudson.FilePath.act(FilePath.java:1230)
00:00:09.026      at hudson.FilePath.act(FilePath.java:1219)
00:00:09.026      at org.jenkinsci.plugins.gitclient.Git.getClient(Git.java:138)
00:00:09.026      at hudson.plugins.git.GitSCM.createClient(GitSCM.java:916)
00:00:09.026      at hudson.plugins.git.GitSCM.createClient(GitSCM.java:847)
00:00:09.026      at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1294)
00:00:09.026      at org.jenkinsci.plugins.workflow.steps.scm.SCMStep.checkout(SCMStep.java:129)
00:00:09.026      at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:97)
00:00:09.026      at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:84)
00:00:09.026      at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution.lambda$start$0(SynchronousNonBlockingStepExecution.java:47)
00:00:09.026      at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
00:00:09.026      at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
00:00:09.026      at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
00:00:09.026      at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
00:00:09.026      at java.base/java.lang.Thread.run(Thread.java:840)
00:00:09.604  [GitHub Checks] GitHub check (name: Jenkins, status: completed) has been published.
00:00:10.408
00:00:10.408  GitHub has been notified of this commit’s build result
00:00:10.408
00:00:10.408  Finished: FAILURE

No recent changes have been deployed to our controller images, agent images, or CasC to explain it.

It was resolved by restarting the controller, but due to the fact that we've had this a second time we would like to get to the bottom of it and to move towards ideally a preventative action or less ideally a corrective action that is quicker than a controller restart for us, which can take an hour before the UI is fully back to operational.

There is also the fact that the presentation of this issue doesn't trigger healthcheck failure on either controller or agent instances so we can clock a few hours of service interruption with time taken to get notified of build failures, investigate/identify errors and restart controller.

Our agents are all ephemeral, docker based agents connecting over JNLP. After a build finishes on an agent, they get automatically removed as nodes from the controller causing the agent instance to restart. When an instance starts up again, it is from an identical image and there is no filesystem persistence so it's a clean slate.
It connects to the controller to register a node under a new name derived from the task id, and then launches the agent java process to connect to the controller. Custom code in form of a groovy plugin and agent side jenkins API client is responsible for this process including removing the old agents.

So with our ephemeral setup in mind are there any ideas as to what would be causing this and whether there are steps that could be taken to prevent reoccurrence or correct the occasional reoccurrence without a full controller restart?
Controller uptime would have been between 12-36 hours old, agent lifetime is only between a couple minutes to couple hours max, and no changes to either the controller or agent images had been deployed recently nor had the CasC been modified.

Let me know if there is any more information I should provide.

Details

Description

Attachments

Activity

People

Dates