-
Bug
-
Resolution: Unresolved
-
Minor
-
LTS Version: 2.426.3
OpenJDK Version: 17.0.6+10-0ubuntu1~20.04.1
Git plugin Version: 5.2.1
Controller OS: Debian Bookworm
Agents (ephemeral docker based): Ubuntu Focal
Greetings. So this issue looks similar to https://issues.jenkins.io/browse/JENKINS-38072 in presentation... it's happened to us twice in the past 6 weeks and both times presented suddenly with builds unable to get past the checkout scm stage across all agents.
00:00:02.569 [GitHub Checks] GitHub check (name: Jenkins, status: in_progress) has been published. 00:00:03.252 [Pipeline] Start of Pipeline 00:00:05.278 [Pipeline] node 00:00:05.609 Running on docker-nodename-f4a88e28-2870-11ef-8344-da94012290d4 in /workspace/jenkins-agent/workspace/ring_reponame-stack_PR-2300 00:00:05.622 [Pipeline] { 00:00:06.350 [Pipeline] stage 00:00:06.372 [Pipeline] { (Declarative: Checkout SCM) 00:00:07.247 [Pipeline] checkout 00:00:07.285 The recommended git tool is: git 00:00:07.898 [Pipeline] } 00:00:08.011 [Pipeline] // stage 00:00:08.121 [Pipeline] } 00:00:08.511 [Pipeline] // node 00:00:08.777 [Pipeline] End of Pipeline 00:00:09.026 Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to JNLP4-connect connection from hostname.of.agent.pool/10.xxx.xxx.xxx:44604 00:00:09.026 at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1787) 00:00:09.026 at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:356) 00:00:09.026 at hudson.remoting.Channel.call(Channel.java:1003) 00:00:09.026 at hudson.FilePath.act(FilePath.java:1230) 00:00:09.026 at hudson.FilePath.act(FilePath.java:1219) 00:00:09.026 at org.jenkinsci.plugins.gitclient.Git.getClient(Git.java:138) 00:00:09.026 at hudson.plugins.git.GitSCM.createClient(GitSCM.java:916) 00:00:09.026 at hudson.plugins.git.GitSCM.createClient(GitSCM.java:847) 00:00:09.026 at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1294) 00:00:09.026 at org.jenkinsci.plugins.workflow.steps.scm.SCMStep.checkout(SCMStep.java:129) 00:00:09.026 at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:97) 00:00:09.026 at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:84) 00:00:09.026 at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution.lambda$start$0(SynchronousNonBlockingStepExecution.java:47) 00:00:09.026 at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) 00:00:09.026 at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) 00:00:09.026 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) 00:00:09.026 at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) 00:00:09.026 at java.base/java.lang.Thread.run(Thread.java:840) 00:00:09.026 java.lang.NoSuchMethodError: 'void hudson.plugins.git.GitAPI.setHostKeyFactory(org.jenkinsci.plugins.gitclient.verifier.HostKeyVerifierFactory)' 00:00:09.026 at org.jenkinsci.plugins.gitclient.Git$GitAPIMasterToSlaveFileCallable.invoke(Git.java:208) 00:00:09.026 at org.jenkinsci.plugins.gitclient.Git$GitAPIMasterToSlaveFileCallable.invoke(Git.java:176) 00:00:09.026 at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3616) 00:00:09.026 at hudson.remoting.UserRequest.perform(UserRequest.java:211) 00:00:09.026 at hudson.remoting.UserRequest.perform(UserRequest.java:54) 00:00:09.026 at hudson.remoting.Request$2.run(Request.java:377) 00:00:09.026 at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78) 00:00:09.026 at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) 00:00:09.026 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) 00:00:09.026 at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) 00:00:09.026 at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:125) 00:00:09.026 at java.base/java.lang.Thread.run(Thread.java:833) 00:00:09.026 Also: org.jenkinsci.plugins.workflow.actions.ErrorAction$ErrorId: ed8f4997-9fb2-44fa-9777-32a06a529b03 00:00:09.026 Caused: java.io.IOException: Remote call on JNLP4-connect connection from hostname.of.agent.pool/10.xxx.xxx.xxx:44604 failed 00:00:09.026 at hudson.remoting.Channel.call(Channel.java:1007) 00:00:09.026 at hudson.FilePath.act(FilePath.java:1230) 00:00:09.026 at hudson.FilePath.act(FilePath.java:1219) 00:00:09.026 at org.jenkinsci.plugins.gitclient.Git.getClient(Git.java:138) 00:00:09.026 at hudson.plugins.git.GitSCM.createClient(GitSCM.java:916) 00:00:09.026 at hudson.plugins.git.GitSCM.createClient(GitSCM.java:847) 00:00:09.026 at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1294) 00:00:09.026 at org.jenkinsci.plugins.workflow.steps.scm.SCMStep.checkout(SCMStep.java:129) 00:00:09.026 at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:97) 00:00:09.026 at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:84) 00:00:09.026 at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution.lambda$start$0(SynchronousNonBlockingStepExecution.java:47) 00:00:09.026 at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) 00:00:09.026 at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) 00:00:09.026 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) 00:00:09.026 at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) 00:00:09.026 at java.base/java.lang.Thread.run(Thread.java:840) 00:00:09.604 [GitHub Checks] GitHub check (name: Jenkins, status: completed) has been published. 00:00:10.408 00:00:10.408 GitHub has been notified of this commit’s build result 00:00:10.408 00:00:10.408 Finished: FAILURE
No recent changes have been deployed to our controller images, agent images, or CasC to explain it.
It was resolved by restarting the controller, but due to the fact that we've had this a second time we would like to get to the bottom of it and to move towards ideally a preventative action or less ideally a corrective action that is quicker than a controller restart for us, which can take an hour before the UI is fully back to operational.
There is also the fact that the presentation of this issue doesn't trigger healthcheck failure on either controller or agent instances so we can clock a few hours of service interruption with time taken to get notified of build failures, investigate/identify errors and restart controller.
Our agents are all ephemeral, docker based agents connecting over JNLP. After a build finishes on an agent, they get automatically removed as nodes from the controller causing the agent instance to restart. When an instance starts up again, it is from an identical image and there is no filesystem persistence so it's a clean slate.
It connects to the controller to register a node under a new name derived from the task id, and then launches the agent java process to connect to the controller. Custom code in form of a groovy plugin and agent side jenkins API client is responsible for this process including removing the old agents.
So with our ephemeral setup in mind are there any ideas as to what would be causing this and whether there are steps that could be taken to prevent reoccurrence or correct the occasional reoccurrence without a full controller restart?
Controller uptime would have been between 12-36 hours old, agent lifetime is only between a couple minutes to couple hours max, and no changes to either the controller or agent images had been deployed recently nor had the CasC been modified.
Let me know if there is any more information I should provide.