-
Bug
-
Resolution: Cannot Reproduce
-
Blocker
-
Windows Server 2008R2; Jenkins 1.54.3; Git Plugin 2.2.1
-
Powered by SuggestiMate
We met randomly failure of git scm, it hung at the fetch process for a long time and will timeout. When it timeout it says
02:56:20 Caused by: hudson.plugins.git.GitException: Command "git fetch --tags --progress ssh://bmcdiags@.../ghts/ta +refs/heads/:refs/remotes/origin/" returned status code -1:
02:56:20 stdout:
02:56:20 stderr: Could not create directory 'c/Users/Administrator/.ssh'.
[JENKINS-24454] Windows GIT SCM fetch code hung
Not a core issue, and likely not a Git issue either. Have you tried creating the directory Jenkins fails to create? Or tried Google?
You might review this stackoverflow posting for ideas of things you might try. It is not exactly the same case, but it was one of the first items when I searched google with that error message.
Since you report it is a random failure, and it seems to be Windows specific, you might also try to accelerate the frequency at which you encounter the problem by defining multiple jobs which use the same ssh authenticated repository, with a local reference repository (to reduce the cloning data transfer), then run the jobs concurrently.
If the problem is a file locking problem with the C:\Users\Administrator\.ssh directory, or with a file in that directory, then running many jobs in parallel should make it happen much more often, and may give you a chance to see other hints which may suggest what is causing the locking problem.
You are right Waite! We are running 3 jobs concurrently on the same win 7 slave pulling from the same code source. And two of them will success and one will fail. The failure is random. I have to workaround to make the job running in serial now to see whether it will get rid of this issue. Is there any other log I can provide?
The message is coming from the stderr output of the "git fetch" command as far as I can tell. That would usually mean that a change to fix the issue would be needed inside the "git" program, external to Jenkins.
If the C:\users\Administrator\.ssh directory does not exist, you can create it by logging in as Administrator, and entering the command "ssh-keygen" from a "Git Bash" shell.
Of course this file is existed otherwise the other two jobs cannot be succeedd and we will not be able to clone the file. This is the folder for placing the ssh public key and known host.
The strange thing is we didn't meet this issue before July. The changes we did: upgrade git plugin from 1.1.6 to 2.2.1 and upgrade Jenkins from 1.532.3 to 1.554.3, change the git repository url(I don't think this is related with this issue as the other two jobs clone succeed.)
Changing the git plugin from 1.1.6 to 2.2.1 also changed from relying on per client credential configuration to using the Jenkins credentials plugin for credential management. I don't know why the git fetch command thinks it needs to create (or lock) the %HOME%\.ssh directory, but that is the challenge you're trying to resolve.
You could check if the JGit implementation inside the git client plugin is better at handling this case. You enable JGit from the "Manage Jenkins" page, where you add a git implementation named "jgit" from the pick list.
I can stable reproduce this issue on other windows slaves as well. I believe this is a bug in git plugin. We didn't see this issue before. Try polling more than 1 job from the same node concurrently will have this issue in a random manager. The node is using ssh for clone.
I am reasonably confident that it is a bug in the git program (or a bug in the Windows file system and its locking design), not a bug in the git plugin.
The command which is failing with a timeout is a call to the "git" program as a separate process. The git plugin calls the git program and waits for the git program to either complete or for the timeout to expire. In this case, the timeout expired, probably because of Windows file system locking semantics.
When you say that you did not see the issue before, were you polling and/or building from multiple concurrent jobs on Windows machines previously?
What version of the git program are you running on your Windows slaves?
Hi,
We meet this issue again on another slave node:
We are seeing a number of git processes on the slave node when this issue happen.
It often happens when user cancelled task during git fetch code step. The git process is not killed properly. And for a while, there are a bunch of git process not killed on the slave.
Here is the output, we have to restart jenkins service on slave node to let it work:
Started by user XXX
[EnvInject] - Loading node environment variables.
Building remotely on GPS-NODE (x86-windows-6.1 6.1 x86-windows windows-6.1 windows x86) in workspace d:\hudson-slave\workspace\Andy_Dev_Branch
> git rev-parse --is-inside-work-tree
Fetching changes from the remote Git repository
> git config remote.origin.url ssh://git@hardware.corp.emc.com:7999/bf/uefi_bios_moons.git
Cleaning workspace
> git rev-parse --verify HEAD
Resetting working tree
> git reset --hard
> git clean -fdx
Fetching upstream changes from ssh://git@****:7999/bf/uefi_bios_moons.git
> git --version
> git fetch --tags --progress ssh://git@***:7999/bf/uefi_bios_moons.git +refs/heads/:refs/remotes/origin/*
FATAL: Failed to fetch from ssh://git@****:7999/bf/uefi_bios_moons.git
hudson.plugins.git.GitException: Failed to fetch from ssh://git@****:7999/bf/uefi_bios_moons.git
at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:623)
at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:855)
at hudson.plugins.git.GitSCM.checkout(GitSCM.java:880)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1252)
at hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:615)
at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:524)
at hudson.model.Run.execute(Run.java:1706)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
at hudson.model.ResourceController.execute(ResourceController.java:88)
at hudson.model.Executor.run(Executor.java:232)
Caused by: hudson.plugins.git.GitException: Command "git fetch --tags --progress ssh://git@***:7999/bf/uefi_bios_moons.git +refs/heads/:refs/remotes/origin/*" returned status code 128:
stdout:
stderr: Could not create directory 'c/Users/buildfarmadmin/.ssh'.
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1325)
at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandWithCredentials(CliGitAPIImpl.java:1186)
at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.access$200(CliGitAPIImpl.java:87)
at org.jenkinsci.plugins.gitclient.CliGitAPIImpl$1.execute(CliGitAPIImpl.java:257)
at org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$1.call(RemoteGitImpl.java:153)
at org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$1.call(RemoteGitImpl.java:146)
at hudson.remoting.UserRequest.perform(UserRequest.java:118)
at hudson.remoting.UserRequest.perform(UserRequest.java:48)
at hudson.remoting.Request$2.run(Request.java:326)
at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at hudson.remoting.Engine$1$1.run(Engine.java:58)
at java.lang.Thread.run(Unknown Source)
There seems to be a git process left unkilled when the job is aborted or killed. Leaving the job hung when next time it starts a new build.
It has two symptoms:
1. When multiple jobs fetching git code at the same time, e.g. set these jobs with the same GIT SCM code repository, we will randomly meet this issue. However, next time you kicked off the build, it will not have this issue.
2. The build always failed with the same message. You have to restart jenkins slave to resolve this issue. We observed there are a lot of git process in process management console.
Getting the same error frequently.
Jenkins - 1.574
Git Plugin - 2.2.7
git version 1.9.5.msysgit.1
While the fetch is stuck, from the Process Explorer it could be seen that the ssh.exe is stuck on the command ssh git@github.faked.com "git-upload-pack 'XYZ/Faked.git'"
Below is from Process Explorer.
jenkins.exe java.exe git.exe git.exe ssh.exe // this one is stuck
While the process is stuck, executing the command ssh git@github.faked.com "git-upload-pack 'XYZ/Faked.git'" from command line gives the response which ends with
......... 005467bd6f492ad36325aea516dfc2f423b1bc5e8dfe refs/tags/branch1 0057747b9750f2389c6ca630480674a85e1decad2387 refs/tags/branch1^{} 0000 Connection to github.faked.com closed by remote host.
From the dump which generated while the process was hung,
STACK_TEXT: 0028d53c 74ee15f7 00000002 0028d58c 00000001 ntdll!NtWaitForMultipleObjects+0x15 0028d5d8 76741a0c 0028d58c 0028d600 00000000 KERNELBASE!WaitForMultipleObjectsEx+0x100 0028d620 767441f0 00000002 7efde000 00000000 kernel32!WaitForMultipleObjectsExImplementation+0xe0 0028d63c 68015424 00000002 0028d694 00000000 kernel32!WaitForMultipleObjects+0x18
The last control flow was to ntdll!NtWaitForMultipleObjects. From the name of the thread it seems like it is waiting for some resources, which is not known at this point.
Any ideas on how to fix this or workarounds which is working?
maximin I'm afraid that I have no ideas to offer. You're running a recent version of msysgit (1.9.5), which contains (as far as I know) a recent version of ssh.
You could try switching to JGit instead of using command line git. There are some use cases which the JGit implementation in the plugin does not support (submodules, pushing tags, and several others), but for simple use cases the JGit implementation is sufficient. You may find that the age of your Jenkins installation (Jenkins 1.574 is now about 2 years old) and the version of the git plugin (2.2.x has been replaced by the 2.3.x series) may be too old to have the most recent JGit implementation fixes, but you could try JGit to see if it resolves your issue.
We're seeing this same problem with:
Jenkins - 1.617
Git Plugin - 2.3.5
git 1.9.0.msysgit.0
We started getting this problem after upgrading from a quite old version of the Git Plugin - 1.4.0, which we were using with the same version of git on the windows slave (1.9.0.mysysgit.0).
We see mostly the same behavior maximin described. We do not get the error about the .ssh directory mentioned in the original description here.
Running the git command spawned by the Jenkins slave manually in a git bash shell in the workspace works every time without delay, regardless of whether or not other jobs are hung on it in the same slave. I did this by copying the command line from process explorer on the hung git command and just pasting it in, so it's exactly the same.
Running just the ssh command gives a response but hangs, the remote end does not close the connection:
003c29363ef2df43efb9d3e517e6f78fc7bda2f46f7e refs/tags/help 0000
However, this behavior should be fine by the Git protocol, the 0000 indicates the end of message.
I wonder if there could be a change in input/output buffering when git is run by another process and this is causing some communication deadlock. We can't reproduce this behavior with git alone (version unchanged) and never saw it with the old Git Plugin version.
jakecobb I doubt there is a change of input/output buffering when git is run by another process, but you'd need to investigate the git source code to decide that for sure.
If you're running your Windows slave as a Windows service, then you'll have real difficulty interactively duplicating the environment where the git process runs. You could try running that process from inside a Jenkins job (using a Windows Batch job step, for instance) to see if the same good behavior exists when the git program is run inside a Jenkins job.
You might also consider updating from msysgit 1.9.0 to the most recent 1.9.5 version. I don't know that it will fix your problem, but there are several useful fixes in the intervening releases between what you're running and the latest version. Among other things, the version of OpenSSH was upgraded between those two versions so that "git clone" using an ssh protocol URL is no longer limited to 1 MB / second download.
02:36:16 Started by upstream project "echidna-patch-quality" build number 335
02:36:16 originally caused by:
02:36:16 Started by command line by xxx
02:36:16 [EnvInject] - Loading node environment variables.
02:36:17 Building remotely on ECHIDNA-QUALITY (6.1 windows-6.1 windows amd64-windows amd64-windows-6.1 amd64) in workspace c:\buildfarm-slave\workspace\echidna-patch-compile
02:36:18 > git rev-parse --is-inside-work-tree
02:36:19 Fetching changes from the remote Git repository
02:36:19 > git config remote.origin.url ssh://@...:/ghts/ta
02:36:20 Fetching upstream changes from ssh://@...:/ghts/ta
02:36:20 > git --version
02:36:20 > git fetch --tags --progress ssh://@...:/ghts/ta +refs/heads/:refs/remotes/origin/
02:56:20 ERROR: Timeout after 20 minutes
02:56:20 FATAL: Failed to fetch from ssh://@...:/ghts/ta
02:56:20 hudson.plugins.git.GitException: Failed to fetch from ssh://bmcdiags@10.110.61.117:30000/ghts/ta
02:56:20 at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:623)
02:56:20 at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:855)
02:56:20 at hudson.plugins.git.GitSCM.checkout(GitSCM.java:880)
02:56:20 at hudson.model.AbstractProject.checkout(AbstractProject.java:1414)
02:56:20 at hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:671)
02:56:20 at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:88)
02:56:20 at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:580)
02:56:20 at hudson.model.Run.execute(Run.java:1684)
02:56:20 at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
02:56:20 at hudson.model.ResourceController.execute(ResourceController.java:88)
02:56:20 at hudson.model.Executor.run(Executor.java:231)
02:56:20 Caused by: hudson.plugins.git.GitException: Command "git fetch --tags --progress ssh://@...:/ghts/ta +refs/heads/:refs/remotes/origin/" returned status code -1:
02:56:20 stdout:
02:56:20 stderr: Could not create directory 'c/Users/Administrator/.ssh'.
02:56:20
02:56:20 at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1325)
02:56:20 at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandWithCredentials(CliGitAPIImpl.java:1186)
02:56:20 at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.access$200(CliGitAPIImpl.java:87)
02:56:20 at org.jenkinsci.plugins.gitclient.CliGitAPIImpl$1.execute(CliGitAPIImpl.java:257)
02:56:20 at org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$1.call(RemoteGitImpl.java:153)
02:56:20 at org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$1.call(RemoteGitImpl.java:146)
02:56:20 at hudson.remoting.UserRequest.perform(UserRequest.java:118)
02:56:20 at hudson.remoting.UserRequest.perform(UserRequest.java:48)
02:56:20 at hudson.remoting.Request$2.run(Request.java:326)
02:56:20 at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
02:56:20 at java.util.concurrent.FutureTask.run(Unknown Source)
02:56:20 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
02:56:20 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
02:56:20 at hudson.remoting.Engine$1$1.run(Engine.java:63)
02:56:20 at java.lang.Thread.run(Unknown Source)