The issue here appears to be caused by a Java bug, https://bugs.openjdk.java.net/browse/JDK-8068370. The issue is possible when multiple threads open a file for writing, close it, and then execute them (each thread using its own file). Even if all files are closed "properly", due to how file handles work around fork/exec, a child process in one thread may inherit the handle to anther thread's open file, and thus break that thread's later subprocess call.
The two usual ways to avoid this issue is to either open files with O_CLOEXEC so they get auto-closed in child processes, or by closing open file descriptors on the child side of fork(). Java appears to do neither of those things, and a simple test program like the ones in the linked openjdk bug show that the problem is still there, on at least OpenJDK Java 8 and Oracle Java 9.
I had a go at possible workarounds, and the options don't look thrilling. GIT_SSH_COMMAND does the same thing as GIT_SSH, failing when someone has the file open for writing. Patching the JDK is not feasible, and it looks like all files are opened in the unsafe manner so there is no "advanced" api to try. One thing that does work is creating the file outside Java, in a child process, or making a copy of the file and using that – with a subprocess cp, Java's Files.copy triggers the same race too. I'll a testcase that shows how threads+files+subprocesses don't work properly, and shows these two workarounds.
The nature of this problem is such that it will randomly trigger for people that have a pipeline that creates a large number of sub-builds at the exact same time, and there doesn't appear to be a good user workaround right now other than not using SSHUserPrivateKey credentials at all. I think there should be some workaround in the git plugin, even if it's ugly.
markewaite do you think createUnixGitSSH could be made to do a cp internally to avoid this race? Something like
$ diff -u a/CliGitAPIImpl.java b/CliGitAPIImpl.java
--- a/CliGitAPIImpl.java 2018-01-08 16:01:40.711864222 +0100
+++ b/CliGitAPIImpl.java 2018-01-08 16:04:27.638664845 +0100
@@ -1936,6 +1936,7 @@
private File createUnixGitSSH(File key, String user) throws IOException {
File ssh = createTempFile("ssh", ".sh");
+ File ssh_copy = new File(ssh.toString() + "-copy");
try (PrintWriter w = new PrintWriter(ssh, Charset.defaultCharset().toString())) {
w.println("#!/bin/sh");
@@ -1945,8 +1946,12 @@
w.println("fi");
w.println("ssh -i \"" + key.getAbsolutePath() + "\" -l \"" + user + "\" -o StrictHostKeyChecking=no \"$@\"");
}
- ssh.setExecutable(true, true);
- return ssh;
+ + + new ProcessBuilder("cp", ssh.toString(), ssh_copy.toString()).start().waitFor();
+ ssh.delete();
+ ssh_copy.setExecutable(true, true);
+ return ssh_copy;
}
(draft, untested)
The "try with resources" technique used in CliGitAPIImpl has been more reliable (for me at least) than explicit calls to close files. It was used to resolve one or more file handle leaks a few years ago.
The command that is being called by the plugin is git ls-remote -h -t git.example.org:/var/git/foo.git. The plugin does not execute that shell script directly, but relies on git to execute the shell script. The environment variables set prior to that call refer to the temporary file which was written (and closed) prior to the call of the git command.
If you find a repeatable way to see the problem, please update the bug report with that information.