-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
Powered by SuggestiMate -
2.383
jenkins.security.Security218Test#jnlpSlave fails not infrequently on Windows as in e.g. this run. I suspect this is due to the split of instance-identity. Based on the logs of that run and other similar runs, Jenkins#getAgentProtocols must be returning the empty set, and I suspect the reason why is that the initialization of JnlpSlaveAgentProtocol4 (which depends on instance-identity having been initialized) is racing with the initialization of instance-identity. Our Windows test agents are very slow, so the problem can probably be reproduced by adding a sleep statement somewhere.
[JENKINS-70206] jenkins.security.Security218Test#jnlpSlave is flaky
Well after reading through the code, the only plausible alternative to a problem while invoking the constructor of JnlpSlaveAgentProtocol4 that I can see would be a failure to even attempt to invoke the constructor of JnlpSlaveAgentProtocol4. We discover components via SezPoz but instantiate them via Guice, so I suppose it is possible that SezPoz is not discovering JnlpSlaveAgentProtocol4 (and would be consistent with the lack of an exception from JnlpSlaveAgentProtocol4's constructor in the logs). On the other hand at a high level it seems unlikely to me that something so basic could be broken for so long without us noticing in some other context, whereas at a high level it seems far more likely to me that changing the linkage and order of operations around initializing JnlpSlaveAgentProtocol4 and its dependencies through the instance-identity split would create a new race condition (or expose an existing one): perhaps not via the JnlpSlaveAgentProtocol4 constructor failing outright, but perhaps via some other disturbance in timing that causes Guice to not be able to find the instance when asked for it.
I added the following debugging code:
diff --git a/core/src/main/java/hudson/ExtensionFinder.java b/core/src/main/java/hudson/ExtensionFinder.java index f4c0266753..4b15bfb601 100644 --- a/core/src/main/java/hudson/ExtensionFinder.java +++ b/core/src/main/java/hudson/ExtensionFinder.java @@ -251,6 +251,7 @@ public abstract class ExtensionFinder implements ExtensionPoint { private final Map<Key, Annotation> annotations = new HashMap<>(); private final Sezpoz moduleFinder = new Sezpoz(); + private final List<IndexItem<?, Object>> sezpozIndices; /** * Map from {@link GuiceExtensionAnnotation#annotationType} to {@link GuiceExtensionAnnotation} @@ -260,7 +261,8 @@ public abstract class ExtensionFinder implements ExtensionPoint { public GuiceFinder() { refreshExtensionAnnotations(); - SezpozModule extensions = new SezpozModule(loadSezpozIndices(Jenkins.get().getPluginManager().uberClassLoader)); + sezpozIndices = loadSezpozIndices(Jenkins.get().getPluginManager().uberClassLoader); + SezpozModule extensions = new SezpozModule(sezpozIndices); List<Module> modules = new ArrayList<>(); modules.add(new AbstractModule() { @@ -317,6 +319,10 @@ public abstract class ExtensionFinder implements ExtensionPoint { return container; } + public List<IndexItem<?, Object>> getSezpozIndices() { + return sezpozIndices; + } + /** * The basic idea is: * diff --git a/test/src/test/java/jenkins/security/Security218Test.java b/test/src/test/java/jenkins/security/Security218Test.java index f278b3f507..63f97fe388 100644 --- a/test/src/test/java/jenkins/security/Security218Test.java +++ b/test/src/test/java/jenkins/security/Security218Test.java @@ -1,13 +1,26 @@ package jenkins.security; import static org.hamcrest.MatcherAssert.assertThat; +import static org.hamcrest.Matchers.containsInAnyOrder; import static org.hamcrest.Matchers.containsString; +import static org.hamcrest.Matchers.hasSize; +import static org.hamcrest.Matchers.in; +import static org.hamcrest.Matchers.instanceOf; +import static org.junit.Assert.assertNotNull; import static org.junit.Assert.assertThrows; +import com.google.inject.Key; +import hudson.ExtensionFinder; +import hudson.ExtensionList; +import hudson.cli.declarative.CLIRegisterer; import hudson.slaves.DumbSlave; import java.io.IOException; import java.io.Serializable; +import java.util.List; +import java.util.Set; import java.util.logging.Level; +import java.util.stream.Collectors; +import net.java.sezpoz.IndexItem; import org.codehaus.groovy.runtime.MethodClosure; import org.junit.Rule; import org.junit.Test; @@ -47,6 +60,22 @@ public class Security218Test implements Serializable { */ @Test public void jnlpSlave() throws Exception { + ExtensionList<ExtensionFinder> extensionFinders = j.jenkins.getExtensionList(ExtensionFinder.class); + assertThat(extensionFinders, hasSize(2)); + assertThat(extensionFinders, containsInAnyOrder(instanceOf(ExtensionFinder.GuiceFinder.class), instanceOf(CLIRegisterer.class))); + for (ExtensionFinder f : extensionFinders) { + if (f instanceof ExtensionFinder.GuiceFinder) { + ExtensionFinder.GuiceFinder finder = (ExtensionFinder.GuiceFinder) f; + List<String> sezpozNames = finder.getSezpozIndices().stream().map(IndexItem::className).collect(Collectors.toList()); + assertThat("jenkins.slaves.JnlpSlaveAgentProtocol4", in(sezpozNames)); + List<String> bindingTypes = finder.getContainer().getBindings().keySet().stream().map(Key::getTypeLiteral).map(Object::toString).collect(Collectors.toList()); + assertThat("jenkins.slaves.JnlpSlaveAgentProtocol4", in(bindingTypes)); + Object o = finder.getContainer().getBindings().entrySet().stream().filter(e -> e.getKey().getTypeLiteral().toString().equals("jenkins.slaves.JnlpSlaveAgentProtocol4")).map(e -> e.getValue().getProvider().get()); + assertNotNull(o); + } + } + Set<String> agentProtocols = j.jenkins.getAgentProtocols(); + assertThat(agentProtocols, containsInAnyOrder("JNLP4-connect", "Ping")); DumbSlave a = (DumbSlave) inboundAgents.createAgent(j, InboundAgentRule.Options.newBuilder().secret().build()); j.waitOnline(a); try {
With that I got the following exception:
Expected: iterable with items ["JNLP4-connect", "Ping"] in any order but: no item matches: "JNLP4-connect", "Ping" in [] at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20) at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6) at jenkins.security.Security218Test.jnlpSlave(Security218Test.java:78) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:568) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54) at org.jvnet.hudson.test.JenkinsRule$1.evaluate(JenkinsRule.java:605) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.lang.Thread.run(Thread.java:833)
So SezPoz discovered JnlpSlaveAgentProtocol4 correctly, and Guice can instantiate it correctly, but yet ExtensionList.lookup(AgentProtocol.class) is not returning it? This does not make any sense to me.
https://ci.jenkins.io/job/Core/job/jenkins/job/PR-7479/6/testReport/junit/jenkins.security/Security218Test/Windows_jdk11___Windows_Build___Test___jnlpSlave/ shows the failure pathology with the following instrumentation:
diff --git a/core/src/main/java/hudson/ExtensionFinder.java b/core/src/main/java/hudson/ExtensionFinder.java index f4c0266753..4b15bfb601 100644 --- a/core/src/main/java/hudson/ExtensionFinder.java +++ b/core/src/main/java/hudson/ExtensionFinder.java @@ -251,6 +251,7 @@ public abstract class ExtensionFinder implements ExtensionPoint { private final Map<Key, Annotation> annotations = new HashMap<>(); private final Sezpoz moduleFinder = new Sezpoz(); + private final List<IndexItem<?, Object>> sezpozIndices; /** * Map from {@link GuiceExtensionAnnotation#annotationType} to {@link GuiceExtensionAnnotation} @@ -260,7 +261,8 @@ public abstract class ExtensionFinder implements ExtensionPoint { public GuiceFinder() { refreshExtensionAnnotations(); - SezpozModule extensions = new SezpozModule(loadSezpozIndices(Jenkins.get().getPluginManager().uberClassLoader)); + sezpozIndices = loadSezpozIndices(Jenkins.get().getPluginManager().uberClassLoader); + SezpozModule extensions = new SezpozModule(sezpozIndices); List<Module> modules = new ArrayList<>(); modules.add(new AbstractModule() { @@ -317,6 +319,10 @@ public abstract class ExtensionFinder implements ExtensionPoint { return container; } + public List<IndexItem<?, Object>> getSezpozIndices() { + return sezpozIndices; + } + /** * The basic idea is: * diff --git a/core/src/main/java/jenkins/model/Jenkins.java b/core/src/main/java/jenkins/model/Jenkins.java index d065df8065..dd42ab2ad0 100644 --- a/core/src/main/java/jenkins/model/Jenkins.java +++ b/core/src/main/java/jenkins/model/Jenkins.java @@ -639,6 +639,11 @@ public class Jenkins extends AbstractCIBase implements DirectlyModifiableTopLeve */ private static final boolean SLAVE_AGENT_PORT_ENFORCE = SystemProperties.getBoolean(Jenkins.class.getName() + ".slaveAgentPortEnforce", false); + @CheckForNull + public synchronized List<String> getDisabledAgentProtocols() { + return disabledAgentProtocols; + } + /** * The TCP agent protocols that are explicitly disabled (we store the disabled ones so that newer protocols * are enabled by default). Will be {@code null} instead of empty to simplify XML format. @@ -654,6 +659,11 @@ public class Jenkins extends AbstractCIBase implements DirectlyModifiableTopLeve @Deprecated private transient String[] _disabledAgentProtocols; + @CheckForNull + public synchronized List<String> getEnabledAgentProtocols() { + return enabledAgentProtocols; + } + /** * The TCP agent protocols that are {@link AgentProtocol#isOptIn()} and explicitly enabled. * Will be {@code null} instead of empty to simplify XML format. diff --git a/test/src/test/java/jenkins/security/Security218Test.java b/test/src/test/java/jenkins/security/Security218Test.java index f278b3f507..fa132084ca 100644 --- a/test/src/test/java/jenkins/security/Security218Test.java +++ b/test/src/test/java/jenkins/security/Security218Test.java @@ -1,13 +1,32 @@ package jenkins.security; import static org.hamcrest.MatcherAssert.assertThat; +import static org.hamcrest.Matchers.containsInAnyOrder; import static org.hamcrest.Matchers.containsString; +import static org.hamcrest.Matchers.hasSize; +import static org.hamcrest.Matchers.in; +import static org.hamcrest.Matchers.instanceOf; +import static org.hamcrest.Matchers.nullValue; +import static org.junit.Assert.assertFalse; +import static org.junit.Assert.assertNotNull; import static org.junit.Assert.assertThrows; +import static org.junit.Assert.assertTrue; +import com.google.inject.Key; +import hudson.ExtensionFinder; +import hudson.ExtensionList; +import hudson.TcpSlaveAgentListener; +import hudson.cli.declarative.CLIRegisterer; import hudson.slaves.DumbSlave; import java.io.IOException; import java.io.Serializable; +import java.util.List; +import java.util.Set; import java.util.logging.Level; +import java.util.stream.Collectors; +import jenkins.AgentProtocol; +import jenkins.slaves.JnlpSlaveAgentProtocol4; +import net.java.sezpoz.IndexItem; import org.codehaus.groovy.runtime.MethodClosure; import org.junit.Rule; import org.junit.Test; @@ -47,6 +66,36 @@ public class Security218Test implements Serializable { */ @Test public void jnlpSlave() throws Exception { + ExtensionList<ExtensionFinder> extensionFinders = j.jenkins.getExtensionList(ExtensionFinder.class); + assertThat(extensionFinders, hasSize(2)); + assertThat(extensionFinders, containsInAnyOrder(instanceOf(ExtensionFinder.GuiceFinder.class), instanceOf(CLIRegisterer.class))); + for (ExtensionFinder f : extensionFinders) { + if (f instanceof ExtensionFinder.GuiceFinder) { + ExtensionFinder.GuiceFinder finder = (ExtensionFinder.GuiceFinder) f; + List<String> sezpozNames = finder.getSezpozIndices().stream().map(IndexItem::className).collect(Collectors.toList()); + assertThat("jenkins.slaves.JnlpSlaveAgentProtocol4", in(sezpozNames)); + List<String> bindingTypes = finder.getContainer().getBindings().keySet().stream().map(Key::getTypeLiteral).map(Object::toString).collect(Collectors.toList()); + assertThat("jenkins.slaves.JnlpSlaveAgentProtocol4", in(bindingTypes)); + Object o = finder.getContainer().getBindings().entrySet().stream().filter(e -> e.getKey().getTypeLiteral().toString().equals("jenkins.slaves.JnlpSlaveAgentProtocol4")) + .map(e -> e.getValue().getProvider().get()).findFirst().orElse(null); + assertNotNull(o); + } + } + ExtensionList<AgentProtocol> agentProtocolExtensions = ExtensionList.lookup(AgentProtocol.class); + assertThat(agentProtocolExtensions, containsInAnyOrder(instanceOf(JnlpSlaveAgentProtocol4.class), instanceOf(TcpSlaveAgentListener.PingAgentProtocol.class))); + for (AgentProtocol p : agentProtocolExtensions) { + assertNotNull(p.getClass().getName(), p.getName()); + if (p instanceof TcpSlaveAgentListener.PingAgentProtocol) { + assertTrue(p.isRequired()); + } else if (p instanceof JnlpSlaveAgentProtocol4) { + assertFalse(p.isRequired()); + } + assertFalse(p.getClass().getName(), p.isOptIn()); + } + assertThat(j.jenkins.getDisabledAgentProtocols(), nullValue()); + assertThat(j.jenkins.getEnabledAgentProtocols(), nullValue()); + Set<String> agentProtocols = j.jenkins.getAgentProtocols(); + assertThat(agentProtocols, containsInAnyOrder("JNLP4-connect", "Ping")); DumbSlave a = (DumbSlave) inboundAgents.createAgent(j, InboundAgentRule.Options.newBuilder().secret().build()); j.waitOnline(a); try {
The stack trace is:
Expected: iterable with items ["JNLP4-connect", "Ping"] in any order but: no item matches: "JNLP4-connect", "Ping" in [] at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20) at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6) at jenkins.security.Security218Test.jnlpSlave(Security218Test.java:98) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54) at org.jvnet.hudson.test.JenkinsRule$1.evaluate(JenkinsRule.java:605) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.lang.Thread.run(Thread.java:829)
Since we just ran all of the same logic just a few lines above in the debug code and got a different result, the only plausible explanation is that Jenkins#getAgentProtocols is caching a bad value from earlier on, even though the values are good at the time my debug code runs. In other words, I am increasingly suspecting that a change in timing is responsible for this flakiness, which is now impacting the vast majority of Windows test runs.
OTOH lack of instance-identity would mean a KeyStoreException is thrown from the extension constructor, which should be recorded as a stack trace in the logs.
Proof in https://ci.jenkins.io/job/Core/job/jenkins/job/PR-7479/6/testReport/junit/jenkins.security/Security218Test/Windows_jdk11___Windows_Build___Test___jnlpSlave/ that this change in timing is a result of the instance-identity split: a KeyStoreException thrown from the extension constructor, recorded as a stack trace in the logs:
0.122 [id=101] WARNING h.ExtensionFinder$GuiceFinder$FaultTolerantScope$1#error: Failed to instantiate Key[type=jenkins.slaves.JnlpSlaveAgentProtocol4, annotation=[none]]; skipping this component java.security.KeyStoreException: JENKINS-41987: no RSAPrivateKey found; perhaps instance-identity plugin is not installed at jenkins.slaves.JnlpSlaveAgentProtocol4.<init>(JnlpSlaveAgentProtocol4.java:110) at jenkins.slaves.JnlpSlaveAgentProtocol4$$FastClassByGuice$$338020067.GUICE$TRAMPOLINE(<generated>) at jenkins.slaves.JnlpSlaveAgentProtocol4$$FastClassByGuice$$338020067.apply(<generated>) at com.google.inject.internal.DefaultConstructionProxyFactory$FastClassProxy.newInstance(DefaultConstructionProxyFactory.java:82) at com.google.inject.internal.ConstructorInjector.provision(ConstructorInjector.java:114) at com.google.inject.internal.ConstructorInjector.access$000(ConstructorInjector.java:33) at com.google.inject.internal.ConstructorInjector$1.call(ConstructorInjector.java:98) at com.google.inject.internal.ProvisionListenerStackCallback$Provision.provision(ProvisionListenerStackCallback.java:109) at hudson.ExtensionFinder$GuiceFinder$SezpozModule.onProvision(ExtensionFinder.java:574) at com.google.inject.internal.ProvisionListenerStackCallback$Provision.provision(ProvisionListenerStackCallback.java:117) at com.google.inject.internal.ProvisionListenerStackCallback.provision(ProvisionListenerStackCallback.java:66) at com.google.inject.internal.ConstructorInjector.construct(ConstructorInjector.java:93) at com.google.inject.internal.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:296) at com.google.inject.internal.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:40) Caused: com.google.inject.ProvisionException: Unable to provision, see the following errors: 1) [Guice/ErrorInjectingConstructor]: KeyStoreException: JENKINS-41987: no RSAPrivateKey found; perhaps instance-identity plugin is not installed at JnlpSlaveAgentProtocol4.<init>(JnlpSlaveAgentProtocol4.java:102) Learn more: https://github.com/google/guice/wiki/ERROR_INJECTING_CONSTRUCTOR 1 error ====================== Full classname legend: ====================== JnlpSlaveAgentProtocol4: "jenkins.slaves.JnlpSlaveAgentProtocol4" KeyStoreException: "java.security.KeyStoreException" ======================== End of classname legend: ======================== at com.google.inject.internal.InternalProvisionException.toProvisionException(InternalProvisionException.java:251) at com.google.inject.internal.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:43) at com.google.inject.internal.SingletonScope$1.get(SingletonScope.java:169) at hudson.ExtensionFinder$GuiceFinder$FaultTolerantScope$1.get(ExtensionFinder.java:450) at com.google.inject.internal.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:45) at com.google.inject.internal.InjectorImpl$1.get(InjectorImpl.java:1100) at hudson.ExtensionFinder$GuiceFinder._find(ExtensionFinder.java:408) at hudson.ExtensionFinder$GuiceFinder.find(ExtensionFinder.java:399) at hudson.ClassicPluginStrategy.findComponents(ClassicPluginStrategy.java:359) at hudson.ExtensionList.load(ExtensionList.java:384) at hudson.ExtensionList.ensureLoaded(ExtensionList.java:320) at hudson.ExtensionList.iterator(ExtensionList.java:172) at jenkins.AgentProtocol.of(AgentProtocol.java:111) at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:278) 0.342 [id=102] INFO o.jvnet.hudson.test.JenkinsRule#createWebServer: Running on http://localhost:52992/jenkins/
Hey i would like to work on this issue as it would help me build the understanding of the jenkins core. from where should i start to work on this ?
Thanks for detailed investigation basil! That seems to confirm that my proposed fix
delay the initialization of all fields unless and until getName is called
could work. Should be a pretty simple change. vandit1604 would you like to try?
Assignee | New: Vandit Singh [ vandit1604 ] |
jglick by the fields to be initialized first you are referring isOptIn and getDisplayName right?
but im not able to understand how here
RSAPrivateKey privateKey = InstanceIdentityProvider.RSA.getPrivateKey(); if (privateKey == null) { throw new KeyStoreException("JENKINS-41987: no RSAPrivateKey found; perhaps instance-identity plugin is not installed"); }
privateKey must be null that's why we got this in stacktrace
0.122 [id=101] WARNING h.ExtensionFinder$GuiceFinder$FaultTolerantScope$1#error: Failed to instantiate Key[type=jenkins.slaves.JnlpSlaveAgentProtocol4, annotation=[none]]; skipping this component java.security.KeyStoreException: JENKINS-41987: no RSAPrivateKey found; perhaps instance-identity plugin is not installed
but if privateKey is null then here
public PRIV getPrivateKey() { InstanceIdentityProvider<PUB, PRIV> provider = get(this); try { return provider == null ? null : provider.getPrivateKey(); } catch (RuntimeException e) { LOGGER.log(Level.WARNING, "Instance identity provider " + provider + " propagated a runtime exception", e); return null; } catch (Error e) { LOGGER.log(Level.INFO, "Encountered an error while consulting instance identity provider " + provider, e); throw e; } catch (Throwable e) { LOGGER.log(Level.SEVERE, "Instance identity provider " + provider + " propagated an uncaught exception", e); return null; } }
if privateKey is null here then shouldn't the method getPrivateKey call itself until provider is not null anymore and pass that to privateKey
RSAPrivateKey privateKey = InstanceIdentityProvider.RSA.getPrivateKey();
Well, the logs do say JNLP4-connect is disabled. OTOH lack of instance-identity would mean a KeyStoreException is thrown from the extension constructor, which should be recorded as a stack trace in the logs.
If it helps, we could delay the initialization of all fields unless and until getName is called, at which point we could return null (after logging a warning!) if instance-identity is missing. This would also allow startup to run a bit faster, though you would need to do the same work at some point unless you were using solely outbound and/or WebSocket agents.