We are currently facing issues with infra.ci.jenkins.io instance with "HTTP/504 gateway timeout" errors while logging in.
What we were able to verify so far:
- The issue is located on the infra.ci.jenkins.io instance (which is not available publicly: requires VPN connection to reach it) only, we don’t see this login issue on ci.jenkins.io, neither on release.jenkins.io
- The issue happen when a user tries to login. Jenkins tries to contact the LDAP to verify the identity and after 1min it answers an HTTP/504. The logs are reporting a read timeout while contacting the LDAP server:
- All the jobs continue to be built without any error (incoming webhooks included)
- When the error appears, all login attempts are failing in the next minutes. After ~15 min everything come back as normal and we can login again.
- During this 15 min of “errors”:
- LDAP connection are working quite well in the Jenkins’s container using either curl of ldapseach CLIs. Example of a curl request from the jenkins container, executed while HTTP/504 are happening:
- No other services (either in the same cluster, or somewhere else) are impacted at all with LDAP timeout
- Ldap server does not have any error log and using tcpdump, the requests from Jenkins are not seen (while my curl are)
- We tried to fine-tune the LDAP JNDI pooling https://github.com/jenkins-infra/charts/pull/888 by decreasing the read/write timeout and disabling connection pooling. The timeout are now happening earlier, and the “15 min error period” is now closer to 5 min, which make it easier to ignore, but still problematic.
- The instance release.jenkins.io is the instance which is the closest in term of setup: same kubernetes cluster, same LDAP setting, same network road, and does not have the issue.
- BUT release uses the Jenkins LTS (2.263.4) which has the LDAP plugin in version 1.26 (https://plugins.jenkins.io/ldap/#releases) which is a version WITHOUT the new spring security effort. While the “infra.ci” instance in error is using the weekly core (2.281 as for now) which has the LDAP plugin in version 2.3. We tried with the latest 2.4 version of the plugin and the behavior is still the same.
Next steps are about checking the metrics from the JVM (GC, connections pools, threads, etc.) to see if we can find a correlation.