Tested with Version 1235.v2db_ddd9f797b
1257.v89e586d3c58c ; 3.11.2
Performance of downstreamPipelineTriggerRunListener is degrading with growing number of dependencies and often stops to work because of timeouts of DB Connection Pool.
Here the output of one of our Jobs before the fix:
[withMaven] downstreamPipelineTriggerRunListener - completed in 62394803 ms
62.394.803 ms = 1039 min = 17,33 h
We have a very large setup with thousands of builds. downstreamPipelineTriggerRunListener takes up to some hours and is failing because of timeouts of DB Connection Pool.
Analysis showed that time is lost for calculation of transitive upstreams. Same information is read several thousand times to calculate listTransitiveUpstreamJobs().
In threadDumps we see, that all threads are hanging here:
at org.postgresql.jdbc.PgPreparedStatement.executeQuery(PgPreparedStatement.java:118) at com.zaxxer.hikari.pool.ProxyPreparedStatement.executeQuery(ProxyPreparedStatement.java:52) at com.zaxxer.hikari.pool.HikariProxyPreparedStatement.executeQuery(HikariProxyPreparedStatement.java) at org.jenkinsci.plugins.pipeline.maven.dao.AbstractPipelineMavenPluginDao.listUpstreamPipelinesBasedOnMavenDependencies(AbstractPipelineMavenPluginDao.java:912) at org.jenkinsci.plugins.pipeline.maven.dao.AbstractPipelineMavenPluginDao.listUpstreamJobs(AbstractPipelineMavenPluginDao.
In one example threadDump we see 62 threads executing listUpstreamPipelinesBasedOnMavenDependencies
All threads that execute HikariProxyPreparedStatement.executeQuery are called by listUpstreamPipelinesBasedOnMavenDependencies
After fixing this problem by reducing the number of SQLs and optimizing the SQL we have seen another problem:
The triggering of downstreams leaded to a bottleneck by locks in hudson.model.Queue.schedule2(), we saw up to 60 threads wating at one moment for a lock.
We did several improvements:
1) Reduce the very high number of SQLs caused by the recursion in listUpstreamJobs
listTransitiveUpstreamJobs uses a recursion that uses listUpstreamJobs several times. Even for one call of listTransitiveUpstreamJobs the same information may be read several times.
But especially in DownstreamPipelineTriggerRunListener listTransitiveUpstreamJobs is called for all downstreams and the upstreams of these are normally repeating.
We use a UpstreamMemory Object, that is created at the beginning of downstreamPipelineTriggerRunListener. It remembers the already calculated upstreams.
We have observed several 10-thousand hits for UpstreamMemory for calculation of only one job. With this fix, calcualtion is working again, before we had timeouts of the DB Connection Pool. This saves about 90% of this complex SQL statement.
2) Tuned the SQL Query for listUpstreamPipelinesBasedOnMavenDependencies
The query does a complex join and does not use the existing index for table jenkins_job on postgres for very large data.
A ANALYZE of the query shows this sequential scan:
Seq Scan on jenkins_job upstream_job (cost=0.00..81.58 rows=3086 width=41) Filter: (jenkins_master_id = 1)
This is the reason for the bad performance, the query takes some minutes!
In our Pull Request you can find a rewrite of the SQL and some documentation. We divide the problem in two parts. First we read the jenkins_job as this returns a single result. Then we read the join without the problematic table.
The two queries take only some milliseconds now.
https://learnsql.com/blog/sql-join-only-first-row/ shows another solution that works, We tried out Solution 4 and used a 'LIMIT 1' hint for the jenkins_job table. It worked for postgres, but it is not clear if it works for all supported databases. Therefore we decided to use the 'two queries' approach,
3) Added a Memory for getGeneratedArtifacts and listDownstreamJobsByArtifact during execution of DownstreamPipelineTriggerRunListener#onCompleted to reduce SQLs
This saves redundant SQLs calls an makes it easier to analyze the SQL logs.
4) Reduced the locks during hudson.model.Queue.schedule2()
hudson.model.Queue.schedule2() uses a lock object and does not scale. We have seen over 50 threads in one thread dumps that waited for the lock.
Before we trigger a downstream job, we therefore check if it is already in the queue. Here we found out a bug in the existing implementation: WorkflowJob.isInQueue() returns always 'false'. You have to use Jenkins.get().getQueue().contains((Task)job) to get it right.
We also fixed this for the existing check on upstream jobs. This did not work before.
5) We have seen, that too much downstreams were started.
Downstreams that had matching transitive upstreams were triggered often, especially for long running jobs. Reason was. that listTransitiveUpstreamJobs returned an empty list for jobs that were running in this moment.
Fix: We check the listTransitiveUpstreamJobs for the last successful build if we get an empty list
BEFORE Pull Request: [withMaven] downstreamPipelineTriggerRunListener - completed in 62394803 ms
AFTER Pull Request: [withMaven] downstreamPipelineTriggerRunListener - completed in 534 ms