Details
-
Bug
-
Status: Closed (View Workflow)
-
Blocker
-
Resolution: Fixed
-
None
-
Tested with Version 1235.v2db_ddd9f797b
-
-
1257.v89e586d3c58c ; 3.11.2
Description
PROBLEM
Performance of downstreamPipelineTriggerRunListener is degrading with growing number of dependencies and often stops to work because of timeouts of DB Connection Pool.
Here the output of one of our Jobs before the fix:
[withMaven] downstreamPipelineTriggerRunListener - completed in 62394803 ms
62.394.803 ms = 1039 min = 17,33 h
We have a very large setup with thousands of builds. downstreamPipelineTriggerRunListener takes up to some hours and is failing because of timeouts of DB Connection Pool.
ANALYSIS
Analysis showed that time is lost for calculation of transitive upstreams. Same information is read several thousand times to calculate listTransitiveUpstreamJobs().
In threadDumps we see, that all threads are hanging here:
at org.postgresql.jdbc.PgPreparedStatement.executeQuery(PgPreparedStatement.java:118) at com.zaxxer.hikari.pool.ProxyPreparedStatement.executeQuery(ProxyPreparedStatement.java:52) at com.zaxxer.hikari.pool.HikariProxyPreparedStatement.executeQuery(HikariProxyPreparedStatement.java) at org.jenkinsci.plugins.pipeline.maven.dao.AbstractPipelineMavenPluginDao.listUpstreamPipelinesBasedOnMavenDependencies(AbstractPipelineMavenPluginDao.java:912) at org.jenkinsci.plugins.pipeline.maven.dao.AbstractPipelineMavenPluginDao.listUpstreamJobs(AbstractPipelineMavenPluginDao.
In one example threadDump we see 62 threads executing listUpstreamPipelinesBasedOnMavenDependencies
All threads that execute HikariProxyPreparedStatement.executeQuery are called by listUpstreamPipelinesBasedOnMavenDependencies
After fixing this problem by reducing the number of SQLs and optimizing the SQL we have seen another problem:
The triggering of downstreams leaded to a bottleneck by locks in hudson.model.Queue.schedule2(), we saw up to 60 threads wating at one moment for a lock.
SOLUTION
We did several improvements:
1) Reduce the very high number of SQLs caused by the recursion in listUpstreamJobs
listTransitiveUpstreamJobs uses a recursion that uses listUpstreamJobs several times. Even for one call of listTransitiveUpstreamJobs the same information may be read several times.
But especially in DownstreamPipelineTriggerRunListener listTransitiveUpstreamJobs is called for all downstreams and the upstreams of these are normally repeating.
We use a UpstreamMemory Object, that is created at the beginning of downstreamPipelineTriggerRunListener. It remembers the already calculated upstreams.
We have observed several 10-thousand hits for UpstreamMemory for calculation of only one job. With this fix, calcualtion is working again, before we had timeouts of the DB Connection Pool. This saves about 90% of this complex SQL statement.
2) Tuned the SQL Query for listUpstreamPipelinesBasedOnMavenDependencies
The query does a complex join and does not use the existing index for table jenkins_job on postgres for very large data.
A ANALYZE of the query shows this sequential scan:
Seq Scan on jenkins_job upstream_job (cost=0.00..81.58 rows=3086 width=41) Filter: (jenkins_master_id = 1)
This is the reason for the bad performance, the query takes some minutes!
In our Pull Request you can find a rewrite of the SQL and some documentation. We divide the problem in two parts. First we read the jenkins_job as this returns a single result. Then we read the join without the problematic table.
The two queries take only some milliseconds now.
https://learnsql.com/blog/sql-join-only-first-row/ shows another solution that works, We tried out Solution 4 and used a 'LIMIT 1' hint for the jenkins_job table. It worked for postgres, but it is not clear if it works for all supported databases. Therefore we decided to use the 'two queries' approach,
3) Added a Memory for getGeneratedArtifacts and listDownstreamJobsByArtifact during execution of DownstreamPipelineTriggerRunListener#onCompleted to reduce SQLs
This saves redundant SQLs calls an makes it easier to analyze the SQL logs.
4) Reduced the locks during hudson.model.Queue.schedule2()
hudson.model.Queue.schedule2() uses a lock object and does not scale. We have seen over 50 threads in one thread dumps that waited for the lock.
Before we trigger a downstream job, we therefore check if it is already in the queue. Here we found out a bug in the existing implementation: WorkflowJob.isInQueue() returns always 'false'. You have to use Jenkins.get().getQueue().contains((Task)job) to get it right.
We also fixed this for the existing check on upstream jobs. This did not work before.
5) We have seen, that too much downstreams were started.
Downstreams that had matching transitive upstreams were triggered often, especially for long running jobs. Reason was. that listTransitiveUpstreamJobs returned an empty list for jobs that were running in this moment.
Fix: We check the listTransitiveUpstreamJobs for the last successful build if we get an empty list
BEFORE Pull Request: [withMaven] downstreamPipelineTriggerRunListener - completed in 62394803 ms
AFTER Pull Request: [withMaven] downstreamPipelineTriggerRunListener - completed in 534 ms
Pull Request:
https://github.com/jenkinsci/pipeline-maven-plugin/pull/567
The query Plan of the listUpstreamPipelinesBasedOnMavenDependencies query is attached. It contains this seq scan:
Seq Scan on jenkins_job upstream_job (cost=0.00..81.58 rows=3086 width=41)
{{PipelineMavenPluginPostgreSqlDao - PostgreSQL 11.17
JDBC URL: jdbc:postgresql://xxx}}
{{Table JENKINS_MASTER: 1 rows
Table MAVEN_ARTIFACT: 1203669 rows
Table JENKINS_JOB: 3099 rows
Table JENKINS_BUILD: 19795 rows
Table MAVEN_DEPENDENCY: 6178556 rows
Table GENERATED_MAVEN_ARTIFACT: 915590 rows
Table MAVEN_PARENT_PROJECT: 188514 rows
Table JENKINS_BUILD_UPSTREAM_CAUSE: 5837 rows
Performances:
find: totalDurationInMs=12525274, count=64311
write: totalDurationInMs=114369864, count=720134}}
All indizes are installed
This is really strange. The last join seems to be the problem, it does not use the an index. Statistics are up to date.
I tried out to reduce the query by the last join and use the job_id directly: It only takes 500ms:
EXPLAIN ANALYZE select distinct upstream_job.full_name, upstream_build.number
from JENKINS_JOB as upstream_job
inner join JENKINS_BUILD as upstream_build on (upstream_job.id = upstream_build.job_id and upstream_job.last_successful_build_number = upstream_build.number)
inner join GENERATED_MAVEN_ARTIFACT on (upstream_build.id = GENERATED_MAVEN_ARTIFACT.build_id and GENERATED_MAVEN_ARTIFACT.skip_downstream_triggers = false)
inner join MAVEN_ARTIFACT on GENERATED_MAVEN_ARTIFACT.artifact_id = MAVEN_ARTIFACT.id
inner join MAVEN_DEPENDENCY on (MAVEN_DEPENDENCY.artifact_id = MAVEN_ARTIFACT.id and MAVEN_DEPENDENCY.ignore_upstream_triggers = false)
inner join JENKINS_BUILD as downstream_build on MAVEN_DEPENDENCY.build_id = downstream_build.id
inner join JENKINS_JOB as downstream_job on downstream_build.job_id = downstream_job.id
where downstream_build.job_id = 8642 and downstream_build.number = 248 and upstream_job.jenkins_master_id = 1