Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-70208

downstreamPipelineTriggerRunListener Performance

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Blocker
    • Resolution: Fixed
    • pipeline-maven-plugin
    • None
    • Tested with Version 1235.v2db_ddd9f797b
    • 1257.v89e586d3c58c ; 3.11.2

    Description

      PROBLEM

      Performance of downstreamPipelineTriggerRunListener is degrading with growing number of dependencies and often stops to work because of timeouts of DB Connection Pool.

      Here the output of one of our Jobs before the fix:
      [withMaven] downstreamPipelineTriggerRunListener - completed in 62394803 ms
      62.394.803 ms = 1039 min = 17,33 h

      We have a very large setup with thousands of builds. downstreamPipelineTriggerRunListener takes up to some hours and is failing because of timeouts of DB Connection Pool.

      ANALYSIS

      Analysis showed that time is lost for calculation of transitive upstreams. Same information is read several thousand times to calculate listTransitiveUpstreamJobs().

      In threadDumps we see, that all threads are hanging here: 

      at org.postgresql.jdbc.PgPreparedStatement.executeQuery(PgPreparedStatement.java:118) at com.zaxxer.hikari.pool.ProxyPreparedStatement.executeQuery(ProxyPreparedStatement.java:52) at com.zaxxer.hikari.pool.HikariProxyPreparedStatement.executeQuery(HikariProxyPreparedStatement.java) at org.jenkinsci.plugins.pipeline.maven.dao.AbstractPipelineMavenPluginDao.listUpstreamPipelinesBasedOnMavenDependencies(AbstractPipelineMavenPluginDao.java:912) at org.jenkinsci.plugins.pipeline.maven.dao.AbstractPipelineMavenPluginDao.listUpstreamJobs(AbstractPipelineMavenPluginDao.

      In one example threadDump we see 62 threads executing listUpstreamPipelinesBasedOnMavenDependencies

      All threads that execute HikariProxyPreparedStatement.executeQuery are called by listUpstreamPipelinesBasedOnMavenDependencies

      After fixing this problem by reducing the number of SQLs and optimizing the SQL we have seen another problem:

      The triggering of downstreams leaded to a bottleneck by locks in hudson.model.Queue.schedule2(), we saw up to 60 threads wating at one moment for  a lock.

       

      SOLUTION

      We did several improvements:

      1) Reduce the very high number of SQLs caused by the recursion in listUpstreamJobs 

      listTransitiveUpstreamJobs uses a recursion that uses listUpstreamJobs several times. Even for one call of listTransitiveUpstreamJobs the same information may be read several times.
      But especially in DownstreamPipelineTriggerRunListener listTransitiveUpstreamJobs is called for all downstreams and the upstreams of these are normally repeating.
      We use a UpstreamMemory Object, that is created at the beginning of downstreamPipelineTriggerRunListener. It remembers the already calculated upstreams.
      We have observed several 10-thousand hits for UpstreamMemory  for calculation of only one job. With this fix, calcualtion is working again, before we had timeouts of the DB Connection Pool. This saves about 90% of this complex SQL statement.

       

      2) Tuned the SQL Query for listUpstreamPipelinesBasedOnMavenDependencies

      The query does a complex join and does not use the existing index for table jenkins_job on postgres for very large data.

      A ANALYZE of the query shows this sequential scan:

      Seq Scan on jenkins_job upstream_job (cost=0.00..81.58 rows=3086 width=41) Filter: (jenkins_master_id = 1)

      This is the reason for the bad performance, the query takes some minutes!

      In our Pull Request you can find a rewrite of the SQL and some documentation. We divide the problem in two parts. First we read the jenkins_job as this returns a single result. Then we read the join without the problematic table.

      The two queries take only some milliseconds now.

      https://learnsql.com/blog/sql-join-only-first-row/ shows another solution that works, We tried out Solution 4 and used a 'LIMIT 1' hint for the jenkins_job table. It worked for postgres, but it is not clear if it works for all supported databases. Therefore we decided to use the 'two queries' approach,

       

      3)  Added a Memory for getGeneratedArtifacts and listDownstreamJobsByArtifact during execution of DownstreamPipelineTriggerRunListener#onCompleted to reduce SQLs

      This saves redundant SQLs calls an makes it easier to analyze the SQL logs.

      4) Reduced the locks during hudson.model.Queue.schedule2()

      hudson.model.Queue.schedule2() uses a lock object and does not scale. We have seen over 50 threads in one thread dumps that waited for the lock. 

      Before we trigger a downstream job, we therefore check if it is already in the queue. Here we found out a bug in the existing implementation: WorkflowJob.isInQueue() returns always 'false'. You have to use Jenkins.get().getQueue().contains((Task)job) to get it right.

      We also fixed this for the existing check on upstream jobs. This did not work before.

       

      5) We have seen, that too much downstreams were started.

      Downstreams that had matching transitive upstreams were triggered often, especially for long running jobs. Reason was. that listTransitiveUpstreamJobs returned an empty list for jobs that were running in this moment.

      Fix: We check the listTransitiveUpstreamJobs for the last successful build if we get an empty list

       

       

      BEFORE Pull Request: [withMaven] downstreamPipelineTriggerRunListener - completed in 62394803 ms

      AFTER Pull Request: [withMaven] downstreamPipelineTriggerRunListener - completed in 534 ms

      Pull Request:

      https://github.com/jenkinsci/pipeline-maven-plugin/pull/567

       

       

       

      Attachments

        Activity

          aubele Martin Aubele added a comment - - edited

          The query Plan of the listUpstreamPipelinesBasedOnMavenDependencies query is attached. It contains this seq scan:

          Seq Scan on jenkins_job upstream_job  (cost=0.00..81.58 rows=3086 width=41) 

           

           

          {{PipelineMavenPluginPostgreSqlDao - PostgreSQL 11.17
          JDBC URL: jdbc:postgresql://xxx}}

          {{Table JENKINS_MASTER: 1 rows
          Table MAVEN_ARTIFACT: 1203669 rows
          Table JENKINS_JOB: 3099 rows
          Table JENKINS_BUILD: 19795 rows
          Table MAVEN_DEPENDENCY: 6178556 rows
          Table GENERATED_MAVEN_ARTIFACT: 915590 rows
          Table MAVEN_PARENT_PROJECT: 188514 rows

          Table JENKINS_BUILD_UPSTREAM_CAUSE: 5837 rows
          Performances:
          find: totalDurationInMs=12525274, count=64311
          write: totalDurationInMs=114369864, count=720134}}

           

          All indizes are installed

           

          This is really strange. The last join seems to be the problem, it does not use the an index. Statistics are up to date.

           

          I tried out to reduce the query by the last join and use the job_id directly: It only takes 500ms:

          EXPLAIN ANALYZE select distinct upstream_job.full_name, upstream_build.number
          from JENKINS_JOB as upstream_job
          inner join JENKINS_BUILD as upstream_build on (upstream_job.id = upstream_build.job_id and upstream_job.last_successful_build_number = upstream_build.number)
          inner join GENERATED_MAVEN_ARTIFACT on (upstream_build.id = GENERATED_MAVEN_ARTIFACT.build_id  and GENERATED_MAVEN_ARTIFACT.skip_downstream_triggers = false)
          inner join MAVEN_ARTIFACT on GENERATED_MAVEN_ARTIFACT.artifact_id = MAVEN_ARTIFACT.id
          inner join MAVEN_DEPENDENCY on (MAVEN_DEPENDENCY.artifact_id = MAVEN_ARTIFACT.id and MAVEN_DEPENDENCY.ignore_upstream_triggers = false)
          inner join JENKINS_BUILD as downstream_build on MAVEN_DEPENDENCY.build_id = downstream_build.id
          inner join JENKINS_JOB as downstream_job on downstream_build.job_id = downstream_job.id
          where downstream_build.job_id = 8642 and  downstream_build.number = 248 and upstream_job.jenkins_master_id = 1

           

          aubele Martin Aubele added a comment - - edited The query Plan of the listUpstreamPipelinesBasedOnMavenDependencies query is attached. It contains this seq scan: Seq Scan on jenkins_job upstream_job  (cost=0.00..81.58 rows=3086 width=41)      {{PipelineMavenPluginPostgreSqlDao - PostgreSQL 11.17 JDBC URL: jdbc:postgresql://xxx}} {{Table JENKINS_MASTER: 1 rows Table MAVEN_ARTIFACT: 1203669 rows Table JENKINS_JOB: 3099 rows Table JENKINS_BUILD: 19795 rows Table MAVEN_DEPENDENCY: 6178556 rows Table GENERATED_MAVEN_ARTIFACT: 915590 rows Table MAVEN_PARENT_PROJECT: 188514 rows Table JENKINS_BUILD_UPSTREAM_CAUSE: 5837 rows Performances: find: totalDurationInMs=12525274, count=64311 write: totalDurationInMs=114369864, count=720134}}   All indizes are installed   This is really strange. The last join seems to be the problem, it does not use the an index. Statistics are up to date.   I tried out to reduce the query by the last join and use the job_id directly: It only takes 500ms: EXPLAIN ANALYZE select distinct upstream_job.full_name, upstream_build.number from JENKINS_JOB as upstream_job inner join JENKINS_BUILD as upstream_build on (upstream_job.id = upstream_build.job_id and upstream_job.last_successful_build_number = upstream_build.number) inner join GENERATED_MAVEN_ARTIFACT on (upstream_build.id = GENERATED_MAVEN_ARTIFACT.build_id  and GENERATED_MAVEN_ARTIFACT.skip_downstream_triggers = false) inner join MAVEN_ARTIFACT on GENERATED_MAVEN_ARTIFACT.artifact_id = MAVEN_ARTIFACT.id inner join MAVEN_DEPENDENCY on (MAVEN_DEPENDENCY.artifact_id = MAVEN_ARTIFACT.id and MAVEN_DEPENDENCY.ignore_upstream_triggers = false) inner join JENKINS_BUILD as downstream_build on MAVEN_DEPENDENCY.build_id = downstream_build.id inner join JENKINS_JOB as downstream_job on downstream_build.job_id = downstream_job.id where downstream_build.job_id = 8642 and  downstream_build.number = 248 and upstream_job.jenkins_master_id = 1  
          aubele Martin Aubele added a comment - - edited

          Rewriting the query to:

          select distinct upstream_job.full_name, upstream_build.number
          from JENKINS_JOB as upstream_job
          inner join JENKINS_BUILD as upstream_build on (upstream_job.id = upstream_build.job_id and upstream_job.last_successful_build_number = upstream_build.number)
          inner join GENERATED_MAVEN_ARTIFACT on (upstream_build.id = GENERATED_MAVEN_ARTIFACT.build_id  and GENERATED_MAVEN_ARTIFACT.skip_downstream_triggers = false)
          inner join MAVEN_ARTIFACT on GENERATED_MAVEN_ARTIFACT.artifact_id = MAVEN_ARTIFACT.id
          inner join MAVEN_DEPENDENCY on (MAVEN_DEPENDENCY.artifact_id = MAVEN_ARTIFACT.id and MAVEN_DEPENDENCY.ignore_upstream_triggers = false)
          inner join JENKINS_BUILD as downstream_build on (MAVEN_DEPENDENCY.build_id = downstream_build.id and downstream_build.job_id = (
          SELECT downstream_job.id
                     FROM JENKINS_JOB as downstream_job
                     WHERE downstream_job.full_name = 'xxx' and downstream_job.jenkins_master_id = 1
                     LIMIT 1
                )
          )
          where downstream_build.number = 248 and upstream_job.jenkins_master_id = 1

           

          makes it fast. The hint "LIMIT 1" tells the analyzer, that only one row is expected here.

          But i do not know if this is supported by h2 Database.

           

          aubele Martin Aubele added a comment - - edited Rewriting the query to: select distinct upstream_job.full_name, upstream_build.number from JENKINS_JOB as upstream_job inner join JENKINS_BUILD as upstream_build on (upstream_job.id = upstream_build.job_id and upstream_job.last_successful_build_number = upstream_build.number) inner join GENERATED_MAVEN_ARTIFACT on (upstream_build.id = GENERATED_MAVEN_ARTIFACT.build_id  and GENERATED_MAVEN_ARTIFACT.skip_downstream_triggers = false) inner join MAVEN_ARTIFACT on GENERATED_MAVEN_ARTIFACT.artifact_id = MAVEN_ARTIFACT.id inner join MAVEN_DEPENDENCY on (MAVEN_DEPENDENCY.artifact_id = MAVEN_ARTIFACT.id and MAVEN_DEPENDENCY.ignore_upstream_triggers = false) inner join JENKINS_BUILD as downstream_build on (MAVEN_DEPENDENCY.build_id = downstream_build.id and downstream_build.job_id = ( SELECT downstream_job.id             FROM JENKINS_JOB as downstream_job             WHERE downstream_job.full_name = 'xxx' and downstream_job.jenkins_master_id = 1             LIMIT 1       ) ) where downstream_build.number = 248 and upstream_job.jenkins_master_id = 1   makes it fast. The hint " LIMIT 1" tells the analyzer, that only one row is expected here. But i do not know if this is supported by h2 Database.  
          bguerin Benoit added a comment -

          Here are the statistics of my production Jenkins, after a whole day of intensive work :

          PipelineMavenPluginPostgreSqlDao - PostgreSQL 13.9 (Debian 13.9-1.pgdg110+1)
          	JDBC URL: jdbc:postgresql://db/jenkins
          	Table JENKINS_MASTER: 1 rows
          	Table MAVEN_ARTIFACT: 1311 rows
          	Table JENKINS_JOB: 225 rows
          	Table JENKINS_BUILD: 1572 rows
          	Table MAVEN_DEPENDENCY: 2554 rows
          	Table GENERATED_MAVEN_ARTIFACT: 6329 rows
          	Table MAVEN_PARENT_PROJECT: 2000 rows
          	Table JENKINS_BUILD_UPSTREAM_CAUSE: 1055 rows
           Performances: 
          	 find: totalDurationInMs=7211, count=886
          	 write: totalDurationInMs=48061, count=1374
          
          bguerin Benoit added a comment - Here are the statistics of my production Jenkins, after a whole day of intensive work : PipelineMavenPluginPostgreSqlDao - PostgreSQL 13.9 (Debian 13.9-1.pgdg110+1) JDBC URL: jdbc:postgresql: //db/jenkins Table JENKINS_MASTER: 1 rows Table MAVEN_ARTIFACT: 1311 rows Table JENKINS_JOB: 225 rows Table JENKINS_BUILD: 1572 rows Table MAVEN_DEPENDENCY: 2554 rows Table GENERATED_MAVEN_ARTIFACT: 6329 rows Table MAVEN_PARENT_PROJECT: 2000 rows Table JENKINS_BUILD_UPSTREAM_CAUSE: 1055 rows Performances: find: totalDurationInMs=7211, count=886 write: totalDurationInMs=48061, count=1374
          aubele Martin Aubele added a comment -
          PipelineMavenPluginPostgreSqlDao - PostgreSQL 11.17
          	JDBC URL: jdbc:postgresql://... 
                  Table JENKINS_MASTER: 1 rows
          	Table MAVEN_ARTIFACT: 1263920 rows
          	Table JENKINS_JOB: 3883 rows
          	Table JENKINS_BUILD: 20005 rows
          	Table MAVEN_DEPENDENCY: 6196127 rows
          	Table GENERATED_MAVEN_ARTIFACT: 1047728 rows
          	Table MAVEN_PARENT_PROJECT: 190749 rows
          	Table JENKINS_BUILD_UPSTREAM_CAUSE: 4683 rows
           Performances: 
          	 find: totalDurationInMs=6600278, count=306268
          	 write: totalDurationInMs=26663309, count=3815043 

          this is from our production system but with our optimizations. 08:00 morning. Yesterday only moderate work because of starting holidays.

          aubele Martin Aubele added a comment - PipelineMavenPluginPostgreSqlDao - PostgreSQL 11.17 JDBC URL: jdbc:postgresql: //... Table JENKINS_MASTER: 1 rows Table MAVEN_ARTIFACT: 1263920 rows Table JENKINS_JOB: 3883 rows Table JENKINS_BUILD: 20005 rows Table MAVEN_DEPENDENCY: 6196127 rows Table GENERATED_MAVEN_ARTIFACT: 1047728 rows Table MAVEN_PARENT_PROJECT: 190749 rows Table JENKINS_BUILD_UPSTREAM_CAUSE: 4683 rows Performances: find: totalDurationInMs=6600278, count=306268 write: totalDurationInMs=26663309, count=3815043 this is from our production system but with our optimizations. 08:00 morning. Yesterday only moderate work because of starting holidays.
          bguerin Benoit added a comment -

          Here are my new statistics after more or less same number of builds, using your version of the plugin :

          PipelineMavenPluginPostgreSqlDao - PostgreSQL 13.9 (Debian 13.9-1.pgdg110+1)
          	JDBC URL: jdbc:postgresql://db/jenkins
          	Table JENKINS_MASTER: 1 rows
          	Table MAVEN_ARTIFACT: 1312 rows
          	Table JENKINS_JOB: 228 rows
          	Table JENKINS_BUILD: 1772 rows
          	Table MAVEN_DEPENDENCY: 3115 rows
          	Table GENERATED_MAVEN_ARTIFACT: 7262 rows
          	Table MAVEN_PARENT_PROJECT: 2361 rows
          	Table JENKINS_BUILD_UPSTREAM_CAUSE: 1238 rows
           Performances: 
          	 find: totalDurationInMs=2658, count=926
          	 write: totalDurationInMs=47217, count=1441
           Caches: 
          	 listUpstreamJobs: hits=1045.0, misses=1153.0, efficency=48%
          	 getGeneratedArtifacts: hits=23.0, misses=391.0, efficency=6%
          	 listDownstreamJobsByArtifact: hits=23.0, misses=391.0, efficency=6%
          
          bguerin Benoit added a comment - Here are my new statistics after more or less same number of builds, using your version of the plugin : PipelineMavenPluginPostgreSqlDao - PostgreSQL 13.9 (Debian 13.9-1.pgdg110+1) JDBC URL: jdbc:postgresql: //db/jenkins Table JENKINS_MASTER: 1 rows Table MAVEN_ARTIFACT: 1312 rows Table JENKINS_JOB: 228 rows Table JENKINS_BUILD: 1772 rows Table MAVEN_DEPENDENCY: 3115 rows Table GENERATED_MAVEN_ARTIFACT: 7262 rows Table MAVEN_PARENT_PROJECT: 2361 rows Table JENKINS_BUILD_UPSTREAM_CAUSE: 1238 rows Performances: find: totalDurationInMs=2658, count=926 write: totalDurationInMs=47217, count=1441 Caches: listUpstreamJobs: hits=1045.0, misses=1153.0, efficency=48% getGeneratedArtifacts: hits=23.0, misses=391.0, efficency=6% listDownstreamJobsByArtifact: hits=23.0, misses=391.0, efficency=6%
          aubele Martin Aubele added a comment - - edited

          Thank you!,Could you please share your version of the plugin? Or tell me the GIT branch? Then i will use it in our productions system to compare the statistics. The hits depend very much on the structure of the poms. If you use a lot of indirection by pom-type poms, the hits should be much higher

          aubele Martin Aubele added a comment - - edited Thank you!,Could you please share your version of the plugin? Or tell me the GIT branch? Then i will use it in our productions system to compare the statistics. The hits depend very much on the structure of the poms. If you use a lot of indirection by pom-type poms, the hits should be much higher
          bguerin Benoit added a comment -

          Hello aubele 

          I pushed directly on your working branch. So you could just pull and rebuild the plugin

          By pom-type poms, you mean BOMs ? This is what I use in my company

          bguerin Benoit added a comment - Hello aubele   I pushed directly on your working branch. So you could just pull and rebuild the plugin By pom-type poms, you mean BOMs ? This is what I use in my company

          People

            bguerin Benoit
            aubele Martin Aubele
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: