Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-38381

[JEP-210] Optimize log handling in Pipeline and Durable Task

    • Pipeline - October, Pipeline - April 2018

      Extracted from JENKINS-30896:

      The current design of LogActionImpl, using $id.log, was considered the minimum necessary for a working 1.0 release, not a serious implementation. It has a major problem: when there is a large amount of output, WorkflowRun.copyLogs must duplicate it all to log, doubling disk space requirements per build.

      It would be better to keep a single log file for the build. LogActionImpl should deprecated in favor of an implementation that simply stores a rangeset of offsets into that file. When parallel blocks are producing concurrent output, the single log file will be a bit jumbled (probably still human-readable in most cases), but the rangesets will keep track of what output came from where. The final output produced by WorkflowRun will still be processed to split at line boundaries, add in thread labels, etc. (TBD how and whether JENKINS-30777 could be supported in this mode.)

          [JENKINS-38381] [JEP-210] Optimize log handling in Pipeline and Durable Task

          Jesse Glick added a comment -

          Expanding scope a bit since oleg_nenashev’s work requires org.jenkinsci.plugins.durabletask.Controller.writeLog to be reconsidered as well: we do not want to force step logs to be streamed back to the master. So need an extension point (which for ease of development could start off inside a plugin) which would allow a Run to obtain a BuildListener whose serial form would allow this alternate mode.

          Jesse Glick added a comment - Expanding scope a bit since oleg_nenashev ’s work requires org.jenkinsci.plugins.durabletask.Controller.writeLog to be reconsidered as well: we do not want to force step logs to be streamed back to the master. So need an extension point (which for ease of development could start off inside a plugin) which would allow a Run to obtain a BuildListener whose serial form would allow this alternate mode.

          Jesse Glick added a comment -

          Can roll in a fix of JENKINS-31096 by just using UTF-8 everywhere.

          Jesse Glick added a comment - Can roll in a fix of JENKINS-31096 by just using UTF-8 everywhere.

          Jesse Glick added a comment -

          Have an early draft available. Not yet trying to avoid streaming DurableTaskStep output back to the master.

          Jesse Glick added a comment - Have an early draft available. Not yet trying to avoid streaming DurableTaskStep output back to the master.

          Jesse Glick added a comment -

          Basic implementation of a unified log is working. Now I am looking into having an externalizable log storage, and specifically allowing DurableTaskStep.Execution to stream log text directly there from an agent. But when the master is not involved in collecting log output from a sh/bat step, then it is pointless to have it polling the agent over Remoting every 15s or less just to get an exit status and save a log location—this just adds traffic to the channel, and load to the master. So now I am investigating whether the controller can autonomously stream output and send a callback with exit status. This is essentially a rewrite a durable task handling. The goal is that if you run, say,

          node {
            sh '''#!/bin/bash
          for x in {0000..9999}
          do
            echo $x
            sleep 1
          done
          '''
          }
          

          then for the next three hours there should be zero activity on the remoting channel (other than ping threads) and zero load on master related to this build (assuming no human is watching the log).

          Jesse Glick added a comment - Basic implementation of a unified log is working. Now I am looking into having an externalizable log storage, and specifically allowing DurableTaskStep.Execution to stream log text directly there from an agent. But when the master is not involved in collecting log output from a sh / bat step, then it is pointless to have it polling the agent over Remoting every 15s or less just to get an exit status and save a log location—this just adds traffic to the channel, and load to the master. So now I am investigating whether the controller can autonomously stream output and send a callback with exit status. This is essentially a rewrite a durable task handling. The goal is that if you run, say, node { sh '''#!/bin/bash for x in {0000..9999} do echo $x sleep 1 done ''' } then for the next three hours there should be zero activity on the remoting channel (other than ping threads) and zero load on master related to this build (assuming no human is watching the log).

          Jesse Glick added a comment -

          Ran into an obstacle: BourneShellScript as currently implemented relies on ProcessLiveness to tell if a process has exited improperly, so sh can fail with a -1 exit code. Now if this is using a LocalLauncher or a RemoteLauncher, the normal case, it will directly call getpgid in libc. But if it is dealing with a decorated launcher, it delegates to that in order to tell if the process is alive. When you are using docker.image(…).inside, that is what allows the agent to determine whether a process inside the container is alive; getpgid will not work, since Docker remaps PIDs. The problem is that this hook is done by passing a special Launcher as a contextual object in the Pipeline step’s block, but this is not Serializable and not safe for transfer to the agent. So without new APIs, that docker-workflow would need to implement, there seems to be no way for an autonomous agent to check process liveness when using .inside without asking for help from the master, which would defeat some of the performance benefits here when using Docker.

          Possibly need to rework this system so that process liveness checks are not done constantly when checking exit status, but rather periodically at the initiation of the master (say every hour) while there is no reported exit status. That might also help work around lost notifications etc.

          Jesse Glick added a comment - Ran into an obstacle: BourneShellScript as currently implemented relies on ProcessLiveness to tell if a process has exited improperly, so sh can fail with a -1 exit code. Now if this is using a LocalLauncher or a RemoteLauncher , the normal case, it will directly call getpgid in libc . But if it is dealing with a decorated launcher, it delegates to that in order to tell if the process is alive. When you are using docker.image(…).inside , that is what allows the agent to determine whether a process inside the container is alive; getpgid will not work, since Docker remaps PIDs. The problem is that this hook is done by passing a special Launcher as a contextual object in the Pipeline step’s block, but this is not Serializable and not safe for transfer to the agent. So without new APIs, that docker-workflow would need to implement, there seems to be no way for an autonomous agent to check process liveness when using .inside without asking for help from the master, which would defeat some of the performance benefits here when using Docker. Possibly need to rework this system so that process liveness checks are not done constantly when checking exit status, but rather periodically at the initiation of the master (say every hour) while there is no reported exit status. That might also help work around lost notifications etc.

          Jesse Glick added a comment -

          I think I have a design that will work, but still fleshing out a prototype.

          Jesse Glick added a comment - I think I have a design that will work, but still fleshing out a prototype.

          Jesse Glick added a comment -

          Basic stuff now seems to be working.

          Jesse Glick added a comment - Basic stuff now seems to be working.

          Code changed in jenkins
          User: Jesse Glick
          Path:
          pom.xml
          src/main/java/org/jenkinsci/plugins/durabletask/BourneShellScript.java
          src/main/java/org/jenkinsci/plugins/durabletask/Controller.java
          src/main/java/org/jenkinsci/plugins/durabletask/DurableTask.java
          src/main/java/org/jenkinsci/plugins/durabletask/FileMonitoringTask.java
          src/main/java/org/jenkinsci/plugins/durabletask/Handler.java
          src/main/java/org/jenkinsci/plugins/durabletask/executors/OnceRetentionStrategy.java
          src/test/java/org/jenkinsci/plugins/durabletask/BourneShellScriptTest.java
          http://jenkins-ci.org/commit/durable-task-plugin/eaa6b02e8a2aec369f727c29a15a7fc44ebcd12a
          Log:
          JENKINS-38381 Prototype API to receive asynchronous notifications of process output or exit code.

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: pom.xml src/main/java/org/jenkinsci/plugins/durabletask/BourneShellScript.java src/main/java/org/jenkinsci/plugins/durabletask/Controller.java src/main/java/org/jenkinsci/plugins/durabletask/DurableTask.java src/main/java/org/jenkinsci/plugins/durabletask/FileMonitoringTask.java src/main/java/org/jenkinsci/plugins/durabletask/Handler.java src/main/java/org/jenkinsci/plugins/durabletask/executors/OnceRetentionStrategy.java src/test/java/org/jenkinsci/plugins/durabletask/BourneShellScriptTest.java http://jenkins-ci.org/commit/durable-task-plugin/eaa6b02e8a2aec369f727c29a15a7fc44ebcd12a Log: JENKINS-38381 Prototype API to receive asynchronous notifications of process output or exit code.

          Jesse Glick added a comment -

          Integrated demo is in the logstash PR.

          Jesse Glick added a comment - Integrated demo is in the logstash PR.

          Ryan Campbell added a comment - - edited

          Benefits of this change: This change will fix unicode support. This change will also send all log information to a single sink instead of seperate logfiles per step, removing a bottleneck in the build logging. Also, instead of the master polling for output from the pipeline, the logs are instead pushed from the slave to the master.

          We agree that we should implement this change, but leave the extension point `PipelineLogFile` marked as beta.

          Our plan is for Jesse to complete testing of this change and integrate it into the master branch when ready.

          Ryan Campbell added a comment - - edited Benefits of this change: This change will fix unicode support. This change will also send all log information to a single sink instead of seperate logfiles per step, removing a bottleneck in the build logging. Also, instead of the master polling for output from the pipeline, the logs are instead pushed from the slave to the master. We agree that we should implement this change, but leave the extension point `PipelineLogFile` marked as beta. Our plan is for Jesse to complete testing of this change and integrate it into the master branch when ready.

          Jesse Glick added a comment -

          Baseline (pre-patch) performance for a build with one (echo) step producing a million lines of output on my laptop:

          • Took 1791ms to run the build
          • Took 437ms to write HTML of whole build
          • Took 0ms to compute length of whole build
          • Took 19ms to write truncated HTML of whole build
          • Took 11ms to write plain text of whole build
          • Took 340ms to write HTML of one node
          • Took 0ms to compute length of one node
          • Took 6ms to write truncated HTML of one node
          • Took 10ms to write plain text of one node

          With the patch (but not using remote log storage):

          • Took 3687ms to run the build
          • Took 866ms to write HTML of whole build
          • Took 127ms to compute length of whole build
          • Took 129ms to write truncated HTML of whole build
          • Took 162ms to write plain text of whole build
          • Took 630ms to write HTML of one node
          • Took 115ms to compute length of one node
          • Took 148ms to write truncated HTML of one node
          • Took 75ms to write plain text of one node

          The build itself seems to take longer, suggesting the need for optimization in AnnotatedLogAction.DecoratedTaskListener. (Though this is not a very realistic case, since it represents a step producing megabytes of output all at once, making things easy on copyLogs; in practice output would come incrementally, which the baseline version would probably spend more time copying.)

          As to the rendering of the log (if and when someone asks for it), the second set of numbers represent measurable load on the server, but a log this big seems like it would be rare in the field, and anyway the cost of rendering such log output in Chrome on the client side is much higher. In other words, the subjective overhead of viewing the build is mostly due to the browser, not any processing on the server. time curl -I … produces responses rather quickly. So I am less concerned about that.

          Jesse Glick added a comment - Baseline (pre-patch) performance for a build with one ( echo ) step producing a million lines of output on my laptop: Took 1791ms to run the build Took 437ms to write HTML of whole build Took 0ms to compute length of whole build Took 19ms to write truncated HTML of whole build Took 11ms to write plain text of whole build Took 340ms to write HTML of one node Took 0ms to compute length of one node Took 6ms to write truncated HTML of one node Took 10ms to write plain text of one node With the patch (but not using remote log storage): Took 3687ms to run the build Took 866ms to write HTML of whole build Took 127ms to compute length of whole build Took 129ms to write truncated HTML of whole build Took 162ms to write plain text of whole build Took 630ms to write HTML of one node Took 115ms to compute length of one node Took 148ms to write truncated HTML of one node Took 75ms to write plain text of one node The build itself seems to take longer, suggesting the need for optimization in AnnotatedLogAction.DecoratedTaskListener . (Though this is not a very realistic case, since it represents a step producing megabytes of output all at once, making things easy on copyLogs ; in practice output would come incrementally, which the baseline version would probably spend more time copying.) As to the rendering of the log (if and when someone asks for it), the second set of numbers represent measurable load on the server, but a log this big seems like it would be rare in the field, and anyway the cost of rendering such log output in Chrome on the client side is much higher. In other words, the subjective overhead of viewing the build is mostly due to the browser, not any processing on the server. time curl -I … produces responses rather quickly. So I am less concerned about that.

          Jesse Glick added a comment -

          Did some optimizations; down to:

          • Took 2525ms to run the build
          • Took 761ms to write HTML of whole build
          • Took 133ms to compute length of whole build
          • Took 86ms to write truncated HTML of whole build
          • Took 111ms to write plain text of whole build
          • Took 379ms to write HTML of one node
          • Took 57ms to compute length of one node
          • Took 64ms to write truncated HTML of one node
          • Took 62ms to write plain text of one node

          Jesse Glick added a comment - Did some optimizations; down to: Took 2525ms to run the build Took 761ms to write HTML of whole build Took 133ms to compute length of whole build Took 86ms to write truncated HTML of whole build Took 111ms to write plain text of whole build Took 379ms to write HTML of one node Took 57ms to compute length of one node Took 64ms to write truncated HTML of one node Took 62ms to write plain text of one node

          Sorin Sbarnea added a comment -

          jglick Any updates on this? It seems that we hit that one and TBH I don't find some performance degradation as a read problem, if that is fixing a real bug.

          Now we had production downtimes due to https://issues.jenkins-ci.org/browse/JENKINS-37575 (which depends on this). This would count as more serious than some performance degradation.

          Sorin Sbarnea added a comment - jglick Any updates on this? It seems that we hit that one and TBH I don't find some performance degradation as a read problem, if that is fixing a real bug. Now we had production downtimes due to https://issues.jenkins-ci.org/browse/JENKINS-37575  (which depends on this). This would count as more serious than some performance degradation.

          Jesse Glick added a comment -

          recampbell asked for this to be put on hold pending some performance testing. (More to prove that it provides some benefit rather than proving it does not cause some regression, though you never know until you try.) Indeed there are a handful of open bugs in areas of code which are rewritten by this set of PRs. I would like to revisit it.

          Jesse Glick added a comment - recampbell asked for this to be put on hold pending some performance testing. (More to prove that it provides some benefit rather than proving it does not cause some regression , though you never know until you try.) Indeed there are a handful of open bugs in areas of code which are rewritten by this set of PRs. I would like to revisit it.

          Michael Neale added a comment -

          Discussed today - svanoort has a scalability test suite WIP - which is a precondition to this (ie we need to be able to measure/prove performance before and after this set of changes). My main concern with the implementation that I have seen is the speed of fetching a log for a specific step (be that in a single file, or an external store). I expect svanoort or abayer may pick this up.

          Michael Neale added a comment - Discussed today - svanoort has a scalability test suite WIP - which is a precondition to this (ie we need to be able to measure/prove performance before and after this set of changes). My main concern with the implementation that I have seen is the speed of fetching a log for a specific step (be that in a single file, or an external store). I expect svanoort or abayer may pick this up.

          Code changed in jenkins
          User: Jesse Glick
          Path:
          pom.xml
          src/test/java/org/jenkinsci/plugins/workflow/support/actions/LogActionImplTest.java
          http://jenkins-ci.org/commit/workflow-support-plugin/f7777331bfda5ad7694451f748a3d9e80859c2c2
          Log:
          Seems to also allow logging from steps which run a block first, though the logging does come out of logical order (which JENKINS-38381 would address).

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: pom.xml src/test/java/org/jenkinsci/plugins/workflow/support/actions/LogActionImplTest.java http://jenkins-ci.org/commit/workflow-support-plugin/f7777331bfda5ad7694451f748a3d9e80859c2c2 Log: Seems to also allow logging from steps which run a block first, though the logging does come out of logical order (which JENKINS-38381 would address).

          Sam Van Oort added a comment - - edited

          Consider it picked up as part of the second batch of storage work – I plan to complete the storage and durability modes first (along with some polish on the scalability testing work). Edit: the reason for that is that the APIs are needed for downstream work.

          Additional comment: profiling found that copyLogs seems to introduce a large amount of the garbage generation and GC load in running trivial pipelines.

          • Specifically due to getLogText from LogActionImpl and WorkflowRun.writeRawLogTo
          • Generally due to creating fresh byte arrays and buffers for I/O rather than reusing single ones, but also due to the large number of File and Stream objects created. That could be fairly trivially resolved by using the same buffer - since copyLogs is guarded by synchronization on the WorkflowRun.completed object, single instances could be reused safely (if need implementations of the LargeText/AnnotatedLargeText logic were used). This would add a few kB to memory overheads for pipeline but greatly reduces GC load – meaning probably less heap needed overall.

          Sam Van Oort added a comment - - edited Consider it picked up as part of the second batch of storage work – I plan to complete the storage and durability modes first (along with some polish on the scalability testing work). Edit: the reason for that is that the APIs are needed for downstream work. Additional comment: profiling found that copyLogs seems to introduce a large amount of the garbage generation and GC load in running trivial pipelines. Specifically due to getLogText from LogActionImpl and WorkflowRun.writeRawLogTo Generally due to creating fresh byte arrays and buffers for I/O rather than reusing single ones, but also due to the large number of File and Stream objects created. That could be fairly trivially resolved by using the same buffer - since copyLogs is guarded by synchronization on the WorkflowRun.completed object, single instances could be reused safely (if need implementations of the LargeText/AnnotatedLargeText logic were used). This would add a few kB to memory overheads for pipeline but greatly reduces GC load – meaning probably less heap needed overall.

          Ulli Hafner added a comment -

          I wrote another test case that reproduces the problem of JENKINS-32191. Makes parsing of the console log with the warnings plug-in on in pipelines quite a lottery.

          Ulli Hafner added a comment - I wrote  another test case that reproduces the problem of JENKINS-32191 . Makes parsing of the console log with the warnings plug-in on in pipelines quite a lottery.

          Jesse Glick added a comment -

          workflow-durable-task-step #65 covers everything I have updated recently: there is now a set of independently mergeable PRs for:

          • push rather than pull of durable task output
          • proper handling of Unicode (JENKINS-31096)
          • both of those together (so they could be merged in either order)

          workflow-job #27 is now downstream of those changes, and continues to encapsulate the broader rewrite:

          • single log sink for all Pipeline steps
          • SPI to allow that sink be an external service
          • redesign of step console annotations for a more usable display in the classic UI (this requires the first item)

          Jesse Glick added a comment - workflow-durable-task-step #65 covers everything I have updated recently: there is now a set of independently mergeable PRs for: push rather than pull of durable task output proper handling of Unicode ( JENKINS-31096 ) both of those together (so they could be merged in either order) workflow-job #27 is now downstream of those changes, and continues to encapsulate the broader rewrite: single log sink for all Pipeline steps SPI to allow that sink be an external service redesign of step console annotations for a more usable display in the classic UI (this requires the first item)

          mishal shah added a comment -

          jglick Do you know when this will be released? We are also seeing infinite log issue.

          mishal shah added a comment - jglick Do you know when this will be released? We are also seeing infinite log issue.

          Any updates on this issue? Currently, this is slowing down builds and would be great to have fixed, or at least marked as "stalled" || "blocked" if that's the situation.

          Victor Bjelkholm added a comment - Any updates on this issue? Currently, this is slowing down builds and would be great to have fixed, or at least marked as "stalled" || "blocked" if that's the situation.

          Jesse Glick added a comment -

          PRs refreshed.

          Jesse Glick added a comment - PRs refreshed.

          Steven Foster added a comment -

          Nice, would this make any use of the work on artifact-manager-s3-plugin for the external storage part?

          Steven Foster added a comment - Nice, would this make any use of the work on artifact-manager-s3-plugin for the external storage part?

          Jesse Glick added a comment -

          No, external artifact management is unrelated.

          Jesse Glick added a comment - No, external artifact management is unrelated.

          Jesse Glick added a comment -

          Yesterday I redesigned the API to more clearly separate layers of responsibility and allow cloud-based implementations to perform better. I have abandoned the logstash PoC implementation.

          Jesse Glick added a comment - Yesterday I redesigned the API to more clearly separate layers of responsibility and allow cloud-based implementations to perform better. I have abandoned the logstash PoC implementation.

          any update on this? its currently marked as in progress.

          Andreas Sieferlinger added a comment - any update on this? its currently marked as in progress.

          Jesse Glick added a comment -

          Hoping for some review activity soon.

          Jesse Glick added a comment - Hoping for some review activity soon.

          Jesse Glick added a comment -

          I filed a JEP submission for this.

          Jesse Glick added a comment - I filed a JEP submission for this.

          Jesse Glick added a comment -

          Released in beta.

          Jesse Glick added a comment - Released in beta.

            jglick Jesse Glick
            jglick Jesse Glick
            Votes:
            35 Vote for this issue
            Watchers:
            83 Start watching this issue

              Created:
              Updated:
              Resolved: