• Pipeline - October, Pipeline - December

      The current design of SimpleXStreamFlowNodeStorage and LogActionImpl, using workflow/$id.xml and $id.log, was considered the minimum necessary for a working 1.0 release, not a serious implementation. It has two major problems:

      • When there are a lot of steps, as in JENKINS-30055, many small files are created, which is bad for I/O performance.
      • When there is a large amount of output, WorkflowRun.copyLogs must duplicate it all to log, doubling disk space requirements per build.

      It would be better to keep all flow node information in one file. (Perhaps build.xml itself. In principle we could avoid loading non-head nodes with a historical build record, though I believe CpsFlowExecution currently winds up loading them all anyway. Need to check.)

      More importantly, there should be a single log file for the build. LogActionImpl should deprecated in favor of an implementation that simply stores a rangeset of offsets into that file. When parallel blocks are producing concurrent output, the single log file will be a bit jumbled (probably still human-readable in most cases), but the rangesets will keep track of what output came from where. The final output produced by WorkflowRun will still be processed to split at line boundaries, add in thread labels, etc. (TBD how and whether JENKINS-30777 could be supported in this mode.)

          [JENKINS-30896] Unoptimized node storage

          Anshu Arya added a comment - - edited

          My use case has a job that may run for days looping some workflow steps. This causes the job directory to grow very large....e.g. my job ended with 300,000+ small XML files that amount to 600MB+.

          This also causes navigating to to jenkinsURL/job/JobName/builds/Job_Number/flowGraphTable to crash most browsers.

          Anyway to disable collecting the flowGraphTable XMLs for long jobs before this bug is fixed?

          Anshu Arya added a comment - - edited My use case has a job that may run for days looping some workflow steps. This causes the job directory to grow very large....e.g. my job ended with 300,000+ small XML files that amount to 600MB+. This also causes navigating to to jenkinsURL/job/JobName/builds/Job_Number/flowGraphTable to crash most browsers. Anyway to disable collecting the flowGraphTable XMLs for long jobs before this bug is fixed?

          Jesse Glick added a comment -

          300k steps is definitely a scalability test! No there is not currently any way to disable the recording of steps which have been run. There could be some sanity cutoff, so that after 1000 steps (say) nothing more is recorded, though this would break (for example) the stage view (cf. JENKINS-31154 if you are not a CloudBees customer) if you ran 100k sh steps, then stage, then another 100k sh steps, then stage, then a final 100k sh steps. Perhaps steps without LabelAction or some other marker of importance could be dropped from the record.

          Jesse Glick added a comment - 300k steps is definitely a scalability test! No there is not currently any way to disable the recording of steps which have been run. There could be some sanity cutoff, so that after 1000 steps (say) nothing more is recorded, though this would break (for example) the stage view (cf. JENKINS-31154 if you are not a CloudBees customer) if you ran 100k sh steps, then stage , then another 100k sh steps, then stage , then a final 100k sh steps. Perhaps steps without LabelAction or some other marker of importance could be dropped from the record.

          Anshu Arya added a comment -

          Oh man, just ran out of inodes on my Jenkins server because of too many small files! Hopefully this issue can be pushed up the priority list! Need scalability!

          Anshu Arya added a comment - Oh man, just ran out of inodes on my Jenkins server because of too many small files! Hopefully this issue can be pushed up the priority list! Need scalability!

          Sam Van Oort added a comment - - edited

          A couple thoughts after spending time in the innards of this structure for pipeline stage view work (the easier ones are things that multiple people could hack away on):

          Indexing:
          Some operations on a FlowGraph require linear iteration over all nodes to generate summary statistics, for example to generate data for visualization or UI. Where the complete FlowNode set isn't needed, a FlowGraphIndex Action could be created that collects summary information about meaningful branches/milestones (parallel steps, stages, etc). Things that would be useful to know: pause/run duration, final status of steps within that milestone, if any of those steps include NotExecutedNodeActions (or similar), etc. It would also be helpful to know the node ids of steps that mark the start/end of those sections.

          This could be persisted on the WorkflowRun (since it implements Actionable).

          Aggregate storage:
          A large amount of information about each persisted flownode is identical. This could potentially be stored in very compact serialization, while allowing the APIs to transparently retrieve this as if it was stored individually with each item. For example, XStream supports storing common actions as attributes on a Node element where a SingleValueConverter can handle them. Aliasing allows for shorter names for classes (ex, common actions such as log encoding). Node trees for non-head nodes might be stored as an XML tree structure (either with the head at the top, or starting from the initial execution). It would make sense to store the current heads in individual files, until the step completes, and pack into the more compact XML objects periodically (to avoid serializing a large object often). Alternately, we append FlowNode data to a run log file as execution occurs, then periodically regenerate the packed file and empty the log.

          It could be possible to only store unique values for actions, and provide just a mapping table of these actions to FlowNodes. Many actions are used as markers, and nearly all nodes are usually identical (Log encoding, for example). This requires more complexity to handle, however.

          XStream reference here which assists with some of this.

          For implementation, the FlowActionStorage interface provides storage/retrieval of Actions for each FlowNode, and implementations of this have some freedom with how they do storage/recall. FlowNodeStorage extends this to provide storage/recall for individual nodes. The actions need not read/write to the same files as the node hierarchy, however. SimpleXStreamFlowNodeStorage is the only existing implementation.

          Compression:
          Due to the duplication mentioned above, FlowGraphs are highly compressible by nature, even moreso than general XML. LZ4/QuickLZ/LZF/Snappy offer compression options that can compress and decompress as a passthrough filter on an InputStream/OutputStream. They compress less than GZIP, but can compress at hundreds of MB/s on modern CPUs, and can be memory-bandwidth bound on decompression (sometimes GB/s). Might be worth considering to reduce disk use by large FlowGraphs in single files. They also can be used for in-memory compression, and might reduce the memory footprint considerably.

          For Jesse, this may all be trivial and self-evident, but for someone else, it may offer a useful point to contribute on this.

          Notes on previous optimization:
          https://github.com/jenkinsci/workflow-plugin/pull/213
          https://github.com/jenkinsci/workflow-plugin/pull/301

          Note on read of consolidated logs: AnnotatedLargeText in Jenkins Impl

          Guts are all in workflow-support-plugin here.

          Sam Van Oort added a comment - - edited A couple thoughts after spending time in the innards of this structure for pipeline stage view work (the easier ones are things that multiple people could hack away on): Indexing: Some operations on a FlowGraph require linear iteration over all nodes to generate summary statistics, for example to generate data for visualization or UI. Where the complete FlowNode set isn't needed, a FlowGraphIndex Action could be created that collects summary information about meaningful branches/milestones (parallel steps, stages, etc). Things that would be useful to know: pause/run duration, final status of steps within that milestone, if any of those steps include NotExecutedNodeActions (or similar), etc. It would also be helpful to know the node ids of steps that mark the start/end of those sections. This could be persisted on the WorkflowRun (since it implements Actionable). Aggregate storage: A large amount of information about each persisted flownode is identical. This could potentially be stored in very compact serialization, while allowing the APIs to transparently retrieve this as if it was stored individually with each item. For example, XStream supports storing common actions as attributes on a Node element where a SingleValueConverter can handle them. Aliasing allows for shorter names for classes (ex, common actions such as log encoding). Node trees for non-head nodes might be stored as an XML tree structure (either with the head at the top, or starting from the initial execution). It would make sense to store the current heads in individual files, until the step completes, and pack into the more compact XML objects periodically (to avoid serializing a large object often). Alternately, we append FlowNode data to a run log file as execution occurs, then periodically regenerate the packed file and empty the log. It could be possible to only store unique values for actions, and provide just a mapping table of these actions to FlowNodes. Many actions are used as markers, and nearly all nodes are usually identical (Log encoding, for example). This requires more complexity to handle, however. XStream reference here which assists with some of this. For implementation, the FlowActionStorage interface provides storage/retrieval of Actions for each FlowNode, and implementations of this have some freedom with how they do storage/recall. FlowNodeStorage extends this to provide storage/recall for individual nodes. The actions need not read/write to the same files as the node hierarchy, however. SimpleXStreamFlowNodeStorage is the only existing implementation. Compression: Due to the duplication mentioned above, FlowGraphs are highly compressible by nature, even moreso than general XML. LZ4/QuickLZ/LZF/Snappy offer compression options that can compress and decompress as a passthrough filter on an InputStream/OutputStream. They compress less than GZIP, but can compress at hundreds of MB/s on modern CPUs, and can be memory-bandwidth bound on decompression (sometimes GB/s). Might be worth considering to reduce disk use by large FlowGraphs in single files. They also can be used for in-memory compression, and might reduce the memory footprint considerably. For Jesse, this may all be trivial and self-evident, but for someone else, it may offer a useful point to contribute on this. Notes on previous optimization: https://github.com/jenkinsci/workflow-plugin/pull/213 https://github.com/jenkinsci/workflow-plugin/pull/301 Note on read of consolidated logs: AnnotatedLargeText in Jenkins Impl Guts are all in workflow-support-plugin here.

          Sam Van Oort added a comment -

          Capturing notes from discussion (and scattered across various JIRAs/PR comments/etc):

          • It would be advantageous to store an inline index of block start/stop nodes, allowing for resolution of node to block via range queries

          Sam Van Oort added a comment - Capturing notes from discussion (and scattered across various JIRAs/PR comments/etc): It would be advantageous to store an inline index of block start/stop nodes, allowing for resolution of node to block via range queries

          Sam Van Oort added a comment -

          To give one order-of-magnitude estimate for scaling, for the CloudBees internal CI, mean and median iota across the last successful run of each pipeline job are 86 & 52 respectively.

          Runs with several hundred nodes are not uncommon.

          As long as we take some measures to simplify serialization/deserialization in storage to reduce the CPU hit, batching nodes in chunks of 100 would be reasonable (1-2 kB per node, an entire flow can be loaded/saved in one 50-200 kB I/O).

          Sam Van Oort added a comment - To give one order-of-magnitude estimate for scaling, for the CloudBees internal CI, mean and median iota across the last successful run of each pipeline job are 86 & 52 respectively. Runs with several hundred nodes are not uncommon. As long as we take some measures to simplify serialization/deserialization in storage to reduce the CPU hit, batching nodes in chunks of 100 would be reasonable (1-2 kB per node, an entire flow can be loaded/saved in one 50-200 kB I/O).

          Sam Van Oort added a comment -

          Note for later implementation of log consolidation:

          WorkflowRun.copyLogs + logNodeMessage (writing to a WorkflowConsoleLogger)

          IIUC there's a StreamBuildListener attached to the WorkflowRun which gets log outputs, this is sliced and copied into a new log file for each FlowNode periodically (at the least upon completion, via copyLogs), which becomes part of a LogActionImpl.

          If we wanted we could write directly to an output file, and then use an alternate LogAction version that gives offsets within this file. Interleaved streams from parallels are an issue though (probably we need to keep separate log streams for each, numbered by parallel # or iota and branch order, I.E. 1-1.log, 1-2.log, 2-1.log, 2-2.log). Would need to create new StreamBuildListeners in this case.

          Possible case also for something like NIO w/ channels (since we're basically redirecting inputs from one channel into multiplexing (reactor pattern?).

          Sam Van Oort added a comment - Note for later implementation of log consolidation: WorkflowRun.copyLogs + logNodeMessage (writing to a WorkflowConsoleLogger) IIUC there's a StreamBuildListener attached to the WorkflowRun which gets log outputs, this is sliced and copied into a new log file for each FlowNode periodically (at the least upon completion, via copyLogs), which becomes part of a LogActionImpl. If we wanted we could write directly to an output file, and then use an alternate LogAction version that gives offsets within this file. Interleaved streams from parallels are an issue though (probably we need to keep separate log streams for each, numbered by parallel # or iota and branch order, I.E. 1-1.log, 1-2.log, 2-1.log, 2-2.log). Would need to create new StreamBuildListeners in this case. Possible case also for something like NIO w/ channels (since we're basically redirecting inputs from one channel into multiplexing ( reactor pattern? ).

          Jesse Glick added a comment -

          Splitting log portion into JENKINS-38381 since I think it is independent.

          Jesse Glick added a comment - Splitting log portion into JENKINS-38381 since I think it is independent.

          Sam Van Oort added a comment -

          Most of this is covered by JENKINS-47173 with the bulk storage plus some other enhancements done since then. 

          At least for now, I'm closing this until and unless we feel a need for more complex storage that will incrementally persist FlowNodes – not convinced that's worthwhile though, and the bulk persistence allows a whole wad of other optimizations.

          Sam Van Oort added a comment - Most of this is covered by JENKINS-47173 with the bulk storage plus some other enhancements done since then.  At least for now, I'm closing this until and unless we feel a need for more complex storage that will incrementally persist FlowNodes – not convinced that's worthwhile though, and the bulk persistence allows a whole wad of other optimizations.

            Unassigned Unassigned
            jglick Jesse Glick
            Votes:
            3 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: