• Icon: Bug Bug
    • Resolution: Not A Defect
    • Icon: Minor Minor
    • workflow-api-plugin

      When using the stash build step on a Tegra X1 build slave, it is orders of magnitude slower than it should be: a single file of 20MB takes 29 seconds to archive, and larger files take proportionally longer. The device is connected by 1Gb/s ethernet and the file is stored on an SSD, so it should take well under 1s. While this is happening, the java process on the slave is at 100% CPU usage.

      Sample pipeline script:

      stage 'prepare'
      node('tegra-cuda') {
          deleteDir()
          sh 'dd if=/dev/zero of=dummy bs=1M count=20'
          sh 'date'
          stash name: 'source', includes: 'dummy'
          sh 'date'
      }
      

      Output:

      [Pipeline] stage (prepare)
      Entering stage prepare
      Proceeding
      [Pipeline] node
      Running on e5f011df5aa1-e069b02f in /var/lib/jenkins/workspace/ARM stash test
      [Pipeline] {
      [Pipeline] deleteDir
      [Pipeline] sh
      [ARM stash test] Running shell script
      + dd if=/dev/zero of=dummy bs=1M count=20
      20+0 records in
      20+0 records out
      20971520 bytes (21 MB) copied, 0.0732756 s, 286 MB/s
      [Pipeline] sh
      [ARM stash test] Running shell script
      + date
      Mon Jul 25 11:02:05 UTC 2016
      [Pipeline] stash
      Stashed 1 file(s)
      [Pipeline] sh
      [ARM stash test] Running shell script
      + date
      Mon Jul 25 11:02:34 UTC 2016
      [Pipeline] }
      [Pipeline] // node
      [Pipeline] End of Pipeline
      Finished: SUCCESS
      

      Thread dump for the relevant slave:

      Channel reader thread: channel
      
      "Channel reader thread: channel" Id=11 Group=main RUNNABLE (in native)
      	at java.net.SocketInputStream.socketRead0(Native Method)
      	at java.net.SocketInputStream.read(SocketInputStream.java:152)
      	at java.net.SocketInputStream.read(SocketInputStream.java:122)
      	at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
      	at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
      	-  locked java.io.BufferedInputStream@b24124
      	at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:82)
      	at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72)
      	at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103)
      	at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
      	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
      	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
      
      main
      
      "main" Id=1 Group=main WAITING on hudson.remoting.Engine@13108c4
      	at java.lang.Object.wait(Native Method)
      	-  waiting on hudson.remoting.Engine@13108c4
      	at java.lang.Thread.join(Thread.java:1281)
      	at java.lang.Thread.join(Thread.java:1355)
      	at hudson.remoting.jnlp.Main.main(Main.java:137)
      	at hudson.remoting.jnlp.Main._main(Main.java:130)
      	at hudson.remoting.jnlp.Main.main(Main.java:96)
      	at hudson.plugins.swarm.SwarmClient.connect(SwarmClient.java:239)
      	at hudson.plugins.swarm.Client.run(Client.java:107)
      	at hudson.plugins.swarm.Client.main(Client.java:68)
      
      Ping thread for channel hudson.remoting.Channel@195804b:channel
      
      "Ping thread for channel hudson.remoting.Channel@195804b:channel" Id=16 Group=main TIMED_WAITING
      	at java.lang.Thread.sleep(Native Method)
      	at hudson.remoting.PingThread.run(PingThread.java:90)
      
      pool-1-thread-274 for channel
      
      "pool-1-thread-274 for channel" Id=1122 Group=main RUNNABLE
      	at com.jcraft.jzlib.Deflate.fill_window(Deflate.java:966)
      	at com.jcraft.jzlib.Deflate.deflate_slow(Deflate.java:1125)
      	at com.jcraft.jzlib.Deflate.deflate(Deflate.java:1587)
      	at com.jcraft.jzlib.Deflater.deflate(Deflater.java:140)
      	at com.jcraft.jzlib.DeflaterOutputStream.deflate(DeflaterOutputStream.java:129)
      	at com.jcraft.jzlib.DeflaterOutputStream.write(DeflaterOutputStream.java:102)
      	at org.apache.commons.compress.utils.CountingOutputStream.write(CountingOutputStream.java:48)
      	at org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.writeRecord(TarArchiveOutputStream.java:571)
      	at org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.write(TarArchiveOutputStream.java:435)
      	at hudson.util.io.TarArchiver.visit(TarArchiver.java:100)
      	at hudson.util.DirScanner.scanSingle(DirScanner.java:49)
      	at hudson.util.DirScanner$Glob.scan(DirScanner.java:131)
      	at hudson.FilePath$1.invoke(FilePath.java:463)
      	at hudson.FilePath$1.invoke(FilePath.java:459)
      	at hudson.FilePath$FileCallableWrapper.call(FilePath.java:2772)
      	at hudson.remoting.UserRequest.perform(UserRequest.java:120)
      	at hudson.remoting.UserRequest.perform(UserRequest.java:48)
      	at hudson.remoting.Request$2.run(Request.java:326)
      	at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at hudson.remoting.Engine$1$1.run(Engine.java:62)
      	at java.lang.Thread.run(Thread.java:745)
      
      	Number of locked synchronizers = 1
      	- java.util.concurrent.ThreadPoolExecutor$Worker@17d329e
      
      pool-1-thread-275 for channel
      
      "pool-1-thread-275 for channel" Id=1126 Group=main RUNNABLE
      	at sun.management.ThreadImpl.dumpThreads0(Native Method)
      	at sun.management.ThreadImpl.dumpAllThreads(ThreadImpl.java:446)
      	at hudson.Functions.getThreadInfos(Functions.java:1196)
      	at hudson.util.RemotingDiagnostics$GetThreadDump.call(RemotingDiagnostics.java:98)
      	at hudson.util.RemotingDiagnostics$GetThreadDump.call(RemotingDiagnostics.java:95)
      	at hudson.remoting.UserRequest.perform(UserRequest.java:120)
      	at hudson.remoting.UserRequest.perform(UserRequest.java:48)
      	at hudson.remoting.Request$2.run(Request.java:326)
      	at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at hudson.remoting.Engine$1$1.run(Engine.java:62)
      	at java.lang.Thread.run(Thread.java:745)
      
      	Number of locked synchronizers = 1
      	- java.util.concurrent.ThreadPoolExecutor$Worker@a0b835
      
      RemoteInvocationHandler [#1]
      
      "RemoteInvocationHandler [#1]" Id=10 Group=main TIMED_WAITING on java.lang.ref.ReferenceQueue$Lock@d2df5d
      	at java.lang.Object.wait(Native Method)
      	-  waiting on java.lang.ref.ReferenceQueue$Lock@d2df5d
      	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
      	at hudson.remoting.RemoteInvocationHandler$Unexporter.run(RemoteInvocationHandler.java:415)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
      	at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:110)
      	at java.lang.Thread.run(Thread.java:745)
      
      Thread-1
      
      "Thread-1" Id=9 Group=main TIMED_WAITING on hudson.remoting.Channel@195804b
      	at java.lang.Object.wait(Native Method)
      	-  waiting on hudson.remoting.Channel@195804b
      	at hudson.remoting.Channel.join(Channel.java:948)
      	at hudson.remoting.Engine.run(Engine.java:267)
      
      Finalizer
      
      "Finalizer" Id=3 Group=system WAITING on java.lang.ref.ReferenceQueue$Lock@466bb5
      	at java.lang.Object.wait(Native Method)
      	-  waiting on java.lang.ref.ReferenceQueue$Lock@466bb5
      	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
      	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
      	at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:209)
      
      process reaper
      
      "process reaper" Id=1123 Group=system TIMED_WAITING on java.util.concurrent.SynchronousQueue$TransferStack@6e9009
      	at sun.misc.Unsafe.park(Native Method)
      	-  waiting on java.util.concurrent.SynchronousQueue$TransferStack@6e9009
      	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
      	at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
      	at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:359)
      	at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:942)
      	at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at java.lang.Thread.run(Thread.java:745)
      
      Reference Handler
      
      "Reference Handler" Id=2 Group=system WAITING on java.lang.ref.Reference$Lock@1496b81
      	at java.lang.Object.wait(Native Method)
      	-  waiting on java.lang.ref.Reference$Lock@1496b81
      	at java.lang.Object.wait(Object.java:503)
      	at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
      
      Signal Dispatcher
      
      "Signal Dispatcher" Id=4 Group=system RUNNABLE
      

      I've taken a guess at the appropriate component, but I'm not sure which component corresponds to "workflow-basic-steps" (https://jenkins.io/doc/pipeline/steps/workflow-basic-steps/)

          [JENKINS-36914] stash step is excessively slow on ARM

          Jesse Glick added a comment -

          Possibly not a duplicate, unclear.

          Jesse Glick added a comment - Possibly not a duplicate, unclear.

          Jesse Glick added a comment -

          stash and unstash are not intended for large files. Use the External Workspace Manager plugin, or an external artifact manager like Nexus or Artifactory.

          Jesse Glick added a comment - stash and unstash are not intended for large files. Use the External Workspace Manager plugin, or an external artifact manager like Nexus or Artifactory.

          Code changed in jenkins
          User: Jesse Glick
          Path:
          src/main/resources/org/jenkinsci/plugins/workflow/support/steps/stash/StashStep/help.html
          http://jenkins-ci.org/commit/workflow-basic-steps-plugin/413df48bdcb832261e8fb110150eeb8069e77c33
          Log:
          JENKINS-38640 JENKINS-36914 Warn users to avoid stash/unstash of large files

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: src/main/resources/org/jenkinsci/plugins/workflow/support/steps/stash/StashStep/help.html http://jenkins-ci.org/commit/workflow-basic-steps-plugin/413df48bdcb832261e8fb110150eeb8069e77c33 Log: JENKINS-38640 JENKINS-36914 Warn users to avoid stash/unstash of large files

          Code changed in jenkins
          User: Jesse Glick
          Path:
          src/main/resources/org/jenkinsci/plugins/workflow/support/steps/stash/StashStep/help.html
          http://jenkins-ci.org/commit/workflow-basic-steps-plugin/f541cd2cda5f316042d38af556f4160f7e470ccf
          Log:
          Merge pull request #23 from jenkinsci/jglick-stash-docs

          JENKINS-38640 JENKINS-36914 Warn users to avoid stash/unstash of large files

          Compare: https://github.com/jenkinsci/workflow-basic-steps-plugin/compare/95e202bec553...f541cd2cda5f

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: src/main/resources/org/jenkinsci/plugins/workflow/support/steps/stash/StashStep/help.html http://jenkins-ci.org/commit/workflow-basic-steps-plugin/f541cd2cda5f316042d38af556f4160f7e470ccf Log: Merge pull request #23 from jenkinsci/jglick-stash-docs JENKINS-38640 JENKINS-36914 Warn users to avoid stash/unstash of large files Compare: https://github.com/jenkinsci/workflow-basic-steps-plugin/compare/95e202bec553...f541cd2cda5f

          Sam Van Oort added a comment -

          bmerry Unfortunately I'm quite confident this is a result of the GZIP compression applied to stashes – the hardware itself probably does not have high performance for that algorithm. This operation thus becomes CPU-bound rather than network or I/O bound.

          I don't think we can really resolve this one since it's tied to the algorithm itself – using the latest compatible JDKs may help some. Jesse has suggested some workarounds as well. Finally, we may be able to switch out some of the GZIP implementations to higher-performance versions within the Jenkins core, but even so I doubt it'll yield great performance on ARM hardware.

          Sam Van Oort added a comment - bmerry Unfortunately I'm quite confident this is a result of the GZIP compression applied to stashes – the hardware itself probably does not have high performance for that algorithm. This operation thus becomes CPU-bound rather than network or I/O bound. I don't think we can really resolve this one since it's tied to the algorithm itself – using the latest compatible JDKs may help some. Jesse has suggested some workarounds as well. Finally, we may be able to switch out some of the GZIP implementations to higher-performance versions within the Jenkins core, but even so I doubt it'll yield great performance on ARM hardware.

          Sam Van Oort added a comment -

          Closing since the root cause here is that GZIP is demanding on ARM hardware

          Sam Van Oort added a comment - Closing since the root cause here is that GZIP is demanding on ARM hardware

          Jesse Glick added a comment -

          Do you have some evidence of this being an issue with Gzip on ARM? If so, it would be appropriate to use an uncompressed transfer method on that platform. Really it might be better to do so all the time—the Remoting transport is generally responsible for compression.

          Of course if you are using https://plugins.jenkins.io/artifact-manager-s3 this would not be an issue.

          Jesse Glick added a comment - Do you have some evidence of this being an issue with Gzip on ARM? If so, it would be appropriate to use an uncompressed transfer method on that platform. Really it might be better to do so all the time—the Remoting transport is generally responsible for compression. Of course if you are using https://plugins.jenkins.io/artifact-manager-s3 this would not be an issue.

          Sam Van Oort added a comment -

          jglick I know that the discrepancy vs. raw IO speeds is almost certainly due to GZIP compress/decompress. Using an uncompressed method might be beneficial in some cases (especially with poorly-compressible data).

          I'd have to benchmark the GZIP implementation on that specific platform and compare to the same one on a Intel/AMD laptop processor – but the benchmarks here show pretty large differences in performance between ARM and Intel processors: https://quixdb.github.io/squash-benchmark/

          And if you're only using a single CPU thread to do decompression, with a pure-java implementation that is potentially less than optimal for the platform, then 0.7 MB/s seems not unreasonable for a compression rate. It's in the ballpark anyway - I'm seeing <10 MB compression rates reported for various quad-core+ ARM chips for native-code implementations of that compression algorithm, using multiple threads. Remember we're talking processors that only have a few watts to play with and fairly small cache sizes.

          Sam Van Oort added a comment - jglick I know that the discrepancy vs. raw IO speeds is almost certainly due to GZIP compress/decompress. Using an uncompressed method might be beneficial in some cases (especially with poorly-compressible data). I'd have to benchmark the GZIP implementation on that specific platform and compare to the same one on a Intel/AMD laptop processor – but the benchmarks here show pretty large differences in performance between ARM and Intel processors: https://quixdb.github.io/squash-benchmark/ And if you're only using a single CPU thread to do decompression, with a pure-java implementation that is potentially less than optimal for the platform, then 0.7 MB/s seems not unreasonable for a compression rate. It's in the ballpark anyway - I'm seeing <10 MB compression rates reported for various quad-core+ ARM chips for native-code implementations of that compression algorithm, using multiple threads. Remember we're talking processors that only have a few watts to play with and fairly small cache sizes.

          Jesse Glick added a comment -

          Could just switch to ArchiverFactory.TAR and do the compression on the master side (and similarly for unstashing). There is generally no benefit to using compression at this level regardless of the processor—a LAN is generally quite capable of handling high bandwidth, and many agent connection methods add transport-level compression anyway. On the other hand it is better to burn CPU on an agent, even at the expense of longer build times, if it saves a little CPU on the master, and we do not want to store stashes uncompressed.

          Jesse Glick added a comment - Could just switch to ArchiverFactory.TAR and do the compression on the master side (and similarly for unstashing). There is generally no benefit to using compression at this level regardless of the processor—a LAN is generally quite capable of handling high bandwidth, and many agent connection methods add transport-level compression anyway. On the other hand it is better to burn CPU on an agent, even at the expense of longer build times, if it saves a little CPU on the master, and we do not want to store stashes uncompressed.

          Sam Van Oort added a comment -

          jglick I'd agree we should burn agent CPU over master CPU, but do think it's worth saving some storage space on-master. The happy medium would be a high-performance algorithm such as LZ4, LZO, or LZF which gets most of the benefits of compression and can shrink highly-compressible content with a much lower CPU cost than Deflate (used by GZIP).

          I've seen very positive results with those algorithms in the past – they're fast enough that if you're transmitting fairly compressible content (ex JSON or XML payloads or source code) you can see time savings even in a data center with gigabit links.

          Sam Van Oort added a comment - jglick I'd agree we should burn agent CPU over master CPU, but do think it's worth saving some storage space on-master. The happy medium would be a high-performance algorithm such as LZ4, LZO, or LZF which gets most of the benefits of compression and can shrink highly-compressible content with a much lower CPU cost than Deflate (used by GZIP). I've seen very positive results with those algorithms in the past – they're fast enough that if you're transmitting fairly compressible content (ex JSON or XML payloads or source code) you can see time savings even in a data center with gigabit links.

            Unassigned Unassigned
            bmerry Bruce Merry
            Votes:
            2 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: