Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-36914

stash step is excessively slow on ARM

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Minor
    • Resolution: Not A Defect
    • workflow-api-plugin

    Description

      When using the stash build step on a Tegra X1 build slave, it is orders of magnitude slower than it should be: a single file of 20MB takes 29 seconds to archive, and larger files take proportionally longer. The device is connected by 1Gb/s ethernet and the file is stored on an SSD, so it should take well under 1s. While this is happening, the java process on the slave is at 100% CPU usage.

      Sample pipeline script:

      stage 'prepare'
      node('tegra-cuda') {
          deleteDir()
          sh 'dd if=/dev/zero of=dummy bs=1M count=20'
          sh 'date'
          stash name: 'source', includes: 'dummy'
          sh 'date'
      }
      

      Output:

      [Pipeline] stage (prepare)
      Entering stage prepare
      Proceeding
      [Pipeline] node
      Running on e5f011df5aa1-e069b02f in /var/lib/jenkins/workspace/ARM stash test
      [Pipeline] {
      [Pipeline] deleteDir
      [Pipeline] sh
      [ARM stash test] Running shell script
      + dd if=/dev/zero of=dummy bs=1M count=20
      20+0 records in
      20+0 records out
      20971520 bytes (21 MB) copied, 0.0732756 s, 286 MB/s
      [Pipeline] sh
      [ARM stash test] Running shell script
      + date
      Mon Jul 25 11:02:05 UTC 2016
      [Pipeline] stash
      Stashed 1 file(s)
      [Pipeline] sh
      [ARM stash test] Running shell script
      + date
      Mon Jul 25 11:02:34 UTC 2016
      [Pipeline] }
      [Pipeline] // node
      [Pipeline] End of Pipeline
      Finished: SUCCESS
      

      Thread dump for the relevant slave:

      Channel reader thread: channel
      
      "Channel reader thread: channel" Id=11 Group=main RUNNABLE (in native)
      	at java.net.SocketInputStream.socketRead0(Native Method)
      	at java.net.SocketInputStream.read(SocketInputStream.java:152)
      	at java.net.SocketInputStream.read(SocketInputStream.java:122)
      	at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
      	at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
      	-  locked java.io.BufferedInputStream@b24124
      	at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:82)
      	at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72)
      	at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103)
      	at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
      	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
      	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
      
      main
      
      "main" Id=1 Group=main WAITING on hudson.remoting.Engine@13108c4
      	at java.lang.Object.wait(Native Method)
      	-  waiting on hudson.remoting.Engine@13108c4
      	at java.lang.Thread.join(Thread.java:1281)
      	at java.lang.Thread.join(Thread.java:1355)
      	at hudson.remoting.jnlp.Main.main(Main.java:137)
      	at hudson.remoting.jnlp.Main._main(Main.java:130)
      	at hudson.remoting.jnlp.Main.main(Main.java:96)
      	at hudson.plugins.swarm.SwarmClient.connect(SwarmClient.java:239)
      	at hudson.plugins.swarm.Client.run(Client.java:107)
      	at hudson.plugins.swarm.Client.main(Client.java:68)
      
      Ping thread for channel hudson.remoting.Channel@195804b:channel
      
      "Ping thread for channel hudson.remoting.Channel@195804b:channel" Id=16 Group=main TIMED_WAITING
      	at java.lang.Thread.sleep(Native Method)
      	at hudson.remoting.PingThread.run(PingThread.java:90)
      
      pool-1-thread-274 for channel
      
      "pool-1-thread-274 for channel" Id=1122 Group=main RUNNABLE
      	at com.jcraft.jzlib.Deflate.fill_window(Deflate.java:966)
      	at com.jcraft.jzlib.Deflate.deflate_slow(Deflate.java:1125)
      	at com.jcraft.jzlib.Deflate.deflate(Deflate.java:1587)
      	at com.jcraft.jzlib.Deflater.deflate(Deflater.java:140)
      	at com.jcraft.jzlib.DeflaterOutputStream.deflate(DeflaterOutputStream.java:129)
      	at com.jcraft.jzlib.DeflaterOutputStream.write(DeflaterOutputStream.java:102)
      	at org.apache.commons.compress.utils.CountingOutputStream.write(CountingOutputStream.java:48)
      	at org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.writeRecord(TarArchiveOutputStream.java:571)
      	at org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.write(TarArchiveOutputStream.java:435)
      	at hudson.util.io.TarArchiver.visit(TarArchiver.java:100)
      	at hudson.util.DirScanner.scanSingle(DirScanner.java:49)
      	at hudson.util.DirScanner$Glob.scan(DirScanner.java:131)
      	at hudson.FilePath$1.invoke(FilePath.java:463)
      	at hudson.FilePath$1.invoke(FilePath.java:459)
      	at hudson.FilePath$FileCallableWrapper.call(FilePath.java:2772)
      	at hudson.remoting.UserRequest.perform(UserRequest.java:120)
      	at hudson.remoting.UserRequest.perform(UserRequest.java:48)
      	at hudson.remoting.Request$2.run(Request.java:326)
      	at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at hudson.remoting.Engine$1$1.run(Engine.java:62)
      	at java.lang.Thread.run(Thread.java:745)
      
      	Number of locked synchronizers = 1
      	- java.util.concurrent.ThreadPoolExecutor$Worker@17d329e
      
      pool-1-thread-275 for channel
      
      "pool-1-thread-275 for channel" Id=1126 Group=main RUNNABLE
      	at sun.management.ThreadImpl.dumpThreads0(Native Method)
      	at sun.management.ThreadImpl.dumpAllThreads(ThreadImpl.java:446)
      	at hudson.Functions.getThreadInfos(Functions.java:1196)
      	at hudson.util.RemotingDiagnostics$GetThreadDump.call(RemotingDiagnostics.java:98)
      	at hudson.util.RemotingDiagnostics$GetThreadDump.call(RemotingDiagnostics.java:95)
      	at hudson.remoting.UserRequest.perform(UserRequest.java:120)
      	at hudson.remoting.UserRequest.perform(UserRequest.java:48)
      	at hudson.remoting.Request$2.run(Request.java:326)
      	at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at hudson.remoting.Engine$1$1.run(Engine.java:62)
      	at java.lang.Thread.run(Thread.java:745)
      
      	Number of locked synchronizers = 1
      	- java.util.concurrent.ThreadPoolExecutor$Worker@a0b835
      
      RemoteInvocationHandler [#1]
      
      "RemoteInvocationHandler [#1]" Id=10 Group=main TIMED_WAITING on java.lang.ref.ReferenceQueue$Lock@d2df5d
      	at java.lang.Object.wait(Native Method)
      	-  waiting on java.lang.ref.ReferenceQueue$Lock@d2df5d
      	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
      	at hudson.remoting.RemoteInvocationHandler$Unexporter.run(RemoteInvocationHandler.java:415)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
      	at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:110)
      	at java.lang.Thread.run(Thread.java:745)
      
      Thread-1
      
      "Thread-1" Id=9 Group=main TIMED_WAITING on hudson.remoting.Channel@195804b
      	at java.lang.Object.wait(Native Method)
      	-  waiting on hudson.remoting.Channel@195804b
      	at hudson.remoting.Channel.join(Channel.java:948)
      	at hudson.remoting.Engine.run(Engine.java:267)
      
      Finalizer
      
      "Finalizer" Id=3 Group=system WAITING on java.lang.ref.ReferenceQueue$Lock@466bb5
      	at java.lang.Object.wait(Native Method)
      	-  waiting on java.lang.ref.ReferenceQueue$Lock@466bb5
      	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
      	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
      	at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:209)
      
      process reaper
      
      "process reaper" Id=1123 Group=system TIMED_WAITING on java.util.concurrent.SynchronousQueue$TransferStack@6e9009
      	at sun.misc.Unsafe.park(Native Method)
      	-  waiting on java.util.concurrent.SynchronousQueue$TransferStack@6e9009
      	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
      	at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
      	at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:359)
      	at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:942)
      	at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at java.lang.Thread.run(Thread.java:745)
      
      Reference Handler
      
      "Reference Handler" Id=2 Group=system WAITING on java.lang.ref.Reference$Lock@1496b81
      	at java.lang.Object.wait(Native Method)
      	-  waiting on java.lang.ref.Reference$Lock@1496b81
      	at java.lang.Object.wait(Object.java:503)
      	at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
      
      Signal Dispatcher
      
      "Signal Dispatcher" Id=4 Group=system RUNNABLE
      

      I've taken a guess at the appropriate component, but I'm not sure which component corresponds to "workflow-basic-steps" (https://jenkins.io/doc/pipeline/steps/workflow-basic-steps/)

      Attachments

        Issue Links

          Activity

            svanoort Sam Van Oort added a comment -

            Closing since the root cause here is that GZIP is demanding on ARM hardware

            svanoort Sam Van Oort added a comment - Closing since the root cause here is that GZIP is demanding on ARM hardware
            jglick Jesse Glick added a comment -

            Do you have some evidence of this being an issue with Gzip on ARM? If so, it would be appropriate to use an uncompressed transfer method on that platform. Really it might be better to do so all the time—the Remoting transport is generally responsible for compression.

            Of course if you are using https://plugins.jenkins.io/artifact-manager-s3 this would not be an issue.

            jglick Jesse Glick added a comment - Do you have some evidence of this being an issue with Gzip on ARM? If so, it would be appropriate to use an uncompressed transfer method on that platform. Really it might be better to do so all the time—the Remoting transport is generally responsible for compression. Of course if you are using https://plugins.jenkins.io/artifact-manager-s3 this would not be an issue.
            svanoort Sam Van Oort added a comment -

            jglick I know that the discrepancy vs. raw IO speeds is almost certainly due to GZIP compress/decompress. Using an uncompressed method might be beneficial in some cases (especially with poorly-compressible data).

            I'd have to benchmark the GZIP implementation on that specific platform and compare to the same one on a Intel/AMD laptop processor – but the benchmarks here show pretty large differences in performance between ARM and Intel processors: https://quixdb.github.io/squash-benchmark/

            And if you're only using a single CPU thread to do decompression, with a pure-java implementation that is potentially less than optimal for the platform, then 0.7 MB/s seems not unreasonable for a compression rate. It's in the ballpark anyway - I'm seeing <10 MB compression rates reported for various quad-core+ ARM chips for native-code implementations of that compression algorithm, using multiple threads. Remember we're talking processors that only have a few watts to play with and fairly small cache sizes.

            svanoort Sam Van Oort added a comment - jglick I know that the discrepancy vs. raw IO speeds is almost certainly due to GZIP compress/decompress. Using an uncompressed method might be beneficial in some cases (especially with poorly-compressible data). I'd have to benchmark the GZIP implementation on that specific platform and compare to the same one on a Intel/AMD laptop processor – but the benchmarks here show pretty large differences in performance between ARM and Intel processors: https://quixdb.github.io/squash-benchmark/ And if you're only using a single CPU thread to do decompression, with a pure-java implementation that is potentially less than optimal for the platform, then 0.7 MB/s seems not unreasonable for a compression rate. It's in the ballpark anyway - I'm seeing <10 MB compression rates reported for various quad-core+ ARM chips for native-code implementations of that compression algorithm, using multiple threads. Remember we're talking processors that only have a few watts to play with and fairly small cache sizes.
            jglick Jesse Glick added a comment -

            Could just switch to ArchiverFactory.TAR and do the compression on the master side (and similarly for unstashing). There is generally no benefit to using compression at this level regardless of the processor—a LAN is generally quite capable of handling high bandwidth, and many agent connection methods add transport-level compression anyway. On the other hand it is better to burn CPU on an agent, even at the expense of longer build times, if it saves a little CPU on the master, and we do not want to store stashes uncompressed.

            jglick Jesse Glick added a comment - Could just switch to ArchiverFactory.TAR and do the compression on the master side (and similarly for unstashing). There is generally no benefit to using compression at this level regardless of the processor—a LAN is generally quite capable of handling high bandwidth, and many agent connection methods add transport-level compression anyway. On the other hand it is better to burn CPU on an agent, even at the expense of longer build times, if it saves a little CPU on the master, and we do not want to store stashes uncompressed.
            svanoort Sam Van Oort added a comment -

            jglick I'd agree we should burn agent CPU over master CPU, but do think it's worth saving some storage space on-master. The happy medium would be a high-performance algorithm such as LZ4, LZO, or LZF which gets most of the benefits of compression and can shrink highly-compressible content with a much lower CPU cost than Deflate (used by GZIP).

            I've seen very positive results with those algorithms in the past – they're fast enough that if you're transmitting fairly compressible content (ex JSON or XML payloads or source code) you can see time savings even in a data center with gigabit links.

            svanoort Sam Van Oort added a comment - jglick I'd agree we should burn agent CPU over master CPU, but do think it's worth saving some storage space on-master. The happy medium would be a high-performance algorithm such as LZ4, LZO, or LZF which gets most of the benefits of compression and can shrink highly-compressible content with a much lower CPU cost than Deflate (used by GZIP). I've seen very positive results with those algorithms in the past – they're fast enough that if you're transmitting fairly compressible content (ex JSON or XML payloads or source code) you can see time savings even in a data center with gigabit links.

            People

              Unassigned Unassigned
              bmerry Bruce Merry
              Votes:
              2 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: