Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-22320

High CPU consumption because of SSH communication

      Our Jenkins instance runs 1.553 (since two weeks, was an three year old version before) on Linux with 10 slaves: 6 Linux connected through SSH and 4 windows with WebStart agent started as services.

      When no job is running, CPU is almost 0. Correct.

      When only two or three jobs are running, Jenkins CPU raises between 70% to 160%. atop declares 96% of CPU time spent in IRQ (even with "no" disk access), most CPU consumption is considered as system time. On average since boot, the master node consumes 100% of one CPU. Even if it has 4 CPUs, job execution time is between 2x and 3x compared to older version.

      I configured JMX and did a quick CPU profiling. Top consumer threads are unnamed and are all related to SSH communication

      "Thread-13" - Thread t@90
         java.lang.Thread.State: RUNNABLE
      	at java.net.SocketInputStream.socketRead0(Native Method)
      	at java.net.SocketInputStream.read(SocketInputStream.java:152)
      	at java.net.SocketInputStream.read(SocketInputStream.java:122)
      	at com.trilead.ssh2.crypto.cipher.CipherInputStream.fill_buffer(CipherInputStream.java:41)
      	at com.trilead.ssh2.crypto.cipher.CipherInputStream.internal_read(CipherInputStream.java:52)
      	at com.trilead.ssh2.crypto.cipher.CipherInputStream.getBlock(CipherInputStream.java:79)
      	at com.trilead.ssh2.crypto.cipher.CipherInputStream.read(CipherInputStream.java:108)
      	at com.trilead.ssh2.transport.TransportConnection.receiveMessage(TransportConnection.java:232)
      	at com.trilead.ssh2.transport.TransportManager.receiveLoop(TransportManager.java:682)
      	at com.trilead.ssh2.transport.TransportManager$1.run(TransportManager.java:480)
      	at java.lang.Thread.run(Thread.java:744)
      

      So there is chance the "SSH agent plugin" is concerned.

      I am ready to do deeper analysis on my system if required, and of course to test patches.

          [JENKINS-22320] High CPU consumption because of SSH communication

          Yves Martin added a comment -

          I had to switch my Linux slaves to WebStart (JNLP) agent.
          My master CPU consumption just drops to only 10% in average (and often less)

          Yves Martin added a comment - I had to switch my Linux slaves to WebStart (JNLP) agent. My master CPU consumption just drops to only 10% in average (and often less)

          Yves Martin added a comment -

          My guess is that the SSH library is configured with too small buffers (like when used as interactive shell with urgent packets for echo), which may explain why I measure a so high system part in CPU consumption. This hypothesis may be confirmed with network traces showing many small packets, and TCP statistics to compare to an older version of Jenkins which does not suffer this bug.

          Yves Martin added a comment - My guess is that the SSH library is configured with too small buffers (like when used as interactive shell with urgent packets for echo), which may explain why I measure a so high system part in CPU consumption. This hypothesis may be confirmed with network traces showing many small packets, and TCP statistics to compare to an older version of Jenkins which does not suffer this bug.

          Thomas Herrlin added a comment - - edited

          I am currently experimenting with increasing the fill_buffer() buffer from 2Kb to 64Kb or even replacing it with BufferedReader. My first naive preliminary test looks good, but still too early to say for certain. See trilead-ssh2-64K-buffer.patch

          Thomas Herrlin added a comment - - edited I am currently experimenting with increasing the fill_buffer() buffer from 2Kb to 64Kb or even replacing it with BufferedReader. My first naive preliminary test looks good, but still too early to say for certain. See trilead-ssh2-64K-buffer.patch

          Preliminary tests where actually not so good. Forgot to remove the java exclusions when using the jvisualvm CPU sampler. Majority of CPU time still within socketRead0[native] even after my buffer hack.

          Also tried forcing setTcpNoDelay(false) as src/main/java/hudson/plugins/sshslaves/SSHLauncher.java sets connection.setTCPNoDelay(true).
          Setting to false should reduce the number of packets and thus the amount of interrupts.
          TCPNoDelay should mostly be used for interactive connections to get fast non piggybacked ACKs and not so good for bulk transfers as far as I understand. I find commit c2fcc257 in sshslaves to be contradictory to my understanding on how the TCP protocol works, but perhaps there are some cases in the real world where ACK piggybacking causes problems.

          Did a simple "ifconfig ; sleep 3 ; ifconfig" while jenkins was transferring a big non-compressible mocked artifact, average RX packet size was about 1100 bytes.

          SSH buffer size seems to be set to 4Mb in the sshslaves plugin:
          src/main/java/hudson/plugins/sshslaves/SSHLauncher.java
          private void expandChannelBufferSize(Session session, TaskListener listener) {
          // see hudson.remoting.Channel.PIPE_WINDOW_SIZE for the discussion of why 1MB is in the right ball park
          // but this particular session is where all the master/slave communication will happen, so
          // it's worth using a bigger buffer to really better utilize bandwidth even when the latency is even larger
          // (and since we are draining this pipe very rapidly, it's unlikely that we'll actually accumulate this much data)
          int sz = 4;
          m.invoke(session, sz*1024*1024);

          I am using trilead-ssh2-build217-jenkins-3, sshslaves 1.6, Jenkins 1.554.1 with the bundled jetty container.

          Thomas Herrlin added a comment - Preliminary tests where actually not so good. Forgot to remove the java exclusions when using the jvisualvm CPU sampler. Majority of CPU time still within socketRead0 [native] even after my buffer hack. Also tried forcing setTcpNoDelay(false) as src/main/java/hudson/plugins/sshslaves/SSHLauncher.java sets connection.setTCPNoDelay(true). Setting to false should reduce the number of packets and thus the amount of interrupts. TCPNoDelay should mostly be used for interactive connections to get fast non piggybacked ACKs and not so good for bulk transfers as far as I understand. I find commit c2fcc257 in sshslaves to be contradictory to my understanding on how the TCP protocol works, but perhaps there are some cases in the real world where ACK piggybacking causes problems. Did a simple "ifconfig ; sleep 3 ; ifconfig" while jenkins was transferring a big non-compressible mocked artifact, average RX packet size was about 1100 bytes. SSH buffer size seems to be set to 4Mb in the sshslaves plugin: src/main/java/hudson/plugins/sshslaves/SSHLauncher.java private void expandChannelBufferSize(Session session, TaskListener listener) { // see hudson.remoting.Channel.PIPE_WINDOW_SIZE for the discussion of why 1MB is in the right ball park // but this particular session is where all the master/slave communication will happen, so // it's worth using a bigger buffer to really better utilize bandwidth even when the latency is even larger // (and since we are draining this pipe very rapidly, it's unlikely that we'll actually accumulate this much data) int sz = 4; m.invoke(session, sz*1024*1024); I am using trilead-ssh2-build217-jenkins-3, sshslaves 1.6, Jenkins 1.554.1 with the bundled jetty container.

          Looks like I may be chasing a red herring

          The slowdowns I see may not actually be related to this ticket, even if I get high cpu usage in fill_buffer -> socketRead0.

          Thomas Herrlin added a comment - Looks like I may be chasing a red herring The slowdowns I see may not actually be related to this ticket, even if I get high cpu usage in fill_buffer -> socketRead0.

          Joshua K added a comment -

          I strace'd and it looks like a lot of the traffic is due to repeatedly sending exceptions over the remoting channel:

          write(8, "q\0~\0\5\0\2\16`\0\0\0\0sr\0 java.lang.ClassNotFoundException\177Z\315f>\324 \216\2\0\1L\0\2exq\0~\0\1xq\0~\0\10pt\0\23java.nio.file.Filesuq\0~\0\r\0\0\0\21sq\0~\0\17\0\0\5_t\0\33jenkins.util.AntClassLoadert\0\23AntClassLoader.javat\0\25findClassInComponentssq\0~\0\17\0\0\5-q\0~\0004q\0~\0005t\0\tfindClasssq\0~\0\17", 233) = 233

          This is plausible, as our slaves/master run on Java 1.6 and java.nio.file.Files is new in 1.7.

          I will try to bump our slaves to use JDK 1.7 and see if that helps.

          Joshua K added a comment - I strace'd and it looks like a lot of the traffic is due to repeatedly sending exceptions over the remoting channel: write(8, "q\0~\0\5\0\2\16`\0\0\0\0sr\0 java.lang.ClassNotFoundException\177Z\315f>\324 \216\2\0\1L\0\2exq\0~\0\1xq\0~\0\10pt\0\23java.nio.file.Filesuq\0~\0\r\0\0\0\21sq\0~\0\17\0\0\5_t\0\33jenkins.util.AntClassLoadert\0\23AntClassLoader.javat\0\25findClassInComponentssq\0~\0\17\0\0\5-q\0~\0004q\0~\0005t\0\tfindClasssq\0~\0\17", 233) = 233 This is plausible, as our slaves/master run on Java 1.6 and java.nio.file.Files is new in 1.7. I will try to bump our slaves to use JDK 1.7 and see if that helps.

          Yves Martin added a comment -

          After upgrade to 1.625.1, SSH slave communication is as efficient as WebStart agent.
          As a result, I consider this issue as fixed.
          Thank you for the job. Yves

          Yves Martin added a comment - After upgrade to 1.625.1, SSH slave communication is as efficient as WebStart agent. As a result, I consider this issue as fixed. Thank you for the job. Yves

          Roy Arnon added a comment -

          Hi,
          Using jenkins 1.625.3 and ssh-slave plugin 1.10, it seems this issue occurs intermenitally for us. Lately, it happens at least once a day.
          Most of the threads are wasting CPU time exactly the same as in the example above.

          I've attached call tree from yourkit.

          I can provide more data if required.

          Roy Arnon added a comment - Hi, Using jenkins 1.625.3 and ssh-slave plugin 1.10, it seems this issue occurs intermenitally for us. Lately, it happens at least once a day. Most of the threads are wasting CPU time exactly the same as in the example above. I've attached call tree from yourkit. I can provide more data if required.

          Roy Arnon added a comment -

          Sorry, attached the image incorrectly:

          Roy Arnon added a comment - Sorry, attached the image incorrectly:

            kohsuke Kohsuke Kawaguchi
            ymartin1040 Yves Martin
            Votes:
            3 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated:
              Resolved: