I have a pipeline project which should run at a EC2 instance node.

      I have configured an EC2 connection and starting EC2 t3.medium Windows 10 instances automatically. This all works fine.

      But, the first build at an EC2 instance always performs very bad (slow!!). The next build at the same instance (without rebott etc) is much more faster.

       

      @Library('BMS-Libraries')
      import static bms.mail.Email.*
      import static bms.nexus.Nexus.*
      import static bms.utils.Utils.*
      
      node('AWS_VS2017') {
      		stage('Cleanup Build Machine'){
      			//deleting current workspace directory
      			deleteDir()
      		}
      		
      		stage('Preparing Build machine...'){
      	retrieveAndExtractBuildTools(this)
      		}
      
      //Do some more .......
      }
      

      I attached a screenshot of the runtime of the different pipeline steps.

       

      I connected via RDP to the instance during first build nad task-manager didn't display a high CPU or Memory consumption

          [JENKINS-62158] Bad performance on EC2 instance for first build

          Ed Thorne added a comment -

          I don't know that this is limited to the EC2 plugin. I'm seeing a similar issue with a simple Linux JNLP agent. The first job that runs on the agent takes considerably longer than it normally should. Here's an image that shows my results.

          Builds 31 and 36 are after the agent has been rebooted. Each step is doing essentially the same operations:

          • sh 'env'
          • sh w/simple multi-line command (pwd, ls -al, for loop with print/sleep)
          • writeFile the multi-line command to disk to be used as input for sshScript
          • sshScript to a remote instance and execute the same multi-line command

          The main difference is that the first two steps run on the master node while the third runs on a remote JNLP agent.

          For builds 31 and 36 the execution timings show that it takes almost 20 seconds for a 'sh' step to be loaded and started. The 'sshScript' that follows takes about three minutes from the end of the prior 'sh' step completing until output is logged. Under normal circumstances these operations take about two seconds or less to log some form of activity.

          Observing the output of 'top' and checking CloudWatch metrics for the instance I don't see high resource usage or anything that would explain why this first job after reboot is suffering from such horrible performance. 

          Ed Thorne added a comment - I don't know that this is limited to the EC2 plugin. I'm seeing a similar issue with a simple Linux JNLP agent. The first job that runs on the agent takes considerably longer than it normally should. Here's an image that shows my results. Builds 31 and 36 are after the agent has been rebooted. Each step is doing essentially the same operations: sh 'env' sh w/simple multi-line command (pwd, ls -al, for loop with print/sleep) writeFile the multi-line command to disk to be used as input for sshScript sshScript to a remote instance and execute the same multi-line command The main difference is that the first two steps run on the master node while the third runs on a remote JNLP agent. For builds 31 and 36 the execution timings show that it takes almost 20 seconds for a 'sh' step to be loaded and started. The 'sshScript' that follows takes about three minutes from the end of the prior 'sh' step completing until output is logged. Under normal circumstances these operations take about two seconds or less to log some form of activity. Observing the output of 'top' and checking CloudWatch metrics for the instance I don't see high resource usage or anything that would explain why this first job after reboot is suffering from such horrible performance. 

          Ed Thorne added a comment -

          I forgot to mention. This is Jenkins 2.234 with Pipeline 2.6 and SSH Pipeline Steps 2.0.0.

          Ed Thorne added a comment - I forgot to mention. This is Jenkins 2.234 with Pipeline 2.6 and SSH Pipeline Steps 2.0.0.

          James Green added a comment -

          I'm not sure we are seeing the same bug, but recently (last couple of weeks) our ec2 builds are taking a lot longer too. Always the first build of an ec2 instance, never subsequent builds.

          The big change is upgrading this plugin. According to the agent logs (accessible from the Jenkins web console), the Jenkins master is now awaiting the EC2 instance console output to print the ssh fingerprints to verify the expected keys ahead of connecting. This is acknowledged to take potentially minutes to wait on.

          We'd love to know if there is a workaround for this but we're not familiar with the authentication system in use.

          One way or another, I'm being approached by staff members using Jenkins complaining that this is now far too slow. I'm open to suggestions.

          James Green added a comment - I'm not sure we are seeing the same bug, but recently (last couple of weeks) our ec2 builds are taking a lot longer too. Always the first build of an ec2 instance, never subsequent builds. The big change is upgrading this plugin. According to the agent logs (accessible from the Jenkins web console), the Jenkins master is now awaiting the EC2 instance console output to print the ssh fingerprints to verify the expected keys ahead of connecting. This is acknowledged to take potentially minutes to wait on. We'd love to know if there is a workaround for this but we're not familiar with the authentication system in use. One way or another, I'm being approached by staff members using Jenkins complaining that this is now far too slow. I'm open to suggestions.

          Ramon Leon added a comment -

          First time Jenkins builds a job in an EC2 instance there is a process which doesn't happen on subsequent connections:

          • the instance has to be created by AWS
          • the instance initiate
          • Jenkins creates an init script
          • Jenkins installs the JVM
          • Jenkins installs open-ssh clients
          • Jenkins copies the remote client library
          • Jenkins launches the client on the instance

          All these steps are not done on next builds.

          On latest releases of the EC2 plugin we've included a new security step to avoid MitM attacks. This step waits for the output console of the instance (linux ones) to be ready and the plugin reads the SSH Key to guarantee the machine the plugin is connecting to is the expected one. This steps adds some more time to the initial setup. It depends on the time for the console to be ready, but it is usually likely 5 minutes.

          You can avoid this new gap by lowering the security level to Accept New or Off. None of these security strategies wait for the console to be ready, but they have some security implications. We've provided a wide range of strategies to allow every administrator to decide which one best fits her/his environment. All is documented in the Plugin documentation: https://github.com/jenkinsci/ec2-plugin/#security

          Ramon Leon added a comment - First time Jenkins builds a job in an EC2 instance there is a process which doesn't happen on subsequent connections: the instance has to be created by AWS the instance initiate Jenkins creates an init script Jenkins installs the JVM Jenkins installs open-ssh clients Jenkins copies the remote client library Jenkins launches the client on the instance All these steps are not done on next builds. On latest releases of the EC2 plugin we've included a new security step to avoid MitM attacks . This step waits for the output console of the instance (linux ones) to be ready and the plugin reads the SSH Key to guarantee the machine the plugin is connecting to is the expected one. This steps adds some more time to the initial setup. It depends on the time for the console to be ready, but it is usually likely 5 minutes. You can avoid this new gap by lowering the security level to Accept New or Off . None of these security strategies wait for the console to be ready, but they have some security implications. We've provided a wide range of strategies to allow every administrator to decide which one best fits her/his environment. All is documented in the Plugin documentation: https://github.com/jenkinsci/ec2-plugin/#security

          Daniel Hoerner added a comment - - edited

          mramonleon the issue is not the startup of the AWS instance.  The build is slow after the instance was started (see screenshot 2020-05-04%2016_31_11-Window.jpg). The first step is already running at the slave AWS instance, it's a little bit slower, but this is ok. But the second step (Preparing Build machine) is much more slower, and at this time, all steps you described were already performed!)

          Daniel Hoerner added a comment - - edited mramonleon the issue is not the startup of the AWS instance.  The build is slow after the instance was started (see screenshot 2020-05-04%2016_31_11-Window.jpg). The first step is already running at the slave AWS instance, it's a little bit slower, but this is ok. But the second step (Preparing Build machine) is much more slower, and at this time, all steps you described were already performed!)

          edward reesman added a comment - - edited

          dhoerner edthorne by any chance did you discover a root cause/workaround for this issue? I am almost certain that I am facing this behavior using the ec2-plugin. I have a fairly resource intensive pipeline that leverages various stages comprised of docker commands (build/run etc). Using the same exact ec2 instance types that I use in our non-ec2-plugin setup, I am getting severely degraded performance (pipeline takes twice as long to run when dynamically launching the node each time). I apologize if this wall of text is a bit hard to grok, but anyhow, curious if there has been any forward progress on this issue.

          P.S.

          When I connect to the instance during these "dead periods" where it seems nothing is happening (no logs being written back to jenkins stdout) - the machine is almost idle, no increased memory usage/cpu load. So I have no idea what is pegging the jenkins job from proceeding more quickly. I have tested with both EBS-optimized on and off and it is the same in both cases.

          edward reesman added a comment - - edited dhoerner edthorne by any chance did you discover a root cause/workaround for this issue? I am almost certain that I am facing this behavior using the ec2-plugin. I have a fairly resource intensive pipeline that leverages various stages comprised of docker commands (build/run etc). Using the same exact ec2 instance types that I use in our non-ec2-plugin setup, I am getting severely degraded performance (pipeline takes twice as long to run when dynamically launching the node each time). I apologize if this wall of text is a bit hard to grok, but anyhow, curious if there has been any forward progress on this issue. P.S. When I connect to the instance during these "dead periods" where it seems nothing is happening (no logs being written back to jenkins stdout) - the machine is almost idle, no increased memory usage/cpu load. So I have no idea what is pegging the jenkins job from proceeding more quickly. I have tested with both EBS-optimized on and off and it is the same in both cases.

          Ed Thorne added a comment -

          ereesmanar I simply resorted to having the master perform all the tasks. In my case the ec2-plugin isn't even in play.

          My case was a simple JNLP linux remote agent. It's like pipeline or some other part of Jenkins is having to cold start and load a whole bunch of stuff. Like the other posts, there doesn't appear to be a resource utilization problem on the instance while this is happening. After paying that initial penalty, things would perform normally. But until that penalty was paid, cold start performance was so unacceptable that my boss said to ditch the agent.

          Ed Thorne added a comment - ereesmanar I simply resorted to having the master perform all the tasks. In my case the ec2-plugin isn't even in play. My case was a simple JNLP linux remote agent. It's like pipeline or some other part of Jenkins is having to cold start and load a whole bunch of stuff. Like the other posts, there doesn't appear to be a resource utilization problem on the instance while this is happening. After paying that initial penalty, things would perform normally. But until that penalty was paid, cold start performance was so unacceptable that my boss said to ditch the agent.

          Hung Vo added a comment -

          hello,

           

          I had the same behavior, new agent or machine restarted will have some delay for 4 - 5 minutes before it actually do any job. SO i did some digging. Here is what i can share

          The order - Test when node created (1002), when it already warm (1003) then restart and test again (1004)

           

          Here is some noticeable activity via Process Monitor

           

          Doing work with the cache

          still doing with cache and jre

           

          Yup still working

           

          Then now actually doing something

           

          I can think of look for way to move .cache to instance store for better io, whitelist the folder in windows defender.

          The Linux node start much faster for me though

          Hung Vo added a comment - hello,   I had the same behavior, new agent or machine restarted will have some delay for 4 - 5 minutes before it actually do any job. SO i did some digging. Here is what i can share The order - Test when node created (1002), when it already warm (1003) then restart and test again (1004)   Here is some noticeable activity via Process Monitor   Doing work with the cache still doing with cache and jre   Yup still working   Then now actually doing something   I can think of look for way to move .cache to instance store for better io, whitelist the folder in windows defender. The Linux node start much faster for me though

          hungvotrung did adding the jenkins\cache folder to whitelist of windows defender work for you with better performance?

          Daniel Hoerner added a comment - hungvotrung did adding the jenkins\cache folder to whitelist of windows defender work for you with better performance?

          Hung Vo added a comment -

          unfortunately, it doesn't help at all. There is still a 4 - 5 min delay. 

          Hung Vo added a comment - unfortunately, it doesn't help at all. There is still a 4 - 5 min delay. 

          Hung Vo added a comment -

          I'm not sure if it related but I can see there seem connection issue between master and agent. I did netstat when there was 6 windows agents and master open total 1744 connections. Not sure how it work behind the scene as i haven't dig more but total amount seem too high imo

          Hung Vo added a comment - I'm not sure if it related but I can see there seem connection issue between master and agent. I did netstat when there was 6 windows agents and master open total 1744 connections. Not sure how it work behind the scene as i haven't dig more but total amount seem too high imo

          Are you using WinRM by any chance?

          Raihaan Shouhell added a comment - Are you using WinRM by any chance?

          The EC2 Plugin always use WinRM for Windows Slaves at AWS.

          Daniel Hoerner added a comment - The EC2 Plugin always use WinRM for Windows Slaves at AWS.

          By default yes, there is a way to use ssh.

          Have you tried the latest of the plugin by any chance?

          Raihaan Shouhell added a comment - By default yes, there is a way to use ssh. Have you tried the latest of the plugin by any chance?

          I just tried the latest version (1.54 of the plugin) with Jenkins 2.249.3 and nothing has changed.

          I also have some issues with a broken connection during the build with following output:

          Node Log:

          Agent successfully connected and online
          ERROR: Connection terminated
          java.io.EOFException
          	at java.base/java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2763)
          	at java.base/java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3258)
          	at java.base/java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:873)
          	at java.base/java.io.ObjectInputStream.<init>(ObjectInputStream.java:350)
          	at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:49)
          	at hudson.remoting.Command.readFrom(Command.java:142)
          	at hudson.remoting.Command.readFrom(Command.java:128)
          	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
          	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
          Caused: java.io.IOException: Unexpected termination of the channel
          	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)

          Build Log:

          EC2 (AWS Cloud) - AWS Windows 10 with VS 2017 (i-xxxxxxxxxxxxxxxxxxxxxxxx) was marked offline: Connection was broken: java.io.EOFExceptionEC2 (AWS Cloud) - AWS Windows 10 with VS 2017 (i-xxxxxxxxxxxxxxxxxxxxxxxx) was marked offline: Connection was broken: java.io.EOFException at java.base/java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2763) at java.base/java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3258) at java.base/java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:873) at java.base/java.io.ObjectInputStream.<init>(ObjectInputStream.java:350) at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:49) at hudson.remoting.Command.readFrom(Command.java:142) at hudson.remoting.Command.readFrom(Command.java:128) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)Caused: java.io.IOException: Unexpected termination of the channel at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)
          [Pipeline] }[Pipeline] // node[Pipeline] End of Pipelinejava.io.IOException: Unable to create live FilePath for EC2 (AWS Cloud) - AWS Windows 10 with VS 2017 (i-xxxxxxxxxxxxxxxxxxxxxxxxxx) at org.jenkinsci.plugins.workflow.support.steps.FilePathDynamicContext.get(FilePathDynamicContext.java:64) at org.jenkinsci.plugins.workflow.support.steps.FilePathDynamicContext.get(FilePathDynamicContext.java:47) at org.jenkinsci.plugins.workflow.steps.DynamicContext$Typed.get(DynamicContext.java:94) at org.jenkinsci.plugins.workflow.cps.ContextVariableSet.get(ContextVariableSet.java:138) at org.jenkinsci.plugins.workflow.cps.CpsThread.getContextVariable(CpsThread.java:135) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.doGet(CpsStepContext.java:297) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:74) at hudson.plugins.emailext.EmailExtStep$EmailExtStepExecution.run(EmailExtStep.java:231) at hudson.plugins.emailext.EmailExtStep$EmailExtStepExecution.run(EmailExtStep.java:174) at org.jenkinsci.plugins.workflow.steps.AbstractSynchronousNonBlockingStepExecution$1$1.call(AbstractSynchronousNonBlockingStepExecution.java:47) at hudson.security.ACL.impersonate(ACL.java:367) at org.jenkinsci.plugins.workflow.steps.AbstractSynchronousNonBlockingStepExecution$1.run(AbstractSynchronousNonBlockingStepExecution.java:44) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834)Finished: FAILURE
          

          Daniel Hoerner added a comment - I just tried the latest version (1.54 of the plugin) with Jenkins 2.249.3 and nothing has changed. I also have some issues with a broken connection during the build with following output: Node Log: Agent successfully connected and online ERROR: Connection terminated java.io.EOFException at java.base/java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2763) at java.base/java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3258) at java.base/java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:873) at java.base/java.io.ObjectInputStream.<init>(ObjectInputStream.java:350) at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:49) at hudson.remoting.Command.readFrom(Command.java:142) at hudson.remoting.Command.readFrom(Command.java:128) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63) Caused: java.io.IOException: Unexpected termination of the channel at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77) Build Log: EC2 (AWS Cloud) - AWS Windows 10 with VS 2017 (i-xxxxxxxxxxxxxxxxxxxxxxxx) was marked offline: Connection was broken: java.io.EOFExceptionEC2 (AWS Cloud) - AWS Windows 10 with VS 2017 (i-xxxxxxxxxxxxxxxxxxxxxxxx) was marked offline: Connection was broken: java.io.EOFException at java.base/java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2763) at java.base/java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3258) at java.base/java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:873) at java.base/java.io.ObjectInputStream.<init>(ObjectInputStream.java:350) at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:49) at hudson.remoting.Command.readFrom(Command.java:142) at hudson.remoting.Command.readFrom(Command.java:128) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)Caused: java.io.IOException: Unexpected termination of the channel at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77) [Pipeline] }[Pipeline] // node[Pipeline] End of Pipelinejava.io.IOException: Unable to create live FilePath for EC2 (AWS Cloud) - AWS Windows 10 with VS 2017 (i-xxxxxxxxxxxxxxxxxxxxxxxxxx) at org.jenkinsci.plugins.workflow.support.steps.FilePathDynamicContext.get(FilePathDynamicContext.java:64) at org.jenkinsci.plugins.workflow.support.steps.FilePathDynamicContext.get(FilePathDynamicContext.java:47) at org.jenkinsci.plugins.workflow.steps.DynamicContext$Typed.get(DynamicContext.java:94) at org.jenkinsci.plugins.workflow.cps.ContextVariableSet.get(ContextVariableSet.java:138) at org.jenkinsci.plugins.workflow.cps.CpsThread.getContextVariable(CpsThread.java:135) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.doGet(CpsStepContext.java:297) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:74) at hudson.plugins.emailext.EmailExtStep$EmailExtStepExecution.run(EmailExtStep.java:231) at hudson.plugins.emailext.EmailExtStep$EmailExtStepExecution.run(EmailExtStep.java:174) at org.jenkinsci.plugins.workflow.steps.AbstractSynchronousNonBlockingStepExecution$1$1.call(AbstractSynchronousNonBlockingStepExecution.java:47) at hudson.security.ACL.impersonate(ACL.java:367) at org.jenkinsci.plugins.workflow.steps.AbstractSynchronousNonBlockingStepExecution$1.run(AbstractSynchronousNonBlockingStepExecution.java:44) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang. Thread .run( Thread .java:834)Finished: FAILURE

          Steven Foster added a comment -

          raihaan

          > By default yes, there is a way to use ssh.

          can you elaborate on this? looking at the code I only see a WinRM connection method, and an SSH connection method that relies on unix tools

          Steven Foster added a comment - raihaan > By default yes, there is a way to use ssh. can you elaborate on this? looking at the code I only see a WinRM connection method, and an SSH connection method that relies on unix tools

          Daniel H added a comment -

          raihaan can you please answer how to use SSH with a Windows Agent? The performance issues still exists also with latest version of the plugin!

          Daniel H added a comment - raihaan can you please answer how to use SSH with a Windows Agent? The performance issues still exists also with latest version of the plugin!

          Yoshihiro added a comment - - edited

          I am currently having similar problems.
          Does anyone have any new information on this?

          Only at first, file access seems unusually slow.

          Yoshihiro added a comment - - edited I am currently having similar problems. Does anyone have any new information on this? Only at first, file access seems unusually slow.

            mramonleon Ramon Leon
            dhoerner Daniel Hoerner
            Votes:
            2 Vote for this issue
            Watchers:
            11 Start watching this issue

              Created:
              Updated: