Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-61314

EC2 Plugin: Windows java.io.IOException: Pipe Closed

      We are seeing 2 occurrences of a java.io.IOException when using Windows nodes as agents.
      1. When new nodes are being spun up for Windows jobs, it appears that Jenkins will assign a newly spun up node (after establishing a connection) to service the build, but the connection gets terminated before the build even begins.

      Started by user some_user
      Running as SYSTEM
      Building remotely on EC2 (amazon-ec2) - windows-label (i-0de1740c2107602b3) (windows-label) in workspace D:\dev\jenkins\workspace\WindowsStressTests\SimpleWindowsBuild11
      FATAL: java.io.IOException: Pipe is already closed
      hudson.remoting.FastPipedInputStream$ClosedBy: The pipe was closed at...
      	at hudson.remoting.FastPipedOutputStream.error(FastPipedOutputStream.java:100)
      	at hudson.remoting.FastPipedOutputStream.close(FastPipedOutputStream.java:90)
      	at hudson.plugins.ec2.util.Closeables.closeQuietly(Closeables.java:23)
      	at hudson.plugins.ec2.win.winrm.WindowsProcess$2.run(WindowsProcess.java:146)
      Caused: java.io.IOException: Pipe is already closed
      	at hudson.remoting.FastPipedOutputStream.write(FastPipedOutputStream.java:154)
      	at hudson.remoting.FastPipedOutputStream.write(FastPipedOutputStream.java:138)
      	at hudson.remoting.ChunkedOutputStream.sendFrame(ChunkedOutputStream.java:89)
      	at hudson.remoting.ChunkedOutputStream.drain(ChunkedOutputStream.java:85)
      	at hudson.remoting.ChunkedOutputStream.write(ChunkedOutputStream.java:54)
      	at java.io.OutputStream.write(OutputStream.java:75)
      	at hudson.remoting.ChunkedCommandTransport.writeBlock(ChunkedCommandTransport.java:45)
      	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.write(AbstractSynchronousByteArrayCommandTransport.java:46)
      	at hudson.remoting.Channel.send(Channel.java:721)
      	at hudson.remoting.Channel.close(Channel.java:1436)
      Also:   hudson.remoting.Channel$CallSiteStackTrace: Remote call to EC2 (amazon-ec2) - windows-label (i-0de1740c2107602b3)
      		at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1741)
      		at hudson.remoting.Request.call(Request.java:202)
      		at hudson.remoting.Channel.call(Channel.java:954)
      		at hudson.FilePath.act(FilePath.java:1069)
      		at hudson.FilePath.act(FilePath.java:1058)
      		at hudson.FilePath.mkdirs(FilePath.java:1243)
      		at hudson.model.AbstractProject.checkout(AbstractProject.java:1199)
      		at hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:574)
      		at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
      		at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:499)
      		at hudson.model.Run.execute(Run.java:1853)
      		at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
      		at hudson.model.ResourceController.execute(ResourceController.java:97)
      		at hudson.model.Executor.run(Executor.java:427)
      Caused: hudson.remoting.RequestAbortedException
      	at hudson.remoting.Request.abort(Request.java:340)
      	at hudson.remoting.Channel.terminate(Channel.java:1038)
      	at hudson.remoting.Channel.close(Channel.java:1444)
      	at hudson.remoting.Channel.close(Channel.java:1403)
      	at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:843)
      	at hudson.slaves.SlaveComputer.access$100(SlaveComputer.java:108)
      	at hudson.slaves.SlaveComputer$2.run(SlaveComputer.java:734)
      	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      	at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      Finished: FAILURE
      

      Afterwards, the node is left in an offline state with the message "Connection was broken" until we manually reconnect it (via the "Launch Agent" button in the Jenkins UI). We are consistently see this problem. 

      Log from the node:

      EC2 (amazon-ec2) - windows-label(i-0531ef3f07b2cdaee) booted at 1582839470000
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Waiting for password to be available. Sleeping 10s.
      Connecting to (10.93.145.196) with WinRM as david_webb6
      WinRM service responded. Waiting for WinRM service to stabilize on EC2 (amazon-ec2) - windows-label (i-0531ef3f07b2cdaee)
      WinRM should now be ok on EC2 (amazon-ec2) - windows-label (i-0531ef3f07b2cdaee)
      Connected with WinRM.
      Creating tmp directory if it does not exist
      remoting.jar sent remotely. Bootstrapping it
      Launching via WinRM:java  -jar C:\Windows\Temp\remoting.jar -workDir D:\dev\jenkins
      <===[JENKINS REMOTING CAPACITY]===>Remoting version: 3.36.1
      This is a Windows agent
      ERROR: Failed to monitor for Free Swap Space
      java.util.concurrent.TimeoutException
      	at hudson.remoting.Request$1.get(Request.java:316)
      	at hudson.remoting.Request$1.get(Request.java:240)
      	at hudson.remoting.FutureAdapter.get(FutureAdapter.java:59)
      	at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitorDetailed(AbstractAsyncNodeMonitorDescriptor.java:114)
      	at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitor(AbstractAsyncNodeMonitorDescriptor.java:78)
      	at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:306)
      ERROR: Failed to monitor for Free Disk Space
      java.util.concurrent.TimeoutException
      	at hudson.remoting.Request$1.get(Request.java:316)
      	at hudson.remoting.Request$1.get(Request.java:240)
      	at hudson.remoting.FutureAdapter.get(FutureAdapter.java:59)
      	at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitorDetailed(AbstractAsyncNodeMonitorDescriptor.java:114)
      	at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitor(AbstractAsyncNodeMonitorDescriptor.java:78)
      	at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:306)
      ERROR: Failed to monitor for Free Temp Space
      java.util.concurrent.TimeoutException
      	at hudson.remoting.Request$1.get(Request.java:316)
      	at hudson.remoting.Request$1.get(Request.java:240)
      	at hudson.remoting.FutureAdapter.get(FutureAdapter.java:59)
      	at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitorDetailed(AbstractAsyncNodeMonitorDescriptor.java:114)
      	at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitor(AbstractAsyncNodeMonitorDescriptor.java:78)
      	at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:306)
      Agent successfully connected and online
      ERROR: Connection terminated
      hudson.remoting.FastPipedInputStream$ClosedBy: The pipe was closed at...
      	at hudson.remoting.FastPipedOutputStream.error(FastPipedOutputStream.java:100)
      	at hudson.remoting.FastPipedOutputStream.close(FastPipedOutputStream.java:90)
      	at hudson.plugins.ec2.util.Closeables.closeQuietly(Closeables.java:23)
      	at hudson.plugins.ec2.win.winrm.WindowsProcess$2.run(WindowsProcess.java:146)
      Caused: java.io.IOException: Pipe is already closed
      	at hudson.remoting.FastPipedOutputStream.write(FastPipedOutputStream.java:154)
      	at hudson.remoting.FastPipedOutputStream.write(FastPipedOutputStream.java:138)
      	at hudson.remoting.ChunkedOutputStream.sendFrame(ChunkedOutputStream.java:89)
      	at hudson.remoting.ChunkedOutputStream.drain(ChunkedOutputStream.java:85)
      	at hudson.remoting.ChunkedOutputStream.write(ChunkedOutputStream.java:54)
      	at java.io.OutputStream.write(OutputStream.java:75)
      	at hudson.remoting.ChunkedCommandTransport.writeBlock(ChunkedCommandTransport.java:45)
      	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.write(AbstractSynchronousByteArrayCommandTransport.java:46)
      	at hudson.remoting.Channel.send(Channel.java:721)
      	at hudson.remoting.Channel.close(Channel.java:1436)
      	at hudson.remoting.Channel.close(Channel.java:1403)
      	at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:843)
      	at hudson.slaves.SlaveComputer.access$100(SlaveComputer.java:108)
      	at hudson.slaves.SlaveComputer$2.run(SlaveComputer.java:734)
      	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      	at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      ERROR: Connection terminated
      java.io.EOFException
      	at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2736)
      	at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3211)
      	at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:896)
      	at java.io.ObjectInputStream.<init>(ObjectInputStream.java:358)
      	at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:49)
      	at hudson.remoting.Command.readFrom(Command.java:140)
      	at hudson.remoting.Command.readFrom(Command.java:126)
      	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
      	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
      Caused: java.io.IOException: Unexpected termination of the channel
      	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)
      

      2. In the middle of a build, the node will go into a disconnected state and see the same error. This occurrence is less frequent and we're unable to consistently recreate it and it may be a different issue altogether but I wanted to at least bring it to your attention in case the information was helpful in solving our problem. 

       

          [JENKINS-61314] EC2 Plugin: Windows java.io.IOException: Pipe Closed

          Ramon Leon added a comment -

          For some reason, the connection between Jenkins and the instance was closed and the build cannot complete. It looks like the plugin cannot get the password. Look the documentation about how to get the password: https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_GetPasswordData.html

          EC2 is using this method. You can try to request is by yourself with AWS CLI or any other way.

          If you can give us some reproduction steps in a clean environment it would be great to help on solving the issue.

          Ramon Leon added a comment - For some reason, the connection between Jenkins and the instance was closed and the build cannot complete. It looks like the plugin cannot get the password. Look the documentation about how to get the password: https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_GetPasswordData.html EC2 is using this method. You can try to request is by yourself with AWS CLI or any other way. If you can give us some reproduction steps in a clean environment it would be great to help on solving the issue.

          Bruno Esteves added a comment -

          Having the same problem, the connection is lost mid-build and fails

          Bruno Esteves added a comment - Having the same problem, the connection is lost mid-build and fails

          Joseph LaFreniere added a comment - - edited

          I encountered this exact issue. Was able to consistently reproduce by spinning up a Windows agent via the plugin's UI then triggering a job to run on the resulting agent. Within a few seconds (prior to the job actually executing any of my business logic) the agent would fail with the "pipeline closed" error and after a few minutes of retrying the job would also fail.

          I found a workaround of configuring my Windows cloud to connect via SSH instead of WinRM. Starting with Windows 10 build 1809 and Windows Server 2019, there is an officially supported, native OpenSSH daemon for Windows. See https://learn.microsoft.com/en-us/windows-server/administration/openssh/openssh_install_firstuse?source=recommendations&tabs=powershell for details. The full workaround that is currently working for me:

          1. Build a Windows AMI based on Server 2019. As part of that AMI
            1. Install OpenSSH and configure the `sshd` and `ssh-agent` services as described in the above link.
            2. Configure PowerShell as the default SSH shell: `New-ItemProperty -Path "HKLM:\SOFTWARE\OpenSSH" -Name DefaultShell -Value (Get-Command powershell.exe).Path -PropertyType String -Force`
          2. When launching an EC2 instance using the new AMI:
            1. Provide a valid SSH keypair (`sshKeysCredentialsId` in this plugin's configuration)
            2. Select IMDSv2 (`metadataTokensRequired: true` in this plugin's configuration)
            3. Provide the attached PowerShell file as the userdata script (`userData` in this plugin's configuration)
          3. Connect to the new instance via SSH. In this plugin's configuration:
            1. amiType: `{"unixData": {sshPort: "22"}}`
            2. connectBySSHProcess: `true`

          example-ssh-userdata.ps1

          Joseph LaFreniere added a comment - - edited I encountered this exact issue. Was able to consistently reproduce by spinning up a Windows agent via the plugin's UI then triggering a job to run on the resulting agent. Within a few seconds (prior to the job actually executing any of my business logic) the agent would fail with the "pipeline closed" error and after a few minutes of retrying the job would also fail. I found a workaround of configuring my Windows cloud to connect via SSH instead of WinRM. Starting with Windows 10 build 1809 and Windows Server 2019, there is an officially supported, native OpenSSH daemon for Windows. See https://learn.microsoft.com/en-us/windows-server/administration/openssh/openssh_install_firstuse?source=recommendations&tabs=powershell for details. The full workaround that is currently working for me: Build a Windows AMI based on Server 2019. As part of that AMI Install OpenSSH and configure the `sshd` and `ssh-agent` services as described in the above link. Configure PowerShell as the default SSH shell: `New-ItemProperty -Path "HKLM:\SOFTWARE\OpenSSH" -Name DefaultShell -Value (Get-Command powershell.exe).Path -PropertyType String -Force` When launching an EC2 instance using the new AMI: Provide a valid SSH keypair (`sshKeysCredentialsId` in this plugin's configuration) Select IMDSv2 (`metadataTokensRequired: true` in this plugin's configuration) Provide the attached PowerShell file as the userdata script (`userData` in this plugin's configuration) Connect to the new instance via SSH. In this plugin's configuration: amiType: `{"unixData": {sshPort: "22"}}` connectBySSHProcess: `true` example-ssh-userdata.ps1

          Ok, so it's been forever and a day since I worked on the project with this issue, but we did solve it. At this point I can only describe the solve and not give specifics because I no longer have access to that code.

           

          In a nutshell, the way the windows agent begins the jnlp process (The actual windows command) has a super low default memory allocation limit. You basically either have to build an image that bumps this limit or have it be part of the init script. It was a process so esoteric I simply can't remember. But that was the solve. Good luck

          Christian Bongiorno added a comment - Ok, so it's been forever and a day since I worked on the project with this issue, but we did solve it. At this point I can only describe the solve and not give specifics because I no longer have access to that code.   In a nutshell, the way the windows agent begins the jnlp process (The actual windows command) has a super low default memory allocation limit. You basically either have to build an image that bumps this limit or have it be part of the init script. It was a process so esoteric I simply can't remember. But that was the solve. Good luck

          Allan BURDAJEWICZ added a comment - - edited

          chb0jenkins was it about setting winrm set winrm/config/winrs '@{MaxMemoryPerShellMB="10240"}' as described at https://github.com/jenkinsci/ec2-plugin/blob/6d7bfc58ba1d1f479385ded27f78e78a33f1c84d/src/main/resources/hudson/plugins/ec2/SlaveTemplate/help-amiType.html#L24 ? or is it something else ?

          Allan BURDAJEWICZ added a comment - - edited chb0jenkins was it about setting winrm set winrm/config/winrs '@{MaxMemoryPerShellMB="10240"}' as described at https://github.com/jenkinsci/ec2-plugin/blob/6d7bfc58ba1d1f479385ded27f78e78a33f1c84d/src/main/resources/hudson/plugins/ec2/SlaveTemplate/help-amiType.html#L24 ? or is it something else ?

          Yes, I believe that's it. 

          Christian Bongiorno added a comment - Yes, I believe that's it. 

            thoulen FABRIZIO MANFREDI
            pyieh Pierson Yieh
            Votes:
            8 Vote for this issue
            Watchers:
            12 Start watching this issue

              Created:
              Updated: