• Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • docker-plugin
    • None
    • Jenkins LTS 2.249.2
      Docker Plugin 1.2.1

      I'm not sure what has triggered this behavior.  We've been using the docker plugin to spin up agents for about 6 months now and its worked pretty much flawlessly to this point.  We've just recently started seeing this strange behavior where Jenkins will not spin up new agents while jobs are waiting in the queue.  These jobs in the queue will just sit there forever.  Eventually we get to a point where there are no agents running but multiple jobs queued up.

      We have 11 cloud instances.  Each instance has multiple templates associated with it.  These cloud instances are all petty much identical.  They serve the same templates and labels.  The agents connect via ssh.

      The only way I can get things back working is to restart the service.  Once the service comes backup, Jenkins starts servicing the job requests again.

      The only thing I can see in my logs is this entry.

      09-Nov-2020 11:51:56.879 SEVERE [dockerjava-netty-3426-1] com.github.dockerjava.core.async.ResultCallbackTemplate.onError Error during callback
      com.github.dockerjava.api.exception.NotFoundException: {"message":"No such container: a96167b9016d2624870640933da629f429ad736a473d690ee70fac3cd97bf211"}

      at com.github.dockerjava.netty.handler.HttpResponseHandler.channelRead0(HttpResponseHandler.java:103)
      at com.github.dockerjava.netty.handler.HttpResponseHandler.channelRead0(HttpResponseHandler.java:33)
      at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
      at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
      at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
      at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
      at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241)
      at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
      at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
      at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
      at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
      at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
      at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
      at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
      at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:438)
      at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323)
      at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297)
      at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:253)
      at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
      at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
      at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
      at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
      at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
      at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
      at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
      at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1432)
      at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1199)
      at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1243)
      at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:502)
      at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:441)
      at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278)
      at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
      at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
      at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
      at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434)
      at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
      at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
      at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965)
      at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
      at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:648)
      at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:583)
      at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:500)
      at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:462)
      at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897)
      at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
      at java.lang.Thread.run(Thread.java:748)

       

      If I look for that container hash in the logs I see this prior to the above.  So it looks like the container is created, does its job, is removed.  About 8 minutes after removal we get the error above about it being missing.

      09-Nov-2020 11:36:22.228 INFO Computer.threadPoolForRemoting [#51286] com.nirima.jenkins.plugins.docker.DockerTemplate.doProvisionNode Started container ID a96167b9016d2624870640933da629f429ad736a473d690ee70fac3cd97bf211 for node DK_COSCOMMON7_D03-00063jpilpdzc from image: pmsplb-cos-tools.dev.datacard.com:8600/centos7_common:latest

      09-Nov-2020 11:43:47.104 INFO Computer.threadPoolForRemoting [#50769] io.jenkins.docker.DockerTransientNode$1.println Stopped container 'a96167b9016d2624870640933da629f429ad736a473d690ee70fac3cd97bf211' for node 'DK_COSCOMMON7_D03-00063jpilpdzc'.
      09-Nov-2020 11:43:48.538 INFO Computer.threadPoolForRemoting [#50769] io.jenkins.docker.DockerTransientNode$1.println Removed container 'a96167b9016d2624870640933da629f429ad736a473d690ee70fac3cd97bf211' for node 'DK_COSCOMMON7_D03-00063jpilpdzc'.

       

      I'm not sure where else to look at this point.  

          [JENKINS-64179] Docker Cloud Stops Creating New Containers

          Matt Wilson added a comment -

          I think I'm seeing a pattern on when I'm getting failures.

          It seems like Jenkins will try and queue a job on to an existing docker agent that is running a different build. 

          in my environment I have several clouds defined.

          I have a node label called "CENTOS7" defined on several clouds and its has a unique name each cloud "CENTOS7_01", "CENTOS7_02", etc etc

           

          what I'm seeing is this

          when I trigger a job for "CENTOS7", it will sometimes attempt to attach to an existing running container for that label.  i.e. I will see that my job is waiting for "CENTOS7_01_adsafwer", while I have several other clouds that are free to run this job (note this doesn't always happen, it normally spins up a new container)

          when the other job completes running, "CENTOS7_01_adsafwer" is deleted as expected.

          my triggered job in the queue reverts back to saying it is waiting for the next "CENTOS7" to become available.

           

          At this point it will sometimes start a new container for this job, other times it will just sit there forever.

          If it gets stuck, if I delete the waiting job in the queue and start another one, it will often then happily start a new container and run the job.

          The bigger problem here is that this seems to spread and that it will stop starting containers for all jobs.

          at this point I can't figure out what is causing this.

          Matt Wilson added a comment - I think I'm seeing a pattern on when I'm getting failures. It seems like Jenkins will try and queue a job on to an existing docker agent that is running a different build.  in my environment I have several clouds defined. I have a node label called "CENTOS7" defined on several clouds and its has a unique name each cloud "CENTOS7_01", "CENTOS7_02", etc etc   what I'm seeing is this when I trigger a job for "CENTOS7", it will sometimes attempt to attach to an existing running container for that label.  i.e. I will see that my job is waiting for "CENTOS7_01_adsafwer", while I have several other clouds that are free to run this job (note this doesn't always happen, it normally spins up a new container) when the other job completes running, "CENTOS7_01_adsafwer" is deleted as expected. my triggered job in the queue reverts back to saying it is waiting for the next "CENTOS7" to become available.   At this point it will sometimes start a new container for this job, other times it will just sit there forever. If it gets stuck, if I delete the waiting job in the queue and start another one, it will often then happily start a new container and run the job. The bigger problem here is that this seems to spread and that it will stop starting containers for all jobs. at this point I can't figure out what is causing this.

          Matt Wilson added a comment -

          I've noticed this pattern in my debugging

          1. run a job, jobs completes fine in a container called DK_COSCOMMON7_D15-0000p4hq2f6yg.
          2. logs indicate there was some kind of socket io error, but that the container has been removed
            I/O error in channel DK_COSCOMMON7_D15-0000p4hq2f6yg
            java.net.SocketException: Socket closed
            at java.net.SocketInputStream.socketRead0(Native Method)
            at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
            at java.net.SocketInputStream.read(SocketInputStream.java:171)
            at java.net.SocketInputStream.read(SocketInputStream.java:141)
            at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
            at sun.security.ssl.InputRecord.read(InputRecord.java:503)
            at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:983)
            at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:940)
            at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
            at io.jenkins.docker.client.DockerMultiplexedInputStream.readInternal(DockerMultiplexedInputStream.java:49)
            at io.jenkins.docker.client.DockerMultiplexedInputStream.read(DockerMultiplexedInputStream.java:31)
            at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:92)
            at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72)
            at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103)
            at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
            at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
            at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
            Removed Node for node 'DK_COSCOMMON7_D15-0000p4hq2f6yg'.
            Dec 12, 2020 7:39:35 AM INFO io.jenkins.docker.DockerTransientNode$1 printlnStopped container '2f42050429e1cb26c2bb4067c5cc031cc08bdff118fe70bf73624f3758fa17d2' for node 'DK_COSCOMMON7_D15-0000p4hq2f6yg'.
            Dec 12, 2020 7:39:35 AM INFO io.jenkins.docker.DockerTransientNode$1 printlnRemoved container '2f42050429e1cb26c2bb4067c5cc031cc08bdff118fe70bf73624f3758fa17d2' for node 'DK_COSCOMMON7_D15-0000p4hq2f6yg'.
          3. when I try and start a new job, In the gui it shows that it tries to reuse "DK_COSCOMMON7_D15-0000p4hq2f6yg", but then flips back to the label assigned to that template DK_COSCOMMON7.  logs show this

          Dec 12, 2020 7:40:12 AM SEVERE com.github.dockerjava.core.async.ResultCallbackTemplate onErrorError during callback com.github.dockerjava.api.exception.NotFoundException: {"message":"No such container: 2f42050429e1cb26c2bb4067c5cc031cc08bdff118fe70bf73624f3758fa17d2"} at com.github.dockerjava.netty.handler.HttpResponseHandler.channelRead0(HttpResponseHandler.java:103) at com.github.dockerjava.netty.handler.HttpResponseHandler.channelRead0(HttpResponseHandler.java:33) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:438) at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297) at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:253) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1432) at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1199) at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1243) at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:502) at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:441) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:648) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:583) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:500) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:462) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748)
          Dec 12, 2020 7:40:12 AM INFO hudson.slaves.NodeProvisioner lambda$update$6Image of pmsplb-cos-tools.dev.datacard.com:8600/centos7_common:latest provisioning successfully completed. We have now 113 computer(s)
          Dec 12, 2020 7:40:12 AM INFO io.jenkins.docker.DockerTransientNode$1 printlnDisconnected computer for node 'DK_COSCOMMON7_D15-0000p4hq2f6yg'.
          Dec 12, 2020 7:40:13 AM INFO io.jenkins.docker.DockerTransientNode$1 printlnRemoved Node for node 'DK_COSCOMMON7_D15-0000p4hq2f6yg'.

          1. at this point my job will sit in the queue for ever, it will never restart  unless I cancel it, and then restart it.  When I restart it, the cycle continues.  job runs, second run doesn't run, cancel, next run goes.

           

          this one really drives me crazy.  I have no idea at this point what is going on.  what is causing that socket error.  Still trying to figure that part out.

           

          the above was using the "attach" container method.  I've also tried with with "ssh" which we normally use, and the pattern is identical, except that the ssh boxes don't generate the socket connection error.

          My nodes folder on my jenkins server does not list the node name post build, i.e. that is being cleaned up.  A console call to list the connected slaves does not return that machine either.  no idea where it is retaining this information.

          this is with jenkins version 2.249.3 although I'm about to upgrade to the new lts version 2.263.1.  

          Matt Wilson added a comment - I've noticed this pattern in my debugging run a job, jobs completes fine in a container called DK_COSCOMMON7_D15-0000p4hq2f6yg. logs indicate there was some kind of socket io error, but that the container has been removed I/O error in channel DK_COSCOMMON7_D15-0000p4hq2f6yg java.net.SocketException: Socket closed at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:171) at java.net.SocketInputStream.read(SocketInputStream.java:141) at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) at sun.security.ssl.InputRecord.read(InputRecord.java:503) at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:983) at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:940) at sun.security.ssl.AppInputStream.read(AppInputStream.java:105) at io.jenkins.docker.client.DockerMultiplexedInputStream.readInternal(DockerMultiplexedInputStream.java:49) at io.jenkins.docker.client.DockerMultiplexedInputStream.read(DockerMultiplexedInputStream.java:31) at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:92) at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72) at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103) at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63) Removed Node for node 'DK_COSCOMMON7_D15-0000p4hq2f6yg'. Dec 12, 2020 7:39:35 AM INFO io.jenkins.docker.DockerTransientNode$1 printlnStopped container '2f42050429e1cb26c2bb4067c5cc031cc08bdff118fe70bf73624f3758fa17d2' for node 'DK_COSCOMMON7_D15-0000p4hq2f6yg'. Dec 12, 2020 7:39:35 AM INFO io.jenkins.docker.DockerTransientNode$1 printlnRemoved container '2f42050429e1cb26c2bb4067c5cc031cc08bdff118fe70bf73624f3758fa17d2' for node 'DK_COSCOMMON7_D15-0000p4hq2f6yg'. when I try and start a new job, In the gui it shows that it tries to reuse "DK_COSCOMMON7_D15-0000p4hq2f6yg", but then flips back to the label assigned to that template DK_COSCOMMON7.  logs show this Dec 12, 2020 7:40:12 AM SEVERE com.github.dockerjava.core.async.ResultCallbackTemplate onErrorError during callback com.github.dockerjava.api.exception.NotFoundException: {"message":"No such container: 2f42050429e1cb26c2bb4067c5cc031cc08bdff118fe70bf73624f3758fa17d2"} at com.github.dockerjava.netty.handler.HttpResponseHandler.channelRead0(HttpResponseHandler.java:103) at com.github.dockerjava.netty.handler.HttpResponseHandler.channelRead0(HttpResponseHandler.java:33) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:438) at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297) at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:253) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1432) at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1199) at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1243) at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:502) at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:441) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:648) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:583) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:500) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:462) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748) Dec 12, 2020 7:40:12 AM INFO hudson.slaves.NodeProvisioner lambda$update$6Image of pmsplb-cos-tools.dev.datacard.com:8600/centos7_common:latest provisioning successfully completed. We have now 113 computer(s) Dec 12, 2020 7:40:12 AM INFO io.jenkins.docker.DockerTransientNode$1 printlnDisconnected computer for node 'DK_COSCOMMON7_D15-0000p4hq2f6yg'. Dec 12, 2020 7:40:13 AM INFO io.jenkins.docker.DockerTransientNode$1 printlnRemoved Node for node 'DK_COSCOMMON7_D15-0000p4hq2f6yg'. at this point my job will sit in the queue for ever, it will never restart  unless I cancel it, and then restart it.  When I restart it, the cycle continues.  job runs, second run doesn't run, cancel, next run goes.   this one really drives me crazy.  I have no idea at this point what is going on.  what is causing that socket error.  Still trying to figure that part out.   the above was using the "attach" container method.  I've also tried with with "ssh" which we normally use, and the pattern is identical, except that the ssh boxes don't generate the socket connection error. My nodes folder on my jenkins server does not list the node name post build, i.e. that is being cleaned up.  A console call to list the connected slaves does not return that machine either.  no idea where it is retaining this information. this is with jenkins version 2.249.3 although I'm about to upgrade to the new lts version 2.263.1.  

          Gregor Tudan added a comment -

          An observation that I made: the problem seems to occur pretty reliably if Jenkins fails to start a build-container at some point (timeout, RAM-Overusage on the Docker-Server...). It will then oftentimes stop starting new containers on any other of the container of the cloud. 

          Gregor Tudan added a comment - An observation that I made: the problem seems to occur pretty reliably if Jenkins fails to start a build-container at some point (timeout, RAM-Overusage on the Docker-Server...). It will then oftentimes stop starting new containers on any other of the container of the cloud. 

          Matt Wilson added a comment -

          I mostly stopped using this plugin a few years ago due to the problem I described above, but I've recently been experimenting with using it again.  This morning after about a month of no problems I ran into the same issue again.  I did notice one thing though that I don't think I mentioned initially.

          I have multiple clouds

          My clouds have multiple images that share the same label

          This problem I describe only seems to occur when the label is shared across multiple cloud instances.  If its a unique label name, it works.

          example

          Cloud 1, 2, 3, 4 all have an image label called "COMMON" (they all spin up the exact same image).  They also all have an unique label called CLOUD1, CLOUD1, CLOUD3, CLOUD4.  This unique labels could point to a different image, or it could actually just be a secondary label for "COMMON"

          when I get into this state, and run a job using the label "COMMON", nothing will happen.  No agen will get spun up.  If I run a job that uses a unique label i.e. "CLOUD1", jenkins spins up a box.

          I just wish there was some way I could "reset" the plugin without have to restart the entire jenkins service.

           

          Matt Wilson added a comment - I mostly stopped using this plugin a few years ago due to the problem I described above, but I've recently been experimenting with using it again.  This morning after about a month of no problems I ran into the same issue again.  I did notice one thing though that I don't think I mentioned initially. I have multiple clouds My clouds have multiple images that share the same label This problem I describe only seems to occur when the label is shared across multiple cloud instances.  If its a unique label name, it works. example Cloud 1, 2, 3, 4 all have an image label called "COMMON" (they all spin up the exact same image).  They also all have an unique label called CLOUD1, CLOUD1, CLOUD3, CLOUD4.  This unique labels could point to a different image, or it could actually just be a secondary label for "COMMON" when I get into this state, and run a job using the label "COMMON", nothing will happen.  No agen will get spun up.  If I run a job that uses a unique label i.e. "CLOUD1", jenkins spins up a box. I just wish there was some way I could "reset" the plugin without have to restart the entire jenkins service.  

            Unassigned Unassigned
            mwils2424 Matt Wilson
            Votes:
            2 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: