Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-55066

Docker plugin erroneously terminates containers shortly after they start

    XMLWordPrintable

    Details

    • Similar Issues:

      Description

      We are seeing an issue where Pipeline jobs using Docker agents (with the Docker plugin, as opposed to Docker containers on regular agents using Pipeline's Docker support) intermittently fail right at the start, during the initial git checkout, with a "FATAL: java.io.IOException: Unexpected termination of the channel" exception. Having enabled debug logging for the Docker plugin, it appears that the plugin is erroneously killing the container because it thinks it is no longer needed.

      Job log:

      [First few lines redacted, this is the Jenkinsfile checkout]
      
      Checking out Revision 0b45f687992585a470e5faf003309b215e3f74f1 (refs/remotes/origin/master)
       > git config core.sparsecheckout # timeout=10
       > git checkout -f 0b45f687992585a470e5faf003309b215e3f74f1
      FATAL: java.io.IOException: Unexpected termination of the channel
      java.io.EOFException
              at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2679)
              at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3154)
              at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:862)
              at java.io.ObjectInputStream.<init>(ObjectInputStream.java:358)
              at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:49)
              at hudson.remoting.Command.readFrom(Command.java:140)
              at hudson.remoting.Command.readFrom(Command.java:126)
              at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:36)
              at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
      Caused: java.io.IOException: Unexpected termination of the channel
              at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)
      Also:   hudson.remoting.Channel$CallSiteStackTrace: Remote call to docker-2ae12755b75761
                      at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1741)
                      at hudson.remoting.Request.call(Request.java:202)
                      at hudson.remoting.Channel.call(Channel.java:954)
                      at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:283)
                      at com.sun.proxy.$Proxy118.withRepository(Unknown Source)
                      at org.jenkinsci.plugins.gitclient.RemoteGitImpl.withRepository(RemoteGitImpl.java:235)
                      at hudson.plugins.git.GitSCM.printCommitMessageToLog(GitSCM.java:1271)
                      at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1244)
                      at hudson.scm.SCM.checkout(SCM.java:504)
                      at hudson.model.AbstractProject.checkout(AbstractProject.java:1208)
                      at hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:574)
                      at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
                      at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:499)
                      at hudson.model.Run.execute(Run.java:1815)
                      at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
                      at hudson.model.ResourceController.execute(ResourceController.java:97)
                      at hudson.model.Executor.run(Executor.java:429)
      Caused: hudson.remoting.RequestAbortedException
              at hudson.remoting.Request.abort(Request.java:340)
              at hudson.remoting.Channel.terminate(Channel.java:1038)
              at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:96)
      Finished: FAILURE

      Docker plugin debug log:

      2018-11-15 13:59:56.444+0000 [id=25]    FINE    c.n.j.p.d.s.DockerOnceRetentionStrategy#done: terminating docker-2ae12755b75761 since PlaceholderExecutable:ExecutorStepExecution.PlaceholderTask{runId=CompileTest#22494,label=docker-2ae12755b75761,context=CpsStepContext[4:node]:Owner[CompileTest/22494:CompileTest #22494],cookie=561ba1da-fd51-4ee6-9bc3-5d4bb75a9fd0,auth=null} seems to be finished
      2018-11-15 13:59:56.446+0000 [id=2063156]       INFO    i.j.docker.DockerTransientNode$1#println: Disconnected computer for slave 'docker-2ae12755b75761'.
      2018-11-15 13:59:56.448+0000 [id=2063156]       INFO    i.j.docker.DockerTransientNode$1#println: Removed Node for slave 'docker-2ae12755b75761'. 

      Jenkins log:

      2018-11-15 13:59:56.445+0000 [id=2063144]       SEVERE  h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel docker-2ae12755b75761
      java.net.SocketException: Socket closed
              at java.net.SocketInputStream.socketRead0(Native Method)
              at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
              at java.net.SocketInputStream.read(SocketInputStream.java:171)
              at java.net.SocketInputStream.read(SocketInputStream.java:141)
              at java.net.SocketInputStream.read(SocketInputStream.java:127)
              at io.jenkins.docker.client.DockerMultiplexedInputStream.readInternal(DockerMultiplexedInputStream.java:41)
              at io.jenkins.docker.client.DockerMultiplexedInputStream.read(DockerMultiplexedInputStream.java:25)
              at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:91)
              at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72)
              at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103)
              at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
              at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
              at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
      2018-11-15 13:59:56.446+0000 [id=2063156]       INFO    i.j.docker.DockerTransientNode$1#println: Disconnected computer for slave 'docker-2ae12755b75761'.
      2018-11-15 13:59:56.448+0000 [id=2063156]       INFO    i.j.docker.DockerTransientNode$1#println: Removed Node for slave 'docker-2ae12755b75761'.
      

       

      The timestamps of the logs seem to indicate that the Docker plugin erroneously thinks that the job, or at least a step in the job, has completed and so the container should be terminated. This happens a couple of times a day on the same job, but most builds do not fail.

        Attachments

          Activity

          Hide
          narenji Ali Narenji added a comment -

          Is there any workaround on the issue?

          Show
          narenji Ali Narenji added a comment - Is there any workaround on the issue?
          Hide
          jsimas Joao Simas added a comment -

          I'm having exactly the same behavior. Did you found any solution?

          Show
          jsimas Joao Simas added a comment - I'm having exactly the same behavior. Did you found any solution?
          Hide
          pjdarton pjdarton added a comment -

          As a (heavy) user of the docker-plugin, this one surprises me as "it works for me".

          I think that, to get anywhere debugging this, we'll need a decent test case that'll reproduce the issue ... and does so on the current version of the docker plugin.

          Show
          pjdarton pjdarton added a comment - As a (heavy) user of the docker-plugin, this one surprises me as "it works for me". I think that, to get anywhere debugging this, we'll need a decent test case that'll reproduce the issue ... and does so on the current version of the docker plugin.
          Hide
          myra_faisal Myra Faisal added a comment -

          I am facing the same issue. Have you found the solution?

          Show
          myra_faisal Myra Faisal added a comment - I am facing the same issue. Have you found the solution?
          Hide
          pjdarton pjdarton added a comment -

          If anyone has, they've kept it to themselves.

          As I said before, to fix this, we'll need a decent test case that'll reproduce the issue, using the latest docker-plugin and using standard docker images.

          If you can't reproduce the issue using the latest docker-plugin and standard docker images then I guess that's your solution - use the latest docker-plugin and use standard docker images.

          Show
          pjdarton pjdarton added a comment - If anyone has, they've kept it to themselves. As I said before, to fix this, we'll need a decent test case that'll reproduce the issue, using the latest docker-plugin and using standard docker images. If you can't reproduce the issue using the latest docker-plugin and standard docker images then I guess that's your solution - use the latest docker-plugin and use standard docker images.
          Hide
          amidar Amit Dar added a comment -

          is issue started popping on our site as well.

          jenkins 2.249.2

          docker plugin 1.2.1

           

          from the looks of it, this is not being well handled. can any of the previous commenters add any info regarding this issue?

          Show
          amidar Amit Dar added a comment - is issue started popping on our site as well. jenkins 2.249.2 docker plugin 1.2.1   from the looks of it, this is not being well handled. can any of the previous commenters add any info regarding this issue?
          Hide
          pjdarton pjdarton added a comment -

          Personally, I'd be suspicious of anything that called itself PlaceholderExecutable:ExecutorStepExecution.PlaceholderTask

          FYI the docker plugin terminates the container when the task is complete - that's "as designed" (and also "as intended", i.e. I also believe the design is correct).  However, the docker plugin doesn't decide when things are "done", it is told when things are done, so if it's told a task is done when it isn't, this is the kind of symptom you'll see.  My expectation here is that this probably isn't a bug with the docker-plugin at all but instead a bug with whatever is telling the docker plugin it's time to kill the container.

          A quick trip to google revealed this javadoc which implies that this is down to the "workflow-durable-task-step" code not doing what it says here ... or maybe the closure is not as "done" as that code thought it was.

           

          So, all I can do is re-iterate what I've said twice before: We need a repro case.

          Someone who's experiencing this issue needs to take the time to reduce it down to just the minimal set of conditions required to make it happen.  If someone does that then we have a solvable bug; until someone does that, all we have is a bunch of folks providing sympathy & empathy for fellow suffers, but no actual help.

          TL;DR: If this is bugging you, demonstrate the bug; make it easier for people to help you.

          Show
          pjdarton pjdarton added a comment - Personally, I'd be suspicious of anything that called itself PlaceholderExecutable:ExecutorStepExecution.PlaceholderTask FYI the docker plugin terminates the container when the task is complete - that's "as designed" (and also "as intended", i.e. I also believe the design is correct).  However, the docker plugin doesn't decide when things are "done", it is told when things are done, so if it's told a task is done when it isn't, this is the kind of symptom you'll see.  My expectation here is that  this probably isn't a bug with the docker-plugin at all but instead a bug with whatever is telling the docker plugin it's time to kill the container. A quick trip to google revealed this javadoc which implies that this is down to the "workflow-durable-task-step" code not doing what it says here ... or maybe the closure is not as "done" as that code thought it was.   So, all I can do is re-iterate what I've said twice before: We need a repro case. Someone who's experiencing this issue needs to take the time to reduce it down to just the minimal set of conditions required to make it happen.  If someone does that then we have a solvable bug; until someone does that, all we have is a bunch of folks providing sympathy & empathy for fellow suffers, but no actual help. TL;DR: If this is bugging you, demonstrate the bug; make it easier for people to help you.
          Hide
          owenmehegan Owen Mehegan added a comment -

          Just wanted to say that I saw this issue while assisting a CloudBees customer, which is what led me to file this bug with the information I gathered from them. But they eventually went silent and we never made any further progress, so I don't have anything more I can offer.

          Show
          owenmehegan Owen Mehegan added a comment - Just wanted to say that I saw this issue while assisting a CloudBees customer, which is what led me to file this bug with the information I gathered from them. But they eventually went silent and we never made any further progress, so I don't have anything more I can offer.
          Hide
          mwils2424 Matt Wilson added a comment - - edited

          I'm not sure if I have this problem exactly, but its pretty close.  We've been using this plugin since spring, but we've started to see problem where jobs that run on containers aren't being started.  They would just queue up forever until the jenkins service gets restarted.  This issue just seems to have come out of the blue.  It went away for a while, but now its back.

           

          I've noticed this pattern in my debugging

          1. run a job, jobs completes fine in a container called DK_COSCOMMON7_D15-0000p4hq2f6yg.
          2. logs indicate there was some kind of socket io error, but that the container has been removed
            I/O error in channel DK_COSCOMMON7_D15-0000p4hq2f6yg
            java.net.SocketException: Socket closed
            at java.net.SocketInputStream.socketRead0(Native Method)
            at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
            at java.net.SocketInputStream.read(SocketInputStream.java:171)
            at java.net.SocketInputStream.read(SocketInputStream.java:141)
            at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
            at sun.security.ssl.InputRecord.read(InputRecord.java:503)
            at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:983)
            at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:940)
            at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
            at io.jenkins.docker.client.DockerMultiplexedInputStream.readInternal(DockerMultiplexedInputStream.java:49)
            at io.jenkins.docker.client.DockerMultiplexedInputStream.read(DockerMultiplexedInputStream.java:31)
            at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:92)
            at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72)
            at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103)
            at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
            at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
            at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
            Removed Node for node 'DK_COSCOMMON7_D15-0000p4hq2f6yg'.
            Dec 12, 2020 7:39:35 AM INFO io.jenkins.docker.DockerTransientNode$1 printlnStopped container '2f42050429e1cb26c2bb4067c5cc031cc08bdff118fe70bf73624f3758fa17d2' for node 'DK_COSCOMMON7_D15-0000p4hq2f6yg'.
            Dec 12, 2020 7:39:35 AM INFO io.jenkins.docker.DockerTransientNode$1 printlnRemoved container '2f42050429e1cb26c2bb4067c5cc031cc08bdff118fe70bf73624f3758fa17d2' for node 'DK_COSCOMMON7_D15-0000p4hq2f6yg'.
          3. when I try and start a new job, In the gui it shows that it tries to reuse "DK_COSCOMMON7_D15-0000p4hq2f6yg", but then flips back to the label assigned to that template DK_COSCOMMON7.  logs show this

          Dec 12, 2020 7:40:12 AM SEVERE com.github.dockerjava.core.async.ResultCallbackTemplate onErrorError during callback com.github.dockerjava.api.exception.NotFoundException: {"message":"No such container: 2f42050429e1cb26c2bb4067c5cc031cc08bdff118fe70bf73624f3758fa17d2"} at com.github.dockerjava.netty.handler.HttpResponseHandler.channelRead0(HttpResponseHandler.java:103) at com.github.dockerjava.netty.handler.HttpResponseHandler.channelRead0(HttpResponseHandler.java:33) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:438) at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297) at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:253) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1432) at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1199) at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1243) at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:502) at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:441) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:648) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:583) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:500) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:462) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748)
          Dec 12, 2020 7:40:12 AM INFO hudson.slaves.NodeProvisioner lambda$update$6Image of pmsplb-cos-tools.dev.datacard.com:8600/centos7_common:latest provisioning successfully completed. We have now 113 computer(s)
          Dec 12, 2020 7:40:12 AM INFO io.jenkins.docker.DockerTransientNode$1 printlnDisconnected computer for node 'DK_COSCOMMON7_D15-0000p4hq2f6yg'.
          Dec 12, 2020 7:40:13 AM INFO io.jenkins.docker.DockerTransientNode$1 printlnRemoved Node for node 'DK_COSCOMMON7_D15-0000p4hq2f6yg'.

          1. at this point my job will sit in the queue for ever, it will never restart  unless I cancel it, and then restart it.  When I restart it, the cycle continues.  job runs, second run doesn't run, cancel, next run goes.

           

          this one really drives me crazy.  I have no idea at this point what is going on.  what is causing that socket error.  Still trying to figure that part out.

           

          the above was using the "attach" container method.  I've also tried with with "ssh" which we normally use, and the pattern is identical, except that the ssh boxes don't generate the socket connection error.

          My nodes folder on my jenkins server does not list the node name post build, i.e. that is being cleaned up.  A console call to list the connected slaves does not return that machine either.  no idea where it is retaining this information.

           

          restarting the jenkins server clears all this up for a random period of time.  Could be days, weeks or hours.  I've probably restarted the service 10 times in the last 3 days.  

          this is with jenkins version 2.249.3 although I'm about to upgrade to the new lts version 2.263.1.  

          Show
          mwils2424 Matt Wilson added a comment - - edited I'm not sure if I have this problem exactly, but its pretty close.  We've been using this plugin since spring, but we've started to see problem where jobs that run on containers aren't being started.  They would just queue up forever until the jenkins service gets restarted.  This issue just seems to have come out of the blue.  It went away for a while, but now its back.   I've noticed this pattern in my debugging run a job, jobs completes fine in a container called DK_COSCOMMON7_D15-0000p4hq2f6yg. logs indicate there was some kind of socket io error, but that the container has been removed I/O error in channel DK_COSCOMMON7_D15-0000p4hq2f6yg java.net.SocketException: Socket closed at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:171) at java.net.SocketInputStream.read(SocketInputStream.java:141) at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) at sun.security.ssl.InputRecord.read(InputRecord.java:503) at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:983) at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:940) at sun.security.ssl.AppInputStream.read(AppInputStream.java:105) at io.jenkins.docker.client.DockerMultiplexedInputStream.readInternal(DockerMultiplexedInputStream.java:49) at io.jenkins.docker.client.DockerMultiplexedInputStream.read(DockerMultiplexedInputStream.java:31) at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:92) at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72) at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103) at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63) Removed Node for node 'DK_COSCOMMON7_D15-0000p4hq2f6yg'. Dec 12, 2020 7:39:35 AM INFO io.jenkins.docker.DockerTransientNode$1 printlnStopped container '2f42050429e1cb26c2bb4067c5cc031cc08bdff118fe70bf73624f3758fa17d2' for node 'DK_COSCOMMON7_D15-0000p4hq2f6yg'. Dec 12, 2020 7:39:35 AM INFO io.jenkins.docker.DockerTransientNode$1 printlnRemoved container '2f42050429e1cb26c2bb4067c5cc031cc08bdff118fe70bf73624f3758fa17d2' for node 'DK_COSCOMMON7_D15-0000p4hq2f6yg'. when I try and start a new job, In the gui it shows that it tries to reuse "DK_COSCOMMON7_D15-0000p4hq2f6yg", but then flips back to the label assigned to that template DK_COSCOMMON7.  logs show this Dec 12, 2020 7:40:12 AM SEVERE com.github.dockerjava.core.async.ResultCallbackTemplate onErrorError during callback com.github.dockerjava.api.exception.NotFoundException: {"message":"No such container: 2f42050429e1cb26c2bb4067c5cc031cc08bdff118fe70bf73624f3758fa17d2"} at com.github.dockerjava.netty.handler.HttpResponseHandler.channelRead0(HttpResponseHandler.java:103) at com.github.dockerjava.netty.handler.HttpResponseHandler.channelRead0(HttpResponseHandler.java:33) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:438) at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297) at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:253) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1432) at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1199) at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1243) at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:502) at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:441) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:648) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:583) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:500) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:462) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748) Dec 12, 2020 7:40:12 AM INFO hudson.slaves.NodeProvisioner lambda$update$6Image of pmsplb-cos-tools.dev.datacard.com:8600/centos7_common:latest provisioning successfully completed. We have now 113 computer(s) Dec 12, 2020 7:40:12 AM INFO io.jenkins.docker.DockerTransientNode$1 printlnDisconnected computer for node 'DK_COSCOMMON7_D15-0000p4hq2f6yg'. Dec 12, 2020 7:40:13 AM INFO io.jenkins.docker.DockerTransientNode$1 printlnRemoved Node for node 'DK_COSCOMMON7_D15-0000p4hq2f6yg'. at this point my job will sit in the queue for ever, it will never restart  unless I cancel it, and then restart it.  When I restart it, the cycle continues.  job runs, second run doesn't run, cancel, next run goes.   this one really drives me crazy.  I have no idea at this point what is going on.  what is causing that socket error.  Still trying to figure that part out.   the above was using the "attach" container method.  I've also tried with with "ssh" which we normally use, and the pattern is identical, except that the ssh boxes don't generate the socket connection error. My nodes folder on my jenkins server does not list the node name post build, i.e. that is being cleaned up.  A console call to list the connected slaves does not return that machine either.  no idea where it is retaining this information.   restarting the jenkins server clears all this up for a random period of time.  Could be days, weeks or hours.  I've probably restarted the service 10 times in the last 3 days.   this is with jenkins version 2.249.3 although I'm about to upgrade to the new lts version 2.263.1.  
          Hide
          rmshnair Manikandan added a comment -

          I do see similar error. somehow, the executors are notified that the job is done where it is not actually. this is even happening for a shell script with just a for loop and sleep.. interesting thing is, this same setup with same containers are worked well before and currently it is working good in another instance which i have stated with same Jenkins build to debug this issue. another point worth notable is, when this problem is occurring, restart of application does not correct anything. the problem is persistent after restart.

           

           
          Apr 05, 2021 4:55:21 PM FINE hudson.model.Executor
          Executor #0 for mavenslave-0002gzpj2u9oj : executing Infra_verifyDockerBuildNode #26 completed Infra_verifyDockerBuildNode #26 in 95,213ms
          Apr 05, 2021 4:55:21 PM FINE com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategy
          terminating mavenslave-0002gzpj2u9oj since Infra_verifyDockerBuildNode #26 seems to be finished
          Apr 05, 2021 4:55:21 PM INFO io.jenkins.docker.DockerTransientNode$1 println
          Disconnected computer for node 'mavenslave-0002gzpj2u9oj'.
          Apr 05, 2021 4:55:21 PM FINE hudson.model.Executor
          Executor #0 is interrupted(ABORTED): java.lang.InterruptedException at hudson.model.Executor.interrupt(Executor.java:218) at hudson.model.Executor.interrupt(Executor.java:201) at hudson.model.Executor.interrupt(Executor.java:195) at hudson.model.Executor.interrupt(Executor.java:181) at hudson.model.Computer$1.run(Computer.java:899) at hudson.model.Queue._withLock(Queue.java:1398) at hudson.model.Queue.withLock(Queue.java:1275) at hudson.model.Computer.setNumExecutors(Computer.java:894) at hudson.model.Computer.inflictMortalWound(Computer.java:853) at hudson.model.AbstractCIBase$2.run(AbstractCIBase.java:238) at hudson.model.Queue._withLock(Queue.java:1398) at hudson.model.Queue.withLock(Queue.java:1275) at hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:207) at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1634) at jenkins.model.Nodes$6.run(Nodes.java:271) at hudson.model.Queue._withLock(Queue.java:1398) at hudson.model.Queue.withLock(Queue.java:1275) at jenkins.model.Nodes.removeNode(Nodes.java:262) at jenkins.model.Jenkins.removeNode(Jenkins.java:2164) at io.jenkins.docker.DockerTransientNode.terminate(DockerTransientNode.java:251) at io.jenkins.docker.DockerTransientNode.terminate(DockerTransientNode.java:179) at com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategy.lambda$null$0(DockerOnceRetentionStrategy.java:114) at hudson.model.Queue._withLock(Queue.java:1398) at hudson.model.Queue.withLock(Queue.java:1275) at com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategy.lambda$done$1(DockerOnceRetentionStrategy.java:111) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
          Apr 05, 2021 4:55:21 PM INFO io.jenkins.docker.DockerTransientNode$1 println
          Removed Node for node 'mavenslave-0002gzpj2u9oj'.

          Show
          rmshnair Manikandan added a comment - I do see similar error. somehow, the executors are notified that the job is done where it is not actually. this is even happening for a shell script with just a for loop and sleep.. interesting thing is, this same setup with same containers are worked well before and currently it is working good in another instance which i have stated with same Jenkins build to debug this issue. another point worth notable is, when this problem is occurring, restart of application does not correct anything. the problem is persistent after restart.     Apr 05, 2021 4:55:21 PM FINE hudson.model.Executor Executor #0 for mavenslave-0002gzpj2u9oj : executing Infra_verifyDockerBuildNode #26 completed Infra_verifyDockerBuildNode #26 in 95,213ms Apr 05, 2021 4:55:21 PM FINE com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategy terminating mavenslave-0002gzpj2u9oj since Infra_verifyDockerBuildNode #26 seems to be finished Apr 05, 2021 4:55:21 PM INFO io.jenkins.docker.DockerTransientNode$1 println Disconnected computer for node 'mavenslave-0002gzpj2u9oj'. Apr 05, 2021 4:55:21 PM FINE hudson.model.Executor Executor #0 is interrupted(ABORTED): java.lang.InterruptedException at hudson.model.Executor.interrupt(Executor.java:218) at hudson.model.Executor.interrupt(Executor.java:201) at hudson.model.Executor.interrupt(Executor.java:195) at hudson.model.Executor.interrupt(Executor.java:181) at hudson.model.Computer$1.run(Computer.java:899) at hudson.model.Queue._withLock(Queue.java:1398) at hudson.model.Queue.withLock(Queue.java:1275) at hudson.model.Computer.setNumExecutors(Computer.java:894) at hudson.model.Computer.inflictMortalWound(Computer.java:853) at hudson.model.AbstractCIBase$2.run(AbstractCIBase.java:238) at hudson.model.Queue._withLock(Queue.java:1398) at hudson.model.Queue.withLock(Queue.java:1275) at hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:207) at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1634) at jenkins.model.Nodes$6.run(Nodes.java:271) at hudson.model.Queue._withLock(Queue.java:1398) at hudson.model.Queue.withLock(Queue.java:1275) at jenkins.model.Nodes.removeNode(Nodes.java:262) at jenkins.model.Jenkins.removeNode(Jenkins.java:2164) at io.jenkins.docker.DockerTransientNode.terminate(DockerTransientNode.java:251) at io.jenkins.docker.DockerTransientNode.terminate(DockerTransientNode.java:179) at com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategy.lambda$null$0(DockerOnceRetentionStrategy.java:114) at hudson.model.Queue._withLock(Queue.java:1398) at hudson.model.Queue.withLock(Queue.java:1275) at com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategy.lambda$done$1(DockerOnceRetentionStrategy.java:111) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Apr 05, 2021 4:55:21 PM INFO io.jenkins.docker.DockerTransientNode$1 println Removed Node for node 'mavenslave-0002gzpj2u9oj'.
          Hide
          mwils2424 Matt Wilson added a comment -

          Manikandan same.  I've got a second instance that uses this solution still.  Its smaller, but busy.  They haven't had many issues at all.  On my main site I had to stop using this plugin.  Couldn't get passed 1 day without having to restart.  Too bad, as I really like the functionality and flexibility it gave me. 

          Show
          mwils2424 Matt Wilson added a comment - Manikandan  same.  I've got a second instance that uses this solution still.  Its smaller, but busy.  They haven't had many issues at all.  On my main site I had to stop using this plugin.  Couldn't get passed 1 day without having to restart.  Too bad, as I really like the functionality and flexibility it gave me. 
          Hide
          pjdarton pjdarton added a comment -

          Matt Wilson Sorry for not responding earlier; looking at your post from December last year...

          Re 2.
          Seeing "Socket closed" at the time when Jenkins is closing the agent down (and hence the socket that talked to that agent) is unremarkable (as far as the docker-plugin is concerned).
          I agree that it's ugly, but (I think) that's all it is - ugly - and not evidence of a fault.
          (but, if I'm wrong, please tell me why in detail...)

          Re 3.
          That's normal for Jenkins.
          When a job is pending, the Jenkins UI tries to guess why it isn't running yet, but that's all it is - a guess.
          So, in this scenario, Jenkins has an agent "DK_COSCOMMON7_D15-0000p4hq2f6yg" that's in the process of closing down (and being deleted) so Jenkins says "waiting for an executor on DK_COSCOMMON7_D15-0000p4hq2f6yg" and then, once that agent has been deleted it then says that it's waiting for an agent with the tags that the job is asking for.
          TL;DR: The Jenkins UI misleads users when there's dynamic (cloud) agents being supplied "on demand".

          Re: "com.github.dockerjava.api.exception.NotFoundException"
          Sadly that's normal; docker-java is overly verbose when it comes to logging exceptions, and here it is logging (as an exception for user-attention) a perfectly normal result that's fully handled by the docker-plugin code. It's doing two container-removals asynchronously, one will get there first and remove the container, the second will be told it's already been removed (and the code handles that) but not before docker-java logging this answer as an exception requiring end-user attention (which is wrong - code should not log exceptions it throws).
          TL;DR: Only pay attention to com.github.dockerjava.api.exception stuff when looking for further information surrounding a Jenkins exception that happened at the same time.

          Re: "sit in the queue for ever"
          Hmm... "that shouldn't happen".
          OK, now this sounds like a real issue - what should happen is that the job that's waiting for the executor to appear should result in a new container being created in docker, a new agent being added to Jenkins, and then the job should run on that agent.
          That's what should happen.
          However, I am aware that there's a (long-standing) bug whereby the docker-plugin can get confused w.r.t. what containers are "in progress" and what isn't ... so can you use the Jenkins "Script Console" (you'll need Jenkins admin rights) and do:

          Jenkins.getInstance().clouds.get(com.nirima.jenkins.plugins.docker.DockerCloud).CONTAINERS_IN_PROGRESS
          

          The list printed will show which containers are being created - if this list is not empty, and it looks like the containers are never coming online, you can clear this list to get the Docker plugin to try again:

          Jenkins.getInstance().clouds.get(com.nirima.jenkins.plugins.docker.DockerCloud).CONTAINERS_IN_PROGRESS.clear()
          

          Note: Restarting the Jenkins server will (effectively) do this too, so anything that can be "fixed by restarting" may be because of this. AFAICT it's mostly caused by changes to the docker-plugin's configuration happening at the same time that other things are going on ... and I've never discovered exactly what/why (or it would've been fixed by now!)
          So, if clearing that map un-blocks things then that's another symptom of that bug ... which may then shed additional light on WTF is going on there and thus help resolve it "for good".

          Re: attach vs ssh
          The container connection method should be irrelevant to this.

          Re: node folder
          I don't use folders myself; I have zero experience using them and I'm not aware of there being any unit-testing (in the docker-plugin) that verifies that everything continues to work when using them.
          It is possible that this issue only exists when folders are being used and disappears when they're not in use; if you (or anyone else) can prove/disprove this then that would be useful information.

          Show
          pjdarton pjdarton added a comment - Matt Wilson Sorry for not responding earlier; looking at your post from December last year... Re 2. Seeing "Socket closed" at the time when Jenkins is closing the agent down (and hence the socket that talked to that agent) is unremarkable (as far as the docker-plugin is concerned). I agree that it's ugly, but (I think) that's all it is - ugly - and not evidence of a fault. (but, if I'm wrong, please tell me why in detail...) Re 3. That's normal for Jenkins. When a job is pending, the Jenkins UI tries to guess why it isn't running yet, but that's all it is - a guess. So, in this scenario, Jenkins has an agent "DK_COSCOMMON7_D15-0000p4hq2f6yg" that's in the process of closing down (and being deleted) so Jenkins says "waiting for an executor on DK_COSCOMMON7_D15-0000p4hq2f6yg" and then, once that agent has been deleted it then says that it's waiting for an agent with the tags that the job is asking for. TL;DR: The Jenkins UI misleads users when there's dynamic (cloud) agents being supplied "on demand". Re: "com.github.dockerjava.api.exception.NotFoundException" Sadly that's normal; docker-java is overly verbose when it comes to logging exceptions, and here it is logging (as an exception for user-attention) a perfectly normal result that's fully handled by the docker-plugin code. It's doing two container-removals asynchronously, one will get there first and remove the container, the second will be told it's already been removed (and the code handles that) but not before docker-java logging this answer as an exception requiring end-user attention (which is wrong - code should not log exceptions it throws). TL;DR: Only pay attention to com.github.dockerjava.api.exception stuff when looking for further information surrounding a Jenkins exception that happened at the same time. Re: "sit in the queue for ever" Hmm... "that shouldn't happen". OK, now this sounds like a real issue - what should happen is that the job that's waiting for the executor to appear should result in a new container being created in docker, a new agent being added to Jenkins, and then the job should run on that agent. That's what should happen. However, I am aware that there's a (long-standing) bug whereby the docker-plugin can get confused w.r.t. what containers are "in progress" and what isn't ... so can you use the Jenkins "Script Console" (you'll need Jenkins admin rights) and do: Jenkins.getInstance().clouds.get(com.nirima.jenkins.plugins.docker.DockerCloud).CONTAINERS_IN_PROGRESS The list printed will show which containers are being created - if this list is not empty, and it looks like the containers are never coming online, you can clear this list to get the Docker plugin to try again: Jenkins.getInstance().clouds.get(com.nirima.jenkins.plugins.docker.DockerCloud).CONTAINERS_IN_PROGRESS.clear() Note: Restarting the Jenkins server will (effectively) do this too, so anything that can be "fixed by restarting" may be because of this. AFAICT it's mostly caused by changes to the docker-plugin's configuration happening at the same time that other things are going on ... and I've never discovered exactly what/why (or it would've been fixed by now!) So, if clearing that map un-blocks things then that's another symptom of that bug ... which may then shed additional light on WTF is going on there and thus help resolve it "for good". Re: attach vs ssh The container connection method should be irrelevant to this. Re: node folder I don't use folders myself; I have zero experience using them and I'm not aware of there being any unit-testing (in the docker-plugin) that verifies that everything continues to work when using them. It is possible that this issue only exists when folders are being used and disappears when they're not in use; if you (or anyone else) can prove/disprove this then that would be useful information.
          Hide
          rmshnair Manikandan added a comment -

          I had a breakthrough today. however, I am not sure whether these are relevant or not as so far, I do not have any direct evidence for what causes Jenkins to think the job is finished.

          in my setup, I have the docker secured with certificate with "TLS Verify" on. today I disabled this in order trace the network messages between the server and client. once I disabled the TLS Verify, the problem vanished. I am still performing some more testing to see whether these are relevant to each other... questions in front of me,

          1. I have this TLSverify for the past 8-9 months. why this problem now. it showed up for a week last month, then vanished and came back a week before.
          2. Why the error occurs always better 3 min range. is there any calls that tries to reach docker API through 2376 for statusing?
          3. how it is successful first time and fail later? if such a failure happens why it is not notified. in any logging.
          4. is this another false alarm? would it fail again

          may be more as I just started investigating this results.

          Show
          rmshnair Manikandan added a comment - I had a breakthrough today. however, I am not sure whether these are relevant or not as so far, I do not have any direct evidence for what causes Jenkins to think the job is finished. in my setup, I have the docker secured with certificate with "TLS Verify" on. today I disabled this in order trace the network messages between the server and client. once I disabled the TLS Verify, the problem vanished. I am still performing some more testing to see whether these are relevant to each other... questions in front of me, I have this TLSverify for the past 8-9 months. why this problem now. it showed up for a week last month, then vanished and came back a week before. Why the error occurs always better 3 min range. is there any calls that tries to reach docker API through 2376 for statusing? how it is successful first time and fail later? if such a failure happens why it is not notified. in any logging. is this another false alarm? would it fail again may be more as I just started investigating this results.
          Hide
          pjdarton pjdarton added a comment -

          TLS verify shouldn't affect things; the change in behaviour may be as much to do with "someone changed the config" as much as anything (although that "shouldn't" affect things either ... but then again, Jenkins shouldn't be terminating things before it's finished with them, so somewhere, something that shouldn't happen is happening anyway)
          I believe that the "TLS verify" option is a boolean yes/no passed into the docker-java code to tell it whether or not to suppress verification of the server certificates; if you don't tick this box then any certificates encountered must either be in the Jenkins servers JVM's trusted keystore or must be signed by one that's in the keystore.
          e.g. where I work, we have an "internal" CA we use for HTTPS certificates, but that's a self-signed CA, so we need to ensure that every machine & every JVM on every machine has that CA installed as a "trusted certificate" or else folks get told "this website isn't safe, are you sure you want to continue?" etc; the "TLS verify" tick-box in Jenkins effectively says "yeah, we don't care if the certificate isn't trusted, carry on anyway".
          If this was a factor then Jenkins wouldn't be able to talk to docker at all.

          TL;DR: It shouldn't make a difference to this issue.
          ...but I don't dispute that something, somewhere, has made a difference - it just isn't likely to be this, but it could be a timing issue, for example - if you used a proxy to log everything, that'd add a few milliseconds of delay which might be the real reason (or it could be a gazillion other things).

          Re: timing
          The docker plugin remembers errors when provisioning for a while; it'll "turn off" a cloud (or template) which failed, but only temporarily (default 5 minutes).
          There's also a background clean-up task that goes around deleting containers that aren't in use anymore; it ensures that they're a few minutes old before it'll consider deleting them (there a system property you can set to disable this background cleanup if you suspect it's at fault - check the logs for details as it's fairly up-front about what it's doing).
          I'd suggest that you set up a custom log-recorder (manage jenkins -> system log, "add new log recorder", call it "docker") set to:

          Show
          pjdarton pjdarton added a comment - TLS verify shouldn't affect things; the change in behaviour may be as much to do with "someone changed the config" as much as anything (although that "shouldn't" affect things either ... but then again, Jenkins shouldn't be terminating things before it's finished with them, so somewhere, something that shouldn't happen is happening anyway) I believe that the "TLS verify" option is a boolean yes/no passed into the docker-java code to tell it whether or not to suppress verification of the server certificates; if you don't tick this box then any certificates encountered must either be in the Jenkins servers JVM's trusted keystore or must be signed by one that's in the keystore. e.g. where I work, we have an "internal" CA we use for HTTPS certificates, but that's a self-signed CA, so we need to ensure that every machine & every JVM on every machine has that CA installed as a "trusted certificate" or else folks get told "this website isn't safe, are you sure you want to continue?" etc; the "TLS verify" tick-box in Jenkins effectively says "yeah, we don't care if the certificate isn't trusted, carry on anyway". If this was a factor then Jenkins wouldn't be able to talk to docker at all. TL;DR: It shouldn't make a difference to this issue. ...but I don't dispute that something, somewhere, has made a difference - it just isn't likely to be this, but it could be a timing issue, for example - if you used a proxy to log everything, that'd add a few milliseconds of delay which might be the real reason (or it could be a gazillion other things). Re: timing The docker plugin remembers errors when provisioning for a while; it'll "turn off" a cloud (or template) which failed, but only temporarily (default 5 minutes). There's also a background clean-up task that goes around deleting containers that aren't in use anymore; it ensures that they're a few minutes old before it'll consider deleting them (there a system property you can set to disable this background cleanup if you suspect it's at fault - check the logs for details as it's fairly up-front about what it's doing). I'd suggest that you set up a custom log-recorder (manage jenkins -> system log, "add new log recorder", call it "docker") set to:
          Hide
          rmshnair Manikandan added a comment -

          I totally understand and agree with you. but i can now 100% fix and reproduce this issue by just switching TLSverify for Docker API. when docker api is unsecured, 100% working and switch back to certs and TLSverify, failing exactly 3-4 mins. what is going on at 3-4 mins of job start, any moniker thread trying to reach DockerAPI? i notice some handshake failures at docker API side, however not having enough evidence to confirm the source of that message.

          any thoughts

           

          Show
          rmshnair Manikandan added a comment - I totally understand and agree with you. but i can now 100% fix and reproduce this issue by just switching TLSverify for Docker API. when docker api is unsecured, 100% working and switch back to certs and TLSverify, failing exactly 3-4 mins. what is going on at 3-4 mins of job start, any moniker thread trying to reach DockerAPI? i notice some handshake failures at docker API side, however not having enough evidence to confirm the source of that message. any thoughts  
          Hide
          pjdarton pjdarton added a comment -

          It sounds like "shouldn't" isn't the same as "doesn't" - seems that theory and practise aren't quite in alignment...
          OK, in which case the logging I suggested is critical; grap the logs for those 3-4 minutes and upload them (redacting any hostnames etc you want to keep private).

          Something, somewhere, is doing something I didn't expect it to be doing ... but the answers should be in the logs somewhere.

          One other thing: Are you sure you don't have any other docker-related functionality happening at the same time? e.g. the docker-workflow plugin and/or the yet-another-docker-plugin and/or the docker-swarm plugin. While these should all play nicely together, it's possible that the underlying cause of this is interference between them rather than a bug that's purely within the docker-plugin.

          Show
          pjdarton pjdarton added a comment - It sounds like "shouldn't" isn't the same as "doesn't" - seems that theory and practise aren't quite in alignment... OK, in which case the logging I suggested is critical; grap the logs for those 3-4 minutes and upload them (redacting any hostnames etc you want to keep private). Something, somewhere, is doing something I didn't expect it to be doing ... but the answers should be in the logs somewhere. One other thing: Are you sure you don't have any other docker-related functionality happening at the same time? e.g. the docker-workflow plugin and/or the yet-another-docker-plugin and/or the docker-swarm plugin. While these should all play nicely together, it's possible that the underlying cause of this is interference between them rather than a bug that's purely within the docker-plugin.
          Hide
          rmshnair Manikandan added a comment -

          thanks for understanding. here you go with logs. i added those 4 only as you suggested above

          Apr 08, 2021 5:27:37 PM FINE com.nirima.jenkins.plugins.docker.DockerCloud provisionAsked to provision 1 agent(s) for: jenkinsslaveApr 08, 2021 5:27:37 PM INFO com.nirima.jenkins.plugins.docker.DockerCloud canAddProvisionedAgentProvisioning 'registry.local/scm/infra/docker' on 'slave_JKS_02'; Total containers: 0 (of 5)Apr 08, 2021 5:27:37 PM INFO com.nirima.jenkins.plugins.docker.DockerCloud provisionWill provision 'registry.local/scm/infra/docker', for label: 'jenkinsslave', in cloud: 'slave_JKS_02'Apr 08, 2021 5:27:37 PM INFO com.nirima.jenkins.plugins.docker.DockerTemplate doProvisionNodeTrying to run container for image "registry.local/scm/infra/docker"Apr 08, 2021 5:27:37 PM INFO com.nirima.jenkins.plugins.docker.DockerTemplate doProvisionNodeTrying to run container for node test-000284dm0q22t from image: registry.local/scm/infra/dockerApr 08, 2021 5:27:37 PM INFO com.nirima.jenkins.plugins.docker.DockerTemplate doProvisionNodeStarted container ID a00bcf97eb400b74ee6e99ee2013b1be96e80ee8d2e94bb981716202f5e7e073 for node test-000284dm0q22t from image: registry.local/scm/infra/dockerApr 08, 2021 5:27:39 PM INFO io.jenkins.docker.client.DockerMultiplexedInputStream readInternalstderr from test-000284dm0q22t (a00bcf97eb400b74ee6e99ee2013b1be96e80ee8d2e94bb981716202f5e7e073): Apr 08, 2021 5:27:39 PM org.jenkinsci.remoting.engine.WorkDirManager setupLogging
          INFO: Using /home/jenkins/agent.log as an agent error log destination; output log will not be generatedApr 08, 2021 5:27:40 PM INFO io.jenkins.docker.client.DockerMultiplexedInputStream readInternalstderr from test-000284dm0q22t (a00bcf97eb400b74ee6e99ee2013b1be96e80ee8d2e94bb981716202f5e7e073): channel startedApr 08, 2021 5:30:20 PM FINE com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategyterminating test-000284dm0q22t since Infra_verifyDockerBuildNode #51 seems to be finishedApr 08, 2021 5:30:20 PM INFO io.jenkins.docker.DockerTransientNode$1 printlnDisconnected computer for node 'test-000284dm0q22t'.Apr 08, 2021 5:30:20 PM INFO io.jenkins.docker.DockerTransientNode$1 printlnRemoved Node for node 'test-000284dm0q22t'.Apr 08, 2021 5:30:20 PM INFO io.jenkins.docker.DockerTransientNode$1 printlnContainer 'a00bcf97eb400b74ee6e99ee2013b1be96e80ee8d2e94bb981716202f5e7e073' already stoppedfor node 'test-000284dm0q22t'.Apr 08, 2021 5:30:21 PM INFO io.jenkins.docker.DockerTransientNode$1 printlnRemoved container 'a00bcf97eb400b74ee6e99ee2013b1be96e80ee8d2e94bb981716202f5e7e073' for node 'test-000284dm0q22t'.Apr 08, 2021 5:30:35 PM INFO hudson.model.AsyncPeriodicWork lambda$doRun$0Started DockerContainerWatchdog Asynchronous Periodic WorkApr 08, 2021 5:30:35 PM INFO com.nirima.jenkins.plugins.docker.DockerContainerWatchdog executeDocker Container Watchdog has been triggeredApr 08, 2021 5:30:35 PM INFO com.nirima.jenkins.plugins.docker.DockerContainerWatchdog$Statistics writeStatisticsToLogWatchdog Statistics: Number of overall executions: 328, Executions with processing timeout: 0, Containers removed gracefully: 8, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 14804 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 637 ms, Average runtime of container retrieval: 90 msApr 08, 2021 5:30:35 PM INFO com.nirima.jenkins.plugins.docker.DockerContainerWatchdog loadNodeMapWe currently have 13 nodes assigned to this Jenkins instance, which we will checkApr 08, 2021 5:30:35 PM INFO com.nirima.jenkins.plugins.docker.DockerContainerWatchdog executeChecking Docker Cloud slave_JKS_02 at tcp://Slave02.local:2376/Apr 08, 2021 5:30:35 PM INFO com.nirima.jenkins.plugins.docker.DockerContainerWatchdog executeChecking Docker Cloud slave_JKS_08 at tcp://Slave08.local:2376/Apr 08, 2021 5:30:35 PM INFO com.nirima.jenkins.plugins.docker.DockerContainerWatchdog processCloudWill not cleanup superfluous containers on DockerCloud [name=slave_JKS_08, dockerURI=tcp://Slave08.local:2376/], as it is disabledApr 08, 2021 5:30:35 PM INFO com.nirima.jenkins.plugins.docker.DockerContainerWatchdog executeChecking Docker Cloud slave_JKS_01 at tcp://Slave01.local:2376/Apr 08, 2021 5:30:35 PM INFO com.nirima.jenkins.plugins.docker.DockerContainerWatchdog executeDocker Container Watchdog check has been completedApr 08, 2021 5:30:35 PM INFO hudson.model.AsyncPeriodicWork lambda$doRun$0Finished DockerContainerWatchdog Asynchronous Periodic Work. 209 ms

           

          Show
          rmshnair Manikandan added a comment - thanks for understanding. here you go with logs. i added those 4 only as you suggested above Apr 08, 2021 5:27:37 PM FINE com.nirima.jenkins.plugins.docker.DockerCloud provision Asked to provision 1 agent(s) for: jenkinsslave Apr 08, 2021 5:27:37 PM INFO com.nirima.jenkins.plugins.docker.DockerCloud canAddProvisionedAgent Provisioning 'registry.local/scm/infra/docker' on 'slave_JKS_02'; Total containers: 0 (of 5) Apr 08, 2021 5:27:37 PM INFO com.nirima.jenkins.plugins.docker.DockerCloud provision Will provision 'registry.local/scm/infra/docker', for label: 'jenkinsslave', in cloud: 'slave_JKS_02' Apr 08, 2021 5:27:37 PM INFO com.nirima.jenkins.plugins.docker.DockerTemplate doProvisionNode Trying to run container for image "registry.local/scm/infra/docker" Apr 08, 2021 5:27:37 PM INFO com.nirima.jenkins.plugins.docker.DockerTemplate doProvisionNode Trying to run container for node test-000284dm0q22t from image: registry.local/scm/infra/docker Apr 08, 2021 5:27:37 PM INFO com.nirima.jenkins.plugins.docker.DockerTemplate doProvisionNode Started container ID a00bcf97eb400b74ee6e99ee2013b1be96e80ee8d2e94bb981716202f5e7e073 for node test-000284dm0q22t from image: registry.local/scm/infra/docker Apr 08, 2021 5:27:39 PM INFO io.jenkins.docker.client.DockerMultiplexedInputStream readInternal stderr from test-000284dm0q22t (a00bcf97eb400b74ee6e99ee2013b1be96e80ee8d2e94bb981716202f5e7e073): Apr 08, 2021 5:27:39 PM org.jenkinsci.remoting.engine.WorkDirManager setupLogging INFO: Using /home/jenkins/agent.log as an agent error log destination; output log will not be generated Apr 08, 2021 5:27:40 PM INFO io.jenkins.docker.client.DockerMultiplexedInputStream readInternal stderr from test-000284dm0q22t (a00bcf97eb400b74ee6e99ee2013b1be96e80ee8d2e94bb981716202f5e7e073): channel started Apr 08, 2021 5:30:20 PM FINE com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategy terminating test-000284dm0q22t since Infra_verifyDockerBuildNode #51 seems to be finished Apr 08, 2021 5:30:20 PM INFO io.jenkins.docker.DockerTransientNode$1 println Disconnected computer for node 'test-000284dm0q22t'. Apr 08, 2021 5:30:20 PM INFO io.jenkins.docker.DockerTransientNode$1 println Removed Node for node 'test-000284dm0q22t'. Apr 08, 2021 5:30:20 PM INFO io.jenkins.docker.DockerTransientNode$1 println Container 'a00bcf97eb400b74ee6e99ee2013b1be96e80ee8d2e94bb981716202f5e7e073' already stoppedfor node 'test-000284dm0q22t'. Apr 08, 2021 5:30:21 PM INFO io.jenkins.docker.DockerTransientNode$1 println Removed container 'a00bcf97eb400b74ee6e99ee2013b1be96e80ee8d2e94bb981716202f5e7e073' for node 'test-000284dm0q22t'. Apr 08, 2021 5:30:35 PM INFO hudson.model.AsyncPeriodicWork lambda$doRun$0 Started DockerContainerWatchdog Asynchronous Periodic Work Apr 08, 2021 5:30:35 PM INFO com.nirima.jenkins.plugins.docker.DockerContainerWatchdog execute Docker Container Watchdog has been triggered Apr 08, 2021 5:30:35 PM INFO com.nirima.jenkins.plugins.docker.DockerContainerWatchdog$Statistics writeStatisticsToLog Watchdog Statistics: Number of overall executions: 328, Executions with processing timeout: 0, Containers removed gracefully: 8, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 14804 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 637 ms, Average runtime of container retrieval: 90 ms Apr 08, 2021 5:30:35 PM INFO com.nirima.jenkins.plugins.docker.DockerContainerWatchdog loadNodeMap We currently have 13 nodes assigned to this Jenkins instance, which we will check Apr 08, 2021 5:30:35 PM INFO com.nirima.jenkins.plugins.docker.DockerContainerWatchdog execute Checking Docker Cloud slave_JKS_02 at tcp://Slave02.local:2376/ Apr 08, 2021 5:30:35 PM INFO com.nirima.jenkins.plugins.docker.DockerContainerWatchdog execute Checking Docker Cloud slave_JKS_08 at tcp://Slave08.local:2376/ Apr 08, 2021 5:30:35 PM INFO com.nirima.jenkins.plugins.docker.DockerContainerWatchdog processCloud Will not cleanup superfluous containers on DockerCloud [name=slave_JKS_08, dockerURI=tcp://Slave08.local:2376/] , as it is disabled Apr 08, 2021 5:30:35 PM INFO com.nirima.jenkins.plugins.docker.DockerContainerWatchdog execute Checking Docker Cloud slave_JKS_01 at tcp://Slave01.local:2376/ Apr 08, 2021 5:30:35 PM INFO com.nirima.jenkins.plugins.docker.DockerContainerWatchdog execute Docker Container Watchdog check has been completed Apr 08, 2021 5:30:35 PM INFO hudson.model.AsyncPeriodicWork lambda$doRun$0 Finished DockerContainerWatchdog Asynchronous Periodic Work. 209 ms  
          Hide
          fernando_rosado Fernando Rosado Altamirano added a comment - - edited

          We have the similar problem, but a very strange behavior.

          • The docker host has disabled the TLS setting.
          • All the execution finish before job finish with the same error message : Socket closed
          • I followed the same steps with the similar results. Enabled the logs and so on ( I will copy it later)
          • As desesperated situation, i changed the IP of docker host:
            #vi /etc/sysconfig/network-scripts/ifcfg-eth0
            # systemctl restart network
            

          If i rollback the change, and assign again the old IP address (no other change). Jenkins docker agents fails after a few minutes (5 min or less)
          Only this change (and obviously change jenkins setting to use this IP) solve the problem.
          This is not a valid solution for us, we want to find the Root cause.

          Attached the full log details :
          Jenkins log from docker plugins :

          Asked to provision 1 agent(s) for: ansible-test
          Aug 24, 2021 11:48:50 AM INFO com.nirima.jenkins.plugins.docker.DockerCloud canAddProvisionedAgent
          Provisioning 'mydockerhub.internal/dev/ansible2.9-base-centos7:latest' on 'docker-agents'; Total containers: 0 (of 100)
          Aug 24, 2021 11:48:50 AM INFO com.nirima.jenkins.plugins.docker.DockerCloud provision
          Will provision 'mydockerhub.internal/dev/ansible2.9-base-centos7:latest', for label: 'ansible-test', in cloud: 'docker-agents'
          Aug 24, 2021 11:48:50 AM INFO com.nirima.jenkins.plugins.docker.DockerTemplate pullImage
          Pulling image 'mydockerhub.internal/dev/ansible2.9-base-centos7:latest'. This may take awhile...
          Aug 24, 2021 11:48:50 AM FINEST io.netty.channel.nio.NioEventLoop openSelector
          instrumented a special java.util.Set into: {}
          Aug 24, 2021 11:48:50 AM FINEST io.netty.channel.nio.NioEventLoop openSelector
          instrumented a special java.util.Set into: {}
          Aug 24, 2021 11:48:50 AM FINEST io.netty.channel.nio.NioEventLoop openSelector
          instrumented a special java.util.Set into: {}
          Aug 24, 2021 11:48:50 AM FINEST io.netty.channel.nio.NioEventLoop openSelector
          instrumented a special java.util.Set into: {}
          Aug 24, 2021 11:48:50 AM FINEST io.netty.channel.nio.NioEventLoop openSelector
          instrumented a special java.util.Set into: {}
          Aug 24, 2021 11:48:50 AM FINEST io.netty.channel.nio.NioEventLoop openSelector
          instrumented a special java.util.Set into: {}
          Aug 24, 2021 11:48:50 AM FINEST io.netty.channel.nio.NioEventLoop openSelector
          instrumented a special java.util.Set into: {}
          Aug 24, 2021 11:48:50 AM FINEST io.netty.channel.nio.NioEventLoop openSelector
          instrumented a special java.util.Set into: {}
          Aug 24, 2021 11:48:50 AM FINEST io.netty.channel.nio.NioEventLoop openSelector
          instrumented a special java.util.Set into: {}
          Aug 24, 2021 11:48:50 AM FINEST io.netty.channel.nio.NioEventLoop openSelector
          instrumented a special java.util.Set into: {}
          Aug 24, 2021 11:48:50 AM FINEST io.netty.channel.nio.NioEventLoop openSelector
          instrumented a special java.util.Set into: {}
          Aug 24, 2021 11:48:50 AM FINEST io.netty.channel.nio.NioEventLoop openSelector
          instrumented a special java.util.Set into: {}
          Aug 24, 2021 11:48:50 AM FINEST io.netty.channel.nio.NioEventLoop openSelector
          instrumented a special java.util.Set into: {}
          Aug 24, 2021 11:48:50 AM FINEST io.netty.channel.nio.NioEventLoop openSelector
          instrumented a special java.util.Set into: {}
          Aug 24, 2021 11:48:50 AM FINEST io.netty.channel.nio.NioEventLoop openSelector
          instrumented a special java.util.Set into: {}
          Aug 24, 2021 11:48:50 AM FINEST io.netty.channel.nio.NioEventLoop openSelector
          instrumented a special java.util.Set into: {}
          Aug 24, 2021 11:48:50 AM INFO io.jenkins.docker.client.DockerAPI getOrMakeClient
          Cached connection io.jenkins.docker.client.DockerAPI$SharableDockerClient@188d282f to DockerClientParameters{dockerUri=tcp://docker-agent-host:4243, credentialsId=null, readTimeoutInMsOrNull=300000, connectTimeoutInMsOrNull=60000}
          Aug 24, 2021 11:48:51 AM INFO com.nirima.jenkins.plugins.docker.DockerTemplate pullImage
          Finished pulling image 'mydockerhub.internal/dev/ansible2.9-base-centos7:latest', took 542 ms
          Aug 24, 2021 11:48:51 AM INFO com.nirima.jenkins.plugins.docker.DockerTemplate doProvisionNode
          Trying to run container for image "mydockerhub.internal/dev/ansible2.9-base-centos7:latest"
          Aug 24, 2021 11:48:51 AM INFO com.nirima.jenkins.plugins.docker.DockerTemplate doProvisionNode
          Trying to run container for node ansible2.9-base-centos7-003xx52xa8gj7 from image: mydockerhub.internal/dev/ansible2.9-base-centos7:latest
          Aug 24, 2021 11:48:51 AM INFO com.nirima.jenkins.plugins.docker.DockerTemplate doProvisionNode
          Started container ID 1704c833eb9299adafa8257e6c9896a82cb292be98e139d679034e8646c7bea4 for node ansible2.9-base-centos7-003xx52xa8gj7 from image: mydockerhub.internal/dev/ansible2.9-base-centos7:latest
          Aug 24, 2021 11:48:52 AM FINE io.netty.channel.DefaultChannelPipeline onUnhandledInboundMessage
          Discarded inbound message {} that reached at the tail of the pipeline. Please check your pipeline configuration.
          Aug 24, 2021 11:48:52 AM INFO io.jenkins.docker.client.DockerMultiplexedInputStream readInternal
          stderr from ansible2.9-base-centos7-003xx52xa8gj7 (1704c833eb9299adafa8257e6c9896a82cb292be98e139d679034e8646c7bea4): Aug 24, 2021 11:48:52 AM org.jenkinsci.remoting.engine.WorkDirManager setupLogging
          INFO: Using /home/jenkins/agent.log as an agent error log destination; output log will not be generated
          Aug 24, 2021 11:48:53 AM INFO io.jenkins.docker.client.DockerMultiplexedInputStream readInternal
          stderr from ansible2.9-base-centos7-003xx52xa8gj7 (1704c833eb9299adafa8257e6c9896a82cb292be98e139d679034e8646c7bea4): channel started
          Aug 24, 2021 11:51:40 AM INFO hudson.model.AsyncPeriodicWork lambda$doRun$0
          Started DockerContainerWatchdog Asynchronous Periodic Work
          Aug 24, 2021 11:51:40 AM INFO com.nirima.jenkins.plugins.docker.DockerContainerWatchdog execute
          Docker Container Watchdog has been triggered
          Aug 24, 2021 11:51:40 AM INFO com.nirima.jenkins.plugins.docker.DockerContainerWatchdog$Statistics writeStatisticsToLog
          Watchdog Statistics: Number of overall executions: 8274, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 43, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 910 ms, Average runtime of container retrieval: 234 ms
          Aug 24, 2021 11:51:40 AM INFO com.nirima.jenkins.plugins.docker.DockerContainerWatchdog loadNodeMap
          We currently have 20 nodes assigned to this Jenkins instance, which we will check
          Aug 24, 2021 11:51:40 AM INFO com.nirima.jenkins.plugins.docker.DockerContainerWatchdog execute
          Checking Docker Cloud docker-agents at tcp://docker-agent-host:4243
          Aug 24, 2021 11:51:40 AM INFO com.nirima.jenkins.plugins.docker.DockerContainerWatchdog execute
          Docker Container Watchdog check has been completed
          Aug 24, 2021 11:51:40 AM INFO hudson.model.AsyncPeriodicWork lambda$doRun$0
          Finished DockerContainerWatchdog Asynchronous Periodic Work. 7 ms
          Aug 24, 2021 11:51:52 AM FINE com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategy
          terminating ansible2.9-base-centos7-003xx52xa8gj7 since test-job #31 seems to be finished
          Aug 24, 2021 11:51:52 AM INFO io.jenkins.docker.DockerTransientNode$1 println
          Disconnected computer for node 'ansible2.9-base-centos7-003xx52xa8gj7'.
          Aug 24, 2021 11:51:52 AM INFO io.jenkins.docker.DockerTransientNode$1 println
          Can't stop container '1704c833eb9299adafa8257e6c9896a82cb292be98e139d679034e8646c7bea4' for node 'ansible2.9-base-centos7-003xx52xa8gj7' as it does not exist.
          Aug 24, 2021 11:51:52 AM INFO io.jenkins.docker.DockerTransientNode$1 println
          Removed Node for node 'ansible2.9-base-centos7-003xx52xa8gj7'.
          
          

          I also has enabled the docker debug trace and appear a explicit docker delete command :

          Aug 24 12:06:57 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:57.805591390Z" level=debug msg="Assigning addresses for endpoint serene_villani's interface on network bridge"
          Aug 24 12:06:57 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:57.812651917Z" level=debug msg="Programming external connectivity on endpoint serene_villani (eecc35eb5d7c74c16be3f557fc84ddb6ce86bfd7485061c558c0802327a4f49f)"
          Aug 24 12:06:57 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:57.814811628Z" level=debug msg="EnableService 8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846 START"
          Aug 24 12:06:57 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:57.814841839Z" level=debug msg="EnableService 8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846 DONE"
          Aug 24 12:06:57 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:57.818744124Z" level=debug msg="bundle dir created" bundle=/var/run/docker/containerd/8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846 module=libcontainerd namespace=moby root=/var/lib/docker/overlay2/2773a0dca5fc99396f556e5e1b897224cf231a947f45c960bb6bf46c52ee1f6b/merged
          Aug 24 12:06:57 docker-agent-host containerd[1102]: time="2021-08-24T12:06:57.845538905Z" level=info msg="starting signal loop" namespace=moby path=/run/containerd/io.containerd.runtime.v2.task/moby/8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846 pid=9806
          Aug 24 12:06:58 docker-agent-host kernel: IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
          Aug 24 12:06:58 docker-agent-host kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
          Aug 24 12:06:58 docker-agent-host kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth71a50aa: link becomes ready
          Aug 24 12:06:58 docker-agent-host kernel: docker0: port 1(veth71a50aa) entered blocking state
          Aug 24 12:06:58 docker-agent-host kernel: docker0: port 1(veth71a50aa) entered forwarding state
          Aug 24 12:06:58 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:58.371993182Z" level=debug msg="sandbox set key processing took 156.552676ms for container 8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846"
          Aug 24 12:06:58 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:58.438854804Z" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/create
          Aug 24 12:06:58 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:58.472247002Z" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/start
          Aug 24 12:06:58 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:58.560652987Z" level=debug msg="Calling PUT /containers/8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846/archive?path=%2Fhome%2Fjenkins&noOverwriteDirNonDir=false"
          Aug 24 12:06:58 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:58.560874237Z" level=debug msg="container mounted via layerStore: &{/var/lib/docker/overlay2/2773a0dca5fc99396f556e5e1b897224cf231a947f45c960bb6bf46c52ee1f6b/merged 0x55f316e67f40 0x55f316e67f40}" container=8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846
          Aug 24 12:06:58 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:58.561455163Z" level=debug msg="unpigz binary not found, falling back to go gzip library"
          Aug 24 12:06:58 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:58.701300673Z" level=debug msg="Calling GET /containers/8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846/json"
          Aug 24 12:06:58 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:58.706687477Z" level=debug msg="Calling GET /containers/8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846/json"
          Aug 24 12:06:58 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:58.711112832Z" level=debug msg="Calling POST /containers/8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846/exec"
          Aug 24 12:06:58 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:58.711243982Z" level=debug msg="form data: {\"AttachStderr\":true,\"AttachStdin\":true,\"AttachStdout\":true,\"Cmd\":[\"/usr/bin/java\",\"-jar\",\"/home/jenkins/remoting-4.5.jar\",\"-noReconnect\",\"-noKeepAlive\",\"-agentLog\",\"/home/jenkins/agent.log\"],\"Tty\":false,\"User\":\"jenkins\",\"containerId\":\"8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846\"}"
          Aug 24 12:06:58 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:58.712526832Z" level=debug msg="Calling POST /v1.32/exec/304ccc7fbf96afacc204b489462233dddf6839ff1ceac06d8eaca3446f5df773/start"
          Aug 24 12:06:58 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:58.712619072Z" level=debug msg="form data: {\"Detach\":false,\"Tty\":false}"
          Aug 24 12:06:58 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:58.712931510Z" level=debug msg="starting exec command 304ccc7fbf96afacc204b489462233dddf6839ff1ceac06d8eaca3446f5df773 in container 8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846"
          Aug 24 12:06:58 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:58.714921605Z" level=debug msg="attach: stderr: begin"
          Aug 24 12:06:58 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:58.714922892Z" level=debug msg="attach: stdout: begin"
          Aug 24 12:06:58 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:58.714937302Z" level=debug msg="attach: stdin: begin"
          Aug 24 12:06:58 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:58.718168804Z" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/exec-added
          Aug 24 12:06:58 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:58.811154921Z" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/exec-started
          Aug 24 12:07:02 docker-agent-host dockerd[22607]: time="2021-08-24T12:07:02.061778175Z" level=debug msg="Calling GET /containers/8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846/json"
          Aug 24 12:07:12 docker-agent-host dockerd[22607]: time="2021-08-24T12:07:12.467544842Z" level=debug msg="Calling HEAD /_ping"
          Aug 24 12:07:12 docker-agent-host dockerd[22607]: time="2021-08-24T12:07:12.468691041Z" level=debug msg="Calling GET /v1.41/containers/json"
          Aug 24 12:07:17 docker-agent-host dockerd[22607]: time="2021-08-24T12:07:17.753068188Z" level=debug msg="Calling HEAD /_ping"
          Aug 24 12:07:21 docker-agent-host dockerd[22607]: time="2021-08-24T12:07:21.431766056Z" level=debug msg="Calling HEAD /_ping"
          Aug 24 12:07:21 docker-agent-host dockerd[22607]: time="2021-08-24T12:07:21.433187102Z" level=debug msg="Calling GET /v1.41/containers/8f8/json"
          Aug 24 12:11:37 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:37.776240745Z" level=debug msg="Calling GET /containers/json?all=true&filters=%7B%22label%22%3A%5B%22com.nirima.jenkins.plugins.docker.JenkinsId%3Da1fdb6cddc65e3808d7849dcbf333580%22%5D%7D"
          Aug 24 12:11:37 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:37.783347656Z" level=debug msg="Calling POST /containers/8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846/stop?t=10"
          Aug 24 12:11:37 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:37.783654630Z" level=debug msg="Sending kill signal 15 to container 8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846"
          Aug 24 12:11:40 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:40.906046192Z" level=debug msg="Calling GET /containers/json?all=true&filters=%7B%22label%22%3A%5B%22com.nirima.jenkins.plugins.docker.JenkinsId%3Da1fdb6cddc65e3808d7849dcbf333580%22%5D%7D"
          Aug 24 12:11:47 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:47.852784520Z" level=info msg="Container 8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846 failed to exit within 10 seconds of signal 15 - using the force"
          Aug 24 12:11:47 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:47.852972852Z" level=debug msg="Sending kill signal 9 to container 8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846"
          Aug 24 12:11:47 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:47.937757918Z" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/exit
          Aug 24 12:11:47 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:47.938300206Z" level=debug msg="attach: stdout: end"
          Aug 24 12:11:47 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:47.938332963Z" level=debug msg="attach: stderr: end"
          Aug 24 12:11:47 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:47.938460037Z" level=debug msg="attach: stdin: end"
          Aug 24 12:11:47 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:47.938559126Z" level=debug msg="attach done"
          Aug 24 12:11:47 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:47.941605088Z" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/exit
          Aug 24 12:11:48 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:48.000044557Z" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/delete
          Aug 24 12:11:48 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:48.000105917Z" level=info msg="ignoring event" container=8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
          Aug 24 12:11:48 docker-agent-host containerd[1102]: time="2021-08-24T12:11:48.000436444Z" level=info msg="shim disconnected" id=8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846
          Aug 24 12:11:48 docker-agent-host containerd[1102]: time="2021-08-24T12:11:48.000664261Z" level=error msg="copy shim log" error="read /proc/self/fd/12: file already closed"
          Aug 24 12:11:48 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:48.002242004Z" level=debug msg="Revoking external connectivity on endpoint serene_villani (eecc35eb5d7c74c16be3f557fc84ddb6ce86bfd7485061c558c0802327a4f49f)"
          Aug 24 12:11:48 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:48.010704047Z" level=debug msg="DeleteConntrackEntries purged ipv4:0, ipv6:0"
          Aug 24 12:11:48 docker-agent-host kernel: docker0: port 1(veth71a50aa) entered disabled state
          Aug 24 12:11:48 docker-agent-host kernel: docker0: port 1(veth71a50aa) entered disabled state
          Aug 24 12:11:48 docker-agent-host kernel: device veth71a50aa left promiscuous mode
          Aug 24 12:11:48 docker-agent-host kernel: docker0: port 1(veth71a50aa) entered disabled state
          Aug 24 12:11:48 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:48.070537216Z" level=debug msg="Releasing addresses for endpoint serene_villani's interface on network bridge"
          Aug 24 12:11:48 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:48.070603344Z" level=debug msg="ReleaseAddress(LocalDefault/172.17.0.0/16, 172.17.0.2)"
          Aug 24 12:11:48 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:48.070725504Z" level=debug msg="Released address PoolID:LocalDefault/172.17.0.0/16, Address:172.17.0.2 Sequence:App: ipam/default/data, ID: LocalDefault/172.17.0.0/16, DBIndex: 0x0, Bits: 65536, Unselected: 65532, Sequence: (0xe0000000, 1)->(0x0, 2046)->(0x1, 1)->end Curr:3"
          Aug 24 12:11:48 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:48.105589404Z" level=debug msg="Calling DELETE /containers/8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846?v=true"
          Aug 24 12:11:58 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:58.007798583Z" level=debug msg="Calling POST /containers/8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846/stop?t=10"
          Aug 24 12:16:37 docker-agent-host dockerd[22607]: time="2021-08-24T12:16:37.775109905Z" level=debug msg="Calling GET /containers/json?all=true&filters=%7B%22label%22%3A%5B%22com.nirima.jenkins.plugins.docker.JenkinsId%3Da1fdb6cddc65e3808d7849dcbf333580%22%5D%7D"
          Aug 24 12:16:40 docker-agent-host dockerd[22607]: time="2021-08-24T12:16:40.906616889Z" level=debug msg="Calling GET /containers/json?all=true&filters=%7B%22label%22%3A%5B%22com.nirima.jenkins.plugins.docker.JenkinsId%3Da1fdb6cddc65e3808d7849dcbf333580%22%5D%7D"
          Aug 24 12:17:18 docker-agent-host dockerd[22607]: time="2021-08-24T12:17:18.145111111Z" level=debug msg="Closing buffered stdin pipe"
          Aug 24 12:17:18 docker-agent-host dockerd[22607]: time="2021-08-24T12:17:18.145378406Z" level=debug msg="Closing buffered stdin pipe"
          Aug 24 12:17:18 docker-agent-host dockerd[22607]: time="2021-08-24T12:17:18.145648625Z" level=debug msg="Closing buffered stdin pipe"
          Aug 24 12:17:18 docker-agent-host dockerd[22607]: time="2021-08-24T12:17:18.145677775Z" level=debug msg="Closing buffered stdin pipe"
          

          I think the problem could be related with the Watchdog and the filter that is using. It is asking docker host about the status of the containters that jenkins has created. Docker host returns all the containers on the filter (labeled by jenkins) and it was unable to identify the running container for some reason (name are not matching)
          The execution of this code is always the same:

          Jenkins.getInstance().clouds.get(com.nirima.jenkins.plugins.docker.DockerCloud).CONTAINERS_IN_PROGRESS 
          -----
          Result: {}
          
          

          And the logs always shows the message from com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategy

          terminating node-name since job-name #37 seems to be finished
          

          Even if the job has not finished. For some reason the plugin is unable to detect that the job still running.

          Show
          fernando_rosado Fernando Rosado Altamirano added a comment - - edited We have the similar problem, but a very strange behavior. The docker host has disabled the TLS setting. All the execution finish before job finish with the same error message : Socket closed I followed the same steps with the similar results. Enabled the logs and so on ( I will copy it later) As desesperated situation, i changed the IP of docker host: #vi /etc/sysconfig/network-scripts/ifcfg-eth0 # systemctl restart network If i rollback the change, and assign again the old IP address (no other change). Jenkins docker agents fails after a few minutes (5 min or less) Only this change (and obviously change jenkins setting to use this IP) solve the problem. This is not a valid solution for us, we want to find the Root cause. Attached the full log details : Jenkins log from docker plugins : Asked to provision 1 agent(s) for: ansible-test Aug 24, 2021 11:48:50 AM INFO com.nirima.jenkins.plugins.docker.DockerCloud canAddProvisionedAgent Provisioning 'mydockerhub.internal/dev/ansible2.9-base-centos7:latest' on 'docker-agents'; Total containers: 0 (of 100) Aug 24, 2021 11:48:50 AM INFO com.nirima.jenkins.plugins.docker.DockerCloud provision Will provision 'mydockerhub.internal/dev/ansible2.9-base-centos7:latest', for label: 'ansible-test', in cloud: 'docker-agents' Aug 24, 2021 11:48:50 AM INFO com.nirima.jenkins.plugins.docker.DockerTemplate pullImage Pulling image 'mydockerhub.internal/dev/ansible2.9-base-centos7:latest'. This may take awhile... Aug 24, 2021 11:48:50 AM FINEST io.netty.channel.nio.NioEventLoop openSelector instrumented a special java.util.Set into: {} Aug 24, 2021 11:48:50 AM FINEST io.netty.channel.nio.NioEventLoop openSelector instrumented a special java.util.Set into: {} Aug 24, 2021 11:48:50 AM FINEST io.netty.channel.nio.NioEventLoop openSelector instrumented a special java.util.Set into: {} Aug 24, 2021 11:48:50 AM FINEST io.netty.channel.nio.NioEventLoop openSelector instrumented a special java.util.Set into: {} Aug 24, 2021 11:48:50 AM FINEST io.netty.channel.nio.NioEventLoop openSelector instrumented a special java.util.Set into: {} Aug 24, 2021 11:48:50 AM FINEST io.netty.channel.nio.NioEventLoop openSelector instrumented a special java.util.Set into: {} Aug 24, 2021 11:48:50 AM FINEST io.netty.channel.nio.NioEventLoop openSelector instrumented a special java.util.Set into: {} Aug 24, 2021 11:48:50 AM FINEST io.netty.channel.nio.NioEventLoop openSelector instrumented a special java.util.Set into: {} Aug 24, 2021 11:48:50 AM FINEST io.netty.channel.nio.NioEventLoop openSelector instrumented a special java.util.Set into: {} Aug 24, 2021 11:48:50 AM FINEST io.netty.channel.nio.NioEventLoop openSelector instrumented a special java.util.Set into: {} Aug 24, 2021 11:48:50 AM FINEST io.netty.channel.nio.NioEventLoop openSelector instrumented a special java.util.Set into: {} Aug 24, 2021 11:48:50 AM FINEST io.netty.channel.nio.NioEventLoop openSelector instrumented a special java.util.Set into: {} Aug 24, 2021 11:48:50 AM FINEST io.netty.channel.nio.NioEventLoop openSelector instrumented a special java.util.Set into: {} Aug 24, 2021 11:48:50 AM FINEST io.netty.channel.nio.NioEventLoop openSelector instrumented a special java.util.Set into: {} Aug 24, 2021 11:48:50 AM FINEST io.netty.channel.nio.NioEventLoop openSelector instrumented a special java.util.Set into: {} Aug 24, 2021 11:48:50 AM FINEST io.netty.channel.nio.NioEventLoop openSelector instrumented a special java.util.Set into: {} Aug 24, 2021 11:48:50 AM INFO io.jenkins.docker.client.DockerAPI getOrMakeClient Cached connection io.jenkins.docker.client.DockerAPI$SharableDockerClient@188d282f to DockerClientParameters{dockerUri=tcp://docker-agent-host:4243, credentialsId=null, readTimeoutInMsOrNull=300000, connectTimeoutInMsOrNull=60000} Aug 24, 2021 11:48:51 AM INFO com.nirima.jenkins.plugins.docker.DockerTemplate pullImage Finished pulling image 'mydockerhub.internal/dev/ansible2.9-base-centos7:latest', took 542 ms Aug 24, 2021 11:48:51 AM INFO com.nirima.jenkins.plugins.docker.DockerTemplate doProvisionNode Trying to run container for image "mydockerhub.internal/dev/ansible2.9-base-centos7:latest" Aug 24, 2021 11:48:51 AM INFO com.nirima.jenkins.plugins.docker.DockerTemplate doProvisionNode Trying to run container for node ansible2.9-base-centos7-003xx52xa8gj7 from image: mydockerhub.internal/dev/ansible2.9-base-centos7:latest Aug 24, 2021 11:48:51 AM INFO com.nirima.jenkins.plugins.docker.DockerTemplate doProvisionNode Started container ID 1704c833eb9299adafa8257e6c9896a82cb292be98e139d679034e8646c7bea4 for node ansible2.9-base-centos7-003xx52xa8gj7 from image: mydockerhub.internal/dev/ansible2.9-base-centos7:latest Aug 24, 2021 11:48:52 AM FINE io.netty.channel.DefaultChannelPipeline onUnhandledInboundMessage Discarded inbound message {} that reached at the tail of the pipeline. Please check your pipeline configuration. Aug 24, 2021 11:48:52 AM INFO io.jenkins.docker.client.DockerMultiplexedInputStream readInternal stderr from ansible2.9-base-centos7-003xx52xa8gj7 (1704c833eb9299adafa8257e6c9896a82cb292be98e139d679034e8646c7bea4): Aug 24, 2021 11:48:52 AM org.jenkinsci.remoting.engine.WorkDirManager setupLogging INFO: Using /home/jenkins/agent.log as an agent error log destination; output log will not be generated Aug 24, 2021 11:48:53 AM INFO io.jenkins.docker.client.DockerMultiplexedInputStream readInternal stderr from ansible2.9-base-centos7-003xx52xa8gj7 (1704c833eb9299adafa8257e6c9896a82cb292be98e139d679034e8646c7bea4): channel started Aug 24, 2021 11:51:40 AM INFO hudson.model.AsyncPeriodicWork lambda$doRun$0 Started DockerContainerWatchdog Asynchronous Periodic Work Aug 24, 2021 11:51:40 AM INFO com.nirima.jenkins.plugins.docker.DockerContainerWatchdog execute Docker Container Watchdog has been triggered Aug 24, 2021 11:51:40 AM INFO com.nirima.jenkins.plugins.docker.DockerContainerWatchdog$Statistics writeStatisticsToLog Watchdog Statistics: Number of overall executions: 8274, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 43, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 910 ms, Average runtime of container retrieval: 234 ms Aug 24, 2021 11:51:40 AM INFO com.nirima.jenkins.plugins.docker.DockerContainerWatchdog loadNodeMap We currently have 20 nodes assigned to this Jenkins instance, which we will check Aug 24, 2021 11:51:40 AM INFO com.nirima.jenkins.plugins.docker.DockerContainerWatchdog execute Checking Docker Cloud docker-agents at tcp://docker-agent-host:4243 Aug 24, 2021 11:51:40 AM INFO com.nirima.jenkins.plugins.docker.DockerContainerWatchdog execute Docker Container Watchdog check has been completed Aug 24, 2021 11:51:40 AM INFO hudson.model.AsyncPeriodicWork lambda$doRun$0 Finished DockerContainerWatchdog Asynchronous Periodic Work. 7 ms Aug 24, 2021 11:51:52 AM FINE com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategy terminating ansible2.9-base-centos7-003xx52xa8gj7 since test-job #31 seems to be finished Aug 24, 2021 11:51:52 AM INFO io.jenkins.docker.DockerTransientNode$1 println Disconnected computer for node 'ansible2.9-base-centos7-003xx52xa8gj7'. Aug 24, 2021 11:51:52 AM INFO io.jenkins.docker.DockerTransientNode$1 println Can't stop container '1704c833eb9299adafa8257e6c9896a82cb292be98e139d679034e8646c7bea4' for node 'ansible2.9-base-centos7-003xx52xa8gj7' as it does not exist. Aug 24, 2021 11:51:52 AM INFO io.jenkins.docker.DockerTransientNode$1 println Removed Node for node 'ansible2.9-base-centos7-003xx52xa8gj7'. I also has enabled the docker debug trace and appear a explicit docker delete command : Aug 24 12:06:57 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:57.805591390Z" level=debug msg="Assigning addresses for endpoint serene_villani's interface on network bridge" Aug 24 12:06:57 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:57.812651917Z" level=debug msg="Programming external connectivity on endpoint serene_villani (eecc35eb5d7c74c16be3f557fc84ddb6ce86bfd7485061c558c0802327a4f49f)" Aug 24 12:06:57 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:57.814811628Z" level=debug msg="EnableService 8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846 START" Aug 24 12:06:57 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:57.814841839Z" level=debug msg="EnableService 8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846 DONE" Aug 24 12:06:57 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:57.818744124Z" level=debug msg="bundle dir created" bundle=/var/run/docker/containerd/8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846 module=libcontainerd namespace=moby root=/var/lib/docker/overlay2/2773a0dca5fc99396f556e5e1b897224cf231a947f45c960bb6bf46c52ee1f6b/merged Aug 24 12:06:57 docker-agent-host containerd[1102]: time="2021-08-24T12:06:57.845538905Z" level=info msg="starting signal loop" namespace=moby path=/run/containerd/io.containerd.runtime.v2.task/moby/8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846 pid=9806 Aug 24 12:06:58 docker-agent-host kernel: IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready Aug 24 12:06:58 docker-agent-host kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready Aug 24 12:06:58 docker-agent-host kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth71a50aa: link becomes ready Aug 24 12:06:58 docker-agent-host kernel: docker0: port 1(veth71a50aa) entered blocking state Aug 24 12:06:58 docker-agent-host kernel: docker0: port 1(veth71a50aa) entered forwarding state Aug 24 12:06:58 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:58.371993182Z" level=debug msg="sandbox set key processing took 156.552676ms for container 8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846" Aug 24 12:06:58 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:58.438854804Z" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/create Aug 24 12:06:58 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:58.472247002Z" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/start Aug 24 12:06:58 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:58.560652987Z" level=debug msg="Calling PUT /containers/8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846/archive?path=%2Fhome%2Fjenkins&noOverwriteDirNonDir=false" Aug 24 12:06:58 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:58.560874237Z" level=debug msg="container mounted via layerStore: &{/var/lib/docker/overlay2/2773a0dca5fc99396f556e5e1b897224cf231a947f45c960bb6bf46c52ee1f6b/merged 0x55f316e67f40 0x55f316e67f40}" container=8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846 Aug 24 12:06:58 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:58.561455163Z" level=debug msg="unpigz binary not found, falling back to go gzip library" Aug 24 12:06:58 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:58.701300673Z" level=debug msg="Calling GET /containers/8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846/json" Aug 24 12:06:58 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:58.706687477Z" level=debug msg="Calling GET /containers/8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846/json" Aug 24 12:06:58 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:58.711112832Z" level=debug msg="Calling POST /containers/8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846/exec" Aug 24 12:06:58 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:58.711243982Z" level=debug msg="form data: {\"AttachStderr\":true,\"AttachStdin\":true,\"AttachStdout\":true,\"Cmd\":[\"/usr/bin/java\",\"-jar\",\"/home/jenkins/remoting-4.5.jar\",\"-noReconnect\",\"-noKeepAlive\",\"-agentLog\",\"/home/jenkins/agent.log\"],\"Tty\":false,\"User\":\"jenkins\",\"containerId\":\"8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846\"}" Aug 24 12:06:58 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:58.712526832Z" level=debug msg="Calling POST /v1.32/exec/304ccc7fbf96afacc204b489462233dddf6839ff1ceac06d8eaca3446f5df773/start" Aug 24 12:06:58 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:58.712619072Z" level=debug msg="form data: {\"Detach\":false,\"Tty\":false}" Aug 24 12:06:58 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:58.712931510Z" level=debug msg="starting exec command 304ccc7fbf96afacc204b489462233dddf6839ff1ceac06d8eaca3446f5df773 in container 8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846" Aug 24 12:06:58 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:58.714921605Z" level=debug msg="attach: stderr: begin" Aug 24 12:06:58 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:58.714922892Z" level=debug msg="attach: stdout: begin" Aug 24 12:06:58 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:58.714937302Z" level=debug msg="attach: stdin: begin" Aug 24 12:06:58 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:58.718168804Z" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/exec-added Aug 24 12:06:58 docker-agent-host dockerd[22607]: time="2021-08-24T12:06:58.811154921Z" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/exec-started Aug 24 12:07:02 docker-agent-host dockerd[22607]: time="2021-08-24T12:07:02.061778175Z" level=debug msg="Calling GET /containers/8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846/json" Aug 24 12:07:12 docker-agent-host dockerd[22607]: time="2021-08-24T12:07:12.467544842Z" level=debug msg="Calling HEAD /_ping" Aug 24 12:07:12 docker-agent-host dockerd[22607]: time="2021-08-24T12:07:12.468691041Z" level=debug msg="Calling GET /v1.41/containers/json" Aug 24 12:07:17 docker-agent-host dockerd[22607]: time="2021-08-24T12:07:17.753068188Z" level=debug msg="Calling HEAD /_ping" Aug 24 12:07:21 docker-agent-host dockerd[22607]: time="2021-08-24T12:07:21.431766056Z" level=debug msg="Calling HEAD /_ping" Aug 24 12:07:21 docker-agent-host dockerd[22607]: time="2021-08-24T12:07:21.433187102Z" level=debug msg="Calling GET /v1.41/containers/8f8/json" Aug 24 12:11:37 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:37.776240745Z" level=debug msg="Calling GET /containers/json?all=true&filters=%7B%22label%22%3A%5B%22com.nirima.jenkins.plugins.docker.JenkinsId%3Da1fdb6cddc65e3808d7849dcbf333580%22%5D%7D" Aug 24 12:11:37 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:37.783347656Z" level=debug msg="Calling POST /containers/8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846/stop?t=10" Aug 24 12:11:37 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:37.783654630Z" level=debug msg="Sending kill signal 15 to container 8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846" Aug 24 12:11:40 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:40.906046192Z" level=debug msg="Calling GET /containers/json?all=true&filters=%7B%22label%22%3A%5B%22com.nirima.jenkins.plugins.docker.JenkinsId%3Da1fdb6cddc65e3808d7849dcbf333580%22%5D%7D" Aug 24 12:11:47 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:47.852784520Z" level=info msg="Container 8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846 failed to exit within 10 seconds of signal 15 - using the force" Aug 24 12:11:47 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:47.852972852Z" level=debug msg="Sending kill signal 9 to container 8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846" Aug 24 12:11:47 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:47.937757918Z" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/exit Aug 24 12:11:47 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:47.938300206Z" level=debug msg="attach: stdout: end" Aug 24 12:11:47 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:47.938332963Z" level=debug msg="attach: stderr: end" Aug 24 12:11:47 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:47.938460037Z" level=debug msg="attach: stdin: end" Aug 24 12:11:47 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:47.938559126Z" level=debug msg="attach done" Aug 24 12:11:47 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:47.941605088Z" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/exit Aug 24 12:11:48 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:48.000044557Z" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/delete Aug 24 12:11:48 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:48.000105917Z" level=info msg="ignoring event" container=8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete" Aug 24 12:11:48 docker-agent-host containerd[1102]: time="2021-08-24T12:11:48.000436444Z" level=info msg="shim disconnected" id=8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846 Aug 24 12:11:48 docker-agent-host containerd[1102]: time="2021-08-24T12:11:48.000664261Z" level=error msg="copy shim log" error="read /proc/self/fd/12: file already closed" Aug 24 12:11:48 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:48.002242004Z" level=debug msg="Revoking external connectivity on endpoint serene_villani (eecc35eb5d7c74c16be3f557fc84ddb6ce86bfd7485061c558c0802327a4f49f)" Aug 24 12:11:48 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:48.010704047Z" level=debug msg="DeleteConntrackEntries purged ipv4:0, ipv6:0" Aug 24 12:11:48 docker-agent-host kernel: docker0: port 1(veth71a50aa) entered disabled state Aug 24 12:11:48 docker-agent-host kernel: docker0: port 1(veth71a50aa) entered disabled state Aug 24 12:11:48 docker-agent-host kernel: device veth71a50aa left promiscuous mode Aug 24 12:11:48 docker-agent-host kernel: docker0: port 1(veth71a50aa) entered disabled state Aug 24 12:11:48 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:48.070537216Z" level=debug msg="Releasing addresses for endpoint serene_villani's interface on network bridge" Aug 24 12:11:48 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:48.070603344Z" level=debug msg="ReleaseAddress(LocalDefault/172.17.0.0/16, 172.17.0.2)" Aug 24 12:11:48 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:48.070725504Z" level=debug msg="Released address PoolID:LocalDefault/172.17.0.0/16, Address:172.17.0.2 Sequence:App: ipam/default/data, ID: LocalDefault/172.17.0.0/16, DBIndex: 0x0, Bits: 65536, Unselected: 65532, Sequence: (0xe0000000, 1)->(0x0, 2046)->(0x1, 1)->end Curr:3" Aug 24 12:11:48 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:48.105589404Z" level=debug msg="Calling DELETE /containers/8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846?v=true" Aug 24 12:11:58 docker-agent-host dockerd[22607]: time="2021-08-24T12:11:58.007798583Z" level=debug msg="Calling POST /containers/8f875599d2e1edbcbbca68d92f16c3242fed6212d00d80720de86f1a5fa3d846/stop?t=10" Aug 24 12:16:37 docker-agent-host dockerd[22607]: time="2021-08-24T12:16:37.775109905Z" level=debug msg="Calling GET /containers/json?all=true&filters=%7B%22label%22%3A%5B%22com.nirima.jenkins.plugins.docker.JenkinsId%3Da1fdb6cddc65e3808d7849dcbf333580%22%5D%7D" Aug 24 12:16:40 docker-agent-host dockerd[22607]: time="2021-08-24T12:16:40.906616889Z" level=debug msg="Calling GET /containers/json?all=true&filters=%7B%22label%22%3A%5B%22com.nirima.jenkins.plugins.docker.JenkinsId%3Da1fdb6cddc65e3808d7849dcbf333580%22%5D%7D" Aug 24 12:17:18 docker-agent-host dockerd[22607]: time="2021-08-24T12:17:18.145111111Z" level=debug msg="Closing buffered stdin pipe" Aug 24 12:17:18 docker-agent-host dockerd[22607]: time="2021-08-24T12:17:18.145378406Z" level=debug msg="Closing buffered stdin pipe" Aug 24 12:17:18 docker-agent-host dockerd[22607]: time="2021-08-24T12:17:18.145648625Z" level=debug msg="Closing buffered stdin pipe" Aug 24 12:17:18 docker-agent-host dockerd[22607]: time="2021-08-24T12:17:18.145677775Z" level=debug msg="Closing buffered stdin pipe" I think the problem could be related with the Watchdog and the filter that is using. It is asking docker host about the status of the containters that jenkins has created. Docker host returns all the containers on the filter (labeled by jenkins) and it was unable to identify the running container for some reason (name are not matching) The execution of this code is always the same: Jenkins.getInstance().clouds.get(com.nirima.jenkins.plugins.docker.DockerCloud).CONTAINERS_IN_PROGRESS ----- Result: {} And the logs always shows the message from com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategy terminating node-name since job-name #37 seems to be finished Even if the job has not finished. For some reason the plugin is unable to detect that the job still running.
          Hide
          rmshnair Manikandan added a comment -

          I agree to the point that Jenkins is sending the explicit command to stop the slave container. but not able to realize why it making a decision to stop container assuming job is complete.

          Show
          rmshnair Manikandan added a comment - I agree to the point that Jenkins is sending the explicit command to stop the slave container. but not able to realize why it making a decision to stop container assuming job is complete.
          Hide
          fernando_rosado Fernando Rosado Altamirano added a comment -

          Thanks to this blog post :
          https://gitmemory.com/issue/jenkinsci/docker-plugin/678/485778494

          I have found the problem. We had a jenkins test instance running with the same configuration using the same docker-host (it was not job running, we only use it for evaluate upgrades and new plugins).
          We have copied the configuration from one to another and it is keeping the same jenkinsID.
          If you've got multiple Jenkins servers using the same docker host, you MUST ensure that they have different instance IDs, otherwise the background cleanup DockerContainerWatchdog will kill the other Jenkins server's containers.

          Show
          fernando_rosado Fernando Rosado Altamirano added a comment - Thanks to this blog post : https://gitmemory.com/issue/jenkinsci/docker-plugin/678/485778494 I have found the problem. We had a jenkins test instance running with the same configuration using the same docker-host (it was not job running, we only use it for evaluate upgrades and new plugins). We have copied the configuration from one to another and it is keeping the same jenkinsID. If you've got multiple Jenkins servers using the same docker host, you MUST ensure that they have different instance IDs, otherwise the background cleanup DockerContainerWatchdog will kill the other Jenkins server's containers.
          Hide
          rmshnair Manikandan added a comment -

          Interesting, in my case, i had only one Jenkins running in the docker. will automatic restart of jenkins server container to simulates this scenario...

          Show
          rmshnair Manikandan added a comment - Interesting, in my case, i had only one Jenkins running in the docker. will automatic restart of jenkins server container to simulates this scenario...

            People

            Assignee:
            Unassigned Unassigned
            Reporter:
            owenmehegan Owen Mehegan
            Votes:
            6 Vote for this issue
            Watchers:
            10 Start watching this issue

              Dates

              Created:
              Updated: