-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
Jenkins LTS 2.249.2
Docker Plugin 1.2.1
I'm not sure what has triggered this behavior. We've been using the docker plugin to spin up agents for about 6 months now and its worked pretty much flawlessly to this point. We've just recently started seeing this strange behavior where Jenkins will not spin up new agents while jobs are waiting in the queue. These jobs in the queue will just sit there forever. Eventually we get to a point where there are no agents running but multiple jobs queued up.
We have 11 cloud instances. Each instance has multiple templates associated with it. These cloud instances are all petty much identical. They serve the same templates and labels. The agents connect via ssh.
The only way I can get things back working is to restart the service. Once the service comes backup, Jenkins starts servicing the job requests again.
The only thing I can see in my logs is this entry.
09-Nov-2020 11:51:56.879 SEVERE [dockerjava-netty-3426-1] com.github.dockerjava.core.async.ResultCallbackTemplate.onError Error during callback
com.github.dockerjava.api.exception.NotFoundException: {"message":"No such container: a96167b9016d2624870640933da629f429ad736a473d690ee70fac3cd97bf211"}
at com.github.dockerjava.netty.handler.HttpResponseHandler.channelRead0(HttpResponseHandler.java:103)
at com.github.dockerjava.netty.handler.HttpResponseHandler.channelRead0(HttpResponseHandler.java:33)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:438)
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297)
at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:253)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1432)
at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1199)
at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1243)
at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:502)
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:441)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:648)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:583)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:500)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:462)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
If I look for that container hash in the logs I see this prior to the above. So it looks like the container is created, does its job, is removed. About 8 minutes after removal we get the error above about it being missing.
09-Nov-2020 11:36:22.228 INFO Computer.threadPoolForRemoting [#51286] com.nirima.jenkins.plugins.docker.DockerTemplate.doProvisionNode Started container ID a96167b9016d2624870640933da629f429ad736a473d690ee70fac3cd97bf211 for node DK_COSCOMMON7_D03-00063jpilpdzc from image: pmsplb-cos-tools.dev.datacard.com:8600/centos7_common:latest
09-Nov-2020 11:43:47.104 INFO Computer.threadPoolForRemoting [#50769] io.jenkins.docker.DockerTransientNode$1.println Stopped container 'a96167b9016d2624870640933da629f429ad736a473d690ee70fac3cd97bf211' for node 'DK_COSCOMMON7_D03-00063jpilpdzc'.
09-Nov-2020 11:43:48.538 INFO Computer.threadPoolForRemoting [#50769] io.jenkins.docker.DockerTransientNode$1.println Removed container 'a96167b9016d2624870640933da629f429ad736a473d690ee70fac3cd97bf211' for node 'DK_COSCOMMON7_D03-00063jpilpdzc'.
I'm not sure where else to look at this point.
I think I'm seeing a pattern on when I'm getting failures.
It seems like Jenkins will try and queue a job on to an existing docker agent that is running a different build.
in my environment I have several clouds defined.
I have a node label called "CENTOS7" defined on several clouds and its has a unique name each cloud "CENTOS7_01", "CENTOS7_02", etc etc
what I'm seeing is this
when I trigger a job for "CENTOS7", it will sometimes attempt to attach to an existing running container for that label. i.e. I will see that my job is waiting for "CENTOS7_01_adsafwer", while I have several other clouds that are free to run this job (note this doesn't always happen, it normally spins up a new container)
when the other job completes running, "CENTOS7_01_adsafwer" is deleted as expected.
my triggered job in the queue reverts back to saying it is waiting for the next "CENTOS7" to become available.
At this point it will sometimes start a new container for this job, other times it will just sit there forever.
If it gets stuck, if I delete the waiting job in the queue and start another one, it will often then happily start a new container and run the job.
The bigger problem here is that this seems to spread and that it will stop starting containers for all jobs.
at this point I can't figure out what is causing this.