Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-63999

Build nodes stop responding with DockerContainerWatchdog error

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • docker, docker-plugin
    • None
    • jenkinsci/blueocean image, Core version 2.249.1
      Docker plugin 1.2.1
      Docker pipeline 1.24
      Docker commons 1.17
      Node with Docker CE 19.03.13
      Jenkins Master host with Docker CE 19.03.13

      I can build projects by hand individually. For the nightly builds, we throw probably 10-15 projects into the queue at a time, and wait for them to filter through. Upon a recent upgrade of both jenkinsci/blueocean and the plugins, these builds now hang indefinitely and no new builds can be started successfully. Restarting Jenkins master fixes the issue.

      Connections are done over a Docker cloud, using the TCP connection and the "attach Docker container" option.

      The errors on the node hosting the build instances is:

      DockerContainerWatchdog Asynchronous Periodic Work thread is still running. Execution aborted.
      Oct 13, 2020 4:15:41 PM INFO hudson.model.AsyncPeriodicWork doRunDockerContainerWatchdog Asynchronous Periodic Work thread is still running. Execution aborted.
      Oct 13, 2020 4:20:41 PM INFO hudson.model.AsyncPeriodicWork doRunDockerContainerWatchdog Asynchronous Periodic Work thread is still running. Execution aborted.
      Oct 13, 2020 4:25:41 PM INFO hudson.model.AsyncPeriodicWork doRunDockerContainerWatchdog Asynchronous Periodic Work thread is still running. Execution aborted.

      The console logs for each build on the Jenkins master show:

      Started by timer
      Obtained Jenkinsfile from git git@gitlab.company.org:ns/repo.git
      Running in Durability level: MAX_SURVIVABILITY
      [Pipeline] Start of Pipeline
      [Pipeline] node
      Still waiting to schedule task
      'Ubuntu 16.04 Kinetic-0006iph8pkfzg on docker' is offline

       EDIT: Just verified as well that I can overload our build agents and the queue will eventually clear (overloaded by 1 extra build). So maybe it's the number of tasks? I try to push something like 16 builds at the same time, with the bandwidth to handle 4 at a time, and each build takes probably 10 minutes.

          [JENKINS-63999] Build nodes stop responding with DockerContainerWatchdog error

          Matt Wilson added a comment -

          We're seeing a fairly similar problem.  This cropped up in the last 5 or 6 days after using the docker agent model for about 6 months now with no issues.

          (I used to see the aborted message in our logs, but have not seen it in the last week when this problem has started to occur)

          Our issue is that the Jenkins master stops spinning up new container build agents.  I'm not even sure yet where the failure point is in my logs.

          Jenkins LTS 2.249.2

          Docker plugin 1.2.1

          Docker version 19.03.13

          All my plugins are generally up to date.  I usually update them all every week or two.

          Our docker agents are using SSH to connect.

          Matt Wilson added a comment - We're seeing a fairly similar problem.  This cropped up in the last 5 or 6 days after using the docker agent model for about 6 months now with no issues. (I used to see the aborted message in our logs, but have not seen it in the last week when this problem has started to occur) Our issue is that the Jenkins master stops spinning up new container build agents.  I'm not even sure yet where the failure point is in my logs. Jenkins LTS 2.249.2 Docker plugin 1.2.1 Docker version 19.03.13 All my plugins are generally up to date.  I usually update them all every week or two. Our docker agents are using SSH to connect.

          pjdarton added a comment -

          FYI the still running message is a symptom of a problem rather than the problem itself.

          The watchdog process asks each docker daemon in turn to list the containers it's running, and it asks Jenkins about what nodes/agents there are, and then it sets about getting rid of containers that have no nodes and nodes than have no containers.
          If it's "still running" then that suggests that something, somewhere, isn't answering its "tell me everything" questions promptly; that's most likely a docker daemon (in my experience, docker daemons can lock up, requiring a reboot to get them working again; they're now better than they used to be years ago but...) but it could be that Jenkins itself is deadlocked and can't answer.

          Firstly, I'd suggest that you go through ALL your docker cloud configs and ensure that you've got a "Connection Timeout" (Configure Clouds -> Docker Cloud details... -> Advanced... ) and "Read Timeout" set to a sensible number, e.g. 10 (seconds). If your docker daemons are on the end of a distant & slow network connection then maybe 30 seconds might be more reasonable; if your docker daemons are very local and ought to be responsive then a connect timeout of 2 seconds and read timeout of 5 might be more reasonable.
          This will ensure that the docker plugin is never left "waiting forever" for an answer from a docker daemon - it'll stop waiting and complain instead.

          Secondly, if that doesn't fix it, next time it happens, ask Jenkins for a full thread-dump of what's going on, and reports all the threads involving the docker plugin here - that'll let folks (maybe me, maybe someone else) figure out where things got stuck, which may (in turn) help folks figure out why they're stuck and how to un-stick them.

          ...but I'd start by ensuring that you've got connection & read timeouts defined on all your docker clouds.

          pjdarton added a comment - FYI the still running message is a symptom of a problem rather than the problem itself. The watchdog process asks each docker daemon in turn to list the containers it's running, and it asks Jenkins about what nodes/agents there are, and then it sets about getting rid of containers that have no nodes and nodes than have no containers. If it's "still running" then that suggests that something, somewhere, isn't answering its "tell me everything" questions promptly; that's most likely a docker daemon (in my experience, docker daemons can lock up, requiring a reboot to get them working again; they're now better than they used to be years ago but...) but it could be that Jenkins itself is deadlocked and can't answer. Firstly, I'd suggest that you go through ALL your docker cloud configs and ensure that you've got a "Connection Timeout" (Configure Clouds -> Docker Cloud details... -> Advanced... ) and "Read Timeout" set to a sensible number, e.g. 10 (seconds). If your docker daemons are on the end of a distant & slow network connection then maybe 30 seconds might be more reasonable; if your docker daemons are very local and ought to be responsive then a connect timeout of 2 seconds and read timeout of 5 might be more reasonable. This will ensure that the docker plugin is never left "waiting forever" for an answer from a docker daemon - it'll stop waiting and complain instead. Secondly, if that doesn't fix it, next time it happens, ask Jenkins for a full thread-dump of what's going on, and reports all the threads involving the docker plugin here - that'll let folks (maybe me, maybe someone else) figure out where things got stuck, which may (in turn) help folks figure out why they're stuck and how to un-stick them. ...but I'd start by ensuring that you've got connection & read timeouts defined on all your docker clouds.

          Kevin Broselge added a comment - - edited

           We got the exactly same issues, nearly every day our cloud workers got stuck and do not start new jobs.

          Here is a thread dump in this case:

          Thread dump [Jenkins].zip

          seems like a many threads are blocked by jenkins.util.Timer 2 

          jenkins.util.Timer 2

           

          "jenkins.util.Timer [#2]" Id=37 Group=main WAITING on java.util.concurrent.CountDownLatch$Sync@131c3f8d 
          at java.base@11.0.15/jdk.internal.misc.Unsafe.park(Native Method) 
          - waiting on java.util.concurrent.CountDownLatch$Sync@131c3f8d at java.base@11.0.15/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194) 
          at java.base@11.0.15/java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885) 
          at java.base@11.0.15/java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1039) 
          at java.base@11.0.15/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1345) 
          at java.base@11.0.15/java.util.concurrent.CountDownLatch.await(CountDownLatch.java:232) 
          at com.github.dockerjava.api.async.ResultCallbackTemplate.awaitCompletion(ResultCallbackTemplate.java:91) 
          at com.github.dockerjava.netty.NettyInvocationBuilder$ResponseCallback.awaitResult(NettyInvocationBuilder.java:58) 
          at com.github.dockerjava.netty.NettyInvocationBuilder.get(NettyInvocationBuilder.java:150) 
          at com.github.dockerjava.core.exec.ListContainersCmdExec.execute(ListContainersCmdExec.java:44) 
          at com.github.dockerjava.core.exec.ListContainersCmdExec.execute(ListContainersCmdExec.java:15) 
          at com.github.dockerjava.core.exec.AbstrSyncDockerCmdExec.exec(AbstrSyncDockerCmdExec.java:21) 
          at com.github.dockerjava.core.command.AbstrDockerCmd.exec(AbstrDockerCmd.java:35) 
          at com.nirima.jenkins.plugins.docker.DockerCloud.countContainersInDocker(DockerCloud.java:631) 
          at com.nirima.jenkins.plugins.docker.DockerCloud.canAddProvisionedAgent(DockerCloud.java:649) 
          at com.nirima.jenkins.plugins.docker.DockerCloud.provision(DockerCloud.java:358)
          - locked com.nirima.jenkins.plugins.docker.DockerCloud@751b743d 
          at hudson.slaves.Cloud.provision(Cloud.java:210) 
          at hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply(NodeProvisioner.java:727) 
          at hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:326) 
          at hudson.slaves.NodeProvisioner.lambda$suggestReviewNow$4(NodeProvisioner.java:198) 
          at hudson.slaves.NodeProvisioner$$Lambda$471/0x0000000840dca840.run(Unknown Source) 
          at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:67) 
          at java.base@11.0.15/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
          at java.base@11.0.15/java.util.concurrent.FutureTask.run(FutureTask.java:264) 
          at java.base@11.0.15/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) 
          at java.base@11.0.15/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
          at java.base@11.0.15/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
          at java.base@11.0.15/java.lang.Thread.run(Thread.java:829)
          
          Number of locked synchronizers = 2
           - java.util.concurrent.ThreadPoolExecutor$Worker@630769bf
           - java.util.concurrent.locks.ReentrantLock$NonfairSync@3be63ee0

           

           

          Connection and Read Timeout is 60s

          I hope this helps

           

          Kevin Broselge added a comment - - edited  We got the exactly same issues, nearly every day our cloud workers got stuck and do not start new jobs. Here is a thread dump in this case: Thread dump [Jenkins].zip seems like a many threads are blocked by jenkins.util.Timer 2   jenkins.util.Timer 2   "jenkins.util.Timer [#2]" Id=37 Group=main WAITING on java.util.concurrent.CountDownLatch$Sync@131c3f8d at java.base@11.0.15/jdk.internal.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.CountDownLatch$Sync@131c3f8d at java.base@11.0.15/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194) at java.base@11.0.15/java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885) at java.base@11.0.15/java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1039) at java.base@11.0.15/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1345) at java.base@11.0.15/java.util.concurrent.CountDownLatch.await(CountDownLatch.java:232) at com.github.dockerjava.api.async.ResultCallbackTemplate.awaitCompletion(ResultCallbackTemplate.java:91) at com.github.dockerjava.netty.NettyInvocationBuilder$ResponseCallback.awaitResult(NettyInvocationBuilder.java:58) at com.github.dockerjava.netty.NettyInvocationBuilder.get(NettyInvocationBuilder.java:150) at com.github.dockerjava.core.exec.ListContainersCmdExec.execute(ListContainersCmdExec.java:44) at com.github.dockerjava.core.exec.ListContainersCmdExec.execute(ListContainersCmdExec.java:15) at com.github.dockerjava.core.exec.AbstrSyncDockerCmdExec.exec(AbstrSyncDockerCmdExec.java:21) at com.github.dockerjava.core.command.AbstrDockerCmd.exec(AbstrDockerCmd.java:35) at com.nirima.jenkins.plugins.docker.DockerCloud.countContainersInDocker(DockerCloud.java:631) at com.nirima.jenkins.plugins.docker.DockerCloud.canAddProvisionedAgent(DockerCloud.java:649) at com.nirima.jenkins.plugins.docker.DockerCloud.provision(DockerCloud.java:358) - locked com.nirima.jenkins.plugins.docker.DockerCloud@751b743d at hudson.slaves.Cloud.provision(Cloud.java:210) at hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply(NodeProvisioner.java:727) at hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:326) at hudson.slaves.NodeProvisioner.lambda$suggestReviewNow$4(NodeProvisioner.java:198) at hudson.slaves.NodeProvisioner$$Lambda$471/0x0000000840dca840.run(Unknown Source) at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:67) at java.base@11.0.15/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base@11.0.15/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base@11.0.15/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) at java.base@11.0.15/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base@11.0.15/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base@11.0.15/java.lang. Thread .run( Thread .java:829) Number of locked synchronizers = 2 - java.util.concurrent.ThreadPoolExecutor$Worker@630769bf - java.util.concurrent.locks.ReentrantLock$NonfairSync@3be63ee0     Connection and Read Timeout is 60s I hope this helps  

          pjdarton added a comment -

          I'm not in a position to help fix anything (as I don't use this plugin anymore) but I can read the stacktrace and explain what you're seeing.

          According to that snippet, Jenkins is asking the cloud plugins to provide more agents (hudson.slaves.Cloud.provision), it's asking the docker-plugin to provision (i.e. create) a new Jenkins agent (com.nirima.jenkins.plugins.docker.DockerCloud.provision), and the docker-plugin (com.nirima.jenkins.plugins.docker.DockerCloud.countContainersInDocker) is asking the docker daemon to list all the containers (com.github.dockerjava.core.exec.ListContainersCmdExec.execute) that the docker-plugin has started so that it can decide whether or not it's allowed to make any more (it all depends on whether you've set a limit on the total number of containers and/or a limit on the number of containers for a particular template).

          ...but in this case, for some reason, the docker daemon hasn't answered - the request has gone from Jenkins in to the docker daemon but the response from the docker daemon has not been received by Jenkins yet, which is why that thread is still waiting for its response.
          In theory, it should wait for the read timeout period before giving up ... but the netty transport doesn't have timeouts on all conditions so it's possible that, despite the docker-plugin specifying a 60-second timeout, it might wait forever.
          (FYI there is some work in progress to change the docker-plugin's transport to a more modern / better-tested one, but it's non-trivial)

           

          My guess is that, if you went onto the machine which was running that docker daemon and asked it to do "docker ps -a" it would probably fail to say anything at the command-line too. i.e. the docker service itself has likely crashed.

          If that happens, the cure is to restart the docker daemon ... which (if it's that far gone) often requires a reboot of the machine that's running the docker daemon.
          ...but it's probably worth taking a look at the machine's logs in case there's anything useful that was logged before it all went wrong; IME a lack of memory of a common problem - numerous applications allocate themselves a certain %ge of the host machine's ram (e.g. Java would default to grabbing 25% of the host's RAM, IBM db2 would default to grabbing 90% etc) and this strategy isn't nice if you want to run e.g. 5 identical containers each running a JVM ... where each tries to use 1/4 of the host OS's ram, i.e. not 1/4 of 20% (the container's fair share of RAM) but 1/4 of the host's total RAM, so those 5 JVMs would try to use 125% of the host RAM.
          IME not everything responds well to the oom-killer running, and sometimes not even the oom-killer can save the host OS from a lack of memory (especially as Jenkins will just ask for more containers to be started).

           

          ...and if it's not a crashed docker daemon then it's likely a communication problem that could be fixed by switching the transport from netty (which is rather "unloved" by the docker-java code) to Apache http (which has better quality support in docker-java).

          pjdarton added a comment - I'm not in a position to help fix anything (as I don't use this plugin anymore) but I can read the stacktrace and explain what you're seeing. According to that snippet, Jenkins is asking the cloud plugins to provide more agents (hudson.slaves.Cloud.provision), it's asking the docker-plugin to provision (i.e. create) a new Jenkins agent (com.nirima.jenkins.plugins.docker.DockerCloud.provision), and the docker-plugin (com.nirima.jenkins.plugins.docker.DockerCloud.countContainersInDocker) is asking the docker daemon to list all the containers (com.github.dockerjava.core.exec.ListContainersCmdExec.execute) that the docker-plugin has started so that it can decide whether or not it's allowed to make any more (it all depends on whether you've set a limit on the total number of containers and/or a limit on the number of containers for a particular template). ...but in this case, for some reason, the docker daemon hasn't answered - the request has gone from Jenkins in to the docker daemon but the response from the docker daemon has not been received by Jenkins yet, which is why that thread is still waiting for its response. In theory, it should wait for the read timeout period before giving up ... but the netty transport doesn't have timeouts on all conditions so it's possible that, despite the docker-plugin specifying a 60-second timeout, it might wait forever. (FYI there is some work in progress to change the docker-plugin's transport to a more modern / better-tested one, but it's non-trivial)   My guess is that, if you went onto the machine which was running that docker daemon and asked it to do " docker ps -a " it would probably fail to say anything at the command-line too. i.e. the docker service itself has likely crashed. If that happens, the cure is to restart the docker daemon ... which (if it's that far gone) often requires a reboot of the machine that's running the docker daemon. ...but it's probably worth taking a look at the machine's logs in case there's anything useful that was logged before it all went wrong; IME a lack of memory of a common problem - numerous applications allocate themselves a certain %ge of the host machine's ram (e.g. Java would default to grabbing 25% of the host's RAM, IBM db2 would default to grabbing 90% etc) and this strategy isn't nice if you want to run e.g. 5 identical containers each running a JVM ... where each tries to use 1/4 of the host OS's ram, i.e. not 1/4 of 20% (the container's fair share of RAM) but 1/4 of the host's total RAM, so those 5 JVMs would try to use 125% of the host RAM. IME not everything responds well to the oom-killer running, and sometimes not even the oom-killer can save the host OS from a lack of memory (especially as Jenkins will just ask for more containers to be started).   ...and if it's not a crashed docker daemon then it's likely a communication problem that could be fixed by switching the transport from netty (which is rather "unloved" by the docker-java code) to Apache http (which has better quality support in docker-java).

          Thanks for your interest!

          Thea daemon is still running and a simple jenkins master restart is fixing the problem. There also is nothing special in the logs. Memory shouldn't be a problem here (it gets stuck after the maximum number of nodes were used)

          We only set a limit to the total container count. 

          Is it possible to switch communication from netty to Apache http via configuration? Or do you know the time horizon for the changes?

           

          Kevin Broselge added a comment - Thanks for your interest! Thea daemon is still running and a simple jenkins master restart is fixing the problem. There also is nothing special in the logs. Memory shouldn't be a problem here (it gets stuck after the maximum number of nodes were used) We only set a limit to the total container count.  Is it possible to switch communication from netty to Apache http via configuration? Or do you know the time horizon for the changes?  

          pjdarton added a comment -

          It's not a configuration thing; it's a code-change thing.

          There's a PR for the changes https://github.com/jenkinsci/docker-plugin/pull/900 ... but those changes depended on https://github.com/jenkinsci/docker-java-api-plugin/pull/26 which delayed things (a lot) and that meant that the plugin code moved on and now there's merge issues to be resolved.
          (and now I no longer use the plugin at work so I can't push things forward anymore myself; the plugin is "up for adoption" in the hope that someone else is willing to take it on)

          Hmm, if it's not a crashed daemon then there's probably something else locking up Jenkins; there's some big locks on the Jenkins model, and it gets locked a lot, and sometimes while "things that require async communications" get processed (plugins should not do that, but a lot do, and it's fine until things don't complete as expected).
          Years ago, I did a lot of work to minimise the amount of time the docker-plugin kept things "locked" (as well as adding the timeout configuration to cope with truculent docker daemons) but you will (still) find that Jenkins becomes "very unwell" if something is locking the model for long periods of time (e.g. builds run to completion but aren't marked as "complete" on the main UI etc) ... but that doesn't guarantee that it's the docker-plugin causing it - it might be something else.
          i.e. it might be that the cause of the trouble isn't the docker-plugin itself, it might be that the docker-plugin's woes are just a symptom of that.

          pjdarton added a comment - It's not a configuration thing; it's a code-change thing. There's a PR for the changes https://github.com/jenkinsci/docker-plugin/pull/900 ... but those changes depended on https://github.com/jenkinsci/docker-java-api-plugin/pull/26 which delayed things (a lot) and that meant that the plugin code moved on and now there's merge issues to be resolved. (and now I no longer use the plugin at work so I can't push things forward anymore myself; the plugin is "up for adoption" in the hope that someone else is willing to take it on) Hmm, if it's not a crashed daemon then there's probably something else locking up Jenkins; there's some big locks on the Jenkins model, and it gets locked a lot, and sometimes while "things that require async communications" get processed (plugins should not do that, but a lot do, and it's fine until things don't complete as expected). Years ago, I did a lot of work to minimise the amount of time the docker-plugin kept things "locked" (as well as adding the timeout configuration to cope with truculent docker daemons) but you will (still) find that Jenkins becomes "very unwell" if something is locking the model for long periods of time (e.g. builds run to completion but aren't marked as "complete" on the main UI etc) ... but that doesn't guarantee that it's the docker-plugin causing it - it might be something else. i.e. it might be that the cause of the trouble isn't the docker-plugin itself, it might be that the docker-plugin's woes are just a symptom of that.

          Mark Waite added a comment -

          The pull requests that are mentioned by pjdarton have been merged and released. https://github.com/jenkinsci/docker-plugin/pull/934 is the pull request

          Mark Waite added a comment - The pull requests that are mentioned by pjdarton have been merged and released. https://github.com/jenkinsci/docker-plugin/pull/934 is the pull request

          I cannot reproduce the issue anymore!

          Kevin Broselge added a comment - I cannot reproduce the issue anymore!

            csanchez Carlos Sanchez
            zlacelle Zach LaCelle
            Votes:
            5 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated:
              Resolved: