Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-47953

Jobs stuck in queue "Jenkins doesn't have label ..."

      After updating to 2.73.3 jobs now gets randomly stuck in queue and Jenkins says that it doesn't have label .. I can see that some slave nodes (containers) coming up online for a split second then disappears, but then the job(s) gets stuck forever in the queue. The problem is that I do not see anything in the Jenkins logs that's out of the ordinary.

       

      Downgrading to 2.73.2 and recreating the config.xml (global config file) seems to fix the issue for us.

       

      P.S.: What's even more weird is that some jobs run while others gets stuck forever (sometimes).

          [JENKINS-47953] Jobs stuck in queue "Jenkins doesn't have label ..."

          Daniel Beck added a comment -

          Other than the channel pinger, there seems to be nothing in 2.73.3 that would explain this.

          Could you try upgrading again to see whether the problem reoccurs after you've reset the configuration and downgraded, or whether it was a one off issue?

          Do you still have the logs from the 2.73.3 run that could indicate a specific problem by logging error messages?

          CC oleg_nenashev

          Daniel Beck added a comment - Other than the channel pinger, there seems to be nothing in 2.73.3 that would explain this. Could you try upgrading again to see whether the problem reoccurs after you've reset the configuration and downgraded, or whether it was a one off issue? Do you still have the logs from the 2.73.3 run that could indicate a specific problem by logging error messages? CC oleg_nenashev

          Fadi Farah added a comment - - edited

          I'll try to upgrade again and see if it happens again, but this is definitely isn't a one off issue; Here's why:

          We have 9 Jenkins masters total (spread out in different regions, some are even in Frankfurt and China). 5 of those Jenkins are installed through APT, and 4 of them are just jars ran by Tomcat.

          Some are installed on Ubuntu 16.04 and some are installed on 14.04. So there is a good variety between all of those masters.

          Fadi Farah added a comment - - edited I'll try to upgrade again and see if it happens again, but this is definitely isn't a one off issue; Here's why: We have 9 Jenkins masters total (spread out in different regions, some are even in Frankfurt and China). 5 of those Jenkins are installed through APT, and 4 of them are just jars ran by Tomcat. Some are installed on Ubuntu 16.04 and some are installed on 14.04. So there is a good variety between all of those masters.

          Oleg Nenashev added a comment -

          By the way, any chance you updated Docker Plugin to 1.0 during the upgrade to 2.73.3?

          Oleg Nenashev added a comment - By the way, any chance you updated Docker Plugin to 1.0 during the upgrade to 2.73.3?

          Fadi Farah added a comment - - edited

          We've updated to 1.0.4 (from 0.16.2) and updated docker-commons to 1.9 (from 1.8) when we were on 2.73.2 and we had no problems there.

          Btw, here are the detailed steps we've taken to get things working again:

          • Downgraded to 2.73.2 (we were still experiencing this issue).
          • Then, we downgraded docker-plugin to 0.16.2 and docker-commons to 1.8 (we were still experiencing this issue).
          • Finally we recreated the config.xml file (then everything started working normally again).

          Fadi Farah added a comment - - edited We've updated to 1.0.4 (from 0.16.2) and updated docker-commons to 1.9 (from 1.8) when we were on 2.73.2 and we had no problems there. Btw, here are the detailed steps we've taken to get things working again: Downgraded to 2.73.2 (we were still experiencing this issue). Then, we downgraded docker-plugin to 0.16.2 and docker-commons to 1.8 (we were still experiencing this issue). Finally we recreated the config.xml file (then everything started working normally again).

          I have two Jenkins masters that I upgraded simultaneously, one works, one doesn't.   Both are running Jenkins 2.90 and the latest plugins (1.0.4/1.9, and in fact all other plugins are on latest as of today), and both are using the same Docker cloud with more or less the same config.xml

          The one that doesn't work does not log anything docker/cloud/provisioning related at all (in /log/all), as if it isn't happening.

          Alexander Komarov added a comment - I have two Jenkins masters that I upgraded simultaneously, one works, one doesn't.   Both are running Jenkins 2.90 and the latest plugins ( 1.0.4/1.9 , and in fact all other plugins are on latest as of today), and both are using the same Docker cloud with more or less the same config.xml The one that doesn't work does not log anything docker/cloud/provisioning related at all (in /log/all), as if it isn't happening.

          Update: both of my Jenkins masters are now correctly provisioning Docker slaves, for no obvious reason.  Here is what I did:

          1. "Idle Timeout=1" for each Docker Template under Experimental Options (instead of the default value 0)
          2. Restarted the master.

          I don't know if step 1 was really necessary.  I saved the main configuration (Apply) several times in the meantime.

          On a side note, lots of these in the logs after each run even though jobs succeed (I'm using JNLP):

           

          Nov 20, 2017 5:46:22 AM jenkins.slaves.DefaultJnlpSlaveReceiver channelClosed
          WARNING: NioChannelHub keys=133 gen=41087: Computer.threadPoolForRemoting [#1] for xx-docker-swarm-01-760c11ed terminated
          java.io.IOException: Connection aborted: org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport@2fabfc23[name=Channel to /xxx.xxx.xxx.xxx]
           at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:216)
           at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:646)
           at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
           at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
           at java.util.concurrent.FutureTask.run(FutureTask.java:266)
           at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
           at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
           at java.lang.Thread.run(Thread.java:748)
          Caused by: java.io.IOException: Connection reset by peer
           at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
           at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
           at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
           at sun.nio.ch.IOUtil.read(IOUtil.java:197)
           at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
           at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:142)
           at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:359)
           at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:570)
           ... 6 more
           
          

           

          Alexander Komarov added a comment - Update: both of my Jenkins masters are now correctly provisioning Docker slaves, for no obvious reason.  Here is what I did: "Idle Timeout=1" for each Docker Template under Experimental Options (instead of the default value 0) Restarted the master. I don't know if step 1 was really necessary.  I saved the main configuration (Apply) several times in the meantime. On a side note, lots of these in the logs after each run even though jobs succeed (I'm using JNLP):   Nov 20, 2017 5:46:22 AM jenkins.slaves.DefaultJnlpSlaveReceiver channelClosed WARNING: NioChannelHub keys=133 gen=41087: Computer.threadPoolForRemoting [#1] for xx-docker-swarm-01-760c11ed terminated java.io.IOException: Connection aborted: org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport@2fabfc23[name=Channel to /xxx.xxx.xxx.xxx] at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:216) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:646) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:197) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:142) at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:359) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:570) ... 6 more    

          ok, so it seems the idle timeout to default to 0 minute just kills your agent before it get assigned to run your job.

          switching this issue to Minor as this is more a UI/UX issue.

          Nicolas De Loof added a comment - ok, so it seems the idle timeout to default to 0 minute just kills your agent before it get assigned to run your job. switching this issue to Minor as this is more a UI/UX issue.

          Code changed in jenkins
          User: Nicolas De Loof
          Path:
          src/main/resources/com/nirima/jenkins/plugins/docker/strategy/DockerCloudRetentionStrategy/config.groovy
          src/main/resources/com/nirima/jenkins/plugins/docker/strategy/DockerCloudRetentionStrategy/help-idleMinutes.html
          src/main/resources/com/nirima/jenkins/plugins/docker/strategy/DockerOnceRetentionStrategy/config.groovy
          http://jenkins-ci.org/commit/docker-plugin/2d0dda5a20401c34ad25dbf8df2b2948365b4f8e
          Log:
          use default timeout of 10 minutes to avoid JENKINS-47953

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Nicolas De Loof Path: src/main/resources/com/nirima/jenkins/plugins/docker/strategy/DockerCloudRetentionStrategy/config.groovy src/main/resources/com/nirima/jenkins/plugins/docker/strategy/DockerCloudRetentionStrategy/help-idleMinutes.html src/main/resources/com/nirima/jenkins/plugins/docker/strategy/DockerOnceRetentionStrategy/config.groovy http://jenkins-ci.org/commit/docker-plugin/2d0dda5a20401c34ad25dbf8df2b2948365b4f8e Log: use default timeout of 10 minutes to avoid JENKINS-47953

            ndeloof Nicolas De Loof
            ffarah Fadi Farah
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: