Status: Closed (View Workflow)
Jenkins 2.73.3, OpenJDK 8_151, Ubuntu 16.04, Ubuntu 14.04.
After updating to 2.73.3 jobs now gets randomly stuck in queue and Jenkins says that it doesn't have label .. I can see that some slave nodes (containers) coming up online for a split second then disappears, but then the job(s) gets stuck forever in the queue. The problem is that I do not see anything in the Jenkins logs that's out of the ordinary.
Downgrading to 2.73.2 and recreating the config.xml (global config file) seems to fix the issue for us.
P.S.: What's even more weird is that some jobs run while others gets stuck forever (sometimes).
I'll try to upgrade again and see if it happens again, but this is definitely isn't a one off issue; Here's why:
We have 9 Jenkins masters total (spread out in different regions, some are even in Frankfurt and China). 5 of those Jenkins are installed through APT, and 4 of them are just jars ran by Tomcat.
Some are installed on Ubuntu 16.04 and some are installed on 14.04. So there is a good variety between all of those masters.
By the way, any chance you updated Docker Plugin to 1.0 during the upgrade to 2.73.3?
We've updated to 1.0.4 (from 0.16.2) and updated docker-commons to 1.9 (from 1.8) when we were on 2.73.2 and we had no problems there.
Btw, here are the detailed steps we've taken to get things working again:
- Downgraded to 2.73.2 (we were still experiencing this issue).
- Then, we downgraded docker-plugin to 0.16.2 and docker-commons to 1.8 (we were still experiencing this issue).
- Finally we recreated the config.xml file (then everything started working normally again).
I have two Jenkins masters that I upgraded simultaneously, one works, one doesn't. Both are running Jenkins 2.90 and the latest plugins (1.0.4/1.9, and in fact all other plugins are on latest as of today), and both are using the same Docker cloud with more or less the same config.xml
The one that doesn't work does not log anything docker/cloud/provisioning related at all (in /log/all), as if it isn't happening.
Update: both of my Jenkins masters are now correctly provisioning Docker slaves, for no obvious reason. Here is what I did:
- "Idle Timeout=1" for each Docker Template under Experimental Options (instead of the default value 0)
- Restarted the master.
I don't know if step 1 was really necessary. I saved the main configuration (Apply) several times in the meantime.
On a side note, lots of these in the logs after each run even though jobs succeed (I'm using JNLP):
Nov 20, 2017 5:46:22 AM jenkins.slaves.DefaultJnlpSlaveReceiver channelClosed WARNING: NioChannelHub keys=133 gen=41087: Computer.threadPoolForRemoting [#1] for xx-docker-swarm-01-760c11ed terminated java.io.IOException: Connection aborted: org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport@2fabfc23[name=Channel to /xxx.xxx.xxx.xxx] at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:216) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:646) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:197) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:142) at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:359) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:570) ... 6 more
ok, so it seems the idle timeout to default to 0 minute just kills your agent before it get assigned to run your job.
switching this issue to Minor as this is more a UI/UX issue.
Code changed in jenkins
User: Nicolas De Loof
use default timeout of 10 minutes to avoid
Other than the channel pinger, there seems to be nothing in 2.73.3 that would explain this.
Could you try upgrading again to see whether the problem reoccurs after you've reset the configuration and downgraded, or whether it was a one off issue?
Do you still have the logs from the 2.73.3 run that could indicate a specific problem by logging error messages?