Resolution: Fixed
durable-task 1.30
The durable-task plugin runs a wrapper process which redirects the user process' stdout/err to a file and sends its exit code to another file. Thus there is no need for the agent JVM to hold onto a process handle for the wrapper; it should be fork-and-forget. In fact the Proc is discarded.
Unfortunately, the current implementation in BourneShellScript does not actually allow the Proc to exit until the user process also exits. On a regular agent this does not matter much. But when you run sh steps inside container on a Kubernetes agent, ContainerExecDecorator and ContainerExecProc actually keep a WebSocket open for the duration of the launched process. This consumes resources on the Kubernetes API server; it is possible to run out of connections. It also consumes three master-side Java threads per sh, like
"OkHttp http://…/..." #361 prio=5 os_prio=0 tid=… nid=… runnable […] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:171) at java.net.SocketInputStream.read(SocketInputStream.java:141) at okio.Okio$2.read(Okio.java:140) at okio.AsyncTimeout$2.read(AsyncTimeout.java:237) at okio.RealBufferedSource.request(RealBufferedSource.java:68) at okio.RealBufferedSource.require(RealBufferedSource.java:61) at okio.RealBufferedSource.readByte(RealBufferedSource.java:74) at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:117) at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:101) at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274) at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214) at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206) at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) "OkHttp WebSocket http://…/..." #359 prio=5 os_prio=0 tid=… nid=… waiting on condition […] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <…> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) "pool-73-thread-1" #358 prio=5 os_prio=0 tid=… nid=… waiting on condition […] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at io.fabric8.kubernetes.client.utils.NonBlockingInputStreamPumper.run(NonBlockingInputStreamPumper.java:57) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
To see the problem, you can run
while (true) { podTemplate(label: BUILD_TAG, containers: [containerTemplate(name: 'ubuntu', image: 'ubuntu', command: 'sleep', args: 'infinity')]) { node (BUILD_TAG) { container('ubuntu') { branches = [:] // TODO cannot use collectEntries because: java.io.NotSerializableException: groovy.lang.IntRange for (int x = 0; x < 1000; x += 5) { def _x = x branches["sleep$x"] = { sleep time: _x, unit: 'SECONDS' sh """set +x; while :; do echo -n "$_x "; date; sleep 10; done""" } } parallel branches } } } }
and watch via
while :; do jstack $pid | fgrep '"' | sort | egrep -i 'ok|pool' > /tmp/x; clear; cat /tmp/x; sleep 5; done
- causes
JENKINS-58656 Wrapper process leaves zombie when no init process present
- Open
JENKINS-59668 Run wrapper process in the background fails with the latest changes
- Closed
JENKINS-65984 BourneShellScript background process should be an option for sh step
- Open
- is duplicated by
JENKINS-56939 pipeline gets stuck: root cause unclear
- Resolved
JENKINS-58463 Job build failed by "Interrupted while waiting for websocket connection, you should increase the Max connections to Kubernetes API"
- Resolved
- relates to
JENKINS-25503 Use setsid instead of nohup
- Resolved
- links to
fix for this issue causes one of our pipelines to stuck. Passing
parameter reverts to old behavior, so I blame running sh build step in subshell as a cause of this.
Pipeline code is:
It never gets to "test" stage, seems that "sh" step in "build" stage never ends.
If you need, I'll gladly provide more information.