Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-58290

WebSocket / OkHttp thread leak from BourneShellScript + ContainerExecDecorator

    XMLWordPrintable

Details

    • durable-task 1.30

    Description

      The durable-task plugin runs a wrapper process which redirects the user process' stdout/err to a file and sends its exit code to another file. Thus there is no need for the agent JVM to hold onto a process handle for the wrapper; it should be fork-and-forget. In fact the Proc is discarded.

      Unfortunately, the current implementation in BourneShellScript does not actually allow the Proc to exit until the user process also exits. On a regular agent this does not matter much. But when you run sh steps inside container on a Kubernetes agent, ContainerExecDecorator and ContainerExecProc actually keep a WebSocket open for the duration of the launched process. This consumes resources on the Kubernetes API server; it is possible to run out of connections. It also consumes three master-side Java threads per sh, like

      "OkHttp http://…/..." #361 prio=5 os_prio=0 tid=… nid=… runnable […]
         java.lang.Thread.State: RUNNABLE
              at java.net.SocketInputStream.socketRead0(Native Method)
              at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
              at java.net.SocketInputStream.read(SocketInputStream.java:171)
              at java.net.SocketInputStream.read(SocketInputStream.java:141)
              at okio.Okio$2.read(Okio.java:140)
              at okio.AsyncTimeout$2.read(AsyncTimeout.java:237)
              at okio.RealBufferedSource.request(RealBufferedSource.java:68)
              at okio.RealBufferedSource.require(RealBufferedSource.java:61)
              at okio.RealBufferedSource.readByte(RealBufferedSource.java:74)
              at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:117)
              at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:101)
              at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
              at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
              at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
              at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:748)
      
      "OkHttp WebSocket http://…/..." #359 prio=5 os_prio=0 tid=… nid=… waiting on condition […]
         java.lang.Thread.State: TIMED_WAITING (parking)
              at sun.misc.Unsafe.park(Native Method)
              - parking to wait for  <…> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
              at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
              at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
              at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
              at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:748)
      
      "pool-73-thread-1" #358 prio=5 os_prio=0 tid=… nid=… waiting on condition […]
         java.lang.Thread.State: TIMED_WAITING (sleeping)
              at java.lang.Thread.sleep(Native Method)
              at io.fabric8.kubernetes.client.utils.NonBlockingInputStreamPumper.run(NonBlockingInputStreamPumper.java:57)
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
              at java.util.concurrent.FutureTask.run(FutureTask.java:266)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:748)
      

      To see the problem, you can run

      while (true) {
          podTemplate(label: BUILD_TAG, containers: [containerTemplate(name: 'ubuntu', image: 'ubuntu', command: 'sleep', args: 'infinity')]) {
              node (BUILD_TAG) {
                  container('ubuntu') {
                      branches = [:] // TODO cannot use collectEntries because: java.io.NotSerializableException: groovy.lang.IntRange
                      for (int x = 0; x < 1000; x += 5) {
                          def _x = x
                          branches["sleep$x"] = {
                              sleep time: _x, unit: 'SECONDS'
                              sh """set +x; while :; do echo -n "$_x "; date; sleep 10; done"""
                          }
                      }
                      parallel branches
                  }
              }
          }
      }
      

      and watch via

      while :; do jstack $pid | fgrep '"' | sort | egrep -i 'ok|pool' > /tmp/x; clear; cat /tmp/x; sleep 5; done
      

      Attachments

        Issue Links

          Activity

            jglick Jesse Glick created issue -
            jglick Jesse Glick made changes -
            Field Original Value New Value
            Status Open [ 1 ] In Progress [ 3 ]
            jglick Jesse Glick made changes -
            Remote Link This issue links to "CloudBees-internal issue (Web Link)" [ 23142 ]
            jglick Jesse Glick made changes -
            Remote Link This issue links to "durable-task #98 (Web Link)" [ 23143 ]
            jglick Jesse Glick made changes -
            Remote Link This issue links to "kubernetes #522 (Web Link)" [ 23144 ]
            jglick Jesse Glick made changes -
            Description The {{durable-task}} plugin runs a wrapper process which redirects the user process' stdout/err to a file and sends its exit code to another file. Thus there is no need for the agent JVM to hold onto a process handle for the wrapper; it should be fork-and-forget. In fact the {{Proc}} is discarded.

            Unfortunately, the current implementation in {{BourneShellScript}} does not actually allow the {{Proc}} to exit until the user process also exits. On a regular agent this does not matter much. But when you run {{sh}} steps inside {{container}} on a Kubernetes agent, {{ContainerExecDecorator}} and {{ContainerExecProc}} actually keep a WebSocket open for the duration of the launched process. This consumes three master-side Java threads per {{sh}}, and also consumes resources on the Kubernetes API server (it is possible to run out of connections).
            The {{durable-task}} plugin runs a wrapper process which redirects the user process' stdout/err to a file and sends its exit code to another file. Thus there is no need for the agent JVM to hold onto a process handle for the wrapper; it should be fork-and-forget. In fact the {{Proc}} is discarded.

            Unfortunately, the current implementation in {{BourneShellScript}} does not actually allow the {{Proc}} to exit until the user process also exits. On a regular agent this does not matter much. But when you run {{sh}} steps inside {{container}} on a Kubernetes agent, {{ContainerExecDecorator}} and {{ContainerExecProc}} actually keep a WebSocket open for the duration of the launched process. This consumes three master-side Java threads per {{sh}}, and also consumes resources on the Kubernetes API server (it is possible to run out of connections). The threads are like

            {code:none}
            "OkHttp http://…/..." #361 prio=5 os_prio=0 tid=… nid=… runnable […]
               java.lang.Thread.State: RUNNABLE
                    at java.net.SocketInputStream.socketRead0(Native Method)
                    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
                    at java.net.SocketInputStream.read(SocketInputStream.java:171)
                    at java.net.SocketInputStream.read(SocketInputStream.java:141)
                    at okio.Okio$2.read(Okio.java:140)
                    at okio.AsyncTimeout$2.read(AsyncTimeout.java:237)
                    at okio.RealBufferedSource.request(RealBufferedSource.java:68)
                    at okio.RealBufferedSource.require(RealBufferedSource.java:61)
                    at okio.RealBufferedSource.readByte(RealBufferedSource.java:74)
                    at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:117)
                    at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:101)
                    at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
                    at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
                    at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
                    at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
                    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                    at java.lang.Thread.run(Thread.java:748)

            "OkHttp WebSocket http://…/..." #359 prio=5 os_prio=0 tid=… nid=… waiting on condition […]
               java.lang.Thread.State: TIMED_WAITING (parking)
                    at sun.misc.Unsafe.park(Native Method)
                    - parking to wait for <…> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
                    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
                    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
                    at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
                    at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
                    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
                    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
                    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                    at java.lang.Thread.run(Thread.java:748)

            "pool-73-thread-1" #358 prio=5 os_prio=0 tid=… nid=… waiting on condition […]
               java.lang.Thread.State: TIMED_WAITING (sleeping)
                    at java.lang.Thread.sleep(Native Method)
                    at io.fabric8.kubernetes.client.utils.NonBlockingInputStreamPumper.run(NonBlockingInputStreamPumper.java:57)
                    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                    at java.lang.Thread.run(Thread.java:748)
            {code}
            jglick Jesse Glick made changes -
            Description The {{durable-task}} plugin runs a wrapper process which redirects the user process' stdout/err to a file and sends its exit code to another file. Thus there is no need for the agent JVM to hold onto a process handle for the wrapper; it should be fork-and-forget. In fact the {{Proc}} is discarded.

            Unfortunately, the current implementation in {{BourneShellScript}} does not actually allow the {{Proc}} to exit until the user process also exits. On a regular agent this does not matter much. But when you run {{sh}} steps inside {{container}} on a Kubernetes agent, {{ContainerExecDecorator}} and {{ContainerExecProc}} actually keep a WebSocket open for the duration of the launched process. This consumes three master-side Java threads per {{sh}}, and also consumes resources on the Kubernetes API server (it is possible to run out of connections). The threads are like

            {code:none}
            "OkHttp http://…/..." #361 prio=5 os_prio=0 tid=… nid=… runnable […]
               java.lang.Thread.State: RUNNABLE
                    at java.net.SocketInputStream.socketRead0(Native Method)
                    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
                    at java.net.SocketInputStream.read(SocketInputStream.java:171)
                    at java.net.SocketInputStream.read(SocketInputStream.java:141)
                    at okio.Okio$2.read(Okio.java:140)
                    at okio.AsyncTimeout$2.read(AsyncTimeout.java:237)
                    at okio.RealBufferedSource.request(RealBufferedSource.java:68)
                    at okio.RealBufferedSource.require(RealBufferedSource.java:61)
                    at okio.RealBufferedSource.readByte(RealBufferedSource.java:74)
                    at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:117)
                    at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:101)
                    at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
                    at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
                    at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
                    at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
                    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                    at java.lang.Thread.run(Thread.java:748)

            "OkHttp WebSocket http://…/..." #359 prio=5 os_prio=0 tid=… nid=… waiting on condition […]
               java.lang.Thread.State: TIMED_WAITING (parking)
                    at sun.misc.Unsafe.park(Native Method)
                    - parking to wait for <…> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
                    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
                    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
                    at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
                    at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
                    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
                    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
                    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                    at java.lang.Thread.run(Thread.java:748)

            "pool-73-thread-1" #358 prio=5 os_prio=0 tid=… nid=… waiting on condition […]
               java.lang.Thread.State: TIMED_WAITING (sleeping)
                    at java.lang.Thread.sleep(Native Method)
                    at io.fabric8.kubernetes.client.utils.NonBlockingInputStreamPumper.run(NonBlockingInputStreamPumper.java:57)
                    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                    at java.lang.Thread.run(Thread.java:748)
            {code}
            The {{durable-task}} plugin runs a wrapper process which redirects the user process' stdout/err to a file and sends its exit code to another file. Thus there is no need for the agent JVM to hold onto a process handle for the wrapper; it should be fork-and-forget. In fact the {{Proc}} is discarded.

            Unfortunately, the current implementation in {{BourneShellScript}} does not actually allow the {{Proc}} to exit until the user process also exits. On a regular agent this does not matter much. But when you run {{sh}} steps inside {{container}} on a Kubernetes agent, {{ContainerExecDecorator}} and {{ContainerExecProc}} actually keep a WebSocket open for the duration of the launched process. This consumes resources on the Kubernetes API server; it is possible to run out of connections. It also consumes three master-side Java threads per {{sh}}, like

            {code:none}
            "OkHttp http://…/..." #361 prio=5 os_prio=0 tid=… nid=… runnable […]
               java.lang.Thread.State: RUNNABLE
                    at java.net.SocketInputStream.socketRead0(Native Method)
                    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
                    at java.net.SocketInputStream.read(SocketInputStream.java:171)
                    at java.net.SocketInputStream.read(SocketInputStream.java:141)
                    at okio.Okio$2.read(Okio.java:140)
                    at okio.AsyncTimeout$2.read(AsyncTimeout.java:237)
                    at okio.RealBufferedSource.request(RealBufferedSource.java:68)
                    at okio.RealBufferedSource.require(RealBufferedSource.java:61)
                    at okio.RealBufferedSource.readByte(RealBufferedSource.java:74)
                    at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:117)
                    at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:101)
                    at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
                    at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
                    at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
                    at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
                    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                    at java.lang.Thread.run(Thread.java:748)

            "OkHttp WebSocket http://…/..." #359 prio=5 os_prio=0 tid=… nid=… waiting on condition […]
               java.lang.Thread.State: TIMED_WAITING (parking)
                    at sun.misc.Unsafe.park(Native Method)
                    - parking to wait for <…> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
                    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
                    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
                    at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
                    at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
                    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
                    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
                    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                    at java.lang.Thread.run(Thread.java:748)

            "pool-73-thread-1" #358 prio=5 os_prio=0 tid=… nid=… waiting on condition […]
               java.lang.Thread.State: TIMED_WAITING (sleeping)
                    at java.lang.Thread.sleep(Native Method)
                    at io.fabric8.kubernetes.client.utils.NonBlockingInputStreamPumper.run(NonBlockingInputStreamPumper.java:57)
                    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                    at java.lang.Thread.run(Thread.java:748)
            {code}

            To see the problem, you can run

            {code}
            while (true) {
                podTemplate(label: BUILD_TAG, containers: [containerTemplate(name: 'ubuntu', image: 'ubuntu', command: 'sleep', args: 'infinity')]) {
                    node (BUILD_TAG) {
                        container('ubuntu') {
                            branches = [:] // TODO cannot use collectEntries because: java.io.NotSerializableException: groovy.lang.IntRange
                            for (int x = 0; x < 1000; x += 5) {
                                def _x = x
                                branches["sleep$x"] = {
                                    sleep time: _x, unit: 'SECONDS'
                                    sh """set +x; while :; do echo -n "$_x "; date; sleep 10; done"""
                                }
                            }
                            parallel branches
                        }
                    }
                }
            }
            {code}

            and watch via

            {code:sh}
            while :; do jstack $pid | fgrep '"' | sort | egrep -i 'ok|pool' > /tmp/x; clear; cat /tmp/x; sleep 5; done
            {code}
            jglick Jesse Glick made changes -
            Status In Progress [ 3 ] In Review [ 10005 ]
            jglick Jesse Glick made changes -
            Link This issue relates to JENKINS-25503 [ JENKINS-25503 ]
            jglick Jesse Glick made changes -
            Released As durable-task 1.30
            Resolution Fixed [ 1 ]
            Status In Review [ 10005 ] Resolved [ 5 ]
            jglick Jesse Glick made changes -
            Link This issue is duplicated by JENKINS-56939 [ JENKINS-56939 ]
            jglick Jesse Glick made changes -
            Link This issue is duplicated by JENKINS-58463 [ JENKINS-58463 ]
            daniel_jeznach_img Daniel Jeznach made changes -
            Comment [ Hello,
             fix for this issue causes one of our pipelines to stuck. Passing
            {noformat}
            -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.LAUNCH_DIAGNOSTICS=true
            {noformat}
             parameter reverts to old behavior, so I blame running sh build step in subshell as a cause of this.

            Pipeline code is:

            {code:java}
            pipeline {
                agent { label "xxxxxx" + (env.JOB_NAME.find(/-(u14|u16|u18|w7|w10)(?=-)/) ?:'') }
                
                options {
                    timeout(time: 60, unit: 'MINUTES')
                }
                    
                environment {
                    TMP = "${env.WORKSPACE}/tmp"
                    PROJECT_NAME = "${env.JOB_NAME}".find(/^[^-]+/)
                    PROJECT_BRANCH = "${env.JOB_NAME}".find(/(?<=^[^-]+-)[^-]+/)
                    PROJECT_RELEASE = "${env.PROJECT_BRANCH}".replace('MAIN', '9999rc')
                    P4_BRANCH = "REL/${env.PROJECT_BRANCH}".replace('REL/MAIN', 'MAIN')
                }
                
                parameters {
                    string(name: 'change', defaultValue: '', description: 'Changelist to build.', trim: true)
                    string(name: 'status', defaultValue: '', description: 'Changelist status (shelved/submitted).', trim: true)
                    string(name: 'review', defaultValue: '', description: 'Helix Swarm review number.', trim: true)
                    string(name: 'pass', defaultValue: '', description: 'Success callback URL.', trim: true)
                    string(name: 'fail', defaultValue: '', description: 'Failure callback URL.', trim: true)
                }
                
                triggers {
                    pollSCM('@midnight')
                }
                
                stages {
                    stage('sync') {
                        steps {
                            p4sync(
                                credential: 'xxxxx',
                                format: 'xxxx.xxxx-${NODE_NAME}-${JOB_NAME}-${EXECUTOR_NUMBER}_ws',
                                populate: forceClean(),
                                source: depotSource("//path/${env.P4_BRANCH}/...")
                            )
                    
                            buildName "${env.P4_CHANGELIST} ${params.status?:''} ${params.change?:''}".trim()
                            sh '''sed 's/\\\$Change: [0-9]\\+ \\\$/\$Change: '$P4_CHANGELIST' \$/' -i xxxxx/version.py'''
                                
                            script {
                                def dot = env.PROJECT_RELEASE.find(/rc$/) ? '' : '.'
                                env.PACKAGE_NAME = "${env.PROJECT_NAME}-${env.PROJECT_RELEASE}${dot}${env.P4_CHANGELIST}.tar.gz"
                            }
                            echo "PROJECT_NAME = $PROJECT_NAME"
                            echo "PROJECT_BRANCH = $PROJECT_BRANCH"
                            echo "PROJECT_RELEASE = $PROJECT_RELEASE"
                            echo "P4_BRANCH = $P4_BRANCH"
                            echo "PACKAGE_NAME = $PACKAGE_NAME"
                        }
                    }
                    stage('build') {
                        steps {
                            sh '''
                            #!/bin/bash -xe
                    
                            # Clean up possible orphans from other test sessions
                            pkill -P1 -U $(id -u) || true
                            
                            mkdir $TMP
                            
                            virtualenv venv
                            . venv/bin/activate
                            
                            python setup.py sdist
                            pip install dist/${PROJECT_NAME}-${PROJECT_RELEASE}*.tar.gz
                            imgee master --print-requirements | xargs pip install
                            pip install -r tests/test_requirements.txt
                            '''
                        }
                    }
                    stage('test') {
                        steps {
                            sh '''
                                . venv/bin/activate
                                pytest -n 4 \
                                  --timeout=300 \
                                  --junit-xml ${WORKSPACE}/testresults.xml \
                                  --verbose \
                                  --cov=${PROJECT_NAME} \
                                  --cov-branch \
                                  --cov-report xml:${WORKSPACE}/coverage.xml \
                                  --replay-record-dir=testreplay \
                                  tests || true
                            '''
                            junit 'testresults.xml'
                            cobertura coberturaReportFile: 'coverage.xml'
                        }
                    }
                }
            }
            {code}

            It never gets to "test" stage, seems that "sh" step in "build" stage never ends. ]
            daniel_jeznach_img Daniel Jeznach added a comment - - edited

            Hello,
            fix for this issue causes one of our pipelines to stuck. Passing

            -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.LAUNCH_DIAGNOSTICS=true
            

            parameter reverts to old behavior, so I blame running sh build step in subshell as a cause of this.

            Pipeline code is:

            pipeline {
                agent { label "xxxxxx" + (env.JOB_NAME.find(/-(u14|u16|u18|w7|w10)(?=-)/) ?:'') }
                
                options {
                    timeout(time: 60, unit: 'MINUTES')
                }
                    
                environment {
                    TMP = "${env.WORKSPACE}/tmp"
                    PROJECT_NAME = "${env.JOB_NAME}".find(/^[^-]+/)
                    PROJECT_BRANCH = "${env.JOB_NAME}".find(/(?<=^[^-]+-)[^-]+/)
                    PROJECT_RELEASE = "${env.PROJECT_BRANCH}".replace('MAIN', '9999rc')
                    P4_BRANCH = "REL/${env.PROJECT_BRANCH}".replace('REL/MAIN', 'MAIN')
                }
                
                parameters {
                    string(name: 'change', defaultValue: '', description: 'Changelist to build.', trim: true)
                    string(name: 'status', defaultValue: '', description: 'Changelist status (shelved/submitted).', trim: true)
                    string(name: 'review', defaultValue: '', description: 'Helix Swarm review number.', trim: true)
                    string(name: 'pass', defaultValue: '', description: 'Success callback URL.', trim: true)
                    string(name: 'fail', defaultValue: '', description: 'Failure callback URL.', trim: true)
                }
                
                triggers {
                    pollSCM('@midnight')
                }
                
                stages {
                    stage('sync') {
                        steps {
                            p4sync(
                                credential: 'xxxxx',
                                format: 'xxxx.xxxx-${NODE_NAME}-${JOB_NAME}-${EXECUTOR_NUMBER}_ws',
                                populate: forceClean(),
                                source: depotSource("//path/${env.P4_BRANCH}/...")
                            )
                    
                            buildName "${env.P4_CHANGELIST} ${params.status?:''} ${params.change?:''}".trim()
                            sh '''sed 's/\\\$Change: [0-9]\\+ \\\$/\$Change: '$P4_CHANGELIST' \$/' -i xxxxx/version.py'''
                                
                            script {
                                def dot = env.PROJECT_RELEASE.find(/rc$/) ? '' : '.'
                                env.PACKAGE_NAME = "${env.PROJECT_NAME}-${env.PROJECT_RELEASE}${dot}${env.P4_CHANGELIST}.tar.gz"
                            }
                            echo "PROJECT_NAME = $PROJECT_NAME"
                            echo "PROJECT_BRANCH = $PROJECT_BRANCH"
                            echo "PROJECT_RELEASE = $PROJECT_RELEASE"
                            echo "P4_BRANCH = $P4_BRANCH"
                            echo "PACKAGE_NAME = $PACKAGE_NAME"
                        }
                    }
                    stage('build') {
                        steps { 
                            sh '''
                            #!/bin/bash -xe
                    
                            # Clean up possible orphans from other test sessions
                            pkill -P1 -U $(id -u)  || true
                            
                            mkdir $TMP
                            
                            virtualenv venv
                            . venv/bin/activate
                            
                            python setup.py sdist
                            pip install dist/${PROJECT_NAME}-${PROJECT_RELEASE}*.tar.gz
                            xxxxxx master --print-requirements | xargs pip install
                            pip install -r tests/test_requirements.txt
                            '''
                        }
                    }
                    stage('test') { 
                        steps { 
                            sh '''
                                . venv/bin/activate
                                pytest -n 4 \
                                  --timeout=300 \
                                  --junit-xml ${WORKSPACE}/testresults.xml \
                                  --verbose \
                                  --cov=${PROJECT_NAME} \
                                  --cov-branch \
                                  --cov-report xml:${WORKSPACE}/coverage.xml \
                                  --replay-record-dir=testreplay \
                                  tests || true
                            '''
                            junit 'testresults.xml'
                            cobertura coberturaReportFile: 'coverage.xml'
                        }
                    }
                }
            }
            

            It never gets to "test" stage, seems that "sh" step in "build" stage never ends.
            If you need, I'll gladly provide more information.

            daniel_jeznach_img Daniel Jeznach added a comment - - edited Hello, fix for this issue causes one of our pipelines to stuck. Passing -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.LAUNCH_DIAGNOSTICS=true parameter reverts to old behavior, so I blame running sh build step in subshell as a cause of this. Pipeline code is: pipeline { agent { label "xxxxxx" + (env.JOB_NAME.find(/-(u14|u16|u18|w7|w10)(?=-)/) ?:'') } options { timeout(time: 60, unit: 'MINUTES' ) } environment { TMP = "${env.WORKSPACE}/tmp" PROJECT_NAME = "${env.JOB_NAME}" .find(/^[^-]+/) PROJECT_BRANCH = "${env.JOB_NAME}" .find(/(?<=^[^-]+-)[^-]+/) PROJECT_RELEASE = "${env.PROJECT_BRANCH}" .replace( 'MAIN' , '9999rc' ) P4_BRANCH = "REL/${env.PROJECT_BRANCH}" .replace( 'REL/MAIN' , 'MAIN' ) } parameters { string(name: 'change' , defaultValue: '', description: ' Changelist to build.', trim: true ) string(name: 'status' , defaultValue: '', description: ' Changelist status (shelved/submitted).', trim: true ) string(name: 'review' , defaultValue: '', description: ' Helix Swarm review number.', trim: true ) string(name: 'pass' , defaultValue: '', description: ' Success callback URL.', trim: true ) string(name: 'fail' , defaultValue: '', description: ' Failure callback URL.', trim: true ) } triggers { pollSCM( '@midnight' ) } stages { stage( 'sync' ) { steps { p4sync( credential: 'xxxxx' , format: 'xxxx.xxxx-${NODE_NAME}-${JOB_NAME}-${EXECUTOR_NUMBER}_ws' , populate: forceClean(), source: depotSource( " //path/${env.P4_BRANCH}/..." ) ) buildName "${env.P4_CHANGELIST} ${params.status?: ''} ${params.change?:' '}" .trim() sh '''sed ' s/\\\$Change: [0-9]\\+ \\\$/\$Change: '$P4_CHANGELIST' \$/ ' -i xxxxx/version.py' '' script { def dot = env.PROJECT_RELEASE.find(/rc$/) ? '' : ' .' env.PACKAGE_NAME = "${env.PROJECT_NAME}-${env.PROJECT_RELEASE}${dot}${env.P4_CHANGELIST}.tar.gz" } echo "PROJECT_NAME = $PROJECT_NAME" echo "PROJECT_BRANCH = $PROJECT_BRANCH" echo "PROJECT_RELEASE = $PROJECT_RELEASE" echo "P4_BRANCH = $P4_BRANCH" echo "PACKAGE_NAME = $PACKAGE_NAME" } } stage( 'build' ) { steps { sh ''' #!/bin/bash -xe # Clean up possible orphans from other test sessions pkill -P1 -U $(id -u) || true mkdir $TMP virtualenv venv . venv/bin/activate python setup.py sdist pip install dist/${PROJECT_NAME}-${PROJECT_RELEASE}*.tar.gz xxxxxx master --print-requirements | xargs pip install pip install -r tests/test_requirements.txt ''' } } stage( 'test' ) { steps { sh ''' . venv/bin/activate pytest -n 4 \ --timeout=300 \ --junit-xml ${WORKSPACE}/testresults.xml \ --verbose \ --cov=${PROJECT_NAME} \ --cov-branch \ --cov-report xml:${WORKSPACE}/coverage.xml \ --replay-record-dir=testreplay \ tests || true ''' junit 'testresults.xml' cobertura coberturaReportFile: 'coverage.xml' } } } } It never gets to "test" stage, seems that "sh" step in "build" stage never ends. If you need, I'll gladly provide more information.
            jglick Jesse Glick made changes -
            Link This issue causes JENKINS-58656 [ JENKINS-58656 ]
            jglick Jesse Glick added a comment -

            daniel_jeznach_img you can file a separate issue (using the is caused by option to Link) with complete, minimal steps to reproduce from scratch.

            Offhand I suspect the pkill command is to blame, perhaps by killing too much. It is possible you did not need this to begin with—exiting a node block already sends a termination signal to all processes which inherited the (IIRC) JENKINS_MAGIC_COOKIE environment variable passed to any nested sh steps.

            jglick Jesse Glick added a comment - daniel_jeznach_img you can file a separate issue (using the is caused by option to Link ) with complete, minimal steps to reproduce from scratch . Offhand I suspect the pkill command is to blame, perhaps by killing too much. It is possible you did not need this to begin with—exiting a node block already sends a termination signal to all processes which inherited the (IIRC) JENKINS_MAGIC_COOKIE environment variable passed to any nested sh steps.

            jglick Yes, pkill was guilty in this case. Thank you for quick and accurate response.

            daniel_jeznach_img Daniel Jeznach added a comment - jglick Yes, pkill was guilty in this case. Thank you for quick and accurate response.
            v2v Victor Martinez made changes -
            Link This issue causes JENKINS-59668 [ JENKINS-59668 ]
            laurent1 Laurent Hory added a comment -

            Hi,

            fix for this issue cause also a freeze during pytest execution on docker node

            The sh command are block during execution

            With the diagnostics parameters the execution works correctly

            -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.LAUNCH_DIAGNOSTICS=true

             

             

             The context is an pipeline to launch pytest with tox

            docker_img_version='0.17'
            launch_container_with_user='jenkins'
            parser_name='sweep_tools'
            github_cred_id='xxxxxx'
            github_url='git@github.com:xxx.git'
            pipeline {
              agent {
                docker {
                  image 'xxxxx/analytics/ci_process:'+docker_img_version
                  label 'slaves_host'
                  args '-u jenkins -e POSTGRESQL_USER=xxxx -e POSTGRESQL_PASSWORD=xxxx -e POSTGRESQL_DATABASE=xxxxx'
                  customWorkspace "/home/jenkins/workspace/${env.JOB_NAME}"
                }
              }
              stages {
                stage('Delete tox cache') {
                  steps {
                    dir("${env.WORKSPACE}/.tox") {
                      deleteDir()
                    }
                  }
                }
                stage('Test execution') {
                  steps {
                    sh 'tox'
                    junit 'tests/results/results.xml'
                    cobertura autoUpdateHealth: false, autoUpdateStability: false,  coberturaReportFile: 'tests/results/coverage.xml', conditionalCoverageTargets: '70, 0, 0', failUnhealthy: false, failUnstable: false, lineCoverageTargets: '80, 0, 0', maxNumberOfBuilds: 0, methodCoverageTargets: '80, 0, 0', onlyStable: false, sourceEncoding: 'ASCII', zoomCoverageChart: false
                  }
                }
              }
            }
            
            

             The code in tox.ini file is

            [tox]
            envlist = py36[testenv]
            install_command = pip install {opts} {packages}
            deps =
            	pytest
            	pytest-cov
            	pytest-flask
            	chardet
            commands = pytest --cov=sweep_tools --cov-report term-missing --cov-report xml:tests/results/coverage.xml --junitxml tests/results/results.xml
            

             

            Actually the command line executed in step is and freeze just after pytest execution:

            sh -c "({ while [ -d '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d' -a \! -f '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/jenkins-result.txt' ]; do touch '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/jenkins-log.txt'; sleep 3; done } & jsc=durable-07cc39a960911d2b0363cd9d28761c7c; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/script.sh' > '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/jenkins-log.txt' 2>&1; echo $? > '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/jenkins-result.txt.tmp'; mv '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/jenkins-result.txt.tmp' '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/jenkins-result.txt'; wait) >&- 2>&- &"

             

            I can fix the the freeze with following command line:

            sh -c "({ while [ -d '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d' -a \! -f '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/jenkins-result.txt' ]; do touch '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/jenkins-log.txt'; sleep 3; done } & jsc=durable-07cc39a960911d2b0363cd9d28761c7c; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/script.sh' > '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/jenkins-log.txt' 2>&1; echo $? > '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/jenkins-result.txt.tmp'; mv '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/jenkins-result.txt.tmp' '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/jenkins-result.txt'; wait) >&- 2>&-"

             

            In the code the line impacted is BourneShellScript.java

            cmd.addAll(Arrays.asList("sh", "-c", "(" + cmdString + ") >&- 2>&- &"));

             

            It can be replaced by

            cmd.addAll(Arrays.asList("sh", "-c", "(" + cmdString + ") >&- 2>&-"));

             

            laurent1 Laurent Hory added a comment - Hi, fix for this issue cause also a freeze during pytest execution on docker node The sh command are block during execution With the diagnostics parameters the execution works correctly -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.LAUNCH_DIAGNOSTICS= true      The context is an pipeline to launch pytest with tox docker_img_version= '0.17' launch_container_with_user= 'jenkins' parser_name= 'sweep_tools' github_cred_id= 'xxxxxx' github_url= 'git@github.com:xxx.git' pipeline { agent { docker { image 'xxxxx/analytics/ci_process:' +docker_img_version label 'slaves_host' args '-u jenkins -e POSTGRESQL_USER=xxxx -e POSTGRESQL_PASSWORD=xxxx -e POSTGRESQL_DATABASE=xxxxx' customWorkspace "/home/jenkins/workspace/${env.JOB_NAME}" } } stages { stage( 'Delete tox cache' ) { steps { dir( "${env.WORKSPACE}/.tox" ) { deleteDir() } } } stage( 'Test execution' ) { steps { sh 'tox' junit 'tests/results/results.xml' cobertura autoUpdateHealth: false , autoUpdateStability: false , coberturaReportFile: 'tests/results/coverage.xml' , conditionalCoverageTargets: '70, 0, 0' , failUnhealthy: false , failUnstable: false , lineCoverageTargets: '80, 0, 0' , maxNumberOfBuilds: 0, methodCoverageTargets: '80, 0, 0' , onlyStable: false , sourceEncoding: 'ASCII' , zoomCoverageChart: false } } } }  The code in tox.ini file is [tox] envlist = py36[testenv] install_command = pip install {opts} {packages} deps = pytest pytest-cov pytest-flask chardet commands = pytest --cov=sweep_tools --cov-report term-missing --cov-report xml:tests/results/coverage.xml --junitxml tests/results/results.xml   Actually the command line executed in step is and freeze just after pytest execution: sh -c "({ while [ -d '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d' -a \! -f '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/jenkins-result.txt' ]; do touch '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/jenkins-log.txt' ; sleep 3; done } & jsc=durable-07cc39a960911d2b0363cd9d28761c7c; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/script.sh' > '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/jenkins-log.txt' 2>&1; echo $? > '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/jenkins-result.txt.tmp' ; mv '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/jenkins-result.txt.tmp' '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/jenkins-result.txt' ; wait) >&- 2>&- &"   I can fix the the freeze with following command line: sh -c "({ while [ -d '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d' -a \! -f '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/jenkins-result.txt' ]; do touch '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/jenkins-log.txt' ; sleep 3; done } & jsc=durable-07cc39a960911d2b0363cd9d28761c7c; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/script.sh' > '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/jenkins-log.txt' 2>&1; echo $? > '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/jenkins-result.txt.tmp' ; mv '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/jenkins-result.txt.tmp' '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/jenkins-result.txt' ; wait) >&- 2>&-"   In the code the line impacted is BourneShellScript.java cmd.addAll(Arrays.asList( "sh" , "-c" , "(" + cmdString + ") >&- 2>&- &" ));   It can be replaced by cmd.addAll(Arrays.asList( "sh" , "-c" , "(" + cmdString + ") >&- 2>&-" ));  
            laurent1 Laurent Hory made changes -
            Resolution Fixed [ 1 ]
            Status Resolved [ 5 ] Reopened [ 4 ]
            jglick Jesse Glick added a comment -

            laurent1 please do not reopen. If you believe you have discovered a regression, file a separate issue linked to this one with complete, minimal steps to reproduce from scratch.

            jglick Jesse Glick added a comment - laurent1 please do not reopen. If you believe you have discovered a regression, file a separate issue linked to this one with complete, minimal steps to reproduce from scratch.
            jglick Jesse Glick made changes -
            Resolution Fixed [ 1 ]
            Status Reopened [ 4 ] Resolved [ 5 ]
            sbeaulie Samuel Beaulieu made changes -
            Link This issue causes JENKINS-65984 [ JENKINS-65984 ]

            People

              jglick Jesse Glick
              jglick Jesse Glick
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: