Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-72792

Builds hang sometimes for withMaven and ssh-agent steps execution

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • kubernetes-plugin
    • None
    • Jenkins: 2.426.3
      Kubernetes plugin: 4174.v4230d0ccd951

      Approximately 5% of builds executed on our Jenkins instance using Kubernetes plugin are hanging. This issue seems to occurs during the execution of `withMaven` step (from the Pipeline Maven Integration plugin) and `ssh-agent` step (from the SSH Agent plugin) itselves.

      We have another Jenkins instance (with the same version of core and the same plugins) that runs agents as Swarm services via the Docker Swarm plugin, and it do not exhibit this issue.

      It's challenging to determine whether the problem lies within the Kubernetes plugin itself or possibly in other plugins when executing steps with an awareness of running within `container` step.

      Currently, after examining the logs from these plugins (SSH Agent and Pipeline Maven Integration) combined with Log Recorder tracking logs from org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator at the FINEST level, we have identified a potential issue with concurrent process execution within the Kubernetes plugin.

      In the event of a specific step hanging, the log shows an entry: "onOpen: java.util.concurrent.CountDownLatch@2a09a562[Count = 1]". In all other cases, the counter indicates a value of 0, and if I understand correctly, the thread continues to wait for the countDown() call, which does not occur.

      What could be causing this behavior for certain pipeline steps?

      The mentioned log is invoked here: ContainerExecDecorator.java#L484

      Changes related to instantiating and invoking CountDownLatch were introduced as part of the issue: JENKINS-67664

          [JENKINS-72792] Builds hang sometimes for withMaven and ssh-agent steps execution

          After 24 hours the build interrupted with the following stack trace:

          [Pipeline] End of Pipeline
          Agent was removed
          java.lang.InterruptedException
            at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1048)
            at java.base/java.util.concurrent.CountDownLatch.await(CountDownLatch.java:230)
            at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecProc.join(ContainerExecProc.java:100)
            at org.jenkinsci.plugins.pipeline.maven.WithMavenStepExecution2.readFromProcess(WithMavenStepExecution2.java:696)
            at org.jenkinsci.plugins.pipeline.maven.WithMavenStepExecution2.obtainMavenExec(WithMavenStepExecution2.java:523)
            at org.jenkinsci.plugins.pipeline.maven.WithMavenStepExecution2.setupMaven(WithMavenStepExecution2.java:373)
            at org.jenkinsci.plugins.pipeline.maven.WithMavenStepExecution2.doStart(WithMavenStepExecution2.java:212)
            at org.jenkinsci.plugins.workflow.steps.GeneralNonBlockingStepExecution.lambda$run$0(GeneralNonBlockingStepExecution.java:77)
            at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
            at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
            at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
            at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
            at java.base/java.lang.Thread.run(Thread.java:840)
          org.jenkinsci.plugins.workflow.actions.ErrorAction$ErrorId: 47c35f7b-6a87-47b4-8547-5ad708eb137c
          Finished: ABORTED 

          Łukasz Jackiewicz added a comment - After 24 hours the build interrupted with the following stack trace: [Pipeline] End of Pipeline Agent was removed java.lang.InterruptedException at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1048) at java.base/java.util.concurrent.CountDownLatch.await(CountDownLatch.java:230) at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecProc.join(ContainerExecProc.java:100) at org.jenkinsci.plugins.pipeline.maven.WithMavenStepExecution2.readFromProcess(WithMavenStepExecution2.java:696) at org.jenkinsci.plugins.pipeline.maven.WithMavenStepExecution2.obtainMavenExec(WithMavenStepExecution2.java:523) at org.jenkinsci.plugins.pipeline.maven.WithMavenStepExecution2.setupMaven(WithMavenStepExecution2.java:373) at org.jenkinsci.plugins.pipeline.maven.WithMavenStepExecution2.doStart(WithMavenStepExecution2.java:212) at org.jenkinsci.plugins.workflow.steps.GeneralNonBlockingStepExecution.lambda$run$0(GeneralNonBlockingStepExecution.java:77) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang. Thread .run( Thread .java:840) org.jenkinsci.plugins.workflow.actions.ErrorAction$ErrorId: 47c35f7b-6a87-47b4-8547-5ad708eb137c Finished: ABORTED

          Configured log recorder with ContainerExecDecorator and ContainerExecProc at FINEST logging level:

          Mar 11, 2024 1:42:01 PM FINEST org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator
          Launch proc with environment: []
          Mar 11, 2024 1:42:01 PM FINEST org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator
          Executing sh script inside container maven of pod testpipeline-with-withmaven-13-7c8nq-hfrsb-5tsnv
          Mar 11, 2024 1:42:01 PM FINEST org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator
          onOpen : java.util.concurrent.CountDownLatch@60882bfb[Count = 1]
          Mar 11, 2024 1:42:01 PM FINEST org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator
          Launching with env vars: {BUILD_DISPLAY_NAME=#13, BUILD_ID=13, BUILD_NUMBER=13, BUILD_TAG=jenkins-testPipeline_with_withMaven-13, BUILD_URL=https://JENKINS_URL/job/testPipeline_with_withMaven/13/, CI=true, GITLAB_OBJECT_KIND=none, HUDSON_HOME=/var/jenkins_home, HUDSON_SERVER_COOKIE=5af12d216b445803, HUDSON_URL=https://JENKINS_URL/, JENKINS_HOME=/var/jenkins_home, JENKINS_SERVER_COOKIE=5af12d216b445803, JENKINS_URL=https://JENKINS_URL/, JOB_BASE_NAME=testPipeline_with_withMaven, JOB_DISPLAY_URL=https://JENKINS_URL/job/testPipeline_with_withMaven/display/redirect, JOB_NAME=testPipeline_with_withMaven, JOB_URL=https://JENKINS_URL/job/testPipeline_with_withMaven/, library.sharedLibrary.version=develop, POD_CONTAINER=maven, POD_LABEL=testPipeline_with_withMaven_13-7c8nq, RUN_ARTIFACTS_DISPLAY_URL=https://JENKINS_URL/job/testPipeline_with_withMaven/13/display/redirect?page=artifacts, RUN_CHANGES_DISPLAY_URL=https://JENKINS_URL/job/testPipeline_with_withMaven/13/display/redirect?page=changes, RUN_DISPLAY_URL=https://JENKINS_URL/job/testPipeline_with_withMaven/13/display/redirect, RUN_TESTS_DISPLAY_URL=https://JENKINS_URL/job/testPipeline_with_withMaven/13/display/redirect?page=tests, STAGE_NAME=Test stage 1}
          Mar 11, 2024 1:42:01 PM FINEST org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator
          Executing command: "printenv" "MAVEN_HOME" 
          [25 μs.]
          Mar 11, 2024 1:42:01 PM INFO org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1 doLaunch
          Created process inside pod: [testpipeline-with-withmaven-13-7c8nq-hfrsb-5tsnv], container: [maven][233 ms]
          Mar 11, 2024 1:42:01 PM FINEST org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecProc
          Waiting for websocket to close on command finish (java.util.concurrent.CountDownLatch@60882bfb[Count = 1])

          Łukasz Jackiewicz added a comment - Configured log recorder with ContainerExecDecorator and ContainerExecProc at FINEST logging level: Mar 11, 2024 1:42:01 PM FINEST org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator Launch proc with environment: [] Mar 11, 2024 1:42:01 PM FINEST org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator Executing sh script inside container maven of pod testpipeline-with-withmaven-13-7c8nq-hfrsb-5tsnv Mar 11, 2024 1:42:01 PM FINEST org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator onOpen : java.util.concurrent.CountDownLatch@60882bfb[Count = 1] Mar 11, 2024 1:42:01 PM FINEST org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator Launching with env vars: {BUILD_DISPLAY_NAME=#13, BUILD_ID=13, BUILD_NUMBER=13, BUILD_TAG=jenkins-testPipeline_with_withMaven-13, BUILD_URL=https: //JENKINS_URL/job/testPipeline_with_withMaven/13/, CI= true , GITLAB_OBJECT_KIND=none, HUDSON_HOME=/ var /jenkins_home, HUDSON_SERVER_COOKIE=5af12d216b445803, HUDSON_URL=https://JENKINS_URL/, JENKINS_HOME=/ var /jenkins_home, JENKINS_SERVER_COOKIE=5af12d216b445803, JENKINS_URL=https://JENKINS_URL/, JOB_BASE_NAME=testPipeline_with_withMaven, JOB_DISPLAY_URL=https://JENKINS_URL/job/testPipeline_with_withMaven/display/redirect, JOB_NAME=testPipeline_with_withMaven, JOB_URL=https://JENKINS_URL/job/testPipeline_with_withMaven/, library.sharedLibrary.version=develop, POD_CONTAINER=maven, POD_LABEL=testPipeline_with_withMaven_13-7c8nq, RUN_ARTIFACTS_DISPLAY_URL=https://JENKINS_URL/job/testPipeline_with_withMaven/13/display/redirect?page=artifacts, RUN_CHANGES_DISPLAY_URL=https://JENKINS_URL/job/testPipeline_with_withMaven/13/display/redirect?page=changes, RUN_DISPLAY_URL=https://JENKINS_URL/job/testPipeline_with_withMaven/13/display/redirect, RUN_TESTS_DISPLAY_URL=https://JENKINS_URL/job/testPipeline_with_withMaven/13/display/redirect?page=tests, STAGE_NAME=Test stage 1} Mar 11, 2024 1:42:01 PM FINEST org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator Executing command: "printenv" "MAVEN_HOME"   [25 μs.] Mar 11, 2024 1:42:01 PM INFO org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1 doLaunch Created process inside pod: [testpipeline-with-withmaven-13-7c8nq-hfrsb-5tsnv], container: [maven][233 ms] Mar 11, 2024 1:42:01 PM FINEST org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecProc Waiting for websocket to close on command finish (java.util.concurrent.CountDownLatch@60882bfb[Count = 1])

          Hey ljackiewicz ,

          we are currently investigating a similar issue (builds that hang after a gradle step) but as I understand the code right now, your observation is not yet the root of your issues.

          What I think happens is:

          1. ContainerExecDecorator initializes the "started" and "finished" latches to 1 (code). Important is they stay 1 until an event occured and are "counted down to 0" as soon as it is. This unpauses threads waiting for those events.
          2. The listener for WebSocket events is registered (but not yet called) (code)
          3. Using the kubernetes-client library, a process is executed in a pod. This is similar to kubectl exec and opens a WebSocket to receive the output of this process.
          4. When the WebSocket is open, this triggers the open() event of the listener created before. This should produce the onOpen log entry you are seing. Count = 1 means "is not finished yet" (code) which is probably expected when we are talking about a Maven build.

          A typical reason for the hanging builds could be that the created process does not exit. Or the console exit call (code) does not work for some reason. I have no idea why it should only happen on 5% of the builds, it could be a deadlock cause by multithreading events that happen in a wrong order.

          If you solved your issue in the meantime, I would be interested in your findings!

          Michael Cornel added a comment - Hey ljackiewicz , we are currently investigating a similar issue (builds that hang after a gradle step) but as I understand the code right now, your observation is not yet the root of your issues. What I think happens is: ContainerExecDecorator initializes the "started" and "finished" latches to 1 ( code ). Important is they stay 1 until an event occured and are "counted down to 0" as soon as it is. This unpauses threads waiting for those events. The listener for WebSocket events is registered (but not yet called) ( code ) Using the kubernetes-client library, a process is executed in a pod. This is similar to kubectl exec and opens a WebSocket to receive the output of this process. When the WebSocket is open, this triggers the open() event of the listener created before. This should produce the onOpen log entry you are seing. Count = 1 means "is not finished yet" ( code ) which is probably expected when we are talking about a Maven build. A typical reason for the hanging builds could be that the created process does not exit. Or the console exit call ( code ) does not work for some reason. I have no idea why it should only happen on 5% of the builds, it could be a deadlock cause by multithreading events that happen in a wrong order. If you solved your issue in the meantime, I would be interested in your findings!

          They are other ProcStarter impacted that are also blaming ContainerExecDecorator and the CountDownLatch mechanism. And they also hang intermittently at the same spot:

          It's quite easy to reproduce the problem with artifactory plugin and the rtMaven and rtGradle. I am not sure how how easy it is with ssh-agent and withMaven ?

          It is not clear to me how well the ContainerExecDecorator behave for *non durable task steps*.. Durable tasks seem to work well at the moment, but durable tasks are different in that it is a quick launch and forget and the (workflow) durable task plugin monitor that status of the launched process..

          Hard to tell if the intermittent hangs discussed here are caused by a bug in the ContainerExecDecorator or by the command being executed by it through the launcher. I tend to think it is the decorator since those work well through remoting.

          cc vlatombe jglick I am curious about what are you thoughts on this.

          Allan BURDAJEWICZ added a comment - They are other ProcStarter impacted that are also blaming ContainerExecDecorator and the CountDownLatch mechanism. And they also hang intermittently at the same spot: https://issues.jenkins.io/browse/JENKINS-65488 https://issues.jenkins.io/browse/JENKINS-71708 https://github.com/jfrog/jenkins-artifactory-plugin/issues/741 https://github.com/jfrog/jenkins-artifactory-plugin/issues/950 It's quite easy to reproduce the problem with artifactory plugin and the rtMaven and rtGradle . I am not sure how how easy it is with ssh-agent and withMaven ? It is not clear to me how well the ContainerExecDecorator behave for * non durable task steps *.. Durable tasks seem to work well at the moment, but durable tasks are different in that it is a quick launch and forget and the (workflow) durable task plugin monitor that status of the launched process.. Hard to tell if the intermittent hangs discussed here are caused by a bug in the ContainerExecDecorator or by the command being executed by it through the launcher. I tend to think it is the decorator since those work well through remoting. cc vlatombe jglick I am curious about what are you thoughts on this.

          Jesse Glick added a comment -

          I have no idea offhand. Anyway ContainerExecDecorator is overdue for a complete rewrite, to not using the pod/exec API, or at least to only use it once to start some listening process in the container that would communicate with the agent container over a named pipe or similar.

          Jesse Glick added a comment - I have no idea offhand. Anyway ContainerExecDecorator is overdue for a complete rewrite, to not using the pod/exec API, or at least to only use it once to start some listening process in the container that would communicate with the agent container over a named pipe or similar.

            Unassigned Unassigned
            ljackiewicz Łukasz Jackiewicz
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: