• Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • logstash-plugin
    • jenkins 2.298
      plugins: workflow-cps:2.92, logstash: 2.4.0, workflow-job:2.41 etc..
      OS: CentOS7 - 5.11.9-1.el7.elrepo.x86_64
      java: java-11-openjdk-11.0.12.0.7-0.el7_9.x86_64


      Time to time we faced deadlocks if logstash plugin is enabled. Only reboot helps. With the same conf but without logstash - we don't get deadlock.

      Running CpsFlowExecution[Owner[Platform/devops/incubator/devops-configuration-plugin-incubator/feature-build/295:Platform/devops/incubator/devops-configuration-plugin-incubator/feature-build #295]] locked on java.io.PrintStream@2316b2cd (owned by jenkins.util.Timer [#9]):
       at org.jenkinsci.plugins.workflow.job.console.NewNodeConsoleNote.print(NewNodeConsoleNote.java:74)
       at org.jenkinsci.plugins.workflow.job.WorkflowRun.onNewHead(WorkflowRun.java:1051)
       at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.notifyListeners(CpsFlowExecution.java:1473)
       at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.notifyNewHead(CpsThreadGroup.java:472)
       at org.jenkinsci.plugins.workflow.cps.FlowHead.setNewHead(FlowHead.java:157)
       at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.onProgramEnd(CpsFlowExecution.java:1255)
       at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:424)
       at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access00(CpsThreadGroup.java:96)
       at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.call(CpsThreadGroup.java:312)
       at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.call(CpsThreadGroup.java:276)
       at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService.call(CpsVmExecutorService.java:67)
       at java.base@11.0.8/java.util.concurrent.futuretask.run(futuretask.java:264)
       at hudson.remoting.SingleLaneExecutorService.run(SingleLaneExecutorService.java:136)
       at jenkins.util.ContextResettingExecutorService.run(ContextResettingExecutorService.java:28)
       at jenkins.security.ImpersonatingExecutorService.run(ImpersonatingExecutorService.java:59)
       at java.base@11.0.8/java.util.concurrent.executors.call(executors.java:515)
       at java.base@11.0.8/java.util.concurrent.futuretask.run(futuretask.java:264)
       at java.base@11.0.8/java.util.concurrent.threadpoolexecutor.runworker(threadpoolexecutor.java:1128)
       at java.base@11.0.8/java.util.concurrent.threadpoolexecutor.run(threadpoolexecutor.java:628)
       at java.base@11.0.8/java.lang.thread.run(thread.java:834)
      , jenkins.util.Timer [#9] locked on org.jenkinsci.plugins.workflow.cps.CpsFlowExecution@3608cc31 (owned by Running CpsFlowExecution[Owner[Platform/devops/incubator/devops-configuration-plugin-incubator/feature-build/295:Platform/devops/incubator/devops-configuration-plugin-incubator/feature-build #295]]):
       at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.getCurrentHeads(CpsFlowExecution.java:981)
       at org.jenkinsci.plugins.workflow.flow.FlowExecution.isComplete(FlowExecution.java:208)
       at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.isComplete(CpsFlowExecution.java:1238)
       at org.jenkinsci.plugins.workflow.cps.RunningFlowActions.createFor(RunningFlowActions.java:52)
       at org.jenkinsci.plugins.workflow.cps.RunningFlowActions.createFor(RunningFlowActions.java:42)
       at hudson.model.Actionable.createFor(Actionable.java:114)
       at hudson.model.Actionable.getAction(Actionable.java:335)
       at jenkins.plugins.logstash.persistence.BuildData.updateResult(BuildData.java:279)
       at jenkins.plugins.logstash.LogstashWriter.write(LogstashWriter.java:181)
       at jenkins.plugins.logstash.LogstashWriter.write(LogstashWriter.java:118)
       at jenkins.plugins.logstash.LogstashOutputStream.eol(LogstashOutputStream.java:64)
       at hudson.console.LineTransformationOutputStream.eol(LineTransformationOutputStream.java:60)
       at hudson.console.LineTransformationOutputStream.write(LineTransformationOutputStream.java:56)
       at hudson.console.LineTransformationOutputStream.write(LineTransformationOutputStream.java:74)
       at java.base@11.0.8/java.io.printstream.write(printstream.java:559)
       at java.base@11.0.8/sun.nio.cs.streamencoder.writebytes(streamencoder.java:233)
       at java.base@11.0.8/sun.nio.cs.streamencoder.implflushbuffer(streamencoder.java:312)
       at java.base@11.0.8/sun.nio.cs.streamencoder.flushbuffer(streamencoder.java:104)
       at java.base@11.0.8/java.io.outputstreamwriter.flushbuffer(outputstreamwriter.java:184)
       at java.base@11.0.8/java.io.printstream.newline(printstream.java:625)
       at java.base@11.0.8/java.io.printstream.println(printstream.java:883)
       at jenkins.model.CauseOfInterruption.print(CauseOfInterruption.java:129)
       at hudson.model.Executor.recordCauseOfInterruption(Executor.java:283)
       at org.jenkinsci.plugins.workflow.job.WorkflowRun.lambda-bash(WorkflowRun.java:392)
       at org.jenkinsci.plugins.workflow.job.WorkflowRun20188Lambda90/0x000000084f76bc40.run(Unknown Source)
       at jenkins.security.ImpersonatingScheduledExecutorService.run(ImpersonatingScheduledExecutorService.java:58)
       at java.base@11.0.8/java.util.concurrent.executors.call(executors.java:515)
       at java.base@11.0.8/java.util.concurrent.futuretask.run(futuretask.java:264)
       at java.base@11.0.8/java.util.concurrent.scheduledthreadpoolexecutor.run(scheduledthreadpoolexecutor.java:304)
       at java.base@11.0.8/java.util.concurrent.threadpoolexecutor.runworker(threadpoolexecutor.java:1128)
       at java.base@11.0.8/java.util.concurrent.threadpoolexecutor.run(threadpoolexecutor.java:628)
       at java.base@11.0.8/java.lang.thread.run(thread.java:834)
      

          [JENKINS-66258] get deadlock with enabled logstash-plugin

          Might have been introduced by https://issues.jenkins.io/browse/JENKINS-49114.

          Allan BURDAJEWICZ added a comment - Might have been introduced by https://issues.jenkins.io/browse/JENKINS-49114 .

          Jesse Glick added a comment -

          LogstashOutputStream.eol ought to do any other work asynchronously.

          Jesse Glick added a comment - LogstashOutputStream.eol ought to do any other work asynchronously.

          Allan BURDAJEWICZ added a comment - - edited

          jglick Despite the bad practice of doing this synchronously, I wonder if maybe the RunningFlowActions could be improved to not be consulted on Actionable#getAction(Class). I know this is not the root cause here, but if the factory could override the TransientActionFactory#actionType I think it would prevent this from happening in other cases. For example, prometheus deadlocks were also contributed by this factory: https://github.com/jenkinsci/prometheus-plugin/issues/635. WDYT ?

          Allan BURDAJEWICZ added a comment - - edited jglick Despite the bad practice of doing this synchronously, I wonder if maybe the RunningFlowActions could be improved to not be consulted on Actionable#getAction(Class) . I know this is not the root cause here, but if the factory could override the TransientActionFactory#actionType I think it would prevent this from happening in other cases. For example, prometheus deadlocks were also contributed by this factory: https://github.com/jenkinsci/prometheus-plugin/issues/635 . WDYT ?

          Jesse Glick added a comment -

          getPersistentAction perhaps. Not familiar with the code in this plugin.

          Jesse Glick added a comment - getPersistentAction perhaps. Not familiar with the code in this plugin.

          Devin Nusbaum added a comment - - edited

          allan_burdajewicz FWIW, if you really do see RunningFlowActions.createFor come up frequently in deadlocks of this nature due to it needing to acquire the monitor for CpsFlowExecution, then to me it seems expedient to have that factory override actionType either by introducing a new Action subtype to use as a supertype for CpsThreadDumpAction and PauseUnpauseAction (e.g. RunningFlowAction), or by splitting the factory in two.

          Looking briefly at logstash and prometheus, both of them are calling Run.getAction(AbstractTestResultAction.class), so I don't think PersistentAction is directly useful, since it is only for FlowNode actions.

          Devin Nusbaum added a comment - - edited allan_burdajewicz FWIW, if you really do see RunningFlowActions.createFor come up frequently in deadlocks of this nature due to it needing to acquire the monitor for CpsFlowExecution , then to me it seems expedient to have that factory override actionType either by introducing a new Action subtype to use as a supertype for CpsThreadDumpAction and PauseUnpauseAction  (e.g. RunningFlowAction ), or by splitting the factory in two. Looking briefly at logstash and prometheus , both of them are calling Run.getAction(AbstractTestResultAction.class) , so I don't think PersistentAction is directly useful, since it is only for FlowNode actions.

          Allan BURDAJEWICZ added a comment - - edited

          thanks for your inputs dnusbaum jglick. I searched into occurrences in github jenkinsci and cloudbees org, JIRA, CloudBees internal JIRA and support case. In most cases it shows up from logstash and prometheus. In GitHub, it was mentioned in some PR such as https://github.com/jenkinsci/workflow-cps-plugin/pull/596 and https://github.com/jenkinsci/workflow-job-plugin/pull/101.

          I am happy to propose a PR for workflow-cps. If it can prevent further issues and reduce stress on pipeline execution and load (less un-necessary calls to to CpsFlowExecution.getCurrentHeads.

          For logstash here, maybe a hot fix is to do the following:

          build.getActions().stream()
                    .filter(action -> action instanceof AbstractTestResultAction is not transient).findFirst()
                    .orElse(null)
          

          We know the AbstractTestResultAction is not transient.

          Allan BURDAJEWICZ added a comment - - edited thanks for your inputs dnusbaum jglick . I searched into occurrences in github jenkinsci and cloudbees org, JIRA, CloudBees internal JIRA and support case. In most cases it shows up from logstash and prometheus. In GitHub, it was mentioned in some PR such as https://github.com/jenkinsci/workflow-cps-plugin/pull/596 and https://github.com/jenkinsci/workflow-job-plugin/pull/101 . I am happy to propose a PR for workflow-cps. If it can prevent further issues and reduce stress on pipeline execution and load (less un-necessary calls to to CpsFlowExecution.getCurrentHeads . For logstash here, maybe a hot fix is to do the following: build.getActions().stream() .filter(action -> action instanceof AbstractTestResultAction is not transient ).findFirst() .orElse( null ) We know the AbstractTestResultAction is not transient.

          Allan BURDAJEWICZ added a comment - Proposed https://github.com/jenkinsci/logstash-plugin/pull/148

            jbochenski Jakub Bochenski
            sappersd Dmitry
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated: