Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-52362

Jenkins hangs due to "Running CpsFlowExecution unresponsive"

    XMLWordPrintable

Details

    • Bug
    • Status: Open (View Workflow)
    • Blocker
    • Resolution: Unresolved

    Description

      Three times in the last two weeks, we've had our Jenkins server stop responding to requests. When I check syslog, I see errors like this:

      Jun 30 16:07:18 jenkins [jenkins]: Jun 30, 2018 4:07:18 PM org.jenkinsci.plugins.workflow.support.concurrent.Timeout lambda$ping$0
      Jun 30 16:07:18 jenkins [jenkins]: INFO: Running CpsFlowExecutionOwner[project/263:project #263] unresponsive for 5 sec
      Jun 30 16:07:18 jenkins [jenkins]: Jun 30, 2018 4:07:18 PM org.jenkinsci.plugins.workflow.support.concurrent.Timeout lambda$ping$0
      Jun 30 16:07:18 jenkins [jenkins]: INFO: Running CpsFlowExecutionOwner[project/368:project #368] unresponsive for 5 sec
      Jun 30 16:07:18 jenkins [jenkins]: Jun 30, 2018 4:07:18 PM org.jenkinsci.plugins.workflow.support.concurrent.Timeout lambda$ping$0
      Jun 30 16:07:18 jenkins [jenkins]: INFO: Running CpsFlowExecutionOwner[project/318:project #318] unresponsive for 5 sec

      These seem to persist indefinitely and there don't seem to be any other relevant messages in the log. The Web UI just hangs until nginx times out.

      The Java process will then refuse to stop when I try to restart the service and I have to kill it with kill -9.

       

      Attachments

        Activity

          ganthore Mark Austin added a comment - - edited

          Edit: Root cause was the performance plugin running perfReport on ~240mb worth of test data.

          We're experiencing a similar issue for the past 2 days.

          Attached is a thread dump. ganthore-threads.dump

          We're running on core 2.344 and latest pipelines plugins.

          ganthore Mark Austin added a comment - - edited Edit: Root cause was the performance plugin running perfReport on ~240mb worth of test data. We're experiencing a similar issue for the past 2 days. Attached is a thread dump. ganthore-threads.dump We're running on core 2.344 and latest pipelines plugins.
          imqyh yuhang qiu added a comment -

          Version: Jenkins 2.355 with latest plugins.

          Problem:

          1. Some job cannot finish even if it's actually done, and keeps logging same output
          2. Nodes of those jobs report unresponsive for 5 sec/ 10 sec... ,and the time can reset to 5 sec(means it's not a dead lock)
          3. Other job on those node cannot start or finish
          4. jstack jenkins, jobs are waiting for a thread. And there is tee in this thread backtrace. (I forgot to snapshot/save it...)

          Reason:

          1. We use tee in pipeline
          2. Maybe cmd in tee does not close fd properly, or unstable network(packet loss for EOF), or tee self has bugs
          3. Such huge and infinite outputs occupy the lock for too long.
          4. The node start unresponsive, and the unresponsive time will reset when lock is get.
          5. Jobs on those node cannot start or finish.

          After we remove all tee in all jobs, the problem disappeared. But the reason might be different in other user's reports.

          If you got tee in your backtrace, might be the tee problem, you should try to remove it.

          Report by: Aliyun PolarDB Testing team.

          imqyh yuhang qiu added a comment - Version: Jenkins 2.355 with latest plugins. Problem: Some job cannot finish even if it's actually done, and keeps logging same output Nodes of those jobs report unresponsive for 5 sec/ 10 sec... ,and the time can reset to 5 sec(means it's not a dead lock) Other job on those node cannot start or finish jstack jenkins, jobs are waiting for a thread. And there is tee in this thread backtrace. (I forgot to snapshot/save it...) Reason: We use tee in pipeline Maybe cmd in tee does not close fd properly, or unstable network(packet loss for EOF), or tee self has bugs Such huge and infinite outputs occupy the lock for too long. The node start unresponsive, and the unresponsive time will reset when lock is get. Jobs on those node cannot start or finish. After we remove all tee in all jobs, the problem disappeared. But the reason might be different in other user's reports. If you got tee in your backtrace, might be the tee problem, you should try to remove it. Report by: Aliyun PolarDB Testing team.
          mcascone Max Cascone added a comment -

          We are having this same problem, I'm not sure if it's related to the CPS issue of the original subject or not. Several times a day, jobs will just stop, at various spots in their pipeline. These are all declarative pipelines, Linux controller, windows agent. There is plenty of disk space.

          Jenkins 2.346.2
          mostly all updated plugins

          that -Xmx256m looks suspiciously low

          This is from the jenkins.xml:

          <executable>/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-0.el7_8.x86_64/jre/bin</executable> -->
            <arguments>-Xrs -Xmx256m -Dhudson.lifecycle=hudson.lifecycle.WindowsServiceLifecycle -Djsse.enableSNIExtension=false -jar "%BASE%\jenkins.war" --httpPort=8080</arguments>
          

          memory:

          $ free -h
                        total        used        free      shared  buff/cache   available
          Mem:           7.6G        3.6G        1.2G        256M        2.9G        3.5G
          Swap:          1.0G        345M        678M
          

          Should we bump up the -Xmx to 4g?

          mcascone Max Cascone added a comment - We are having this same problem, I'm not sure if it's related to the CPS issue of the original subject or not. Several times a day, jobs will just stop, at various spots in their pipeline. These are all declarative pipelines, Linux controller, windows agent. There is plenty of disk space. Jenkins 2.346.2 mostly all updated plugins that -Xmx256m looks suspiciously low This is from the jenkins.xml: <executable> /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-0.el7_8.x86_64/jre/bin </executable> --> <arguments> -Xrs -Xmx256m -Dhudson.lifecycle=hudson.lifecycle.WindowsServiceLifecycle -Djsse.enableSNIExtension=false -jar "%BASE%\jenkins.war" --httpPort=8080 </arguments> memory: $ free -h total used free shared buff/cache available Mem: 7.6G 3.6G 1.2G 256M 2.9G 3.5G Swap: 1.0G 345M 678M Should we bump up the -Xmx to 4g?
          touseef Touseef added a comment -

          mcascone , Not sure whether you have already fixed it. Try updating to openjdk 11

          touseef Touseef added a comment - mcascone , Not sure whether you have already fixed it. Try updating to openjdk 11
          mmarquezvacas Miguel added a comment - - edited

          touseef We're experiencing the same issue. Do you mean updating the jdk version will help us fix the issue?

          Which specific version helps? The latest one?

          mmarquezvacas Miguel added a comment - - edited touseef We're experiencing the same issue. Do you mean updating the jdk version will help us fix the issue? Which specific version helps? The latest one?

          People

            Unassigned Unassigned
            pdouglas Philip Douglas
            Votes:
            37 Vote for this issue
            Watchers:
            54 Start watching this issue

            Dates

              Created:
              Updated: