Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-52362

Jenkins hangs due to "Running CpsFlowExecution unresponsive"

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Blocker
    • Resolution: Unresolved
    • Environment:
    • Similar Issues:

      Description

      Three times in the last two weeks, we've had our Jenkins server stop responding to requests. When I check syslog, I see errors like this:

      Jun 30 16:07:18 jenkins [jenkins]: Jun 30, 2018 4:07:18 PM org.jenkinsci.plugins.workflow.support.concurrent.Timeout lambda$ping$0
      Jun 30 16:07:18 jenkins [jenkins]: INFO: Running CpsFlowExecutionOwner[project/263:project #263] unresponsive for 5 sec
      Jun 30 16:07:18 jenkins [jenkins]: Jun 30, 2018 4:07:18 PM org.jenkinsci.plugins.workflow.support.concurrent.Timeout lambda$ping$0
      Jun 30 16:07:18 jenkins [jenkins]: INFO: Running CpsFlowExecutionOwner[project/368:project #368] unresponsive for 5 sec
      Jun 30 16:07:18 jenkins [jenkins]: Jun 30, 2018 4:07:18 PM org.jenkinsci.plugins.workflow.support.concurrent.Timeout lambda$ping$0
      Jun 30 16:07:18 jenkins [jenkins]: INFO: Running CpsFlowExecutionOwner[project/318:project #318] unresponsive for 5 sec

      These seem to persist indefinitely and there don't seem to be any other relevant messages in the log. The Web UI just hangs until nginx times out.

      The Java process will then refuse to stop when I try to restart the service and I have to kill it with kill -9.

       

        Attachments

          Activity

          pdouglas Philip Douglas created issue -
          Hide
          tmay81 Tyler May added a comment -

          We're having the same issue. Jenkins will run fine for a while, then lock up and we see the same errors in the logs. We're using Jenkins 2.127, workflow-cps 2.51:

          INFO: Running CpsFlowExecutionOwner[git/org/master/146:git/org/master #146] unresponsive for 5 sec
          Jul 05, 2018 3:28:36 PM org.jenkinsci.plugins.workflow.support.concurrent.Timeout lambda$ping$0

          Show
          tmay81 Tyler May added a comment - We're having the same issue. Jenkins will run fine for a while, then lock up and we see the same errors in the logs. We're using Jenkins 2.127, workflow-cps 2.51: INFO: Running CpsFlowExecution Owner[git/org/master/146:git/org/master #146] unresponsive for 5 sec Jul 05, 2018 3:28:36 PM org.jenkinsci.plugins.workflow.support.concurrent.Timeout lambda$ping$0
          Hide
          raca Ivan Aracki added a comment - - edited

          Happening randomly maybe 30% of the time. I'm using Jenkins to build and run spring-boot docker containers. Jenkins is also run in a container.

          The message I am getting:

           Jul 09, 2018 9:13:40 PM org.jenkinsci.plugins.workflow.support.concurrent.Timeout lambda$ping$0
          {code:java}
          INFO: org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#336]: checking /var/jenkins_home/workspace/tonicdm on unresponsive for 5.1 sec
          
          Show
          raca Ivan Aracki added a comment - - edited Happening randomly maybe 30% of the time. I'm using Jenkins to build and run spring-boot docker containers. Jenkins is also run in a container. The message I am getting: Jul 09, 2018 9:13:40 PM org.jenkinsci.plugins.workflow.support.concurrent.Timeout lambda$ping$0 {code:java} INFO: org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#336]: checking / var /jenkins_home/workspace/tonicdm on unresponsive for 5.1 sec
          Hide
          venkatemadi venkat reddy added a comment - - edited

          We are also seeing same issue. 
          Jenkins ver. 2.121.1  

          Jenkins log

          Jul 22, 2018 6:53:27 PM org.jenkinsci.plugins.workflow.support.concurrent.Timeout lambda$ping$0
          INFO: Running CpsFlowExecutionOwner[-Project/872:Project #872 unresponsive for 22 hr

           

           

          Show
          venkatemadi venkat reddy added a comment - - edited We are also seeing same issue.  Jenkins ver. 2.121.1    Jenkins log Jul 22, 2018 6:53:27 PM org.jenkinsci.plugins.workflow.support.concurrent.Timeout lambda$ping$0 INFO: Running CpsFlowExecution Owner[-Project/872:Project #872 unresponsive for 22 hr    
          Hide
          jniedrauer Josiah Niedrauer added a comment -

          Also seeing this issue. It starts looping with the unresponsive time going up, but otherwise no change. It never seems to recover from this state, even after 12+ hours.

          Running jenkins from the Docker tag: jenkins/jenkins:lts
          Currently at version: 2.121.3

          Log output:

          Sep 01, 2018 11:42:41 PM com.squareup.okhttp.internal.Platform$JdkWithJettyBootPlatform getSelectedProtocol
          INFO: ALPN callback dropped: SPDY and HTTP/2 are disabled. Is alpn-boot on the boot class path?
          Sep 01, 2018 11:43:18 PM org.jenkinsci.plugins.workflow.support.concurrent.Timeout lambda$ping$0
          INFO: org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#570]: checking REDACTED on Docker (i-06939e1a358dc4ce5) unresponsive for 5 sec
          Sep 01, 2018 11:43:27 PM org.jenkinsci.plugins.workflow.support.concurrent.Timeout lambda$ping$0
          INFO: org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#570]: checking REDACTED on Docker (i-06939e1a358dc4ce5) unresponsive for 13 sec
          Sep 01, 2018 11:44:00 PM org.jenkinsci.plugins.workflow.support.concurrent.Timeout lambda$ping$0
          

           

          Show
          jniedrauer Josiah Niedrauer added a comment - Also seeing this issue. It starts looping with the unresponsive time going up, but otherwise no change. It never seems to recover from this state, even after 12+ hours. Running jenkins from the Docker tag: jenkins/jenkins:lts Currently at version: 2.121.3 Log output: Sep 01, 2018 11:42:41 PM com.squareup.okhttp.internal.Platform$JdkWithJettyBootPlatform getSelectedProtocol INFO: ALPN callback dropped: SPDY and HTTP/2 are disabled. Is alpn-boot on the boot class path? Sep 01, 2018 11:43:18 PM org.jenkinsci.plugins.workflow.support.concurrent.Timeout lambda$ping$0 INFO: org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#570]: checking REDACTED on Docker (i-06939e1a358dc4ce5) unresponsive for 5 sec Sep 01, 2018 11:43:27 PM org.jenkinsci.plugins.workflow.support.concurrent.Timeout lambda$ping$0 INFO: org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#570]: checking REDACTED on Docker (i-06939e1a358dc4ce5) unresponsive for 13 sec Sep 01, 2018 11:44:00 PM org.jenkinsci.plugins.workflow.support.concurrent.Timeout lambda$ping$0  
          Hide
          jniedrauer Josiah Niedrauer added a comment -

          I have not been able to reproduce this since adding swap to my jenkins master docker host. I think this condition may somehow be triggered by low memory.

          Show
          jniedrauer Josiah Niedrauer added a comment - I have not been able to reproduce this since adding swap to my jenkins master docker host. I think this condition may somehow be triggered by low memory.
          svanoort Sam Van Oort made changes -
          Field Original Value New Value
          Component/s durable-task-plugin [ 18622 ]
          Component/s workflow-cps-plugin [ 21713 ]
          Component/s workflow-durable-task-step-plugin [ 21715 ]
          Component/s core [ 15593 ]
          svanoort Sam Van Oort made changes -
          Labels pipeline threads
          Hide
          svanoort Sam Van Oort added a comment -

          Josiah Niedrauer Philip Douglas Please can you grab and attach a thread dump from when you see this issue?

          Show
          svanoort Sam Van Oort added a comment - Josiah Niedrauer Philip Douglas Please can you grab and attach a thread dump from when you see this issue?
          eagle_rainbow Nico Schmoigl made changes -
          Assignee Nico Schmoigl [ eagle_rainbow ]
          eagle_rainbow Nico Schmoigl made changes -
          Assignee Nico Schmoigl [ eagle_rainbow ]
          Hide
          eagle_rainbow Nico Schmoigl added a comment -

          Similar issue also reproducible here on an older machine. Setup (amongst others):

          • Jenkins core 2.60.3
          • Yet Another Docker Plugin 0.1.0-rc47
          • Jenkins running in docker container, connecting to another docker server for running jobs on the slave.

          System locks up, but apparently continues running internally.

          The HTTP server can't be down entirely, as sending a GET to the endpoint /api/json (which we use for "availability pinging") kept responding at usual response times.

          Jenkins runs jobs, which are executed every 5 minutes, so we can track down the point in time of it quite well. 

          I could cross-check: There was more than 120MB of free heap memory for the java process + further 3GB of RAM. 

          Operating system and docker logs around that time seem very unsuspicious. 

          Show
          eagle_rainbow Nico Schmoigl added a comment - Similar issue also reproducible here on an older machine. Setup (amongst others): Jenkins core 2.60.3 Yet Another Docker Plugin 0.1.0-rc47 Jenkins running in docker container, connecting to another docker server for running jobs on the slave. System locks up, but apparently continues running internally. The HTTP server can't be down entirely, as sending a GET to the endpoint /api/json (which we use for "availability pinging") kept responding at usual response times. Jenkins runs jobs, which are executed every 5 minutes, so we can track down the point in time of it quite well.  I could cross-check: There was more than 120MB of free heap memory for the java process + further 3GB of RAM.  Operating system and docker logs around that time seem very unsuspicious. 
          Hide
          svanoort Sam Van Oort added a comment -

          Nico Schmoigl Please could you attach a thread dump from Jenkins?

          Show
          svanoort Sam Van Oort added a comment - Nico Schmoigl Please could you attach a thread dump from Jenkins?
          Hide
          jniedrauer Josiah Niedrauer added a comment -

          Sam Van Oort I have not been able to reproduce this after weeks of heavy use. It may have been adding swap that fixed it, or it may have been that a new version was pushed to the lts tag. I am not sure. In either case, it is unlikely that I will be able to get you a thread dump.

          Show
          jniedrauer Josiah Niedrauer added a comment - Sam Van Oort I have not been able to reproduce this after weeks of heavy use. It may have been adding swap that fixed it, or it may have been that a new version was pushed to the lts tag. I am not sure. In either case, it is unlikely that I will be able to get you a thread dump.
          Hide
          eagle_rainbow Nico Schmoigl added a comment -

          Please could you attach a thread dump from Jenkins?

          ... will try to when the server goes down next time.
          It might become a little tricky, as this is a productive instance and if it is down, pressure is often high to get it back up running again as soon as possible.

          Show
          eagle_rainbow Nico Schmoigl added a comment - Please could you attach a thread dump from Jenkins? ... will try to when the server goes down next time. It might become a little tricky, as this is a productive instance and if it is down, pressure is often high to get it back up running again as soon as possible.
          Hide
          svanoort Sam Van Oort added a comment -

          Nico Schmoigl Ack, we'll wait untuil you have a chance to grab the thread dump – otherwise it'll depend on if this gets fixed by another bugfix.

          Show
          svanoort Sam Van Oort added a comment - Nico Schmoigl Ack, we'll wait untuil you have a chance to grab the thread dump – otherwise it'll depend on if this gets fixed by another bugfix.
          Hide
          eagle_rainbow Nico Schmoigl added a comment - - edited

          Just for the record: At the same server, we experienced a deadlock situation today, which may be related to this issue:

          Today, we had a very sluggish server (long latency in response). Checking, we found several hanging inbound GET requests which took ages (>1,100,000 ms) to complete. A thread dump showed that several threads were blocked by a lock in jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:369), which was indirectly triggered by hudson.plugins.performance.actions.PerformanceProjectAction.doRespondingTimeGraph. Note that there was not just the performance plugin, but we also saw other GET requests, such as /job/.../test/trend (there also locked in jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber) or /job/.../jacoco/graph (also in jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber) were affected. 

          After a little of analysis, we found one of the /job/.../performance/throughputGraph jobs running (state "running"), which apparently was in an endless loop. It also held the lock of the critical monitor, which blocked all the other requests. The interesting (triggering) block within this thread to me was: 

          ...
          at com.thoughtworks.xstream.XStream.unmarshal(XStream.java:1189)
          at hudson.util.XStream2.unmarshal(XStream2.java:114)
          at com.thoughtworks.xstream.XStream.unmarshal(XStream.java:1173)
          at hudson.XmlFile.unmarshal(XmlFile.java:160)
          at org.jenkinsci.plugins.workflow.job.WorkflowRun.reload(WorkflowRun.java:603)
          at hudson.model.Run.<init>(Run.java:325)
          at org.jenkinsci.plugins.workflow.job.WorkflowRun.<init>(WorkflowRun.java:209)
          at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
          at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
          at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
          at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
          at jenkins.model.lazy.LazyBuildMixIn.loadBuild(LazyBuildMixIn.java:165)
          at jenkins.model.lazy.LazyBuildMixIn$1.create(LazyBuildMixIn.java:142)
          

          Killing the thread did the trick - and the rest started to work again. Afterwards, we had to restart the server - but that was due to another problem, which is unrelated to this one here.

          However, the situation today was different than before: Today, we had a significant load average / CPU load during that situation. In the previous situation, load average / CPU load was very normal - also for hours before the blocking event. 

          Given that the symptoms are different, I am currently not sure whether we just saw the "early stage" of yet-another occurrence of this issue, which we could cure with a courageous thread kill, or whether this was something totally different. For sure, it makes sense to closely look at the list of locks pending, if the issue reappears.

          Show
          eagle_rainbow Nico Schmoigl added a comment - - edited Just for the record: At the same server, we experienced a deadlock situation today, which m ay  be related to this issue: Today, we had a very sluggish server (long latency in response). Checking, we found several hanging inbound GET requests which took ages (>1,100,000 ms) to complete. A thread dump showed that several threads were blocked by a lock in jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:369), which was indirectly triggered by hudson.plugins.performance.actions.PerformanceProjectAction.doRespondingTimeGraph. Note that there was not just the performance plugin, but we also saw other GET requests, such as /job/.../test/trend (there also locked in jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber) or /job/.../jacoco/graph (also in jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber) were affected.  After a little of analysis, we found one of the /job/.../performance/throughputGraph jobs running (state "running"), which apparently was in an endless loop. It also held the lock of the critical monitor, which blocked all the other requests. The interesting (triggering) block within this thread to me was:  ... at com.thoughtworks.xstream.XStream.unmarshal(XStream.java:1189) at hudson.util.XStream2.unmarshal(XStream2.java:114) at com.thoughtworks.xstream.XStream.unmarshal(XStream.java:1173) at hudson.XmlFile.unmarshal(XmlFile.java:160) at org.jenkinsci.plugins.workflow.job.WorkflowRun.reload(WorkflowRun.java:603) at hudson.model.Run.<init>(Run.java:325) at org.jenkinsci.plugins.workflow.job.WorkflowRun.<init>(WorkflowRun.java:209) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at jenkins.model.lazy.LazyBuildMixIn.loadBuild(LazyBuildMixIn.java:165) at jenkins.model.lazy.LazyBuildMixIn$1.create(LazyBuildMixIn.java:142) Killing the thread did the trick - and the rest started to work again. Afterwards, we had to restart the server - but that was due to another problem, which is unrelated to this one here. However, the situation today was different than before: Today, we had a significant load average / CPU load during that situation. In the previous situation, load average / CPU load was very normal - also for hours before the blocking event.  Given that the symptoms are different, I am currently not sure whether we just saw the "early stage" of yet-another occurrence of this issue, which we could cure with a courageous thread kill, or whether this was something totally different. For sure, it makes sense to closely look at the list of locks pending, if the issue reappears.
          Hide
          eagle_rainbow Nico Schmoigl added a comment -

          Intermediate feedback: So far, the problem did not reappear for our server. The only thing which we have changed (after the issue I had documented in https://issues.jenkins-ci.org/browse/JENKINS-52362?focusedCommentId=349554&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-349554 ) was the "read timeout" setting in the YAD configuration section from "empty" to 120. 

          I will update this ticket in case we experience yet another situation where the server stops responding.

          Show
          eagle_rainbow Nico Schmoigl added a comment - Intermediate feedback: So far, the problem did not reappear for our server. The only thing which we have changed (after the issue I had documented in https://issues.jenkins-ci.org/browse/JENKINS-52362?focusedCommentId=349554&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-349554  ) was the "read timeout" setting in the YAD configuration section from "empty" to 120.  I will update this ticket in case we experience yet another situation where the server stops responding.
          Hide
          svanoort Sam Van Oort added a comment -

          Nico Schmoigl I'm glad to see you haven't had a recurrence. That sounds like you were held up due to I/O issues – only one operation can try to load a build from disk at a time, and if that takes a while and other operations are depending on that build being loaded they will be blocked until it finishes loading. We did some improvements over the last few releases of workflow-cps and workflow-job to support lazy-loading of the FlowExecution which can make these cases significantly faster (especially where we are using the Performance-Optimized durability mode).

          Looking at performance-plugin, it seems that the Performance plugin causes this due to trying to load all builds for each run (which is bound to create a bottleneck for other things trying to access the builds as well).

          Now, what I'd love to see is the full stack trace for that "endless loop" – it's possible in some rare cases if the build onLoad ends up invoking itself either directly or indirectly, but suggests a critical bug somewhere (perhaps Performance Plugin, perhaps a combination of plugins).

          Also, what is "YAD plugin" short for? "Yet Another Docker plugin"? If so, I'm very curious how that could be related because the timeout there applies to communications with the Docker server.

          Show
          svanoort Sam Van Oort added a comment - Nico Schmoigl I'm glad to see you haven't had a recurrence. That sounds like you were held up due to I/O issues – only one operation can try to load a build from disk at a time, and if that takes a while and other operations are depending on that build being loaded they will be blocked until it finishes loading. We did some improvements over the last few releases of workflow-cps and workflow-job to support lazy-loading of the FlowExecution which can make these cases significantly faster (especially where we are using the Performance-Optimized durability mode). Looking at performance-plugin, it seems that the Performance plugin causes this due to trying to load all builds for each run (which is bound to create a bottleneck for other things trying to access the builds as well). Now, what I'd love to see is the full stack trace for that "endless loop" – it's possible in some rare cases if the build onLoad ends up invoking itself either directly or indirectly, but suggests a critical bug somewhere (perhaps Performance Plugin, perhaps a combination of plugins). Also, what is "YAD plugin" short for? "Yet Another Docker plugin"? If so, I'm very curious how that could be related because the timeout there applies to communications with the Docker server.
          eagle_rainbow Nico Schmoigl made changes -
          Hide
          eagle_rainbow Nico Schmoigl added a comment -

          We did some improvements over the last few releases of workflow-cps and workflow-job

          That sounds promising. We'll have a look at this some time later.

          Now, what I'd love to see is the full stack trace for that "endless loop".

          Before going into detail here, please let me reiterate one thing: It is not proven to me that the "endless loop" case really was an "early version" of the original bug report here. It has happened on the same server at roughly the same time. Having said this, let's have a look at the thread trace.

          I have attached an anonymized version of the thread trace I had created when the system was in that "endless loop" state (20180919-hangingjenkinsthreads-logs.txt). I suggest to start your analysis with searching for the term "main-voter", which is one of our jobs - and based on my analysis is the job, which caused the situation.
          Although we enabled quite strict retention on that job, we still have ~250 builds with it. Moreover, expect that each (successful) build will have around 1600 (mostly very small) log files in the build's folder (BTW they give us also a hard time with our backup strategy).

          Also, what is "YAD plugin" short for? "Yet Another Docker plugin"?

          Yes, correct.

          If so, I'm very curious how that could be related because the timeout there applies to communications with the Docker server.

          Well, this is yet another guess of us - here's the story:
          Remember that I had written

          Afterwards, we had to restart the server

          The reason for that was the thread "jenkins.util.Timer 6". If you look at its stack, you'll see that it's blocked in a ListContainersCmdExec request. That's a call via HTTP REST to the docker server, asking for the list of all containers running on the host (mainly - it's a little more complicated than this ). With an additional tool, we found out that it must be hanging there for hours (so much on "read timeout" - empty setting there means "infinity"). It's in blocked I/O state, waiting for the result coming back from the docker host.
          We don't exactly know what the docker host did (replied?), but usually such calls only take 1-2 seconds to answer - on very busy hosts it may be up to half a minute or so. You may expect that our docker host should respond within less of a second. Apparently, the missing response had blocked the YAD plugin and no further containers could be created (which mainly meant that the build queue was blocked, as nearly all our jobs require a new container). We could observe that this also had a negative effect on the management of already running containers/nodes for currently-running jobs (containers were hanging strangely). It wouldn't surprise me, if that also had indirect bad influence on many other things...
          We also tried to kill that thread (I know that the YAD plugin checks the state of the containers on the host regularly), but as it was in state "blocked I/O" the JVM wouldn't allow us to do that. In the end, the only chance to get this fix again was to restart Jenkins. We then set the read timeout hoping to prevent by this that it does not wait for eternity anymore.

          But again: a big bunch of the story here is guess work, which we put together like a mosaic based on some small pieces which we saw. That's why nobody should take this for "granted knowledge".

          Show
          eagle_rainbow Nico Schmoigl added a comment - We did some improvements over the last few releases of workflow-cps and workflow-job That sounds promising. We'll have a look at this some time later. Now, what I'd love to see is the full stack trace for that "endless loop". Before going into detail here, please let me reiterate one thing: It is not proven to me that the "endless loop" case really was an "early version" of the original bug report here. It has happened on the same server at roughly the same time. Having said this, let's have a look at the thread trace. I have attached an anonymized version of the thread trace I had created when the system was in that "endless loop" state ( 20180919-hangingjenkinsthreads-logs.txt ). I suggest to start your analysis with searching for the term "main-voter", which is one of our jobs - and based on my analysis is the job, which caused the situation. Although we enabled quite strict retention on that job, we still have ~250 builds with it. Moreover, expect that each (successful) build will have around 1600 (mostly very small) log files in the build's folder (BTW they give us also a hard time with our backup strategy). Also, what is "YAD plugin" short for? "Yet Another Docker plugin"? Yes, correct. If so, I'm very curious how that could be related because the timeout there applies to communications with the Docker server. Well, this is yet another guess of us - here's the story: Remember that I had written Afterwards, we had to restart the server The reason for that was the thread "jenkins.util.Timer 6 ". If you look at its stack, you'll see that it's blocked in a ListContainersCmdExec request. That's a call via HTTP REST to the docker server, asking for the list of all containers running on the host (mainly - it's a little more complicated than this ). With an additional tool, we found out that it must be hanging there for hours (so much on "read timeout" - empty setting there means "infinity"). It's in blocked I/O state, waiting for the result coming back from the docker host. We don't exactly know what the docker host did (replied?), but usually such calls only take 1-2 seconds to answer - on very busy hosts it may be up to half a minute or so. You may expect that our docker host should respond within less of a second. Apparently, the missing response had blocked the YAD plugin and no further containers could be created (which mainly meant that the build queue was blocked, as nearly all our jobs require a new container). We could observe that this also had a negative effect on the management of already running containers/nodes for currently-running jobs (containers were hanging strangely). It wouldn't surprise me, if that also had indirect bad influence on many other things... We also tried to kill that thread (I know that the YAD plugin checks the state of the containers on the host regularly), but as it was in state "blocked I/O" the JVM wouldn't allow us to do that. In the end, the only chance to get this fix again was to restart Jenkins. We then set the read timeout hoping to prevent by this that it does not wait for eternity anymore. But again: a big bunch of the story here is guess work, which we put together like a mosaic based on some small pieces which we saw. That's why nobody should take this for "granted knowledge".
          johnwu John Wu made changes -
          Attachment jenkins-thread.dump.xml [ 44700 ]
          johnwu John Wu made changes -
          Attachment jenkins-thread.dump.xml [ 44700 ]
          johnwu John Wu made changes -
          Attachment jenkins-thread-dump.txt [ 44701 ]
          Hide
          johnwu John Wu added a comment -

          We encountered the same issue (Running Jenkins ver. 2.60.3). I also attached out thread-dump fyi. Restarting the master node solved the issue (for now).

          Our Jenkins is deployed to Kubernetes but we don't have Yet Another Docker plugin installed.

          jenkins-thread-dump.txt

          Show
          johnwu John Wu added a comment - We encountered the same issue (Running Jenkins ver. 2.60.3). I also attached out thread-dump fyi. Restarting the master node solved the issue (for now). Our Jenkins is deployed to Kubernetes but we don't have Yet Another Docker plugin installed. jenkins-thread-dump.txt
          eagle_rainbow Nico Schmoigl made changes -
          Attachment 20181012-statebefore.txt [ 44769 ]
          Hide
          eagle_rainbow Nico Schmoigl added a comment -

          Short update from our case:

          • Yesterday we had another case where we almost stumbled into this situation again. 
          • Situation was that one user had caused many requests to the main page of a job (called "JenkinsJob102" later on) where the complex graphs (jacoco/test coverage/...) are rendered. We again stumbled into the situation that the HTTP request threads were hanging for a long time. 
          • CPU load was up to 500% (i.e. 5 cores were busy). Note that still some CPU capacity was free.
          • I/O load apparently was not the biggest problem (otherwise we would have seen a different setting on "top").
          • Jenkins stopped job processing. HTTP response time was in the area of 30s and longer.
          • We did not see the error log message "INFO: Running CpsFlowExecutionOwner[...] unresponsive for 5 sec" (or similar) yet, but the execution of Pipeline jobs seized. So I would have expected that this was to happen very soon. 

          I attached a thread dump for you to this ticket (20181012-statebefore.txt). We detected two culprits:

          1. Lock 0x00000000d7102070 was the culprit that the GET requests started to queue again (search for "#3231" in the thread dump file). All hanging GET HTTP requests threads were against JenkinsJob102.
            We first killed the thread which has "#3231" in its name, as it was the current owner of the lock. CPU shortly dropped, but all the rest of the other threads kicked in. We then manually killed also the rest of the threads, as we were very confident that these requests were leftovers which no user ever would require anymore. That took roughly 15 minutes to be done, as performance of Jenkins was bad.
            Once they were gone, CPU load was at around our usual 20%. Yet, the job queue was not processing anymore.
          2. Taking another thread dump snapshot (which unfortunately I lost shortly thereafter), we then detected that Yet-Another-Dockerplugin (YAD) was waiting for a response from our docker server again. It had a lock on the method "getClient()" and thus other threads for provisioning new slaves could not gain the lock (nearly all our jobs in the queue require a docker-based slave in one or the other way). Having cross-checked with the docker server (which was not expecting to send anything anymore), we then also killed that thread which was waiting for the answer from the docker server, which would never come.
            With that also provisioning of slaves resumed and the job queue started to reduce.

          Having the Jenkins server in good shape again, we dared to try reproducing the situation: One user logged on and opened four browser tabs pointing to the main page of job "JenkinsJob102". He then did a single "browser refresh (F5)" for each of these tabs. CPU load then almost immediately was up to 500% again and we had roughly a dozen of hanging GET request threads (note though that the number of "hanging threads" was much lower than I would have expected them to be based on the attempts the user made - so "some of the requests" must have been completed in virtually no time). Response time was again as bad as we had experienced it before. The stack traces also looked similar.

          We killed the HTTP GET request threads again; this time we did not run into a YAD lock problem, though. System was then up well thereafter.

          In the builds folder of job "JenkinsJob102", we currently have 316 builds.

          The moral of this story to me is that we have two distinct locking issues, which could build up each other. 
          What I cannot explain properly yet, though, is the question why sometimes the "heavy GET requests" can be answered in virtually no time (obviously they aren't cached...), but as soon as the number of requests to the resource goes up, each execution takes much more CPU capacity (excluding wait time) than before.

          Show
          eagle_rainbow Nico Schmoigl added a comment - Short update from our case: Yesterday we had another case where we almost stumbled into this situation again.  Situation was that one user had caused many requests to the main page of a job (called "JenkinsJob102" later on) where the complex graphs (jacoco/test coverage/...) are rendered. We again stumbled into the situation that the HTTP request threads were hanging for a long time.  CPU load was up to 500% (i.e. 5 cores were busy). Note that still some CPU capacity was free. I/O load apparently was not the biggest problem (otherwise we would have seen a different setting on "top"). Jenkins stopped job processing. HTTP response time was in the area of 30s and longer. We did not see the error log message "INFO: Running CpsFlowExecutionOwner [...] unresponsive for 5 sec" (or similar) yet, but the execution of Pipeline jobs seized. So I would have expected that this was to happen very soon.  I attached a thread dump for you to this ticket (20181012-statebefore.txt). We detected two culprits: Lock 0x00000000d7102070 was the culprit that the GET requests started to queue again (search for "#3231" in the thread dump file). All hanging GET HTTP requests threads were against JenkinsJob102. We first killed the thread which has "#3231" in its name, as it was the current owner of the lock. CPU shortly dropped, but all the rest of the other threads kicked in. We then manually killed also the rest of the threads, as we were very confident that these requests were leftovers which no user ever would require anymore. That took roughly 15 minutes to be done, as performance of Jenkins was bad. Once they were gone, CPU load was at around our usual 20%. Yet, the job queue was not processing anymore. Taking another thread dump snapshot (which unfortunately I lost shortly thereafter), we then detected that Yet-Another-Dockerplugin (YAD) was waiting for a response from our docker server again. It had a lock on the method "getClient()" and thus other threads for provisioning new slaves could not gain the lock (nearly all our jobs in the queue require a docker-based slave in one or the other way). Having cross-checked with the docker server (which was not expecting to send anything anymore), we then also killed that thread which was waiting for the answer from the docker server, which would never come. With that also provisioning of slaves resumed and the job queue started to reduce. Having the Jenkins server in good shape again, we dared to try reproducing the situation: One user logged on and opened four browser tabs pointing to the main page of job "JenkinsJob102". He then did a single "browser refresh (F5)" for each of these tabs. CPU load then almost immediately was up to 500% again and we had roughly a dozen of hanging GET request threads (note though that the number of "hanging threads" was much lower than I would have expected them to be based on the attempts the user made - so "some of the requests" must have been completed in virtually no time). Response time was again as bad as we had experienced it before. The stack traces also looked similar. We killed the HTTP GET request threads again; this time we did not run into a YAD lock problem, though. System was then up well thereafter. In the builds folder of job "JenkinsJob102", we currently have 316 builds. The moral of this story to me is that we have two distinct locking issues, which could build up each other.  What I cannot explain properly yet, though, is the question why sometimes the "heavy GET requests" can be answered in virtually no time (obviously they aren't cached...), but as soon as the number of requests to the resource goes up, each execution takes much more CPU capacity (excluding wait time) than before.
          Hide
          eagle_rainbow Nico Schmoigl added a comment -

          After a longer period of not having experienced the problem, we just encountered it again. I could take a jstack snapshot and I found the following interesting thread stack trace:

          "Running CpsFlowExecution[Owner[apppname1/appname2/appname3/appname4/134675:appname1/appname2/appname3/appname4 #134675]]" #164285 daemon prio=5 os_prio=0 tid=0x00007f668d046000 nid=0x2b16 waiting on condition [0x00007f65db018000]
             java.lang.Thread.State: WAITING (parking)
                  at sun.misc.Unsafe.park(Native Method)
                  - parking to wait for  <0x00000000d6edcc30> (a org.codehaus.groovy.reflection.GroovyClassValuePreJava7$GroovyClassValuePreJava7Segment)
                  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
                  at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
                  at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
                  at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
                  at org.codehaus.groovy.util.LockableObject.lock(LockableObject.java:37)
                  at org.codehaus.groovy.util.AbstractConcurrentMapBase$Segment.removeEntry(AbstractConcurrentMapBase.java:173)
                  at org.codehaus.groovy.util.ManagedConcurrentMap$Entry.finalizeReference(ManagedConcurrentMap.java:81)
                  at org.codehaus.groovy.util.ManagedConcurrentMap$EntryWithValue.finalizeReference(ManagedConcurrentMap.java:115)
                  at org.codehaus.groovy.reflection.GroovyClassValuePreJava7$EntryWithValue.finalizeReference(GroovyClassValuePreJava7.java:51)
                  at org.codehaus.groovy.util.ReferenceManager$CallBackedManager.removeStallEntries0(ReferenceManager.java:108)
                  at org.codehaus.groovy.util.ReferenceManager$CallBackedManager.removeStallEntries(ReferenceManager.java:93)
                  at org.codehaus.groovy.util.ReferenceManager$CallBackedManager.afterReferenceCreation(ReferenceManager.java:117)
                  at org.codehaus.groovy.util.ReferenceManager$1.afterReferenceCreation(ReferenceManager.java:135)
                  at org.codehaus.groovy.util.ManagedReference.<init>(ManagedReference.java:36)
                  at org.codehaus.groovy.util.ManagedReference.<init>(ManagedReference.java:40)
                  at org.codehaus.groovy.util.ManagedLinkedList$Element.<init>(ManagedLinkedList.java:40)
                  at org.codehaus.groovy.util.ManagedLinkedList.add(ManagedLinkedList.java:102)
                  at org.codehaus.groovy.reflection.ClassInfo$GlobalClassSet.add(ClassInfo.java:478)
                  - locked <0x00000000d6e6aa68> (a org.codehaus.groovy.util.ManagedLinkedList)
                  at org.codehaus.groovy.reflection.ClassInfo$1.computeValue(ClassInfo.java:83)
                  at org.codehaus.groovy.reflection.ClassInfo$1.computeValue(ClassInfo.java:79)
                  at org.codehaus.groovy.reflection.GroovyClassValuePreJava7$EntryWithValue.<init>(GroovyClassValuePreJava7.java:37)
                  at org.codehaus.groovy.reflection.GroovyClassValuePreJava7$GroovyClassValuePreJava7Segment.createEntry(GroovyClassValuePreJava7.java:64)
                  at org.codehaus.groovy.reflection.GroovyClassValuePreJava7$GroovyClassValuePreJava7Segment.createEntry(GroovyClassValuePreJava7.java:55)
                  at org.codehaus.groovy.util.AbstractConcurrentMap$Segment.put(AbstractConcurrentMap.java:157)
                  at org.codehaus.groovy.util.AbstractConcurrentMap$Segment.getOrPut(AbstractConcurrentMap.java:100)
                  at org.codehaus.groovy.util.AbstractConcurrentMap.getOrPut(AbstractConcurrentMap.java:38)
                  at org.codehaus.groovy.reflection.GroovyClassValuePreJava7.get(GroovyClassValuePreJava7.java:94)
                  at org.codehaus.groovy.reflection.ClassInfo.getClassInfo(ClassInfo.java:144)
                  at org.codehaus.groovy.runtime.metaclass.MetaClassRegistryImpl.getMetaClass(MetaClassRegistryImpl.java:258)
                  at org.codehaus.groovy.runtime.InvokerHelper.getMetaClass(InvokerHelper.java:883)
                  at org.codehaus.groovy.runtime.callsite.CallSiteArray.createCallStaticSite(CallSiteArray.java:75)
                  at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallStatic(CallSiteArray.java:56)
                  at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callStatic(AbstractCallSite.java:194)
                  at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callStatic(AbstractCallSite.java:198)
                  at dpGetOnboardingCredentials.<clinit>(dpGetOnboardingCredentials.groovy)
                  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
                  at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
                  at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
                  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
                  at java.lang.Class.newInstance(Class.java:442)
                  at org.jenkinsci.plugins.workflow.cps.global.UserDefinedGlobalVariable.getValue(UserDefinedGlobalVariable.java:54)
                  at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:99)
                  at sun.reflect.GeneratedMethodAccessor220.invoke(Unknown Source)
                  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                  at java.lang.reflect.Method.invoke(Method.java:498)
                  at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93)
                  at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)
                  at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1218)
                  at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1027)
                  at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:42)
                  at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
                  at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
                  at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:157)
                  at org.kohsuke.groovy.sandbox.GroovyInterceptor.onMethodCall(GroovyInterceptor.java:23)
                  at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onMethodCall(SandboxInterceptor.java:133)
                  at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:155)
                  at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:159)
                  at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:129)
                  at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:129)
                  at com.cloudbees.groovy.cps.sandbox.SandboxInvoker.methodCall(SandboxInvoker.java:17)
                  at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:57)
                  at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:109)
                  at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:82)
                  at sun.reflect.GeneratedMethodAccessor179.invoke(Unknown Source)
                  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                  at java.lang.reflect.Method.invoke(Method.java:498)
                  at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
                  at com.cloudbees.groovy.cps.impl.CollectionLiteralBlock$ContinuationImpl.dispatch(CollectionLiteralBlock.java:55)
                  at com.cloudbees.groovy.cps.impl.CollectionLiteralBlock$ContinuationImpl.item(CollectionLiteralBlock.java:45)
                  at sun.reflect.GeneratedMethodAccessor188.invoke(Unknown Source)
                  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                  at java.lang.reflect.Method.invoke(Method.java:498)
                  at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
                  at com.cloudbees.groovy.cps.impl.ConstantBlock.eval(ConstantBlock.java:21)
                  at com.cloudbees.groovy.cps.Next.step(Next.java:83)
                  at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:174)
                  at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:163)
                  at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:122)
                  at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:261)
                  at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:163)
                  at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:19)
                  at org.jenkinsci.plugins.workflow.cps.SandboxContinuable$1.call(SandboxContinuable.java:35)
                  at org.jenkinsci.plugins.workflow.cps.SandboxContinuable$1.call(SandboxContinuable.java:32)
                  at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.GroovySandbox.runInSandbox(GroovySandbox.java:108)
                  at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:32)
                  at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:174)
                  at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:331)
                  at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$200(CpsThreadGroup.java:82)
                  at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:243)
                  at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:231)
                  at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:64)
                  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                  at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
                  at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
                  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
                  at java.lang.Thread.run(Thread.java:745)

          The interesting part is that I also cannot find any job holding the lock - this for more than 5 minutes.

          I "killed" the thread using the monitoring plugin. Very soon thereafter, the next job ran into the same problem, which reads:

          "Executor #-1 for master : executing appname5/appname6 #105" #168783 daemon prio=5 os_prio=0 tid=0x00007f668d137800 nid=0x49df waiting on condition [0x00007f65dd9d0000]
             java.lang.Thread.State: WAITING (parking)
                  at sun.misc.Unsafe.park(Native Method)
                  - parking to wait for  <0x00000000d6ed59e0> (a org.codehaus.groovy.reflection.GroovyClassValuePreJava7$GroovyClassValuePreJava7Segment)
                  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
                  at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
                  at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
                  at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
                  at org.codehaus.groovy.util.LockableObject.lock(LockableObject.java:37)
                  at org.codehaus.groovy.util.AbstractConcurrentMap$Segment.put(AbstractConcurrentMap.java:104)
                  at org.codehaus.groovy.util.AbstractConcurrentMap$Segment.getOrPut(AbstractConcurrentMap.java:100)
                  at org.codehaus.groovy.util.AbstractConcurrentMap.getOrPut(AbstractConcurrentMap.java:38)
                  at org.codehaus.groovy.reflection.GroovyClassValuePreJava7.get(GroovyClassValuePreJava7.java:94)
                  at org.codehaus.groovy.reflection.ClassInfo.getClassInfo(ClassInfo.java:144)
                  at org.codehaus.groovy.runtime.metaclass.MetaClassRegistryImpl.getMetaClass(MetaClassRegistryImpl.java:258)
                  at org.codehaus.groovy.runtime.InvokerHelper.getMetaClass(InvokerHelper.java:883)
                  at org.codehaus.groovy.runtime.callsite.CallSiteArray.createCallStaticSite(CallSiteArray.java:75)
                  at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallStatic(CallSiteArray.java:56)
                  at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callStatic(AbstractCallSite.java:194)
                  at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callStatic(AbstractCallSite.java:198)
                  at WorkflowScript.<clinit>(WorkflowScript)
                  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
                  at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
                  at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
                  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
                  at java.lang.Class.newInstance(Class.java:442)
                  at org.codehaus.groovy.runtime.InvokerHelper.createScript(InvokerHelper.java:434)
                  at groovy.lang.GroovyShell.parse(GroovyShell.java:700)
                  at org.jenkinsci.plugins.workflow.cps.CpsGroovyShell.doParse(CpsGroovyShell.java:133)
                  at org.jenkinsci.plugins.workflow.cps.CpsGroovyShell.reparse(CpsGroovyShell.java:127)
                  at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.parseScript(CpsFlowExecution.java:557)
                  at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.start(CpsFlowExecution.java:518)
                  at org.jenkinsci.plugins.workflow.job.WorkflowRun.run(WorkflowRun.java:269)
                  at hudson.model.ResourceController.execute(ResourceController.java:97)
                  at hudson.model.Executor.run(Executor.java:405)

          The jobs are in a very early state. The second job, for example, only has the following lines in the console output:

          Started by user [User name censored]
          

          Does this say anything to anyone already?

           

          Show
          eagle_rainbow Nico Schmoigl added a comment - After a longer period of not having experienced the problem, we just encountered it again. I could take a jstack snapshot and I found the following interesting thread stack trace: "Running CpsFlowExecution[Owner[apppname1/appname2/appname3/appname4/134675:appname1/appname2/appname3/appname4 #134675]]" #164285 daemon prio=5 os_prio=0 tid=0x00007f668d046000 nid=0x2b16 waiting on condition [0x00007f65db018000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000000d6edcc30> (a org.codehaus.groovy.reflection.GroovyClassValuePreJava7$GroovyClassValuePreJava7Segment) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at org.codehaus.groovy.util.LockableObject.lock(LockableObject.java:37) at org.codehaus.groovy.util.AbstractConcurrentMapBase$Segment.removeEntry(AbstractConcurrentMapBase.java:173) at org.codehaus.groovy.util.ManagedConcurrentMap$Entry.finalizeReference(ManagedConcurrentMap.java:81) at org.codehaus.groovy.util.ManagedConcurrentMap$EntryWithValue.finalizeReference(ManagedConcurrentMap.java:115) at org.codehaus.groovy.reflection.GroovyClassValuePreJava7$EntryWithValue.finalizeReference(GroovyClassValuePreJava7.java:51) at org.codehaus.groovy.util.ReferenceManager$CallBackedManager.removeStallEntries0(ReferenceManager.java:108) at org.codehaus.groovy.util.ReferenceManager$CallBackedManager.removeStallEntries(ReferenceManager.java:93) at org.codehaus.groovy.util.ReferenceManager$CallBackedManager.afterReferenceCreation(ReferenceManager.java:117) at org.codehaus.groovy.util.ReferenceManager$1.afterReferenceCreation(ReferenceManager.java:135) at org.codehaus.groovy.util.ManagedReference.<init>(ManagedReference.java:36) at org.codehaus.groovy.util.ManagedReference.<init>(ManagedReference.java:40) at org.codehaus.groovy.util.ManagedLinkedList$Element.<init>(ManagedLinkedList.java:40) at org.codehaus.groovy.util.ManagedLinkedList.add(ManagedLinkedList.java:102) at org.codehaus.groovy.reflection.ClassInfo$GlobalClassSet.add(ClassInfo.java:478) - locked <0x00000000d6e6aa68> (a org.codehaus.groovy.util.ManagedLinkedList) at org.codehaus.groovy.reflection.ClassInfo$1.computeValue(ClassInfo.java:83) at org.codehaus.groovy.reflection.ClassInfo$1.computeValue(ClassInfo.java:79) at org.codehaus.groovy.reflection.GroovyClassValuePreJava7$EntryWithValue.<init>(GroovyClassValuePreJava7.java:37) at org.codehaus.groovy.reflection.GroovyClassValuePreJava7$GroovyClassValuePreJava7Segment.createEntry(GroovyClassValuePreJava7.java:64) at org.codehaus.groovy.reflection.GroovyClassValuePreJava7$GroovyClassValuePreJava7Segment.createEntry(GroovyClassValuePreJava7.java:55) at org.codehaus.groovy.util.AbstractConcurrentMap$Segment.put(AbstractConcurrentMap.java:157) at org.codehaus.groovy.util.AbstractConcurrentMap$Segment.getOrPut(AbstractConcurrentMap.java:100) at org.codehaus.groovy.util.AbstractConcurrentMap.getOrPut(AbstractConcurrentMap.java:38) at org.codehaus.groovy.reflection.GroovyClassValuePreJava7.get(GroovyClassValuePreJava7.java:94) at org.codehaus.groovy.reflection.ClassInfo.getClassInfo(ClassInfo.java:144) at org.codehaus.groovy.runtime.metaclass.MetaClassRegistryImpl.getMetaClass(MetaClassRegistryImpl.java:258) at org.codehaus.groovy.runtime.InvokerHelper.getMetaClass(InvokerHelper.java:883) at org.codehaus.groovy.runtime.callsite.CallSiteArray.createCallStaticSite(CallSiteArray.java:75) at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallStatic(CallSiteArray.java:56) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callStatic(AbstractCallSite.java:194) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callStatic(AbstractCallSite.java:198) at dpGetOnboardingCredentials.<clinit>(dpGetOnboardingCredentials.groovy) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at java.lang.Class.newInstance(Class.java:442) at org.jenkinsci.plugins.workflow.cps.global.UserDefinedGlobalVariable.getValue(UserDefinedGlobalVariable.java:54) at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:99) at sun.reflect.GeneratedMethodAccessor220.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93) at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325) at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1218) at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1027) at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:42) at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113) at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:157) at org.kohsuke.groovy.sandbox.GroovyInterceptor.onMethodCall(GroovyInterceptor.java:23) at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onMethodCall(SandboxInterceptor.java:133) at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:155) at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:159) at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:129) at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:129) at com.cloudbees.groovy.cps.sandbox.SandboxInvoker.methodCall(SandboxInvoker.java:17) at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:57) at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:109) at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:82) at sun.reflect.GeneratedMethodAccessor179.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72) at com.cloudbees.groovy.cps.impl.CollectionLiteralBlock$ContinuationImpl.dispatch(CollectionLiteralBlock.java:55) at com.cloudbees.groovy.cps.impl.CollectionLiteralBlock$ContinuationImpl.item(CollectionLiteralBlock.java:45) at sun.reflect.GeneratedMethodAccessor188.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72) at com.cloudbees.groovy.cps.impl.ConstantBlock.eval(ConstantBlock.java:21) at com.cloudbees.groovy.cps.Next.step(Next.java:83) at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:174) at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:163) at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:122) at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:261) at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:163) at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:19) at org.jenkinsci.plugins.workflow.cps.SandboxContinuable$1.call(SandboxContinuable.java:35) at org.jenkinsci.plugins.workflow.cps.SandboxContinuable$1.call(SandboxContinuable.java:32) at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.GroovySandbox.runInSandbox(GroovySandbox.java:108) at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:32) at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:174) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:331) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$200(CpsThreadGroup.java:82) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:243) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:231) at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:64) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) The interesting part is that I also cannot find any job holding the lock - this for more than 5 minutes. I "killed" the thread using the monitoring plugin. Very soon thereafter, the next job ran into the same problem, which reads: "Executor #-1 for master : executing appname5/appname6 #105" #168783 daemon prio=5 os_prio=0 tid=0x00007f668d137800 nid=0x49df waiting on condition [0x00007f65dd9d0000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000000d6ed59e0> (a org.codehaus.groovy.reflection.GroovyClassValuePreJava7$GroovyClassValuePreJava7Segment) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at org.codehaus.groovy.util.LockableObject.lock(LockableObject.java:37) at org.codehaus.groovy.util.AbstractConcurrentMap$Segment.put(AbstractConcurrentMap.java:104) at org.codehaus.groovy.util.AbstractConcurrentMap$Segment.getOrPut(AbstractConcurrentMap.java:100) at org.codehaus.groovy.util.AbstractConcurrentMap.getOrPut(AbstractConcurrentMap.java:38) at org.codehaus.groovy.reflection.GroovyClassValuePreJava7.get(GroovyClassValuePreJava7.java:94) at org.codehaus.groovy.reflection.ClassInfo.getClassInfo(ClassInfo.java:144) at org.codehaus.groovy.runtime.metaclass.MetaClassRegistryImpl.getMetaClass(MetaClassRegistryImpl.java:258) at org.codehaus.groovy.runtime.InvokerHelper.getMetaClass(InvokerHelper.java:883) at org.codehaus.groovy.runtime.callsite.CallSiteArray.createCallStaticSite(CallSiteArray.java:75) at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallStatic(CallSiteArray.java:56) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callStatic(AbstractCallSite.java:194) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callStatic(AbstractCallSite.java:198) at WorkflowScript.<clinit>(WorkflowScript) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at java.lang.Class.newInstance(Class.java:442) at org.codehaus.groovy.runtime.InvokerHelper.createScript(InvokerHelper.java:434) at groovy.lang.GroovyShell.parse(GroovyShell.java:700) at org.jenkinsci.plugins.workflow.cps.CpsGroovyShell.doParse(CpsGroovyShell.java:133) at org.jenkinsci.plugins.workflow.cps.CpsGroovyShell.reparse(CpsGroovyShell.java:127) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.parseScript(CpsFlowExecution.java:557) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.start(CpsFlowExecution.java:518) at org.jenkinsci.plugins.workflow.job.WorkflowRun.run(WorkflowRun.java:269) at hudson.model.ResourceController.execute(ResourceController.java:97) at hudson.model.Executor.run(Executor.java:405) The jobs are in a very early state. The second job, for example, only has the following lines in the console output: Started by user [User name censored] Does this say anything to anyone already?  
          vivek Vivek Pandey made changes -
          Labels pipeline threads pipeline threads triaged-2018-11
          Hide
          mykola Mykola Marzhan added a comment -

          downgrade to jenkins-2.140 helped me, I don't see any issues anymore.
          btw, the issue exists on 2.138.3 and on 2.151 versions.

          Show
          mykola Mykola Marzhan added a comment - downgrade to jenkins-2.140 helped me, I don't see any issues anymore. btw, the issue exists on 2.138.3 and on 2.151 versions.
          Hide
          bherfurth Ben Herfurth added a comment -

          Latest LTS version 2.150.1 also has it.

           

          This really really annoying. We have to restart the complete server few times a day.

          Show
          bherfurth Ben Herfurth added a comment - Latest LTS version 2.150.1 also has it.   This really really annoying. We have to restart the complete server few times a day.
          bherfurth Ben Herfurth made changes -
          Priority Major [ 3 ] Blocker [ 1 ]
          Hide
          jrioux Joe Rioux added a comment -

          We are experiencing the same, and also have to reboot our server several times per day.
          Then sometimes it goes for days without a problem.

          I filed this bug - https://issues.jenkins-ci.org/browse/JENKINS-54894

          We are going to try creating a new server to resolve this since we also have 2 other servers, configured nearly the same as the one we keep having to reboot. The new one will be identical to the 2 we don't have a problem on.

          Show
          jrioux Joe Rioux added a comment - We are experiencing the same, and also have to reboot our server several times per day. Then sometimes it goes for days without a problem. I filed this bug - https://issues.jenkins-ci.org/browse/JENKINS-54894 We are going to try creating a new server to resolve this since we also have 2 other servers, configured nearly the same as the one we keep having to reboot. The new one will be identical to the 2 we don't have a problem on.
          daxroc Damien Roche made changes -
          Hide
          daxroc Damien Roche added a comment -

          I took a thread dump during our encounter of this issue in the log today. We seem to encounter a Deadlock with Ec2 provisioning - possibly related to the changes introduced to spawn instances without delay. Have others enabled this feature?

          Our symptoms: Jenkins queue will grow (not substantially) - jobs don't seem to execute tho we have available executors. New jobs fail to even launch - nothing in logs either. All intensive purposes we don't get any lightweight executors. We see the following log entry multiple times for jobs that were in progress.

          Jan 02, 2019 10:02:36 PM org.jenkinsci.plugins.workflow.support.concurrent.Timeout lambda$ping$0 INFO: Running CpsFlowExecution[Owner[<Omitted>/master #59]] unresponsive for 55 min
          

          hudson.plugins.ec2.AmazonEC2Cloud is blocking 3 threads.

           jenkins.util.Timer [#7] thread obtained hudson.plugins.ec2.AmazonEC2Cloud's lock & did not release it. Due to that 3 threads are BLOCKED as shown in the below graph. If threads are BLOCKED for prolonged period, application will become unresponsive. Examine 'jenkins.util.Timer [#7]' stacktrace to see why lock is not released.
          

          jenkins.util.Timer [#7] - priority:5 - threadId:0x00007f21a00f9800 - nativeId:0x1e3ac - nativeId (decimal):123820 - state:WAITING stackTrace: java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000000c06ca768> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) at hudson.model.Queue._withLock(Queue.java:1438) at hudson.model.Queue.withLock(Queue.java:1301) at jenkins.model.Nodes.updateNode(Nodes.java:193) at jenkins.model.Jenkins.updateNode(Jenkins.java:2095) at hudson.model.Node.save(Node.java:140) at hudson.util.PersistedList.onModified(PersistedList.java:173) at hudson.util.PersistedList.replaceBy(PersistedList.java:85) at hudson.model.Slave.<init>(Slave.java:198) at hudson.plugins.ec2.EC2AbstractSlave.<init>(EC2AbstractSlave.java:138) at hudson.plugins.ec2.EC2OndemandSlave.<init>(EC2OndemandSlave.java:49) at hudson.plugins.ec2.EC2OndemandSlave.<init>(EC2OndemandSlave.java:42) at hudson.plugins.ec2.SlaveTemplate.newOndemandSlave(SlaveTemplate.java:963) at hudson.plugins.ec2.SlaveTemplate.toSlaves(SlaveTemplate.java:660) at hudson.plugins.ec2.SlaveTemplate.provisionOndemand(SlaveTemplate.java:632) at hudson.plugins.ec2.SlaveTemplate.provision(SlaveTemplate.java:463) at hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:587) - locked <0x00000000d5800000> (a hudson.plugins.ec2.AmazonEC2Cloud) at hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:602) at hudson.plugins.ec2.NoDelayProvisionerStrategy.apply(NoDelayProvisionerStrategy.java:48) at hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:320) at hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:61) at hudson.slaves.NodeProvisioner$NodeProvisionerInvoker.doRun(NodeProvisioner.java:809) at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72) at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
          
          Handling POST /cloud/ec2-us-east-1/provision from <omitted> : <omitted>  Stack Trace is:  java.lang.Thread.State: BLOCKED (on object monitor) at hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:568) - waiting to lock <0x00000000d5800000> (a hudson.plugins.ec2.AmazonEC2Cloud) at hudson.plugins.ec2.EC2Cloud.doProvision(EC2Cloud.java:358) at java.lang.invoke.LambdaForm$DMH/565760380.invokeVirtual_LL_L(LambdaForm$DMH) at java.lang.invoke.LambdaForm$BMH/222162113.reinvoke(LambdaForm$BMH) at java.lang.invoke.LambdaForm$MH/775458576.invoker(LambdaForm$MH) at java.lang.invoke.LambdaForm$MH/2100864823.invokeExact_MT(LambdaForm$MH) at java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:627) at org.kohsuke.stapler.Function$MethodFunction.invoke(Function.java:396) at org.kohsuke.stapler.Function$InstanceFunction.invoke(Function.java:408) at org.kohsuke.stapler.Function.bindAndInvoke(Function.java:212) at org.kohsuke.stapler.Function.bindAndInvokeAndServeResponse(Function.java:145) at org.kohsuke.stapler.MetaClass$11.doDispatch(MetaClass.java:537) at org.kohsuke.stapler.NameBasedDispatcher.dispatch(NameBasedDispatcher.java:58) at org.kohsuke.stapler.Stapler.tryInvoke(Stapler.java:739) at org.kohsuke.stapler.Stapler.invoke(Stapler.java:870) at org.kohsuke.stapler.MetaClass$4.doDispatch(MetaClass.java:282) at org.kohsuke.stapler.NameBasedDispatcher.dispatch(NameBasedDispatcher.java:58) at org.kohsuke.stapler.Stapler.tryInvoke(Stapler.java:739) at org.kohsuke.stapler.Stapler.invoke(Stapler.java:870) at org.kohsuke.stapler.Stapler.invoke(Stapler.java:668) at org.kohsuke.stapler.Stapler.service(Stapler.java:238) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:865) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1655) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:154) at org.jenkinsci.plugins.ssegateway.Endpoint$SSEListenChannelFilter.doFilter(Endpoint.java:243) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151) at com.cloudbees.jenkins.support.slowrequest.SlowRequestFilter.doFilter(SlowRequestFilter.java:37) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151) at io.jenkins.blueocean.auth.jwt.impl.JwtAuthenticationFilter.doFilter(JwtAuthenticationFilter.java:61) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151) at com.smartcodeltd.jenkinsci.plugin.assetbundler.filters.LessCSS.doFilter(LessCSS.java:47) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151) at io.jenkins.blueocean.ResourceCacheControl.doFilter(ResourceCacheControl.java:134) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151) at hudson.plugins.greenballs.GreenBallFilter.doFilter(GreenBallFilter.java:59) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151) at net.bull.javamelody.MonitoringFilter.doFilter(MonitoringFilter.java:239) at net.bull.javamelody.MonitoringFilter.doFilter(MonitoringFilter.java:215) at net.bull.javamelody.PluginMonitoringFilter.doFilter(PluginMonitoringFilter.java:88) at org.jvnet.hudson.plugins.monitoring.HudsonMonitoringFilter.doFilter(HudsonMonitoringFilter.java:114) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151) at jenkins.metrics.impl.MetricsFilter.doFilter(MetricsFilter.java:125) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151) at jenkins.telemetry.impl.UserLanguages$AcceptLanguageFilter.doFilter(UserLanguages.java:128) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151) at hudson.util.PluginServletFilter.doFilter(PluginServletFilter.java:157) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642) at hudson.security.csrf.CrumbFilter.doFilter(CrumbFilter.java:64) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:84) at hudson.security.UnwrapSecurityExceptionFilter.doFilter(UnwrapSecurityExceptionFilter.java:51) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87) at jenkins.security.ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java:117) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87) at org.acegisecurity.providers.anonymous.AnonymousProcessingFilter.doFilter(AnonymousProcessingFilter.java:125) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87) at org.acegisecurity.ui.rememberme.RememberMeProcessingFilter.doFilter(RememberMeProcessingFilter.java:142) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87) at org.acegisecurity.ui.AbstractProcessingFilter.doFilter(AbstractProcessingFilter.java:271) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87) at jenkins.security.BasicHeaderProcessor.doFilter(BasicHeaderProcessor.java:93) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87) at org.acegisecurity.context.HttpSessionContextIntegrationFilter.doFilter(HttpSessionContextIntegrationFilter.java:249) at hudson.security.HttpSessionContextIntegrationFilter2.doFilter(HttpSessionContextIntegrationFilter2.java:67) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87) at hudson.security.ChainedServletFilter.doFilter(ChainedServletFilter.java:90) at hudson.security.HudsonFilter.doFilter(HudsonFilter.java:171) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642) at org.kohsuke.stapler.compression.CompressionFilter.doFilter(CompressionFilter.java:49) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642) at hudson.util.CharacterEncodingFilter.doFilter(CharacterEncodingFilter.java:82) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642) at org.kohsuke.stapler.DiagnosticThreadNameFilter.doFilter(DiagnosticThreadNameFilter.java:30) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1340) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1242) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144) at org.eclipse.jetty.server.handler.RequestLogHandler.handle(RequestLogHandler.java:56) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.eclipse.jetty.server.Server.handle(Server.java:503) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:364) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103) at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126) at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:765) at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:683) at java.lang.Thread.run(Thread.java:748)
          
          EC2 alive slaves monitor thread  Stack Trace is:  java.lang.Thread.State: BLOCKED (on object monitor) at hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:748) - waiting to lock <0x00000000d5800000> (a hudson.plugins.ec2.AmazonEC2Cloud) at hudson.plugins.ec2.CloudHelper.getInstance(CloudHelper.java:47) at hudson.plugins.ec2.EC2AbstractSlave.fetchLiveInstanceData(EC2AbstractSlave.java:475) at hudson.plugins.ec2.EC2AbstractSlave.isAlive(EC2AbstractSlave.java:443) at hudson.plugins.ec2.EC2SlaveMonitor.execute(EC2SlaveMonitor.java:43) at hudson.model.AsyncPeriodicWork$1.run(AsyncPeriodicWork.java:101) at java.lang.Thread.run(Thread.java:748)
          
          jenkins.util.Timer [#10] 
          Stack Trace is: 
          java.lang.Thread.State: BLOCKED (on object monitor)
          at hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:748)
          - waiting to lock <0x00000000d5800000> (a hudson.plugins.ec2.AmazonEC2Cloud)
          at hudson.plugins.ec2.CloudHelper.getInstance(CloudHelper.java:47)
          at hudson.plugins.ec2.CloudHelper.getInstanceWithRetry(CloudHelper.java:25)
          at hudson.plugins.ec2.EC2Computer.getState(EC2Computer.java:127)
          at hudson.plugins.ec2.EC2RetentionStrategy.internalCheck(EC2RetentionStrategy.java:112)
          at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:90)
          at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:48)
          at hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72)
          at hudson.model.Queue._withLock(Queue.java:1381)
          at hudson.model.Queue.withLock(Queue.java:1258)
          at hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63)
          at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
          at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
          at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
          at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
          at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at java.lang.Thread.run(Thread.java:748)
          Show
          daxroc Damien Roche added a comment - I took a thread dump during our encounter of this issue in the log today. We seem to encounter a Deadlock with Ec2 provisioning - possibly related to the changes introduced to spawn instances without delay. Have others enabled this feature? Our symptoms: Jenkins queue will grow (not substantially) - jobs don't seem to execute tho we have available executors. New jobs fail to even launch - nothing in logs either. All intensive purposes we don't get any lightweight executors. We see the following log entry multiple times for jobs that were in progress. Jan 02, 2019 10:02:36 PM org.jenkinsci.plugins.workflow.support.concurrent.Timeout lambda$ping$0 INFO: Running CpsFlowExecution[Owner[<Omitted>/master #59]] unresponsive for 55 min hudson.plugins.ec2.AmazonEC2Cloud  is blocking 3 threads.  jenkins.util.Timer [#7] thread obtained hudson.plugins.ec2.AmazonEC2Cloud 's lock & did not release it. Due to that 3 threads are BLOCKED as shown in the below graph. If threads are BLOCKED for prolonged period, application will become unresponsive. Examine ' jenkins.util.Timer [#7]' stacktrace to see why lock is not released. jenkins.util.Timer [#7] - priority:5 - threadId:0x00007f21a00f9800 - nativeId:0x1e3ac - nativeId (decimal):123820 - state:WAITING stackTrace: java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000000c06ca768> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) at hudson.model.Queue._withLock(Queue.java:1438) at hudson.model.Queue.withLock(Queue.java:1301) at jenkins.model.Nodes.updateNode(Nodes.java:193) at jenkins.model.Jenkins.updateNode(Jenkins.java:2095) at hudson.model.Node.save(Node.java:140) at hudson.util.PersistedList.onModified(PersistedList.java:173) at hudson.util.PersistedList.replaceBy(PersistedList.java:85) at hudson.model.Slave.<init>(Slave.java:198) at hudson.plugins.ec2.EC2AbstractSlave.<init>(EC2AbstractSlave.java:138) at hudson.plugins.ec2.EC2OndemandSlave.<init>(EC2OndemandSlave.java:49) at hudson.plugins.ec2.EC2OndemandSlave.<init>(EC2OndemandSlave.java:42) at hudson.plugins.ec2.SlaveTemplate.newOndemandSlave(SlaveTemplate.java:963) at hudson.plugins.ec2.SlaveTemplate.toSlaves(SlaveTemplate.java:660) at hudson.plugins.ec2.SlaveTemplate.provisionOndemand(SlaveTemplate.java:632) at hudson.plugins.ec2.SlaveTemplate.provision(SlaveTemplate.java:463) at hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:587) - locked <0x00000000d5800000> (a hudson.plugins.ec2.AmazonEC2Cloud) at hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:602) at hudson.plugins.ec2.NoDelayProvisionerStrategy.apply(NoDelayProvisionerStrategy.java:48) at hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:320) at hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:61) at hudson.slaves.NodeProvisioner$NodeProvisionerInvoker.doRun(NodeProvisioner.java:809) at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72) at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Handling POST /cloud/ec2-us-east-1/provision from <omitted> : <omitted>  Stack Trace is:  java.lang.Thread.State: BLOCKED (on object monitor) at hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:568) - waiting to lock <0x00000000d5800000> (a hudson.plugins.ec2.AmazonEC2Cloud) at hudson.plugins.ec2.EC2Cloud.doProvision(EC2Cloud.java:358) at java.lang.invoke.LambdaForm$DMH/565760380.invokeVirtual_LL_L(LambdaForm$DMH) at java.lang.invoke.LambdaForm$BMH/222162113.reinvoke(LambdaForm$BMH) at java.lang.invoke.LambdaForm$MH/775458576.invoker(LambdaForm$MH) at java.lang.invoke.LambdaForm$MH/2100864823.invokeExact_MT(LambdaForm$MH) at java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:627) at org.kohsuke.stapler.Function$MethodFunction.invoke(Function.java:396) at org.kohsuke.stapler.Function$InstanceFunction.invoke(Function.java:408) at org.kohsuke.stapler.Function.bindAndInvoke(Function.java:212) at org.kohsuke.stapler.Function.bindAndInvokeAndServeResponse(Function.java:145) at org.kohsuke.stapler.MetaClass$11.doDispatch(MetaClass.java:537) at org.kohsuke.stapler.NameBasedDispatcher.dispatch(NameBasedDispatcher.java:58) at org.kohsuke.stapler.Stapler.tryInvoke(Stapler.java:739) at org.kohsuke.stapler.Stapler.invoke(Stapler.java:870) at org.kohsuke.stapler.MetaClass$4.doDispatch(MetaClass.java:282) at org.kohsuke.stapler.NameBasedDispatcher.dispatch(NameBasedDispatcher.java:58) at org.kohsuke.stapler.Stapler.tryInvoke(Stapler.java:739) at org.kohsuke.stapler.Stapler.invoke(Stapler.java:870) at org.kohsuke.stapler.Stapler.invoke(Stapler.java:668) at org.kohsuke.stapler.Stapler.service(Stapler.java:238) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:865) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1655) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:154) at org.jenkinsci.plugins.ssegateway.Endpoint$SSEListenChannelFilter.doFilter(Endpoint.java:243) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151) at com.cloudbees.jenkins.support.slowrequest.SlowRequestFilter.doFilter(SlowRequestFilter.java:37) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151) at io.jenkins.blueocean.auth.jwt.impl.JwtAuthenticationFilter.doFilter(JwtAuthenticationFilter.java:61) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151) at com.smartcodeltd.jenkinsci.plugin.assetbundler.filters.LessCSS.doFilter(LessCSS.java:47) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151) at io.jenkins.blueocean.ResourceCacheControl.doFilter(ResourceCacheControl.java:134) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151) at hudson.plugins.greenballs.GreenBallFilter.doFilter(GreenBallFilter.java:59) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151) at net.bull.javamelody.MonitoringFilter.doFilter(MonitoringFilter.java:239) at net.bull.javamelody.MonitoringFilter.doFilter(MonitoringFilter.java:215) at net.bull.javamelody.PluginMonitoringFilter.doFilter(PluginMonitoringFilter.java:88) at org.jvnet.hudson.plugins.monitoring.HudsonMonitoringFilter.doFilter(HudsonMonitoringFilter.java:114) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151) at jenkins.metrics.impl.MetricsFilter.doFilter(MetricsFilter.java:125) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151) at jenkins.telemetry.impl.UserLanguages$AcceptLanguageFilter.doFilter(UserLanguages.java:128) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151) at hudson.util.PluginServletFilter.doFilter(PluginServletFilter.java:157) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642) at hudson.security.csrf.CrumbFilter.doFilter(CrumbFilter.java:64) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:84) at hudson.security.UnwrapSecurityExceptionFilter.doFilter(UnwrapSecurityExceptionFilter.java:51) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87) at jenkins.security.ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java:117) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87) at org.acegisecurity.providers.anonymous.AnonymousProcessingFilter.doFilter(AnonymousProcessingFilter.java:125) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87) at org.acegisecurity.ui.rememberme.RememberMeProcessingFilter.doFilter(RememberMeProcessingFilter.java:142) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87) at org.acegisecurity.ui.AbstractProcessingFilter.doFilter(AbstractProcessingFilter.java:271) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87) at jenkins.security.BasicHeaderProcessor.doFilter(BasicHeaderProcessor.java:93) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87) at org.acegisecurity.context.HttpSessionContextIntegrationFilter.doFilter(HttpSessionContextIntegrationFilter.java:249) at hudson.security.HttpSessionContextIntegrationFilter2.doFilter(HttpSessionContextIntegrationFilter2.java:67) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87) at hudson.security.ChainedServletFilter.doFilter(ChainedServletFilter.java:90) at hudson.security.HudsonFilter.doFilter(HudsonFilter.java:171) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642) at org.kohsuke.stapler.compression.CompressionFilter.doFilter(CompressionFilter.java:49) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642) at hudson.util.CharacterEncodingFilter.doFilter(CharacterEncodingFilter.java:82) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642) at org.kohsuke.stapler.DiagnosticThreadNameFilter.doFilter(DiagnosticThreadNameFilter.java:30) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1340) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1242) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144) at org.eclipse.jetty.server.handler.RequestLogHandler.handle(RequestLogHandler.java:56) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.eclipse.jetty.server.Server.handle(Server.java:503) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:364) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103) at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126) at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:765) at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:683) at java.lang.Thread.run(Thread.java:748) EC2 alive slaves monitor thread  Stack Trace is:  java.lang.Thread.State: BLOCKED (on object monitor) at hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:748) - waiting to lock <0x00000000d5800000> (a hudson.plugins.ec2.AmazonEC2Cloud) at hudson.plugins.ec2.CloudHelper.getInstance(CloudHelper.java:47) at hudson.plugins.ec2.EC2AbstractSlave.fetchLiveInstanceData(EC2AbstractSlave.java:475) at hudson.plugins.ec2.EC2AbstractSlave.isAlive(EC2AbstractSlave.java:443) at hudson.plugins.ec2.EC2SlaveMonitor.execute(EC2SlaveMonitor.java:43) at hudson.model.AsyncPeriodicWork$1.run(AsyncPeriodicWork.java:101) at java.lang.Thread.run(Thread.java:748) jenkins.util.Timer [#10]  Stack Trace is:  java.lang.Thread.State: BLOCKED (on object monitor) at hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:748) - waiting to lock <0x00000000d5800000> (a hudson.plugins.ec2.AmazonEC2Cloud) at hudson.plugins.ec2.CloudHelper.getInstance(CloudHelper.java:47) at hudson.plugins.ec2.CloudHelper.getInstanceWithRetry(CloudHelper.java:25) at hudson.plugins.ec2.EC2Computer.getState(EC2Computer.java:127) at hudson.plugins.ec2.EC2RetentionStrategy.internalCheck(EC2RetentionStrategy.java:112) at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:90) at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:48) at hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72) at hudson.model.Queue._withLock(Queue.java:1381) at hudson.model.Queue.withLock(Queue.java:1258) at hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63) at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72) at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
          Hide
          eagle_rainbow Nico Schmoigl added a comment -

          Damien Roche, we had similar issues before - and I fear that this is only loosely related to this issue here. In my case, we are not using EC2 cloud but the docker cloud plugin (YAD). Root cause for us was that the plugin did not have a connection timeout configured. If then a connection to the cloud manager fails, the thread is waiting eternally for an answer. For the sake of consistency, however, the thread aquired a lock, which then is never released... and all the blues started...

          That is why I would suggest you to have a look at your timeout values (I don't know whether they are configurable in the case of EC2) - and if applicable post them here for further cross-checking. If they are too high, you should fix that first.

          Show
          eagle_rainbow Nico Schmoigl added a comment - Damien Roche , we had similar issues before - and I fear that this is only loosely related to this issue here. In my case, we are not using EC2 cloud but the docker cloud plugin (YAD). Root cause for us was that the plugin did not have a connection timeout configured. If then a connection to the cloud manager fails, the thread is waiting eternally for an answer. For the sake of consistency, however, the thread aquired a lock, which then is never released... and all the blues started... That is why I would suggest you to have a look at your timeout values (I don't know whether they are configurable in the case of EC2) - and if applicable post them here for further cross-checking. If they are too high, you should fix that first.
          Hide
          daxroc Damien Roche added a comment -

          Our timeouts are set quite high, I'll reduce those today. It looks like this is known issue in the Ec2 Plugin - JENKINS-53858 

          Show
          daxroc Damien Roche added a comment - Our timeouts are set quite high, I'll reduce those today. It looks like this is known issue in the Ec2 Plugin -  JENKINS-53858  
          Hide
          bpleines Branden added a comment -

          Also running into this issue. As others have noted restarting Jenkins resolves the issue short term, but after a couple days uptime the problem resurfaces. Adding the versions of our affected environment.

           

          Jenkins Core: 2.289.1

          workflow-aggregator: 2.6
          workflow-api: 2.45
          workflow-basic-steps: 2.23
          workflow-cps: 2.92
          workflow-cps-global-lib: 2.19
          workflow-durable-task-step: 2.39
          workflow-job: 2.41
          workflow-multibranch: 2.26
          workflow-scm-step: 2.13
          workflow-step-api: 2.23
          workflow-support: 3.8
          pipeline-build-step: 2.13
          pipeline-graph-analysis: 1.11
          pipeline-input-step: 2.12
          pipeline-milestone-step: 1.3.2
          pipeline-model-api: 1.8.5
          pipeline-model-declarative-agent: N/A
          pipeline-model-definition: 1.8.5
          pipeline-model-extensions: 1.8.5
          pipeline-rest-api: 2.19
          pipeline-stage-step: 2.5
          pipeline-stage-tags-metadata: 1.8.5
          pipeline-stage-view: 2.19
          pipeline-utility-steps: 2.8.0

           

          Show
          bpleines Branden added a comment - Also running into this issue. As others have noted restarting Jenkins resolves the issue short term, but after a couple days uptime the problem resurfaces. Adding the versions of our affected environment.   Jenkins Core: 2.289.1 workflow-aggregator: 2.6 workflow-api: 2.45 workflow-basic-steps: 2.23 workflow-cps: 2.92 workflow-cps-global-lib: 2.19 workflow-durable-task-step: 2.39 workflow-job: 2.41 workflow-multibranch: 2.26 workflow-scm-step: 2.13 workflow-step-api: 2.23 workflow-support: 3.8 pipeline-build-step: 2.13 pipeline-graph-analysis: 1.11 pipeline-input-step: 2.12 pipeline-milestone-step: 1.3.2 pipeline-model-api: 1.8.5 pipeline-model-declarative-agent: N/A pipeline-model-definition: 1.8.5 pipeline-model-extensions: 1.8.5 pipeline-rest-api: 2.19 pipeline-stage-step: 2.5 pipeline-stage-tags-metadata: 1.8.5 pipeline-stage-view: 2.19 pipeline-utility-steps: 2.8.0  

            People

            Assignee:
            Unassigned Unassigned
            Reporter:
            pdouglas Philip Douglas
            Votes:
            29 Vote for this issue
            Watchers:
            43 Start watching this issue

              Dates

              Created:
              Updated: