Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-40613

DurableTaskStep.Execution.getWorkspace() should be subject to timeout

      check wraps most of its body in a timeout, but not getWorkspace. If an agent is unresponsive, this can cause a Timer thread to block indefinitely:

      "jenkins.util.Timer [#2] / waiting for hudson.remoting.Channel@...:..." ...
         java.lang.Thread.State: TIMED_WAITING (on object monitor)
          at java.lang.Object.wait(Native Method)
          at hudson.remoting.Request.call(Request.java:147)
          - locked <...> (a hudson.remoting.UserRequest)
          at hudson.remoting.Channel.call(Channel.java:780)
          at hudson.FilePath.act(FilePath.java:1007)
          at hudson.FilePath.act(FilePath.java:996)
          at hudson.FilePath.isDirectory(FilePath.java:1532)
          at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.getWorkspace(DurableTaskStep.java:174)
          at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:244)
          at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:233)
      

      Remote calls like isDirectory should have a timeout applied. Unfortunately Jenkins core APIs commit the cardinal sin of distributing computing: they offer calls like FilePath.isDirectory() which throw InterruptedException but have no built-in timeout, so it is necessary to go to extra lengths to add a timeout to any Remoting call.

          [JENKINS-40613] DurableTaskStep.Execution.getWorkspace() should be subject to timeout

          Jesse Glick added a comment -

          Still remains in JENKINS-38381 rewrite, though called much less frequently.

          Jesse Glick added a comment - Still remains in JENKINS-38381 rewrite, though called much less frequently.

          Jesse Glick added a comment -

          Unfortunately I have not been able to reproduce the problem. For example, if I attach an agent (using CommandLauncher), obtain the Channel/FilePath (simulating the ws field here being populated), kill -9 the agent JVM, and call .isDirectory(), the command immediately fails:

          java.io.IOException: remote file operation failed: /tmp/slave at hudson.remoting.Channel@3b3848de:remote: hudson.remoting.ChannelClosedException: channel is already closed
              at hudson.FilePath.act(FilePath.java:1014)
              at hudson.FilePath.act(FilePath.java:996)
              at hudson.FilePath.isDirectory(FilePath.java:1532)
              at …
          Caused by: hudson.remoting.ChannelClosedException: channel is already closed
              at hudson.remoting.Channel.send(Channel.java:578)
              at hudson.remoting.Request.call(Request.java:130)
              at hudson.remoting.Channel.call(Channel.java:796)
              at hudson.FilePath.act(FilePath.java:1007)
              ... 84 more
          Caused by: java.io.IOException: Unexpected termination of the channel
              at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:73)
          Caused by: java.io.EOFException
              at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2626)
              at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1321)
              at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373)
              at hudson.remoting.Command.readFrom(Command.java:94)
              at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:70)
              at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:59)
          

          So there is some sort of failure mode of Remoting which is not trivially reproduced.

          Jesse Glick added a comment - Unfortunately I have not been able to reproduce the problem. For example, if I attach an agent (using CommandLauncher ), obtain the Channel / FilePath (simulating the ws field here being populated), kill -9 the agent JVM, and call .isDirectory() , the command immediately fails: java.io.IOException: remote file operation failed: /tmp/slave at hudson.remoting.Channel@3b3848de:remote: hudson.remoting.ChannelClosedException: channel is already closed at hudson.FilePath.act(FilePath.java:1014) at hudson.FilePath.act(FilePath.java:996) at hudson.FilePath.isDirectory(FilePath.java:1532) at … Caused by: hudson.remoting.ChannelClosedException: channel is already closed at hudson.remoting.Channel.send(Channel.java:578) at hudson.remoting.Request.call(Request.java:130) at hudson.remoting.Channel.call(Channel.java:796) at hudson.FilePath.act(FilePath.java:1007) ... 84 more Caused by: java.io.IOException: Unexpected termination of the channel at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:73) Caused by: java.io.EOFException at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2626) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1321) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373) at hudson.remoting.Command.readFrom(Command.java:94) at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:70) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:59) So there is some sort of failure mode of Remoting which is not trivially reproduced.

          Code changed in jenkins
          User: Jesse Glick
          Path:
          src/main/java/com/cloudbees/jenkins/support/impl/ThreadDumps.java
          http://jenkins-ci.org/commit/support-core-plugin/b7c48dbcef52be2ffab5cac57d177500f4165c55
          Log:
          Merge pull request #100 from jglick/Timer

          JENKINS-40613 Report a blocked Timer

          Compare: https://github.com/jenkinsci/support-core-plugin/compare/b131f686a77c...b7c48dbcef52

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: src/main/java/com/cloudbees/jenkins/support/impl/ThreadDumps.java http://jenkins-ci.org/commit/support-core-plugin/b7c48dbcef52be2ffab5cac57d177500f4165c55 Log: Merge pull request #100 from jglick/Timer JENKINS-40613 Report a blocked Timer Compare: https://github.com/jenkinsci/support-core-plugin/compare/b131f686a77c...b7c48dbcef52

          Jesse Glick added a comment -

          Still see similar hangs all over the place, but no leads on how to reproduce.

          JENKINS-44785 tracks a general API fix for Remoting to time out calls properly.

          JENKINS-41854 notes a case where a hypothesized bug should be resulting in a ChannelClosedException.

          Jesse Glick added a comment - Still see similar hangs all over the place, but no leads on how to reproduce. JENKINS-44785 tracks a general API fix for Remoting to time out calls properly. JENKINS-41854 notes a case where a hypothesized bug should be resulting in a ChannelClosedException .

          We have a similar hang with this stack trace

          jenkins.util.Timer [#9] / waiting for hudson.remoting.Channel@53047d0d:Swarm-mo-7232f481e
          java.lang.Object.wait(Native Method)
          hudson.remoting.Request.call(Request.java:147)
          hudson.remoting.Channel.call(Channel.java:796)
          hudson.FilePath.act(FilePath.java:1007)
          hudson.FilePath.act(FilePath.java:996)
          hudson.FilePath.isDirectory(FilePath.java:1532)
          org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.getWorkspace(DurableTaskStep.java:183)
          org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:294)
          org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:279)
          java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          java.util.concurrent.FutureTask.run(FutureTask.java:266)
          java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
          java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
          java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
          java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
          java.lang.Thread.run(Thread.java:745)

           

          Renjith Pillai added a comment - We have a similar hang with this stack trace jenkins.util.Timer [#9] / waiting for hudson.remoting.Channel@53047d0d:Swarm-mo-7232f481e java.lang. Object .wait(Native Method) hudson.remoting.Request.call(Request.java:147) hudson.remoting.Channel.call(Channel.java:796) hudson.FilePath.act(FilePath.java:1007) hudson.FilePath.act(FilePath.java:996) hudson.FilePath.isDirectory(FilePath.java:1532) org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.getWorkspace(DurableTaskStep.java:183) org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:294) org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:279) java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) java.util.concurrent.FutureTask.run(FutureTask.java:266) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) java.lang. Thread .run( Thread .java:745)  

          Basil Crow added a comment -

          We hit a similar hang with this stack trace:

          "jenkins.util.Timer [#10] / waiting for slave3 id=36883650" #42 daemon prio=5 os_prio=0 tid=0x00007f516407d800 nid=0x70c in Object.wait() [0x00007f5162810000]
             java.lang.Thread.State: TIMED_WAITING (on object monitor)
              at java.lang.Object.wait(Native Method)
              at hudson.remoting.Request.call(Request.java:147)
              - locked <0x00000006290146b8> (a hudson.remoting.UserRequest)
              at hudson.remoting.Channel.call(Channel.java:829)
              at hudson.FilePath.act(FilePath.java:987)
              at hudson.FilePath.act(FilePath.java:976)
              at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:121)
              at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:304)
              at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:279)
              at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
              at java.util.concurrent.FutureTask.run(FutureTask.java:266)
              at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
              at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:748)
          

          In the Jenkins job itself, the following was printed:

          [Pipeline] withEnv
          [Pipeline] {
          [Pipeline] wrap
          [Pipeline] {
          [Pipeline] sh
          06:56:58 Running shell script
          06:57:08 Cannot contact slave3: java.io.IOException: java.lang.InterruptedException
          06:57:47 + rm -f /var/tmp/jenkins_slaves/path/to/script.out
          06:57:58 + rm -f /var/tmp/jenkins_slaves/path/to/script.out
          06:58:07 + rm -f /var/tmp/jenkins_slaves/path/to/script.out
          06:58:18 + rm -f /var/tmp/jenkins_slaves/path/to/script.out
          [... more of the same ...]
          

          I looked at the slave logs. There was nothing printed except for the successful launch of the agent about a week earlier. Similarly, the Jenkins console logs didn't show any output related to the slave. The only debugging information I could get was the stack trace mentioned above.

          Basil Crow added a comment - We hit a similar hang with this stack trace: "jenkins.util.Timer [#10] / waiting for slave3 id=36883650" #42 daemon prio=5 os_prio=0 tid=0x00007f516407d800 nid=0x70c in Object.wait() [0x00007f5162810000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at hudson.remoting.Request.call(Request.java:147) - locked <0x00000006290146b8> (a hudson.remoting.UserRequest) at hudson.remoting.Channel.call(Channel.java:829) at hudson.FilePath.act(FilePath.java:987) at hudson.FilePath.act(FilePath.java:976) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:121) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:304) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:279) at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) In the Jenkins job itself, the following was printed: [Pipeline] withEnv [Pipeline] { [Pipeline] wrap [Pipeline] { [Pipeline] sh 06:56:58 Running shell script 06:57:08 Cannot contact slave3: java.io.IOException: java.lang.InterruptedException 06:57:47 + rm -f /var/tmp/jenkins_slaves/path/to/script.out 06:57:58 + rm -f /var/tmp/jenkins_slaves/path/to/script.out 06:58:07 + rm -f /var/tmp/jenkins_slaves/path/to/script.out 06:58:18 + rm -f /var/tmp/jenkins_slaves/path/to/script.out [... more of the same ...] I looked at the slave logs. There was nothing printed except for the successful launch of the agent about a week earlier. Similarly, the Jenkins console logs didn't show any output related to the slave. The only debugging information I could get was the stack trace mentioned above.

          Basil Crow added a comment -

          When I cancelled the above stuck job, the following was printed:

          java.io.EOFException
          	at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2638)
          	at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3113)
          	at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:853)
          	at java.io.ObjectInputStream.<init>(ObjectInputStream.java:349)
          	at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48)
          	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
          	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:59)
          Caused: java.io.IOException: Unexpected termination of the channel
          	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:73)
          Caused: hudson.remoting.ChannelClosedException: channel is already closed
          	at hudson.remoting.Channel.send(Channel.java:605)
          	at hudson.remoting.Request.call(Request.java:130)
          	at hudson.remoting.Channel.call(Channel.java:829)
          	at hudson.FilePath.act(FilePath.java:987)
          Caused: java.io.IOException: remote file operation failed: /var/tmp/jenkins_slaves/path/to/job@8 at hudson.remoting.Channel@1df7d99e:slave3
          	at hudson.FilePath.act(FilePath.java:994)
          	at hudson.FilePath.act(FilePath.java:976)
          	at hudson.FilePath.mkdirs(FilePath.java:1159)
          	at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:79)
          	at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:67)
          	at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1$1.call(SynchronousNonBlockingStepExecution.java:49)
          	at hudson.security.ACL.impersonate(ACL.java:260)
          	at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1.run(SynchronousNonBlockingStepExecution.java:46)
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          	at java.lang.Thread.run(Thread.java:748)
          Finished: FAILURE
          

          Basil Crow added a comment - When I cancelled the above stuck job, the following was printed: java.io.EOFException at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2638) at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3113) at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:853) at java.io.ObjectInputStream.<init>(ObjectInputStream.java:349) at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:59) Caused: java.io.IOException: Unexpected termination of the channel at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:73) Caused: hudson.remoting.ChannelClosedException: channel is already closed at hudson.remoting.Channel.send(Channel.java:605) at hudson.remoting.Request.call(Request.java:130) at hudson.remoting.Channel.call(Channel.java:829) at hudson.FilePath.act(FilePath.java:987) Caused: java.io.IOException: remote file operation failed: /var/tmp/jenkins_slaves/path/to/job@8 at hudson.remoting.Channel@1df7d99e:slave3 at hudson.FilePath.act(FilePath.java:994) at hudson.FilePath.act(FilePath.java:976) at hudson.FilePath.mkdirs(FilePath.java:1159) at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:79) at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:67) at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1$1.call(SynchronousNonBlockingStepExecution.java:49) at hudson.security.ACL.impersonate(ACL.java:260) at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1.run(SynchronousNonBlockingStepExecution.java:46) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Finished: FAILURE

            Unassigned Unassigned
            jglick Jesse Glick
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: