Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-40613

DurableTaskStep.Execution.getWorkspace() should be subject to timeout

    XMLWordPrintable

Details

    Description

      check wraps most of its body in a timeout, but not getWorkspace. If an agent is unresponsive, this can cause a Timer thread to block indefinitely:

      "jenkins.util.Timer [#2] / waiting for hudson.remoting.Channel@...:..." ...
         java.lang.Thread.State: TIMED_WAITING (on object monitor)
          at java.lang.Object.wait(Native Method)
          at hudson.remoting.Request.call(Request.java:147)
          - locked <...> (a hudson.remoting.UserRequest)
          at hudson.remoting.Channel.call(Channel.java:780)
          at hudson.FilePath.act(FilePath.java:1007)
          at hudson.FilePath.act(FilePath.java:996)
          at hudson.FilePath.isDirectory(FilePath.java:1532)
          at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.getWorkspace(DurableTaskStep.java:174)
          at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:244)
          at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:233)
      

      Remote calls like isDirectory should have a timeout applied. Unfortunately Jenkins core APIs commit the cardinal sin of distributing computing: they offer calls like FilePath.isDirectory() which throw InterruptedException but have no built-in timeout, so it is necessary to go to extra lengths to add a timeout to any Remoting call.

      Attachments

        Issue Links

          Activity

            jglick Jesse Glick added a comment -

            Still remains in JENKINS-38381 rewrite, though called much less frequently.

            jglick Jesse Glick added a comment - Still remains in JENKINS-38381 rewrite, though called much less frequently.
            jglick Jesse Glick added a comment -

            Unfortunately I have not been able to reproduce the problem. For example, if I attach an agent (using CommandLauncher), obtain the Channel/FilePath (simulating the ws field here being populated), kill -9 the agent JVM, and call .isDirectory(), the command immediately fails:

            java.io.IOException: remote file operation failed: /tmp/slave at hudson.remoting.Channel@3b3848de:remote: hudson.remoting.ChannelClosedException: channel is already closed
                at hudson.FilePath.act(FilePath.java:1014)
                at hudson.FilePath.act(FilePath.java:996)
                at hudson.FilePath.isDirectory(FilePath.java:1532)
                at …
            Caused by: hudson.remoting.ChannelClosedException: channel is already closed
                at hudson.remoting.Channel.send(Channel.java:578)
                at hudson.remoting.Request.call(Request.java:130)
                at hudson.remoting.Channel.call(Channel.java:796)
                at hudson.FilePath.act(FilePath.java:1007)
                ... 84 more
            Caused by: java.io.IOException: Unexpected termination of the channel
                at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:73)
            Caused by: java.io.EOFException
                at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2626)
                at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1321)
                at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373)
                at hudson.remoting.Command.readFrom(Command.java:94)
                at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:70)
                at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:59)
            

            So there is some sort of failure mode of Remoting which is not trivially reproduced.

            jglick Jesse Glick added a comment - Unfortunately I have not been able to reproduce the problem. For example, if I attach an agent (using CommandLauncher ), obtain the Channel / FilePath (simulating the ws field here being populated), kill -9 the agent JVM, and call .isDirectory() , the command immediately fails: java.io.IOException: remote file operation failed: /tmp/slave at hudson.remoting.Channel@3b3848de:remote: hudson.remoting.ChannelClosedException: channel is already closed at hudson.FilePath.act(FilePath.java:1014) at hudson.FilePath.act(FilePath.java:996) at hudson.FilePath.isDirectory(FilePath.java:1532) at … Caused by: hudson.remoting.ChannelClosedException: channel is already closed at hudson.remoting.Channel.send(Channel.java:578) at hudson.remoting.Request.call(Request.java:130) at hudson.remoting.Channel.call(Channel.java:796) at hudson.FilePath.act(FilePath.java:1007) ... 84 more Caused by: java.io.IOException: Unexpected termination of the channel at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:73) Caused by: java.io.EOFException at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2626) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1321) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373) at hudson.remoting.Command.readFrom(Command.java:94) at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:70) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:59) So there is some sort of failure mode of Remoting which is not trivially reproduced.

            Code changed in jenkins
            User: Jesse Glick
            Path:
            src/main/java/com/cloudbees/jenkins/support/impl/ThreadDumps.java
            http://jenkins-ci.org/commit/support-core-plugin/b7c48dbcef52be2ffab5cac57d177500f4165c55
            Log:
            Merge pull request #100 from jglick/Timer

            JENKINS-40613 Report a blocked Timer

            Compare: https://github.com/jenkinsci/support-core-plugin/compare/b131f686a77c...b7c48dbcef52

            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: src/main/java/com/cloudbees/jenkins/support/impl/ThreadDumps.java http://jenkins-ci.org/commit/support-core-plugin/b7c48dbcef52be2ffab5cac57d177500f4165c55 Log: Merge pull request #100 from jglick/Timer JENKINS-40613 Report a blocked Timer Compare: https://github.com/jenkinsci/support-core-plugin/compare/b131f686a77c...b7c48dbcef52
            jglick Jesse Glick added a comment -

            Still see similar hangs all over the place, but no leads on how to reproduce.

            JENKINS-44785 tracks a general API fix for Remoting to time out calls properly.

            JENKINS-41854 notes a case where a hypothesized bug should be resulting in a ChannelClosedException.

            jglick Jesse Glick added a comment - Still see similar hangs all over the place, but no leads on how to reproduce. JENKINS-44785 tracks a general API fix for Remoting to time out calls properly. JENKINS-41854 notes a case where a hypothesized bug should be resulting in a ChannelClosedException .

            We have a similar hang with this stack trace

            jenkins.util.Timer [#9] / waiting for hudson.remoting.Channel@53047d0d:Swarm-mo-7232f481e
            java.lang.Object.wait(Native Method)
            hudson.remoting.Request.call(Request.java:147)
            hudson.remoting.Channel.call(Channel.java:796)
            hudson.FilePath.act(FilePath.java:1007)
            hudson.FilePath.act(FilePath.java:996)
            hudson.FilePath.isDirectory(FilePath.java:1532)
            org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.getWorkspace(DurableTaskStep.java:183)
            org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:294)
            org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:279)
            java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
            java.util.concurrent.FutureTask.run(FutureTask.java:266)
            java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
            java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
            java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
            java.lang.Thread.run(Thread.java:745)

             

            vrenjith Renjith Pillai added a comment - We have a similar hang with this stack trace jenkins.util.Timer [#9] / waiting for hudson.remoting.Channel@53047d0d:Swarm-mo-7232f481e java.lang. Object .wait(Native Method) hudson.remoting.Request.call(Request.java:147) hudson.remoting.Channel.call(Channel.java:796) hudson.FilePath.act(FilePath.java:1007) hudson.FilePath.act(FilePath.java:996) hudson.FilePath.isDirectory(FilePath.java:1532) org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.getWorkspace(DurableTaskStep.java:183) org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:294) org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:279) java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) java.util.concurrent.FutureTask.run(FutureTask.java:266) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) java.lang. Thread .run( Thread .java:745)  
            basil Basil Crow added a comment -

            We hit a similar hang with this stack trace:

            "jenkins.util.Timer [#10] / waiting for slave3 id=36883650" #42 daemon prio=5 os_prio=0 tid=0x00007f516407d800 nid=0x70c in Object.wait() [0x00007f5162810000]
               java.lang.Thread.State: TIMED_WAITING (on object monitor)
                at java.lang.Object.wait(Native Method)
                at hudson.remoting.Request.call(Request.java:147)
                - locked <0x00000006290146b8> (a hudson.remoting.UserRequest)
                at hudson.remoting.Channel.call(Channel.java:829)
                at hudson.FilePath.act(FilePath.java:987)
                at hudson.FilePath.act(FilePath.java:976)
                at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:121)
                at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:304)
                at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:279)
                at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
                at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
                at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                at java.lang.Thread.run(Thread.java:748)
            

            In the Jenkins job itself, the following was printed:

            [Pipeline] withEnv
            [Pipeline] {
            [Pipeline] wrap
            [Pipeline] {
            [Pipeline] sh
            06:56:58 Running shell script
            06:57:08 Cannot contact slave3: java.io.IOException: java.lang.InterruptedException
            06:57:47 + rm -f /var/tmp/jenkins_slaves/path/to/script.out
            06:57:58 + rm -f /var/tmp/jenkins_slaves/path/to/script.out
            06:58:07 + rm -f /var/tmp/jenkins_slaves/path/to/script.out
            06:58:18 + rm -f /var/tmp/jenkins_slaves/path/to/script.out
            [... more of the same ...]
            

            I looked at the slave logs. There was nothing printed except for the successful launch of the agent about a week earlier. Similarly, the Jenkins console logs didn't show any output related to the slave. The only debugging information I could get was the stack trace mentioned above.

            basil Basil Crow added a comment - We hit a similar hang with this stack trace: "jenkins.util.Timer [#10] / waiting for slave3 id=36883650" #42 daemon prio=5 os_prio=0 tid=0x00007f516407d800 nid=0x70c in Object.wait() [0x00007f5162810000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at hudson.remoting.Request.call(Request.java:147) - locked <0x00000006290146b8> (a hudson.remoting.UserRequest) at hudson.remoting.Channel.call(Channel.java:829) at hudson.FilePath.act(FilePath.java:987) at hudson.FilePath.act(FilePath.java:976) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:121) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:304) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:279) at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) In the Jenkins job itself, the following was printed: [Pipeline] withEnv [Pipeline] { [Pipeline] wrap [Pipeline] { [Pipeline] sh 06:56:58 Running shell script 06:57:08 Cannot contact slave3: java.io.IOException: java.lang.InterruptedException 06:57:47 + rm -f /var/tmp/jenkins_slaves/path/to/script.out 06:57:58 + rm -f /var/tmp/jenkins_slaves/path/to/script.out 06:58:07 + rm -f /var/tmp/jenkins_slaves/path/to/script.out 06:58:18 + rm -f /var/tmp/jenkins_slaves/path/to/script.out [... more of the same ...] I looked at the slave logs. There was nothing printed except for the successful launch of the agent about a week earlier. Similarly, the Jenkins console logs didn't show any output related to the slave. The only debugging information I could get was the stack trace mentioned above.
            basil Basil Crow added a comment -

            When I cancelled the above stuck job, the following was printed:

            java.io.EOFException
            	at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2638)
            	at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3113)
            	at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:853)
            	at java.io.ObjectInputStream.<init>(ObjectInputStream.java:349)
            	at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48)
            	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
            	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:59)
            Caused: java.io.IOException: Unexpected termination of the channel
            	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:73)
            Caused: hudson.remoting.ChannelClosedException: channel is already closed
            	at hudson.remoting.Channel.send(Channel.java:605)
            	at hudson.remoting.Request.call(Request.java:130)
            	at hudson.remoting.Channel.call(Channel.java:829)
            	at hudson.FilePath.act(FilePath.java:987)
            Caused: java.io.IOException: remote file operation failed: /var/tmp/jenkins_slaves/path/to/job@8 at hudson.remoting.Channel@1df7d99e:slave3
            	at hudson.FilePath.act(FilePath.java:994)
            	at hudson.FilePath.act(FilePath.java:976)
            	at hudson.FilePath.mkdirs(FilePath.java:1159)
            	at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:79)
            	at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:67)
            	at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1$1.call(SynchronousNonBlockingStepExecution.java:49)
            	at hudson.security.ACL.impersonate(ACL.java:260)
            	at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1.run(SynchronousNonBlockingStepExecution.java:46)
            	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
            	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
            	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
            	at java.lang.Thread.run(Thread.java:748)
            Finished: FAILURE
            
            basil Basil Crow added a comment - When I cancelled the above stuck job, the following was printed: java.io.EOFException at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2638) at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3113) at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:853) at java.io.ObjectInputStream.<init>(ObjectInputStream.java:349) at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:59) Caused: java.io.IOException: Unexpected termination of the channel at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:73) Caused: hudson.remoting.ChannelClosedException: channel is already closed at hudson.remoting.Channel.send(Channel.java:605) at hudson.remoting.Request.call(Request.java:130) at hudson.remoting.Channel.call(Channel.java:829) at hudson.FilePath.act(FilePath.java:987) Caused: java.io.IOException: remote file operation failed: /var/tmp/jenkins_slaves/path/to/job@8 at hudson.remoting.Channel@1df7d99e:slave3 at hudson.FilePath.act(FilePath.java:994) at hudson.FilePath.act(FilePath.java:976) at hudson.FilePath.mkdirs(FilePath.java:1159) at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:79) at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:67) at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1$1.call(SynchronousNonBlockingStepExecution.java:49) at hudson.security.ACL.impersonate(ACL.java:260) at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1.run(SynchronousNonBlockingStepExecution.java:46) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Finished: FAILURE

            People

              Unassigned Unassigned
              jglick Jesse Glick
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: