Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-30655

Remoting blocks when the slave disconnects during copying files

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Blocker Blocker
    • core, remoting
    • Jenkins 1.599 and 1.609.3, Copy artifact 1.35

      We are copying a big file from another job to the test job. That takes 4~5 minutes. The problem is, if the slave disconnects during this 4~5 minutes, the copy artifact doesn't know and doesn't stop. As a result, this job will run forever. Cancelling the job doesn't cancel it at this moment. Even disconnecting the slave doesn't stop the job. The only way out of this is restart the master, and I mean reboot the master machine, because soft restarting the Jenkins process will also hang during the restart. This is really ugly when it happens, so the priority is blocker.

          [JENKINS-30655] Remoting blocks when the slave disconnects during copying files

          ikedam added a comment -

          I'm still bisecting the version... (it takes much time to download old Jenkins war files)

          Jenkins CopyArtifact remoting Disconnect the slave
          1.554 1.33 2.33 Build fails
          1.554.3 1.33 2.36 Build fails
          1.560 1.33 2.39 Build hangs

          ikedam added a comment - I'm still bisecting the version... (it takes much time to download old Jenkins war files) Jenkins CopyArtifact remoting Disconnect the slave 1.554 1.33 2.33 Build fails 1.554.3 1.33 2.36 Build fails 1.560 1.33 2.39 Build hangs

          ikedam added a comment -

          Bisecting completed.
          This looks introduced in Jenkins 1.560.

          Jenkins CopyArtifact remoting Disconnect the slave
          1.559 1.33 2.37 Build fails
          1.560 1.33 2.39 Build hangs

          ikedam added a comment - Bisecting completed. This looks introduced in Jenkins 1.560. Jenkins CopyArtifact remoting Disconnect the slave 1.559 1.33 2.37 Build fails 1.560 1.33 2.39 Build hangs

          ikedam added a comment -

          I found that the hang doesn't reproduce with Jenkins-1.559 + remoting-2.39 (I built that by modifying the source code) and this should be caused for changes in core rather than changes in remoting.

          The hang in my environment looks caused for d4c74bf.
          Reverting this change makes the hang unreproducible.

          > k76154

          Let us know followings:

          • How do you launch your slaves?
            • The suspected change affects only JNLP slaves. It might not concern this problem if you use SSH slaves.
          • How do you cancel jobs?
            • You look use build-timeout plugin. Aborting by build-timeout and aborting by clicking "x" button work in different ways.

          ikedam added a comment - I found that the hang doesn't reproduce with Jenkins-1.559 + remoting-2.39 (I built that by modifying the source code) and this should be caused for changes in core rather than changes in remoting. The hang in my environment looks caused for d4c74bf . Reverting this change makes the hang unreproducible. > k76154 Let us know followings: How do you launch your slaves? The suspected change affects only JNLP slaves. It might not concern this problem if you use SSH slaves. How do you cancel jobs? You look use build-timeout plugin. Aborting by build-timeout and aborting by clicking "x" button work in different ways.

          JY Hsu added a comment -

          I connect through JNLP and java web start, because I am running mobile tests and must have UI access. This is the only way to get it. All other headless ways to connect the slave will not be able to launch the emulator/simulator.

          I cancelled with both timeout plugin and clicking the x. Neither of them worked.

          JY Hsu added a comment - I connect through JNLP and java web start, because I am running mobile tests and must have UI access. This is the only way to get it. All other headless ways to connect the slave will not be able to launch the emulator/simulator. I cancelled with both timeout plugin and clicking the x. Neither of them worked.

          I think we have a similar problem running Jenkins 1.625.3.
          We can not cancel job.
          Job is running for 14 days. and the last job log is:

          [EnvInject] - Variables injected successfully.
          [EnvInject] - Injecting as environment variables the properties content 
          LOG_DIR=$WORKSPACE/module/Jobname/log
          
          [EnvInject] - Variables injected successfully.
          
          
          "Executor #1 for Slavename : executing Jobname #1159" Id=166016 Group=main BLOCKED on hudson.remoting.ProxyOutputStream@2ec1aa81 owned by "Computer.threadPoolForRemoting [#2813] : IO ID=406741 : seq#=406740" Id=166006
          	at hudson.remoting.ProxyOutputStream.flush(ProxyOutputStream.java:152)
          	-  blocked on hudson.remoting.ProxyOutputStream@2ec1aa81
          	at hudson.remoting.RemoteOutputStream.flush(RemoteOutputStream.java:114)
          	at java.io.FilterOutputStream.flush(FilterOutputStream.java:140)
          	at java.io.FilterOutputStream.close(FilterOutputStream.java:158)
          	at hudson.plugins.copyartifact.FingerprintingCopyMethod.copyOne(FingerprintingCopyMethod.java:85)
          	at hudson.plugins.copyartifact.CopyArtifact.perform(CopyArtifact.java:531)
          	at hudson.plugins.copyartifact.CopyArtifact.perform(CopyArtifact.java:436)
          	at hudson.tasks.BuildStepCompatibilityLayer.perform(BuildStepCompatibilityLayer.java:75)
          	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
          	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:785)
          	at hudson.maven.MavenModuleSetBuild$MavenModuleSetBuildExecution.build(MavenModuleSetBuild.java:919)
          	at hudson.maven.MavenModuleSetBuild$MavenModuleSetBuildExecution.doRun(MavenModuleSetBuild.java:671)
          	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:537)
          	at hudson.model.Run.execute(Run.java:1741)
          	at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:531)
          	at hudson.model.ResourceController.execute(ResourceController.java:98)
          	at hudson.model.Executor.run(Executor.java:408)
          

          Silviu Marchis added a comment - I think we have a similar problem running Jenkins 1.625.3. We can not cancel job. Job is running for 14 days. and the last job log is: [EnvInject] - Variables injected successfully. [EnvInject] - Injecting as environment variables the properties content LOG_DIR=$WORKSPACE/module/Jobname/log [EnvInject] - Variables injected successfully. "Executor #1 for Slavename : executing Jobname #1159" Id=166016 Group=main BLOCKED on hudson.remoting.ProxyOutputStream@2ec1aa81 owned by "Computer.threadPoolForRemoting [#2813] : IO ID=406741 : seq#=406740" Id=166006 at hudson.remoting.ProxyOutputStream.flush(ProxyOutputStream.java:152) - blocked on hudson.remoting.ProxyOutputStream@2ec1aa81 at hudson.remoting.RemoteOutputStream.flush(RemoteOutputStream.java:114) at java.io.FilterOutputStream.flush(FilterOutputStream.java:140) at java.io.FilterOutputStream.close(FilterOutputStream.java:158) at hudson.plugins.copyartifact.FingerprintingCopyMethod.copyOne(FingerprintingCopyMethod.java:85) at hudson.plugins.copyartifact.CopyArtifact.perform(CopyArtifact.java:531) at hudson.plugins.copyartifact.CopyArtifact.perform(CopyArtifact.java:436) at hudson.tasks.BuildStepCompatibilityLayer.perform(BuildStepCompatibilityLayer.java:75) at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20) at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:785) at hudson.maven.MavenModuleSetBuild$MavenModuleSetBuildExecution.build(MavenModuleSetBuild.java:919) at hudson.maven.MavenModuleSetBuild$MavenModuleSetBuildExecution.doRun(MavenModuleSetBuild.java:671) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:537) at hudson.model.Run.execute(Run.java:1741) at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:531) at hudson.model.ResourceController.execute(ResourceController.java:98) at hudson.model.Executor.run(Executor.java:408)

          Oleg Nenashev added a comment -

          From what I see it's still an issue in the last version

          Oleg Nenashev added a comment - From what I see it's still an issue in the last version

          This issue is still in the current LTS release (2.73.2). Is there any workaround? We have a big master with hundreds of users. It is no option to restart the Jenkins master if some Raspberry Pi goes offline and the Jobs is blocked.

          Sven Hickstein added a comment - This issue is still in the current LTS release (2.73.2). Is there any workaround? We have a big master with hundreds of users. It is no option to restart the Jenkins master if some Raspberry Pi goes offline and the Jobs is blocked.

          Oleg Nenashev added a comment -

          Added it to my EPIC scope.
          hickstein Which Remoting version is being used on your master?

          Oleg Nenashev added a comment - Added it to my EPIC scope. hickstein Which Remoting version is being used on your master?

          We currently use version remoting version 3.10.2 (LTS 2.73.2)

          Sven Hickstein added a comment - We currently use version remoting version 3.10.2 (LTS 2.73.2)

          Oleg Nenashev added a comment -

          Unfortunately I have no capacity to work on Remoting in medium term, so I will unassign it and let others to take it. If somebody is interested to submit a pull request, I will be happy to help to get it reviewed and released.

          Oleg Nenashev added a comment - Unfortunately I have no capacity to work on Remoting in medium term, so I will unassign it and let others to take it. If somebody is interested to submit a pull request, I will be happy to help to get it reviewed and released.

            Unassigned Unassigned
            k76154 JY Hsu
            Votes:
            4 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated: