Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-30655

Remoting blocks when the slave disconnects during copying files

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Blocker Blocker
    • core, remoting
    • Jenkins 1.599 and 1.609.3, Copy artifact 1.35

      We are copying a big file from another job to the test job. That takes 4~5 minutes. The problem is, if the slave disconnects during this 4~5 minutes, the copy artifact doesn't know and doesn't stop. As a result, this job will run forever. Cancelling the job doesn't cancel it at this moment. Even disconnecting the slave doesn't stop the job. The only way out of this is restart the master, and I mean reboot the master machine, because soft restarting the Jenkins process will also hang during the restart. This is really ugly when it happens, so the priority is blocker.

          [JENKINS-30655] Remoting blocks when the slave disconnects during copying files

          ikedam added a comment -

          It is rather considered an issue of Jenkins core as copyartifact depends on the function of Jenkins core for file copying from remotes.

          • Would you get thread dumps (of both the master and the slave) when the problem occurs?
          • Would you let me know more details about "soft restarting the Jenkins process will also hang during the restart"? What exactly happens?
          • I highly recommend you to use Jenkins LTS versions, as non-LTS versions are often unstable. Could you see whether the problem reproduces when you use the latest LTS version?: https://jenkins-ci.org/#stable

          ikedam added a comment - It is rather considered an issue of Jenkins core as copyartifact depends on the function of Jenkins core for file copying from remotes. Would you get thread dumps (of both the master and the slave) when the problem occurs? Please see following pages for instructions to get thread dumps: https://wiki.jenkins-ci.org/display/JENKINS/Obtaining+a+thread+dump Would you let me know more details about "soft restarting the Jenkins process will also hang during the restart"? What exactly happens? I highly recommend you to use Jenkins LTS versions, as non-LTS versions are often unstable. Could you see whether the problem reproduces when you use the latest LTS version?: https://jenkins-ci.org/#stable

          JY Hsu added a comment -

          soft restarting means go https://jenkins-url/restart. This restart will never end when the problem happens. It appears that the whole Jenkins process is messed up when this happens. Only rebooting the host machine can resolve this. But this is a Jenkins server with several product team using it, so having to reboot every couple days is really a big problem.

          JY Hsu added a comment - soft restarting means go https://jenkins-url/restart . This restart will never end when the problem happens. It appears that the whole Jenkins process is messed up when this happens. Only rebooting the host machine can resolve this. But this is a Jenkins server with several product team using it, so having to reboot every couple days is really a big problem.

          JY Hsu added a comment -

          This is happening again even after I upgraded to LTS build 1.609.3. Attached is the thread dump. The slave that is having problem is IC_Mac_01. The job name is LCMI_UnitTest.

          JY Hsu added a comment - This is happening again even after I upgraded to LTS build 1.609.3. Attached is the thread dump. The slave that is having problem is IC_Mac_01. The job name is LCMI_UnitTest.

          ikedam added a comment -

          Here looks the place where the block occurs:

          Executor #0 for IC_Mac_01 : executing LCMI_UnitTest #630
          "Executor #0 for IC_Mac_01 : executing LCMI_UnitTest #630" Id=14887 Group=main BLOCKED on hudson.remoting.Channel@2bdb2bdb owned by "Ping thread for channel hudson.remoting.Channel@2bdb2bdb:IC_Mac_01" Id=12907
          	at hudson.remoting.ProxyOutputStream.flush(ProxyOutputStream.java:153)
          	-  blocked on hudson.remoting.Channel@2bdb2bdb
          	-  locked hudson.remoting.ProxyOutputStream@2570257
          	at hudson.remoting.RemoteOutputStream.flush(RemoteOutputStream.java:114)
          	at java.io.FilterOutputStream.flush(FilterOutputStream.java:134)
          	at java.io.FilterOutputStream.close(FilterOutputStream.java:151)
          	at hudson.remoting.RemoteOutputStream.close(RemoteOutputStream.java:118)
          	at org.apache.commons.io.IOUtils.closeQuietly(IOUtils.java:303)
          	at org.apache.commons.io.IOUtils.closeQuietly(IOUtils.java:274)
          	at hudson.FilePath$41.invoke(FilePath.java:2020)
          	at hudson.FilePath$41.invoke(FilePath.java:2010)
          	at hudson.FilePath.act(FilePath.java:989)
          	at hudson.FilePath.act(FilePath.java:967)
          	at hudson.FilePath.copyTo(FilePath.java:2010)
          	at hudson.plugins.copyartifact.FingerprintingCopyMethod.copyOne(FingerprintingCopyMethod.java:80)
          	at hudson.plugins.copyartifact.FingerprintingCopyMethod.copyAll(FingerprintingCopyMethod.java:64)
          

          It looks caused by the remoting module in Jenkins core.
          I'll leave the assignee expecting a maintainer of Jenkins core will take over this issue.

          ikedam added a comment - Here looks the place where the block occurs: Executor #0 for IC_Mac_01 : executing LCMI_UnitTest #630 "Executor #0 for IC_Mac_01 : executing LCMI_UnitTest #630" Id=14887 Group=main BLOCKED on hudson.remoting.Channel@2bdb2bdb owned by "Ping thread for channel hudson.remoting.Channel@2bdb2bdb:IC_Mac_01" Id=12907 at hudson.remoting.ProxyOutputStream.flush(ProxyOutputStream.java:153) - blocked on hudson.remoting.Channel@2bdb2bdb - locked hudson.remoting.ProxyOutputStream@2570257 at hudson.remoting.RemoteOutputStream.flush(RemoteOutputStream.java:114) at java.io.FilterOutputStream.flush(FilterOutputStream.java:134) at java.io.FilterOutputStream.close(FilterOutputStream.java:151) at hudson.remoting.RemoteOutputStream.close(RemoteOutputStream.java:118) at org.apache.commons.io.IOUtils.closeQuietly(IOUtils.java:303) at org.apache.commons.io.IOUtils.closeQuietly(IOUtils.java:274) at hudson.FilePath$41.invoke(FilePath.java:2020) at hudson.FilePath$41.invoke(FilePath.java:2010) at hudson.FilePath.act(FilePath.java:989) at hudson.FilePath.act(FilePath.java:967) at hudson.FilePath.copyTo(FilePath.java:2010) at hudson.plugins.copyartifact.FingerprintingCopyMethod.copyOne(FingerprintingCopyMethod.java:80) at hudson.plugins.copyartifact.FingerprintingCopyMethod.copyAll(FingerprintingCopyMethod.java:64) It looks caused by the remoting module in Jenkins core. I'll leave the assignee expecting a maintainer of Jenkins core will take over this issue.

          ikedam added a comment -

          k76154
          Would you attach following logs output when the slave was disconnected? That might help the investigation.

          • The console log of the build (LCMI_UnitTest #630)
          • Jenkins system logs, e.g. /var/log/jenkins/jenkins.log (it depends on how you launch Jenkins).

          ikedam added a comment - k76154 Would you attach following logs output when the slave was disconnected? That might help the investigation. The console log of the build (LCMI_UnitTest #630) Jenkins system logs, e.g. /var/log/jenkins/jenkins.log (it depends on how you launch Jenkins).

          ikedam added a comment -

          Jenkins 1.609.3 uses remoting-2.52.

          ikedam added a comment - Jenkins 1.609.3 uses remoting-2.52.

          ikedam added a comment -

          I tested the reproduction on Windows 8.1/64 bits, a slave launched with Java Web Start on the same machine.

          Jenkins CopyArtifact Disconnect the slave Abort the hanging build
          1.509.4 1.33 Build fails N/A
          1.580.3 1.33 Build hangs Build is aborted
          1.580.3 1.36 Build hangs Build is aborted
          1.609.3 1.36 Build hangs Build is aborted
          • There looks some regressions between Jenkins 1.509.4 and 1.580.3. I'll bisect the versions and detect the version causing the regression.
          • I always could abort blocked builds.

          ikedam added a comment - I tested the reproduction on Windows 8.1/64 bits, a slave launched with Java Web Start on the same machine. Jenkins CopyArtifact Disconnect the slave Abort the hanging build 1.509.4 1.33 Build fails N/A 1.580.3 1.33 Build hangs Build is aborted 1.580.3 1.36 Build hangs Build is aborted 1.609.3 1.36 Build hangs Build is aborted There looks some regressions between Jenkins 1.509.4 and 1.580.3. I'll bisect the versions and detect the version causing the regression. I always could abort blocked builds.

          ikedam added a comment -

          k76154 Would you attach logs (console logs of builds and Jenkins system logs) when you abort blocked builds?

          ikedam added a comment - k76154 Would you attach logs (console logs of builds and Jenkins system logs) when you abort blocked builds?

          JY Hsu added a comment -

          Started by upstream project "LCMI20.0 Test Step" build number 69
          originally caused by:
          Started by upstream project "LCMI20.0 Pipeline" build number 85
          originally caused by:
          Started by upstream project "LCMI20.0 Start Pipeline" build number 105
          originally caused by:
          Started by an SCM change
          [EnvInject] - Loading node environment variables.
          Building remotely on IC_Mac_01 in workspace /Users/Automation/Jenkins/workspace/LCMI_UnitTest

          Deleting project workspace... done

          [EnvInject] - Executing scripts and injecting environment variables after the SCM step.
          [EnvInject] - Injecting as environment variables the properties content
          XCTOOL=/Users/automation/xctool
          BUILD_HOME=/Users/automation/Jenkins/workspace/LCMI_UnitTest/B_LCMI20.0_Connections.test/buildartifacts

          [EnvInject] - Variables injected successfully.
          Build timed out (after 10 minutes). Marking the build as aborted.
          Build timed out (after 10 minutes). Marking the build as failed.

          I have no access to the master machine, so I can only provide the job log

          JY Hsu added a comment - Started by upstream project "LCMI20.0 Test Step" build number 69 originally caused by: Started by upstream project "LCMI20.0 Pipeline" build number 85 originally caused by: Started by upstream project "LCMI20.0 Start Pipeline" build number 105 originally caused by: Started by an SCM change [EnvInject] - Loading node environment variables. Building remotely on IC_Mac_01 in workspace /Users/Automation/Jenkins/workspace/LCMI_UnitTest Deleting project workspace... done [EnvInject] - Executing scripts and injecting environment variables after the SCM step. [EnvInject] - Injecting as environment variables the properties content XCTOOL=/Users/automation/xctool BUILD_HOME=/Users/automation/Jenkins/workspace/LCMI_UnitTest/B_LCMI20.0_Connections.test/buildartifacts [EnvInject] - Variables injected successfully. Build timed out (after 10 minutes). Marking the build as aborted. Build timed out (after 10 minutes). Marking the build as failed. I have no access to the master machine, so I can only provide the job log

          ikedam added a comment -
          Jenkins CopyArtifact remoting Disconnect the slave
          1.554.3 1.33 2.36 Build fails
          1.565.3 1.33 2.46 Build sometimes hangs
          1.580.3 1.33 2.47 Build always hangs

          ikedam added a comment - Jenkins CopyArtifact remoting Disconnect the slave 1.554.3 1.33 2.36 Build fails 1.565.3 1.33 2.46 Build sometimes hangs 1.580.3 1.33 2.47 Build always hangs

          JY Hsu added a comment -

          I don't see the problem if I manually disconnect the slave. It looks like the problem happens when the slave doesn't disconnect, but instead have network problem. Or it is not related to network at all, just that the copy somehow went into a deadlock.

          JY Hsu added a comment - I don't see the problem if I manually disconnect the slave. It looks like the problem happens when the slave doesn't disconnect, but instead have network problem. Or it is not related to network at all, just that the copy somehow went into a deadlock.

          ikedam added a comment -

          Does that mean no additional logs are output even after click the abort button ("x" button) ?

          ikedam added a comment - Does that mean no additional logs are output even after click the abort button ("x" button) ?

          JY Hsu added a comment - - edited

          The problem happened again on 10/08 around 6:30PM~7:00PM. See the attached Jenkins log

          JY Hsu added a comment - - edited The problem happened again on 10/08 around 6:30PM~7:00PM. See the attached Jenkins log

          ikedam added a comment -

          I'm still bisecting the version... (it takes much time to download old Jenkins war files)

          Jenkins CopyArtifact remoting Disconnect the slave
          1.554 1.33 2.33 Build fails
          1.554.3 1.33 2.36 Build fails
          1.560 1.33 2.39 Build hangs

          ikedam added a comment - I'm still bisecting the version... (it takes much time to download old Jenkins war files) Jenkins CopyArtifact remoting Disconnect the slave 1.554 1.33 2.33 Build fails 1.554.3 1.33 2.36 Build fails 1.560 1.33 2.39 Build hangs

          ikedam added a comment -

          Bisecting completed.
          This looks introduced in Jenkins 1.560.

          Jenkins CopyArtifact remoting Disconnect the slave
          1.559 1.33 2.37 Build fails
          1.560 1.33 2.39 Build hangs

          ikedam added a comment - Bisecting completed. This looks introduced in Jenkins 1.560. Jenkins CopyArtifact remoting Disconnect the slave 1.559 1.33 2.37 Build fails 1.560 1.33 2.39 Build hangs

          ikedam added a comment -

          I found that the hang doesn't reproduce with Jenkins-1.559 + remoting-2.39 (I built that by modifying the source code) and this should be caused for changes in core rather than changes in remoting.

          The hang in my environment looks caused for d4c74bf.
          Reverting this change makes the hang unreproducible.

          > k76154

          Let us know followings:

          • How do you launch your slaves?
            • The suspected change affects only JNLP slaves. It might not concern this problem if you use SSH slaves.
          • How do you cancel jobs?
            • You look use build-timeout plugin. Aborting by build-timeout and aborting by clicking "x" button work in different ways.

          ikedam added a comment - I found that the hang doesn't reproduce with Jenkins-1.559 + remoting-2.39 (I built that by modifying the source code) and this should be caused for changes in core rather than changes in remoting. The hang in my environment looks caused for d4c74bf . Reverting this change makes the hang unreproducible. > k76154 Let us know followings: How do you launch your slaves? The suspected change affects only JNLP slaves. It might not concern this problem if you use SSH slaves. How do you cancel jobs? You look use build-timeout plugin. Aborting by build-timeout and aborting by clicking "x" button work in different ways.

          JY Hsu added a comment -

          I connect through JNLP and java web start, because I am running mobile tests and must have UI access. This is the only way to get it. All other headless ways to connect the slave will not be able to launch the emulator/simulator.

          I cancelled with both timeout plugin and clicking the x. Neither of them worked.

          JY Hsu added a comment - I connect through JNLP and java web start, because I am running mobile tests and must have UI access. This is the only way to get it. All other headless ways to connect the slave will not be able to launch the emulator/simulator. I cancelled with both timeout plugin and clicking the x. Neither of them worked.

          I think we have a similar problem running Jenkins 1.625.3.
          We can not cancel job.
          Job is running for 14 days. and the last job log is:

          [EnvInject] - Variables injected successfully.
          [EnvInject] - Injecting as environment variables the properties content 
          LOG_DIR=$WORKSPACE/module/Jobname/log
          
          [EnvInject] - Variables injected successfully.
          
          
          "Executor #1 for Slavename : executing Jobname #1159" Id=166016 Group=main BLOCKED on hudson.remoting.ProxyOutputStream@2ec1aa81 owned by "Computer.threadPoolForRemoting [#2813] : IO ID=406741 : seq#=406740" Id=166006
          	at hudson.remoting.ProxyOutputStream.flush(ProxyOutputStream.java:152)
          	-  blocked on hudson.remoting.ProxyOutputStream@2ec1aa81
          	at hudson.remoting.RemoteOutputStream.flush(RemoteOutputStream.java:114)
          	at java.io.FilterOutputStream.flush(FilterOutputStream.java:140)
          	at java.io.FilterOutputStream.close(FilterOutputStream.java:158)
          	at hudson.plugins.copyartifact.FingerprintingCopyMethod.copyOne(FingerprintingCopyMethod.java:85)
          	at hudson.plugins.copyartifact.CopyArtifact.perform(CopyArtifact.java:531)
          	at hudson.plugins.copyartifact.CopyArtifact.perform(CopyArtifact.java:436)
          	at hudson.tasks.BuildStepCompatibilityLayer.perform(BuildStepCompatibilityLayer.java:75)
          	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
          	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:785)
          	at hudson.maven.MavenModuleSetBuild$MavenModuleSetBuildExecution.build(MavenModuleSetBuild.java:919)
          	at hudson.maven.MavenModuleSetBuild$MavenModuleSetBuildExecution.doRun(MavenModuleSetBuild.java:671)
          	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:537)
          	at hudson.model.Run.execute(Run.java:1741)
          	at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:531)
          	at hudson.model.ResourceController.execute(ResourceController.java:98)
          	at hudson.model.Executor.run(Executor.java:408)
          

          Silviu Marchis added a comment - I think we have a similar problem running Jenkins 1.625.3. We can not cancel job. Job is running for 14 days. and the last job log is: [EnvInject] - Variables injected successfully. [EnvInject] - Injecting as environment variables the properties content LOG_DIR=$WORKSPACE/module/Jobname/log [EnvInject] - Variables injected successfully. "Executor #1 for Slavename : executing Jobname #1159" Id=166016 Group=main BLOCKED on hudson.remoting.ProxyOutputStream@2ec1aa81 owned by "Computer.threadPoolForRemoting [#2813] : IO ID=406741 : seq#=406740" Id=166006 at hudson.remoting.ProxyOutputStream.flush(ProxyOutputStream.java:152) - blocked on hudson.remoting.ProxyOutputStream@2ec1aa81 at hudson.remoting.RemoteOutputStream.flush(RemoteOutputStream.java:114) at java.io.FilterOutputStream.flush(FilterOutputStream.java:140) at java.io.FilterOutputStream.close(FilterOutputStream.java:158) at hudson.plugins.copyartifact.FingerprintingCopyMethod.copyOne(FingerprintingCopyMethod.java:85) at hudson.plugins.copyartifact.CopyArtifact.perform(CopyArtifact.java:531) at hudson.plugins.copyartifact.CopyArtifact.perform(CopyArtifact.java:436) at hudson.tasks.BuildStepCompatibilityLayer.perform(BuildStepCompatibilityLayer.java:75) at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20) at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:785) at hudson.maven.MavenModuleSetBuild$MavenModuleSetBuildExecution.build(MavenModuleSetBuild.java:919) at hudson.maven.MavenModuleSetBuild$MavenModuleSetBuildExecution.doRun(MavenModuleSetBuild.java:671) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:537) at hudson.model.Run.execute(Run.java:1741) at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:531) at hudson.model.ResourceController.execute(ResourceController.java:98) at hudson.model.Executor.run(Executor.java:408)

          Oleg Nenashev added a comment -

          From what I see it's still an issue in the last version

          Oleg Nenashev added a comment - From what I see it's still an issue in the last version

          This issue is still in the current LTS release (2.73.2). Is there any workaround? We have a big master with hundreds of users. It is no option to restart the Jenkins master if some Raspberry Pi goes offline and the Jobs is blocked.

          Sven Hickstein added a comment - This issue is still in the current LTS release (2.73.2). Is there any workaround? We have a big master with hundreds of users. It is no option to restart the Jenkins master if some Raspberry Pi goes offline and the Jobs is blocked.

          Oleg Nenashev added a comment -

          Added it to my EPIC scope.
          hickstein Which Remoting version is being used on your master?

          Oleg Nenashev added a comment - Added it to my EPIC scope. hickstein Which Remoting version is being used on your master?

          We currently use version remoting version 3.10.2 (LTS 2.73.2)

          Sven Hickstein added a comment - We currently use version remoting version 3.10.2 (LTS 2.73.2)

          Oleg Nenashev added a comment -

          Unfortunately I have no capacity to work on Remoting in medium term, so I will unassign it and let others to take it. If somebody is interested to submit a pull request, I will be happy to help to get it reviewed and released.

          Oleg Nenashev added a comment - Unfortunately I have no capacity to work on Remoting in medium term, so I will unassign it and let others to take it. If somebody is interested to submit a pull request, I will be happy to help to get it reviewed and released.

            Unassigned Unassigned
            k76154 JY Hsu
            Votes:
            4 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated: