-
Bug
-
Resolution: Unresolved
-
Blocker
-
Jenkins 1.599 and 1.609.3, Copy artifact 1.35
We are copying a big file from another job to the test job. That takes 4~5 minutes. The problem is, if the slave disconnects during this 4~5 minutes, the copy artifact doesn't know and doesn't stop. As a result, this job will run forever. Cancelling the job doesn't cancel it at this moment. Even disconnecting the slave doesn't stop the job. The only way out of this is restart the master, and I mean reboot the master machine, because soft restarting the Jenkins process will also hang during the restart. This is really ugly when it happens, so the priority is blocker.
- Jenkins-ThreadDump.txt
- 291 kB
- JY Hsu
- jenkins.log-20151012.gz
- 187 kB
- JY Hsu
[JENKINS-30655] Remoting blocks when the slave disconnects during copying files
soft restarting means go https://jenkins-url/restart. This restart will never end when the problem happens. It appears that the whole Jenkins process is messed up when this happens. Only rebooting the host machine can resolve this. But this is a Jenkins server with several product team using it, so having to reboot every couple days is really a big problem.
This is happening again even after I upgraded to LTS build 1.609.3. Attached is the thread dump. The slave that is having problem is IC_Mac_01. The job name is LCMI_UnitTest.
Here looks the place where the block occurs:
Executor #0 for IC_Mac_01 : executing LCMI_UnitTest #630 "Executor #0 for IC_Mac_01 : executing LCMI_UnitTest #630" Id=14887 Group=main BLOCKED on hudson.remoting.Channel@2bdb2bdb owned by "Ping thread for channel hudson.remoting.Channel@2bdb2bdb:IC_Mac_01" Id=12907 at hudson.remoting.ProxyOutputStream.flush(ProxyOutputStream.java:153) - blocked on hudson.remoting.Channel@2bdb2bdb - locked hudson.remoting.ProxyOutputStream@2570257 at hudson.remoting.RemoteOutputStream.flush(RemoteOutputStream.java:114) at java.io.FilterOutputStream.flush(FilterOutputStream.java:134) at java.io.FilterOutputStream.close(FilterOutputStream.java:151) at hudson.remoting.RemoteOutputStream.close(RemoteOutputStream.java:118) at org.apache.commons.io.IOUtils.closeQuietly(IOUtils.java:303) at org.apache.commons.io.IOUtils.closeQuietly(IOUtils.java:274) at hudson.FilePath$41.invoke(FilePath.java:2020) at hudson.FilePath$41.invoke(FilePath.java:2010) at hudson.FilePath.act(FilePath.java:989) at hudson.FilePath.act(FilePath.java:967) at hudson.FilePath.copyTo(FilePath.java:2010) at hudson.plugins.copyartifact.FingerprintingCopyMethod.copyOne(FingerprintingCopyMethod.java:80) at hudson.plugins.copyartifact.FingerprintingCopyMethod.copyAll(FingerprintingCopyMethod.java:64)
It looks caused by the remoting module in Jenkins core.
I'll leave the assignee expecting a maintainer of Jenkins core will take over this issue.
k76154
Would you attach following logs output when the slave was disconnected? That might help the investigation.
- The console log of the build (LCMI_UnitTest #630)
- Jenkins system logs, e.g. /var/log/jenkins/jenkins.log (it depends on how you launch Jenkins).
I tested the reproduction on Windows 8.1/64 bits, a slave launched with Java Web Start on the same machine.
Jenkins | CopyArtifact | Disconnect the slave | Abort the hanging build |
---|---|---|---|
1.509.4 | 1.33 | Build fails | N/A |
1.580.3 | 1.33 | Build hangs | Build is aborted |
1.580.3 | 1.36 | Build hangs | Build is aborted |
1.609.3 | 1.36 | Build hangs | Build is aborted |
- There looks some regressions between Jenkins 1.509.4 and 1.580.3. I'll bisect the versions and detect the version causing the regression.
- I always could abort blocked builds.
k76154 Would you attach logs (console logs of builds and Jenkins system logs) when you abort blocked builds?
Started by upstream project "LCMI20.0 Test Step" build number 69
originally caused by:
Started by upstream project "LCMI20.0 Pipeline" build number 85
originally caused by:
Started by upstream project "LCMI20.0 Start Pipeline" build number 105
originally caused by:
Started by an SCM change
[EnvInject] - Loading node environment variables.
Building remotely on IC_Mac_01 in workspace /Users/Automation/Jenkins/workspace/LCMI_UnitTest
Deleting project workspace... done
[EnvInject] - Executing scripts and injecting environment variables after the SCM step.
[EnvInject] - Injecting as environment variables the properties content
XCTOOL=/Users/automation/xctool
BUILD_HOME=/Users/automation/Jenkins/workspace/LCMI_UnitTest/B_LCMI20.0_Connections.test/buildartifacts
[EnvInject] - Variables injected successfully.
Build timed out (after 10 minutes). Marking the build as aborted.
Build timed out (after 10 minutes). Marking the build as failed.
I have no access to the master machine, so I can only provide the job log
Jenkins | CopyArtifact | remoting | Disconnect the slave |
---|---|---|---|
1.554.3 | 1.33 | 2.36 | Build fails |
1.565.3 | 1.33 | 2.46 | Build sometimes hangs |
1.580.3 | 1.33 | 2.47 | Build always hangs |
I don't see the problem if I manually disconnect the slave. It looks like the problem happens when the slave doesn't disconnect, but instead have network problem. Or it is not related to network at all, just that the copy somehow went into a deadlock.
Does that mean no additional logs are output even after click the abort button ("x" button) ?
The problem happened again on 10/08 around 6:30PM~7:00PM. See the attached Jenkins log
I'm still bisecting the version... (it takes much time to download old Jenkins war files)
Jenkins | CopyArtifact | remoting | Disconnect the slave |
---|---|---|---|
1.554 | 1.33 | 2.33 | Build fails |
1.554.3 | 1.33 | 2.36 | Build fails |
1.560 | 1.33 | 2.39 | Build hangs |
Bisecting completed.
This looks introduced in Jenkins 1.560.
Jenkins | CopyArtifact | remoting | Disconnect the slave |
---|---|---|---|
1.559 | 1.33 | 2.37 | Build fails |
1.560 | 1.33 | 2.39 | Build hangs |
I found that the hang doesn't reproduce with Jenkins-1.559 + remoting-2.39 (I built that by modifying the source code) and this should be caused for changes in core rather than changes in remoting.
The hang in my environment looks caused for d4c74bf.
Reverting this change makes the hang unreproducible.
> k76154
Let us know followings:
- How do you launch your slaves?
- The suspected change affects only JNLP slaves. It might not concern this problem if you use SSH slaves.
- How do you cancel jobs?
- You look use build-timeout plugin. Aborting by build-timeout and aborting by clicking "x" button work in different ways.
I connect through JNLP and java web start, because I am running mobile tests and must have UI access. This is the only way to get it. All other headless ways to connect the slave will not be able to launch the emulator/simulator.
I cancelled with both timeout plugin and clicking the x. Neither of them worked.
I think we have a similar problem running Jenkins 1.625.3.
We can not cancel job.
Job is running for 14 days. and the last job log is:
[EnvInject] - Variables injected successfully. [EnvInject] - Injecting as environment variables the properties content LOG_DIR=$WORKSPACE/module/Jobname/log [EnvInject] - Variables injected successfully.
"Executor #1 for Slavename : executing Jobname #1159" Id=166016 Group=main BLOCKED on hudson.remoting.ProxyOutputStream@2ec1aa81 owned by "Computer.threadPoolForRemoting [#2813] : IO ID=406741 : seq#=406740" Id=166006 at hudson.remoting.ProxyOutputStream.flush(ProxyOutputStream.java:152) - blocked on hudson.remoting.ProxyOutputStream@2ec1aa81 at hudson.remoting.RemoteOutputStream.flush(RemoteOutputStream.java:114) at java.io.FilterOutputStream.flush(FilterOutputStream.java:140) at java.io.FilterOutputStream.close(FilterOutputStream.java:158) at hudson.plugins.copyartifact.FingerprintingCopyMethod.copyOne(FingerprintingCopyMethod.java:85) at hudson.plugins.copyartifact.CopyArtifact.perform(CopyArtifact.java:531) at hudson.plugins.copyartifact.CopyArtifact.perform(CopyArtifact.java:436) at hudson.tasks.BuildStepCompatibilityLayer.perform(BuildStepCompatibilityLayer.java:75) at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20) at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:785) at hudson.maven.MavenModuleSetBuild$MavenModuleSetBuildExecution.build(MavenModuleSetBuild.java:919) at hudson.maven.MavenModuleSetBuild$MavenModuleSetBuildExecution.doRun(MavenModuleSetBuild.java:671) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:537) at hudson.model.Run.execute(Run.java:1741) at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:531) at hudson.model.ResourceController.execute(ResourceController.java:98) at hudson.model.Executor.run(Executor.java:408)
This issue is still in the current LTS release (2.73.2). Is there any workaround? We have a big master with hundreds of users. It is no option to restart the Jenkins master if some Raspberry Pi goes offline and the Jobs is blocked.
Added it to my EPIC scope.
hickstein Which Remoting version is being used on your master?
Unfortunately I have no capacity to work on Remoting in medium term, so I will unassign it and let others to take it. If somebody is interested to submit a pull request, I will be happy to help to get it reviewed and released.
It is rather considered an issue of Jenkins core as copyartifact depends on the function of Jenkins core for file copying from remotes.