Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-55597

Jenkins JNLP agent disconnects when it fails to delete an old build workspace

      The agent stays up, but the status in the dashboard keeps changing from online to offline and back. The server-side log at https://X/computer/Y/log shows these messages in a loop:

      JNLP agent connected from Y/Z
      Remoting version: 3.28
      This is a Unix agent
      
      deleting obsolete workspace /home/infinispan/workspace/Openshift_Services_PR-141
      
      java.nio.file.FileSystemException: /home/infinispan/workspace/Openshift_Services_PR-141/openshift.local.clusterup/openshift.local.volumes/pods: Operation not permitted
      	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
      	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
      	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
      	at sun.nio.fs.UnixFileAttributeViews$Posix.setMode(UnixFileAttributeViews.java:238)
      	at sun.nio.fs.UnixFileAttributeViews$Posix.setPermissions(UnixFileAttributeViews.java:260)
      	at java.nio.file.Files.setPosixFilePermissions(Files.java:2045)
      	at hudson.Util.makeWritable(Util.java:337)
      	at hudson.Util.tryOnceDeleteFile(Util.java:297)
      	at hudson.Util.deleteFile(Util.java:272)
      Also:   hudson.remoting.Channel$CallSiteStackTrace: Remote call to JNLP4-connect connection from nat-pool-rdu-t.redhat.com/66.187.233.202:45016
      		at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1743)
      		at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357)
      		at hudson.remoting.Channel.call(Channel.java:957)
      		at hudson.FilePath.act(FilePath.java:1072)
      		at hudson.FilePath.act(FilePath.java:1061)
      		at hudson.FilePath.deleteRecursive(FilePath.java:1268)
      		at jenkins.branch.WorkspaceLocatorImpl$Collector.onOnline(WorkspaceLocatorImpl.java:534)
      		at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:692)
      		at jenkins.slaves.DefaultJnlpSlaveReceiver.afterChannel(DefaultJnlpSlaveReceiver.java:168)
      		at org.jenkinsci.remoting.engine.JnlpConnectionState$4.invoke(JnlpConnectionState.java:421)
      		at org.jenkinsci.remoting.engine.JnlpConnectionState.fire(JnlpConnectionState.java:312)
      		at org.jenkinsci.remoting.engine.JnlpConnectionState.fireAfterChannel(JnlpConnectionState.java:418)
      		at org.jenkinsci.remoting.engine.JnlpProtocol4Handler$Handler$1.run(JnlpProtocol4Handler.java:334)
      		at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      		at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
      		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      Caused: java.io.IOException: Unable to delete '/home/infinispan/workspace/Openshift_Services_PR-141/openshift.local.clusterup/openshift.local.volumes/pods'. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts.
      	at hudson.Util.deleteFile(Util.java:277)
      	at hudson.FilePath.deleteRecursive(FilePath.java:1305)
      	at hudson.FilePath.deleteContentsRecursive(FilePath.java:1314)
      	at hudson.FilePath.deleteRecursive(FilePath.java:1296)
      	at hudson.FilePath.deleteContentsRecursive(FilePath.java:1314)
      	at hudson.FilePath.deleteRecursive(FilePath.java:1296)
      	at hudson.FilePath.deleteContentsRecursive(FilePath.java:1314)
      	at hudson.FilePath.deleteRecursive(FilePath.java:1296)
      	at hudson.FilePath.access$1600(FilePath.java:213)
      	at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1274)
      	at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1270)
      	at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3086)
      	at hudson.remoting.UserRequest.perform(UserRequest.java:212)
      	at hudson.remoting.UserRequest.perform(UserRequest.java:54)
      	at hudson.remoting.Request$2.run(Request.java:369)
      	at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:93)
      	at java.lang.Thread.run(Thread.java:748)
      ERROR: Connection terminated
      java.nio.channels.ClosedChannelException
      	at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:209)
      	at org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222)
      	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:816)
      	at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287)
      	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181)
      	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283)
      	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503)
      	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248)
      	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200)
      	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:213)
      	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:784)
      	at org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:173)
      	at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:314)
      	at hudson.remoting.Channel.close(Channel.java:1452)
      	at hudson.remoting.Channel.close(Channel.java:1405)
      	at jenkins.slaves.DefaultJnlpSlaveReceiver.afterChannel(DefaultJnlpSlaveReceiver.java:173)
      	at org.jenkinsci.remoting.engine.JnlpConnectionState$4.invoke(JnlpConnectionState.java:421)
      	at org.jenkinsci.remoting.engine.JnlpConnectionState.fire(JnlpConnectionState.java:312)
      	at org.jenkinsci.remoting.engine.JnlpConnectionState.fireAfterChannel(JnlpConnectionState.java:418)
      	at org.jenkinsci.remoting.engine.JnlpProtocol4Handler$Handler$1.run(JnlpProtocol4Handler.java:334)
      	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      	at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      

      The agent log doesn't include any of this, even after I configured j.u.l with `.level = ALL`.

      IMO failing to delete a build directory is not a fatal error, it should only affect builds that try to use exactly the same directory.

       

      Additionally, while the node is online, it may start to run a build, but then it immediately goes offline and the build fails with this error:

      Error when executing always post condition:
      org.jenkinsci.plugins.workflow.steps.MissingContextVariableException: Required context class hudson.FilePath is missing
      Perhaps you forgot to surround the code with a step that provides this, such as: node
      	at org.jenkinsci.plugins.workflow.steps.StepDescriptor.checkContextAvailability(StepDescriptor.java:261)
      	at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:260)
      	at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:176)
      	at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93)
      	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)
      	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1213)
      	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
      	at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:42)
      	at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
      	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
      	at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:157)
      	at org.kohsuke.groovy.sandbox.GroovyInterceptor.onMethodCall(GroovyInterceptor.java:23)
      	at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onMethodCall(SandboxInterceptor.java:155)
      	at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:155)
      	at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:159)
      	at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:129)
      	at com.cloudbees.groovy.cps.sandbox.SandboxInvoker.methodCall(SandboxInvoker.java:17)
      	at WorkflowScript.run(WorkflowScript:79)
      	at org.jenkinsci.plugins.pipeline.modeldefinition.ModelInterpreter.delegateAndExecute(jar:file:/var/lib/jenkins/plugins/pipeline-model-definition/WEB-INF/lib/pipeline-model-definition.jar!/org/jenkinsci/plugins/pipeline/modeldefinition/ModelInterpreter.groovy:133)
      	at org.jenkinsci.plugins.pipeline.modeldefinition.ModelInterpreter.runPostConditions(jar:file:/var/lib/jenkins/plugins/pipeline-model-definition/WEB-INF/lib/pipeline-model-definition.jar!/org/jenkinsci/plugins/pipeline/modeldefinition/ModelInterpreter.groovy:789)
      	at org.jenkinsci.plugins.pipeline.modeldefinition.ModelInterpreter.catchRequiredContextForNode(jar:file:/var/lib/jenkins/plugins/pipeline-model-definition/WEB-INF/lib/pipeline-model-definition.jar!/org/jenkinsci/plugins/pipeline/modeldefinition/ModelInterpreter.groovy:418)
      	at org.jenkinsci.plugins.pipeline.modeldefinition.ModelInterpreter.catchRequiredContextForNode(jar:file:/var/lib/jenkins/plugins/pipeline-model-definition/WEB-INF/lib/pipeline-model-definition.jar!/org/jenkinsci/plugins/pipeline/modeldefinition/ModelInterpreter.groovy:416)
      	at org.jenkinsci.plugins.pipeline.modeldefinition.ModelInterpreter.runPostConditions(jar:file:/var/lib/jenkins/plugins/pipeline-model-definition/WEB-INF/lib/pipeline-model-definition.jar!/org/jenkinsci/plugins/pipeline/modeldefinition/ModelInterpreter.groovy:788)
      	at com.cloudbees.groovy.cps.CpsDefaultGroovyMethods.each(CpsDefaultGroovyMethods:2030)
      	at com.cloudbees.groovy.cps.CpsDefaultGroovyMethods.each(CpsDefaultGroovyMethods:2015)
      	at com.cloudbees.groovy.cps.CpsDefaultGroovyMethods.each(CpsDefaultGroovyMethods:2056)
      	at org.jenkinsci.plugins.pipeline.modeldefinition.ModelInterpreter.runPostConditions(jar:file:/var/lib/jenkins/plugins/pipeline-model-definition/WEB-INF/lib/pipeline-model-definition.jar!/org/jenkinsci/plugins/pipeline/modeldefinition/ModelInterpreter.groovy:783)
      	at org.jenkinsci.plugins.pipeline.modeldefinition.ModelInterpreter.executePostBuild(jar:file:/var/lib/jenkins/plugins/pipeline-model-definition/WEB-INF/lib/pipeline-model-definition.jar!/org/jenkinsci/plugins/pipeline/modeldefinition/ModelInterpreter.groovy:760)
      	at ___cps.transform___(Native Method)
      	at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:57)
      	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:109)
      	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:82)
      	at sun.reflect.GeneratedMethodAccessor327.invoke(Unknown Source)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
      	at com.cloudbees.groovy.cps.impl.ConstantBlock.eval(ConstantBlock.java:21)
      	at com.cloudbees.groovy.cps.Next.step(Next.java:83)
      	at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:174)
      	at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:163)
      	at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:129)
      	at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:268)
      	at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:163)
      	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$101(SandboxContinuable.java:34)
      	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.lambda$run0$0(SandboxContinuable.java:59)
      	at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.GroovySandbox.runInSandbox(GroovySandbox.java:121)
      	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:58)
      	at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:182)
      	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:332)
      	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$200(CpsThreadGroup.java:83)
      	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:244)
      	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:232)
      	at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:64)
      

      It would be great if the error message would say explicitly that the build is failing because the node was disconnected, and if the agent-side logs included the error as well.

          [JENKINS-55597] Jenkins JNLP agent disconnects when it fails to delete an old build workspace

          Ryan Lopez added a comment -

          We are running into the same problem described above. Jenkins tries to delete files which are created by a Docker script as owned by root with no regular user write permissions. Jenkins tries to clean up the obsolete workspaces and immediately disconnects and reconnects. If anyone knows of a workaround or fix, we are very interested. Thanks!

          Ryan Lopez added a comment - We are running into the same problem described above. Jenkins tries to delete files which are created by a Docker script as owned by root with no regular user write permissions. Jenkins tries to clean up the obsolete workspaces and immediately disconnects and reconnects. If anyone knows of a workaround or fix, we are very interested. Thanks!

          Sean Talts added a comment -

          Same. My workaround so far has just been to log in to the machines and manually rm -rf the entire workspace. Unfortunately I have to do this every time Jenkins is restarted to e.g. install updates.

          Sean Talts added a comment - Same. My workaround so far has just been to log in to the machines and manually rm -rf the entire workspace. Unfortunately I have to do this every time Jenkins is restarted to e.g. install updates.

          Liam Newman added a comment -

          I believe this is fixed in Jenkins v2.177.
          Please comment if this should be reopened.

          Liam Newman added a comment - I believe this is fixed in Jenkins v2.177. Please comment if this should be reopened.

          James Howe added a comment -

          This also appears to affect SSH agents in 2.164.3, although the exception is a channel EOF instead of the FileSystemException.
          Deleting the directory manually allows the agent to startup correctly.

          Can this fix be ported to LTS please?

          James Howe added a comment - This also appears to affect SSH agents in 2.164.3, although the exception is a channel EOF instead of the FileSystemException. Deleting the directory manually allows the agent to startup correctly. Can this fix be ported to LTS please?

          Liam Newman added a comment - - edited

          jameshowe
          It is my understanding the LTS backporting is reserved for security issues and we missed the window for 2.164.3.
          The next LTS will be out in June, but the base looks like it will be 2.176 - missed by one. I will see what I can do to get this pulled into that release.

          In the meanwhile, do you have test instance you can try on 2.177 and verify this is fixed? That would be very helpful.

          Liam Newman added a comment - - edited jameshowe It is my understanding the LTS backporting is reserved for security issues and we missed the window for 2.164.3. The next LTS will be out in June, but the base looks like it will be 2.176 - missed by one. I will see what I can do to get this pulled into that release. In the meanwhile, do you have test instance you can try on 2.177 and verify this is fixed? That would be very helpful.

          James Howe added a comment - - edited

          To justify a security patch, any user who can submit a job that uses Docker can probably use this to make a DoS attack.
          That's how I'm getting it - the docker container runs as root and can write to the workspace, resulting in files the jenkins user cannot delete.

          James Howe added a comment - - edited To justify a security patch, any user who can submit a job that uses Docker can probably use this to make a DoS attack. That's how I'm getting it - the docker container runs as root and can write to the workspace, resulting in files the jenkins user cannot delete.

          Liam Newman added a comment -

          Okay, I was wrong about security only, it turns out to be just adding a label, which I've done for JENKINS-57111. We'll see how that goes.

          Liam Newman added a comment - Okay, I was wrong about security only, it turns out to be just adding a label, which I've done for JENKINS-57111 . We'll see how that goes.

          Liam Newman added a comment -

          jameshowe
          This will be in LTS 2.176.1.

          Liam Newman added a comment - jameshowe This will be in LTS 2.176.1.

            Unassigned Unassigned
            danberindei Dan Berindei
            Votes:
            4 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: