Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-54496

Node monitoring should not log exceptions for disconnected nodes

    • Icon: Improvement Improvement
    • Resolution: Fixed
    • Icon: Minor Minor
    • core, remoting
    • None
    • Jenkins 2.220

      In an environment in which Jenkins nodes are being created and removed a lot, e.g. when using the docker-plugin to create single-use containers as slave nodes, the node monitoring functionality can fill the logs full of spurious exceptions, e.g.

      Oct 13, 2018 10:18:14 PM io.jenkins.docker.DockerTransientNode$1 println
      INFO: Disconnected computer for node 'docker-hst02-00030qwaqtu3g'.
      Oct 13, 2018 10:18:14 PM io.jenkins.docker.DockerTransientNode$1 println
      INFO: Removed Node for node 'docker-hst02-00030qwaqtu3g'.
      Oct 13, 2018 10:18:14 PM hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor monitorDetailed
      WARNING: Failed to monitor docker-hst02-00030qwaqtu3g for Clock Difference
      java.util.concurrent.ExecutionException: java.io.IOException: Cannot locate RemoteClassLoader.ClassLoaderProxy(2) in the channel exported table
      	at hudson.remoting.Channel$2.adapt(Channel.java:992)
      	at hudson.remoting.Channel$2.adapt(Channel.java:986)
      	at hudson.remoting.FutureAdapter.get(FutureAdapter.java:59)
      	at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitorDetailed(AbstractAsyncNodeMonitorDescriptor.java:114)
      	at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitor(AbstractAsyncNodeMonitorDescriptor.java:76)
      	at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:305)
      Caused by: java.io.IOException: Cannot locate RemoteClassLoader.ClassLoaderProxy(2) in the channel exported table
      	at hudson.remoting.MultiClassLoaderSerializer$Input.readClassLoader(MultiClassLoaderSerializer.java:107)
      	at hudson.remoting.MultiClassLoaderSerializer$Input.resolveClass(MultiClassLoaderSerializer.java:128)
      	at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1826)
      	at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1713)
      	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2000)
      	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
      	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
      	at hudson.remoting.UserRequest.deserialize(UserRequest.java:291)
      	at hudson.remoting.UserRequest$NormalResponse.retrieve(UserRequest.java:326)
      	at hudson.remoting.Channel$2.adapt(Channel.java:990)
      	... 5 more
      Caused by: java.util.concurrent.ExecutionException: Invalid object ID 2 iota=7
      	at hudson.remoting.ExportTable.diagnoseInvalidObjectId(ExportTable.java:478)
      	at hudson.remoting.ExportTable.get(ExportTable.java:397)
      	at hudson.remoting.Channel.getExportedObject(Channel.java:780)
      	at hudson.remoting.MultiClassLoaderSerializer$Input.readClassLoader(MultiClassLoaderSerializer.java:105)
      	... 14 more
      
      Oct 13, 2018 10:18:14 PM hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor monitorDetailed
      WARNING: Failed to monitor docker-hst02-00030qwaqtu3g for Free Temp Space
      java.util.concurrent.ExecutionException: java.io.IOException: Cannot locate RemoteClassLoader.ClassLoaderProxy(2) in the channel exported table
      	at hudson.remoting.Channel$2.adapt(Channel.java:992)
      	at hudson.remoting.Channel$2.adapt(Channel.java:986)
      	at hudson.remoting.FutureAdapter.get(FutureAdapter.java:59)
      	at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitorDetailed(AbstractAsyncNodeMonitorDescriptor.java:114)
      	at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitor(AbstractAsyncNodeMonitorDescriptor.java:76)
      	at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:305)
      Caused by: java.io.IOException: Cannot locate RemoteClassLoader.ClassLoaderProxy(2) in the channel exported table
      	at hudson.remoting.MultiClassLoaderSerializer$Input.readClassLoader(MultiClassLoaderSerializer.java:107)
      	at hudson.remoting.MultiClassLoaderSerializer$Input.resolveClass(MultiClassLoaderSerializer.java:128)
      	at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1826)
      	at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1713)
      	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2000)
      	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
      	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
      	at hudson.remoting.UserRequest.deserialize(UserRequest.java:291)
      	at hudson.remoting.UserRequest$NormalResponse.retrieve(UserRequest.java:326)
      	at hudson.remoting.Channel$2.adapt(Channel.java:990)
      	... 5 more
      Caused by: java.util.concurrent.ExecutionException: Invalid object ID 2 iota=7
      	at hudson.remoting.ExportTable.diagnoseInvalidObjectId(ExportTable.java:478)
      	at hudson.remoting.ExportTable.get(ExportTable.java:397)
      	at hudson.remoting.Channel.getExportedObject(Channel.java:780)
      	at hudson.remoting.MultiClassLoaderSerializer$Input.readClassLoader(MultiClassLoaderSerializer.java:105)
      	... 14 more
      
      Oct 13, 2018 10:18:15 PM io.jenkins.docker.DockerTransientNode$1 println
      INFO: Stopped container 'f42781f352e6b9a08e66654e9d385999a4a210bb0956f3dc9f8125e169e245fe' for node 'docker-hst02-00030qwaqtu3g'.
      
      Oct 13, 2018 10:18:15 PM hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor monitorDetailed
      WARNING: Failed to monitor docker-hst02-00030qwaqtu3g for Free Disk Space
      java.util.concurrent.ExecutionException: java.io.IOException: Cannot locate RemoteClassLoader.ClassLoaderProxy(2) in the channel exported table
      	at hudson.remoting.Channel$2.adapt(Channel.java:992)
      	at hudson.remoting.Channel$2.adapt(Channel.java:986)
      	at hudson.remoting.FutureAdapter.get(FutureAdapter.java:59)
      	at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitorDetailed(AbstractAsyncNodeMonitorDescriptor.java:114)
      	at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitor(AbstractAsyncNodeMonitorDescriptor.java:76)
      	at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:305)
      Caused by: java.io.IOException: Cannot locate RemoteClassLoader.ClassLoaderProxy(2) in the channel exported table
      	at hudson.remoting.MultiClassLoaderSerializer$Input.readClassLoader(MultiClassLoaderSerializer.java:107)
      	at hudson.remoting.MultiClassLoaderSerializer$Input.resolveClass(MultiClassLoaderSerializer.java:128)
      	at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1826)
      	at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1713)
      	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2000)
      	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
      	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
      	at hudson.remoting.UserRequest.deserialize(UserRequest.java:291)
      	at hudson.remoting.UserRequest$NormalResponse.retrieve(UserRequest.java:326)
      	at hudson.remoting.Channel$2.adapt(Channel.java:990)
      	... 5 more
      Caused by: java.util.concurrent.ExecutionException: Invalid object ID 2 iota=7
      	at hudson.remoting.ExportTable.diagnoseInvalidObjectId(ExportTable.java:478)
      	at hudson.remoting.ExportTable.get(ExportTable.java:397)
      	at hudson.remoting.Channel.getExportedObject(Channel.java:780)
      	at hudson.remoting.MultiClassLoaderSerializer$Input.readClassLoader(MultiClassLoaderSerializer.java:105)
      	... 14 more
      
      Oct 13, 2018 10:18:15 PM hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor monitorDetailed
      WARNING: Failed to monitor docker-hst02-00030qwaqtu3g for Free Swap Space
      java.util.concurrent.ExecutionException: java.io.IOException: Cannot locate RemoteClassLoader.ClassLoaderProxy(2) in the channel exported table
      	at hudson.remoting.Channel$2.adapt(Channel.java:992)
      	at hudson.remoting.Channel$2.adapt(Channel.java:986)
      	at hudson.remoting.FutureAdapter.get(FutureAdapter.java:59)
      	at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitorDetailed(AbstractAsyncNodeMonitorDescriptor.java:114)
      	at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitor(AbstractAsyncNodeMonitorDescriptor.java:76)
      	at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:305)
      Caused by: java.io.IOException: Cannot locate RemoteClassLoader.ClassLoaderProxy(2) in the channel exported table
      	at hudson.remoting.MultiClassLoaderSerializer$Input.readClassLoader(MultiClassLoaderSerializer.java:107)
      	at hudson.remoting.MultiClassLoaderSerializer$Input.resolveClass(MultiClassLoaderSerializer.java:128)
      	at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1826)
      	at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1713)
      	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2000)
      	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
      	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
      	at hudson.remoting.UserRequest.deserialize(UserRequest.java:291)
      	at hudson.remoting.UserRequest$NormalResponse.retrieve(UserRequest.java:326)
      	at hudson.remoting.Channel$2.adapt(Channel.java:990)
      	... 5 more
      Caused by: java.util.concurrent.ExecutionException: Invalid object ID 2 iota=7
      	at hudson.remoting.ExportTable.diagnoseInvalidObjectId(ExportTable.java:478)
      	at hudson.remoting.ExportTable.get(ExportTable.java:397)
      	at hudson.remoting.Channel.getExportedObject(Channel.java:780)
      	at hudson.remoting.MultiClassLoaderSerializer$Input.readClassLoader(MultiClassLoaderSerializer.java:105)
      	... 14 more
      
      Oct 13, 2018 10:18:15 PM hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor monitorDetailed
      WARNING: Failed to monitor docker-hst02-00030qwaqtu3g for Response Time
      java.util.concurrent.ExecutionException: java.io.IOException: Cannot locate RemoteClassLoader.ClassLoaderProxy(2) in the channel exported table
      	at hudson.remoting.Channel$2.adapt(Channel.java:992)
      	at hudson.remoting.Channel$2.adapt(Channel.java:986)
      	at hudson.remoting.FutureAdapter.get(FutureAdapter.java:59)
      	at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitorDetailed(AbstractAsyncNodeMonitorDescriptor.java:114)
      	at hudson.node_monitors.ResponseTimeMonitor$1.monitor(ResponseTimeMonitor.java:57)
      	at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:305)
      Caused by: java.io.IOException: Cannot locate RemoteClassLoader.ClassLoaderProxy(2) in the channel exported table
      	at hudson.remoting.MultiClassLoaderSerializer$Input.readClassLoader(MultiClassLoaderSerializer.java:107)
      	at hudson.remoting.MultiClassLoaderSerializer$Input.resolveClass(MultiClassLoaderSerializer.java:128)
      	at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1826)
      	at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1713)
      	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2000)
      	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
      	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
      	at hudson.remoting.UserRequest.deserialize(UserRequest.java:291)
      	at hudson.remoting.UserRequest$NormalResponse.retrieve(UserRequest.java:326)
      	at hudson.remoting.Channel$2.adapt(Channel.java:990)
      	... 5 more
      Caused by: java.util.concurrent.ExecutionException: Invalid object ID 2 iota=7
      	at hudson.remoting.ExportTable.diagnoseInvalidObjectId(ExportTable.java:478)
      	at hudson.remoting.ExportTable.get(ExportTable.java:397)
      	at hudson.remoting.Channel.getExportedObject(Channel.java:780)
      	at hudson.remoting.MultiClassLoaderSerializer$Input.readClassLoader(MultiClassLoaderSerializer.java:105)
      	... 14 more
      

      This is ugly and obscures "real errors" within the log. This causes real-world problems when trying to debug actual slave disconnection issues, as the log is full of "normal" exceptions that obscure any abnormal exceptions.

      Recommendation:
      Node monitoring code should only report exceptions to the main jenkins.err.log file if the slave node still exists by the time the exception is caught ; there's no point logging huge exception traces for a slave node which has already been removed.
      If the node doesn't exist then either suppress the logging completely or log it at a lower severity so it doesn't fill the main logfile.

          [JENKINS-54496] Node monitoring should not log exceptions for disconnected nodes

          Jeff Thompson added a comment -

          I'm not really sure what the desired behavior here is. It seems like the ask is that Jenkins would be better able to distinguish between an ephemeral and a permanent agent and then react or log differently in the two situations. It seems like these log messages might be important for permanent agents but excessive for ephemeral ones. Maybe this type of monitoring should not be enabled for ephemeral agents.

          Alternatively, the ask might be to improve the detection of when an agent no longer exists. Looking at the stack trace and the code, this operation is only called if Jenkins thinks the node exists.

          Jeff Thompson added a comment - I'm not really sure what the desired behavior here is. It seems like the ask is that Jenkins would be better able to distinguish between an ephemeral and a permanent agent and then react or log differently in the two situations. It seems like these log messages might be important for permanent agents but excessive for ephemeral ones. Maybe this type of monitoring should not be enabled for ephemeral agents. Alternatively, the ask might be to improve the detection of when an agent no longer exists. Looking at the stack trace and the code, this operation is only called if Jenkins thinks the node exists.

          pjdarton added a comment - - edited

          The node existed when the method started its work, but the node is removed during execution (causing the exception). It's a race condition, but one that's very common with high-throughout dynamic slaves.
          I strongly suspect that, if the code checked again "does this node exist" before logging the exception, it would discover that the node does not exist anymore (and, in my opinion, should either log at a lower level, or not log at all)

          pjdarton added a comment - - edited The node existed when the method started its work, but the node is removed during execution (causing the exception). It's a race condition, but one that's very common with high-throughout dynamic slaves. I strongly suspect that, if the code checked again "does this node exist" before logging the exception, it would discover that the node does not exist anymore (and, in my opinion, should either log at a lower level, or not log at all)

          pjdarton added a comment - - edited

          I've done a bit of digging...
          I believe the problem stems from the fact that monitorDetailed() gets the list of Computer instances from Jenkins at the start, but this list can change while the monitoring is taking place such that, by the time Future.get returns with an answer (or an exception), the Computer being monitored may have been removed from Jenkins.
          ...and if it has been removed from Jenkins and the monitoring threw an exception of the form "can't talk to this slave node anymore" then that's pretty much "expected behavior" rather than something worthy of logging a warning & exception trace.
          i.e. before this method logs a big complaint to the logfile, it should check to see if the slave node being monitored still exists in Jenkins as, if Jenkins has forgotten about it, we shouldn't go logging about it.

          I think that the answer may be to modify AbstractAsyncNodeMonitorDescriptor's private void error(Computer c, Throwable x) method so that instead of this:

              private void error(Computer c, Throwable x) {
                  if (c instanceof SlaveComputer) {
                      Functions.printStackTrace(x, ((SlaveComputer) c).getListener().error("Failed to monitor for " + getDisplayName()));
                  } else {
                      LOGGER.log(WARNING, "Failed to monitor " + c.getDisplayName() + " for " + getDisplayName(), x);
                  }
              }
          

          it does this

              private void error(Computer c, Throwable x) {
                  // JENKINS-54496: don't log if c was removed from Jenkins after we'd started monitoring
                  final boolean cIsStillCurrent = Jenkins.get().getComputer(c.getName()) == c;
                  if (!cIsStillCurrent) {
                      return;
                  }
                  if (c instanceof SlaveComputer) {
                      Functions.printStackTrace(x, ((SlaveComputer) c).getListener().error("Failed to monitor for " + getDisplayName()));
                  } else {
                      LOGGER.log(WARNING, "Failed to monitor " + c.getDisplayName() + " for " + getDisplayName(), x);
                  }
              }
          

          If I can persuade my dev environmemt to build me a custom jenkins.war with this code change, I'll be able to try that out and see if that resolves the issue ... but in the meantime, WDYT?

          pjdarton added a comment - - edited I've done a bit of digging... I believe the problem stems from the fact that monitorDetailed() gets the list of Computer instances from Jenkins at the start, but this list can change while the monitoring is taking place such that, by the time Future.get returns with an answer (or an exception), the Computer being monitored may have been removed from Jenkins. ...and if it has been removed from Jenkins and the monitoring threw an exception of the form "can't talk to this slave node anymore" then that's pretty much "expected behavior" rather than something worthy of logging a warning & exception trace. i.e. before this method logs a big complaint to the logfile, it should check to see if the slave node being monitored still exists in Jenkins as, if Jenkins has forgotten about it, we shouldn't go logging about it. I think that the answer may be to modify AbstractAsyncNodeMonitorDescriptor's private void error(Computer c, Throwable x) method so that instead of this : private void error(Computer c, Throwable x) { if (c instanceof SlaveComputer) { Functions.printStackTrace(x, ((SlaveComputer) c).getListener().error( "Failed to monitor for " + getDisplayName())); } else { LOGGER.log(WARNING, "Failed to monitor " + c.getDisplayName() + " for " + getDisplayName(), x); } } it does this private void error(Computer c, Throwable x) { // JENKINS-54496: don 't log if c was removed from Jenkins after we' d started monitoring final boolean cIsStillCurrent = Jenkins.get().getComputer(c.getName()) == c; if (!cIsStillCurrent) { return ; } if (c instanceof SlaveComputer) { Functions.printStackTrace(x, ((SlaveComputer) c).getListener().error( "Failed to monitor for " + getDisplayName())); } else { LOGGER.log(WARNING, "Failed to monitor " + c.getDisplayName() + " for " + getDisplayName(), x); } } If I can persuade my dev environmemt to build me a custom jenkins.war with this code change, I'll be able to try that out and see if that resolves the issue ... but in the meantime, WDYT?

          Jeff Thompson added a comment - - edited

          This makes good sense. I'm glad you were able to dig into it further and isolate the spurious messages you've been seeing. It would be great if you can see if that eliminates the issue for you. Even if you can't verify it it's probably a reasonable check. If you want to submit a PR for it, mention me so I can be sure to review ("jeffret-b" on GitHub). Or I could see about submitting the PR sometime.

          Jeff Thompson added a comment - - edited This makes good sense. I'm glad you were able to dig into it further and isolate the spurious messages you've been seeing. It would be great if you can see if that eliminates the issue for you. Even if you can't verify it it's probably a reasonable check. If you want to submit a PR for it, mention me so I can be sure to review ("jeffret-b" on GitHub). Or I could see about submitting the PR sometime.

          pjdarton added a comment -

          I was able to build a jenkins.war this afternoon; it's going "live" on my main Jenkins server tonight and so, with the workload that gets each day, I should be able to tell if it's resolved within a day or two (assuming no other work gets in the way, that is).

          If it works, I'll submit a PR.

          pjdarton added a comment - I was able to build a jenkins.war this afternoon; it's going "live" on my main Jenkins server tonight and so, with the workload that gets each day, I should be able to tell if it's resolved within a day or two (assuming no other work gets in the way, that is). If it works, I'll submit a PR.

          pjdarton added a comment -

          Well, it seems to be working. I'd normally expect to see over a dozen of those exceptions during the course of a morning and the log is quiet (well, it's still very active, but it's no longer getting lots of those exceptions ... which will make it easier for me to concentrate on tracking down the other exceptions I'm seeing now )

          PR-4461 submitted (by me - I'm "pjdarton" on here and on github).  Changes are as suggested above - just the addition of those 5 lines.

          Note that github doesn't seem to recognise "jeffret-g" so either it's not letting me see you, or there's a typo ... I did find a "jeffret-b" user with real name matching yours so I'm asssuming a b/g typo ... and if that's not the case, I've just pinged the wrong person...

          pjdarton added a comment - Well, it seems to be working. I'd normally expect to see over a dozen of those exceptions during the course of a morning and the log is quiet (well, it's still very active, but it's no longer getting lots of those exceptions ... which will make it easier for me to concentrate on tracking down the other exceptions I'm seeing now ) PR-4461 submitted (by me - I'm "pjdarton" on here and on github).  Changes are as suggested above - just the addition of those 5 lines. Note that github doesn't seem to recognise "jeffret-g" so either it's not letting me see you, or there's a typo ... I did find a "jeffret-b" user with real name matching yours so I'm asssuming a b/g typo ... and if that's not the case, I've just pinged the wrong person...

          Jeff Thompson added a comment -

          Argghh. Yes. I use the "g" version on a number of other places and mistyped it once I started. I got the PR. I'll review it today hopefully.

          Jeff Thompson added a comment - Argghh. Yes. I use the "g" version on a number of other places and mistyped it once I started. I got the PR. I'll review it today hopefully.

          Oleg Nenashev added a comment -

          It was released in Jenkins 2.220

          Oleg Nenashev added a comment - It was released in Jenkins 2.220

            jthompson Jeff Thompson
            pjdarton pjdarton
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: