Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-21251

Slave channel timeout seemingly caused by usage of System.currentTimeMillis()

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • core, remoting
    • Windows XP, 32bit, Java 1.6
      Jenkins ver. 1.509.3

      My projects run on separate slave machines/nodes.
      Some of these projects have testcases that change the system clock.
      This causes Jenkins and the slave to immediately disconnect. It always happens.

      I've tried to dig into it a bit and here's what I have:
      I'm looking at hudson.remoting.PingThread.java
      The timeout on the channel ping is enforced by calculating the difference between 2 counter values, each obtained with System.currentTimeMillis().
      However System.currentTimeMillis() is not the right way of obtaining such counters - the values are not monotonically increasing.
      If wall-clock time changes for any reason (like maybe Daylight savings or a manual change) then the value returned might become smaller than a moment ago.
      Obtaining the counter values with System.nanoTime() would solve this.

      STACK TRACE below:

      FATAL: Unable to delete script file C:\WINDOWS\TEMP\hudson2959765976006672203.bat
      00:44:00 hudson.util.IOException2: remote file operation failed: C:\WINDOWS\TEMP\hudson2959765976006672203.bat at hudson.remoting.Channel@62ec4e:UTM-RT5AUTOM027
      00:44:00 at hudson.FilePath.act(FilePath.java:901)
      00:44:00 at hudson.FilePath.act(FilePath.java:878)
      00:44:00 at hudson.FilePath.delete(FilePath.java:1263)
      00:44:00 at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:101)
      00:44:00 at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:60)
      00:44:00 at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:19)
      00:44:00 at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:802)
      00:44:00 at hudson.model.Build$BuildExecution.build(Build.java:199)
      00:44:00 at hudson.model.Build$BuildExecution.doRun(Build.java:160)
      00:44:00 at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:584)
      00:44:00 at hudson.model.Run.execute(Run.java:1592)
      00:44:00 at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
      00:44:00 at hudson.model.ResourceController.execute(ResourceController.java:88)
      00:44:00 at hudson.model.Executor.run(Executor.java:237)
      00:44:00 Caused by: hudson.remoting.ChannelClosedException: channel is already closed
      00:44:00 at hudson.remoting.Channel.send(Channel.java:494)
      00:44:00 at hudson.remoting.Request.call(Request.java:129)
      00:44:00 at hudson.remoting.Channel.call(Channel.java:672)
      00:44:00 at hudson.FilePath.act(FilePath.java:894)
      00:44:00 ... 13 more
      00:44:00 Caused by: java.io.IOException: Unexpected termination of the channel
      00:44:00 at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50)
      00:44:00 Caused by: java.io.EOFException
      00:44:00 at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2554)
      00:44:00 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1297)
      00:44:00 at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
      00:44:00 at hudson.remoting.Command.readFrom(Command.java:92)
      00:44:00 at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:59)
      00:44:00 at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
      00:44:00 FATAL: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
      00:44:00 hudson.remoting.RequestAbortedException: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
      00:44:00 at hudson.remoting.Request.call(Request.java:174)
      00:44:00 at hudson.remoting.Channel.call(Channel.java:672)
      00:44:00 at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:158)
      00:44:00 at $Proxy45.join(Unknown Source)
      00:44:00 at hudson.Launcher$RemoteLauncher$ProcImpl.join(Launcher.java:925)
      00:44:00 at hudson.Launcher$ProcStarter.join(Launcher.java:360)
      00:44:00 at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:91)
      00:44:00 at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:60)
      00:44:00 at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:19)
      00:44:00 at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:802)
      00:44:00 at hudson.model.Build$BuildExecution.build(Build.java:199)
      00:44:00 at hudson.model.Build$BuildExecution.doRun(Build.java:160)
      00:44:00 at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:584)
      00:44:00 at hudson.model.Run.execute(Run.java:1592)
      00:44:00 at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
      00:44:00 at hudson.model.ResourceController.execute(ResourceController.java:88)
      00:44:00 at hudson.model.Executor.run(Executor.java:237)
      00:44:00 Caused by: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
      00:44:00 at hudson.remoting.Request.abort(Request.java:299)
      00:44:00 at hudson.remoting.Channel.terminate(Channel.java:732)
      00:44:00 at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:69)
      00:44:00 Caused by: java.io.IOException: Unexpected termination of the channel
      00:44:00 at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50)
      00:44:00 Caused by: java.io.EOFException
      00:44:00 at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2554)
      00:44:00 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1297)
      00:44:00 at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
      00:44:00 at hudson.remoting.Command.readFrom(Command.java:92)
      00:44:00 at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:59)
      00:44:00 at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)

          [JENKINS-21251] Slave channel timeout seemingly caused by usage of System.currentTimeMillis()

          Daniel Beck added a comment -

          Are you able to reproduce the issue on recent Jenkins versions?

          Daniel Beck added a comment - Are you able to reproduce the issue on recent Jenkins versions?

          James Nord added a comment - - edited

          The code is susceptable to the clock changing when it is about to send the first Ping.
          Using System.nanoTime will have performance implications for Windows systems - so I am going to attempt a more conservative fix first.

          James Nord added a comment - - edited The code is susceptable to the clock changing when it is about to send the first Ping. Using System.nanoTime will have performance implications for Windows systems - so I am going to attempt a more conservative fix first.

          Code changed in jenkins
          User: James Nord
          Path:
          src/main/java/hudson/remoting/AtmostOneThreadExecutor.java
          src/main/java/hudson/remoting/Channel.java
          src/main/java/hudson/remoting/PingThread.java
          src/main/java/hudson/remoting/Request.java
          src/main/java/hudson/remoting/SingleLaneExecutorService.java
          src/main/java/hudson/remoting/SynchronousExecutorService.java
          http://jenkins-ci.org/commit/remoting/29ec965650bba57145df14f0fee73acbd8ce5248
          Log:
          [FIXED JENKINS-21251] Initial calculation of wait time was susseptable to
          clock drift.

          The calculation on how long to wait in various places on the code was
          suseptable to the clock changing as multiple calls where made to
          System.currentTimeMillis() for the initial calculation.

          Changed this so that for the initial calculation only one call will be
          made to System.currentTimeMillis().
          WHilst this does not guarantee that the actuall amount of time will have
          been waited for (and subsequent calculations are suseptable) at least on
          the hotstop VM Object.wait() will not have spurious wake-ups.
          Therefore the overhead of moving to System.nanoTime() is currently not
          considered to be worth the performance overhead.

          Compare: https://github.com/jenkinsci/remoting/compare/ff1a0669d363^...29ec965650bb

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: James Nord Path: src/main/java/hudson/remoting/AtmostOneThreadExecutor.java src/main/java/hudson/remoting/Channel.java src/main/java/hudson/remoting/PingThread.java src/main/java/hudson/remoting/Request.java src/main/java/hudson/remoting/SingleLaneExecutorService.java src/main/java/hudson/remoting/SynchronousExecutorService.java http://jenkins-ci.org/commit/remoting/29ec965650bba57145df14f0fee73acbd8ce5248 Log: [FIXED JENKINS-21251] Initial calculation of wait time was susseptable to clock drift. The calculation on how long to wait in various places on the code was suseptable to the clock changing as multiple calls where made to System.currentTimeMillis() for the initial calculation. Changed this so that for the initial calculation only one call will be made to System.currentTimeMillis(). WHilst this does not guarantee that the actuall amount of time will have been waited for (and subsequent calculations are suseptable) at least on the hotstop VM Object.wait() will not have spurious wake-ups. Therefore the overhead of moving to System.nanoTime() is currently not considered to be worth the performance overhead. Compare: https://github.com/jenkinsci/remoting/compare/ff1a0669d363 ^...29ec965650bb

          lacostej added a comment -

          Question: could using NTP to regularly reset clock drifts trigger this issue ?

          lacostej added a comment - Question: could using NTP to regularly reset clock drifts trigger this issue ?

          Code changed in jenkins
          User: James Nord
          Path:
          src/main/java/hudson/remoting/AtmostOneThreadExecutor.java
          src/main/java/hudson/remoting/Channel.java
          src/main/java/hudson/remoting/PingThread.java
          src/main/java/hudson/remoting/Request.java
          src/main/java/hudson/remoting/SingleLaneExecutorService.java
          src/main/java/hudson/remoting/SynchronousExecutorService.java
          http://jenkins-ci.org/commit/remoting/75fb3559e354418000760573eac19efcd86c267f
          Log:
          JENKINS-21251 Initial calculation of wait time was susseptable to
          clock drift.

          The calculation on how long to wait in various places on the code was
          suseptable to the clock changing as multiple calls where made to
          System.currentTimeMillis() for the initial calculation.

          Changed this so that we use a single call to System.nanoTime and made just
          a single call for the initial calculation.

          There is still a potential issue for any callers of Channel.getLastHeard()
          this will be addressed in a future change and has only been noted in this
          commit.

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: James Nord Path: src/main/java/hudson/remoting/AtmostOneThreadExecutor.java src/main/java/hudson/remoting/Channel.java src/main/java/hudson/remoting/PingThread.java src/main/java/hudson/remoting/Request.java src/main/java/hudson/remoting/SingleLaneExecutorService.java src/main/java/hudson/remoting/SynchronousExecutorService.java http://jenkins-ci.org/commit/remoting/75fb3559e354418000760573eac19efcd86c267f Log: JENKINS-21251 Initial calculation of wait time was susseptable to clock drift. The calculation on how long to wait in various places on the code was suseptable to the clock changing as multiple calls where made to System.currentTimeMillis() for the initial calculation. Changed this so that we use a single call to System.nanoTime and made just a single call for the initial calculation. There is still a potential issue for any callers of Channel.getLastHeard() this will be addressed in a future change and has only been noted in this commit.

          James Nord added a comment -

          lacostej -
          if you use ntp deamon then probably not as it corrects the clock over a longer time.
          if you use cron and ntpdate then yes.

          James Nord added a comment - lacostej - if you use ntp deamon then probably not as it corrects the clock over a longer time. if you use cron and ntpdate then yes.

          Code changed in jenkins
          User: James Nord
          Path:
          src/main/java/hudson/remoting/AtmostOneThreadExecutor.java
          src/main/java/hudson/remoting/Channel.java
          src/main/java/hudson/remoting/PingThread.java
          src/main/java/hudson/remoting/Request.java
          src/main/java/hudson/remoting/SingleLaneExecutorService.java
          src/main/java/hudson/remoting/SynchronousExecutorService.java
          http://jenkins-ci.org/commit/remoting/7f35e250f5d535a9dbe16e67a47443aab568e7c4
          Log:
          Merge pull request #33 from jenkinsci/JENKINS-21251_take2

          JENKINS-21251 Slave channel timeout seemingly caused by usage of System.currentTimeMillis()

          Compare: https://github.com/jenkinsci/remoting/compare/c4df5cb9c458...7f35e250f5d5

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: James Nord Path: src/main/java/hudson/remoting/AtmostOneThreadExecutor.java src/main/java/hudson/remoting/Channel.java src/main/java/hudson/remoting/PingThread.java src/main/java/hudson/remoting/Request.java src/main/java/hudson/remoting/SingleLaneExecutorService.java src/main/java/hudson/remoting/SynchronousExecutorService.java http://jenkins-ci.org/commit/remoting/7f35e250f5d535a9dbe16e67a47443aab568e7c4 Log: Merge pull request #33 from jenkinsci/ JENKINS-21251 _take2 JENKINS-21251 Slave channel timeout seemingly caused by usage of System.currentTimeMillis() Compare: https://github.com/jenkinsci/remoting/compare/c4df5cb9c458...7f35e250f5d5

          Code changed in jenkins
          User: Kohsuke Kawaguchi
          Path:
          changelog.html
          core/pom.xml
          pom.xml
          http://jenkins-ci.org/commit/jenkins/72e0b9c53a15306e0a7b7117100f4d1b7ec250b8
          Log:
          [FIXED JENKINS-6167 JENKINS-21251 JENKINS-21160]

          Integrated a newer version of remoting & args4j

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Kohsuke Kawaguchi Path: changelog.html core/pom.xml pom.xml http://jenkins-ci.org/commit/jenkins/72e0b9c53a15306e0a7b7117100f4d1b7ec250b8 Log: [FIXED JENKINS-6167 JENKINS-21251 JENKINS-21160] Integrated a newer version of remoting & args4j

          Oleg Nenashev added a comment -

          Moving to Resolved since there is a "take2". Please reopen if the issue still happens

          Oleg Nenashev added a comment - Moving to Resolved since there is a "take2". Please reopen if the issue still happens

            Unassigned Unassigned
            benc666 Ben Combrink
            Votes:
            3 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: