Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-48990

Long builds with no logs fail at Ubuntu after 1-2 hours

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Blocker Blocker
    • durable-task-plugin
    • None
    • Jenkins ver. 2.89.2
      Remoting Version 3.14
      Latest plugins for 01/17/2018
      Agent ubuntu.1, launched via ssh
      Ubuntu 14.04.5 LTS

      The simplest Jenkinsfile to reproduce:

      pipeline {
          agent { label "ubuntu.1" }
      
          options {
              disableConcurrentBuilds()
              ansiColor('xterm')
              timestamps()
          }
      
          stages {
              stage('Sleep') {
                  steps {
                      sh "sleep 99999999"
                  }
              }
          }
      }

      Log ends with:

      08:54:24 [Org_repository-5BKEJ4KU7KWRDM4GMA5EGH4UVHK4U74AML3VWSVHEXZJBWCI2QTQ] Running shell script
      08:54:25 + sleep 99999999
      11:30:55 Cannot contact ubuntu.1: java.lang.InterruptedException
      Post stage
      [Pipeline] archiveArtifacts
      11:31:11 Archiving artifacts
      [Pipeline] }
      [Pipeline] // node
      [Pipeline] }
      [Pipeline] // stage
      [Pipeline] }
      [Pipeline] // timestamps
      [Pipeline] }
      [Pipeline] // ansiColor
      [Pipeline] }
      [Pipeline] // timeout
      [Pipeline] End of Pipeline
      
      GitHub has been notified of this commit’s build result
      
      ERROR: script returned exit code -1
      Finished: FAILURE
      

      With All trace from durable-task-plugin I see "heartbeat touches apparently not running in ..."

      Since failures started about a month ago or so, it seem to be a regression of https://github.com/jenkinsci/durable-task-plugin/commit/5c98ca855a9a2fb0043888c1bab9cc5f41c8773a

       

      I did also try to run 'ps' once a minute at the same agent host in parallel with this job. The sh processes with heartbeat and with sleep disappear right after the build fails. These are processes defined in https://github.com/jenkinsci/durable-task-plugin/blob/5c98ca855a9a2fb0043888c1bab9cc5f41c8773a/src/main/java/org/jenkinsci/plugins/durabletask/BourneShellScript.java#L154

      Looking to the workspace folder and @tmp folder I see no pid file - is it expected?

          [JENKINS-48990] Long builds with no logs fail at Ubuntu after 1-2 hours

          Oleg Nenashev added a comment -

          which Remoting version do you use on the agent side?
          please also provide agent logs

          Oleg Nenashev added a comment - which Remoting version do you use on the agent side? please also provide agent logs

          oleg_nenashev, the agent is configured using Jenkins SSH Slaves plugin.

          Could you guide me the way to check Remoting version and find agent logs?

           

          At the agent jenkins root folder:

          $ l$ ls -l
          total 748
          -rw-rw-r--  1 jenkins jenkins 745674 Jan 18 07:00 slave.jar
          drwxrwxr-x  2 jenkins jenkins   4096 Jan 18 07:00 support
          drwxrwxr-x 97 jenkins jenkins  12288 Jan 21 10:15 workspace
          
          $ls -l support/
          total 8
          -rw-rw-r-- 1 jenkins jenkins 1082 Jan 18 06:57 all_2018-01-18_14.47.55.log
          -rw-rw-r-- 1 jenkins jenkins  440 Jan 18 07:02 all_2018-01-18_15.00.30.log
          
          $ cat support/all_2018-01-18_15.00.30.log 
          2018-01-18 15:01:29.609+0000 [id=45]	INFO	h.r.RemoteInvocationHandler$Unexporter#reportStats: rate(1min) = 52.3±59.7/sec; rate(5min) = 101.9±43.4/sec; rate(15min) = 113.9±27.2/sec; rate(total) = 10.9±34.6/sec; N = 11
          2018-01-18 15:02:29.609+0000 [id=45]	INFO	h.r.RemoteInvocationHandler$Unexporter#reportStats: rate(1min) = 19.3±44.1/sec; rate(5min) = 83.5±55.5/sec; rate(15min) = 106.6±38.4/sec; rate(total) = 5.2±24.6/sec; N = 23
          

          Alexander Vorobiev added a comment - oleg_nenashev , the agent is configured using  Jenkins SSH Slaves plugin . Could you guide me the way to check Remoting version and find agent logs?   At the agent jenkins root folder: $ l$ ls -l total 748 -rw-rw-r-- 1 jenkins jenkins 745674 Jan 18 07:00 slave.jar drwxrwxr-x 2 jenkins jenkins 4096 Jan 18 07:00 support drwxrwxr-x 97 jenkins jenkins 12288 Jan 21 10:15 workspace $ls -l support/ total 8 -rw-rw-r-- 1 jenkins jenkins 1082 Jan 18 06:57 all_2018-01-18_14.47.55.log -rw-rw-r-- 1 jenkins jenkins 440 Jan 18 07:02 all_2018-01-18_15.00.30.log $ cat support/all_2018-01-18_15.00.30.log 2018-01-18 15:01:29.609+0000 [id=45] INFO h.r.RemoteInvocationHandler$Unexporter#reportStats: rate(1min) = 52.3±59.7/sec; rate(5min) = 101.9±43.4/sec; rate(15min) = 113.9±27.2/sec; rate(total) = 10.9±34.6/sec; N = 11 2018-01-18 15:02:29.609+0000 [id=45] INFO h.r.RemoteInvocationHandler$Unexporter#reportStats: rate(1min) = 19.3±44.1/sec; rate(5min) = 83.5±55.5/sec; rate(15min) = 106.6±38.4/sec; rate(total) = 5.2±24.6/sec; N = 23

          Oleg Nenashev added a comment - vorobievalex see https://speakerdeck.com/onenashev/day-of-jenkins-2017-dealing-with-agent-connectivity-issues?slide=56

          oleg_nenashev, the Remoting version is 3.14.
          I have started the agent with option -slaveLog agent.log
          The only line in the log after it was started and the subject error appeared was 'channel startedchannel started'.

          I did follow the file all this time:

          $ tail --follow=name --retry agent.log 
          channel startedchannel started
          
          

          Alexander Vorobiev added a comment - oleg_nenashev , the Remoting version is 3.14. I have started the agent with option -slaveLog agent.log The only line in the log after it was started and the subject error appeared was 'channel startedchannel started'. I did follow the file all this time: $ tail --follow=name --retry agent.log channel startedchannel started

          This problem initially appeared for one server, but now it affects the other servers, breaking the automation processes.

          Is there any way to verbose agent logging and see what happens?

          Alexander Vorobiev added a comment - This problem initially appeared for one server, but now it affects the other servers, breaking the automation processes. Is there any way to verbose agent logging and see what happens?

          Jeff Thompson added a comment - - edited

          Yes, it is possible to configure agent logging for finer or more verbose output. You can read about it here: https://github.com/jenkinsci/remoting/blob/master/docs/logging.md . In summary, if you add a java.util.logging properties file and then reference it via the `-loggingConfig` parameter to the agent. For example something like this: `-loggingConfig jenkins-logging.properties`.

          Without further information, it will be difficult to diagnose anything from the Remoting side. Commonly remoting issues involve something in the networking or system environment terminating the connection from outside the process. The trick can be to determine what is doing that. In once instance (JENKINS-52922), Nush Ahmd discovered that setting hudson.slaves.ChannelPinger.pingIntervalSeconds kept the channel from getting disconnected.

          This case looks to be more a durable-task issue, so unless more information is provided I intend to remove Remoting from this report.

          Jeff Thompson added a comment - - edited Yes, it is possible to configure agent logging for finer or more verbose output. You can read about it here: https://github.com/jenkinsci/remoting/blob/master/docs/logging.md  . In summary, if you add a java.util.logging properties file and then reference it via the `-loggingConfig` parameter to the agent. For example something like this: `-loggingConfig jenkins-logging.properties`. Without further information, it will be difficult to diagnose anything from the Remoting side. Commonly remoting issues involve something in the networking or system environment terminating the connection from outside the process. The trick can be to determine what is doing that. In once instance ( JENKINS-52922 ), Nush Ahmd discovered that setting hudson.slaves.ChannelPinger.pingIntervalSeconds kept the channel from getting disconnected. This case looks to be more a durable-task issue, so unless more information is provided I intend to remove Remoting from this report.

          Jeff Thompson added a comment -

          My configuration was different but my long-running test job ran for seven hours until I killed it. I ran with current versions on Mac.

          Jeff Thompson added a comment - My configuration was different but my long-running test job ran for seven hours until I killed it. I ran with current versions on Mac.

            Unassigned Unassigned
            vorobievalex Alexander Vorobiev
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated: