Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-48990

Long builds with no logs fail at Ubuntu after 1-2 hours

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Blocker Blocker
    • durable-task-plugin
    • None
    • Jenkins ver. 2.89.2
      Remoting Version 3.14
      Latest plugins for 01/17/2018
      Agent ubuntu.1, launched via ssh
      Ubuntu 14.04.5 LTS

      The simplest Jenkinsfile to reproduce:

      pipeline {
          agent { label "ubuntu.1" }
      
          options {
              disableConcurrentBuilds()
              ansiColor('xterm')
              timestamps()
          }
      
          stages {
              stage('Sleep') {
                  steps {
                      sh "sleep 99999999"
                  }
              }
          }
      }

      Log ends with:

      08:54:24 [Org_repository-5BKEJ4KU7KWRDM4GMA5EGH4UVHK4U74AML3VWSVHEXZJBWCI2QTQ] Running shell script
      08:54:25 + sleep 99999999
      11:30:55 Cannot contact ubuntu.1: java.lang.InterruptedException
      Post stage
      [Pipeline] archiveArtifacts
      11:31:11 Archiving artifacts
      [Pipeline] }
      [Pipeline] // node
      [Pipeline] }
      [Pipeline] // stage
      [Pipeline] }
      [Pipeline] // timestamps
      [Pipeline] }
      [Pipeline] // ansiColor
      [Pipeline] }
      [Pipeline] // timeout
      [Pipeline] End of Pipeline
      
      GitHub has been notified of this commit’s build result
      
      ERROR: script returned exit code -1
      Finished: FAILURE
      

      With All trace from durable-task-plugin I see "heartbeat touches apparently not running in ..."

      Since failures started about a month ago or so, it seem to be a regression of https://github.com/jenkinsci/durable-task-plugin/commit/5c98ca855a9a2fb0043888c1bab9cc5f41c8773a

       

      I did also try to run 'ps' once a minute at the same agent host in parallel with this job. The sh processes with heartbeat and with sleep disappear right after the build fails. These are processes defined in https://github.com/jenkinsci/durable-task-plugin/blob/5c98ca855a9a2fb0043888c1bab9cc5f41c8773a/src/main/java/org/jenkinsci/plugins/durabletask/BourneShellScript.java#L154

      Looking to the workspace folder and @tmp folder I see no pid file - is it expected?

          [JENKINS-48990] Long builds with no logs fail at Ubuntu after 1-2 hours

          Alexander Vorobiev created issue -
          Alexander Vorobiev made changes -
          Description Original: The simplest Jenkinsfile to reproduce:
          {code:java}
          pipeline {
              agent { label "ubuntu.1" }

              options {
                  disableConcurrentBuilds()
                  ansiColor('xterm')
                  timestamps()
              }

              stages {
                  stage('Sleep') {
                      steps {
                          sh "sleep 99999999"
                      }
                  }
              }
          }{code}
          Log ends with:
          {code:java}
          08:54:24 [Org_repository-5BKEJ4KU7KWRDM4GMA5EGH4UVHK4U74AML3VWSVHEXZJBWCI2QTQ] Running shell script
          08:54:25 + sleep 99999999
          11:30:55 Cannot contact ubuntu.1: java.lang.InterruptedException
          Post stage
          [Pipeline] archiveArtifacts
          11:31:11 Archiving artifacts
          [Pipeline] }
          [Pipeline] // node
          [Pipeline] }
          [Pipeline] // stage
          [Pipeline] }
          [Pipeline] // timestamps
          [Pipeline] }
          [Pipeline] // ansiColor
          [Pipeline] }
          [Pipeline] // timeout
          [Pipeline] End of Pipeline

          GitHub has been notified of this commit’s build result

          ERROR: script returned exit code -1
          Finished: FAILURE
          {code}
          With All trace from durable-task-plugin I see "heartbeat touches apparently not running in ..."

          Since failures started about a month ago or so, it seem to be a regression of [https://github.com/jenkinsci/durable-task-plugin/commit/5c98ca855a9a2fb0043888c1bab9cc5f41c8773a]

           

          I did also try to run 'ps' once a minute at the same agent host in parallel with this job. The sh processes with heartbeat and with sleep disappear right after the build fails. These are processes defined in [https://github.com/jenkinsci/durable-task-plugin/blob/5c98ca855a9a2fb0043888c1bab9cc5f41c8773a/src/main/java/org/jenkinsci/plugins/durabletask/BourneShellScript.java#L154]
          New:  

          The simplest Jenkinsfile to reproduce:
          {code:java}
          pipeline {
              agent { label "ubuntu.1" }

              options {
                  disableConcurrentBuilds()
                  ansiColor('xterm')
                  timestamps()
              }

              stages {
                  stage('Sleep') {
                      steps {
                          sh "sleep 99999999"
                      }
                  }
              }
          }{code}
          Log ends with:
          {code:java}
          08:54:24 [Org_repository-5BKEJ4KU7KWRDM4GMA5EGH4UVHK4U74AML3VWSVHEXZJBWCI2QTQ] Running shell script
          08:54:25 + sleep 99999999
          11:30:55 Cannot contact ubuntu.1: java.lang.InterruptedException
          Post stage
          [Pipeline] archiveArtifacts
          11:31:11 Archiving artifacts
          [Pipeline] }
          [Pipeline] // node
          [Pipeline] }
          [Pipeline] // stage
          [Pipeline] }
          [Pipeline] // timestamps
          [Pipeline] }
          [Pipeline] // ansiColor
          [Pipeline] }
          [Pipeline] // timeout
          [Pipeline] End of Pipeline

          GitHub has been notified of this commit’s build result

          ERROR: script returned exit code -1
          Finished: FAILURE
          {code}
          With All trace from durable-task-plugin I see "heartbeat touches apparently not running in ..."

          Since failures started about a month ago or so, it seem to be a regression of [https://github.com/jenkinsci/durable-task-plugin/commit/5c98ca855a9a2fb0043888c1bab9cc5f41c8773a]

           

          I did also try to run 'ps' once a minute at the same agent host in parallel with this job. The sh processes with heartbeat and with sleep disappear right after the build fails. These are processes defined in [https://github.com/jenkinsci/durable-task-plugin/blob/5c98ca855a9a2fb0043888c1bab9cc5f41c8773a/src/main/java/org/jenkinsci/plugins/durabletask/BourneShellScript.java#L154]
          Alexander Vorobiev made changes -
          Environment New: Jenkins ver. 2.89.2
          Latest plugins for 01/17/2018
          Agent ubuntu.1, launched via ssh
          Ubuntu 14.04.5 LTS
          Alexander Vorobiev made changes -
          Description Original:  

          The simplest Jenkinsfile to reproduce:
          {code:java}
          pipeline {
              agent { label "ubuntu.1" }

              options {
                  disableConcurrentBuilds()
                  ansiColor('xterm')
                  timestamps()
              }

              stages {
                  stage('Sleep') {
                      steps {
                          sh "sleep 99999999"
                      }
                  }
              }
          }{code}
          Log ends with:
          {code:java}
          08:54:24 [Org_repository-5BKEJ4KU7KWRDM4GMA5EGH4UVHK4U74AML3VWSVHEXZJBWCI2QTQ] Running shell script
          08:54:25 + sleep 99999999
          11:30:55 Cannot contact ubuntu.1: java.lang.InterruptedException
          Post stage
          [Pipeline] archiveArtifacts
          11:31:11 Archiving artifacts
          [Pipeline] }
          [Pipeline] // node
          [Pipeline] }
          [Pipeline] // stage
          [Pipeline] }
          [Pipeline] // timestamps
          [Pipeline] }
          [Pipeline] // ansiColor
          [Pipeline] }
          [Pipeline] // timeout
          [Pipeline] End of Pipeline

          GitHub has been notified of this commit’s build result

          ERROR: script returned exit code -1
          Finished: FAILURE
          {code}
          With All trace from durable-task-plugin I see "heartbeat touches apparently not running in ..."

          Since failures started about a month ago or so, it seem to be a regression of [https://github.com/jenkinsci/durable-task-plugin/commit/5c98ca855a9a2fb0043888c1bab9cc5f41c8773a]

           

          I did also try to run 'ps' once a minute at the same agent host in parallel with this job. The sh processes with heartbeat and with sleep disappear right after the build fails. These are processes defined in [https://github.com/jenkinsci/durable-task-plugin/blob/5c98ca855a9a2fb0043888c1bab9cc5f41c8773a/src/main/java/org/jenkinsci/plugins/durabletask/BourneShellScript.java#L154]
          New: The simplest Jenkinsfile to reproduce:
          {code:java}
          pipeline {
              agent { label "ubuntu.1" }

              options {
                  disableConcurrentBuilds()
                  ansiColor('xterm')
                  timestamps()
              }

              stages {
                  stage('Sleep') {
                      steps {
                          sh "sleep 99999999"
                      }
                  }
              }
          }{code}
          Log ends with:
          {code:java}
          08:54:24 [Org_repository-5BKEJ4KU7KWRDM4GMA5EGH4UVHK4U74AML3VWSVHEXZJBWCI2QTQ] Running shell script
          08:54:25 + sleep 99999999
          11:30:55 Cannot contact ubuntu.1: java.lang.InterruptedException
          Post stage
          [Pipeline] archiveArtifacts
          11:31:11 Archiving artifacts
          [Pipeline] }
          [Pipeline] // node
          [Pipeline] }
          [Pipeline] // stage
          [Pipeline] }
          [Pipeline] // timestamps
          [Pipeline] }
          [Pipeline] // ansiColor
          [Pipeline] }
          [Pipeline] // timeout
          [Pipeline] End of Pipeline

          GitHub has been notified of this commit’s build result

          ERROR: script returned exit code -1
          Finished: FAILURE
          {code}
          With All trace from durable-task-plugin I see "heartbeat touches apparently not running in ..."

          Since failures started about a month ago or so, it seem to be a regression of [https://github.com/jenkinsci/durable-task-plugin/commit/5c98ca855a9a2fb0043888c1bab9cc5f41c8773a]

           

          I did also try to run 'ps' once a minute at the same agent host in parallel with this job. The sh processes with heartbeat and with sleep disappear right after the build fails. These are processes defined in [https://github.com/jenkinsci/durable-task-plugin/blob/5c98ca855a9a2fb0043888c1bab9cc5f41c8773a/src/main/java/org/jenkinsci/plugins/durabletask/BourneShellScript.java#L154]

          Looking to the workspace folder and @tmp folder I see no pid file - is it expected?

          Oleg Nenashev added a comment -

          which Remoting version do you use on the agent side?
          please also provide agent logs

          Oleg Nenashev added a comment - which Remoting version do you use on the agent side? please also provide agent logs

          oleg_nenashev, the agent is configured using Jenkins SSH Slaves plugin.

          Could you guide me the way to check Remoting version and find agent logs?

           

          At the agent jenkins root folder:

          $ l$ ls -l
          total 748
          -rw-rw-r--  1 jenkins jenkins 745674 Jan 18 07:00 slave.jar
          drwxrwxr-x  2 jenkins jenkins   4096 Jan 18 07:00 support
          drwxrwxr-x 97 jenkins jenkins  12288 Jan 21 10:15 workspace
          
          $ls -l support/
          total 8
          -rw-rw-r-- 1 jenkins jenkins 1082 Jan 18 06:57 all_2018-01-18_14.47.55.log
          -rw-rw-r-- 1 jenkins jenkins  440 Jan 18 07:02 all_2018-01-18_15.00.30.log
          
          $ cat support/all_2018-01-18_15.00.30.log 
          2018-01-18 15:01:29.609+0000 [id=45]	INFO	h.r.RemoteInvocationHandler$Unexporter#reportStats: rate(1min) = 52.3±59.7/sec; rate(5min) = 101.9±43.4/sec; rate(15min) = 113.9±27.2/sec; rate(total) = 10.9±34.6/sec; N = 11
          2018-01-18 15:02:29.609+0000 [id=45]	INFO	h.r.RemoteInvocationHandler$Unexporter#reportStats: rate(1min) = 19.3±44.1/sec; rate(5min) = 83.5±55.5/sec; rate(15min) = 106.6±38.4/sec; rate(total) = 5.2±24.6/sec; N = 23
          

          Alexander Vorobiev added a comment - oleg_nenashev , the agent is configured using  Jenkins SSH Slaves plugin . Could you guide me the way to check Remoting version and find agent logs?   At the agent jenkins root folder: $ l$ ls -l total 748 -rw-rw-r-- 1 jenkins jenkins 745674 Jan 18 07:00 slave.jar drwxrwxr-x 2 jenkins jenkins 4096 Jan 18 07:00 support drwxrwxr-x 97 jenkins jenkins 12288 Jan 21 10:15 workspace $ls -l support/ total 8 -rw-rw-r-- 1 jenkins jenkins 1082 Jan 18 06:57 all_2018-01-18_14.47.55.log -rw-rw-r-- 1 jenkins jenkins 440 Jan 18 07:02 all_2018-01-18_15.00.30.log $ cat support/all_2018-01-18_15.00.30.log 2018-01-18 15:01:29.609+0000 [id=45] INFO h.r.RemoteInvocationHandler$Unexporter#reportStats: rate(1min) = 52.3±59.7/sec; rate(5min) = 101.9±43.4/sec; rate(15min) = 113.9±27.2/sec; rate(total) = 10.9±34.6/sec; N = 11 2018-01-18 15:02:29.609+0000 [id=45] INFO h.r.RemoteInvocationHandler$Unexporter#reportStats: rate(1min) = 19.3±44.1/sec; rate(5min) = 83.5±55.5/sec; rate(15min) = 106.6±38.4/sec; rate(total) = 5.2±24.6/sec; N = 23

          Oleg Nenashev added a comment - vorobievalex see https://speakerdeck.com/onenashev/day-of-jenkins-2017-dealing-with-agent-connectivity-issues?slide=56
          Alexander Vorobiev made changes -
          Environment Original: Jenkins ver. 2.89.2
          Latest plugins for 01/17/2018
          Agent ubuntu.1, launched via ssh
          Ubuntu 14.04.5 LTS
          New: Jenkins ver. 2.89.2
          Remoting Version 3.14
          Latest plugins for 01/17/2018
          Agent ubuntu.1, launched via ssh
          Ubuntu 14.04.5 LTS

          oleg_nenashev, the Remoting version is 3.14.
          I have started the agent with option -slaveLog agent.log
          The only line in the log after it was started and the subject error appeared was 'channel startedchannel started'.

          I did follow the file all this time:

          $ tail --follow=name --retry agent.log 
          channel startedchannel started
          
          

          Alexander Vorobiev added a comment - oleg_nenashev , the Remoting version is 3.14. I have started the agent with option -slaveLog agent.log The only line in the log after it was started and the subject error appeared was 'channel startedchannel started'. I did follow the file all this time: $ tail --follow=name --retry agent.log channel startedchannel started

          This problem initially appeared for one server, but now it affects the other servers, breaking the automation processes.

          Is there any way to verbose agent logging and see what happens?

          Alexander Vorobiev added a comment - This problem initially appeared for one server, but now it affects the other servers, breaking the automation processes. Is there any way to verbose agent logging and see what happens?

            Unassigned Unassigned
            vorobievalex Alexander Vorobiev
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated: