Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-28759

Batch steps on slaves randomly hang when complete

      Batch steps that succeed are hanging, more frequently since the upgrade to Jenkins 1.6.16 + WF 1.7; I think this is recent, I do not recall encountering such issues with Jenkins 1.6.09 + WF 1.5. This is highly problematic for workflow scripts that rely on large numbers of batch steps. Note, the slave nodes in question may be considered "high-latency" with response times occasionally in seconds.

      Reproduced 2 out of 4 times using the following test idiom, increasing the below loop to 1000 will probably make it a 100% reproduction, parallelizing anecdotally seems to increase reproduction:

      node('slave') {
        for( int i = 0; i < 100; ++i ) {
          echo "i=${i}"
          bat *<some batch step that takes variable time to run, eg scm or make>*
        }
      }
      

          [JENKINS-28759] Batch steps on slaves randomly hang when complete

          A C created issue -

          Jesse Glick added a comment -

          Did you check with Workflow 1.8?

          Jesse Glick added a comment - Did you check with Workflow 1.8?

          A C added a comment -

          Still occurs with WF 1.8

          A C added a comment - Still occurs with WF 1.8
          A C made changes -
          Environment Original: Jenkins 1.6.16
          Workflow 1.7
          Windows 8.1 Slaves
          New: Jenkins 1.6.16
          Workflow 1.8
          Windows 8.1 Slaves

          i can confirm this. As scripts takes more time, it hangs. i'm on workflow 1.8 and jenkins 1.620

          Tomas Zaleniakas added a comment - i can confirm this. As scripts takes more time, it hangs. i'm on workflow 1.8 and jenkins 1.620

          also timeout step does nothing with this issue, i tried on Win 7, Win server 2012 R2, both are same, seems like durable task has serious problem, because same script on freestyle job is running fine.

          Tomas Zaleniakas added a comment - also timeout step does nothing with this issue, i tried on Win 7, Win server 2012 R2, both are same, seems like durable task has serious problem, because same script on freestyle job is running fine.

          A C added a comment -

          I abandoned Workflow plugin awhile back due to this issue.

          A C added a comment - I abandoned Workflow plugin awhile back due to this issue.
          Jesse Glick made changes -
          Description Original: Batch steps that succeed are hanging, more frequently since the upgrade to Jenkins 1.6.16 + WF 1.7; I think this is recent, I do not recall encountering such issues with Jenkins 1.6.09 + WF 1.5. This is highly problematic for workflow scripts that rely on large numbers of batch steps. Note, the slave nodes in question may be considered "high-latency" with response times occasionally in seconds.

          Reproduced 2 out of 4 times using the following test idiom, increasing the below loop to 1000 will probably make it a 100% reproduction, parallelizing anecdotally seems to increase reproduction:

          node('slave') {
            for( int i = 0; i < 100; ++i ) {
              echo "i=${i}"
              bat *<some batch step that takes variable time to run, eg scm or make>*
            }
          }
          New: Batch steps that succeed are hanging, more frequently since the upgrade to Jenkins 1.6.16 + WF 1.7; I think this is recent, I do not recall encountering such issues with Jenkins 1.6.09 + WF 1.5. This is highly problematic for workflow scripts that rely on large numbers of batch steps. Note, the slave nodes in question may be considered "high-latency" with response times occasionally in seconds.

          Reproduced 2 out of 4 times using the following test idiom, increasing the below loop to 1000 will probably make it a 100% reproduction, parallelizing anecdotally seems to increase reproduction:

          {code}
          node('slave') {
            for( int i = 0; i < 100; ++i ) {
              echo "i=${i}"
              bat *<some batch step that takes variable time to run, eg scm or make>*
            }
          }
          {code}

          Jesse Glick added a comment - - edited

          I would certainly like to solve this but I need to be able to reproduce it, and so far I cannot. I run batch steps on XP or 2012 R2 and they finish just fine.

          You should create a custom logger recording messages from org.jenkinsci.plugins.workflow.steps.durable_task and org.jenkinsci.plugins.durabletask at FINE. When the script hangs, pay attention to what log messages are coming in. Also go to the slave workspace and look for a control directory, which will be named something like .123abc45, and check what files are present. As seen here and here, jenkins-log.txt should contain the full output of the script; jenkins-wrap.bat should contain a short controller script (using cmd.exe); jenkins-main.bat the contents of your script; and jenkins-result.txt should exist with the contents 0 if the process has exited. Also check Task Manager to see whether the main and controller scripts are running.

          I have an unconfirmed report of a situation where the main and controller scripts are not running, yet jenkins-result.txt does not exist, suggesting that the controller script somehow failed to record the exit of the main script. If this is the case for you, I need to understand what could cause that.

          Jesse Glick added a comment - - edited I would certainly like to solve this but I need to be able to reproduce it, and so far I cannot. I run batch steps on XP or 2012 R2 and they finish just fine. You should create a custom logger recording messages from org.jenkinsci.plugins.workflow.steps.durable_task and org.jenkinsci.plugins.durabletask at FINE . When the script hangs, pay attention to what log messages are coming in. Also go to the slave workspace and look for a control directory, which will be named something like .123abc45 , and check what files are present. As seen here and here , jenkins-log.txt should contain the full output of the script; jenkins-wrap.bat should contain a short controller script (using cmd.exe ); jenkins-main.bat the contents of your script; and jenkins-result.txt should exist with the contents 0 if the process has exited. Also check Task Manager to see whether the main and controller scripts are running. I have an unconfirmed report of a situation where the main and controller scripts are not running, yet jenkins-result.txt does not exist, suggesting that the controller script somehow failed to record the exit of the main script. If this is the case for you, I need to understand what could cause that.

          Jesse Glick added a comment -
          node('…windows…') {
            env.JAVA_HOME = /C:\Program Files\Java\jdk1.8.0_51/
            for( int i = 0; i < 100; ++i ) {
              echo "i=${i}"
              bat(/…\bin\mvn -f …\animal-sniffer\pom.xml clean package/)
            }
          }
          

          passed for me after 75 minutes. WF 1.9, Durable Task 1.6, Jenkins 1.620, master on Linux with a JNLP slave on 2012 R2.

          Maybe the “high latency” bit is the key? The reporter also filed JENKINS-28604 which I also have no leads on and which mentioned latency.

          Or perhaps this only happens when a single bat step takes quite a long time? What order of magnitude are we talking about here? Are there any suspicious things running on the slave machine like antivirus checkers?

          Jesse Glick added a comment - node( '…windows…' ) { env.JAVA_HOME = /C:\Program Files\Java\jdk1.8.0_51/ for ( int i = 0; i < 100; ++i ) { echo "i=${i}" bat(/…\bin\mvn -f …\animal-sniffer\pom.xml clean package /) } } passed for me after 75 minutes. WF 1.9, Durable Task 1.6, Jenkins 1.620, master on Linux with a JNLP slave on 2012 R2. Maybe the “high latency” bit is the key? The reporter also filed JENKINS-28604 which I also have no leads on and which mentioned latency. Or perhaps this only happens when a single bat step takes quite a long time? What order of magnitude are we talking about here? Are there any suspicious things running on the slave machine like antivirus checkers?

            Unassigned Unassigned
            sumdumgai A C
            Votes:
            19 Vote for this issue
            Watchers:
            29 Start watching this issue

              Created:
              Updated: