Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-28759

Batch steps on slaves randomly hang when complete

      Batch steps that succeed are hanging, more frequently since the upgrade to Jenkins 1.6.16 + WF 1.7; I think this is recent, I do not recall encountering such issues with Jenkins 1.6.09 + WF 1.5. This is highly problematic for workflow scripts that rely on large numbers of batch steps. Note, the slave nodes in question may be considered "high-latency" with response times occasionally in seconds.

      Reproduced 2 out of 4 times using the following test idiom, increasing the below loop to 1000 will probably make it a 100% reproduction, parallelizing anecdotally seems to increase reproduction:

      node('slave') {
        for( int i = 0; i < 100; ++i ) {
          echo "i=${i}"
          bat *<some batch step that takes variable time to run, eg scm or make>*
        }
      }
      

          [JENKINS-28759] Batch steps on slaves randomly hang when complete

          Jesse Glick added a comment -

          Did you check with Workflow 1.8?

          Jesse Glick added a comment - Did you check with Workflow 1.8?

          A C added a comment -

          Still occurs with WF 1.8

          A C added a comment - Still occurs with WF 1.8

          i can confirm this. As scripts takes more time, it hangs. i'm on workflow 1.8 and jenkins 1.620

          Tomas Zaleniakas added a comment - i can confirm this. As scripts takes more time, it hangs. i'm on workflow 1.8 and jenkins 1.620

          also timeout step does nothing with this issue, i tried on Win 7, Win server 2012 R2, both are same, seems like durable task has serious problem, because same script on freestyle job is running fine.

          Tomas Zaleniakas added a comment - also timeout step does nothing with this issue, i tried on Win 7, Win server 2012 R2, both are same, seems like durable task has serious problem, because same script on freestyle job is running fine.

          A C added a comment -

          I abandoned Workflow plugin awhile back due to this issue.

          A C added a comment - I abandoned Workflow plugin awhile back due to this issue.

          Jesse Glick added a comment - - edited

          I would certainly like to solve this but I need to be able to reproduce it, and so far I cannot. I run batch steps on XP or 2012 R2 and they finish just fine.

          You should create a custom logger recording messages from org.jenkinsci.plugins.workflow.steps.durable_task and org.jenkinsci.plugins.durabletask at FINE. When the script hangs, pay attention to what log messages are coming in. Also go to the slave workspace and look for a control directory, which will be named something like .123abc45, and check what files are present. As seen here and here, jenkins-log.txt should contain the full output of the script; jenkins-wrap.bat should contain a short controller script (using cmd.exe); jenkins-main.bat the contents of your script; and jenkins-result.txt should exist with the contents 0 if the process has exited. Also check Task Manager to see whether the main and controller scripts are running.

          I have an unconfirmed report of a situation where the main and controller scripts are not running, yet jenkins-result.txt does not exist, suggesting that the controller script somehow failed to record the exit of the main script. If this is the case for you, I need to understand what could cause that.

          Jesse Glick added a comment - - edited I would certainly like to solve this but I need to be able to reproduce it, and so far I cannot. I run batch steps on XP or 2012 R2 and they finish just fine. You should create a custom logger recording messages from org.jenkinsci.plugins.workflow.steps.durable_task and org.jenkinsci.plugins.durabletask at FINE . When the script hangs, pay attention to what log messages are coming in. Also go to the slave workspace and look for a control directory, which will be named something like .123abc45 , and check what files are present. As seen here and here , jenkins-log.txt should contain the full output of the script; jenkins-wrap.bat should contain a short controller script (using cmd.exe ); jenkins-main.bat the contents of your script; and jenkins-result.txt should exist with the contents 0 if the process has exited. Also check Task Manager to see whether the main and controller scripts are running. I have an unconfirmed report of a situation where the main and controller scripts are not running, yet jenkins-result.txt does not exist, suggesting that the controller script somehow failed to record the exit of the main script. If this is the case for you, I need to understand what could cause that.

          Jesse Glick added a comment -
          node('…windows…') {
            env.JAVA_HOME = /C:\Program Files\Java\jdk1.8.0_51/
            for( int i = 0; i < 100; ++i ) {
              echo "i=${i}"
              bat(/…\bin\mvn -f …\animal-sniffer\pom.xml clean package/)
            }
          }
          

          passed for me after 75 minutes. WF 1.9, Durable Task 1.6, Jenkins 1.620, master on Linux with a JNLP slave on 2012 R2.

          Maybe the “high latency” bit is the key? The reporter also filed JENKINS-28604 which I also have no leads on and which mentioned latency.

          Or perhaps this only happens when a single bat step takes quite a long time? What order of magnitude are we talking about here? Are there any suspicious things running on the slave machine like antivirus checkers?

          Jesse Glick added a comment - node( '…windows…' ) { env.JAVA_HOME = /C:\Program Files\Java\jdk1.8.0_51/ for ( int i = 0; i < 100; ++i ) { echo "i=${i}" bat(/…\bin\mvn -f …\animal-sniffer\pom.xml clean package /) } } passed for me after 75 minutes. WF 1.9, Durable Task 1.6, Jenkins 1.620, master on Linux with a JNLP slave on 2012 R2. Maybe the “high latency” bit is the key? The reporter also filed JENKINS-28604 which I also have no leads on and which mentioned latency. Or perhaps this only happens when a single bat step takes quite a long time? What order of magnitude are we talking about here? Are there any suspicious things running on the slave machine like antivirus checkers?

          So i managed to find the issue with your help. The durable task error was that it cannot acces tje jenkins-log.txt file. And that batch proccess creates some procces that doesn not ecit after it, so we had another batch script to clean it, so i merged those two batch files and now builds doesnt hang.

          Tomas Zaleniakas added a comment - So i managed to find the issue with your help. The durable task error was that it cannot acces tje jenkins-log.txt file. And that batch proccess creates some procces that doesn not ecit after it, so we had another batch script to clean it, so i merged those two batch files and now builds doesnt hang.

          Jesse Glick added a comment -

          Not quite sure I followed that. It sounds like you know of a way of a way to reproduce this from scratch. If so, that would be very valuable information.

          Jesse Glick added a comment - Not quite sure I followed that. It sounds like you know of a way of a way to reproduce this from scratch. If so, that would be very valuable information.

          Piyush Jain added a comment - - edited

          I am using latest 1.10 workflow plugin in Jenkins 1.609.2 LTS Jenkins. I am seeing same issue while executing batch command, it do not return back to command prompt.

          All these four files are auto generated by jenkins workflow on jenkins slave workspace

          1. jenkins-wrap.bat (on windows slave)
          call "C:\JenkinsSlave\workspace\osp_test_pipeline_taml\.0401c727\jenkins-main.bat" > "C:\JenkinsSlave\workspace\osp_test_pipeline_taml\.0401c727\jenkins-log.txt" 2>&1
          echo %ERRORLEVEL% > "C:\JenkinsSlave\workspace\osp_test_pipeline_taml\.0401c727\jenkins-result.txt"

          2. jenkins-main.bat (on windows slave)
          cd %ITEST_HOME%
          %OSP_SDVT_3RDGEN_BASELINE_COMMAND%

          3. jenkins-result.txt (on windows slave)
          0

          4. jenkins-log.txt (on windows slave)

          C:\JenkinsSlave\workspace\osp_test_pipeline_taml>cd C:\Program Files (x86)\Spirent Communications\iTest 4.2

          C:\Program Files (x86)\Spirent Communications\iTest 4.2>itestcli -rw -P project://SDVT_Smokecheck/Tier2/Parmeters/OSP_Tier1_SmokeCheck.ffpt -p upgrade/imagepath_ta5k=/TAML.D2287/OSP/ -p upgrade/aucfilename_ta5k=OSP-TAMLD2287.auc -w C:\iTest\3GEN project://SDVT_Smokecheck/Tier2/TestCases/Smoke_System_GreenUp.fftc
          itestcli Command Processor Version 4.2.2
          Copyright (c) 2006-2013, Spirent Communications

          Executing testcase: 'project://SDVT_Smokecheck/Tier2/TestCases/Smoke_System_GreenUp.fftc' ...
          Execution started at: Fri Sep 04 03:59:40 CDT 2015
          Test owner:

          Sev. Origin Session Step Index Procedure Message
          ===========================================================================================================
          Info execution main Execution started
          Pass analysis 18.1 3.1 main Baseline test passed
          Pass execution 18.1 3.1 main Test case Smoke_System_GreenUp has passed.
          Info execution Execution completed (2s)

          Execution finished at: Fri Sep 04 03:59:43 CDT 2015
          Execution Status: Pass

          ---------------------------------------------------

          This script runs fine , shows execution status Pass, but jenkins build gets hung after that in workflow based builds , seems like exit command is not getting auto updated in jenkins-wrap.bat & jenkins-main.bat files.

          can someone look into it & fix it asap

          Piyush Jain added a comment - - edited I am using latest 1.10 workflow plugin in Jenkins 1.609.2 LTS Jenkins. I am seeing same issue while executing batch command, it do not return back to command prompt. All these four files are auto generated by jenkins workflow on jenkins slave workspace 1. jenkins-wrap.bat (on windows slave) call "C:\JenkinsSlave\workspace\osp_test_pipeline_taml\.0401c727\jenkins-main.bat" > "C:\JenkinsSlave\workspace\osp_test_pipeline_taml\.0401c727\jenkins-log.txt" 2>&1 echo %ERRORLEVEL% > "C:\JenkinsSlave\workspace\osp_test_pipeline_taml\.0401c727\jenkins-result.txt" 2. jenkins-main.bat (on windows slave) cd %ITEST_HOME% %OSP_SDVT_3RDGEN_BASELINE_COMMAND% 3. jenkins-result.txt (on windows slave) 0 4. jenkins-log.txt (on windows slave) C:\JenkinsSlave\workspace\osp_test_pipeline_taml>cd C:\Program Files (x86)\Spirent Communications\iTest 4.2 C:\Program Files (x86)\Spirent Communications\iTest 4.2>itestcli -rw -P project://SDVT_Smokecheck/Tier2/Parmeters/OSP_Tier1_SmokeCheck.ffpt -p upgrade/imagepath_ta5k=/TAML.D2287/OSP/ -p upgrade/aucfilename_ta5k=OSP-TAMLD2287.auc -w C:\iTest\3GEN project://SDVT_Smokecheck/Tier2/TestCases/Smoke_System_GreenUp.fftc itestcli Command Processor Version 4.2.2 Copyright (c) 2006-2013, Spirent Communications Executing testcase: 'project://SDVT_Smokecheck/Tier2/TestCases/Smoke_System_GreenUp.fftc' ... Execution started at: Fri Sep 04 03:59:40 CDT 2015 Test owner: Sev. Origin Session Step Index Procedure Message =========================================================================================================== Info execution main Execution started Pass analysis 18.1 3.1 main Baseline test passed Pass execution 18.1 3.1 main Test case Smoke_System_GreenUp has passed. Info execution Execution completed (2s) Execution finished at: Fri Sep 04 03:59:43 CDT 2015 Execution Status: Pass --------------------------------------------------- This script runs fine , shows execution status Pass, but jenkins build gets hung after that in workflow based builds , seems like exit command is not getting auto updated in jenkins-wrap.bat & jenkins-main.bat files. can someone look into it & fix it asap

          Jesse Glick added a comment -

          piyushjkjain your issue sounds more like JENKINS-27419, perhaps. Impossible to be sure without a way to reproduce from scratch.

          Jesse Glick added a comment - piyushjkjain your issue sounds more like JENKINS-27419 , perhaps. Impossible to be sure without a way to reproduce from scratch.

          So i got time and try to reproduce the hangs, but in my case the issue was with
          https://wiki.jenkins-ci.org/display/JENKINS/Spawning+processes+from+build
          In my case is that nunit tests launch chrome and doesn't close it after execution. I check logs, on regular free style jobs after a while got
          "Process leaked file descriptors. See http://wiki.jenkins-ci.org/display/JENKINS/Spawning+processes+from+build for more information"
          and it continues to run, but workflow doesn't detect it and hangs forever. So i think workflow durable task need to detect leaked file descriptors and to inform it, so you know what is going on.

          Tomas Zaleniakas added a comment - So i got time and try to reproduce the hangs, but in my case the issue was with https://wiki.jenkins-ci.org/display/JENKINS/Spawning+processes+from+build In my case is that nunit tests launch chrome and doesn't close it after execution. I check logs, on regular free style jobs after a while got "Process leaked file descriptors. See http://wiki.jenkins-ci.org/display/JENKINS/Spawning+processes+from+build for more information" and it continues to run, but workflow doesn't detect it and hangs forever. So i think workflow durable task need to detect leaked file descriptors and to inform it, so you know what is going on.

          Jesse Glick added a comment -

          Sounds like a plausible root cause to investigate.

          Jesse Glick added a comment - Sounds like a plausible root cause to investigate.

          Jeremy Riley added a comment - - edited

          I have a problem with batch steps hanging too.

          In my case the hang seems to occur when two parallel steps are run on the same slave. The first is OK, but the second always hangs.

          When the hang occurs there are only two of the auto-generated files in the workspace jenkins-main.bat and jenkins-wrap.bat. The last line of output in the console output is [Workflow test@2] Running batch script.

          I have seen the problem with a single 'bat' commands and also with multiple 'bat' commands. My latest attempt is:

          def msbuild ={ args ->   

          bat "echo Hello world"   

          bat "echo ${msBuildPath} ${args}"   

          bat "${msBuildPath} ${args}"

          }

          The hang occurs starting the first 'bat' command.

          If I remove all the parallel steps from the script is runs without problem. If I use separate projects they run without problem.

          Is there any way to turn on extra logging around the process launch?

          Jeremy Riley added a comment - - edited I have a problem with batch steps hanging too. In my case the hang seems to occur when two parallel steps are run on the same slave. The first is OK, but the second always hangs. When the hang occurs there are only two of the auto-generated files in the workspace jenkins-main.bat and jenkins-wrap.bat. The last line of output in the console output is [Workflow test@2] Running batch script . I have seen the problem with a single 'bat' commands and also with multiple 'bat' commands. My latest attempt is: def msbuild ={ args ->    bat "echo Hello world"    bat "echo ${msBuildPath} ${args}"    bat "${msBuildPath} ${args}" } The hang occurs starting the first 'bat' command. If I remove all the parallel steps from the script is runs without problem. If I use separate projects they run without problem. Is there any way to turn on extra logging around the process launch?

          Jeremy Riley added a comment -

          When I use -Dhudson.slaves.WorkspaceList="=" to change the separator character for concurrent workspaces the problem is no longer seen.

          Jeremy Riley added a comment - When I use -Dhudson.slaves.WorkspaceList="=" to change the separator character for concurrent workspaces the problem is no longer seen.

          Jesse Glick added a comment -

          One user reporting a similar symptom has found that the root cause was actually JENKINS-29924. That does seem to result in visible stack traces in the system log, though, so I doubt it explains most of the reports.

          Jesse Glick added a comment - One user reporting a similar symptom has found that the root cause was actually JENKINS-29924 . That does seem to result in visible stack traces in the system log, though, so I doubt it explains most of the reports.

          Jesse Glick added a comment -

          Everyone observing this, please try updating to Durable Task 1.7 in case the fix of JENKINS-27419 addressed this too. I sort of doubt it, but I have no real hypothesis for what is causing this so it is possible.

          Jesse Glick added a comment - Everyone observing this, please try updating to Durable Task 1.7 in case the fix of JENKINS-27419 addressed this too. I sort of doubt it, but I have no real hypothesis for what is causing this so it is possible.

          Jesse Glick added a comment -

          I discovered a potential bug in platform-independent code which might explain at least certain cases of this issue. Some other observations like jeremyriley’s do not sound related.

          Jesse Glick added a comment - I discovered a potential bug in platform-independent code which might explain at least certain cases of this issue. Some other observations like jeremyriley ’s do not sound related.

          Jeremy Riley added a comment -

          I tried the new Durable Task and the problem I described is still present.

          Jeremy Riley added a comment - I tried the new Durable Task and the problem I described is still present.

          Anthony Burns added a comment - - edited

          I'm able to reproduce this issue on Windows Server 2012 with Jenkins 2.3 and Pipeline 2.1. This machine is the master.

          Originally I was running 5 batch scripts spread over 3 stages, the first stage would hang on the first batch script (of 2) within the first stage. I was able to force the stage to continue by manually running the jenkins-wrap.bat file from the workspace. I've managed to work around this issue for now by splitting each batch file into individual stages.

          FYI, my first thought was that it may have something to do with the first two batch scripts running from the same directory-inside a dir()-in my Jenkinsfile. However, separating the two batch scripts into individual dir's did not fix the issue, and even after separating the first two batch scripts into separate stages, two later batch scripts (sharing a stage, but not a directory) also got hung up on the first batch in the stage.

          Anthony Burns added a comment - - edited I'm able to reproduce this issue on Windows Server 2012 with Jenkins 2.3 and Pipeline 2.1. This machine is the master. Originally I was running 5 batch scripts spread over 3 stages, the first stage would hang on the first batch script (of 2) within the first stage. I was able to force the stage to continue by manually running the jenkins-wrap.bat file from the workspace. I've managed to work around this issue for now by splitting each batch file into individual stages. FYI, my first thought was that it may have something to do with the first two batch scripts running from the same directory-inside a dir()-in my Jenkinsfile. However, separating the two batch scripts into individual dir's did not fix the issue, and even after separating the first two batch scripts into separate stages, two later batch scripts (sharing a stage, but not a directory) also got hung up on the first batch in the stage.

          Gijs Kuijer added a comment -

          This one is probably related to this issue: https://issues.jenkins-ci.org/browse/JENKINS-34150

          Gijs Kuijer added a comment - This one is probably related to this issue: https://issues.jenkins-ci.org/browse/JENKINS-34150

          Gijs Kuijer added a comment - - edited

          https://issues.jenkins-ci.org/browse/JENKINS-34150 probably resolves this issue as well.

          Gijs Kuijer added a comment - - edited https://issues.jenkins-ci.org/browse/JENKINS-34150 probably resolves this issue as well.

          Jesse Glick added a comment -

          Any known way to reproduce from scratch? There are a lot of related issues which are all probably duplicates, but it is unclear what the trigger conditions are.

          Jesse Glick added a comment - Any known way to reproduce from scratch? There are a lot of related issues which are all probably duplicates, but it is unclear what the trigger conditions are.

          mcrooney added a comment - - edited

          I wonder if this is specific to batch steps, or a general Pipeline issue, as we rarely but regularly see Pipeline hang indefinitely after shell steps are finished at:

          + exit 0
          

          They always have to be hard-killed:

          + exit 0
          Aborted by Example User
          Click here to forcibly terminate running steps
          Terminating stage
          Click here to forcibly kill entire build
          Hard kill!
          Finished: ABORTED
          

          Would this be a different bug?

          mcrooney added a comment - - edited I wonder if this is specific to batch steps, or a general Pipeline issue, as we rarely but regularly see Pipeline hang indefinitely after shell steps are finished at: + exit 0 They always have to be hard-killed: + exit 0 Aborted by Example User Click here to forcibly terminate running steps Terminating stage Click here to forcibly kill entire build Hard kill! Finished: ABORTED Would this be a different bug?

          Daniel Aguado Araujo added a comment - - edited

          Workaround: run those steps with powershell

           

          I'm affected by this bug from few days after some changes on my VM builders. I use swarm client.

          Daniel Aguado Araujo added a comment - - edited Workaround: run those steps with powershell   I'm affected by this bug from few days after some changes on my VM builders. I use swarm client.

            Unassigned Unassigned
            sumdumgai A C
            Votes:
            19 Vote for this issue
            Watchers:
            29 Start watching this issue

              Created:
              Updated: