Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-50892

Pipeline jobs stuck after restart

    XMLWordPrintable

    Details

    • Similar Issues:

      Description

      In our instances pipeline jobs stuck very often after restart. It seems like pipeline detects that it needs to continue and tries to start execution where was interrupted but nothings is executed, it only looks like it is executed from UI - computer is occupied by job, build is in running state, but it seems that it is in endless waiting cycle.

      The interesting is that this state is probably saved persistently because I do not catch it every time, but when it is caught, I can restart jenkins as many times as I want and it will never  recover successfully (always stuck).

      I believe that problem is somewhere in program.dat, but it is not easily readable as xml for human so I am not sure where the difference is. 

      I did the same with the previous 2 runs and they were able to recover, but the third one did not.

      A add the screen of the successful recovers (or unsuccessful - one ended by failure but it did not stuck, which is success in this case) and the screen of build with issue. Since it seems to be persistent (as I described) I archived jenkins home and add it as well and the jenkins war as well. Archived jenkins home contains the versions of pipeline as well.

       

      Thank you for help!

       

        Attachments

          Issue Links

            Activity

            Hide
            lvotypkova Lucie Votypkova added a comment -

            I am sorry I can not upload the war and the home because of size limit. I added only "jobs home", I will sent you the rest by e-mail if you are interested.

            Show
            lvotypkova Lucie Votypkova added a comment - I am sorry I can not upload the war and the home because of size limit. I added only "jobs home", I will sent you the rest by e-mail if you are interested.
            Hide
            jglick Jesse Glick added a comment -

            Maybe related to JENKINS-50199 Sam Van Oort? Just a casual guess, have not looked at the details.

            Show
            jglick Jesse Glick added a comment - Maybe related to JENKINS-50199 Sam Van Oort ? Just a casual guess, have not looked at the details.
            Hide
            svanoort Sam Van Oort added a comment -

            Lucie Votypkova Have you tried adding this startup setting on your Jenkins master? "-Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=300"

            Off the top of my head, this one looks like it's an issue with the Shell step and how it detects whether the shell step is running or dead, rather than a resume issue. Adding a long-enough timeout with that setting may resolve this.

            Show
            svanoort Sam Van Oort added a comment - Lucie Votypkova Have you tried adding this startup setting on your Jenkins master? "-Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=300" Off the top of my head, this one looks like it's an issue with the Shell step and how it detects whether the shell step is running or dead, rather than a resume issue. Adding a long-enough timeout with that setting may resolve this.
            Hide
            lvotypkova Lucie Votypkova added a comment -

            During checking my laptop I realized that maybe the bug is not on the pipeline side. I checked my processes and it seems that shell step is made durable by running it the way it keeps result file. That running seems failed and hangs, example of my processes (I put only two as an example, but there is more like that during my attempts to reproduce it):

             

            lvotypko 7007 0.0 0.0 113248 668 pts/2 S Apr17 0:17 sh -c { while [ -d '/home/lvotypko/workspace/workspace/Reproducers/pipeline-stuck-after-restart@tmp/durable-1be3ec40' -a ! -f '/home/lvotypko/workspace/workspace/Reproducers/pipeline-stuck-after-restart@tmp/durable-1be3ec40/jenkins-result.txt' ]; do touch '/home/lvotypko/workspace/workspace/Reproducers/pipeline-stuck-after-restart@tmp/durable-1be3ec40/jenkins-log.txt'; sleep 3; done } & jsc=durable-da19427a0c4143110a7079e69bb900f7; JENKINS_SERVER_COOKIE=$jsc '/home/lvotypko/workspace/workspace/Reproducers/pipeline-stuck-after-restart@tmp/durable-1be3ec40/script.sh' > '/home/lvotypko/workspace/workspace/Reproducers/pipeline-stuck-after-restart@tmp/durable-1be3ec40/jenkins-log.txt' 2>&1; echo $? > '/home/lvotypko/workspace/workspace/Reproducers/pipeline-stuck-after-restart@tmp/durable-1be3ec40/jenkins-result.txt.tmp'; mv '/home/lvotypko/workspace/workspace/Reproducers/pipeline-stuck-after-restart@tmp/durable-1be3ec40/jenkins-result.txt.tmp' '/home/lvotypko/workspace/workspace/Reproducers/pipeline-stuck-after-restart@tmp/durable-1be3ec40/jenkins-result.txt'; wait

            lvotypko 10819 0.0 0.0 113248 792 pts/10 S 10:28 0:06 sh -c { while [ -d '/tmp/workspace/reproducer@tmp/durable-f0a2ec55' -a ! -f '/tmp/workspace/reproducer@tmp/durable-f0a2ec55/jenkins-result.txt' ]; do touch '/tmp/workspace/reproducer@tmp/durable-f0a2ec55/jenkins-log.txt'; sleep 3; done } & jsc=durable-bc6ed2e151065a52c7c18babc726f6be; JENKINS_SERVER_COOKIE=$jsc '/tmp/workspace/reproducer@tmp/durable-f0a2ec55/script.sh' > '/tmp/workspace/reproducer@tmp/durable-f0a2ec55/jenkins-log.txt' 2>&1; echo $? > '/tmp/workspace/reproducer@tmp/durable-f0a2ec55/jenkins-result.txt.tmp'; mv '/tmp/workspace/reproducer@tmp/durable-f0a2ec55/jenkins-result.txt.tmp' '/tmp/workspace/reproducer@tmp/durable-f0a2ec55/jenkins-result.txt'; wait

             

            it is clear that one of them is from yesterday and still running!

             

            it explains why it happens quite often in our instances (and restart plays somehow role in it) because we use shell steps quite often and quite often for long time running tasks.

             

            Show
            lvotypkova Lucie Votypkova added a comment - During checking my laptop I realized that maybe the bug is not on the pipeline side. I checked my processes and it seems that shell step is made durable by running it the way it keeps result file. That running seems failed and hangs, example of my processes (I put only two as an example, but there is more like that during my attempts to reproduce it):   lvotypko 7007 0.0 0.0 113248 668 pts/2 S Apr17 0:17 sh -c { while [ -d '/home/lvotypko/workspace/workspace/Reproducers/pipeline-stuck-after-restart@tmp/durable-1be3ec40' -a ! -f '/home/lvotypko/workspace/workspace/Reproducers/pipeline-stuck-after-restart@tmp/durable-1be3ec40/jenkins-result.txt' ]; do touch '/home/lvotypko/workspace/workspace/Reproducers/pipeline-stuck-after-restart@tmp/durable-1be3ec40/jenkins-log.txt'; sleep 3; done } & jsc=durable-da19427a0c4143110a7079e69bb900f7; JENKINS_SERVER_COOKIE=$jsc '/home/lvotypko/workspace/workspace/Reproducers/pipeline-stuck-after-restart@tmp/durable-1be3ec40/script.sh' > '/home/lvotypko/workspace/workspace/Reproducers/pipeline-stuck-after-restart@tmp/durable-1be3ec40/jenkins-log.txt' 2>&1; echo $? > '/home/lvotypko/workspace/workspace/Reproducers/pipeline-stuck-after-restart@tmp/durable-1be3ec40/jenkins-result.txt.tmp'; mv '/home/lvotypko/workspace/workspace/Reproducers/pipeline-stuck-after-restart@tmp/durable-1be3ec40/jenkins-result.txt.tmp' '/home/lvotypko/workspace/workspace/Reproducers/pipeline-stuck-after-restart@tmp/durable-1be3ec40/jenkins-result.txt'; wait lvotypko 10819 0.0 0.0 113248 792 pts/10 S 10:28 0:06 sh -c { while [ -d '/tmp/workspace/reproducer@tmp/durable-f0a2ec55' -a ! -f '/tmp/workspace/reproducer@tmp/durable-f0a2ec55/jenkins-result.txt' ]; do touch '/tmp/workspace/reproducer@tmp/durable-f0a2ec55/jenkins-log.txt'; sleep 3; done } & jsc=durable-bc6ed2e151065a52c7c18babc726f6be; JENKINS_SERVER_COOKIE=$jsc '/tmp/workspace/reproducer@tmp/durable-f0a2ec55/script.sh' > '/tmp/workspace/reproducer@tmp/durable-f0a2ec55/jenkins-log.txt' 2>&1; echo $? > '/tmp/workspace/reproducer@tmp/durable-f0a2ec55/jenkins-result.txt.tmp'; mv '/tmp/workspace/reproducer@tmp/durable-f0a2ec55/jenkins-result.txt.tmp' '/tmp/workspace/reproducer@tmp/durable-f0a2ec55/jenkins-result.txt'; wait   it is clear that one of them is from yesterday and still running!   it explains why it happens quite often in our instances (and restart plays somehow role in it) because we use shell steps quite often and quite often for long time running tasks.  
            Hide
            lvotypkova Lucie Votypkova added a comment -

            the only execution in script.sh is "sleep 60", but grep processes on sleep 60 returns nothing. So ti seems that it hangs in functionality before or after. Because script.sh and jenkins-log.txt exists and jenkins-log.txt has expected output, I guess that something happens after sleep 60 is executed.

            Show
            lvotypkova Lucie Votypkova added a comment - the only execution in script.sh is "sleep 60", but grep processes on sleep 60 returns nothing. So ti seems that it hangs in functionality before or after. Because script.sh and jenkins-log.txt exists and jenkins-log.txt has expected output, I guess that something happens after sleep 60 is executed.
            Hide
            svanoort Sam Van Oort added a comment -

            Lucie Votypkova You should be able to manually invoke the command on that build agent with some modification for debugging – I wonder if something is wacky in the bash config there, perhaps something that triggers a premature failure? The 'set -e' option springs to mind here, but there could be something else at play.

            Show
            svanoort Sam Van Oort added a comment - Lucie Votypkova You should be able to manually invoke the command on that build agent with some modification for debugging – I wonder if something is wacky in the bash config there, perhaps something that triggers a premature failure? The 'set -e' option springs to mind here, but there could be something else at play.
            Hide
            lvotypkova Lucie Votypkova added a comment -

            Ok, I am not sure if this is the same what happens in our instance (I made reproducer locally which does the same things - call shell step), but it is clear that the functionality from pipeline execution is correct - manually I wrote jenkins-result.txt.tmp file with number 0 and build which was stuck for about one hour finished with success immediately. 

            I will watch our instance if I hit the same issue as in my reproducer (it looks like so from UI perspective). Now I would concentrate on durable shell step. 

            Show
            lvotypkova Lucie Votypkova added a comment - Ok, I am not sure if this is the same what happens in our instance (I made reproducer locally which does the same things - call shell step), but it is clear that the functionality from pipeline execution is correct - manually I wrote jenkins-result.txt.tmp file with number 0 and build which was stuck for about one hour finished with success immediately.  I will watch our instance if I hit the same issue as in my reproducer (it looks like so from UI perspective). Now I would concentrate on durable shell step. 
            Hide
            lvotypkova Lucie Votypkova added a comment -

            Sam, ok I will do so. But I think that connection with jenkins can play role - it has never happened without restart - I tried the same shell step without restart several times and it has never happened. But I will check it - maybe I can do some modification into shell step code locally to have more info about it and I will check how durable shell step works more closely maybe I will get some idea. Thanks for cooperation!

            Show
            lvotypkova Lucie Votypkova added a comment - Sam, ok I will do so. But I think that connection with jenkins can play role - it has never happened without restart - I tried the same shell step without restart several times and it has never happened. But I will check it - maybe I can do some modification into shell step code locally to have more info about it and I will check how durable shell step works more closely maybe I will get some idea. Thanks for cooperation!
            Hide
            lvotypkova Lucie Votypkova added a comment -

            I concentrated on script (do not include jenkins, only run the bash script which pipeline runs) and I can reproduce it - when it is interrupted in the middle it hangs. Do I understand correctly that it should not do only execution but even wait for its execution (in this case for sleep 60)? Because I thought that it should be executed in background because pipeline will wait for the result file and log can be read from jenkins-log.txt. Or am I wrong?

            Show
            lvotypkova Lucie Votypkova added a comment - I concentrated on script (do not include jenkins, only run the bash script which pipeline runs) and I can reproduce it - when it is interrupted in the middle it hangs. Do I understand correctly that it should not do only execution but even wait for its execution (in this case for sleep 60)? Because I thought that it should be executed in background because pipeline will wait for the result file and log can be read from jenkins-log.txt. Or am I wrong?
            Hide
            lvotypkova Lucie Votypkova added a comment -

            I checked the script and it is badly written. it periodically check if file "jenkins-result.txt" exists in while block, but when you interrupt script (due to restart) when script.sh is running but not finished  these commands will not be executed:

            echo $? > '/home/lvotypko/workspace/workspace/Reproducers/pipeline-stuck-after-restart@tmp/durable-1be3ec40/jenkins-result.txt.tmp

             mv '/home/lvotypko/workspace/workspace/Reproducers/pipeline-stuck-after-restart@tmp/durable-1be3ec40/jenkins-result.txt.tmp' '/home/lvotypko/workspace/workspace/Reproducers/pipeline-stuck-after-restart@tmp/durable-1be3ec40/jenkins-result.txt'

            That caused that file "jenkins-result.txt" will never exists but the background script with while checking for its existence will run forever.

            Should not be better to do it all in background? I will write text for it, it should not be problem to catch it by test. I sent pull requests so you can see it.

            Show
            lvotypkova Lucie Votypkova added a comment - I checked the script and it is badly written. it periodically check if file "jenkins-result.txt" exists in while block, but when you interrupt script (due to restart) when script.sh is running but not finished  these commands will not be executed: echo $? > '/home/lvotypko/workspace/workspace/Reproducers/pipeline-stuck-after-restart@tmp/durable-1be3ec40/jenkins-result.txt.tmp  mv '/home/lvotypko/workspace/workspace/Reproducers/pipeline-stuck-after-restart@tmp/durable-1be3ec40/jenkins-result.txt.tmp' '/home/lvotypko/workspace/workspace/Reproducers/pipeline-stuck-after-restart@tmp/durable-1be3ec40/jenkins-result.txt' That caused that file "jenkins-result.txt" will never exists but the background script with while checking for its existence will run forever. Should not be better to do it all in background? I will write text for it, it should not be problem to catch it by test. I sent pull requests so you can see it.
            Hide
            lvotypkova Lucie Votypkova added a comment -

            I created a reproducer for this behavior[1]. I believe that it will be useful to use not only result file as an indicator that script is still running, but as well the file with pid of process. Because in my test I run script "sleep 22". I checked the script which launched it, after waited for process which execute directly the 'sleep 22'. During execution of process 'sleep 22' I kill the launching process (parent of process 'sleep 22') with signal 15 (which usually happens when connection is interrupted). Result of this action is that after 30seconds waiting the process 'sleep 22' is executed and done, but launching process is still running and result file does not exist. In this state it stays forever.

            [1] https://github.com/lvotypko/durable-task-plugin/commit/a21ed041f67431a18d106760e244e0e8b5c94cc3

            Show
            lvotypkova Lucie Votypkova added a comment - I created a reproducer for this behavior [1] . I believe that it will be useful to use not only result file as an indicator that script is still running, but as well the file with pid of process. Because in my test I run script "sleep 22". I checked the script which launched it, after waited for process which execute directly the 'sleep 22'. During execution of process 'sleep 22' I kill the launching process (parent of process 'sleep 22') with signal 15 (which usually happens when connection is interrupted). Result of this action is that after 30seconds waiting the process 'sleep 22' is executed and done, but launching process is still running and result file does not exist. In this state it stays forever. [1]   https://github.com/lvotypko/durable-task-plugin/commit/a21ed041f67431a18d106760e244e0e8b5c94cc3
            Hide
            jglick Jesse Glick added a comment -

            Durable tasks support killing of the user process. If the wrapper process is killed, all bets are off, which is why the plugin will just wait for a while to see if the output file has been touched, and after a while give up and return exit code -1. There is no need to track PIDs (we used to do that and it was a nightmare). So there are two questions of interest about your environment:

            • What about interrupting the connection is causing the launching process to be killed? This sounds like a problem with your agent launcher. I have seen it happen with Docker-based agents that were not using the -i option.
            • What do you mean that the launching process is still running? You just said you killed the launching process.
            Show
            jglick Jesse Glick added a comment - Durable tasks support killing of the user process. If the wrapper process is killed, all bets are off, which is why the plugin will just wait for a while to see if the output file has been touched, and after a while give up and return exit code -1. There is no need to track PIDs (we used to do that and it was a nightmare). So there are two questions of interest about your environment: What about interrupting the connection is causing the launching process to be killed? This sounds like a problem with your agent launcher. I have seen it happen with Docker-based agents that were not using the -i option. What do you mean that the launching process is still running? You just said you killed the launching process.
            Hide
            lvotypkova Lucie Votypkova added a comment -

            The whole point is that the process is not killed by signal -15. 

            Show
            lvotypkova Lucie Votypkova added a comment - The whole point is that the process is not killed by signal -15. 
            Hide
            lvotypkova Lucie Votypkova added a comment - - edited

            And of course the touch is still executed - because there is no result file. Whole problem is that this process will not die after kill -15, but the part what creates result file is not executed (echo $?) from some reason. I am not shell expert and I am not sure which part of that is interrupted (by signal 15) but it seems that it is definitely not the part with while cycle and touch. The script.sh is executed (I can watch it form console - you can reproduce it manually very easily too), but when I send the signal kill -15 to the process while script.sh is being executed, it will never execute echo $? and write a result file (and the rest of the execution). I am not sure why. I thought that it is a feature (surviving of kill -15) that it will survive for example disconnection of agent during restart Jenkins. But in this case it should survive whole execution, not only the while cycle with touch. 

            Show
            lvotypkova Lucie Votypkova added a comment - - edited And of course the touch is still executed - because there is no result file. Whole problem is that this process will not die after kill -15, but the part what creates result file is not executed (echo $?) from some reason. I am not shell expert and I am not sure which part of that is interrupted (by signal 15) but it seems that it is definitely not the part with while cycle and touch. The script.sh is executed (I can watch it form console - you can reproduce it manually very easily too), but when I send the signal kill -15 to the process while script.sh is being executed, it will never execute echo $? and write a result file (and the rest of the execution). I am not sure why. I thought that it is a feature (surviving of kill -15) that it will survive for example disconnection of agent during restart Jenkins. But in this case it should survive whole execution, not only the while cycle with touch. 
            Hide
            lvotypkova Lucie Votypkova added a comment -

            I think it is because while cycle with touch is send background and so kill -15 interrupt only the rest of script, and wait is waiting for process with while cycle. But not 100% sure.

            Show
            lvotypkova Lucie Votypkova added a comment - I think it is because while cycle with touch is send background and so kill -15 interrupt only the rest of script, and wait is waiting for process with while cycle. But not 100% sure.
            Hide
            lvotypkova Lucie Votypkova added a comment -

            I can fix it by sending the script execution (together with echo $? > tmp result file and cp tmp result file to result file) to background too. But I am not sure if it is correct approach. 

            Show
            lvotypkova Lucie Votypkova added a comment - I can fix it by sending the script execution (together with echo $? > tmp result file and cp tmp result file to result file) to background too. But I am not sure if it is correct approach. 
            Hide
            lvotypkova Lucie Votypkova added a comment -

            If you agree with sending the execution to background too, you can check this commit https://github.com/lvotypko/durable-task-plugin/commit/541f1ddb31839891489231c94024bedd7f0d6aad - there is included the test which reproduces the issues (interruptScriptExecution). 

            Show
            lvotypkova Lucie Votypkova added a comment - If you agree with sending the execution to background too, you can check this commit https://github.com/lvotypko/durable-task-plugin/commit/541f1ddb31839891489231c94024bedd7f0d6aad  - there is included the test which reproduces the issues (interruptScriptExecution). 
            Hide
            jglick Jesse Glick added a comment -

            I see a test but I do not follow what actual scenario it is simulating. What is sending SIGTERM, to which process, and why?

            To give some background, the intended behavior is that:

            • The agent JVM launches a wrapper script, throwing away the Proc so that it is not waiting for the wrapper script process to finish.
            • The wrapper script launches the user script, collecting stdout/stderr to a file, and waiting for it to finish, sending the exit code to another file.
            • The agent JVM periodically checks for an exit code and/or new log output, and updates the status of the sh step accordingly.
            • If the agent is killed and restarted (including both Jenkins master restarts and Remoting disconnections), the wrapper script and user script can continue to run uninterrupted.
            • The wrapper script also periodically touches the log file to confirm that it is still there. If the agent sees no updates to the log file, it assumes that the whole process tree died (for example, the computer was rebooted, with a fresh agent connection) and aborts the sh step with the special code -1.

            So the system is robust against the user script being abruptly killed—that would just be a nonzero exit code—or the user script and the wrapper script both being killed—that would be exit code -1. If the wrapper script alone is killed, then the step will seem to hang for a while after the user script completes, since there is no exit code file, after which the timeout should expire and the step end with code -1. The wrapper script uses {} rather than (…) for the background touch loop in an attempt to avoid forking another copy of /bin/sh.

            Now if something about this story is not working, we want to fix it, but I need to understand exactly what is going wrong.

            Show
            jglick Jesse Glick added a comment - I see a test but I do not follow what actual scenario it is simulating. What is sending SIGTERM, to which process, and why? To give some background, the intended behavior is that: The agent JVM launches a wrapper script, throwing away the Proc so that it is not waiting for the wrapper script process to finish. The wrapper script launches the user script, collecting stdout/stderr to a file, and waiting for it to finish, sending the exit code to another file. The agent JVM periodically checks for an exit code and/or new log output, and updates the status of the sh step accordingly. If the agent is killed and restarted (including both Jenkins master restarts and Remoting disconnections), the wrapper script and user script can continue to run uninterrupted. The wrapper script also periodically touches the log file to confirm that it is still there. If the agent sees no updates to the log file, it assumes that the whole process tree died (for example, the computer was rebooted, with a fresh agent connection) and aborts the sh step with the special code -1. So the system is robust against the user script being abruptly killed—that would just be a nonzero exit code—or the user script and the wrapper script both being killed—that would be exit code -1. If the wrapper script alone is killed, then the step will seem to hang for a while after the user script completes, since there is no exit code file, after which the timeout should expire and the step end with code -1. The wrapper script uses { … } rather than (…) for the background touch loop in an attempt to avoid forking another copy of /bin/sh . Now if something about this story is not working, we want to fix it, but I need to understand exactly what is going wrong.
            Hide
            lvotypkova Lucie Votypkova added a comment -

            Thank you, for intended behavior. Now it is easier for me to understands.

            Our issues - we have slaves connected through ssh. In case of restart or fall of Jenkins, the connection is usually closed. So the slave.jar process is interrupted too and it interrupts the wrapper script too (I am not sure in this part, that is the reason why I do not send pull request and discuss the issue first here). When Jenkins is up again, pipeline job correctly detects that build is not finished and the sh script was launched and it is waiting for result - but in this case endlessly. Since pipeline job is done to survive restart - so Jenkins is allowed to restart during build, it quite often happens that it struck forever and the job has to be manually killed. I checked the processes in this case on the slave and wrapper was running and periodic touches was done, but result file did not appear. 

            My theory is that closing connection sends -15 to wrapper script process too and that is what my test tests - I reproduced it manually on laptop just only Jenkins threads without slave - I run Jenkins from console, run pipeline job with sh sleep 60, when I see that sleep 60 is executed, I close the console with Jenkins. When I reboot Jenkins again, I get what I sent in previous comments - pipeline job is stuck endlessly and there are endless wrapper script execution. My point is that when it happens and wrapper script gets kill -15, it should behave that it kills all (the periodic touches the log file and the script execution) or it survives all (the script dies but the periodic touches too). The wrong is that touches survives but script execution does not (or better it is not executed that part which write the result and result file). I think that it would be better solution to send both to background (because one of them has to be done on background - it is necessary to run them parallel). But that would make the wrapper script practically resistant to kill -15 (but on the other hand it is already partially resistant - ate least the part with touch), and my trouble is whatever that was a purpose or it is acceptable solution. Because trying to detect when touches got stuck is harder then sent bot to background. Since you add for some nohup [1] which does some resistance but not so strong as sending background, I am not sure how to handle this issue.

             

            Show
            lvotypkova Lucie Votypkova added a comment - Thank you, for intended behavior. Now it is easier for me to understands. Our issues - we have slaves connected through ssh. In case of restart or fall of Jenkins, the connection is usually closed. So the slave.jar process is interrupted too and it interrupts the wrapper script too (I am not sure in this part, that is the reason why I do not send pull request and discuss the issue first here). When Jenkins is up again, pipeline job correctly detects that build is not finished and the sh script was launched and it is waiting for result - but in this case endlessly. Since pipeline job is done to survive restart - so Jenkins is allowed to restart during build, it quite often happens that it struck forever and the job has to be manually killed. I checked the processes in this case on the slave and wrapper was running and periodic touches was done, but result file did not appear.  My theory is that closing connection sends -15 to wrapper script process too and that is what my test tests - I reproduced it manually on laptop just only Jenkins threads without slave - I run Jenkins from console, run pipeline job with sh sleep 60, when I see that sleep 60 is executed, I close the console with Jenkins. When I reboot Jenkins again, I get what I sent in previous comments - pipeline job is stuck endlessly and there are endless wrapper script execution. My point is that when it happens and wrapper script gets kill -15, it should behave that it kills all (the periodic touches the log file and the script execution) or it survives all (the script dies but the periodic touches too). The wrong is that touches survives but script execution does not (or better it is not executed that part which write the result and result file). I think that it would be better solution to send both to background (because one of them has to be done on background - it is necessary to run them parallel). But that would make the wrapper script practically resistant to kill -15 (but on the other hand it is already partially resistant - ate least the part with touch), and my trouble is whatever that was a purpose or it is acceptable solution. Because trying to detect when touches got stuck is harder then sent bot to background. Since you add for some nohup [1] which does some resistance but not so strong as sending background, I am not sure how to handle this issue.  
            Hide
            jglick Jesse Glick added a comment -

            the connection is usually closed. So the slave.jar process is interrupted too and it interrupts the wrapper script too

            Then this is at least the main problem. It should not be sending an interrupt to the wrapper script. This is using ssh-slaves plugin?

            wrapper was running and periodic touches was done

            Hmm, so what kind of interrupt was sent then?

            I reproduced it manually on laptop just only Jenkins threads without slave

            If you are not using agents, and you stop Jenkins with Ctrl-C, then this is a known limitation discussed in JENKINS-25503. I do not consider it much of a priority since no one should be using master executors to begin with. Of course a solution may improve behavior with certain agent launchers too.

            it should behave that it kills all (the periodic touches the log file and the script execution) or it survives all (the script dies but the periodic touches too)

            Yes of course, we expect the whole wrapper process to either be working, or killed.

            Show
            jglick Jesse Glick added a comment - the connection is usually closed. So the slave.jar process is interrupted too and it interrupts the wrapper script too Then this is at least the main problem. It should not be sending an interrupt to the wrapper script. This is using ssh-slaves plugin? wrapper was running and periodic touches was done Hmm, so what kind of interrupt was sent then? I reproduced it manually on laptop just only Jenkins threads without slave If you are not using agents, and you stop Jenkins with Ctrl-C, then this is a known limitation discussed in JENKINS-25503 . I do not consider it much of a priority since no one should be using master executors to begin with. Of course a solution may improve behavior with certain agent launchers too. it should behave that it kills all (the periodic touches the log file and the script execution) or it survives all (the script dies but the periodic touches too) Yes of course, we expect the whole wrapper process to either be working, or killed.
            Hide
            lvotypkova Lucie Votypkova added a comment -

            Yes, we use ssh-slaves plugin. I will try to use it in my test too, so we can see if I am right, I was not sure what exactly happens with ssh slave connection, slave.jar and its child processes when jenkins is rebooted, it is not problem to change my test a little - originally I wanted to test only the durable shell task that there is a way (and the signal 15 is not something unusual) how it can stuck in a way we does not want - touch is running, result file is not created, but script finished.

            Show
            lvotypkova Lucie Votypkova added a comment - Yes, we use ssh-slaves plugin. I will try to use it in my test too, so we can see if I am right, I was not sure what exactly happens with ssh slave connection, slave.jar and its child processes when jenkins is rebooted, it is not problem to change my test a little - originally I wanted to test only the durable shell task that there is a way (and the signal 15 is not something unusual) how it can stuck in a way we does not want - touch is running, result file is not created, but script finished.
            Hide
            jglick Jesse Glick added a comment -

            I suspect PR 75 would solve the problem, though I would prefer to understand the root cause.

            Show
            jglick Jesse Glick added a comment - I suspect PR 75 would solve the problem, though I would prefer to understand the root cause.
            Hide
            jglick Jesse Glick added a comment - - edited

            Changes in JENKINS-47791 introduced the pair of sh processes which apparently causes this issue.

            in an attempt to avoid forking another copy of /bin/sh

            Apparently a failed attempt. Maybe { works to avoid forking most of the time, but not when & is in use?

            Show
            jglick Jesse Glick added a comment - - edited Changes in JENKINS-47791 introduced the pair of sh processes which apparently causes this issue. in an attempt to avoid forking another copy of /bin/sh Apparently a failed attempt. Maybe { works to avoid forking most of the time, but not when & is in use?
            Hide
            jglick Jesse Glick added a comment -

            Anyone consistently seeing this problem, please try installing this build (using Plugin Manager » Advanced) and let me know if it helps.

            Show
            jglick Jesse Glick added a comment - Anyone consistently seeing this problem, please try installing this build (using Plugin Manager » Advanced ) and let me know if it helps.
            Hide
            andreler André Leruitte added a comment -

            Thanks for suggestion Jesse Glick .

            I upgraded the durable-tasks-plugin on saturday night, and unfortunately we still running into neverending jobs.

             

            These 4 jobs will never finish, and cannot be aborted:

             

            Our only way to clear them is to restart that (dockerized) Jenkins instance. We are also running into startup performance issues (that we still have to diagnose) that makes each Jenkins restart take up to 30 minutes...

             

             

            Show
            andreler André Leruitte added a comment - Thanks for suggestion  Jesse Glick  . I upgraded the durable-tasks-plugin on saturday night, and unfortunately we still running into neverending jobs.   These 4 jobs will never finish, and cannot be aborted:   Our only way to clear them is to restart that (dockerized) Jenkins instance. We are also running into startup performance issues (that we still have to diagnose) that makes each Jenkins restart take up to 30 minutes...    
            Hide
            jglick Jesse Glick added a comment -

            André Leruitte your issue may or may not have anything to do with the issue reported here. There are dozens of reasons why a build might hang. It is necessary to perform detailed diagnostics to confirm a particular problem.

            Show
            jglick Jesse Glick added a comment - André Leruitte your issue may or may not have anything to do with the issue reported here. There are dozens of reasons why a build might hang. It is necessary to perform detailed diagnostics to confirm a particular problem.
            Hide
            jglick Jesse Glick added a comment -

            Patched merged but as yet unreleased.

            Show
            jglick Jesse Glick added a comment - Patched merged but as yet unreleased.
            Hide
            svanoort Sam Van Oort added a comment -

            Released as durable-task 2.23

            Show
            svanoort Sam Van Oort added a comment - Released as durable-task 2.23
            Hide
            lvotypkova Lucie Votypkova added a comment -

            Thank you Jesse. I did a quick test and it seems that it solved our problem (maybe there can be another issues, but not this one).

            Show
            lvotypkova Lucie Votypkova added a comment - Thank you Jesse. I did a quick test and it seems that it solved our problem (maybe there can be another issues, but not this one).
            Hide
            robinro Robin Roth added a comment -

             The change broke the usage of the official jnlp-slave image based on alpine, since this images uses busybox for ps, which does not support -p.

            The image was fixed now to include a working ps version: https://github.com/jenkinsci/docker-jnlp-slave/issues/65

            Show
            robinro Robin Roth added a comment -  The change broke the usage of the official jnlp-slave image based on alpine, since this images uses busybox for ps, which does not support -p . The image was fixed now to include a working ps version: https://github.com/jenkinsci/docker-jnlp-slave/issues/65
            Hide
            merlindam Damien Merlin added a comment - - edited

            Hi, I noticed an issue with this fix, it generates errors message on my windows nodes ( where I use cygwin ) :

            [e:\jenkins\workspace\test-node-windows-ps] Running shell script
            ps: unknown option – o
            Try `ps --help' for more information.

            So my troubles are with code :

            cmd = String.format("pid=$$;{{
            Unknown macro: { while ps -o pid -p $pid | grep -q $pid && [ -d '%s' -a ! -f '%s' ]; do touch '%s'; sleep 3; done }
            }}& jsc=%s; %s=$jsc '%s' > '%s' 2> '%s'; echo $? > '%s.tmp'; mv '%s.tmp' '%s'; wait",

            On cygwin I would expect : ps -p $pid

            {{$ ps -p 12588}}
              PID PPID PGID WINPID TTY UID STIME COMMAND
              12588 10484 12588 1836 pty1 1053975 09:58:55 /usr/bin/bash

            Because following code is not working

            $ ps -o pid -p 12588
            ps: unknown option – o
            Try `ps --help' for more information.

             

            This code seems fine for debian jessie for example :

            $ ps -o pid -p 102008
            {{ PID}}
            102008

            So that's explain my error message and so the fix is not working for cygwin configuration.

            My concerns were addressed by https://issues.jenkins-ci.org/browse/JENKINS-52881

             

            Show
            merlindam Damien Merlin added a comment - - edited Hi, I noticed an issue with this fix, it generates errors message on my windows nodes ( where I use cygwin ) : [e:\jenkins\workspace\test-node-windows-ps] Running shell script ps: unknown option – o Try `ps --help' for more information. So my troubles are with code : cmd = String.format("pid=$$; {{ Unknown macro: { while ps -o pid -p $pid | grep -q $pid && [ -d '%s' -a ! -f '%s' ]; do touch '%s'; sleep 3; done } }} & jsc=%s; %s=$jsc '%s' > '%s' 2> '%s'; echo $? > '%s.tmp'; mv '%s.tmp' '%s'; wait", On cygwin I would expect : ps -p $pid {{ $ ps -p 12588 }}   PID PPID PGID WINPID TTY UID STIME COMMAND   12588 10484 12588 1836 pty1 1053975 09:58:55 /usr/bin/bash Because following code is not working $ ps -o pid -p 12588 ps: unknown option – o Try `ps --help' for more information.   This code seems fine for debian jessie for example : $ ps -o pid -p 102008 {{ PID}} 102008 So that's explain my error message and so the fix is not working for cygwin configuration. My concerns were addressed by  https://issues.jenkins-ci.org/browse/JENKINS-52881  
            Hide
            kudiaborm Marley Kudiabor added a comment -

            Sam Van Oort Durable task is currently on 1.26 does 2.23 imply the jenkins version itself?

            Show
            kudiaborm Marley Kudiabor added a comment - Sam Van Oort Durable task is currently on 1.26 does 2.23 imply the jenkins version itself?
            Hide
            jglick Jesse Glick added a comment - - edited

            No, the durable-task plugin. For most Pipeline issues, the version of Jenkins core is irrelevant. I presume Sam Van Oort meant something else, since there is no such version of this plugin.

            Show
            jglick Jesse Glick added a comment - - edited No, the durable-task plugin. For most Pipeline issues, the version of Jenkins core is irrelevant. I presume Sam Van Oort meant something else, since there is no such version of this plugin.

              People

              Assignee:
              jglick Jesse Glick
              Reporter:
              lvotypkova Lucie Votypkova
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: