-
Bug
-
Resolution: Fixed
-
Major
-
None
-
Powered by SuggestiMate
In our instances pipeline jobs stuck very often after restart. It seems like pipeline detects that it needs to continue and tries to start execution where was interrupted but nothings is executed, it only looks like it is executed from UI - computer is occupied by job, build is in running state, but it seems that it is in endless waiting cycle.
The interesting is that this state is probably saved persistently because I do not catch it every time, but when it is caught, I can restart jenkins as many times as I want and it will never recover successfully (always stuck).
I believe that problem is somewhere in program.dat, but it is not easily readable as xml for human so I am not sure where the difference is.
I did the same with the previous 2 runs and they were able to recover, but the third one did not.
A add the screen of the successful recovers (or unsuccessful - one ended by failure but it did not stuck, which is success in this case) and the screen of build with issue. Since it seems to be persistent (as I described) I archived jenkins home and add it as well and the jenkins war as well. Archived jenkins home contains the versions of pipeline as well.
Thank you for help!
- is related to
-
JENKINS-53857 No option to disable build continuation when computer forcefully restarted
-
- Closed
-
- relates to
-
JENKINS-47791 Eliminate ProcessLiveness
-
- Resolved
-
-
JENKINS-46283 pipeline hangs after executing sh step command
-
- Resolved
-
-
JENKINS-52881 durable-task plugin v1.23 kills jobs on Cygwin/MSys agents
-
- Resolved
-
-
JENKINS-48300 Pipeline shell step aborts prematurely with ERROR: script returned exit code -1
-
- Resolved
-
- links to
[JENKINS-50892] Pipeline jobs stuck after restart
Maybe related to JENKINS-50199 svanoort? Just a casual guess, have not looked at the details.
lvotypkova Have you tried adding this startup setting on your Jenkins master? "-Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=300"
Off the top of my head, this one looks like it's an issue with the Shell step and how it detects whether the shell step is running or dead, rather than a resume issue. Adding a long-enough timeout with that setting may resolve this.
During checking my laptop I realized that maybe the bug is not on the pipeline side. I checked my processes and it seems that shell step is made durable by running it the way it keeps result file. That running seems failed and hangs, example of my processes (I put only two as an example, but there is more like that during my attempts to reproduce it):
lvotypko 7007 0.0 0.0 113248 668 pts/2 S Apr17 0:17 sh -c { while [ -d '/home/lvotypko/workspace/workspace/Reproducers/pipeline-stuck-after-restart@tmp/durable-1be3ec40' -a ! -f '/home/lvotypko/workspace/workspace/Reproducers/pipeline-stuck-after-restart@tmp/durable-1be3ec40/jenkins-result.txt' ]; do touch '/home/lvotypko/workspace/workspace/Reproducers/pipeline-stuck-after-restart@tmp/durable-1be3ec40/jenkins-log.txt'; sleep 3; done } & jsc=durable-da19427a0c4143110a7079e69bb900f7; JENKINS_SERVER_COOKIE=$jsc '/home/lvotypko/workspace/workspace/Reproducers/pipeline-stuck-after-restart@tmp/durable-1be3ec40/script.sh' > '/home/lvotypko/workspace/workspace/Reproducers/pipeline-stuck-after-restart@tmp/durable-1be3ec40/jenkins-log.txt' 2>&1; echo $? > '/home/lvotypko/workspace/workspace/Reproducers/pipeline-stuck-after-restart@tmp/durable-1be3ec40/jenkins-result.txt.tmp'; mv '/home/lvotypko/workspace/workspace/Reproducers/pipeline-stuck-after-restart@tmp/durable-1be3ec40/jenkins-result.txt.tmp' '/home/lvotypko/workspace/workspace/Reproducers/pipeline-stuck-after-restart@tmp/durable-1be3ec40/jenkins-result.txt'; wait
lvotypko 10819 0.0 0.0 113248 792 pts/10 S 10:28 0:06 sh -c { while [ -d '/tmp/workspace/reproducer@tmp/durable-f0a2ec55' -a ! -f '/tmp/workspace/reproducer@tmp/durable-f0a2ec55/jenkins-result.txt' ]; do touch '/tmp/workspace/reproducer@tmp/durable-f0a2ec55/jenkins-log.txt'; sleep 3; done } & jsc=durable-bc6ed2e151065a52c7c18babc726f6be; JENKINS_SERVER_COOKIE=$jsc '/tmp/workspace/reproducer@tmp/durable-f0a2ec55/script.sh' > '/tmp/workspace/reproducer@tmp/durable-f0a2ec55/jenkins-log.txt' 2>&1; echo $? > '/tmp/workspace/reproducer@tmp/durable-f0a2ec55/jenkins-result.txt.tmp'; mv '/tmp/workspace/reproducer@tmp/durable-f0a2ec55/jenkins-result.txt.tmp' '/tmp/workspace/reproducer@tmp/durable-f0a2ec55/jenkins-result.txt'; wait
it is clear that one of them is from yesterday and still running!
it explains why it happens quite often in our instances (and restart plays somehow role in it) because we use shell steps quite often and quite often for long time running tasks.
the only execution in script.sh is "sleep 60", but grep processes on sleep 60 returns nothing. So ti seems that it hangs in functionality before or after. Because script.sh and jenkins-log.txt exists and jenkins-log.txt has expected output, I guess that something happens after sleep 60 is executed.
lvotypkova You should be able to manually invoke the command on that build agent with some modification for debugging – I wonder if something is wacky in the bash config there, perhaps something that triggers a premature failure? The 'set -e' option springs to mind here, but there could be something else at play.
Ok, I am not sure if this is the same what happens in our instance (I made reproducer locally which does the same things - call shell step), but it is clear that the functionality from pipeline execution is correct - manually I wrote jenkins-result.txt.tmp file with number 0 and build which was stuck for about one hour finished with success immediately.
I will watch our instance if I hit the same issue as in my reproducer (it looks like so from UI perspective). Now I would concentrate on durable shell step.
Sam, ok I will do so. But I think that connection with jenkins can play role - it has never happened without restart - I tried the same shell step without restart several times and it has never happened. But I will check it - maybe I can do some modification into shell step code locally to have more info about it and I will check how durable shell step works more closely maybe I will get some idea. Thanks for cooperation!
I concentrated on script (do not include jenkins, only run the bash script which pipeline runs) and I can reproduce it - when it is interrupted in the middle it hangs. Do I understand correctly that it should not do only execution but even wait for its execution (in this case for sleep 60)? Because I thought that it should be executed in background because pipeline will wait for the result file and log can be read from jenkins-log.txt. Or am I wrong?
I checked the script and it is badly written. it periodically check if file "jenkins-result.txt" exists in while block, but when you interrupt script (due to restart) when script.sh is running but not finished these commands will not be executed:
echo $? > '/home/lvotypko/workspace/workspace/Reproducers/pipeline-stuck-after-restart@tmp/durable-1be3ec40/jenkins-result.txt.tmp
mv '/home/lvotypko/workspace/workspace/Reproducers/pipeline-stuck-after-restart@tmp/durable-1be3ec40/jenkins-result.txt.tmp' '/home/lvotypko/workspace/workspace/Reproducers/pipeline-stuck-after-restart@tmp/durable-1be3ec40/jenkins-result.txt'
That caused that file "jenkins-result.txt" will never exists but the background script with while checking for its existence will run forever.
Should not be better to do it all in background? I will write text for it, it should not be problem to catch it by test. I sent pull requests so you can see it.
I created a reproducer for this behavior[1]. I believe that it will be useful to use not only result file as an indicator that script is still running, but as well the file with pid of process. Because in my test I run script "sleep 22". I checked the script which launched it, after waited for process which execute directly the 'sleep 22'. During execution of process 'sleep 22' I kill the launching process (parent of process 'sleep 22') with signal 15 (which usually happens when connection is interrupted). Result of this action is that after 30seconds waiting the process 'sleep 22' is executed and done, but launching process is still running and result file does not exist. In this state it stays forever.
[1] https://github.com/lvotypko/durable-task-plugin/commit/a21ed041f67431a18d106760e244e0e8b5c94cc3
Durable tasks support killing of the user process. If the wrapper process is killed, all bets are off, which is why the plugin will just wait for a while to see if the output file has been touched, and after a while give up and return exit code -1. There is no need to track PIDs (we used to do that and it was a nightmare). So there are two questions of interest about your environment:
- What about interrupting the connection is causing the launching process to be killed? This sounds like a problem with your agent launcher. I have seen it happen with Docker-based agents that were not using the -i option.
- What do you mean that the launching process is still running? You just said you killed the launching process.
And of course the touch is still executed - because there is no result file. Whole problem is that this process will not die after kill -15, but the part what creates result file is not executed (echo $?) from some reason. I am not shell expert and I am not sure which part of that is interrupted (by signal 15) but it seems that it is definitely not the part with while cycle and touch. The script.sh is executed (I can watch it form console - you can reproduce it manually very easily too), but when I send the signal kill -15 to the process while script.sh is being executed, it will never execute echo $? and write a result file (and the rest of the execution). I am not sure why. I thought that it is a feature (surviving of kill -15) that it will survive for example disconnection of agent during restart Jenkins. But in this case it should survive whole execution, not only the while cycle with touch.
I think it is because while cycle with touch is send background and so kill -15 interrupt only the rest of script, and wait is waiting for process with while cycle. But not 100% sure.
I can fix it by sending the script execution (together with echo $? > tmp result file and cp tmp result file to result file) to background too. But I am not sure if it is correct approach.
If you agree with sending the execution to background too, you can check this commit https://github.com/lvotypko/durable-task-plugin/commit/541f1ddb31839891489231c94024bedd7f0d6aad - there is included the test which reproduces the issues (interruptScriptExecution).
I see a test but I do not follow what actual scenario it is simulating. What is sending SIGTERM, to which process, and why?
To give some background, the intended behavior is that:
- The agent JVM launches a wrapper script, throwing away the Proc so that it is not waiting for the wrapper script process to finish.
- The wrapper script launches the user script, collecting stdout/stderr to a file, and waiting for it to finish, sending the exit code to another file.
- The agent JVM periodically checks for an exit code and/or new log output, and updates the status of the sh step accordingly.
- If the agent is killed and restarted (including both Jenkins master restarts and Remoting disconnections), the wrapper script and user script can continue to run uninterrupted.
- The wrapper script also periodically touches the log file to confirm that it is still there. If the agent sees no updates to the log file, it assumes that the whole process tree died (for example, the computer was rebooted, with a fresh agent connection) and aborts the sh step with the special code -1.
So the system is robust against the user script being abruptly killed—that would just be a nonzero exit code—or the user script and the wrapper script both being killed—that would be exit code -1. If the wrapper script alone is killed, then the step will seem to hang for a while after the user script completes, since there is no exit code file, after which the timeout should expire and the step end with code -1. The wrapper script uses {…} rather than (…) for the background touch loop in an attempt to avoid forking another copy of /bin/sh.
Now if something about this story is not working, we want to fix it, but I need to understand exactly what is going wrong.
Thank you, for intended behavior. Now it is easier for me to understands.
Our issues - we have slaves connected through ssh. In case of restart or fall of Jenkins, the connection is usually closed. So the slave.jar process is interrupted too and it interrupts the wrapper script too (I am not sure in this part, that is the reason why I do not send pull request and discuss the issue first here). When Jenkins is up again, pipeline job correctly detects that build is not finished and the sh script was launched and it is waiting for result - but in this case endlessly. Since pipeline job is done to survive restart - so Jenkins is allowed to restart during build, it quite often happens that it struck forever and the job has to be manually killed. I checked the processes in this case on the slave and wrapper was running and periodic touches was done, but result file did not appear.
My theory is that closing connection sends -15 to wrapper script process too and that is what my test tests - I reproduced it manually on laptop just only Jenkins threads without slave - I run Jenkins from console, run pipeline job with sh sleep 60, when I see that sleep 60 is executed, I close the console with Jenkins. When I reboot Jenkins again, I get what I sent in previous comments - pipeline job is stuck endlessly and there are endless wrapper script execution. My point is that when it happens and wrapper script gets kill -15, it should behave that it kills all (the periodic touches the log file and the script execution) or it survives all (the script dies but the periodic touches too). The wrong is that touches survives but script execution does not (or better it is not executed that part which write the result and result file). I think that it would be better solution to send both to background (because one of them has to be done on background - it is necessary to run them parallel). But that would make the wrapper script practically resistant to kill -15 (but on the other hand it is already partially resistant - ate least the part with touch), and my trouble is whatever that was a purpose or it is acceptable solution. Because trying to detect when touches got stuck is harder then sent bot to background. Since you add for some nohup [1] which does some resistance but not so strong as sending background, I am not sure how to handle this issue.
the connection is usually closed. So the slave.jar process is interrupted too and it interrupts the wrapper script too
Then this is at least the main problem. It should not be sending an interrupt to the wrapper script. This is using ssh-slaves plugin?
wrapper was running and periodic touches was done
Hmm, so what kind of interrupt was sent then?
I reproduced it manually on laptop just only Jenkins threads without slave
If you are not using agents, and you stop Jenkins with Ctrl-C, then this is a known limitation discussed in JENKINS-25503. I do not consider it much of a priority since no one should be using master executors to begin with. Of course a solution may improve behavior with certain agent launchers too.
it should behave that it kills all (the periodic touches the log file and the script execution) or it survives all (the script dies but the periodic touches too)
Yes of course, we expect the whole wrapper process to either be working, or killed.
Yes, we use ssh-slaves plugin. I will try to use it in my test too, so we can see if I am right, I was not sure what exactly happens with ssh slave connection, slave.jar and its child processes when jenkins is rebooted, it is not problem to change my test a little - originally I wanted to test only the durable shell task that there is a way (and the signal 15 is not something unusual) how it can stuck in a way we does not want - touch is running, result file is not created, but script finished.
I suspect PR 75 would solve the problem, though I would prefer to understand the root cause.
Changes in JENKINS-47791 introduced the pair of sh processes which apparently causes this issue.
in an attempt to avoid forking another copy of /bin/sh
Apparently a failed attempt. Maybe { works to avoid forking most of the time, but not when & is in use?
Anyone consistently seeing this problem, please try installing this build (using Plugin Manager » Advanced) and let me know if it helps.
Thanks for suggestion jglick .
I upgraded the durable-tasks-plugin on saturday night, and unfortunately we still running into neverending jobs.
These 4 jobs will never finish, and cannot be aborted:
Our only way to clear them is to restart that (dockerized) Jenkins instance. We are also running into startup performance issues (that we still have to diagnose) that makes each Jenkins restart take up to 30 minutes...
andreler your issue may or may not have anything to do with the issue reported here. There are dozens of reasons why a build might hang. It is necessary to perform detailed diagnostics to confirm a particular problem.
Thank you Jesse. I did a quick test and it seems that it solved our problem (maybe there can be another issues, but not this one).
The change broke the usage of the official jnlp-slave image based on alpine, since this images uses busybox for ps, which does not support -p.
The image was fixed now to include a working ps version: https://github.com/jenkinsci/docker-jnlp-slave/issues/65
Hi, I noticed an issue with this fix, it generates errors message on my windows nodes ( where I use cygwin ) :
[e:\jenkins\workspace\test-node-windows-ps] Running shell script
ps: unknown option – o
Try `ps --help' for more information.
So my troubles are with code :
cmd = String.format("pid=$$;{{
Unknown macro: { while ps -o pid -p $pid | grep -q $pid && [ -d '%s' -a ! -f '%s' ]; do touch '%s'; sleep 3; done }
}}& jsc=%s; %s=$jsc '%s' > '%s' 2> '%s'; echo $? > '%s.tmp'; mv '%s.tmp' '%s'; wait",
On cygwin I would expect : ps -p $pid
{{$ ps -p 12588}}
PID PPID PGID WINPID TTY UID STIME COMMAND
12588 10484 12588 1836 pty1 1053975 09:58:55 /usr/bin/bash
Because following code is not working
$ ps -o pid -p 12588
ps: unknown option – o
Try `ps --help' for more information.
This code seems fine for debian jessie for example :
$ ps -o pid -p 102008
{{ PID}}
102008
So that's explain my error message and so the fix is not working for cygwin configuration.
My concerns were addressed by https://issues.jenkins-ci.org/browse/JENKINS-52881
svanoort Durable task is currently on 1.26 does 2.23 imply the jenkins version itself?
No, the durable-task plugin. For most Pipeline issues, the version of Jenkins core is irrelevant. I presume svanoort meant something else, since there is no such version of this plugin.
I am sorry I can not upload the war and the home because of size limit. I added only "jobs home", I will sent you the rest by e-mail if you are interested.