I have a similar issue that happens occasionally but often enough to be a nuisance.
The Jenkins pipeline log throws the message:
"Failed to get job status from Tower: Unexpected error code returned (503)"
Both AWX and Jenkins run within Openshift.
AWX is typically not very busy - we rarely have more than 3 jobs running at the same time.
The issue seems to happen most frequently while AWX is busy waiting for a deployment to Openshift to finish.
I have not found any useful logs either in AWX or in Jenkins, but maybe I'm not looking in the right location. The closest I found was this, and I don't know if it is related or not:
[Ansible-Tower] Building GET request to https://awxserver/api/v2/jobs/57602/
[Ansible-Tower] Forcing cert trust
[Ansible-Tower] Request completed with (503)
[Ansible-Tower] Deleting oAuth token 15396 for awx[Ansible-Tower] Forcing cert trust
[Ansible-Tower] Calling for oAuth token delete at https://awxserver/api/v2/tokens/15396/
[Ansible-Tower] Request completed with (200)
We are running AWX 9.
The deployment part on which AWX is waiting typically takes 10-15 minutes, the rest of the job that involves AWX takes maybe an additional 5 minutes.
We are using a pipeline, and we are not using async at this time.
I'd also appreciate a retry feature or some advice on how to figure out this issue as it fails our pipelines randomly even though the AWX job may complete successfully.
For some background on this; are you running Tower or AWX?
Also, how long are your jobs running for? Are we talking days or hours?
Do you know what is making your instance become unresponsive during that time?
Also, are you using a pipeline or a freestyle job in Jenkins? If you are using a pipeline have you tried running with the async option?
In general, I am feeling like a Tower infrastructure should be stable enough to remain up during long running jobs rather than try and have the plugin "tolerate" unstable connections. However, if there are special circumstances to be considered I might be willing to try and do something within the plugin to make it more forgiving of bad Tower connections.