Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-62525

Add some reconnection mechanism to Ansible Tower job monitoring

    • Icon: Improvement Improvement
    • Resolution: Unresolved
    • Icon: Major Major
    • ansible-tower-plugin
    • None

      Ansible Tower jobs can be running for a long time and it's quite normal that connection can be dropped or timeout. Currently, these situations mark the Jenkins build as failed because code throws Exception. I think the Ansible Tower plugin should have some reconnection mechanism that will handle these situations (e.g. fail after 5 consecutive connection timeouts or drops during the configured period).

          [JENKINS-62525] Add some reconnection mechanism to Ansible Tower job monitoring

          John Westcott added a comment -

          For some background on this; are you running Tower or AWX?

          Also, how long are your jobs running for? Are we talking days or hours?

          Do you know what is making your instance become unresponsive during that time?

          Also, are you using a pipeline or a freestyle job in Jenkins? If you are using a pipeline have you tried running with the async option?

          In general, I am feeling like a Tower infrastructure should be stable enough to remain up during long running jobs rather than try and have the plugin "tolerate" unstable connections. However, if there are special circumstances to be considered I might be willing to try and do something within the plugin to make it more forgiving of bad Tower connections.

          John Westcott added a comment - For some background on this; are you running Tower or AWX? Also, how long are your jobs running for? Are we talking days or hours? Do you know what is making your instance become unresponsive during that time? Also, are you using a pipeline or a freestyle job in Jenkins? If you are using a pipeline have you tried running with the async option? In general, I am feeling like a Tower infrastructure should be stable enough to remain up during long running jobs rather than try and have the plugin "tolerate" unstable connections. However, if there are special circumstances to be considered I might be willing to try and do something within the plugin to make it more forgiving of bad Tower connections.

          I'm running AWX. I have jobs that last for several hours and I'm using a pipeline job. I use the async option and the groovy code that retries status queries - but it seems strange since it's really just blocking the entire pipeline while waiting for the result of the task (so I have to use async to have sync). When I used the sync call, I got a timeout Exception after several minutes.
           

          Adam Medziński added a comment - I'm running AWX. I have jobs that last for several hours and I'm using a pipeline job. I use the async option and the groovy code that retries status queries - but it seems strange since it's really just blocking the entire pipeline while waiting for the result of the task (so I have to use async to have sync). When I used the sync call, I got a timeout Exception after several minutes.  

          John Westcott added a comment -

          Do you see your AWX server timeout in the web browser around the same time that Jenkins is timing out?

          John Westcott added a comment - Do you see your AWX server timeout in the web browser around the same time that Jenkins is timing out?

          Richard added a comment - - edited

          I have a similar issue that happens occasionally but often enough to be a nuisance.

          The Jenkins pipeline log throws the message:

          "Failed to get job status from Tower: Unexpected error code returned (503)"

          Both AWX and Jenkins run within Openshift.

          AWX is typically not very busy - we rarely have more than 3 jobs running at the same time.

          The issue seems to happen most frequently while AWX is busy waiting for a deployment to Openshift to finish.

          I have not found any useful logs either in AWX or in Jenkins, but maybe I'm not looking in the right location. The closest I found was this, and I don't know if it is related or not:

          [Ansible-Tower] Building GET request to https://awxserver/api/v2/jobs/57602/

          [Ansible-Tower] Forcing cert trust

          [Ansible-Tower] Request completed with (503)

          [Ansible-Tower] Deleting oAuth token 15396 for awx[Ansible-Tower] Forcing cert trust

          [Ansible-Tower] Calling for oAuth token delete  at https://awxserver/api/v2/tokens/15396/

          [Ansible-Tower] Request completed with (200)

          We are running AWX 9.

          The deployment part on which AWX is waiting typically takes 10-15 minutes, the rest of the job that involves AWX takes maybe an additional 5 minutes.

          We are using a pipeline, and we are not using async at this time.

          I'd also appreciate a retry feature or some advice on how to figure out this issue as it fails our pipelines randomly even though the AWX job may complete successfully.

          Richard added a comment - - edited I have a similar issue that happens occasionally but often enough to be a nuisance. The Jenkins pipeline log throws the message: "Failed to get job status from Tower: Unexpected error code returned (503)" Both AWX and Jenkins run within Openshift. AWX is typically not very busy - we rarely have more than 3 jobs running at the same time. The issue seems to happen most frequently while AWX is busy waiting for a deployment to Openshift to finish. I have not found any useful logs either in AWX or in Jenkins, but maybe I'm not looking in the right location. The closest I found was this, and I don't know if it is related or not: [Ansible-Tower] Building GET request to  https://awxserver/api/v2/jobs/57602/ [Ansible-Tower] Forcing cert trust [Ansible-Tower] Request completed with (503) [Ansible-Tower] Deleting oAuth token 15396 for awx [Ansible-Tower] Forcing cert trust [Ansible-Tower] Calling for oAuth token delete  at  https://awxserver/api/v2/tokens/15396/ [Ansible-Tower] Request completed with (200) We are running AWX 9. The deployment part on which AWX is waiting typically takes 10-15 minutes, the rest of the job that involves AWX takes maybe an additional 5 minutes. We are using a pipeline, and we are not using async at this time. I'd also appreciate a retry feature or some advice on how to figure out this issue as it fails our pipelines randomly even though the AWX job may complete successfully.

            johnwestcottiv John Westcott
            medzin Adam Medziński
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: