Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-69591

Jenkins does not wait for "Back-off pulling image" to resolve

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Minor Minor
    • kubernetes-plugin
    • None
    • Jenkins - 2.346.3
      Kubernetes plugin - 3697.v771155683e38
      Kubernetes - 1.16
      AWS Elastic Container Registry

      I mean this code - https://github.com/jenkinsci/kubernetes-plugin/blob/e44cd177808637b163ab12bfb90c8520adfe1bc7/src/main/java/org/csanchez/jenkins/plugins/kubernetes/pod/retention/Reaper.java#L472

      As I see it, the Jenkins plugin for K8s terminates the job when it notices the "Back-off pulling image" message for the container status.

      Example of error message:

      Unable to pull Docker image "registry.net/project:version". Check if image tag name is spelled correctly. 

      There are various reasons for this:

      • Invalid image tag or registry doesn't exist
      • Failed to authorize in "registry.net"
      • Rate limits on "registry.net" - (it can be the reason of my issue)
      • Network issues
      • etc

      With that in mind, I think the Jenkins plugin for Kubernetes should leverage exponential backoff retry for the image pull operation.

          [JENKINS-69591] Jenkins does not wait for "Back-off pulling image" to resolve

          Paul H added a comment -

          I've been battling this issue for months. It is particularly annoying when you have a lot of parallel branches. In my case it most likely relates to containerd occasionally offering incorrect auth to our private registry (which is an 'Open' issue with containerd as far as I can tell).  While containerd retries, and does the image pull successfully, the Reaper sees the initial Image Pull failure and (in my case) fails the parallel branch. So, yes it would be good if the Reaper TerminateAgentOnImagePullBackOff Listener perhaps ignored more 'transient' image pull errors. For me, as a workaround I've been using this code (in an init.groovy.d script) to simply delete the TerminateAgentOnImagePullBackOff Listener. It's a bit a brutal but 'saves me hours' in failed pipeline runs.

          {{}}

          import org.csanchez.jenkins.plugins.kubernetes.pod.retention.Reaper
          import hudson.ExtensionList
          
          def len = ExtensionList.lookup(Reaper.Listener.class).size()
          for (int i=0 ; i< len ; i++) {
            if (ExtensionList.lookup(Reaper.Listener.class).get(i).getClass().toString().contains("TerminateAgentOnImagePullBackOff")) {
              println "Deleting this Reaper Listener: " + ExtensionList.lookup(Reaper.Listener.class).get(i).getClass().toString()
              ExtensionList.lookup(Reaper.Listener.class).remove(i)
              break
            }
          } 

          Paul H added a comment - I've been battling this issue for months. It is particularly annoying when you have a lot of parallel branches. In my case it most likely relates to containerd occasionally offering incorrect auth to our private registry (which is an 'Open' issue with containerd as far as I can tell).  While containerd retries, and does the image pull successfully, the Reaper sees the initial Image Pull failure and (in my case) fails the parallel branch. So, yes it would be good if the Reaper TerminateAgentOnImagePullBackOff Listener perhaps ignored more 'transient' image pull errors. For me, as a workaround I've been using this code (in an init.groovy.d script) to simply delete the TerminateAgentOnImagePullBackOff Listener. It's a bit a brutal but 'saves me hours' in failed pipeline runs. {{}} import org.csanchez.jenkins.plugins.kubernetes.pod.retention.Reaper import hudson.ExtensionList def len = ExtensionList.lookup(Reaper.Listener.class).size() for ( int i=0 ; i< len ; i++) {   if (ExtensionList.lookup(Reaper.Listener.class).get(i).getClass().toString().contains( "TerminateAgentOnImagePullBackOff" )) {     println "Deleting this Reaper Listener: " + ExtensionList.lookup(Reaper.Listener.class).get(i).getClass().toString()     ExtensionList.lookup(Reaper.Listener.class).remove(i)     break   } }

          It's an interesting use case.

          Current reaper listeners are considered as fatal and this is why they fail the build. Could either add a generic enable/disable flag on each listener and let the user control it per Kubernetes cloud. Or define a new attribute for this particular Listener, allowing to fail the build after a given time in this condition.

          Vincent Latombe added a comment - It's an interesting use case. Current reaper listeners are considered as fatal and this is why they fail the build. Could either add a generic enable/disable flag on each listener and let the user control it per Kubernetes cloud. Or define a new attribute for this particular Listener, allowing to fail the build after a given time in this condition.

            Unassigned Unassigned
            ruslans Ruslan S
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: