Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-66822

Jenkins is trying to create an agent pod forever

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • kubernetes-plugin
    • None
    • Kubernetes plugin 1.30.1
      Jenkins 2.263.4 LTS

      When Jenkins is launching a new slave/agent pod, it never times out and keep trying to launch a new pod when the current is failing.

      For example, we had an issue of miss configuration made the agent pod to be created in a namespaces where there're no pull credentials. In this case we had a cron job which got stuck and kept trying to created a pod for 2 days (until manually aborted), although the issue has been already resolved. It didn't get the code update since the job has to be stopped and rebuilt.

      I went through the whole relevant documentation to find proper timeout setting, but nothing worked. I thought slaveConnectTimeout should do the job, but it didn't help, as well as setting org.csanchez.jenkins.plugins.kubernetes.PodTemplate.connectionTimeout system property.

      What I'm looking for is an option to set a timeout that kills the jobs if it can't launch an agent pod after x seconds.

          [JENKINS-66822] Jenkins is trying to create an agent pod forever

          Odd Will added a comment -

          Have you considered trying the timeout option for the whole pipeline?

          https://www.jenkins.io/doc/book/pipeline/syntax/#options

          Odd Will added a comment - Have you considered trying the timeout option for the whole pipeline? https://www.jenkins.io/doc/book/pipeline/syntax/#options

          Lior Tzur added a comment -

          oddwill, I have considered it, but it's not good for my use case.

          I use this groovy pipeline to run a lot of different jobs with a wide range of runtimes (can be few minutes to a whole day). I wouldn't like initialization timeout to be longer than 10 minutes, what will fail long running jobs.

          Lior Tzur added a comment - oddwill , I have considered it, but it's not good for my use case. I use this groovy pipeline to run a lot of different jobs with a wide range of runtimes (can be few minutes to a whole day). I wouldn't like initialization timeout to be longer than 10 minutes, what will fail long running jobs.

          Mor Cohen added a comment -

          Hey, we also encounter this problem.

          Adding a timeout to the whole pipeline is not suitable for us, as we have some jobs that should run for over 16 hours. 

          Mor Cohen added a comment - Hey, we also encounter this problem. Adding a timeout to the whole pipeline is not suitable for us, as we have some jobs that should run for over 16 hours. 

          Lior Tzur added a comment - - edited

          Hi mocohen, I'll share what we have eventually done as a hack, maybe it can help you.
          We run the pipeline flow in parallel with a pod connection check function.

           

          parallel(
              'Main': {
                  podTemplate(template) {
                      node(POD_LABEL) {
                          // pipeline flow
                      }
                  }
              },
              'Connection check': podConnectionTimeout(),
              failFast: true 
          )
          
          // Runs from master 
          def podConnectionTimeout (timeout = 300, samplingInterval = 5, namespace = 'jobs') {
            return {
              node('master') {
                def kubectlContext = 'my_cluster'
                def isRunning = false
          
                for (int i = 0; i < timeout / samplingInterval; i++) {
                  sleep samplingInterval
                  isRunning = sh script: "set +x\nkubectl get pods --selector job_name=${env.JOB_BASE_NAME},build_number=${BUILD_NUMBER} --namespace ${namespace} --context ${kubectlContext} | grep Running || true", returnStdout: true
                  if (isRunning)
                    break
                  else
                    println "Job's container isn't ready yet"
                }
                if (!isRunning)
                  error "Pod creation exceeded timeout"
              }
            }
          }
          

           
           

          Lior Tzur added a comment - - edited Hi mocohen , I'll share what we have eventually done as a hack, maybe it can help you. We run the pipeline flow in parallel with a pod connection check function.   parallel( 'Main' : { podTemplate(template) { node(POD_LABEL) { // pipeline flow } } }, 'Connection check' : podConnectionTimeout(), failFast: true ) // Runs from master def podConnectionTimeout (timeout = 300, samplingInterval = 5, namespace = 'jobs' ) { return { node( 'master' ) { def kubectlContext = 'my_cluster' def isRunning = false for ( int i = 0; i < timeout / samplingInterval; i++) { sleep samplingInterval isRunning = sh script: "set +x\nkubectl get pods --selector job_name=${env.JOB_BASE_NAME},build_number=${BUILD_NUMBER} --namespace ${namespace} --context ${kubectlContext} | grep Running || true " , returnStdout: true if (isRunning) break else println "Job 's container isn' t ready yet" } if (!isRunning) error "Pod creation exceeded timeout" } } }    

          Mor Cohen added a comment - - edited

          liortzur thanks for sharing your solution, it's really appreciated.

           

          We are running entire pipelines with k8s pod as an agent (i.e Top Level Agents), I think that your solution is great for pod agents that runs on a specific stage on the pipeline (i.e Stage Agents). Please correct me if i'm wrong.

          Mor Cohen added a comment - - edited liortzur  thanks for sharing your solution, it's really appreciated.   We are running entire pipelines with k8s pod as an agent (i.e Top Level Agents), I think that your solution is great for pod agents that runs on a specific stage on the pipeline (i.e Stage Agents). Please correct me if i'm wrong.

          Lior Tzur added a comment -

          mocohen, we also run the entire pipeline on k8s pod agent as top level agent, it wraps the whole code. The only thing that's running on master (out of pipeline) is the connection check (it has to be external).

          Lior Tzur added a comment - mocohen , we also run the entire pipeline on k8s pod agent as top level agent, it wraps the whole code. The only thing that's running on master (out of pipeline) is the connection check (it has to be external).

          Mor Cohen added a comment -

          Hey liortzur, I have managed to use your solution, thank you.

           

          On the other side I do believe that the plugin should handle this problem, so I create this PR: https://github.com/jenkinsci/kubernetes-plugin/pull/1118

          Mor Cohen added a comment - Hey liortzur , I have managed to use your solution, thank you.   On the other side I do believe that the plugin should handle this problem, so I create this PR: https://github.com/jenkinsci/kubernetes-plugin/pull/1118

          Lior Tzur added a comment - - edited

          You're totally right mocohen and you rock for creating this PR!
          Once it's merged we will get rid of this hack.

          Lior Tzur added a comment - - edited You're totally right mocohen  and you rock for creating this PR! Once it's merged we will get rid of this hack.

            Unassigned Unassigned
            liortzur Lior Tzur
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: