[JENKINS-66822] Jenkins is trying to create an agent pod forever

Type: Bug
Resolution: Unresolved
Priority: Major
Component/s: kubernetes-plugin
Labels:
None
Environment:
Kubernetes plugin 1.30.1
Jenkins 2.263.4 LTS

Similar Issues:
Powered by SuggestiMate

Show

When Jenkins is launching a new slave/agent pod, it never times out and keep trying to launch a new pod when the current is failing.

For example, we had an issue of miss configuration made the agent pod to be created in a namespaces where there're no pull credentials. In this case we had a cron job which got stuck and kept trying to created a pod for 2 days (until manually aborted), although the issue has been already resolved. It didn't get the code update since the job has to be stopped and rebuilt.

I went through the whole relevant documentation to find proper timeout setting, but nothing worked. I thought slaveConnectTimeout should do the job, but it didn't help, as well as setting org.csanchez.jenkins.plugins.kubernetes.PodTemplate.connectionTimeout system property.

What I'm looking for is an option to set a timeout that kills the jobs if it can't launch an agent pod after x seconds.

is related to

JENKINS-68409 kubernetes plugin can create a new pod every 10s when something wrong

Reopened

Odd Will added a comment - 2021-10-08 15:55

Have you considered trying the timeout option for the whole pipeline?

https://www.jenkins.io/doc/book/pipeline/syntax/#options

Odd Will added a comment - 2021-10-08 15:55 Have you considered trying the timeout option for the whole pipeline? https://www.jenkins.io/doc/book/pipeline/syntax/#options

Lior Tzur added a comment - 2021-10-10 11:56

oddwill, I have considered it, but it's not good for my use case.

I use this groovy pipeline to run a lot of different jobs with a wide range of runtimes (can be few minutes to a whole day). I wouldn't like initialization timeout to be longer than 10 minutes, what will fail long running jobs.

Lior Tzur added a comment - 2021-10-10 11:56 oddwill , I have considered it, but it's not good for my use case. I use this groovy pipeline to run a lot of different jobs with a wide range of runtimes (can be few minutes to a whole day). I wouldn't like initialization timeout to be longer than 10 minutes, what will fail long running jobs.

Mor Cohen added a comment - 2022-01-17 09:08

Hey, we also encounter this problem.

Adding a timeout to the whole pipeline is not suitable for us, as we have some jobs that should run for over 16 hours.

Mor Cohen added a comment - 2022-01-17 09:08 Hey, we also encounter this problem. Adding a timeout to the whole pipeline is not suitable for us, as we have some jobs that should run for over 16 hours.

Lior Tzur added a comment - 2022-01-17 09:38 - edited

Hi mocohen, I'll share what we have eventually done as a hack, maybe it can help you.
We run the pipeline flow in parallel with a pod connection check function.

parallel(
    'Main': {
        podTemplate(template) {
            node(POD_LABEL) {
                // pipeline flow
            }
        }
    },
    'Connection check': podConnectionTimeout(),
    failFast: true 
)

// Runs from master 
def podConnectionTimeout (timeout = 300, samplingInterval = 5, namespace = 'jobs') {
  return {
    node('master') {
      def kubectlContext = 'my_cluster'
      def isRunning = false

      for (int i = 0; i < timeout / samplingInterval; i++) {
        sleep samplingInterval
        isRunning = sh script: "set +x\nkubectl get pods --selector job_name=${env.JOB_BASE_NAME},build_number=${BUILD_NUMBER} --namespace ${namespace} --context ${kubectlContext} | grep Running || true", returnStdout: true
        if (isRunning)
          break
        else
          println "Job's container isn't ready yet"
      }
      if (!isRunning)
        error "Pod creation exceeded timeout"
    }
  }
}

Lior Tzur added a comment - 2022-01-17 09:38 - edited Hi mocohen , I'll share what we have eventually done as a hack, maybe it can help you. We run the pipeline flow in parallel with a pod connection check function. parallel( 'Main' : { podTemplate(template) { node(POD_LABEL) { // pipeline flow } } }, 'Connection check' : podConnectionTimeout(), failFast: true ) // Runs from master def podConnectionTimeout (timeout = 300, samplingInterval = 5, namespace = 'jobs' ) { return { node( 'master' ) { def kubectlContext = 'my_cluster' def isRunning = false for ( int i = 0; i < timeout / samplingInterval; i++) { sleep samplingInterval isRunning = sh script: "set +x\nkubectl get pods --selector job_name=${env.JOB_BASE_NAME},build_number=${BUILD_NUMBER} --namespace ${namespace} --context ${kubectlContext} | grep Running || true " , returnStdout: true if (isRunning) break else println "Job 's container isn' t ready yet" } if (!isRunning) error "Pod creation exceeded timeout" } } }

Mor Cohen added a comment - 2022-01-17 10:44 - edited

liortzur thanks for sharing your solution, it's really appreciated.

We are running entire pipelines with k8s pod as an agent (i.e Top Level Agents), I think that your solution is great for pod agents that runs on a specific stage on the pipeline (i.e Stage Agents). Please correct me if i'm wrong.

Mor Cohen added a comment - 2022-01-17 10:44 - edited liortzur thanks for sharing your solution, it's really appreciated. We are running entire pipelines with k8s pod as an agent (i.e Top Level Agents), I think that your solution is great for pod agents that runs on a specific stage on the pipeline (i.e Stage Agents). Please correct me if i'm wrong.

Lior Tzur added a comment - 2022-01-17 12:17

mocohen, we also run the entire pipeline on k8s pod agent as top level agent, it wraps the whole code. The only thing that's running on master (out of pipeline) is the connection check (it has to be external).

Lior Tzur added a comment - 2022-01-17 12:17 mocohen , we also run the entire pipeline on k8s pod agent as top level agent, it wraps the whole code. The only thing that's running on master (out of pipeline) is the connection check (it has to be external).

Mor Cohen added a comment - 2022-01-20 18:39

Hey liortzur, I have managed to use your solution, thank you.

On the other side I do believe that the plugin should handle this problem, so I create this PR: https://github.com/jenkinsci/kubernetes-plugin/pull/1118

Mor Cohen added a comment - 2022-01-20 18:39 Hey liortzur , I have managed to use your solution, thank you. On the other side I do believe that the plugin should handle this problem, so I create this PR: https://github.com/jenkinsci/kubernetes-plugin/pull/1118

Lior Tzur added a comment - 2022-01-22 15:40 - edited

You're totally right mocohen and you rock for creating this PR!
Once it's merged we will get rid of this hack.

Lior Tzur added a comment - 2022-01-22 15:40 - edited You're totally right mocohen and you rock for creating this PR! Once it's merged we will get rid of this hack.

Assignee:: Unassigned

Reporter:: Lior Tzur

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2021-10-07 08:10

Updated:: 2022-05-05 19:24

Jenkins

Details

Description

Attachments

Issue Links

Activity

Collapse comment: Odd Will added a comment - 2021-10-08 15:55

Expand comment: Odd Will added a comment - 2021-10-08 15:55

Collapse comment: Lior Tzur added a comment - 2021-10-10 11:56

Expand comment: Lior Tzur added a comment - 2021-10-10 11:56

Collapse comment: Mor Cohen added a comment - 2022-01-17 09:08

Expand comment: Mor Cohen added a comment - 2022-01-17 09:08

Collapse comment: Lior Tzur added a comment - 2022-01-17 09:38, Edited by Lior Tzur - 2022-01-17 09:39

Expand comment: Lior Tzur added a comment - 2022-01-17 09:38, Edited by Lior Tzur - 2022-01-17 09:39

Collapse comment: Mor Cohen added a comment - 2022-01-17 10:44, Edited by Mor Cohen - 2022-01-17 10:45

Expand comment: Mor Cohen added a comment - 2022-01-17 10:44, Edited by Mor Cohen - 2022-01-17 10:45

Collapse comment: Lior Tzur added a comment - 2022-01-17 12:17

Expand comment: Lior Tzur added a comment - 2022-01-17 12:17

Collapse comment: Mor Cohen added a comment - 2022-01-20 18:39

Expand comment: Mor Cohen added a comment - 2022-01-20 18:39

Collapse comment: Lior Tzur added a comment - 2022-01-22 15:40, Edited by Lior Tzur - 2022-01-22 15:40

Expand comment: Lior Tzur added a comment - 2022-01-22 15:40, Edited by Lior Tzur - 2022-01-22 15:40

People

Dates