Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-66209

Jenkins Controller crashing with exitCode: 1

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Minor
    • Resolution: Unresolved
    • Component/s: core
    • Labels:
      None
    • Environment:
      Jenkins 2.277.4 (Running on Kube cluster, managed using helm chart)
      Agents running as pods defined using kube templates (Latest image from jenkins/jnlp-slave)
    • Similar Issues:

      Description

      Since last 4 -5 months, we have been experiencing Jenkins Controller outages(across multiple versions). The kubernetes pods fails with exitcode 1 and self heals:

      O/p from kube describe:

      Containers:
      jenkins:
      Container ID: docker://f494b5939813d0582d7901894a16740178d6c905ff7f87db0bdbbacb63a64367
      Image: xxxxxxxxxx.dkr.ecr.ap-southeast-2.amazonaws.com/jenkins-controller:2.277.4-lts-alpine
      Image ID: docker-pullable://xxxxxxxx5.dkr.ecr.ap-southeast-2.amazonaws.com/jenkins-controller@sha256:56687bb853764312fe1f28d3b8c161738022f4e80da1b9b81ba82ba929426d1d
      Ports: 8080/TCP, 50000/TCP
      Host Ports: 0/TCP, 0/TCP
      Args:
      --httpPort=8080
      State: Running
      Started: Fri, 23 Jul 2021 10:24:42 +1000
      Last State: Terminated
      Reason: Error
      Exit Code: 1
      Started: Fri, 23 Jul 2021 09:38:37 +1000
      Finished: Fri, 23 Jul 2021 10:24:40 +1000
      Ready: True
      Restart Count: 2
      Limits:
      cpu: 3
      memory: 22Gi
      Requests:
      cpu: 2
      memory: 20Gi
      Liveness: http-get http://:http/login delay=0s timeout=5s period=20s #success=1 #failure=100
      Readiness: http-get http://:http/login delay=0s timeout=5s period=10s #success=1 #failure=100
      Startup: http-get http://:http/login delay=0s timeout=5s period=100s #success=1 #failure=5
      Environment:
      POD_NAME: jenkins-blue-0 (v1:metadata.name)
      JAVA_OPTS:
      JENKINS_OPTS:
      JENKINS_SLAVE_AGENT_PORT: 50000
      JAVA_OPTS: XX:+UnlockExperimentalVMOptions -XX:+UseG1GC -XX:+ExplicitGCInvokesConcurrent -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:MaxRAMPercentage=50.0 -Xloggc:/var/jenkins_home/log/gc%t.log -XX:NumberOfGCLogFiles=5 -XX:+UseGCLogFileRotation -XX:GCLogFileSize=20m -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCCause -XX:+PrintTenuringDistribution -XX:+PrintReferenceGC -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions -XX:+UseCGroupMemoryLimitForHeap -XX:+AlwaysPreTouch -Duser.timezone=Australia/Sydney
      CASC_JENKINS_CONFIG: /var/jenkins_home/casc_configs
      Mounts:
      /var/jenkins_config from jenkins-config (ro)
      /var/jenkins_home from jenkins-home (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from jenkins-controller-sa-token-67wfs (ro)
      Conditions:
      Type Status
      Initialized True
      Ready True
      ContainersReady True
      PodScheduled True
      Volumes:
      plugins:
      Type: EmptyDir (a temporary directory that shares a pod's lifetime)
      Medium:
      SizeLimit: <unset>
      jenkins-config:
      Type: ConfigMap (a volume populated by a ConfigMap)
      Name: jenkins-blue
      Optional: false
      jenkins-home:
      Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
      ClaimName: jenkins-blue
      ReadOnly: false
      sc-config-volume:
      Type: EmptyDir (a temporary directory that shares a pod's lifetime)
      Medium:
      SizeLimit: <unset>
      jenkins-controller-sa-token-67wfs:
      Type: Secret (a volume populated by a Secret)
      SecretName: jenkins-controller-sa-token-67wfs
      Optional: false
      QoS Class: Burstable
      Node-Selectors: <none>
      Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
      node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
      Events:
      Type Reason Age From Message
      ---- ------ ---- ---- -------
      Warning Unhealthy 45m (x9 over 91m) kubelet, ip-10-90-111-59.ap-southeast-2.compute.internal Liveness probe failed: Get "http://10.90.110.107:8080/login": dial tcp 10.90.110.107:8080: connect: connection refused
      Warning Unhealthy 45m (x19 over 91m) kubelet, ip-10-90-111-59.ap-southeast-2.compute.internal Readiness probe failed: Get "http://10.90.110.107:8080/login": dial tcp 10.90.110.107:8080: connect: connection refused

       

      There isnt much in Jenkins core logs, except the error:

      webroot: EnvVars.masterEnvVars.get("JENKINS_HOME")
      2021-07-22 23:38:54.897+0000 [id=1] INFO winstone.Logger#logInternal: Beginning extraction from war file
      Running from: /usr/share/jenkins/jenkins.war
      2021-07-22 23:38:54.709+0000 [id=1] INFO org.eclipse.jetty.util.log.Log#initialized: Logging initialized @916ms to org.eclipse.jetty.util.log.JavaUtilLog
      exitCode: 0
      Scanning success.
      exitCode: 1
      exitCode: 1
      exitCode: 1
      exitCode: 1
      Terminated Kubernetes instance for agent jenkins/test-automation-virtual-device-virtual-device-e2e-390818--hmjsm
      Disconnected computer test-automation-virtual-device-virtual-device-e2e-390818–hmjsm

       

       From last one week, we are nightly restating jenkins controller but it hasnt helped.
       
       

        Attachments

          Activity

          Hide
          timja Tim Jacomb added a comment -

          Do you have any metrics that show the controller CPU and memory usage over time?
          Any performance issues?

          Show
          timja Tim Jacomb added a comment - Do you have any metrics that show the controller CPU and memory usage over time? Any performance issues?

            People

            Assignee:
            Unassigned Unassigned
            Reporter:
            nikhilp Nikhil
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated: