Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-73325

Jenkins losing connection to GCE VM / GCE VM shutting down

      (This is a copy of https://github.com/jenkinsci/google-compute-engine-plugin/issues/467 since there seems to be no interaction besides some users discussing issues with each other.)

      I'm not sure where to look for this error, not blaming this project by itself but it's very hard to grasp.

      We were using Jenkins 2.440.2 with GCE Plugin 4.563.vfa_446a_7e00a_d before without any of these problems. After upgrading to 2.452.1 (including all of the plugins including GCE Plugin (to 4.573.v7dcd6a_37a_ee2) to problems began to start with strange errors like pasted into the actual results.

      It did not happen every time, but quite often (maybe 20% yes, 80% no). Rolling back the plugin to 4.563.vfa_446a_7e00a_d (including restarting Jenkins to apply) did not help.

      We even upgraded to Jenkins 2.452.2 including all plugins (GCE is now at the latest release 4.575.v6969b_7c435eb_).

      I checked the GCP logs some more. It looks like for whatever reason GCP is receiving an DELETE for the VM while the job is still running.

      Here are the logs for a successful jobs (I added the NOTICE/ERROR depending on the icon)

      NOTICE 2024-06-18 08:15:15.251 CEST Compute Engine insert europe-west3-c:gcp-rre-unittest-debian12-di1edw ...
      NOTICE 2024-06-18 08:15:21.166 CEST Compute Engine insert europe-west3-c:gcp-rre-unittest-debian12-di1edw ...
      NOTICE 2024-06-18 08:34:54.780 CEST Compute Engine delete europe-west3-c:gcp-rre-unittest-debian12-di1edw ...
      NOTICE 2024-06-18 08:35:40.626 CEST Compute Engine delete europe-west3-c:gcp-rre-unittest-debian12-di1edw ...
      

      and here the logs for a failing job:

      NOTICE 2024-06-18 07:46:45.363 CEST Compute Engine insert europe-west3-c:gcp-rre-unittest-debian12-jkt5ag ...
      NOTICE 2024-06-18 07:47:00.887 CEST Compute Engine insert europe-west3-c:gcp-rre-unittest-debian12-jkt5ag ...
      NOTICE 2024-06-18 08:04:13.367 CEST Compute Engine delete europe-west3-c:gcp-rre-unittest-debian12-jkt5ag ...
      NOTICE 2024-06-18 08:04:59.185 CEST Compute Engine delete europe-west3-c:gcp-rre-unittest-debian12-jkt5ag ...
      ERROR 2024-06-18 08:05:02.081 CEST Compute Engine delete europe-west3-c:gcp-rre-unittest-debian12-jkt5ag ...
      

      In Jenkins itself it looks like this (note: times are UTC here):

      ...
      [2024-06-18T06:04:43.784Z] PASS src/view/store/tracking/suspendData/tracking.suspend.data.setSuspendDataIfSectionIsVisited.epic.test.ts
      [2024-06-18T06:04:44.494Z] Cannot contact gcp-rre-unittest-debian12-jkt5ag: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@1362ebbd:gcp-rre-unittest-debian12-jkt5ag": Remote call on gcp-rre-unittest-debian12-jkt5ag failed. The channel is closing down or has closed down
      [2024-06-18T06:05:02.226Z] Could not connect to gcp-rre-unittest-debian12-jkt5ag to send interrupt signal to process
      

      So from my perspective (without much insight) it looks like the delete at 2024-06-18 08:04:13.367 CEST is causing the trouble. It looks like it sends a delete, GCP is starting to shut down. It loses connection, wants to cleanup (the VMs are configured as "one shot" instances, so it sends another delete at 2024-06-18 08:04:59.185 which then causes the error at 2024-06-18 08:05:02.081 (since the VM is already gone).

      There is nothing unusual at the 06:04:13 (aka 2024-06-18 08:04:13.367 CEST in GCP). Just some PASSes, not even a single entry for the exact 06:04:13 second.

      I don't know who (which plugin) might cause this. Can this GCE plugin even cause this?

          [JENKINS-73325] Jenkins losing connection to GCE VM / GCE VM shutting down

          J added a comment -

          I got some more informations on this. We made the log level more verbose and added a dedicated logger for this.

          Looks like the plugin is sometimes "forgetting" about the VM:

          Plugin log:

          Aug 26, 2024 7:00:46 AM INFO com.google.jenkins.plugins.computeengine.InstanceConfiguration provision
          Sent insert request for instance configuration [Debian12 agent for RRE unittests]
          Aug 26, 2024 7:00:46 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineComputerLauncher launch
          Launch will wait 300000 for operation operation-1724648445057-6208f01ee1a77-f70c6933-7f833c72 to complete...
          Aug 26, 2024 7:00:46 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud lambda$getPlannedNodeFuture$0
          Waiting 300000ms for node gcp-rre-unittest-debian12-xqphpu to connect
          Aug 26, 2024 7:01:12 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
          Launching instance: gcp-rre-unittest-debian12-xqphpu
          Aug 26, 2024 7:01:12 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
          bootstrap
          Aug 26, 2024 7:01:12 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
          Getting keypair...
          Aug 26, 2024 7:01:12 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
          Using autogenerated ssh keypair
          Aug 26, 2024 7:01:12 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
          Authenticating as jenkins
          Aug 26, 2024 7:01:12 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
          No public address found. Fall back to internal address.
          Aug 26, 2024 7:01:12 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
          Connecting to 192.168.75.37 on port 22, with timeout 10000.
          Aug 26, 2024 7:01:12 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
          Failed to connect via ssh: There was a problem while connecting to 192.168.75.37:22
          Aug 26, 2024 7:01:12 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
          Waiting for SSH to come up. Sleeping 5.
          Aug 26, 2024 7:01:17 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
          No public address found. Fall back to internal address.
          Aug 26, 2024 7:01:17 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
          Connecting to 192.168.75.37 on port 22, with timeout 10000.
          Aug 26, 2024 7:01:17 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
          Connected via SSH.
          Aug 26, 2024 7:01:17 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
          Verifying: java -fullversion
          Aug 26, 2024 7:01:18 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
          Copying agent.jar to: /tmp
          Aug 26, 2024 7:01:18 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
          Launching Jenkins agent via plugin SSH: java -jar /tmp/agent.jar
          Aug 26, 2024 7:01:26 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud lambda$getPlannedNodeFuture$0
          40479ms elapsed waiting for node gcp-rre-unittest-debian12-xqphpu to connect
          Aug 26, 2024 7:04:12 AM INFO com.google.jenkins.plugins.computeengine.CleanLostNodesWork terminateInstance
          Remote instance gcp-rre-unittest-debian12-xqphpu not found locally, removing it
          

          At the same time the VM was doing work:

          ...
          07:02:21  Agent: gcp-rre-unittest-debian12-xqphpu
          ...
          07:03:22  + yarn run test:ci
          07:03:22  yarn run v1.22.22
          07:03:22  $ yarn generate && craco test --coverage
          07:03:22  $ yarn dependency test && yarn create-plugin-list && yarn create-view-model && yarn create-template-list && yarn create-themes && yarn bundle-messages
          07:03:22  $ cross-env TS_NODE_PROJECT=./tsconfig.buildConf.json node -r ts-node/register build-config/dependency.ts test
          07:03:25  Successfully checked view/edit dependencies array in package.json.
          07:03:25  $ cross-env TS_NODE_PROJECT=./tsconfig.buildConf.json node -r ts-node/register build-config/createPluginList.ts
          07:03:27  $ cross-env TS_NODE_PROJECT=./tsconfig.buildConf.json node -r ts-node/register build-config/createViewModel.ts
          07:03:30  $ cross-env TS_NODE_PROJECT=./tsconfig.buildConf.json node -r ts-node/register build-config/createTemplateList.ts
          07:03:32  $ cross-env TS_NODE_PROJECT=./tsconfig.buildConf.json node -r ts-node/register build-config/createThemes.ts
          07:04:56  Cannot contact gcp-rre-unittest-debian12-xqphpu: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@70fb0402:gcp-rre-unittest-debian12-xqphpu": Remote call on gcp-rre-unittest-debian12-xqphpu failed. The channel is closing down or has closed down
          07:05:02  Agent gcp-rre-unittest-debian12-xqphpu was deleted; cancelling node body
          07:05:02  Could not connect to gcp-rre-unittest-debian12-xqphpu to send interrupt signal to process
          

          J added a comment - I got some more informations on this. We made the log level more verbose and added a dedicated logger for this. Looks like the plugin is sometimes "forgetting" about the VM: Plugin log: Aug 26, 2024 7:00:46 AM INFO com.google.jenkins.plugins.computeengine.InstanceConfiguration provision Sent insert request for instance configuration [Debian12 agent for RRE unittests] Aug 26, 2024 7:00:46 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineComputerLauncher launch Launch will wait 300000 for operation operation-1724648445057-6208f01ee1a77-f70c6933-7f833c72 to complete... Aug 26, 2024 7:00:46 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud lambda$getPlannedNodeFuture$0 Waiting 300000ms for node gcp-rre-unittest-debian12-xqphpu to connect Aug 26, 2024 7:01:12 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log Launching instance: gcp-rre-unittest-debian12-xqphpu Aug 26, 2024 7:01:12 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log bootstrap Aug 26, 2024 7:01:12 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log Getting keypair... Aug 26, 2024 7:01:12 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log Using autogenerated ssh keypair Aug 26, 2024 7:01:12 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log Authenticating as jenkins Aug 26, 2024 7:01:12 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log No public address found. Fall back to internal address. Aug 26, 2024 7:01:12 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log Connecting to 192.168.75.37 on port 22, with timeout 10000. Aug 26, 2024 7:01:12 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log Failed to connect via ssh: There was a problem while connecting to 192.168.75.37:22 Aug 26, 2024 7:01:12 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log Waiting for SSH to come up. Sleeping 5. Aug 26, 2024 7:01:17 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log No public address found. Fall back to internal address. Aug 26, 2024 7:01:17 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log Connecting to 192.168.75.37 on port 22, with timeout 10000. Aug 26, 2024 7:01:17 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log Connected via SSH. Aug 26, 2024 7:01:17 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log Verifying: java -fullversion Aug 26, 2024 7:01:18 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log Copying agent.jar to: /tmp Aug 26, 2024 7:01:18 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log Launching Jenkins agent via plugin SSH: java -jar /tmp/agent.jar Aug 26, 2024 7:01:26 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud lambda$getPlannedNodeFuture$0 40479ms elapsed waiting for node gcp-rre-unittest-debian12-xqphpu to connect Aug 26, 2024 7:04:12 AM INFO com.google.jenkins.plugins.computeengine.CleanLostNodesWork terminateInstance Remote instance gcp-rre-unittest-debian12-xqphpu not found locally, removing it At the same time the VM was doing work: ... 07:02:21 Agent: gcp-rre-unittest-debian12-xqphpu ... 07:03:22 + yarn run test:ci 07:03:22 yarn run v1.22.22 07:03:22 $ yarn generate && craco test --coverage 07:03:22 $ yarn dependency test && yarn create-plugin-list && yarn create-view-model && yarn create-template-list && yarn create-themes && yarn bundle-messages 07:03:22 $ cross-env TS_NODE_PROJECT=./tsconfig.buildConf.json node -r ts-node/register build-config/dependency.ts test 07:03:25 Successfully checked view/edit dependencies array in package .json. 07:03:25 $ cross-env TS_NODE_PROJECT=./tsconfig.buildConf.json node -r ts-node/register build-config/createPluginList.ts 07:03:27 $ cross-env TS_NODE_PROJECT=./tsconfig.buildConf.json node -r ts-node/register build-config/createViewModel.ts 07:03:30 $ cross-env TS_NODE_PROJECT=./tsconfig.buildConf.json node -r ts-node/register build-config/createTemplateList.ts 07:03:32 $ cross-env TS_NODE_PROJECT=./tsconfig.buildConf.json node -r ts-node/register build-config/createThemes.ts 07:04:56 Cannot contact gcp-rre-unittest-debian12-xqphpu: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@70fb0402:gcp-rre-unittest-debian12-xqphpu" : Remote call on gcp-rre-unittest-debian12-xqphpu failed. The channel is closing down or has closed down 07:05:02 Agent gcp-rre-unittest-debian12-xqphpu was deleted; cancelling node body 07:05:02 Could not connect to gcp-rre-unittest-debian12-xqphpu to send interrupt signal to process

          jekoe Is it possible that you have several Cloud configured with the same <instanceId>...</instanceId> ? You may run a groovy script like the following to double check on this from the Manage Jenkins > Script Console:

          import com.google.jenkins.plugins.computeengine.ComputeEngineCloud
          import jenkins.model.Jenkins
          
          jenkins.model.Jenkins.get().clouds.findAll {cloud -> cloud instanceof com.google.jenkins.plugins.computeengine.ComputeEngineCloud }.each { cloud -> 
              println "- cloud: " + cloud.cloudName;
          	println "  instanceId: " + cloud.instanceId;
              println "  instanceConfigurations:"
              cloud.configurations.each { configuration -> 
                println "  - namePrefix:" + configuration.namePrefix
                println "    jenkins_cloud_id: " + configuration.googleLabels["jenkins_cloud_id"]
              }
          }
          return
          

          This is something I recently discovered in an environment showing the same symptoms.
          This can be quite possible if for example you use configuration as code to define your GCE cloud, and for example copy/paste a configuration over to add additional ones..

          Allan BURDAJEWICZ added a comment - jekoe Is it possible that you have several Cloud configured with the same <instanceId>...</instanceId> ? You may run a groovy script like the following to double check on this from the Manage Jenkins > Script Console : import com.google.jenkins.plugins.computeengine.ComputeEngineCloud import jenkins.model.Jenkins jenkins.model.Jenkins.get().clouds.findAll {cloud -> cloud instanceof com.google.jenkins.plugins.computeengine.ComputeEngineCloud }.each { cloud -> println "- cloud: " + cloud.cloudName; println " instanceId: " + cloud.instanceId; println " instanceConfigurations:" cloud.configurations.each { configuration -> println " - namePrefix:" + configuration.namePrefix println " jenkins_cloud_id: " + configuration.googleLabels[ "jenkins_cloud_id" ] } } return This is something I recently discovered in an environment showing the same symptoms. This can be quite possible if for example you use configuration as code to define your GCE cloud, and for example copy/paste a configuration over to add additional ones..

            evanbrown Evan Brown
            jekoe J
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: