Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-73325

Jenkins losing connection to GCE VM / GCE VM shutting down

XMLWordPrintable

      (This is a copy of https://github.com/jenkinsci/google-compute-engine-plugin/issues/467 since there seems to be no interaction besides some users discussing issues with each other.)

      I'm not sure where to look for this error, not blaming this project by itself but it's very hard to grasp.

      We were using Jenkins 2.440.2 with GCE Plugin 4.563.vfa_446a_7e00a_d before without any of these problems. After upgrading to 2.452.1 (including all of the plugins including GCE Plugin (to 4.573.v7dcd6a_37a_ee2) to problems began to start with strange errors like pasted into the actual results.

      It did not happen every time, but quite often (maybe 20% yes, 80% no). Rolling back the plugin to 4.563.vfa_446a_7e00a_d (including restarting Jenkins to apply) did not help.

      We even upgraded to Jenkins 2.452.2 including all plugins (GCE is now at the latest release 4.575.v6969b_7c435eb_).

      I checked the GCP logs some more. It looks like for whatever reason GCP is receiving an DELETE for the VM while the job is still running.

      Here are the logs for a successful jobs (I added the NOTICE/ERROR depending on the icon)

      NOTICE 2024-06-18 08:15:15.251 CEST Compute Engine insert europe-west3-c:gcp-rre-unittest-debian12-di1edw ...
      NOTICE 2024-06-18 08:15:21.166 CEST Compute Engine insert europe-west3-c:gcp-rre-unittest-debian12-di1edw ...
      NOTICE 2024-06-18 08:34:54.780 CEST Compute Engine delete europe-west3-c:gcp-rre-unittest-debian12-di1edw ...
      NOTICE 2024-06-18 08:35:40.626 CEST Compute Engine delete europe-west3-c:gcp-rre-unittest-debian12-di1edw ...
      

      and here the logs for a failing job:

      NOTICE 2024-06-18 07:46:45.363 CEST Compute Engine insert europe-west3-c:gcp-rre-unittest-debian12-jkt5ag ...
      NOTICE 2024-06-18 07:47:00.887 CEST Compute Engine insert europe-west3-c:gcp-rre-unittest-debian12-jkt5ag ...
      NOTICE 2024-06-18 08:04:13.367 CEST Compute Engine delete europe-west3-c:gcp-rre-unittest-debian12-jkt5ag ...
      NOTICE 2024-06-18 08:04:59.185 CEST Compute Engine delete europe-west3-c:gcp-rre-unittest-debian12-jkt5ag ...
      ERROR 2024-06-18 08:05:02.081 CEST Compute Engine delete europe-west3-c:gcp-rre-unittest-debian12-jkt5ag ...
      

      In Jenkins itself it looks like this (note: times are UTC here):

      ...
      [2024-06-18T06:04:43.784Z] PASS src/view/store/tracking/suspendData/tracking.suspend.data.setSuspendDataIfSectionIsVisited.epic.test.ts
      [2024-06-18T06:04:44.494Z] Cannot contact gcp-rre-unittest-debian12-jkt5ag: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@1362ebbd:gcp-rre-unittest-debian12-jkt5ag": Remote call on gcp-rre-unittest-debian12-jkt5ag failed. The channel is closing down or has closed down
      [2024-06-18T06:05:02.226Z] Could not connect to gcp-rre-unittest-debian12-jkt5ag to send interrupt signal to process
      

      So from my perspective (without much insight) it looks like the delete at 2024-06-18 08:04:13.367 CEST is causing the trouble. It looks like it sends a delete, GCP is starting to shut down. It loses connection, wants to cleanup (the VMs are configured as "one shot" instances, so it sends another delete at 2024-06-18 08:04:59.185 which then causes the error at 2024-06-18 08:05:02.081 (since the VM is already gone).

      There is nothing unusual at the 06:04:13 (aka 2024-06-18 08:04:13.367 CEST in GCP). Just some PASSes, not even a single entry for the exact 06:04:13 second.

      I don't know who (which plugin) might cause this. Can this GCE plugin even cause this?

            evanbrown Evan Brown
            jekoe J
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: