Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-55356

Some workflow jobs fail after restart on Java 11 server

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Minor
    • Resolution: Fixed
    • workflow-job-plugin
    • Jenkins JDK 11 docker image and current plugins
      Multibranch Pipeline builds for git plugin, git client plugin, and platformlabeler plugin

    Description

      While running Java 11 based Jenkins in a docker container using a pre-release of the workflow support plugin which includes the fix for the null pointer exception, a

      jenkins-url/safeRestart
      

      will cause several of the Pipeline jobs that were running to fail when Jenkins tries to resume the jobs.

      Build log output from the failed builds has included messages like (seems to be more common in windows, but visible in Linux agents as well):

      07:40:50 [INFO] Running org.jenkinsci.plugins.gitclient.PushTest
      Resuming build at Fri Dec 28 07:48:06 MST 2018 after Jenkins restart
      Waiting to resume part of Git Client Plugin Folder » Git Client Branches - Jenkinsfile » beta-3.0 #6: Waiting for next available executor on ‘coleen-pc2-ssh’
      Ready to run at Fri Dec 28 07:48:49 MST 2018
      07:48:49 Timeout set to expire in 17 min
      [Pipeline] }
      [Pipeline] // withEnv
      [Pipeline] }
      [Pipeline] // stage
      [Pipeline] }
      [Pipeline] // timeout
      [Pipeline] }
      [Pipeline] // node
      [Pipeline] }
      07:48:50 Failed in branch windows-8-2.150.1
      [Pipeline] // parallel
      [Pipeline] }
      [Pipeline] // timestamps
      [Pipeline] End of Pipeline
      ERROR: missing workspace C:\J\S\workspace\der_git-client-pipeline_beta-3.0 on coleen-pc2-ssh
      

      and this (seems to fail on Windows and on Linux):

      08:48:33 [INFO] --------------------------------[ hpi ]---------------------------------
      Resuming build at Fri Dec 28 08:51:46 MST 2018 after Jenkins restart
      Waiting to resume part of Git Client Plugin Folder » Git Client Branches - Jenkinsfile » beta-3.0 #7: Waiting to resume part of Git Client Plugin Folder » Git Client Branches - Jenkinsfile » beta-3.0 #7: ???
      ???
      Waiting to resume part of Git Client Plugin Folder » Git Client Branches - Jenkinsfile » beta-3.0 #7: ???
      Waiting to resume part of Git Client Plugin Folder » Git Client Branches - Jenkinsfile » beta-3.0 #7: ???
      Ready to run at Fri Dec 28 08:52:07 MST 2018
      08:52:07 Timeout set to expire in 54 min
      08:52:07 Timeout set to expire in 54 min
      08:52:07 Timeout set to expire in 54 min
      08:52:07 Timeout set to expire in 54 min
      [Pipeline] }
      [Pipeline] // withEnv
      [Pipeline] }
      [Pipeline] }
      [Pipeline] // stage
      [Pipeline] // withEnv
      [Pipeline] }
      [Pipeline] }
      [Pipeline] // timeout
      [Pipeline] // stage
      [Pipeline] }
      [Pipeline] }
      [Pipeline] // node
      [Pipeline] // timeout
      [Pipeline] }
      08:52:08 Failed in branch windows-8
      [Pipeline] }
      [Pipeline] // node
      [Pipeline] }
      08:52:08 Failed in branch windows-8-2.150.1
      08:57:14 process apparently never started in /home/mwaite/testing-a.markwaite.net-agent/workspace/der_git-client-pipeline_beta-3.0@tmp/durable-3af9d572
      [Pipeline] }
      [Pipeline] // withEnv
      [Pipeline] }
      [Pipeline] // stage
      [Pipeline] }
      [Pipeline] // timeout
      [Pipeline] }
      08:57:14 process apparently never started in /home/mwaite/testing-a.markwaite.net-agent/workspace/der_git-client-pipeline_beta-3.0@tmp/durable-1ba13955
      [Pipeline] // node
      [Pipeline] }
      08:57:14 Failed in branch linux-8
      [Pipeline] }
      [Pipeline] // withEnv
      [Pipeline] }
      [Pipeline] // stage
      [Pipeline] }
      [Pipeline] // timeout
      [Pipeline] }
      [Pipeline] // node
      [Pipeline] }
      08:57:14 Failed in branch linux-8-2.150.1
      [Pipeline] // parallel
      [Pipeline] }
      [Pipeline] // timestamps
      [Pipeline] End of Pipeline
      ERROR: missing workspace C:\J\S\workspace\der_git-client-pipeline_beta-3.0 on coleen-pc2-ssh
      

      and this (windows and linux):

      09:12:13 [INFO] --- maven-help-plugin:3.1.1:evaluate (default-cli) @ git-client ---
      Resuming build at Fri Dec 28 09:13:35 MST 2018 after Jenkins restart
      Waiting to resume part of Git Client Plugin Folder » Git Client Branches - Jenkinsfile » beta-3.0 #8: ???
      Waiting to resume part of Git Client Plugin Folder » Git Client Branches - Jenkinsfile » beta-3.0 #8: ???
      Waiting to resume part of Git Client Plugin Folder » Git Client Branches - Jenkinsfile » beta-3.0 #8: ???
      Waiting to resume part of Git Client Plugin Folder » Git Client Branches - Jenkinsfile » beta-3.0 #8: ???
      Ready to run at Fri Dec 28 09:13:51 MST 2018
      09:13:51 Timeout set to expire in 55 min
      09:13:51 Timeout set to expire in 55 min
      09:13:51 Timeout set to expire in 55 min
      09:13:51 Timeout set to expire in 55 min
      [Pipeline] }
      [Pipeline] // withEnv
      [Pipeline] }
      [Pipeline] }
      [Pipeline] }
      [Pipeline] // stage
      [Pipeline] // withEnv
      [Pipeline] // withEnv
      [Pipeline] }
      [Pipeline] }
      [Pipeline] }
      [Pipeline] // timeout
      [Pipeline] // stage
      [Pipeline] // stage
      [Pipeline] }
      [Pipeline] }
      [Pipeline] }
      [Pipeline] // timeout
      [Pipeline] // node
      [Pipeline] // timeout
      [Pipeline] }
      09:13:52 Failed in branch windows-8-2.150.1
      [Pipeline] }
      [Pipeline] }
      [Pipeline] // node
      [Pipeline] // node
      [Pipeline] }
      09:13:52 Failed in branch linux-8
      [Pipeline] }
      09:13:52 Failed in branch windows-8
      09:18:58 process apparently never started in /home/mwaite/testing-a.markwaite.net-agent/workspace/r_git-client-pipeline_beta-3.0_2@tmp/durable-9d0a69fe
      [Pipeline] }
      [Pipeline] // withEnv
      [Pipeline] }
      [Pipeline] // stage
      [Pipeline] }
      [Pipeline] // timeout
      [Pipeline] }
      [Pipeline] // node
      [Pipeline] }
      09:18:58 Failed in branch linux-8-2.150.1
      [Pipeline] // parallel
      [Pipeline] }
      [Pipeline] // timestamps
      [Pipeline] End of Pipeline
      ERROR: missing workspace C:\J\S\workspace\der_git-client-pipeline_beta-3.0 on mark-pc4-ssh
      

      The process never started message and the missing workspace message are visible in both the failed git client plugin builds and in the failed git plugin builds.

      The problem does not seem to repeat on a Java 8 environment, just on a Java 11 environment.
      The problem does seem to repeat less frequently on a Java 11 environment running on a larger computer. The failing computer has 8 GB RAM with an older Intel i5 processor, while the less frequently failing computer has 32 GB RAM and a newer Intel i5 processor. The 32 GB machine has shown the failure multiple times as well as the smaller computer. That failure was during a restart while the agents and the server were very busy.

      The Docker image includes all the plugins that were used in the failure case. I've dupilicated the failures on at least two different machines.

      Attachments

        Activity

          dnusbaum Devin Nusbaum added a comment - - edited

          The error message comes from workflow-durable-task-step. It would be interesting to check whether the directory mentioned in the error actually exists on the agent, or if it is missing, or exists but has a different name (perhaps a randomized suffix or `-` instead of `_` or something, which could be related to recent branch-api changes).

          Another thing to point out is that FilePath#isDirectory will return false in some failure cases (see the Javadoc for File#isDirectory). Perhaps we should update this line to use NIO methods so it throws exceptions instead of returning false in some cases (another case of JENKINS-47324).

          Edit: I filed https://github.com/jenkinsci/jenkins/pull/3864 for that issue.

          Double edit: Also to clarify, if FilePath#isDirectory had thrown an exception, then this code would have been called, which would have caused Pipeline to attempt to connect again rather than aborting the build immediately. That issue wouldn't really explain why we are seeing this on Java 11 and not Java 8, but seems worth investigating.

          dnusbaum Devin Nusbaum added a comment - - edited The error message comes from workflow-durable-task-step . It would be interesting to check whether the directory mentioned in the error actually exists on the agent, or if it is missing, or exists but has a different name (perhaps a randomized suffix or `-` instead of `_` or something, which could be related to recent branch-api changes). Another thing to point out is that FilePath#isDirectory will return false in some failure cases (see the Javadoc for File#isDirectory ). Perhaps we should update this line to use NIO methods so it throws exceptions instead of returning false in some cases (another case of JENKINS-47324 ). Edit: I filed https://github.com/jenkinsci/jenkins/pull/3864  for that issue. Double edit: Also to clarify, if FilePath#isDirectory had thrown an exception, then this code would have been called, which would have caused Pipeline to attempt to connect again rather than aborting the build immediately. That issue wouldn't really explain why we are seeing this on Java 11 and not Java 8, but seems worth investigating.
          basil Basil Crow added a comment -

          I experienced this failure mode twice on January 16, two weeks after upgrading Jenkins from 2.138.1 LTS (with workflow-job 2.25, workflow-cps 2.54, and workflow-durable-task-step 2.21) to 2.150.1 LTS (with workflow-job 2.31, workflow-cps 2.61, and workflow-durable-task-step 2.27). I am not running Java 11. The job has been running daily and has only failed twice with this failure mode, so the error is transient.

          ERROR: missing workspace /var/tmp/jenkins_slaves/jenkins-ops/workspace/devops-gate/master/sync-ova-into-dcod on scale-dc2
          ERROR: missing workspace /var/tmp/jenkins_slaves/jenkins-ops/workspace/devops-gate/master/sync-ova-into-dcod@2 on dc3
          

          When this error occurred on the 16th, I logged into these machines and checked the given directories on the command line. In both cases the directories existed. So I suspect there may have been some transient I/O error at the time.

          basil Basil Crow added a comment - I experienced this failure mode twice on January 16, two weeks after upgrading Jenkins from 2.138.1 LTS (with workflow-job 2.25, workflow-cps 2.54, and workflow-durable-task-step 2.21) to 2.150.1 LTS (with workflow-job 2.31, workflow-cps 2.61, and workflow-durable-task-step 2.27). I am not running Java 11. The job has been running daily and has only failed twice with this failure mode, so the error is transient. ERROR: missing workspace /var/tmp/jenkins_slaves/jenkins-ops/workspace/devops-gate/master/sync-ova-into-dcod on scale-dc2 ERROR: missing workspace /var/tmp/jenkins_slaves/jenkins-ops/workspace/devops-gate/master/sync-ova-into-dcod@2 on dc3 When this error occurred on the 16th, I logged into these machines and checked the given directories on the command line. In both cases the directories existed. So I suspect there may have been some transient I/O error at the time.
          oleg_nenashev Oleg Nenashev added a comment -

          dnusbaumwhat is the status here? We are about to proceed with Java 11 GA in Jenkins. I do not think it is a blocker, but it would be nice to get your feedback

          oleg_nenashev Oleg Nenashev added a comment - dnusbaum what is the status here? We are about to proceed with Java 11 GA in Jenkins. I do not think it is a blocker, but it would be nice to get your feedback
          dnusbaum Devin Nusbaum added a comment -

          oleg_nenashev Unchanged from my perspective. Given that basil mentioned that they have seen the issue on Java 8, it seems like this is not something specific to Java 11. I closed the PR I mentioned in this comment because I was not able to perform the testing necessary to feel confident about the change, and it wasn't clear to me what kinds of failures would have been exposed by switching to NIO. It would probably be safe to reopen it and change but change the behavior to return false if the directory does not exist to be much closer to the original behavior, but I'm not sure what the benefit would be. If anyone is able to come up with a self-contained and consistent reproduction case, then I would be more than happy to take a look, but without any other ideas I am just grasping at straws for now.

          dnusbaum Devin Nusbaum added a comment - oleg_nenashev Unchanged from my perspective. Given that basil mentioned that they have seen the issue on Java 8, it seems like this is not something specific to Java 11. I closed the PR I mentioned in this comment because I was not able to perform the testing necessary to feel confident about the change, and it wasn't clear to me what kinds of failures would have been exposed by switching to NIO. It would probably be safe to reopen it and change but change the behavior to return false if the directory does not exist to be much closer to the original behavior, but I'm not sure what the benefit would be. If anyone is able to come up with a self-contained and consistent reproduction case, then I would be more than happy to take a look, but without any other ideas I am just grasping at straws for now.
          oleg_nenashev Oleg Nenashev added a comment -

          Platform SIG meeting: As far as markewaite concerned, it is resolved. 

          oleg_nenashev Oleg Nenashev added a comment - Platform SIG meeting: As far as markewaite concerned, it is resolved. 

          People

            dnusbaum Devin Nusbaum
            markewaite Mark Waite
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: