Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-67389

Pipeline program.dat file corrupted after non-graceful shutdown

XMLWordPrintable

    • workflow-support 804.vba10a18a1476

      While testing Jenkins backup and restoration, we have seen cases where program.dat is corrupted on disk. See jenkinsci/workflow-cps-plugin#480 and jenkinsci/workflow-support-plugin#123. The two stack traces we have seen are as follows:

      Partially-written program.dat (seen on ext4):

      java.io.IOException: Unsupported protocol version 101
      	at org.jboss.marshalling.river.RiverUnmarshaller.start(RiverUnmarshaller.java:1349)
      	at org.jenkinsci.plugins.workflow.support.pickles.serialization.RiverReader.readPickles(RiverReader.java:175)
      	at org.jenkinsci.plugins.workflow.support.pickles.serialization.RiverReader.restorePickles(RiverReader.java:142)
      	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.loadProgramAsync(CpsFlowExecution.java:784)
      	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.onLoad(CpsFlowExecution.java:750)
      	at org.jenkinsci.plugins.workflow.job.WorkflowRun.getExecution(WorkflowRun.java:691)
      	at org.jenkinsci.plugins.workflow.job.WorkflowRun.onLoad(WorkflowRun.java:550)
      	at …
      Caused: java.io.IOException: Failed to load build state
      	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$3.onSuccess(CpsFlowExecution.java:865)
      	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$3.onSuccess(CpsFlowExecution.java:863)
      	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4$1.run(CpsFlowExecution.java:917)
      	at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:38)
      	at …
      

      Completely empty program.dat (seen on XFS):

      java.io.EOFException
      	at java.io.DataInputStream.readFully(DataInputStream.java:197)
      	at java.io.DataInputStream.readLong(DataInputStream.java:416)
      	at org.jenkinsci.plugins.workflow.support.pickles.serialization.RiverReader.parseHeader(RiverReader.java:113)
      	at org.jenkinsci.plugins.workflow.support.pickles.serialization.RiverReader.restorePickles(RiverReader.java:139)
      	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.loadProgramAsync(CpsFlowExecution.java:784)
      	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.onLoad(CpsFlowExecution.java:750)
      	at org.jenkinsci.plugins.workflow.job.WorkflowRun.getExecution(WorkflowRun.java:699)
      	at org.jenkinsci.plugins.workflow.job.WorkflowRun.onLoad(WorkflowRun.java:558)
      	...
      Caused: java.io.IOException: Failed to load build state
      	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$3.onSuccess(CpsFlowExecution.java:865)
      	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$3.onSuccess(CpsFlowExecution.java:863)
      	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4$1.run(CpsFlowExecution.java:917)
      	at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:38)
      	...
      

      CpsThreadGroup writes program.dat to a temp file and then attempts to atomically rename it to program.dat, but notably we never use FileChannel.force (fsync) to ensure that all data has been committed to disk, and from some reading it seems possible that without fsync, filesystems may be free to reorder the writes and the move, or to buffer the writes and still perform the move.

      I filed jenkinsci/workflow-support-plugin#129 to use FileChannel to write program.dat all via a single channel and so that we can invoke FileChannel.force before closing the channel as an attempt to prevent this corruption.

       

       

            dnusbaum Devin Nusbaum
            dnusbaum Devin Nusbaum
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: