-
Bug
-
Resolution: Fixed
-
Major
-
None
-
-
workflow-support 804.vba10a18a1476
While testing Jenkins backup and restoration, we have seen cases where program.dat is corrupted on disk. See jenkinsci/workflow-cps-plugin#480 and jenkinsci/workflow-support-plugin#123. The two stack traces we have seen are as follows:
Partially-written program.dat (seen on ext4):
java.io.IOException: Unsupported protocol version 101 at org.jboss.marshalling.river.RiverUnmarshaller.start(RiverUnmarshaller.java:1349) at org.jenkinsci.plugins.workflow.support.pickles.serialization.RiverReader.readPickles(RiverReader.java:175) at org.jenkinsci.plugins.workflow.support.pickles.serialization.RiverReader.restorePickles(RiverReader.java:142) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.loadProgramAsync(CpsFlowExecution.java:784) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.onLoad(CpsFlowExecution.java:750) at org.jenkinsci.plugins.workflow.job.WorkflowRun.getExecution(WorkflowRun.java:691) at org.jenkinsci.plugins.workflow.job.WorkflowRun.onLoad(WorkflowRun.java:550) at … Caused: java.io.IOException: Failed to load build state at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$3.onSuccess(CpsFlowExecution.java:865) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$3.onSuccess(CpsFlowExecution.java:863) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4$1.run(CpsFlowExecution.java:917) at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:38) at …
Completely empty program.dat (seen on XFS):
java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readLong(DataInputStream.java:416) at org.jenkinsci.plugins.workflow.support.pickles.serialization.RiverReader.parseHeader(RiverReader.java:113) at org.jenkinsci.plugins.workflow.support.pickles.serialization.RiverReader.restorePickles(RiverReader.java:139) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.loadProgramAsync(CpsFlowExecution.java:784) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.onLoad(CpsFlowExecution.java:750) at org.jenkinsci.plugins.workflow.job.WorkflowRun.getExecution(WorkflowRun.java:699) at org.jenkinsci.plugins.workflow.job.WorkflowRun.onLoad(WorkflowRun.java:558) ... Caused: java.io.IOException: Failed to load build state at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$3.onSuccess(CpsFlowExecution.java:865) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$3.onSuccess(CpsFlowExecution.java:863) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4$1.run(CpsFlowExecution.java:917) at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:38) ...
CpsThreadGroup writes program.dat to a temp file and then attempts to atomically rename it to program.dat, but notably we never use FileChannel.force (fsync) to ensure that all data has been committed to disk, and from some reading it seems possible that without fsync, filesystems may be free to reorder the writes and the move, or to buffer the writes and still perform the move.
I filed jenkinsci/workflow-support-plugin#129 to use FileChannel to write program.dat all via a single channel and so that we can invoke FileChannel.force before closing the channel as an attempt to prevent this corruption.