Status: Open (View Workflow)
There appears to be a race condition between the initialization of tasks and the initialization of nodes. This appears to be within remoting, but I have included my version of ec2 plugin because we see this on ec2 agents.
We are seeing that jobs are being deleted from nodes after a reboot. This appears to be caused by branch-api-plugin WorkspaceLocatorImpl.java (When a computer comes online check for jobs that exist on the computer but do not exist in jenkins (via getItemByFullName)).
It seems that either branch-api-plugin needs a change to wait for jobs to be loaded or maybe jenkins should wait for jobs to be loaded before lauching nodes.
As an aside the way we found this issue is that it manifests to us a a very long startup time caused by running out of heap space because large objects were allocated when connected to nodes to receive stack traces of exceptions on the nodes caused by jenkins trying to delete the folder of a job in progress that jenkins did not have permission to delete. From here I found that this was caused by the remoting plugin trying to delete the build
Looked into this more and this seems to be exclusively an issue in branch-api-plugin, given that jenkins loads nodes at the same time it loads system configuration. Then a later milestone is achieved for loading the jobs. branch-api-plugin should not expect jobs to be loaded when nodes come online. See WorkspaceLocatorImpl.java L#586 (onOnline)