branch-api automatically cleans up workspaces for jobs that are part of multibranch projects when those projects are deleted, see the implementation of ItemListener.onDeleted here. The way that this works is that when a job is deleted, an instance of WorkspaceLocatorImpl.CleanupTask is submitted to the remoting thread pool for to clean up the workspace for that job on every agent connected to Jenkins.
The remoting thread pool is unbounded, so if the multibranch project is large and/or there are a lot of agents, this can result in hundreds or even thousands of temporary threads being created to clean up all of the workspaces. The majority of these threads will be idle, as only one of them can be active for a given agent at a time, so in effect the thread pool is being used as an inefficient queue without any ordering guarantees.
There are a lot of potential mitigating factors that in practice prevent this behavior from being a problem for many Jenkins instances:
- Deletion of multibranch projects might be infrequent
- Multibranch projects that are deleted may only have a few branches
- There may not be very many agents connected to Jenkins
- Agents may be short-lived and/or may not have any workspaces for the jobs being deleted
- Workspace cleanup tasks may complete very quickly with no network issues
- Jenkins may have enough memory headroom that the memory required by the workspace cleanup threads is not significant
However, there are also worst-case scenarios where the current behavior can result in memory exhaustion and agent disconnections, such as Jenkins instances with minimal memory headroom, many long-lived static agents, a somewhat flaky network, and frequent deletions of large multibranch projects.
To improve the behavior in those scenarios, there are a few things we could try that might improve the behavior:
- Submitting the workspace cleanup tasks to a queue instead of directly to the remoting thread pool, and only dispatching a few of the tasks from the queue to the remoting thread pool at a time. Probably relatively easy to implement on top of the existing code.
- Batching workspace cleanup tasks by agent to try to try to reduce the amount of remoting activity involved with all of the individual deletions. Would probably require significant refactoring of the existing code.
- Adding an option to disable automatic workspace cleanup for users who have instances that experience the worst-case behavior, and let them take responsibility for cleaning up workspaces on their agents some other way.