We've encountered occasional hung threads living far longer than their jobs, causing system instability. Root cause is that after build logs are compressed, an additional line is appended, 'Creating placeholder flownodes because failed loading originals.', which corrupts the gz archive. If we remove the appended line, the log can be extracted.
The workaround is to move the build folder on the master, kill any remaining threads, and often we must reboot the master. This has happened multiple times so far, and we've setup thread duration monitoring jobs to detect threads & builds over X ms. Advice on additional ways of capturing relevant log information would be appreciated.
The only place I've found the offending line is: