Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-19743

Massive parallel builds sometimes cause errors in LogRotation

      We submit about 120 parallel builds with 15sec execution time => issue may be caused by frequent log rotation without locks.

      Jobs seem to be OK, but there is a lot of SEVERE messages in the Jenkins log.

      SEVERE: Executor threw an exception
      java.util.NoSuchElementException
      at com.google.common.collect.AbstractIterator.next(AbstractIterator.java:154)
      at java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1067)
      at java.util.AbstractMap$2$1.next(AbstractMap.java:385)
      at hudson.util.RunList.subList(RunList.java:143)
      at hudson.tasks.LogRotator.perform(LogRotator.java:119)
      at hudson.model.Job.logRotate(Job.java:404)
      at hudson.model.Run.execute(Run.java:1655)
      at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
      at hudson.model.ResourceController.execute(ResourceController.java:88)
      at hudson.model.Executor.run(Executor.java:237)ct.AbstractIterator.next(AbstractIterator.java:154)

        1. BuildFlowPluginTest.cs
          0.7 kB
        2. LogRotator.txt
          1 kB
        3. Run.txt
          1 kB
        4. StressTest_Job_config_1.png
          StressTest_Job_config_1.png
          116 kB
        5. StressTest_Job_config_2.png
          StressTest_Job_config_2.png
          68 kB
        6. StressTest_Job_config_3.png
          StressTest_Job_config_3.png
          85 kB
        7. StressTest_SubJob_config_1.png
          StressTest_SubJob_config_1.png
          102 kB
        8. StressTest_SubJob_config_2.png
          StressTest_SubJob_config_2.png
          87 kB
        9. StressTest_SubJob_config_3.png
          StressTest_SubJob_config_3.png
          57 kB

          [JENKINS-19743] Massive parallel builds sometimes cause errors in LogRotation

          Nebu Kadnezar added a comment -

          We've built a Jenkins test project where we can reproduce the issue. Following requirements and project setup describes how you we can reproduce the error deterministically.

          Software requirements:

          • Jenkins version 1.598
          • Build Flow Plugin version 0.9.1
          • Java version 1.7.0_72
          • C# Script execution engine version 3.8.10.0
          • C# Script BuildFlowPluginTest.cs

          Test environment:

          • Jenkins master runs on a virtual machine with Windows 7 SP1 32 bit
          • 8 Jenkins slaves physical machines with Windows7 SP1 64 bit
          • 6 Jenkins slaves virtual machines with Windows 7 SP1 32 bit

          Jenkins projects

          We have two Jenkins Build Flow projects:

          • StressTest_Job
          • StressTest_Job_2
            Both are executing the same StressTest_SubJob.
          1. Create a Build Flow project StressTest_Job and StressTestJob_2 with following configuration like on the screenshots (see attached StressTest_Job_config_1 - StressTest_Job_config_3).
          2. Create a Jenkins project job similar to the screenshots (see attached StressTest_SubJob_config_1 - StressTest_SubJob_config_3). The only difference to our configuration is the section "Source Code Management". Set this configuration to "None" and add on the section "Build\Command" a copy command which copies the "BuildFlowPluginTest.cs" to the Jenkins slave workspace for execution.

          Sample command code for the build section (Windows batch):

          @echo off 
          
          xcopy /e "\\Server\project\ BuildFlowPluginTest.cs" %workspace% 
          
          ping %Parameter% -n 10
          
          cscs BuildFlowPluginTest.cs 10000
          
          ping %Parameter% -n 10
          

          See also the attached C# script BuildFlowPluginTest.cs code.

          Nebu Kadnezar added a comment - We've built a Jenkins test project where we can reproduce the issue. Following requirements and project setup describes how you we can reproduce the error deterministically. Software requirements: Jenkins version 1.598 Build Flow Plugin version 0.9.1 Java version 1.7.0_72 C# Script execution engine version 3.8.10.0 C# Script BuildFlowPluginTest.cs Test environment: Jenkins master runs on a virtual machine with Windows 7 SP1 32 bit 8 Jenkins slaves physical machines with Windows7 SP1 64 bit 6 Jenkins slaves virtual machines with Windows 7 SP1 32 bit Jenkins projects We have two Jenkins Build Flow projects: StressTest_Job StressTest_Job_2 Both are executing the same StressTest_SubJob. Create a Build Flow project StressTest_Job and StressTestJob_2 with following configuration like on the screenshots (see attached StressTest_Job_config_1 - StressTest_Job_config_3). Create a Jenkins project job similar to the screenshots (see attached StressTest_SubJob_config_1 - StressTest_SubJob_config_3). The only difference to our configuration is the section "Source Code Management". Set this configuration to "None" and add on the section "Build\Command" a copy command which copies the "BuildFlowPluginTest.cs" to the Jenkins slave workspace for execution. Sample command code for the build section (Windows batch): @echo off xcopy /e "\\Server\project\ BuildFlowPluginTest.cs" %workspace% ping %Parameter% -n 10 cscs BuildFlowPluginTest.cs 10000 ping %Parameter% -n 10 See also the attached C# script BuildFlowPluginTest.cs code.

          Nebu Kadnezar added a comment -

          We’ve find out that following source changes on Jenkins version 1.606 are a workaround to handle the issue.

          1. Adaption on Run.java on the method delete()

          Run.java
           /**
               * Deletes this build and its entire log
               *
               * @throws IOException
               *      if we fail to delete.
               */
              public void delete() throws IOException {
                  
          		File rootDir = getRootDir();
                  if (!rootDir.isDirectory()) {
          		    LOGGER.log(Level.WARNING, "IOException: " + rootDir + " looks to have already been deleted; siblings: " + Arrays.toString(project.getBuildDir().list()));
                      //throw new IOException(this + ": " + rootDir + " looks to have already been deleted; siblings: " + Arrays.toString(project.getBuildDir().list()));
                  }
                  
                  RunListener.fireDeleted(this);
          
                  synchronized (this) { // avoid holding a lock while calling plugin impls of onDeleted
                  File tmp = new File(rootDir.getParentFile(),'.'+rootDir.getName());
                  
                  if (tmp.exists()) {
                      Util.deleteRecursive(tmp);
                  }
                  // TODO on Java 7 prefer: Files.move(rootDir.toPath(), tmp.toPath(), StandardCopyOption.ATOMIC_MOVE)
                  boolean renamingSucceeded = rootDir.renameTo(tmp);
                  Util.deleteRecursive(tmp);
                  // some user reported that they see some left-over .xyz files in the workspace,
                  // so just to make sure we've really deleted it, schedule the deletion on VM exit, too.
                  if(tmp.exists())
                      tmp.deleteOnExit();
          
                  if(!renamingSucceeded) {
                      LOGGER.log(Level.WARNING, rootDir+" is in use");
          			//throw new IOException(rootDir+" is in use");
          		}
                  LOGGER.log(FINE, "{0}: {1} successfully deleted", new Object[] {this, rootDir});
          
                  removeRunFromParent();
                  }
              }
          

          2. Adaption on LogRotator.java on the method perform(Job<?,?> job)

          LogRotator.java
          public void perform(Job<?,?> job) throws IOException, InterruptedException {
                  LOGGER.log(FINE, "Running the log rotation for {0} with numToKeep={1} daysToKeep={2} artifactNumToKeep={3} artifactDaysToKeep={4}", new Object[] {job, numToKeep, daysToKeep, artifactNumToKeep, artifactDaysToKeep});
                  
                  // always keep the last successful and the last stable builds
                  Run lsb = job.getLastSuccessfulBuild();
                  Run lstb = job.getLastStableBuild();
          
                  if(numToKeep!=-1) {
                      // Note that RunList.size is deprecated, and indeed here we are loading all the builds of the job.
                      // However we would need to load the first numToKeep anyway, just to skip over them;
                      // and we would need to load the rest anyway, to delete them.
                      // (Using RunMap.headMap would not suffice, since we do not know if some recent builds have been deleted for other reasons,
                      // so simply subtracting numToKeep from the currently last build number might cause us to delete too many.)
          			
          			try
          			{
          				List<? extends Run<?,?>> builds = job.getBuilds();
          				for (Run r : copy(builds.subList(Math.min(builds.size(), numToKeep), builds.size()))) {
          					if (shouldKeepRun(r, lsb, lstb)) {
          						continue;
          					}
          					LOGGER.log(FINE, "{0} is to be removed", r);
          					r.delete();
          				}
          			}
          			catch(Exception e)
          			{
          				LOGGER.log(FINE, "subList creating failed", e);
          			}
                  }
          

          Nebu Kadnezar added a comment - We’ve find out that following source changes on Jenkins version 1.606 are a workaround to handle the issue. 1. Adaption on Run.java on the method delete() Run.java /** * Deletes this build and its entire log * * @ throws IOException * if we fail to delete. */ public void delete() throws IOException { File rootDir = getRootDir(); if (!rootDir.isDirectory()) { LOGGER.log(Level.WARNING, "IOException: " + rootDir + " looks to have already been deleted; siblings: " + Arrays.toString(project.getBuildDir().list())); // throw new IOException( this + ": " + rootDir + " looks to have already been deleted; siblings: " + Arrays.toString(project.getBuildDir().list())); } RunListener.fireDeleted( this ); synchronized ( this ) { // avoid holding a lock while calling plugin impls of onDeleted File tmp = new File(rootDir.getParentFile(), '.' +rootDir.getName()); if (tmp.exists()) { Util.deleteRecursive(tmp); } // TODO on Java 7 prefer: Files.move(rootDir.toPath(), tmp.toPath(), StandardCopyOption.ATOMIC_MOVE) boolean renamingSucceeded = rootDir.renameTo(tmp); Util.deleteRecursive(tmp); // some user reported that they see some left-over .xyz files in the workspace, // so just to make sure we've really deleted it, schedule the deletion on VM exit, too. if (tmp.exists()) tmp.deleteOnExit(); if (!renamingSucceeded) { LOGGER.log(Level.WARNING, rootDir+ " is in use" ); // throw new IOException(rootDir+ " is in use" ); } LOGGER.log(FINE, "{0}: {1} successfully deleted" , new Object [] { this , rootDir}); removeRunFromParent(); } } 2. Adaption on LogRotator.java on the method perform(Job<?,?> job) LogRotator.java public void perform(Job<?,?> job) throws IOException, InterruptedException { LOGGER.log(FINE, "Running the log rotation for {0} with numToKeep={1} daysToKeep={2} artifactNumToKeep={3} artifactDaysToKeep={4}" , new Object [] {job, numToKeep, daysToKeep, artifactNumToKeep, artifactDaysToKeep}); // always keep the last successful and the last stable builds Run lsb = job.getLastSuccessfulBuild(); Run lstb = job.getLastStableBuild(); if (numToKeep!=-1) { // Note that RunList.size is deprecated, and indeed here we are loading all the builds of the job. // However we would need to load the first numToKeep anyway, just to skip over them; // and we would need to load the rest anyway, to delete them. // (Using RunMap.headMap would not suffice, since we do not know if some recent builds have been deleted for other reasons, // so simply subtracting numToKeep from the currently last build number might cause us to delete too many.) try { List<? extends Run<?,?>> builds = job.getBuilds(); for (Run r : copy(builds.subList( Math .min(builds.size(), numToKeep), builds.size()))) { if (shouldKeepRun(r, lsb, lstb)) { continue ; } LOGGER.log(FINE, "{0} is to be removed" , r); r.delete(); } } catch (Exception e) { LOGGER.log(FINE, "subList creating failed" , e); } }

          Daniel Beck added a comment -

          nebukadnezar: Could you open a pull request for your fix on GitHub?

          Daniel Beck added a comment - nebukadnezar : Could you open a pull request for your fix on GitHub?

          Nebu Kadnezar added a comment -

          Hi Daniel,

          this is not a fix only a working workaround for the issue. See attached Run.txt and LogRotator.txt file for diff patch.

          Nebu Kadnezar added a comment - Hi Daniel, this is not a fix only a working workaround for the issue. See attached Run.txt and LogRotator.txt file for diff patch.

          Using Jenkins ver. 1.636, we are seeing the following exception on a job allowing concurrent builds and having lots of short jobs:

          hudson.model.Run execute
          SEVERE: Failed to rotate log
          java.util.NoSuchElementException
          at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:76)
          at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:63)
          at java.util.AbstractMap$2$1.next(AbstractMap.java:396)
          at hudson.util.RunList.subList(RunList.java:139)
          at hudson.tasks.LogRotator.perform(LogRotator.java:125)
          at hudson.model.Job.logRotate(Job.java:467)
          at hudson.model.Run.execute(Run.java:1805)
          at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
          at hudson.model.ResourceController.execute(ResourceController.java:98)
          at hudson.model.Executor.run(Executor.java:410)

          Viktor Szathmary added a comment - Using Jenkins ver. 1.636, we are seeing the following exception on a job allowing concurrent builds and having lots of short jobs: hudson.model.Run execute SEVERE: Failed to rotate log java.util.NoSuchElementException at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:76) at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:63) at java.util.AbstractMap$2$1.next(AbstractMap.java:396) at hudson.util.RunList.subList(RunList.java:139) at hudson.tasks.LogRotator.perform(LogRotator.java:125) at hudson.model.Job.logRotate(Job.java:467) at hudson.model.Run.execute(Run.java:1805) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43) at hudson.model.ResourceController.execute(ResourceController.java:98) at hudson.model.Executor.run(Executor.java:410)

          We are also seeing this one (reported above as well). So it seems log rotation for concurrent jobs has a race condition.

          Nov 06, 2015 6:01:49 AM hudson.model.Run execute
          SEVERE: Failed to rotate log
          java.io.IOException:Redacted #173213: /redacted/builds/173213 looks to have already been deleted; siblings: [....lots of job ids....]
          at hudson.model.Run.delete(Run.java:1483)
          at hudson.tasks.LogRotator.perform(LogRotator.java:144)
          at hudson.model.Job.logRotate(Job.java:467)
          at hudson.model.Run.execute(Run.java:1805)
          at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
          at hudson.model.ResourceController.execute(ResourceController.java:98)
          at hudson.model.Executor.run(Executor.java:410)

          Viktor Szathmary added a comment - We are also seeing this one (reported above as well). So it seems log rotation for concurrent jobs has a race condition. Nov 06, 2015 6:01:49 AM hudson.model.Run execute SEVERE: Failed to rotate log java.io.IOException:Redacted #173213: /redacted/builds/173213 looks to have already been deleted; siblings: [....lots of job ids....] at hudson.model.Run.delete(Run.java:1483) at hudson.tasks.LogRotator.perform(LogRotator.java:144) at hudson.model.Job.logRotate(Job.java:467) at hudson.model.Run.execute(Run.java:1805) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43) at hudson.model.ResourceController.execute(ResourceController.java:98) at hudson.model.Executor.run(Executor.java:410)

          I am seeing the same (SEVERE: Failed to rotate log), on 1.625.3 (most recent LTS).

          Once the SEVERE log entries start, Jenkins service starts to use all CPU it can get and becomes very slow/unresponsive.

          Klaus Schniedergers added a comment - I am seeing the same (SEVERE: Failed to rotate log), on 1.625.3 (most recent LTS). Once the SEVERE log entries start, Jenkins service starts to use all CPU it can get and becomes very slow/unresponsive.

          Gavin Shelley added a comment -

          We see this on 1.611. We too have 100s of parallel jobs running. This pollutes the logs with scary messages. Not sure if it actually affects us apart from logs, but it would be nice to see this resolved in case it is the cause of the occasional unexplained problem.

          Gavin Shelley added a comment - We see this on 1.611. We too have 100s of parallel jobs running. This pollutes the logs with scary messages. Not sure if it actually affects us apart from logs, but it would be nice to see this resolved in case it is the cause of the occasional unexplained problem.

          M Chon added a comment - - edited

          I am getting these errors on the machine we build pull requests on:

           

          Nov 09, 2017 10:11:53 AM SEVERE hudson.model.Run executeFailed to rotate logjava.io.IOException: My-Build #1000: /var/lib/jenkins/jobs/My-Build/builds/1000 looks to have already been deleted; siblings: [1021, 1027, 1047, 1091, 1010, 1034, 1040, 1013, 1041, 1083, 1016, 1049, 1030, 1070, 1004, 1099, 1073, 1024, 1009, 1084, 1039, 1001, 1094, 1100, 1057, 1003, 1007, 1052, 1065, lastSuccessfulBuild, 1045, 1026, 1022, 1061, 1054, 1044, 1093, .1000, 1087, 1063, 1072, 1018, 1096, 1074, 1019, lastUnstableBuild, 1092, 1031, 1033, 1005, 1051, 1043, 1068, 1075, 1095, 1079, 1036, 1032, 1029, 1048, 1042, legacyIds, lastFailedBuild, lastUnsuccessfulBuild, 1025, 1078, 1080, 1090, 1046, 1069, 999, 1014, 1020, lastStableBuild, 1067, 1053, 1028, 1002, 1064, 1059, 1082, 1056, 1017, 1071, 1077, 1097, 1037, 1086, 1076, 1008, 1006, 1081, 1088, 1058, 998, 1023, 1060, 1050, 1012, 1062, 1066, 1038, 1098, 1085, 1055, 1015, 1011, 1089, 1035]        at hudson.model.Run.delete(Run.java:1483)        at hudson.tasks.LogRotator.perform(LogRotator.java:131)        at hudson.model.Job.logRotate(Job.java:474)        at hudson.model.Run.execute(Run.java:1784)        at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)        at hudson.model.ResourceController.execute(ResourceController.java:98)        at hudson.model.Executor.run(Executor.java:404) 

           

          Nov 09, 2017 10:11:53 AM SEVERE hudson.model.Run executeFailed to rotate logjava.io.IOException: /var/lib/jenkins/jobs/My-Build/builds/1000 is in use        at hudson.model.Run.delete(Run.java:1503)        at hudson.tasks.LogRotator.perform(LogRotator.java:131)        at hudson.model.Job.logRotate(Job.java:474)        at hudson.model.Run.execute(Run.java:1784)        at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)        at hudson.model.ResourceController.execute(ResourceController.java:98)        at hudson.model.Executor.run(Executor.java:404)

          M Chon added a comment - - edited I am getting these errors on the machine we build pull requests on:   Nov 09, 2017 10:11:53 AM SEVERE hudson.model.Run execute Failed to rotate log java.io.IOException: My-Build #1000: /var/lib/jenkins/jobs/My-Build/builds/1000 looks to have already been deleted; siblings: [1021, 1027, 1047, 1091, 1010, 1034, 1040, 1013, 1041, 1083, 1016, 1049, 1030, 1070, 1004, 1099, 1073, 1024, 1009, 1084, 1039, 1001, 1094, 1100, 1057, 1003, 1007, 1052, 1065, lastSuccessfulBuild, 1045, 1026, 1022, 1061, 1054, 1044, 1093, .1000, 1087, 1063, 1072, 1018, 1096, 1074, 1019, lastUnstableBuild, 1092, 1031, 1033, 1005, 1051, 1043, 1068, 1075, 1095, 1079, 1036, 1032, 1029, 1048, 1042, legacyIds, lastFailedBuild, lastUnsuccessfulBuild, 1025, 1078, 1080, 1090, 1046, 1069, 999, 1014, 1020, lastStableBuild, 1067, 1053, 1028, 1002, 1064, 1059, 1082, 1056, 1017, 1071, 1077, 1097, 1037, 1086, 1076, 1008, 1006, 1081, 1088, 1058, 998, 1023, 1060, 1050, 1012, 1062, 1066, 1038, 1098, 1085, 1055, 1015, 1011, 1089, 1035]         at hudson.model.Run.delete(Run.java:1483)         at hudson.tasks.LogRotator.perform(LogRotator.java:131)         at hudson.model.Job.logRotate(Job.java:474)         at hudson.model.Run.execute(Run.java:1784)         at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)         at hudson.model.ResourceController.execute(ResourceController.java:98)         at hudson.model.Executor.run(Executor.java:404)     Nov 09, 2017 10:11:53 AM SEVERE hudson.model.Run execute Failed to rotate log java.io.IOException: /var/lib/jenkins/jobs/My-Build/builds/1000 is in use         at hudson.model.Run.delete(Run.java:1503)         at hudson.tasks.LogRotator.perform(LogRotator.java:131)         at hudson.model.Job.logRotate(Job.java:474)         at hudson.model.Run.execute(Run.java:1784)         at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)         at hudson.model.ResourceController.execute(ResourceController.java:98)         at hudson.model.Executor.run(Executor.java:404)

          Yuriy Inetov added a comment -

          I  have same problem:

           

          2021-03-28 03:25:07.984+0000 [id=1870887] SEVERE hudson.model.Run#execute: Failed to rotate log
          java.util.NoSuchElementException
           at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:76)
           at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:63)
           at java.util.AbstractMap$2$1.next(AbstractMap.java:418)
           at hudson.util.RunList.subList(RunList.java:154)
           at hudson.tasks.LogRotator.perform(LogRotator.java:160)
           at hudson.model.Job.logRotate(Job.java:469)
           at hudson.model.Run.execute(Run.java:1971)
           at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
           at hudson.model.ResourceController.execute(ResourceController.java:97)
           at hudson.model.Executor.run(Executor.java:429)

           

          statistics in /monitoring (JavaMelody) says:
          12 hits/min on 15 errors, 500k same errors per month

          Debian 9, Jenkins 2.263.4

           

          maybe there is a way to at least disable this type of error?
          with an increase in the number of saved logs, I get a large load on the CPU (I do not know whether it is connected with this or not)
           

          Yuriy Inetov added a comment - I  have same problem:   2021-03-28 03:25:07.984+0000 [id=1870887] SEVERE hudson.model.Run#execute: Failed to rotate log java.util.NoSuchElementException at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:76) at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:63) at java.util.AbstractMap$2$1.next(AbstractMap.java:418) at hudson.util.RunList.subList(RunList.java:154) at hudson.tasks.LogRotator.perform(LogRotator.java:160) at hudson.model.Job.logRotate(Job.java:469) at hudson.model.Run.execute(Run.java:1971) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43) at hudson.model.ResourceController.execute(ResourceController.java:97) at hudson.model.Executor.run(Executor.java:429)   statistics in /monitoring (JavaMelody) says: 12 hits/min on 15 errors, 500k same errors per month Debian 9, Jenkins 2.263.4   maybe there is a way to at least disable this type of error? with an increase in the number of saved logs, I get a large load on the CPU (I do not know whether it is connected with this or not)  

            Unassigned Unassigned
            oleg_nenashev Oleg Nenashev
            Votes:
            14 Vote for this issue
            Watchers:
            19 Start watching this issue

              Created:
              Updated: