Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-29281

mesos-plugin crashes jenkins with a large build queue

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Minor Minor
    • mesos-plugin
    • jenkins-1.609.1
      mesos-plugin-0.6.0
      mesos-0.21.1-1.1.centos65.x86_64

      We have 19 static mesos slaves and a single master. It is working pretty well with our build server which has build queues up to about a hundred jobs which typically take 5-60 minutes to execute.

      On our regression servers, we have build queues of up to 800 jobs of about 10 minutes duration. Eventually jenkins crashes. It does not crash if we disable mesos.

      We observe:

      hundreds of 'offline' jenkins slaves are created - up to 700 - most of which appear to terminate without ever executing a job.

      19 mesos slaves connected and executing tasks

      Other, less critical issues include:

      • huge logs (I turned them down using a setting in logging.properties) - but they still report INFO: items - I think Vinod is addressing that. Without the logging.properties filter, we have gigabytes of "INFO: Offer not sufficient for slave request:" messages.
      • spam in config-history/config as an entry is created for each offline mesos slave created (up around the 10000 mark of spurious changes logged - I harvest them using a simple script, but it's a nuisance)
      • spam in logs/slave/mesos-jenkins* - 10s of thousands of entries for these 'offline' mesos instances - again, I can clean this up in our weekly janitor job
      • # executors per slave is set to 1 but it is not observed - it was most often running 2 tasks per slave - our tasks are too big to do that. Therefore I set the CPU requirement to 1.1 as an alternative method to restrict this,

      I have tried 0.7.0 (fails completely with NPE) and 0.8.0-SNAPSHOT from a few days ago (nothing runs at all "WARNING: Not launching mesos-jenkins-cef7d675-99b7-4ed5-b8cf-d5da02d07e65 because the Mesos Jenkins scheduler is not running")

      I have tried jenkins-1.599 with the same result.

      There's nothing useful in the jenkins.log when it crashes - the last messages on one machine last night were as follows:

      {{INFO: Disconnecting offline computer mesos-jenkins-b0365b1a-8d39-4843-bd54-5e39a3103894
      Jul 08, 2015 12:41:14 AM org.jenkinsci.plugins.mesos.MesosSlave terminate
      INFO: Terminating slave mesos-jenkins-b0365b1a-8d39-4843-bd54-5e39a3103894
      Jul 08, 2015 12:41:14 AM org.jenkinsci.plugins.mesos.JenkinsScheduler terminateJenkinsSlave
      INFO: Terminating jenkins slave mesos-jenkins-b0365b1a-8d39-4843-bd54-5e39a3103894
      Jul 08, 2015 12:41:14 AM org.jenkinsci.plugins.mesos.JenkinsScheduler terminateJenkinsSlave
      INFO: Removing enqueued mesos task mesos-jenkins-b0365b1a-8d39-4843-bd54-5e39a3103894
      Jul 08, 2015 12:41:14 AM org.jenkinsci.plugins.mesos.JenkinsScheduler terminateJenkinsSlave
      INFO: Terminating jenkins slave mesos-jenkins-2c08c203-9570-45d0-8bb8-bff24622d616
      Jul 08, 2015 12:41:14 AM org.jenkinsci.plugins.mesos.JenkinsScheduler terminateJenkinsSlave
      INFO: Removing enqueued mesos task mesos-jenkins-2c08c203-9570-45d0-8bb8-bff24622d616
      Jul 08, 2015 12:41:14 AM org.jenkinsci.plugins.mesos.JenkinsScheduler terminateJenkinsSlave
      INFO: Terminating jenkins slave mesos-jenkins-924ddaaf-5d06-4c93-8810-87cc2704fc7c
      Jul 08, 2015 12:41:14 AM org.jenkinsci.plugins.mesos.JenkinsScheduler terminateJenkinsSlave
      INFO: Removing enqueued mesos task mesos-jenkins-924ddaaf-5d06-4c93-8810-87cc2704fc7c
      }}

      This was in /var/log/messages at that time showing that java had exhausted system memory:

      {{Jul 8 00:35:00 ci14bldmst01v kernel: java invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
      Jul 8 00:35:00 ci14bldmst01v kernel: java cpuset=/ mems_allowed=0
      Jul 8 00:35:00 ci14bldmst01v kernel: Pid: 25076, comm: java Not tainted 2.6.32-504.23.4.el6.x86_64 #1
      Jul 8 00:35:00 ci14bldmst01v kernel: Call Trace:
      Jul 8 00:35:00 ci14bldmst01v kernel: [<ffffffff810d4241>] ? cpuset_print_task_mems_allowed+0x91/0xb0
      Jul 8 00:35:00 ci14bldmst01v kernel: [<ffffffff81127500>] ? dump_header+0x90/0x1b0
      Jul 8 00:35:00 ci14bldmst01v kernel: [<ffffffff8122ee7c>] ? security_real_capable_noaudit+0x3c/0x70
      Jul 8 00:35:00 ci14bldmst01v kernel: [<ffffffff81127982>] ? oom_kill_process+0x82/0x2a0
      Jul 8 00:35:00 ci14bldmst01v kernel: [<ffffffff811278c1>] ? select_bad_process+0xe1/0x120
      Jul 8 00:35:00 ci14bldmst01v kernel: [<ffffffff81127dc0>] ? out_of_memory+0x220/0x3c0
      Jul 8 00:35:00 ci14bldmst01v kernel: [<ffffffff811346ff>] ? __alloc_pages_nodemask+0x89f/0x8d0
      Jul 8 00:35:00 ci14bldmst01v kernel: [<ffffffff8116c9aa>] ? alloc_pages_current+0xaa/0x110
      Jul 8 00:35:00 ci14bldmst01v kernel: [<ffffffff811248f7>] ? __page_cache_alloc+0x87/0x90
      Jul 8 00:35:00 ci14bldmst01v kernel: [<ffffffff811242de>] ? find_get_page+0x1e/0xa0
      Jul 8 00:35:00 ci14bldmst01v kernel: [<ffffffff81125897>] ? filemap_fault+0x1a7/0x500
      Jul 8 00:35:00 ci14bldmst01v kernel: [<ffffffff8114ed04>] ? __do_fault+0x54/0x530
      Jul 8 00:35:00 ci14bldmst01v kernel: [<ffffffff8114f2d7>] ? handle_pte_fault+0xf7/0xb00
      Jul 8 00:35:00 ci14bldmst01v kernel: [<ffffffff8114ff79>] ? handle_mm_fault+0x299/0x3d0
      Jul 8 00:35:00 ci14bldmst01v kernel: [<ffffffff8104d096>] ? __do_page_fault+0x146/0x500
      Jul 8 00:35:00 ci14bldmst01v kernel: [<ffffffff8100ba4e>] ? common_interrupt+0xe/0x13
      Jul 8 00:35:00 ci14bldmst01v kernel: [<ffffffff8100bcae>] ? invalidate_interrupt1+0xe/0x20
      Jul 8 00:35:00 ci14bldmst01v kernel: [<ffffffff8153001e>] ? do_page_fault+0x3e/0xa0
      Jul 8 00:35:00 ci14bldmst01v kernel: [<ffffffff8152d3d5>] ? page_fault+0x25/0x30
      Jul 8 00:35:00 ci14bldmst01v kernel: Mem-Info:
      Jul 8 00:35:00 ci14bldmst01v kernel: Node 0 DMA per-cpu:
      Jul 8 00:35:00 ci14bldmst01v kernel: CPU 0: hi: 0, btch: 1 usd: 0
      Jul 8 00:35:00 ci14bldmst01v kernel: CPU 1: hi: 0, btch: 1 usd: 0
      Jul 8 00:35:00 ci14bldmst01v kernel: Node 0 DMA32 per-cpu:
      Jul 8 00:35:00 ci14bldmst01v kernel: CPU 0: hi: 186, btch: 31 usd: 0
      Jul 8 00:35:00 ci14bldmst01v kernel: CPU 1: hi: 186, btch: 31 usd: 0
      Jul 8 00:35:00 ci14bldmst01v kernel: Node 0 Normal per-cpu:
      Jul 8 00:35:00 ci14bldmst01v kernel: CPU 0: hi: 186, btch: 31 usd: 0
      Jul 8 00:35:00 ci14bldmst01v kernel: CPU 1: hi: 186, btch: 31 usd: 39
      Jul 8 00:35:00 ci14bldmst01v kernel: active_anon:617910 inactive_anon:230791 isolated_anon:0
      Jul 8 00:35:00 ci14bldmst01v kernel: active_file:5 inactive_file:352 isolated_file:0
      Jul 8 00:35:00 ci14bldmst01v kernel: unevictable:0 dirty:0 writeback:3 unstable:0
      Jul 8 00:35:00 ci14bldmst01v kernel: free:21831 slab_reclaimable:7141 slab_unreclaimable:41175
      Jul 8 00:35:00 ci14bldmst01v kernel: mapped:0 shmem:0 pagetables:12690 bounce:0
      Jul 8 00:35:00 ci14bldmst01v kernel: Node 0 DMA free:15684kB min:248kB low:308kB high:372kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15292kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
      }}

            vinodkone Vinod Kone
            bhepple Bob Hepple
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: