Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-66880

Pipeline Jobs fail with: Agent <AgentID> was deleted; cancelling node body

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Not A Defect
    • Icon: Major Major
    • mesos-plugin
    • None
    • Jenkins 2.289.2
      Mesos Cloud 1.7.2

      Hello,

      after upgrading from Jenkins 2.164 to Jenkins 2.289.2 we're running into a very peculiar issue.
      Pipeline jobs seem to keep dying without any particular reason. Most jobs - but not all of them - seem to stop after about 10-15 minutes.

      Looking a the job logs shows only:

      Agent <AgentID> was deleted; cancelling node body
      

      The sub job - as to be expected - shows "Calling Pipeline Cancelled" and the job ends with a FlowInterruptException.

       

      When the subjob is run independently of the pipeline it works without a hitch. I've looked at all logs related to the job itself and Mesos - but Mesos is simply doing as Jenkins instructs it, no errors or any indicators of resource problems (my first thought would have been undersized agents, but analysis showed no such thing/even properly sized agents experience this issue).

       

      Any mesos related logs (Logger org.jenkinsci.plugins.mesos -> Log level ALL) in Jenkins merely show:

      <AgentID> with slave org.jenkinsci.plugins.mesos.MesosSlave[<AgentID>] is not pending deletion or the slave is null
      

      Until at some point it changes to:

      <AgentID> with slave org.jenkinsci.plugins.mesos.MesosSlave[ null ] is not pending deletion or the slave is null
      

      The mesos master and slave logs only show bog standard status and termination messages without any particular context to what is going wrong.

       

       This basically blocks any pipeline from executing on this system. I should also clarify: The issue is not deterministic. Whilst most jobs fail after ten odd minutes they do not fail consistently at the same time. They vary at least by a few minutes - if the job was triggered manually they will sometimes fail after 30 minutes to an hour.

            vinodkone Vinod Kone
            markus_bauerbe Markus Bauer
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: