-
Bug
-
Resolution: Not A Defect
-
Major
-
None
-
Jenkins 2.289.2
Mesos Cloud 1.7.2
Hello,
after upgrading from Jenkins 2.164 to Jenkins 2.289.2 we're running into a very peculiar issue.
Pipeline jobs seem to keep dying without any particular reason. Most jobs - but not all of them - seem to stop after about 10-15 minutes.
Looking a the job logs shows only:
Agent <AgentID> was deleted; cancelling node body
The sub job - as to be expected - shows "Calling Pipeline Cancelled" and the job ends with a FlowInterruptException.
When the subjob is run independently of the pipeline it works without a hitch. I've looked at all logs related to the job itself and Mesos - but Mesos is simply doing as Jenkins instructs it, no errors or any indicators of resource problems (my first thought would have been undersized agents, but analysis showed no such thing/even properly sized agents experience this issue).
Any mesos related logs (Logger org.jenkinsci.plugins.mesos -> Log level ALL) in Jenkins merely show:
<AgentID> with slave org.jenkinsci.plugins.mesos.MesosSlave[<AgentID>] is not pending deletion or the slave is null
Until at some point it changes to:
<AgentID> with slave org.jenkinsci.plugins.mesos.MesosSlave[ null ] is not pending deletion or the slave is null
The mesos master and slave logs only show bog standard status and termination messages without any particular context to what is going wrong.
This basically blocks any pipeline from executing on this system. I should also clarify: The issue is not deterministic. Whilst most jobs fail after ten odd minutes they do not fail consistently at the same time. They vary at least by a few minutes - if the job was triggered manually they will sometimes fail after 30 minutes to an hour.