Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-61051

Jobs are started on master instead of EC2 slaves randomly

    • EC2-Plugin 2.0.2

      Jenkins master runs on an AWS Linux 2. Jenkins uses the EC2 plugin to create slaves whenever needed and many jobs are assigned to slaves using the labels.

      Since upgrading to EC2 plugin 1.49 (and to Jenkins 2.217 which contains remoting 4.0) some jobs - randomly, it seems - are started on the master node instead of using the started slaves. The aws slave is started, but the workspace is created on master (in the user's home which should have been used on the slave). The job's console log says it is running on the slave but it is not true.

      Maybe this is not related to EC2 plugin as I don't see any change related to this problem in the 1.49 version's release history.

      Attachment: I created a snapshot about a node's script console page while - according to the Jenkins logs - it was used for building. I asked for the hostname and although the name of the node suggests it is a slave node, the hostname belongs to the master. And of course the workspace was created on master.

          [JENKINS-61051] Jobs are started on master instead of EC2 slaves randomly

          Laszlo Gaal added a comment -

          I have run into a similar problem with Jenkins v2.204.2 and EC2 plugin v1.49.1. In our case the master was actually overloaded by the misdirected job, and the Jenkins process was killed by the OOM-killer.

          One symptom I found was that the Jenkins log lines that normally log the connection attempt from the EC2 plugin to the newly created worker missed the IP address, printing "null" instead:

          Regular log entry:

          2020-03-10 04:47:57.202+0000 [id=797295]        INFO    hudson.plugins.ec2.EC2Cloud#log: Connecting to 172.31.26.224 on port 22, with timeout 10000. 

          Bad log line (only 2 instances in several weeks, immediately before the failure):

          2020-03-10 04:47:57.113+0000 [id=797326]        INFO    hudson.plugins.ec2.EC2Cloud#log: Connecting to null on port 22, with timeout 10000. 

          Observe the "null" instead of a valid IP address.

          Laszlo Gaal added a comment - I have run into a similar problem with Jenkins v2.204.2 and EC2 plugin v1.49.1. In our case the master was actually overloaded by the misdirected job, and the Jenkins process was killed by the OOM-killer. One symptom I found was that the Jenkins log lines that normally log the connection attempt from the EC2 plugin to the newly created worker missed the IP address, printing "null" instead: Regular log entry: 2020-03-10 04:47:57.202+0000 [id=797295] INFO hudson.plugins.ec2.EC2Cloud#log: Connecting to 172.31.26.224 on port 22, with timeout 10000. Bad log line (only 2 instances in several weeks, immediately before the failure): 2020-03-10 04:47:57.113+0000 [id=797326] INFO hudson.plugins.ec2.EC2Cloud#log: Connecting to null on port 22, with timeout 10000. Observe the "null" instead of a valid IP address.

          Jeff Thompson added a comment -

          That sounds like it is an issue in the EC2 plugin. Possibly a timing problem. Presumably if the IP address isn't specified it runs the job on the master.

          Jeff Thompson added a comment - That sounds like it is an issue in the EC2 plugin. Possibly a timing problem. Presumably if the IP address isn't specified it runs the job on the master.

          Laszlo Gaal added a comment -

          Just ran into this again. jthompson: yeah, it looks like either a timing problem or a race.

          As a workaround I installed roadblocks on the master that should fail such an errant job very early in the startup/config phase, before it has a chance to consume all memory and trigger an OOM-kill. We'll see if it's enough; I'd really hate to downgrade the plugin again.

          Laszlo Gaal added a comment - Just ran into this again. jthompson : yeah, it looks like either a timing problem or a race. As a workaround I installed roadblocks on the master that should fail such an errant job very early in the startup/config phase, before it has a chance to consume all memory and trigger an OOM-kill. We'll see if it's enough; I'd really hate to downgrade the plugin again.

          Gabor V added a comment -

           Any idea who can work on this bug from the ec2 plugin team? To whom should we assign it?

          Gabor V added a comment -  Any idea who can work on this bug from the ec2 plugin team? To whom should we assign it?

          EC2 just launches and manages agents it doesn't actually do anything with regards to assigning agents.
          That null does look suspicious.

          Does your master use the same pem as your agents? I'm assuming that your agents are linux and using ssh as well.

          Raihaan Shouhell added a comment - EC2 just launches and manages agents it doesn't actually do anything with regards to assigning agents. That null does look suspicious. Does your master use the same pem as your agents? I'm assuming that your agents are linux and using ssh as well.

          Laszlo Gaal added a comment -

          raihaan, yes, they do use the same keys, and I've realized that assigning different keys to them would be a useful workaround.

          However, I've never had this problem before upgrading to 1.49.1, so having the same keys does not caue the problem, although it makes the failing case that much more severe.

          Laszlo Gaal added a comment - raihaan , yes, they do use the same keys, and I've realized that assigning different keys to them would be a useful workaround. However, I've never had this problem before upgrading to 1.49.1, so having the same keys does not caue the problem, although it makes the failing case that much more severe.

          Laszlo Gaal added a comment - - edited

          Just saw https://github.com/jenkinsci/ec2-plugin/pull/447, which seems likely to fix this issue; one of the comments actually refers to the

          Connecting to null on port 22 

          pattern I described in an earlier comment.

          Laszlo Gaal added a comment - - edited Just saw https://github.com/jenkinsci/ec2-plugin/pull/447 , which seems likely to fix this issue; one of the comments actually refers to the Connecting to null on port 22 pattern I described in an earlier comment.

          Laszlo Gaal added a comment -

          Unfortunately I saw this again on Jenkins 2.346.3 with ec2 v1.68

          Symptoms are the same: "Connecting to null on port 22", then starting the job on the controller node.

          I wonder if adding a null check to https://github.com/jenkinsci/ec2-plugin/blob/master/src/main/java/hudson/plugins/ec2/ssh/EC2UnixLauncher.java#L430 (in addition to checking for "0.0.0.0" as the IP address) would be enough to force the controller to wait for the worker to come up and initialize fully.

          jthompson , I agree that this is most likely a problem with the ec2 plugin, not remoting, as this log is emitted still in the worker node's startup phase, before the job is handed to the worker. Could maybe thoulen, raihaan or julienduchesne offer some insight?

          My apologies if the direct ping goes against the etiquette; I tried, but could not find an ec2-maintainers alias.

          Laszlo Gaal added a comment - Unfortunately I saw this again on Jenkins 2.346.3 with ec2 v1.68 Symptoms are the same: "Connecting to null on port 22", then starting the job on the controller node. I wonder if adding a null check to https://github.com/jenkinsci/ec2-plugin/blob/master/src/main/java/hudson/plugins/ec2/ssh/EC2UnixLauncher.java#L430 (in addition to checking for "0.0.0.0" as the IP address) would be enough to force the controller to wait for the worker to come up and initialize fully. jthompson , I agree that this is most likely a problem with the ec2 plugin, not remoting, as this log is emitted still in the worker node's startup phase, before the job is handed to the worker. Could maybe thoulen , raihaan or julienduchesne offer some insight? My apologies if the direct ping goes against the etiquette; I tried, but could not find an ec2-maintainers alias.

          Should be fixed in ec2 2.0.2

          Raihaan Shouhell added a comment - Should be fixed in ec2 2.0.2

          Laszlo Gaal added a comment -

          raihaan , thanks a lot for the quick fix and release, it was a very pleasant surprise indeed.

          I'll keep an eye on the scenario; hopefully this fix makes it go away for good.

          Laszlo Gaal added a comment - raihaan , thanks a lot for the quick fix and release, it was a very pleasant surprise indeed. I'll keep an eye on the scenario; hopefully this fix makes it go away for good.

            raihaan Raihaan Shouhell
            gaborv Gabor V
            Votes:
            2 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: