Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-26020

Will not start builds even though there are available slots on executor

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved (View Workflow)
    • Critical
    • Resolution: Incomplete
    • core
    • None
    • LTS 1.580.1

    Description

      Sometimes our nodes won't be able to start new builds even though there are free slots available.

      A workaround for the slaves is to disconnect/connect the slave and it will start to schedule builds again.

      I have observed that when this happens for a slave the slave has fewer threads ongoing than an idle slave.

      Attaching thread dumps when this happens and after doing an disconnect/connect.

      We have seen this issue both on Windows(jlnp) slaves and linux(ssh) slaves as well as on the master node which is running linux.

      Attachments

        Issue Links

          Activity

            ki82 Christian Bremer created issue -
            ki82 Christian Bremer added a comment - 200$ is up for grabs for this issue at: https://freedomsponsors.org/issue/598/will-not-start-builds-even-though-there-are-available-slots-on-executor
            danielbeck Daniel Beck added a comment -

            Any interesting errors getting logged?

            danielbeck Daniel Beck added a comment - Any interesting errors getting logged?

            We get ~5000 JnlpSlaveHandshake errors per hour:

            Dec 15, 2014 8:40:10 AM jenkins.slaves.JnlpSlaveHandshake error
            WARNING: TCP slave agent connection handler #150398 with /10.33.21.14:62740 is aborted: generic_AESL-JENKINS07 is already connected to this master. Rejecting this connection.

            We get these errors at all times, also when we can schedule on all slaves.

            ki82 Christian Bremer added a comment - We get ~5000 JnlpSlaveHandshake errors per hour: Dec 15, 2014 8:40:10 AM jenkins.slaves.JnlpSlaveHandshake error WARNING: TCP slave agent connection handler #150398 with /10.33.21.14:62740 is aborted: generic_AESL-JENKINS07 is already connected to this master. Rejecting this connection. We get these errors at all times, also when we can schedule on all slaves.

            Other than that I see no errors in the log that occurs when it fails to schedule on a node although I might have missed it since our logs are flooded.

            ki82 Christian Bremer added a comment - Other than that I see no errors in the log that occurs when it fails to schedule on a node although I might have missed it since our logs are flooded.
            oleg_nenashev Oleg Nenashev added a comment -

            > Dec 15, 2014 8:40:10 AM jenkins.slaves.JnlpSlaveHandshake error
            WARNING: TCP slave agent connection handler #150398 with /10.33.21.14:62740 is aborted: generic_AESL-JENKINS07 is already connected to this master. Rejecting this connection.

            It seems to be unrelated. Such issue usually happens when you have two jenkins-slave processes. On Windows machines it rarely happens on improper service termination, etc. You can also configure Jenkins slave to have a bigger reconnect attempt interval.

            oleg_nenashev Oleg Nenashev added a comment - > Dec 15, 2014 8:40:10 AM jenkins.slaves.JnlpSlaveHandshake error WARNING: TCP slave agent connection handler #150398 with /10.33.21.14:62740 is aborted: generic_AESL-JENKINS07 is already connected to this master. Rejecting this connection. It seems to be unrelated. Such issue usually happens when you have two jenkins-slave processes. On Windows machines it rarely happens on improper service termination, etc. You can also configure Jenkins slave to have a bigger reconnect attempt interval.
            rtyler R. Tyler Croy made changes -
            Field Original Value New Value
            Workflow JNJira [ 160008 ] JNJira + In-Review [ 180215 ]
            oleg_nenashev Oleg Nenashev added a comment - - edited

            Windows service issue should be fixed by JENKINS-39231. I do not see anything else we can diagnose here

            oleg_nenashev Oleg Nenashev added a comment - - edited Windows service issue should be fixed by JENKINS-39231 . I do not see anything else we can diagnose here
            oleg_nenashev Oleg Nenashev made changes -
            Link This issue is related to JENKINS-39231 [ JENKINS-39231 ]
            oleg_nenashev Oleg Nenashev added a comment -

            I see no way to proceed with this issue without more info. There is also no other commenters/voters. So I'm closing it as Incomplete, feel free to reopen it if you have additional info

            oleg_nenashev Oleg Nenashev added a comment - I see no way to proceed with this issue without more info. There is also no other commenters/voters. So I'm closing it as Incomplete, feel free to reopen it if you have additional info
            oleg_nenashev Oleg Nenashev made changes -
            Resolution Incomplete [ 4 ]
            Status Open [ 1 ] Resolved [ 5 ]

            People

              Unassigned Unassigned
              ki82 Christian Bremer
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: