Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-23821

Billing-time-expiry termination timeouts don't have a delay

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed (View Workflow)
    • Priority: Trivial
    • Resolution: Fixed
    • Component/s: ec2-plugin
    • Labels:
    • Environment:
      EC2-plugin 1.23
    • Similar Issues:

      Description

      The EC2-plugin supports negative expiry times. As far as I can tell that's undocumented, I noticed when investigating JENKINS-23792 .

      Anyway, it's not fully thought out. If the idle time is -5 (minutes) and a node finishes a job at 23:56, then the next idle check fires at 23:57, it will be terminated as idle because it's within 5 minutes of the end of the hourly billing period, even though it's only been idle for 1 minute and is quite likely to have another job come in any second.

      For reference, see ec2-plugin/src/main/java/hudson/plugins/ec2/EC2RetentionStrategy.java, _check(...).

      I'm not sure what the answer is here. I'm not sure it's right to assume that the node must be both idle for abs minutes and within n minutes of its idle time, but that's the most obvious answer. It'd mean that a setting of 5 would expire a node if it was idle for 5 mins and within 5 mins of billing period end. So if it went idle at 11:54 we'd terminate it at 11:59.

      I lean toward that answer in the absence of better ideas. Suggestions?

        Attachments

          Activity

          Hide
          acoulton Andrew Coulton added a comment -

          I guess it depends on usecase.

          If the instance is stopped at -5 minutes and another job comes in at -4, there will be a short delay on that next job while the instance is restarted, but it won't really cost any more than it would have done if the instance had been left running.

          But if I have a job finish at -2 and then keep the instance running 5 minutes in case another job comes, then I have to pay a whole extra hour for no reason.

          Generally for us, if a build fails and there's nothing in the queue it's at least a few minutes, often longer, before we've found the failure and pushed the fix. So the behaviour as-is works fairly well for us.

          Perhaps actually what's needed is a configurable idle time (after last job completion) and a "Wait till the end of an instance-hour to stop instances" checkbox. Stopping an instance takes a fairly consistent amount of time, and I don't think there's a usecase where you'd intentionally want to stop it with say half an hour left? So the minutes left cutoff could just be hardcoded - say 2 minutes before the hour, to be safe.

          Then the logic would be something like:

          if (time_since_last_build_finished > idle_time) {
            if (wait_till_end_of_instance_hour && (minutes_of_instance_hour_left > 2)) {
              return;
            }
            stopInstance();
          }
          

          We'd probably just configure it with the wait checkbox ticked and an idle time of 0, but users with more regular builds could configure a longer idle time at risk of sometimes pushing the instance into an extra hour.

          Show
          acoulton Andrew Coulton added a comment - I guess it depends on usecase. If the instance is stopped at -5 minutes and another job comes in at -4, there will be a short delay on that next job while the instance is restarted, but it won't really cost any more than it would have done if the instance had been left running. But if I have a job finish at -2 and then keep the instance running 5 minutes in case another job comes, then I have to pay a whole extra hour for no reason. Generally for us, if a build fails and there's nothing in the queue it's at least a few minutes, often longer, before we've found the failure and pushed the fix. So the behaviour as-is works fairly well for us. Perhaps actually what's needed is a configurable idle time (after last job completion) and a "Wait till the end of an instance-hour to stop instances" checkbox. Stopping an instance takes a fairly consistent amount of time, and I don't think there's a usecase where you'd intentionally want to stop it with say half an hour left? So the minutes left cutoff could just be hardcoded - say 2 minutes before the hour, to be safe. Then the logic would be something like: if (time_since_last_build_finished > idle_time) { if (wait_till_end_of_instance_hour && (minutes_of_instance_hour_left > 2)) { return ; } stopInstance(); } We'd probably just configure it with the wait checkbox ticked and an idle time of 0, but users with more regular builds could configure a longer idle time at risk of sometimes pushing the instance into an extra hour.
          Hide
          ringerc Craig Ringer added a comment -

          Interesting point - stopping the instance then starting it almost immediately actually doesn't matter that much unless your instances do lots of expensive work on startup.

          I suspect it's much more of a problem if you're using instance store root and terminating your instances, where bringup could take quite a while. Then again most people doing that are probably using custom AMIs that're as pre-configured as possible already.

          Perhaps it's not worth doing anything about.

          I've put together a patch that'll document it and will leave it at that for now. I think I'll switch to negative timeouts and see how it goes on my deployment.

          Show
          ringerc Craig Ringer added a comment - Interesting point - stopping the instance then starting it almost immediately actually doesn't matter that much unless your instances do lots of expensive work on startup. I suspect it's much more of a problem if you're using instance store root and terminating your instances, where bringup could take quite a while. Then again most people doing that are probably using custom AMIs that're as pre-configured as possible already. Perhaps it's not worth doing anything about. I've put together a patch that'll document it and will leave it at that for now. I think I'll switch to negative timeouts and see how it goes on my deployment.
          Hide
          ringerc Craig Ringer added a comment -

          This pull request incorporates the documentation change:

          https://github.com/jenkinsci/ec2-plugin/pull/101

          Show
          ringerc Craig Ringer added a comment - This pull request incorporates the documentation change: https://github.com/jenkinsci/ec2-plugin/pull/101

            People

            Assignee:
            francisu Francis Upton
            Reporter:
            ringerc Craig Ringer
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: