Overview

Workload auto-scaling is configured by setting a strategy, a target value, and in some cases as metric percentile. Together these values determine when the workload will scale up & down.

As the system scales up, traffic will not be sent to the new replicas until they pass the readiness probe, if configured. If there is no probe configured or if it is a basic TCP port check, the requests will hit the new replicas before they are ready to respond. This could cause a delay or errors for end-user traffic.

You can configure autoscaling in the default options for a workload (defaultOptions) and in any of the location-specific options.

Scaling Strategies

The scaling strategy is set using autoscaling.metric.

  • Disabled (disabled)
    • Scaling will be disabled.
  • Concurrent Requests Quantity (concurrency)
    • The average number of requests executing at a given point in time across all the replicas. (requests * requestDuration)/(timePeriod * replicas).
    • Example: A workload with 5 replicas received 1000 requests with an average response time of 50ms (05 seconds) over a 1 second period. The concurrent requests metric for that period is (1000 * .05)/(1 * 5) = 10.
  • Requests Per Second (rps)
    • The raw number of requests received by a workload each second divided by the number of replicas. Requests are counted even if they have not yet been completed.
  • Percentage of CPU Utilization (cpu)
    • The percentage of CPU consumed by system and user processes in the container(s) as specified in the container cpu field.
  • Request Latency (latency)
    • The request response time (at a configurable percentile) in milliseconds, averaged across all replicas.
  • Memory Utilization (memory)
    • The percentage of memory consumed by system and user processes in the container(s) as specified in the container memory field.

Caveats when choosing a workload type and a scaling strategy:

  • Serverless workloads cannot use the latency scaling strategy or any of the Multi Metric scaling strategies.
  • Standard workloads cannot use the concurrency scaling strategy

The scale to zero functionality is only available for Serverless workloads, and only when using the rps or concurrency scaling strategies.

Target

The target is the value that the system will try to keep the metric near but below.

Autoscaling Standard Workloads

For standard workloads, Control Plane runs two asynchronous control loops:

  1. The Scaling Decision Loop
  2. The Metric Calculation Loop

Because of this asynchronous structure, autoscaling decisions may be made based on a metric value that is as old as the metric’s collection rate (usually 20 seconds).

The Scaling Decision Loop

A workload’s scale is evaluated every 15 seconds, using the value most recently calculated by the [metric calculation loop][#standard-metric-calculations]. Each time an evaluation is made the chosen metric is averaged across all available replicas and compared against the scale target. When scaling up, Control Plane does not enforce a stabilization window; the number of pods will increase as soon as the scaling algorithm dictates. When scaling down, a stabilization window of 5 minutes is used; the highest number of pods recommended by the scaling algorithm within the past 5 minutes will be applied to the running workload.

The Metric Calculation Loop

Requests per Second

Every 20 seconds, Control Plane calculates the average number of requests per second over the past 60 seconds.

Latency

Every 20 seconds, Control Plane calculates latency, using the response time of the workload once requests are received, using an average over the past 60 seconds at the specified percentile (p50, p75, p99).

CPU

Every 15 seconds, Control Plane calculates the average CPU usage over the past 15 seconds.

Memory

Every 15 seconds, Control Plane calculates the average memory usage over the past 15 seconds.

Autoscaling Serverless Workloads

The current capacity is evaluated every 2 seconds and compared against the scale target. It averages requests completed over the previous 60 seconds to avoid rapid changes. If ever a scaling decision is made which results in a scale increase above 200% then it suspends scale down decisions and averages over 6 seconds for 60 seconds. This is to allow for rapid scaling when a burst of traffic is detected.

Special considerations for the latency scaling strategy

Because request latency is represented as a distribution, when using the latency scaling strategy, you must choose a metric percentile by setting the autoscaling.metricPercentile property to one of the following values:

  • p50
  • p75
  • p99

Options

  • Minimum Scale (autoscaling.minScale)
    • The minimum allowed number of replicas.
    • Control Plane can scale the workload down to 0 when there is no traffic and scale up immediately to fulfill new requests.
    • Must be between 0 and Maximum Scale inclusive.
  • Maximum Scale (autoscaling.maxScale)
    • The maximum allowed number of replicas.
  • Scale to Zero Delay (autoscaling.scaleToZeroDelay)
    • The amount of time (in seconds) with no requests received before a workload is scaled down to 0.
    • Must be between 30 and 3600 inclusive.
  • Maximum Concurrency (autoscaling.maxConcurrency)
    • A hard maximum for the number of concurrent requests allowed to a replica.
    • If no replicas are available to fulfill the request, it will be queued until a replica with capacity is available and delivered as soon as one is available again.
    • Capacity can be available from requests completing or when a new replica is available from scale out.
    • A value of 0 allows all requests.
    • Must be between 0 and 30000 inclusive.
  • Metric (autoscaling.metric)
    • Controls the metric which will be used for scaling decisions. The goal is to maintain the target across all replicas of a deployment. Options include:
      • concurrency: Uses the number of concurrent requests for the target.
      • cpu: Uses % processor time for the target.
      • memory: Uses memory in Mi for the target.
      • rps: Uses requests per second for the target.
      • latency: Uses the average request response time for the target. Not available for Serverless workloads.
  • Multi Metric (autoscaling.multi)
    • Allows specifying multiple metrics for autoscaling decisions.
    • Each metric must be unique and is defined with a target value. -Not available for Serverless workloads.
  • Metric Percentile (autoscaling.metricPercentile)
    • The latency metric is represented as a distribution, so a percentile within the distribution must be chosen to be used with the target.
    • The default value is p50.
    • Control Plane supports p50, p75, and p99 metric percentiles.
    • For example, if the percentile is p50 and the target is 100, when the 50th percentile of latency is greater than 100ms for the workload, a scale-up decision will be made.

Capacity AI is not available if CPU Utilization is selected because dynamic allocation of CPU resources cannot be accomplished while scaling replicas based on the usage of its CPU. Additionally, Capacity AI may not be enabled when multi metric is set.

Examples

Concurrency

spec:
  defaultOptions:
    autoscaling:
      metric: concurrency
      maxConcurrency: 0
      maxScale: 1
      minScale: 5
      scaleToZeroDelay: 300
      target: 100

CPU

spec:
  defaultOptions:
    autoscaling:
      metric: cpu
      target: 80
      minScale: 1
      maxScale: 5

Memory

spec:
  defaultOptions:
    autoscaling:
      metric: memory
      target: 80
      minScale: 1
      maxScale: 5

RPS

spec:
  defaultOptions:
    autoscaling:
      metric: rps
      target: 100
      minScale: 1
      maxScale: 5

Latency

spec:
  defaultOptions:
    autoscaling:
      metric: latency
      target: 100
      minScale: 1
      maxScale: 5
      metricPercentile: p50

Multi Metric

spec:
  defaultOptions:
    autoscaling:
      minScale: 1
      maxScale: 5
      multi:
        - metric: cpu
          target: 80
        - metric: memory
          target: 80