Overview

Workload auto-scaling is configured by setting a strategy, a target value, and in some cases as metric percentile. Together these values determine when the workload will scale up & down.

As the system scales up, traffic will not be sent to the new replicas until they pass the readiness probe, if configured. If there is no probe configured or if it is a basic TCP port check, the requests will hit the new replicas before they are ready to respond. This could cause a delay or errors for end-user traffic.

You can configure autoscaling in the default options for a workload (defaultOptions) and in any of the location-specific options.

Scaling Strategies

The scaling strategy is set using autoscaling.metric.

  • Disabled (disabled)
    • Scaling will be disabled.
  • Concurrent Requests Quantity (concurrency)
    • The average number of requests executing at a given point in time across all the replicas. (requests * requestDuration)/(timePeriod * replicas).
    • Example: A workload with 5 replicas received 1000 requests with an average response time of 50ms (05 seconds) over a 1 second period. The concurrent requests metric for that period is (1000 * .05)/(1 * 5) = 10.
  • Requests Per Second (rps)
    • The raw number of requests received by a workload each second divided by the number of replicas. Requests are counted even if they have not yet been completed.
  • Percentage of CPU Utilization (cpu)
    • The percentage of CPU consumed by system and user processes in the container(s) as specified in the container cpu field.
  • Request Latency (latency)
    • The request response time (at a configurable percentile) in milliseconds, averaged across all replicas.

Caveats when choosing a workload type and a scaling strategy:

  • Serverless workloads cannot use the latency scaling strategy
  • Standard workloads cannot use the concurrency scaling strategy

The scale to zero functionality is only available for Serverless workloads, and only when using the rps or concurrency scaling strategies.

Autoscaling Standard Workloads

For standard workloads, Control Plane runs two asynchronous control loops:

  1. The Scaling Decision Loop
  2. The Metric Calculation Loop

Because of this asynchronous structure, autoscaling decisions may be made based on a metric value that is as old as the metric’s collection rate (usually 20 seconds).

The Scaling Decision Loop

A workload’s scale is evaluated every 15 seconds, using the value most recently calculated by the [metric calculation loop][#standard-metric-calculations]. Each time an evaluation is made the chosen metric is averaged across all available replicas and compared against the scale target. When scaling up, Control Plane does not enforce a stabilization window; the number of pods will increase as soon as the scaling algorithm dictates. When scaling down, a stabilization window of 5 minutes is used; the highest number of pods recommended by the scaling algorithm within the past 5 minutes will be applied to the running workload.

The Metric Calculation Loop

Requests per Second

Every 20 seconds, Control Plane calculates the average number of requests per second over the past 60 seconds.

Latency

Every 20 seconds, Control Plane calculates latency, using the response time of the workload once requests are received, using an average over the past 60 seconds at the specified percentile (p50, p75, p99).

CPU

Every 15 seconds, Control Plane calculates the average CPU usage over the past 15 seconds.

Autoscaling Serverless Workloads

The current capacity is evaluated every 2 seconds and compared against the scale target. It averages requests completed over the previous 60 seconds to avoid rapid changes. If ever a scaling decision is made which results in a scale increase above 200% then it suspends scale down decisions and averages over 6 seconds for 60 seconds. This is to allow for rapid scaling when a burst of traffic is detected.

Special considerations for the latency scaling strategy

Because request latency is represented as a distribution, when using the latency scaling strategy, you must choose a metric percentile by setting the autoscaling.metricPercentile property to one of the following values:

  • p50
  • p75
  • p99

Options

  • Minimum Scale (autoscaling.minScale)
    • The minimum allowed number of replicas. A workload can scale down to 0 when there is no traffic and scale-up immediately to fulfill new requests. (Must be between 0 and Maximum Scale inclusive).
  • Maximum Scale (autoscaling.maxScale)
    • The maximum allowed number of replicas.
  • Scale to Zero Delay (autoscaling.scaleToZeroDelay)
    • The amount of time (in seconds) when there are no requests received before a workload is scaled down to 0. (Must be between 30 and 3600 inclusive).
  • Maximum Concurrency (autoscaling.maxConcurrency)
    • The maximum allowed number of requests to be actively running against a single replica. If there are no replicas available that are processing less than the configured maximum number of concurrent requests, the system will queue the request and wait for a replica to be available. It will not trigger a scale decision. The purpose of this setting is to prevent a single replica from taking more traffic than it is designed to process.
    • If, for example, Max Concurrency = 100, Scaling Strategy = ‘Concurrent Requests’, and Target = 100, the results would not be desirable for most end-user traffic. When the system decides to scale up, it will queue the requests until an existing request completes or the new replica becomes available.
    • Must be between 0 and 1000 inclusive.
  • Metric Percentile (autoscaling.metricPercentile)
    • The nth percentile is a value below which n percent of the values in a distribution lie.
    • This may only be set while using the latency scaling strategy. The default value is p50.
    • e.g. If the 50th percentile of a latency distribution is 200ms, 50% of requests took less than 200ms.
    • Control plane supports p50, p75, and p99 metric percentiles.

Capacity AI is not available if CPU Utilization is selected because dynamic allocation of CPU resources cannot be accomplished while scaling replicas based on the usage of its CPU.