Autoscaling
Overview
Workload auto-scaling is configured by setting a strategy, a target value, and in some cases as metric percentile. Together these values determine when the workload will scale up & down.
As the system scales up, traffic will not be sent to the new replicas until they pass the readiness probe, if configured. If there is no probe configured or if it is a basic TCP port check, the requests will hit the new replicas before they are ready to respond. This could cause a delay or errors for end-user traffic.
You can configure autoscaling in the default options for a workload (defaultOptions
) and in any of the location-specific options.
Scaling Strategies
The scaling strategy is set using autoscaling.metric
.
- Disabled (
disabled
)- Scaling will be disabled.
- Concurrent Requests Quantity (
concurrency
)- The average number of requests executing at a given point in time across all the replicas.
(requests * requestDuration)/(timePeriod * replicas)
. - Example: A workload with 5 replicas received 1000 requests with an average response time of 50ms (05 seconds)
over a 1 second period. The concurrent requests metric for that period is
(1000 * .05)/(1 * 5) = 10
.
- The average number of requests executing at a given point in time across all the replicas.
- Requests Per Second (
rps
)- The raw number of requests received by a workload each second divided by the number of replicas. Requests are counted even if they have not yet been completed.
- Percentage of CPU Utilization (
cpu
)- The percentage of CPU consumed by system and user processes in the container(s) as specified in the container cpu field.
- Request Latency (
latency
)- The request response time (at a configurable percentile) in milliseconds, averaged across all replicas.
Caveats when choosing a workload type and a scaling strategy:
- Serverless workloads cannot use the
latency
scaling strategy - Standard workloads cannot use the
concurrency
scaling strategy
The scale to zero functionality is only available for Serverless workloads, and only when using the rps
or concurrency
scaling strategies.
Autoscaling Standard Workloads
For standard workloads, Control Plane runs two asynchronous control loops:
- The Scaling Decision Loop
- The Metric Calculation Loop
Because of this asynchronous structure, autoscaling decisions may be made based on a metric value that is as old as the metric’s collection rate (usually 20 seconds).
The Scaling Decision Loop
A workload’s scale is evaluated every 15 seconds, using the value most recently calculated by the [metric calculation loop][#standard-metric-calculations]. Each time an evaluation is made the chosen metric is averaged across all available replicas and compared against the scale target. When scaling up, Control Plane does not enforce a stabilization window; the number of pods will increase as soon as the scaling algorithm dictates. When scaling down, a stabilization window of 5 minutes is used; the highest number of pods recommended by the scaling algorithm within the past 5 minutes will be applied to the running workload.
The Metric Calculation Loop
Requests per Second
Every 20 seconds, Control Plane calculates the average number of requests per second over the past 60 seconds.
Latency
Every 20 seconds, Control Plane calculates latency, using the response time of the workload once requests are received, using an average over the past 60 seconds at the specified percentile (p50, p75, p99).
CPU
Every 15 seconds, Control Plane calculates the average CPU usage over the past 15 seconds.
Autoscaling Serverless Workloads
The current capacity is evaluated every 2 seconds and compared against the scale target. It averages requests completed over the previous 60 seconds to avoid rapid changes. If ever a scaling decision is made which results in a scale increase above 200% then it suspends scale down decisions and averages over 6 seconds for 60 seconds. This is to allow for rapid scaling when a burst of traffic is detected.
Special considerations for the latency
scaling strategy
Because request latency is represented as a distribution, when using the latency
scaling strategy, you must choose a metric percentile by setting the autoscaling.metricPercentile
property to one of the following values:
p50
p75
p99
Options
- Minimum Scale (
autoscaling.minScale
)- The minimum allowed number of replicas. A workload can scale down to 0 when there is no traffic and scale-up immediately to fulfill new requests. (Must be between 0 and
Maximum Scale
inclusive).
- The minimum allowed number of replicas. A workload can scale down to 0 when there is no traffic and scale-up immediately to fulfill new requests. (Must be between 0 and
- Maximum Scale (
autoscaling.maxScale
)- The maximum allowed number of replicas.
- Scale to Zero Delay (
autoscaling.scaleToZeroDelay
)- The amount of time (in seconds) when there are no requests received before a workload is scaled down to 0. (Must be between 30 and 3600 inclusive).
- Maximum Concurrency (
autoscaling.maxConcurrency
)- The maximum allowed number of requests to be actively running against a single replica. If there are no replicas available that are processing less than the configured maximum number of concurrent requests, the system will queue the request and wait for a replica to be available. It will not trigger a scale decision. The purpose of this setting is to prevent a single replica from taking more traffic than it is designed to process.
- If, for example, Max Concurrency = 100, Scaling Strategy = ‘Concurrent Requests’, and Target = 100, the results would not be desirable for most end-user traffic. When the system decides to scale up, it will queue the requests until an existing request completes or the new replica becomes available.
- Must be between 0 and 1000 inclusive.
- Metric Percentile (
autoscaling.metricPercentile
)- The nth percentile is a value below which n percent of the values in a distribution lie.
- This may only be set while using the
latency
scaling strategy. The default value isp50
. - e.g. If the 50th percentile of a latency distribution is 200ms, 50% of requests took less than 200ms.
- Control plane supports p50, p75, and p99 metric percentiles.
Capacity AI is not available if CPU Utilization is selected because dynamic allocation of CPU resources cannot be accomplished while scaling replicas based on the usage of its CPU.