Workload auto-scaling is configured by setting a strategy, a target value, and in some cases as metric percentile. Together these values determine when the workload will scale up & down.
As the system scales up, traffic will not be sent to the new replicas until they pass the readiness probe, if configured. If there is no probe configured or if it is a basic TCP port check, the requests will hit the new replicas before they are ready to respond. This could cause a delay or errors for end-user traffic.
You can configure autoscaling in the default options for a workload (defaultOptions
) and in any of the location-specific options.
The scaling strategy is set using autoscaling.metric
.
disabled
)
concurrency
)
(requests * requestDuration)/(timePeriod * replicas)
.(1000 * .05)/(1 * 5) = 10
.rps
)
cpu
)
latency
)
memory
)
Caveats when choosing a workload type and a scaling strategy:
latency
scaling strategy or any of the Multi Metric scaling strategies.concurrency
scaling strategyThe scale to zero functionality is only available for Serverless workloads, and only when using the rps
or concurrency
scaling strategies.
The target is the value that the system will try to keep the metric near but below.
For standard workloads, Control Plane runs two asynchronous control loops:
Because of this asynchronous structure, autoscaling decisions may be made based on a metric value that is as old as the metric’s collection rate (usually 20 seconds).
A workload’s scale is evaluated every 15 seconds, using the value most recently calculated by the [metric calculation loop][#standard-metric-calculations]. Each time an evaluation is made the chosen metric is averaged across all available replicas and compared against the scale target. When scaling up, Control Plane does not enforce a stabilization window; the number of pods will increase as soon as the scaling algorithm dictates. When scaling down, a stabilization window of 5 minutes is used; the highest number of pods recommended by the scaling algorithm within the past 5 minutes will be applied to the running workload.
Every 20 seconds, Control Plane calculates the average number of requests per second over the past 60 seconds.
Every 20 seconds, Control Plane calculates latency, using the response time of the workload once requests are received, using an average over the past 60 seconds at the specified percentile (p50, p75, p99).
Every 15 seconds, Control Plane calculates the average CPU usage over the past 15 seconds.
Every 15 seconds, Control Plane calculates the average memory usage over the past 15 seconds.
The current capacity is evaluated every 2 seconds and compared against the scale target. It averages requests completed over the previous 60 seconds to avoid rapid changes. If ever a scaling decision is made which results in a scale increase above 200% then it suspends scale down decisions and averages over 6 seconds for 60 seconds. This is to allow for rapid scaling when a burst of traffic is detected.
Special considerations for the latency
scaling strategy
Because request latency is represented as a distribution, when using the latency
scaling strategy, you must choose a metric percentile by setting the autoscaling.metricPercentile
property to one of the following values:
p50
p75
p99
autoscaling.minScale
)
Maximum Scale
inclusive.autoscaling.maxScale
)
autoscaling.scaleToZeroDelay
)
autoscaling.maxConcurrency
)
autoscaling.metric
)
concurrency
: Uses the number of concurrent requests for the target.cpu
: Uses % processor time for the target.memory
: Uses memory in Mi for the target.rps
: Uses requests per second for the target.latency
: Uses the average request response time for the target. Not available for Serverless workloads.autoscaling.multi
)
autoscaling.metricPercentile
)
p50
.Capacity AI is not available if CPU Utilization is selected because dynamic allocation of CPU resources cannot be accomplished while scaling replicas based on the usage of its CPU. Additionally, Capacity AI may not be enabled when multi metric is set.
Workload auto-scaling is configured by setting a strategy, a target value, and in some cases as metric percentile. Together these values determine when the workload will scale up & down.
As the system scales up, traffic will not be sent to the new replicas until they pass the readiness probe, if configured. If there is no probe configured or if it is a basic TCP port check, the requests will hit the new replicas before they are ready to respond. This could cause a delay or errors for end-user traffic.
You can configure autoscaling in the default options for a workload (defaultOptions
) and in any of the location-specific options.
The scaling strategy is set using autoscaling.metric
.
disabled
)
concurrency
)
(requests * requestDuration)/(timePeriod * replicas)
.(1000 * .05)/(1 * 5) = 10
.rps
)
cpu
)
latency
)
memory
)
Caveats when choosing a workload type and a scaling strategy:
latency
scaling strategy or any of the Multi Metric scaling strategies.concurrency
scaling strategyThe scale to zero functionality is only available for Serverless workloads, and only when using the rps
or concurrency
scaling strategies.
The target is the value that the system will try to keep the metric near but below.
For standard workloads, Control Plane runs two asynchronous control loops:
Because of this asynchronous structure, autoscaling decisions may be made based on a metric value that is as old as the metric’s collection rate (usually 20 seconds).
A workload’s scale is evaluated every 15 seconds, using the value most recently calculated by the [metric calculation loop][#standard-metric-calculations]. Each time an evaluation is made the chosen metric is averaged across all available replicas and compared against the scale target. When scaling up, Control Plane does not enforce a stabilization window; the number of pods will increase as soon as the scaling algorithm dictates. When scaling down, a stabilization window of 5 minutes is used; the highest number of pods recommended by the scaling algorithm within the past 5 minutes will be applied to the running workload.
Every 20 seconds, Control Plane calculates the average number of requests per second over the past 60 seconds.
Every 20 seconds, Control Plane calculates latency, using the response time of the workload once requests are received, using an average over the past 60 seconds at the specified percentile (p50, p75, p99).
Every 15 seconds, Control Plane calculates the average CPU usage over the past 15 seconds.
Every 15 seconds, Control Plane calculates the average memory usage over the past 15 seconds.
The current capacity is evaluated every 2 seconds and compared against the scale target. It averages requests completed over the previous 60 seconds to avoid rapid changes. If ever a scaling decision is made which results in a scale increase above 200% then it suspends scale down decisions and averages over 6 seconds for 60 seconds. This is to allow for rapid scaling when a burst of traffic is detected.
Special considerations for the latency
scaling strategy
Because request latency is represented as a distribution, when using the latency
scaling strategy, you must choose a metric percentile by setting the autoscaling.metricPercentile
property to one of the following values:
p50
p75
p99
autoscaling.minScale
)
Maximum Scale
inclusive.autoscaling.maxScale
)
autoscaling.scaleToZeroDelay
)
autoscaling.maxConcurrency
)
autoscaling.metric
)
concurrency
: Uses the number of concurrent requests for the target.cpu
: Uses % processor time for the target.memory
: Uses memory in Mi for the target.rps
: Uses requests per second for the target.latency
: Uses the average request response time for the target. Not available for Serverless workloads.autoscaling.multi
)
autoscaling.metricPercentile
)
p50
.Capacity AI is not available if CPU Utilization is selected because dynamic allocation of CPU resources cannot be accomplished while scaling replicas based on the usage of its CPU. Additionally, Capacity AI may not be enabled when multi metric is set.