Autoscaling

Overview

Workload auto-scaling is configured by setting a strategy, a target value, and in some cases as metric percentile. Together these values determine when the workload will scale up & down. As the system scales up, traffic will not be sent to the new replicas until they pass the readiness probe, if configured. If there is no probe configured or if it is a basic TCP port check, the requests will hit the new replicas before they are ready to respond. This could cause a delay or errors for end-user traffic.

You can configure autoscaling in the default options for a workload (defaultOptions) and in any of the location-specific options.

Scaling Strategies

The scaling strategy is set using autoscaling.metric.

Disabled (disabled)
- Scaling will be disabled.
Concurrent Requests Quantity (concurrency)
- The average number of requests executing at a given point in time across all the replicas. (requests * requestDuration)/(timePeriod * replicas).
- Example: A workload with 5 replicas received 1000 requests with an average response time of 50ms (05 seconds) over a 1 second period. The concurrent requests metric for that period is (1000 * .05)/(1 * 5) = 10.
Requests Per Second (rps)
- The raw number of requests received by a workload each second divided by the number of replicas. Requests are counted even if they haven’t been completed.
Percentage of CPU Utilization (cpu)
- The percentage of CPU consumed by system and user processes in the container(s) as specified in the container cpu field.
Request Latency (latency)
- The request response time (at a configurable percentile) in milliseconds, averaged across all replicas.
Memory Utilization (memory)
- The percentage of memory consumed by system and user processes in the container(s) as specified in the container memory field.

Caveats when choosing a workload type and a scaling strategy:

Serverless workloads cannot use the latency scaling strategy or any of the Multi Metric scaling strategies.
Standard workloads cannot use the concurrency scaling strategy

The scale to zero functionality is only available for Serverless workloads, and only when using the rps or concurrency scaling strategies.

Target

The target is the value that the system will try to keep the metric near but below.

Autoscaling Standard Workloads

For standard workloads, Control Plane runs two asynchronous control loops:

The Scaling Decision Loop
The Metric Calculation Loop

Because of this asynchronous structure, autoscaling decisions may be made based on a metric value that is as old as the metric’s collection rate (usually 20 seconds).

The Scaling Decision Loop

A workload’s scale is evaluated every 15 seconds, using the value most recently calculated by the [metric calculation loop][#standard-metric-calculations]. Each time an evaluation is made the chosen metric is averaged across all available replicas and compared against the scale target. When scaling up, Control Plane does not enforce a stabilization window; the number of pods will increase as soon as the scaling algorithm dictates. When scaling down, a stabilization window of 5 minutes is used; the highest number of pods recommended by the scaling algorithm within the past 5 minutes will be applied to the running workload.

The Metric Calculation Loop

Requests per Second

Every 20 seconds, Control Plane calculates the average number of requests per second over the past 60 seconds.

Latency

Every 20 seconds, Control Plane calculates latency, using the response time of the workload once requests are received, using an average over the past 60 seconds at the specified percentile (p50, p75, p99).

CPU

Every 15 seconds, Control Plane calculates the average CPU usage over the past 15 seconds.

Memory

Every 15 seconds, Control Plane calculates the average memory usage over the past 15 seconds.

Autoscaling Serverless Workloads

The current capacity is evaluated every 2 seconds and compared against the scale target. It averages requests completed over the previous 60 seconds to avoid rapid changes. If ever a scaling decision is made which results in a scale increase above 200% then it suspends scale down decisions and averages over 6 seconds for 60 seconds. This is to allow for rapid scaling when a burst of traffic is detected.

Special considerations for the latency scaling strategyBecause request latency is represented as a distribution, when using the latency scaling strategy, you must choose a metric percentile by setting the autoscaling.metricPercentile property to one of the following values:

p50
p75
p99

Options

Minimum Scale (autoscaling.minScale)
- The minimum allowed number of replicas.
- Control Plane can scale the workload down to 0 when there is no traffic and scale up immediately to fulfill new requests.
- Must be between 0 and Maximum Scale inclusive.
Maximum Scale (autoscaling.maxScale)
- The maximum allowed number of replicas.
Scale to Zero Delay (autoscaling.scaleToZeroDelay)
- The amount of time (in seconds) with no requests received before a workload is scaled down to 0.
- Must be between 30 and 3600 inclusive.
Maximum Concurrency (autoscaling.maxConcurrency)
- A hard maximum for the number of concurrent requests allowed to a replica.
- If no replicas are available to fulfill the request, it will be queued until a replica with capacity is available and delivered as soon as one is available again.
- Capacity can be available from requests completing or when a new replica is available from scale out.
- A value of 0 allows all requests.
- Must be between 0 and 30000 inclusive.
Metric (autoscaling.metric)
- Controls the metric which will be used for scaling decisions. The goal is to maintain the target across all replicas of a deployment. Options include:
  - concurrency: Uses the number of concurrent requests for the target.
  - cpu: Uses % processor time for the target.
  - memory: Uses memory in Mi for the target.
  - rps: Uses requests per second for the target.
  - latency: Uses the average request response time for the target. Not available for Serverless workloads.
Multi Metric (autoscaling.multi)
- Allows specifying multiple metrics for autoscaling decisions.
- Each metric must be unique and is defined with a target value.
- Not available for Serverless workloads.
Metric Percentile (autoscaling.metricPercentile)
- The latency metric is represented as a distribution, so a percentile within the distribution must be chosen to be used with the target.
- The default value is p50.
- Control Plane supports p50, p75, and p99 metric percentiles.
- For example, if the percentile is p50 and the target is 100, when the 50th percentile of latency is greater than 100ms for the workload, a scale-up decision will be made.

Capacity AI is not available if CPU Utilization is selected because dynamic allocation of CPU resources cannot be accomplished while scaling replicas based on the usage of its CPU. Additionally, Capacity AI may not be enabled when multi metric is set.

Examples

Concurrency

spec:
  defaultOptions:
    autoscaling:
      metric: concurrency
      maxConcurrency: 0
      maxScale: 1
      minScale: 5
      scaleToZeroDelay: 300
      target: 100

CPU

spec:
  defaultOptions:
    autoscaling:
      metric: cpu
      target: 80
      minScale: 1
      maxScale: 5

Memory

spec:
  defaultOptions:
    autoscaling:
      metric: memory
      target: 80
      minScale: 1
      maxScale: 5

RPS

spec:
  defaultOptions:
    autoscaling:
      metric: rps
      target: 100
      minScale: 1
      maxScale: 5

Latency

spec:
  defaultOptions:
    autoscaling:
      metric: latency
      target: 100
      minScale: 1
      maxScale: 5
      metricPercentile: p50

Multi Metric

spec:
  defaultOptions:
    autoscaling:
      minScale: 1
      maxScale: 5
      multi:
        - metric: cpu
          target: 80
        - metric: memory
          target: 80

Keda

In cases where the autoscaling strategies provided by Control Plane do not meet your needs, you can use Keda to scale workloads based on custom metrics. Keda is a Kubernetes-based event-driven autoscaler that allows you to define custom scaling rules and metrics. A workload can use keda to scale based on custom metrics, like redis queue lengths or kafka topic lag. It is available for both standard and stateful workloads. In order to use keda, you must first enable it on the gvc

If Keda requires network or cloud resources, you must attach a valid identity on the keda configuration in the gvc

Configuration

spec:
  defaultOptions:
    autoscaling:
      metric: keda
      keda:
        triggers:
          - type: redis
            metadata:
              address: my-redis.my-gvc.cpln.local:6379
              queueLength: '5'
              passwordFromEnv: REDIS_PASSWORD
              usernameFromEnv: REDIS_USERNAME

Advanced Options

For a more advanced setup, perhaps utilizing multiple triggers or additional configurations, you can define the keda advanced section in the workload spec:

spec:
  defaultOptions:
    autoscaling:
      metric: keda
      keda:
        triggers: ...
        advanced:
          scalingModifiers:
            activationTarget: '0'
            formula: >-
              (latency > 3000) ? (existing_pods) : min(max((queue_length /
              15) + 1, 1), existing_pods * 1.15 + 1, existing_pods + 7)
            metricType: AverageValue
            target: '1'

Internal access

If keda requires access to a Control Plane workload, that workload must be configured to allow keda access.

spec:
  firewallConfig:
    internal:
      inboundAllowWorkload:
        - cpln://internal/keda

Quickstart

Concepts

Guides

External Logging

Kubernetes (Managed)

Bring your Own Kubernetes (BYOK)

Terraform Provider

Pulumi Provider

Kubernetes Operator

Reference

Overview

Scaling Strategies

Target

Autoscaling Standard Workloads

The Scaling Decision Loop

The Metric Calculation Loop

Requests per Second

Latency

CPU

Memory

Autoscaling Serverless Workloads

Options

Examples

Concurrency

CPU

Memory

RPS

Latency

Multi Metric

Keda

Configuration

Advanced Options

Internal access

Quickstart

Concepts

Guides

External Logging

Kubernetes (Managed)

Bring your Own Kubernetes (BYOK)

Terraform Provider

Pulumi Provider

Kubernetes Operator

Reference

​Overview

​Scaling Strategies

​Target

​Autoscaling Standard Workloads

​The Scaling Decision Loop

​The Metric Calculation Loop

​Requests per Second

​Latency

​CPU

​Memory

​Autoscaling Serverless Workloads

​Options

​Examples

​Concurrency

​CPU

​Memory

​RPS

​Latency

​Multi Metric

​Keda

​Configuration

​Advanced Options

​Internal access

Overview

Scaling Strategies

Target

Autoscaling Standard Workloads

The Scaling Decision Loop

The Metric Calculation Loop

Requests per Second

Latency

CPU

Memory

Autoscaling Serverless Workloads

Options

Examples

Concurrency

CPU

Memory

RPS

Latency

Multi Metric

Keda

Configuration

Advanced Options

Internal access