Kubernetes HPA: Auto-Scaling Pods Under Load

Manually adjusting replica counts during traffic spikes is both slow and error-prone. Kubernetes Horizontal Pod Autoscaler (HPA) automatically scales Pod count up or down based on CPU usage, memory consumption, or custom metrics. This guide covers HPA configuration from scratch, Metrics Server setup, and production best practices.

How Does HPA Work?

HPA queries Pod metrics from the Metrics API at regular intervals (default 15 seconds). It compares the current metric value against the target and calculates the required replica count. The formula is straightforward:

💡 Formula: desiredReplicas = ceil(currentReplicas * (currentMetric / targetMetric)). For example, if 3 Pods are at 80% CPU and the target is 50%: ceil(3 * 80/50) = ceil(4.8) = 5 Pods needed.

Metrics Server Setup

HPA requires Metrics Server to be installed in the cluster. Metrics Server collects CPU and memory metrics from kubelets and exposes them via the Metrics API.

terminal

# Install Metrics Server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Verify installation
kubectl get deployment metrics-server -n kube-system

# Check node and pod metrics
kubectl top nodes
kubectl top pods

CPU-Based HPA Configuration

The most common HPA scenario is scaling based on CPU usage. In the following example, Pod count increases when average CPU exceeds 60% and decreases when it drops.

hpa-cpu.yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60

⚠️ Important: HPA requires resources.requests.cpu to be defined in Deployment containers. Without CPU requests, HPA cannot calculate percentages and scaling won't occur.

Scaling with Memory and Multiple Metrics

You can add memory usage as a target alongside CPU. When multiple metrics are defined, HPA selects the highest replica count.

hpa-multi.yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 2
  maxReplicas: 30
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 75
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60

The behavior section controls scaling speed. scaleDown.stabilizationWindowSeconds: 300 preserves the highest replica count from the last 5 minutes, preventing premature scale-downs. scaleUp is configured more aggressively for faster response.

HPA Best Practices

Set Accurate Resource Requests CPU and memory request values should reflect your application's consumption under normal load. Too-low requests cause unnecessary scaling; too-high requests cause insufficient scaling.
Use Scale Down Stabilization The default 300-second stabilization window works well for most scenarios. It prevents unnecessary scale-downs during brief traffic dips.
Define Readiness Probes Newly created Pods shouldn't receive traffic before they're ready. Without readiness probes, HPA considers new Pods immediately active, increasing response times.
Don't Use HPA and VPA on the Same Metric If Vertical Pod Autoscaler (VPA) and HPA target the same metric (CPU), they'll conflict. Use VPA for memory only and HPA for CPU, or consider Multidimensional Pod Autoscaler.

For Kubernetes fundamentals, check our Introduction to Kubernetes guide. For application deployment, see our Helm Chart guide. For monitoring infrastructure, read our Prometheus + Grafana guide. The Kubernetes HPA documentation and Metrics Server project are valuable additional resources.

Frequently Asked Questions

How fast does HPA scale?

HPA checks metrics every 15 seconds by default. Scale-up decisions are applied immediately, but new Pods becoming ready (image pull + startup) can take 30 seconds to several minutes.

Can HPA minReplicas be 0?

Standard HPA requires minReplicas of at least 1. For scale-to-zero, use KEDA (Kubernetes Event-Driven Autoscaling). KEDA supports scaling from 0 based on external metrics like queue length or HTTP request count.

What CPU target percentage should I set?

The general recommendation is 50-70%. 50% provides more aggressive scaling (more Pods, lower latency), while 70% is more cost-efficient. Adjust based on your application's response time requirements.

Can I use custom metrics with HPA?

Yes. With Prometheus Adapter or KEDA, you can scale based on HTTP request count, queue depth, database connection count, and more. The autoscaling/v2 API supports custom and external metric types.

Conclusion

Kubernetes HPA automatically responds to traffic fluctuations, ensuring both performance and cost efficiency. Install Metrics Server, set accurate resource requests, use scale-down stabilization, and define readiness probes. When custom metrics are needed, extend HPA with Prometheus Adapter or KEDA.

Auto-Scaling Infrastructure

Keep your Kubernetes cluster ready for traffic spikes with Hosted Cloud servers.

Explore Cloud Server Plans →