Stop Wasting CPU: The QoS & VPA Playbook | k8s Optimisation

Kubernetes internals · DevOps

The scheduler’s blind spot

Here is the single most important thing most people miss about Kubernetes scheduling: the scheduler only looks at resource requests — not limits, not actual usage. When it’s deciding where to place a pod, it asks one question: “Does this node have enough requested capacity left?” That’s it.

This creates a fascinating gap between what your cluster thinks is happening and what is actually happening on the nodes. You can request 250m CPU for a pod that really only burns 30m most of the time — and the scheduler happily reserves all 250m on the node’s accounting sheet, leaving it unavailable to everyone else.

⚠️ The overcommitment trap: When your requests are significantly higher than actual usage, you’re holding CPU hostage. Nodes appear “full” to the scheduler while pods idle at a fraction of their reserved capacity.

What the data actually shows

The two screenshots below come from a Kibana lens visualization monitoring the gestion app in a real cluster over the course of a working afternoon. They tell a story that words alone cannot.

Kubernetes Overcommitment — kubernetes.labels.app: “gestion” — 10:00–21:00

Requests Average: 250m (constant ceiling) • Usage Average: ~50m (typical idle) • Average Waste: 80% (of reserved CPU unused)

The green line (Average Waste %) hovers around 80–100% throughout the day. The white line (Requests Average) is a flat 250m. The red line (Usage Average) barely registers near zero at idle, with orange Max Usage spikes during actual work. This is overcommitment made visible.

What you’re seeing is textbook overcommitment. The green line — average waste — sits at 80% for hours. That means 200m out of every 250m CPU requested is being held on the node’s books but never actually consumed. Yet to the scheduler, those millicores are “taken”.

The orange spikes tell the other half of the story: when the app actually works, it bursts up to ~500m — double its request. Without a limit, this is fine. With a tight limit, those spikes would get throttled. This is exactly the tradeoff you need to understand.

Pod QoS classes: the eviction ladder

Kubernetes uses three Quality of Service classes to decide who gets evicted first when a node runs out of actual resources. You don’t choose a QoS class directly — Kubernetes infers it from how you set requests and limits on your containers.

Guaranteed (QoS Tier 1): Every container has requests = limits for both CPU and memory. Kubernetes pins resources exclusively to this pod. → Evicted last.
Burstable (QoS Tier 2): At least one container sets a request or a limit, but they don’t match. Pod gets a baseline but can burst into slack capacity. → Evicted second.
BestEffort (QoS Tier 3): No requests or limits set at all. Kubernetes gives whatever’s left over and takes it back first when things get tight. → Evicted first.

Choosing the right class for your workload

Guaranteed is right for mission-critical services: databases, payment processors, auth backends. You’re trading resource efficiency for predictability and eviction protection. The scheduler reserves exact capacity, so your pod won’t be disturbed by noisy neighbours.

Burstable is the sweet spot for most application pods — including the gestion app we’re looking at. Set a modest request based on typical idle usage, allow a higher limit to accommodate spikes, and the scheduler gets accurate placement data while your app can still breathe when it needs to.

The practical rule: If your Kibana chart shows a wide gap between your request line and your usage average, you’re a prime candidate for Burstable class with a right-sized request — which is exactly what VPA is designed to calculate for you.

The overcommitment math

CPU requested: 250m • Actual avg usage: ~50m • Wasted per pod: 200m

Scale that across 20 replicas and you’re holding 4,000 millicores hostage — four full CPU cores that your scheduler treats as occupied. That either forces you to over-provision nodes, or starves other workloads that genuinely need capacity.

Vertical Pod Autoscaler: letting the cluster teach itself

Manually right-sizing requests is guesswork. You look at Grafana, pick a p95 value, add a buffer, and hope you got it right. VPA takes a different approach: it runs as a controller inside your cluster, watches historical CPU and memory usage per container, and continuously recommends — or directly applies — better request and limit values.

Observe — The VPA Recommender reads metrics from the metrics server and builds a statistical model of each container’s resource consumption over time.
Recommend — It outputs a VPA object with three fields: lowerBound, target, and upperBound — a recommended range anchored to actual behaviour.
Apply (optional) — In updateMode: Auto, VPA evicts and restarts pods with the new resource values baked in. In Off mode, it only shows recommendations — you apply manually.
Adapt — As your workload patterns change — seasonal traffic, new features, load shifts — VPA’s model continuously updates. No manual re-tuning needed.

VPA + HPA caveat: Don’t use VPA in Auto mode together with Horizontal Pod Autoscaler on CPU metrics — they’ll fight each other. The safe pairing is VPA (Auto) for right-sizing + HPA on custom metrics, or VPA (Off/Recommend) just to inform your manual tuning.

Making it visible: Kibana + Lens

All the theory above becomes genuinely actionable when you can see it happening in real time. The Elastic Kubernetes integration ships a set of default dashboards, but the real power comes from building a custom Lens visualization that puts your key signals on one canvas:

What Kibana collects

Per-pod CPU and memory usage from the node metrics pipeline
Container resource requests and limits from the Kubernetes state
Pod lifecycle events, restarts, and OOM kills
Node-level allocatable vs. allocated capacity

What Lens lets you build

Formula fields: requests − usage = gap (overcommitment)
Waste % as a derived metric updated per minute
Max vs. average usage overlaid to spot burst headroom
Time-range zoom to isolate deployment events or incidents

The chart we’ve been looking at plots six signals together: usage average, requests average, the instantaneous gap (last value), the rolling average gap, average waste percentage, and max usage. The combination lets you answer three questions at a glance: How much am I wasting right now? Is it trending better or worse? And when spikes happen, are they staying below the node’s actual capacity?

Pair that visibility with VPA recommendations and you have a feedback loop: Kibana shows the problem, VPA quantifies the fix, you apply it, Kibana confirms the result. That’s the whole workflow.

Putting it all together

Observe: Kibana Lens shows 80% CPU waste → Right-size: VPA recommends target ≈ 60–80m → Classify: Set requests=60m, limit=600m → Burstable QoS

For the gestion app specifically: if VPA recommends a target of around 60–80m based on historical data, you could set a request of 75m and a limit of 600m (to allow for the ~500m bursts we saw in the orange line). That makes it Burstable, reduces your scheduler’s reserved footprint by 70%, and still protects your pod from being throttled during real work.

Kubernetes doesn’t waste your CPU by accident — it wastes it because you told it to. Requests are a contract with the scheduler, not a suggestion. Get them right, and your cluster suddenly has room to breathe.

Have questions about your cluster’s overcommitment profile? Drop a comment or connect — I’m happy to talk through the specifics.