Skip to main content

Use maxSkew: 2 with Kubernetes Topology Spread Constraints

maxSkew: 1 on a topologySpreadConstraints config looks like the obviously correct choice — maximum spread, tightest guarantee. We ran it that way in production until it caused a partial outage. Turns out maxSkew: 2 is almost always the safer default, and the difference only shows up in the failure case.

The phantom domain problem

With topologyKey: kubernetes.io/hostname and whenUnsatisfiable: DoNotSchedule, the Kubernetes scheduler counts every node registered in the API as a topology domain — including nodes that exist but can’t accept pods. A node that’s resource-exhausted but not tainted, or registered but not yet Ready, still participates in the skew calculation. Its count is 0.

The skew formula is:

skew = count_on_candidate_node - min_count_across_all_domains

So if Node C is stuck at 0, no other node can go above 1 pod without violating maxSkew: 1. Pod 3 can’t go to Node A (skew would be 2-0 = 2). Can’t go to Node B either. Can’t go to Node C — it has no capacity. The pod goes Pending.

The scenario table

4 replicas, 3 nodes (A, B, C), whenUnsatisfiable: DoNotSchedule:

Scenario maxSkew: 1 maxSkew: 2
All 3 nodes healthy 4/4 (2-1-1) 4/4 (2-1-1)
Node C down and tainted 4/4 (2-2) 4/4 (2-2)
Node C exists but has no capacity 2/4 — pods stuck Pending 4/4 (2-2-0)
Scale to 6, C resource-constrained 2/6 — pods stuck Pending 4/6 (2-2-0) — 2 still Pending, but 4 land

The first two rows are identical. The difference only appears when a node is in the API but can’t schedule. maxSkew: 2 allows the remaining healthy nodes to absorb the load. The scheduler still optimizes for even spread — you don’t get everything dumped on one node — but it’s not blocked by the phantom.

Why Karpenter makes this worse

With Karpenter, a new node registers in the API before it’s Ready. That registration creates a phantom domain at 0. Under maxSkew: 1, the scheduler sees the new node at 0 and blocks scheduling on all existing nodes. Pods stay Pending. Karpenter sees Pending pods and decides it needs another node. That node also registers before it’s Ready, creating another phantom. The constraint gets harder to satisfy, not easier.

With maxSkew: 2, the in-flight node’s 0-count doesn’t block the existing healthy nodes. Pods land 2-2-0 and the new node comes up clean. The feedback loop never starts.

The fix

topologySpreadConstraints:
  - maxSkew: 2
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: my-app

That’s it. Under normal conditions the behaviour is identical to maxSkew: 1. The difference only surfaces when a node is registered but can’t schedule — which in a dynamic cluster is not an edge case, it’s Tuesday.

Related

Four days, 277 sessions, one brutal Sunday time slot: scheduling SCALE 23x as a platform team manager

There are 277 sessions at SCALE 23x this year. I know this because I extracted all of them from the schedule webarchive files and scored every single one. I’m not proud of how long this took. But it surfaced some genuinely interesting tradeoffs — and the pattern of what conflicted with what tells you something real about where platform engineering is right now. The scheduling problem is different when you manage a team # When I was an IC, conference scheduling was mostly about depth. Find the three talks that will blow your mind and plan the rest around them. Everything else is hallway track.

Four days, eighteen missed sessions, and a private roundtable with Kelsey Hightower: SCALE 23x as it actually happened

The schedule I built two weeks ago was a fiction. A useful fiction — it forced real thinking about tradeoffs — but eighteen of the sessions I marked as “MUST” or “HIGH” are now links in a YouTube folder I won’t open before 2027. The one session that wasn’t on any schedule, wasn’t announced publicly, and had no recording? That one I can still reconstruct line by line. That’s the gap between the conference you plan and the conference you actually attend.