Use maxSkew: 2 with Kubernetes Topology Spread Constraints

maxSkew: 1 on a topologySpreadConstraints config looks like the obviously correct choice — maximum spread, tightest guarantee. We ran it that way in production until it caused a partial outage. Turns out maxSkew: 2 is almost always the safer default, and the difference only shows up in the failure case.

The phantom domain problem

With topologyKey: kubernetes.io/hostname and whenUnsatisfiable: DoNotSchedule, the Kubernetes scheduler counts every node registered in the API as a topology domain — including nodes that exist but can’t accept pods. A node that’s resource-exhausted but not tainted, or registered but not yet Ready, still participates in the skew calculation. Its count is 0.

The skew formula is:

skew = count_on_candidate_node - min_count_across_all_domains

So if Node C is stuck at 0, no other node can go above 1 pod without violating maxSkew: 1. Pod 3 can’t go to Node A (skew would be 2-0 = 2). Can’t go to Node B either. Can’t go to Node C — it has no capacity. The pod goes Pending.

The scenario table

4 replicas, 3 nodes (A, B, C), whenUnsatisfiable: DoNotSchedule:

Scenario	maxSkew: 1	maxSkew: 2
All 3 nodes healthy	4/4 (2-1-1)	4/4 (2-1-1)
Node C down and tainted	4/4 (2-2)	4/4 (2-2)
Node C exists but has no capacity	2/4 — pods stuck Pending	4/4 (2-2-0)
Scale to 6, C resource-constrained	2/6 — pods stuck Pending	4/6 (2-2-0) — 2 still Pending, but 4 land

The first two rows are identical. The difference only appears when a node is in the API but can’t schedule. maxSkew: 2 allows the remaining healthy nodes to absorb the load. The scheduler still optimizes for even spread — you don’t get everything dumped on one node — but it’s not blocked by the phantom.

Why Karpenter makes this worse

With Karpenter, a new node registers in the API before it’s Ready. That registration creates a phantom domain at 0. Under maxSkew: 1, the scheduler sees the new node at 0 and blocks scheduling on all existing nodes. Pods stay Pending. Karpenter sees Pending pods and decides it needs another node. That node also registers before it’s Ready, creating another phantom. The constraint gets harder to satisfy, not easier.

With maxSkew: 2, the in-flight node’s 0-count doesn’t block the existing healthy nodes. Pods land 2-2-0 and the new node comes up clean. The feedback loop never starts.

The fix

topologySpreadConstraints:
  - maxSkew: 2
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: my-app

That’s it. Under normal conditions the behaviour is identical to maxSkew: 1. The difference only surfaces when a node is registered but can’t schedule — which in a dynamic cluster is not an edge case, it’s Tuesday.

Related