Scheduling

maxSkew: 1 on a topologySpreadConstraints config looks like the obviously correct choice — maximum spread, tightest guarantee. We ran it that way in production until it caused a partial outage. Turns out maxSkew: 2 is almost always the safer default, and the difference only shows up in the failure case. The phantom domain problem With topologyKey: kubernetes.io/hostname and whenUnsatisfiable: DoNotSchedule, the Kubernetes scheduler counts every node registered in the API as a topology domain — including nodes that exist but can’t accept pods. A node that’s resource-exhausted but not tainted, or registered but not yet Ready, still participates in the skew calculation. Its count is 0.

Scheduling

Use maxSkew: 2 with Kubernetes Topology Spread Constraints

Four days, 277 sessions, one brutal Sunday time slot: scheduling SCALE 23x as a platform team manager