Skip to main content

Scheduling

Use maxSkew: 2 with Kubernetes Topology Spread Constraints

maxSkew: 1 on a topologySpreadConstraints config looks like the obviously correct choice — maximum spread, tightest guarantee. We ran it that way in production until it caused a partial outage. Turns out maxSkew: 2 is almost always the safer default, and the difference only shows up in the failure case. The phantom domain problem With topologyKey: kubernetes.io/hostname and whenUnsatisfiable: DoNotSchedule, the Kubernetes scheduler counts every node registered in the API as a topology domain — including nodes that exist but can’t accept pods. A node that’s resource-exhausted but not tainted, or registered but not yet Ready, still participates in the skew calculation. Its count is 0.

Four days, 277 sessions, one brutal Sunday time slot: scheduling SCALE 23x as a platform team manager

There are 277 sessions at SCALE 23x this year. I know this because I extracted all of them from the schedule webarchive files and scored every single one. I’m not proud of how long this took. But it surfaced some genuinely interesting tradeoffs — and the pattern of what conflicted with what tells you something real about where platform engineering is right now. The scheduling problem is different when you manage a team # When I was an IC, conference scheduling was mostly about depth. Find the three talks that will blow your mind and plan the rest around them. Everything else is hallway track.