Skip to main content

A Readiness Probe That Can't Fail Is Just Wallpaper

Writing up a production outage report today, I was making the case that workloads should fail their readiness probes when they can’t reach their dependencies — databases, caches, anything required to do useful work. The collaborating Claude session put it better than I had:

A probe that doesn’t fail when its workload can’t reach its database isn’t a probe — it’s wallpaper.

That’s the whole thing. A readiness probe answers one question: is this pod ready to serve traffic? If the answer depends on a database connection and you’re not checking for that, you’re not answering the question — you’re decorating the pod spec.

Why the shallow probe fails you during an outage

When a required dependency goes down, Kubernetes keeps routing traffic to your pods if their probes stay green. Every request hits a pod that can’t do the work and the user sees the error. You’re now relying on application-layer retries or circuit breakers to save you — neither of which you should need if the probe was doing its job.

A probe that reflects real dependency health lets Kubernetes do what it’s designed for: stop routing traffic to pods that aren’t ready. When the dependency recovers, the probe passes again, traffic resumes, no manual intervention.

What the difference looks like

Shallow probe — passes as long as the HTTP server is up:

readinessProbe:
  httpGet:
    path: /healthz
    port: 8080

Probe that actually answers the question — /ready checks the database connection:

readinessProbe:
  httpGet:
    path: /ready  # verifies DB connection, returns non-2xx if unreachable
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10
  failureThreshold: 3

The specific timing values matter less than the endpoint — tune those to your startup time. The /ready handler should try to verify every dependency the pod needs to serve a request. Any of them unreachable, return a non-2xx. That’s it.

The wallpaper version looks like a probe, passes CI, and appears in every health check dashboard. It just doesn’t help when things go wrong — which is the only time probes matter.

Related

Nobody cares that your Kubernetes cluster is healthy (and what to measure instead)

A few weeks ago, our new principal engineer sat down with our team and said something that stung a little: “I can see your cluster is up. I have no idea if anyone finds it useful.” That’s a hard sentence to sit with when you’ve spent months tuning alerts and building dashboards. I manage a team of SREs. We look after EKS, ArgoCD, Loki, Backstage, Karpenter, and a handful of other tools that together form what we loosely call “the platform.” We’re good at keeping things running. We have alerts. We have runbooks. We have dashboards full of green lights.

PTS 2026: What Actually Happened

Saturday morning in Vienna. We were intending a 10K — a good way to shake off four days of sitting in a room staring at manifests. We took a wrong turn somewhere around the Prater, failed to correct it, and finished 14K instead. Nobody was angry about it. The extra kilometres took us through streets we wouldn’t have found otherwise, past the football stadium and through a neighbourhood we had no particular reason to be in. Finishing tired is still finishing.

Heading to PTS 2026

This is the 16th Perl Toolchain Summit. That number is remarkable in a way that’s easy to walk past — the Perl community has been gathering a small, focused group of toolchain maintainers in a room every single year since 2008, and the output has been disproportionate to the headcount. The Oslo Consensus in 2008 established how the CPAN toolchain would evolve. Lancaster in 2013 did the same for distribution metadata. Last year in Leipzig, the group shipped Test::CVE, prototyped MFA for PAUSE, cut Perl core runtime by 13%, and kept the next-generation CPAN client work moving forward.