Skip to main content

Nobody cares that your Kubernetes cluster is healthy (and what to measure instead)

A few weeks ago, our new principal engineer sat down with our team and said something that stung a little: “I can see your cluster is up. I have no idea if anyone finds it useful.”

That’s a hard sentence to sit with when you’ve spent months tuning alerts and building dashboards.

I manage a team of SREs. We look after EKS, ArgoCD, Loki, Backstage, Karpenter, and a handful of other tools that together form what we loosely call “the platform.” We’re good at keeping things running. We have alerts. We have runbooks. We have dashboards full of green lights.

What we don’t have is any idea whether the engineering teams we serve are actually having a good experience.

The gap between “is it up?” and “is it good?”
#

We can tell you the ArgoCD API server’s memory usage. We cannot tell you how long it takes a developer to get from git push to a running workload.

We know Loki’s ingester is healthy. We have no idea if log queries are fast enough for developers to actually debug a production incident.

This is the difference between engineering observability and product analytics 1. The first asks “is the service working as intended?” The second asks “are people getting value from this, and what should we improve?” We have a mountain of the first and essentially zero of the second.

Nobody outside our team looks at ingester health. What they experience is “I searched for logs and it took 45 seconds” or “I tried to scaffold a new service in Backstage and the template failed.” That’s the product. Right now, we’re almost completely blind to it.

What “product focused” actually means
#

Around the same time as the principal engineer conversation, our director started pushing the team to be “more product focused” — to stop thinking of our platforms as infrastructure we maintain and start treating them as products we own. My first reaction was that this was vague management-speak. Roadmaps. Stakeholder decks. Quarterly business reviews.

I’ve changed my mind.

Product thinking for a platform team means something concrete: you have users (developers), and you’re responsible for understanding their needs, measuring their experience, and iterating to improve it 2. That’s not how most SRE teams operate. We operate more like a utility — keep the electricity flowing, deal with outages when they happen.

Product thinking flips that. It’s not enough that the electricity works. You need to know whether people can easily plug their stuff in, and whether they’re quietly running extension cords to somewhere else because your outlets are in the wrong place.

I have a real suspicion that’s happening with some of our tools. We might find teams have quietly adopted workarounds for things we thought were working fine. We won’t know until we look.

Both the principal engineer and our director were making the same point from different angles: stop operating infrastructure and start owning products. The difference isn’t just philosophical — it changes what you measure, what you build, and how you decide what to work on next.

SLOs are our way in
#

We’ve started building our first SLOs, and even this early it’s already shifted how the team thinks. We’re instrumenting Loki (query latency at various percentiles) and Karpenter (time for workloads to land on newly provisioned nodes).

What makes SLOs different from our existing monitoring is the perspective shift. We’re not measuring “is the Karpenter controller healthy?” — we’re measuring the thing the developer actually waits for 3. The controller can be perfectly healthy while provisioning takes long enough that a developer is sitting there wondering if their deployment is broken.

The plan is to extend this pattern across the rest of our stack:

Platform What we plan to measure
ArgoCD Sync success rate, time from commit to sync completion
Backstage Scaffolder template success rate, catalog page load time
EKS API server latency, pod scheduling time
Loki Query P95 latency, ingestion lag
Karpenter Node provisioning time, bin-packing efficiency

Most of this data is already sitting in Prometheus or in the tools’ own metrics endpoints. The hard part isn’t instrumentation — it’s deciding which measurements actually represent the developer’s experience rather than just the system’s health.

One thing I’m already realising: per-component SLOs won’t be enough on their own. For ArgoCD, “sync duration” in isolation isn’t what matters — it’s the full path from a developer pushing code to their workload actually running. Measuring the seams between tools is where the real insight probably lives. We haven’t figured out how to do that well yet.

But SLOs alone won’t answer the important questions
#

SLOs will tell us how well something works. They won’t tell us:

  • How many people are actually using it
  • Whether they want to use it or are just stuck with it
  • What friction they hit that never surfaces as an incident

Right now, I can’t answer basic questions like: how many teams actively use ArgoCD? How many Backstage scaffolder runs happen per month? What percentage of production workloads run on Karpenter-managed nodes?

You can have perfect reliability on a product nobody uses. I don’t think that’s the case for us — but I can’t prove it either way. That’s the problem.

The DevEx paper by Noda, Storey, Forsgren, and Greiler (2023) makes a compelling case that you need both system telemetry and self-reported developer feedback to understand what’s actually going on 4. Quantitative data tells you what; only qualitative data tells you why. We’re planning a short developer experience survey — maybe 10 questions — to complement the SLO data. I’m genuinely nervous about what we’ll find.

The scorecard we’re building toward
#

For each platform we own, I want us to track four things:

  • Adoption — who uses it and how much
  • Reliability — are we meeting our SLOs
  • Satisfaction — what do developers think (from surveys)
  • Toil — how many support requests does it generate

None of this exists today. The SLOs on Loki and Karpenter are the first pieces. But the scorecard is the target. If we get there, we’ll have something we’ve never had: a way to talk about our platforms as products with measurable health, not just infrastructure we keep alive.

Getting there also means getting disciplined about protecting time for product work. Google’s guidance is pretty direct about capping operational work at 50% of team time 5. I have no idea what our current split is. I suspect it’s heavily skewed toward reactive. Finding out is on the list.

Why this blind spot is so common
#

A few reasons I think this is so common:

SRE culture is reliability-first. That’s in the name. We’re trained to think in uptime, latency, error rates — all measured at the system level. Measuring user experience feels like someone else’s job.

We don’t think of developers as users. We think of them as colleagues who file tickets. The framing matters more than you’d expect. When you call them “users,” you start asking different questions.

There’s no obvious tool for internal platform analytics. We have Prometheus and Grafana for system metrics. We have PagerDuty for incidents. What’s the equivalent for “how many teams adopted our thing this month and do they like it”? It’s a gap. You end up cobbling something together from query logs and Slack message counts.

Our success has been invisible. When the platform works, nobody notices. When it breaks, everyone notices. That asymmetry optimises us for preventing visible failure, not for driving visible improvement. Product thinking is how you make the improvement visible.

What we’re doing next
#

Here’s the plan:

  1. Finish the SLO rollout — extend from Loki and Karpenter to ArgoCD, Backstage, and EKS. Target: end of Q2.
  2. Add basic adoption metrics — usage counters for each platform, broken down by team. Most of this data already exists; we just need to collect and visualise it.
  3. Run a developer experience survey — 10 questions, short and focused.
  4. Build the scorecard — one Grafana dashboard showing adoption, reliability, satisfaction, and toil for each platform.

We’ll probably get through half of it. But even getting the SLOs and adoption metrics in place will put us in a meaningfully better position than we’re in today. Once the scorecard exists, we can find out how skewed our operational vs. product work split actually is — and define a quarter’s worth of initiatives with real data behind them instead of gut feel.

If I were starting from zero, I’d pick one platform, define one SLO from the user’s perspective, and ask one team how their experience actually is. That’s enough to start seeing the gap.

We don’t have our answers yet. But at least we’re about to start looking.