Kubernetes Health Checks: Getting Probes Right in Production

Kubernetes health checks are one of those features that look obvious in the docs and then surprise you the first time a deployment silently breaks in production. I've spent more time than I'd like debugging pods that were technically running but serving nothing useful, all because the probes were either misconfigured or missing entirely.

This isn't a repeat of the Kubernetes basics. It's specifically about liveness, readiness, and startup probes — what they actually do, how they interact, and the mistakes that cost me real downtime.

Three Probes, Three Jobs

Kubernetes ships with three probe types. They're easy to mix up because they all check "is this container okay" — but they mean different things to the scheduler.

Liveness answers: should this container be restarted? If it fails, Kubernetes kills and restarts the container. Use it to catch deadlocks or corrupted state the app can't recover from on its own.

Readiness answers: should this container receive traffic? If it fails, the pod is removed from the Service endpoints. The container keeps running — it just stops getting requests. Use it for slow startup, dependency checks, or graceful degradation.

Startup answers: has the container finished initialising? It gates the other two probes. While the startup probe is active, liveness and readiness don't run. This matters a lot for slow-starting JVM apps or .NET services loading large caches.

The Classic Mistake: Using Liveness for Everything

The most common error I see is using a liveness probe to check external dependencies — a database, a cache, a downstream API. The logic seems reasonable: if the database is down, the app isn't healthy, so restart it.

The problem is that restarting the pod won't fix the database. You end up in a crash loop, Kubernetes backs off with exponential delays, and now you've taken a partial outage and turned it into a full one.

Liveness should only check things the pod itself can fix by restarting. If a restart won't solve the problem, liveness is the wrong probe.

Readiness is the right tool for external dependencies. When the database is unreachable, the pod stops receiving traffic without restarting. When it recovers, the pod becomes ready again. No restart loop, no cascading failure.

Startup Probes Save Slow Services

Before startup probes existed, the workaround was setting initialDelaySeconds to something large enough to cover the worst-case startup time. The problem with that: if the container crashed immediately, Kubernetes would wait the full delay before restarting it. A 60-second delay on every restart adds up fast.

Startup probes handle this cleanly. You give the container a generous window to finish starting up — say, 30 checks at 10-second intervals — and once the startup probe passes, the tighter liveness and readiness probes take over.

startupProbe:
  httpGet:
    path: /health/startup
    port: 8080
  failureThreshold: 30
  periodSeconds: 10

That gives the container up to 5 minutes to start. After that, liveness runs with its normal settings. For a .NET app loading a 2GB model file, this is the difference between a working deployment and a perpetual restart loop.

HTTP vs TCP vs Exec

HTTP probes are the most common and the most useful. They hit an endpoint and check the response code. Anything in the 200-399 range counts as success. You control the path, port, and headers.

TCP probes just check that the port is open. They're useful when your service doesn't speak HTTP — a raw TCP server, a gRPC service without an HTTP health endpoint, or a database sidecar.

Exec probes run a command inside the container. They're flexible but expensive — each probe forks a new process. Fine for low-frequency checks, not great for anything running every few seconds.

For most HTTP services, the pattern I use is:

A /health/live endpoint that returns 200 if the process is running and not deadlocked. No external calls.
A /health/ready endpoint that checks database connectivity, cache availability, and anything else needed to serve traffic.
A /health/startup endpoint that returns 200 only after initial data loading is complete.

Keeping these separate in code makes the intent explicit and avoids the accidental coupling of liveness and external state.

Threshold and Timing Settings

Every probe has four knobs: initialDelaySeconds, periodSeconds, failureThreshold, and successThreshold.

initialDelaySeconds is mostly obsolete if you're using startup probes. Without them, set it high enough to avoid false failures during startup.

periodSeconds controls how often the probe runs. The default is 10 seconds. For liveness, something between 10 and 30 is usually fine. Checking every second is almost always overkill and adds unnecessary load.

failureThreshold is how many consecutive failures before action is taken. The default is 3. For liveness, this means 3 failures in a row before Kubernetes restarts the container. Raising this gives transient failures more room to recover — useful during garbage collection pauses or brief network hiccups.

successThreshold for readiness controls how many consecutive successes are needed before traffic is restored. The default is 1. Raising it to 2 or 3 prevents flapping — the pod won't start getting traffic until it's been consistently ready.

A Working Example for a .NET API

Here's a configuration I've used for a .NET 8 API that connects to PostgreSQL and Redis:

startupProbe:
  httpGet:
    path: /health/startup
    port: 8080
  failureThreshold: 20
  periodSeconds: 5

livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  initialDelaySeconds: 0
  periodSeconds: 15
  failureThreshold: 3
  timeoutSeconds: 3

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  periodSeconds: 10
  failureThreshold: 3
  successThreshold: 2
  timeoutSeconds: 5

The startup probe gives up to 100 seconds for the app to initialise. Liveness checks every 15 seconds — enough to catch a deadlock without hammering the process. Readiness requires two consecutive successes to restore traffic, which prevents flapping during a brief database reconnection.

Probes and Rolling Deployments

Readiness probes are especially important during rolling updates. Kubernetes won't route traffic to a new pod until its readiness probe passes, and it won't terminate old pods until the new ones are ready. If the readiness probe is wrong — too permissive or not checking the right things — users can hit new pods that aren't actually ready to serve them.

I've seen deployments go sideways because the readiness probe was just a TCP check on the port, which passed the moment the process started, before the app had finished connecting to its dependencies. The fix was switching to an HTTP check on an endpoint that does real dependency validation.

Debugging Probe Failures

When probes are misbehaving, kubectl describe pod is your first stop. The Events section shows probe failures with details. For readiness, check whether the pod is listed in the Service's endpoints with kubectl get endpoints.

If you're unsure whether your health endpoint is returning what you expect, exec into the pod and curl it directly:

kubectl exec -it <pod-name> -- curl -v http://localhost:8080/health/ready

That removes networking from the equation and tells you exactly what the probe would see.

Worth Getting Right

Probes are a small configuration surface with a big impact. A misconfigured liveness probe can turn a minor database blip into a full restart cascade. A missing readiness probe means users hit pods that aren't actually ready. A too-tight startup threshold means your service never successfully deploys on a slow node.

Getting them right doesn't take long once you understand what each probe is for. The key is keeping liveness and readiness concerns separate, using startup probes for anything that doesn't start instantly, and testing the health endpoints you actually configure — not just assuming they work.