The difference between a self-healing cluster and a CrashLoopBackOff spiral is three YAML fields most engineers configure wrong.
1. Introduction
Kubernetes is designed to be self-healing. When a container crashes, it restarts. When a node fails, Pods reschedule. When a deployment rolls out, traffic shifts gradually. But none of this intelligence is automatic — it depends entirely on Kubernetes knowing the truth about your application’s health at every moment.
That truth comes from probes.
Without probes, Kubernetes operates blind. It assumes your application is healthy the moment the container process starts. It routes live traffic to Pods that are still warming up. It keeps running Pods that are deadlocked, out of memory, or silently serving 500 errors to every request. It marks rollouts as successful before a single request has been successfully handled.
With correctly configured probes, the picture changes entirely. Kubernetes knows when a slow-starting JVM application is actually ready to serve. It detects a deadlocked goroutine pool and restarts the container. It removes a Pod from the load balancer during a scheduled maintenance window and adds it back when the operation completes. It blocks a broken deployment from progressing until new Pods prove they can handle traffic.
This guide provides a complete mechanical understanding of Kubernetes health checks: the three probe types, the four check methods, every configuration parameter, the container startup sequence, the interaction with Service endpoints, and the production failure patterns that result from misconfiguration.
2. Overview of Kubernetes Health Checks
Kubernetes provides three distinct probe types, each serving a different role in the container lifecycle. Engineers who treat them as interchangeable — or who configure only one — are leaving significant reliability on the table.
startupProbe
Question answered: “Has the application finished starting up?”
The startupProbe runs before any other probe. While it is active, both livenessProbe and readinessProbe are suspended. Its sole purpose is to give slow-starting applications sufficient time to initialize without triggering premature liveness restarts.
Once startupProbe succeeds for the first time, it stops running entirely. It does not repeat.
livenessProbe
Question answered: “Is the application still alive and worth keeping?”
The livenessProbe runs continuously throughout the container's lifetime. When it fails beyond the configured failureThreshold, kubelet kills and restarts the container. This is the mechanism behind Kubernetes self-healing.
Use it to detect states the application cannot recover from on its own: deadlocks, memory corruption, infinite loops, exhausted thread pools with no ability to drain.
readinessProbe
Question answered: “Is the application ready to receive traffic right now?”
The readinessProbe also runs continuously. When it fails, Kubernetes removes the Pod's IP from the Service endpoints. Traffic stops reaching the Pod. The container is not restarted. When the probe passes again, the Pod is re-added to the endpoint list and traffic resumes.
Use it to signal temporary unavailability: cache warming, database connection establishment, downstream dependency degradation, or scheduled maintenance.
Probe Role Summary
Probe Runs When On Failure On Success Repeats startupProbe Container start, until first success Container restarted Probe stops, liveness/readiness begin No (stops after first success) livenessProbe After startup succeeds, continuously Container killed + restarted No action Yes, every periodSeconds readinessProbe After startup succeeds, continuously Pod removed from Service endpoints Pod added to Service endpoints Yes, every periodSeconds
3. How Kubernetes Executes Probes
Probes are executed by kubelet — the node agent running on every Kubernetes worker node. The API server does not run probes. The scheduler does not run probes. kubelet owns the entire probe lifecycle.
┌──────────────────────────────────────────────────────────┐
│ NODE │
│ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ kubelet │ │
│ │ │ │
│ │ Probe scheduler (per container, per probe) │ │
│ │ │ │ │
│ │ ├── httpGet ──▶ HTTP request to container│ │
│ │ ├── tcpSocket ──▶ TCP dial to container │ │
│ │ ├── exec ──▶ command inside container │ │
│ │ └── grpc ──▶ gRPC health check │ │
│ │ │ │
│ │ Result: Success / Failure / Unknown │ │
│ │ │ │ │
│ │ ├── Update Pod status conditions │ │
│ │ ├── Report to API server │ │
│ │ └── Take action (restart / endpoint mgmt) │ │
│ └─────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Container Runtime (containerd / CRI-O) │ │
│ │ Runs exec probes, manages container processes │ │
│ └──────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────┘kubelet tracks three possible probe states:
- Success — the check passed
- Failure — the check failed (counts toward
failureThreshold) - Unknown — the check could not be completed (treated as failure for restart purposes)
Probe results are reflected in the Pod’s status.conditions and influence the Ready condition that controls Service endpoint membership.
4. Probe Types: Check Methods
4.1 HTTP Get (httpGet)
kubelet sends an HTTP GET request to the container. Any response with status code 200–399 is a success. Anything else — including network errors, timeouts, and 4xx/5xx responses — is a failure.
livenessProbe:
httpGet:
path: /health/live
port: 8080
scheme: HTTP
httpHeaders:
- name: X-Health-Check
value: kubeletBest for: Web services, REST APIs, any HTTP server. The most common probe type in production.
Important: The HTTP check is performed by kubelet on the node, not from within the cluster network. Ensure the endpoint is accessible on the container’s network interface, not just on localhost loopback if you have unusual network configurations.
4.2 TCP Socket (tcpSocket)
kubelet attempts to open a TCP connection to the specified port. If the connection is established, the probe succeeds. The connection is immediately closed — no data is sent or received.
readinessProbe:
tcpSocket:
port: 5432Best for: Databases, message brokers, and any service that speaks a binary protocol rather than HTTP. Use this for PostgreSQL, MySQL, Redis, Kafka, and similar workloads where an HTTP endpoint is not available.
Limitation: TCP success only means the port is open and accepting connections. It does not validate that the application is actually processing requests correctly.
4.3 Command Execution (exec)
kubelet executes a command inside the container. Exit code 0 is success. Any non-zero exit code is failure.
livenessProbe:
exec:
command:
- /bin/sh
- -c
- "pg_isready -U postgres -h localhost"Best for: Databases and legacy applications without HTTP endpoints, custom health logic that cannot be expressed as a network check, or verifying filesystem state (e.g., checking a PID file exists).
Warning: exec probes spawn a new process inside the container for every check. At high periodSeconds frequency with CPU-constrained containers, this overhead accumulates. Do not use exec probes for high-frequency checks on resource-limited workloads.
4.4 gRPC (grpc)
Uses the standard gRPC Health Checking Protocol. kubelet calls the grpc.health.v1.Health/Check RPC. A SERVING response is success.
livenessProbe:
grpc:
port: 50051
service: "myapp.Service"Best for: gRPC-native microservices. Requires the application to implement the gRPC health protocol, which most major gRPC frameworks support natively.
Note: gRPC probes require Kubernetes 1.24+ and the feature gate GRPCContainerProbe enabled (enabled by default from 1.27).
Probe Method Comparison
Method Protocol Use Case Validates App Logic Overhead httpGet HTTP/HTTPS Web services, REST APIs Yes (if endpoint is meaningful) Low tcpSocket TCP Databases, binary protocols No (port open only) Very low exec Process exec Legacy apps, custom checks Yes (if command is meaningful) Medium grpc gRPC gRPC microservices Yes Low
5. Probe Parameters and Configuration Options
Core Parameters
Parameter Default Description initialDelaySeconds 0 Seconds to wait after container start before first probe. periodSeconds 10 How often (in seconds) to run the probe. timeoutSeconds 1 Seconds after which the probe times out. Counts as failure. successThreshold 1 Minimum consecutive successes to consider probe passing. failureThreshold 3 Consecutive failures before action is taken (restart or endpoint removal). terminationGracePeriodSeconds Pod-level default (30) Probe-level override for grace period on liveness failure.
How Thresholds Work Together
The total time before Kubernetes acts on a failing liveness probe:
initialDelaySeconds + (failureThreshold × periodSeconds)Example with defaults (initialDelaySeconds: 0, failureThreshold: 3, periodSeconds: 10):
0 + (3 × 10) = 30 seconds before container restarthttpGet Fields
Field Required Description path Yes URL path to request (e.g., /health) port Yes Port number or named port scheme No (default: HTTP) HTTP or HTTPS host No (default: Pod IP) Override hostname for the request httpHeaders No Custom headers as list of {name, value} pairs
tcpSocket Fields
Field Required Description port Yes Port number or named port to dial host No (default: Pod IP) Override host address
exec Fields
Field Required Description command Yes Command and args as a string array. Runs without shell by default.
grpc Fields
Field Required Description port Yes Port number for gRPC server service No Service name to pass to Health/Check RPC
Parameter Configuration Example
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 15 # Wait 15s after container start
periodSeconds: 20 # Check every 20s
timeoutSeconds: 5 # Fail if no response in 5s
failureThreshold: 3 # Restart after 3 consecutive failures
successThreshold: 1 # 1 success to consider healthyreadinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
successThreshold: 2 # Require 2 consecutive successes before re-adding to endpoints
6. Startup Sequence of a Kubernetes Container
Understanding the order of operations is critical for correct probe configuration. Many production issues stem from engineers assuming all three probes run simultaneously from container start.
Container Created
│
▼
┌─────────────────────────────────────────────────────────────┐
│ PHASE 1: STARTUP │
│ │
│ startupProbe runs (if configured) │
│ livenessProbe ──── SUSPENDED │
│ readinessProbe ──── SUSPENDED │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ startupProbe polls every periodSeconds │ │
│ │ │ │
│ │ Failure × failureThreshold ──▶ Container RESTART │ │
│ │ First Success ──────────────▶ Phase 2 begins │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼ (startupProbe passes or not configured)
┌─────────────────────────────────────────────────────────────┐
│ PHASE 2: RUNNING (both probes active simultaneously) │
│ │
│ readinessProbe ──── polls every periodSeconds │
│ │ │
│ ├── Failure: Pod IP removed from Service endpoints │
│ │ Pod stays running, no restart │
│ └── Success: Pod IP added to Service endpoints │
│ Pod receives traffic │
│ │
│ livenessProbe ──── polls every periodSeconds │
│ │ │
│ ├── Failure × failureThreshold: Container RESTARTED │
│ └── Success: No action │
└─────────────────────────────────────────────────────────────┘
│
▼ (livenessProbe triggers restart)
Container Terminated → terminationGracePeriodSeconds → New Container
│
└──▶ Sequence repeats from Phase 1Key Timing Insight
Without startupProbe, a slow-starting application faces this dangerous window:
t=0 Container starts, JVM begins loading
t=10 livenessProbe fires (first check) → app not ready → FAILURE 1
t=20 livenessProbe fires → app still loading → FAILURE 2
t=30 livenessProbe fires → app still loading → FAILURE 3 → RESTART
t=0 Container restarts. Loop repeats forever. CrashLoopBackOff.With startupProbe (failureThreshold: 30, periodSeconds: 10 = 300s budget):
t=0 Container starts, JVM begins loading
t=10 startupProbe fires → not ready → failure 1 of 30 (no restart)
...
t=120 startupProbe fires → app ready → SUCCESS → startup complete
t=130 livenessProbe and readinessProbe begin7. Interaction with Kubernetes Networking
The readinessProbe has a direct and immediate effect on traffic routing through Kubernetes Services.
When a Service selects Pods by label, it maintains an Endpoints object (or EndpointSlice in modern clusters) listing the IP addresses of all Pods currently eligible to receive traffic. The endpoint controller watches Pod Ready conditions and updates this list continuously.
┌───────────────────────────────────────────────────────────────┐
│ TRAFFIC ROUTING FLOW │
│ │
│ Client Request │
│ │ │
│ ▼ │
│ ┌──────────┐ selector: app=myapp │
│ │ Service │ ──────────────────────────────────────────┐ │
│ └──────────┘ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────┐│
│ │ EndpointSlice ││
│ │ addresses: [10.0.0.1, 10.0.0.3] ← only Ready Pods ││
│ │ NOT included: 10.0.0.2 ← readinessProbe failing ││
│ └──────────────────────────────────────────────────────────┘│
│ │
│ Pod 10.0.0.1 Ready: true ← receives traffic │
│ Pod 10.0.0.2 Ready: false ← excluded from endpoints │
│ Pod 10.0.0.3 Ready: true ← receives traffic │
└───────────────────────────────────────────────────────────────┘During Rolling Updates
This mechanism is what makes zero-downtime rolling updates possible:
- New Pod starts →
readinessProbenot yet passing → Pod excluded from endpoints - New Pod passes readiness → added to endpoints → traffic begins routing to it
- Old Pod receives
SIGTERM→readinessProbecan immediately fail → removed from endpoints - Old Pod drains in-flight requests during
terminationGracePeriodSeconds - Old Pod process exits
If step 1 and 3 overlap in time, traffic is always covered by at least one healthy Pod.
Readiness Gates
For advanced use cases, ReadinessGates allow external systems to contribute to a Pod's readiness condition. A service mesh or custom controller can set a condition that Kubernetes includes in the overall readiness evaluation — useful for ensuring a sidecar proxy is fully initialized before the Pod receives traffic.
8. Common Production Problems
Problem 1: CrashLoopBackOff from Aggressive livenessProbe
Scenario: A Spring Boot application with a 45-second startup time. livenessProbe configured with initialDelaySeconds: 10, failureThreshold: 3, periodSeconds: 10.
t=10 liveness fires → app loading → FAIL 1
t=20 liveness fires → app loading → FAIL 2
t=30 liveness fires → app loading → FAIL 3 → RESTART
t=0 Container restarts → repeat → CrashLoopBackOffFix: Add startupProbe with sufficient budget to cover worst-case startup time.
startupProbe:
httpGet:
path: /actuator/health
port: 8080
failureThreshold: 30 # 30 × 10s = 5 minutes budget
periodSeconds: 10Problem 2: Traffic Sent to Unready Pods
Scenario: No readinessProbe configured. Rolling update deploys new Pods. New Pods are added to Service endpoints immediately on container start, before the application finishes initializing. Clients receive connection refused or 503 errors for 20–40 seconds.
Fix: Always configure readinessProbe. Separate it from livenessProbe — use a dedicated /health/ready endpoint that checks downstream dependencies.
Problem 3: readinessProbe Takes Down Healthy Pods
Scenario: readinessProbe calls an endpoint that checks a third-party payment API. The payment API has a 5-minute outage. All Pods fail readiness and are removed from endpoints. The application is completely unavailable even though it could serve non-payment requests.
Fix: Readiness probes should check local application health, not external dependency health. External dependency checks belong in application-level circuit breakers, not Kubernetes probes.
Problem 4: Network Delays Causing Probe Flapping
Scenario: timeoutSeconds: 1 on a probe calling an endpoint that occasionally takes 1.2 seconds under load. Probes intermittently fail and succeed, causing Pods to flap in and out of Service endpoints. Clients experience intermittent errors.
Fix: Set timeoutSeconds to a realistic value based on observed p99 response times at peak load. Use successThreshold: 2 on readinessProbe to require consistent success before re-adding to endpoints.
Problem 5: Misconfigured Port in Probe
Scenario: Application listens on port 8080 for application traffic, port 9090 for metrics and health. livenessProbe configured with port: 8080 pointing at the app port, but the /health path is only served on 9090.
Symptom: Probe always returns 404. Container restarts continuously.
Fix: Always verify probe port and path match exactly. Use named ports to avoid numeric port confusion:
ports:
- name: http
containerPort: 8080
- name: health
containerPort: 9090livenessProbe:
httpGet:
path: /health/live
port: health # uses named port, less error-prone
9. Debugging Health Checks
Identify Probe Failures
# Show probe configuration and recent events
kubectl describe pod # Look for in Events section:
# Warning Unhealthy Liveness probe failed: HTTP probe failed with statuscode: 503
# Warning Unhealthy Readiness probe failed: dial tcp: connection refused
Check Pod Conditions
kubectl get pod -o jsonpath='{.status.conditions}' | jq .
# Look for:
# type: Ready, status: "False", reason: ContainersNotReady
# type: ContainersReady, status: "False"Watch Events in Real Time
kubectl get events --sort-by='.metadata.creationTimestamp' -n -w
# Filter for probe-related:
kubectl get events -n <namespace> | grep -i "unhealthy\|probe"Check Container Logs Around Restart Time
# Logs from the previous (crashed) container instance
kubectl logs --previous# Logs from specific container in multi-container pod
kubectl logs -c --previous
Verify Endpoint Membership
# Check if pod is in service endpoints
kubectl get endpoints <service-name>
kubectl describe endpoints <service-name># Check EndpointSlices (modern clusters)
kubectl get endpointslices -l kubernetes.io/service-name=
Monitor Rollout Health
kubectl rollout status deployment/
# "Waiting for deployment rollout to finish: 1 out of 3 new replicas have been updated"
# If stuck: new pods are failing readinesskubectl get rs -l app=
# If new RS shows READY < DESIRED, readiness probe is blocking progression
Manual Probe Testing
Test your probe endpoints directly from inside the cluster to rule out networking issues:
# Exec into a pod and test the health endpoint manually
kubectl exec -it -- curl -v http://localhost:8080/health/ready# Or use a debug pod
kubectl run debug --image=curlimages/curl -it --rm -- \
curl http://:8080/health/ready
10. Best Practices for Production Systems
Use Three Separate Endpoints
Do not route all probes to the same URL. Each probe has a different semantic purpose and should query different aspects of application state:
GET /health/startup → Is initialization complete?
GET /health/live → Is the process alive and not deadlocked?
GET /health/ready → Is the app ready to handle requests?Always Use startupProbe for JVM, Python, and Large Runtimes
JVM warmup, Python import chains, and applications loading large ML models all have startup times incompatible with aggressive liveness timeouts. A 300-second startup budget via startupProbe with failureThreshold: 30, periodSeconds: 10 is safer and more explicit than inflating initialDelaySeconds on livenessProbe.
Set timeoutSeconds Based on Observed Latency
timeoutSeconds: 1 (the default) is dangerously low for any endpoint that touches a database or makes a downstream call. Measure your health endpoint's p99 latency under load and set timeoutSeconds to at least 2× that value.
Readiness ≠ Liveness
Never use the same logic for both probes. A Pod that is temporarily not ready (waiting for a cache warm, holding a maintenance mode flag) should not be restarted. A Pod that has not been ready for 10 minutes probably should be. These are different conditions requiring different probes.
PodDisruptionBudget + readinessProbe Together
A PodDisruptionBudget protects against too many Pods being unavailable at once. The readinessProbe dynamically removes unavailable Pods from traffic. Together, they prevent both voluntary disruptions (node drains) and organic degradation from routing traffic to broken Pods simultaneously.
Monitor Probe Metrics
kube-state-metrics exposes kube_pod_container_status_ready and probe-related metrics. Set alerts for:
Metric / Condition Alert Threshold Pod Ready: false for > 5 minutes Immediate page Container restart count increase > 2 restarts in 10 minutes Probe failure events in namespace > 5 per minute Endpoint count drops below minimum Below minAvailable in PDB
11. Real Configuration Examples
Web Service (REST API)
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-service
spec:
template:
spec:
containers:
- name: api
image: registry.example.com/api:v1.2.0
ports:
- name: http
containerPort: 8080
- name: management
containerPort: 9090
startupProbe:
httpGet:
path: /actuator/health/liveness
port: management
failureThreshold: 20
periodSeconds: 5 # 100 second startup budget
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: management
initialDelaySeconds: 0 # startupProbe handles delay
periodSeconds: 20
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: management
initialDelaySeconds: 0
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
successThreshold: 2PostgreSQL Database
containers:
- name: postgres
image: postgres:15
ports:
- containerPort: 5432
startupProbe:
exec:
command:
- /bin/sh
- -c
- "pg_isready -U $POSTGRES_USER -d $POSTGRES_DB"
failureThreshold: 30
periodSeconds: 5 # 150 second startup budget
livenessProbe:
exec:
command:
- /bin/sh
- -c
- "pg_isready -U $POSTGRES_USER"
periodSeconds: 30
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
exec:
command:
- /bin/sh
- -c
- "pg_isready -U $POSTGRES_USER -d $POSTGRES_DB"
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
successThreshold: 1gRPC Microservice
containers:
- name: grpc-service
image: registry.example.com/grpc-service:v2.0.0
ports:
- name: grpc
containerPort: 50051
startupProbe:
grpc:
port: 50051
service: "grpc.health.v1.Health"
failureThreshold: 15
periodSeconds: 5
livenessProbe:
grpc:
port: 50051
service: "grpc.health.v1.Health"
periodSeconds: 20
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
grpc:
port: 50051
service: "myapp.PaymentService" # service-specific check
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
successThreshold: 1Slow-Starting Application (ML Model Server)
containers:
- name: model-server
image: registry.example.com/model-server:v1.0.0
ports:
- containerPort: 8501
startupProbe:
httpGet:
path: /v1/models/mymodel # model must be loaded
port: 8501
failureThreshold: 60
periodSeconds: 10 # 600 second (10 min) startup budget
timeoutSeconds: 10
livenessProbe:
tcpSocket:
port: 8501 # just verify process is alive
periodSeconds: 30
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /v1/models/mymodel:predict
port: 8501
periodSeconds: 15
timeoutSeconds: 10
failureThreshold: 2
successThreshold: 1Redis Cache
containers:
- name: redis
image: redis:7-alpine
ports:
- containerPort: 6379
livenessProbe:
exec:
command:
- redis-cli
- ping
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
readinessProbe:
exec:
command:
- redis-cli
- ping
periodSeconds: 5
timeoutSeconds: 2
failureThreshold: 3
successThreshold: 112. Conclusion
Kubernetes health probes are not configuration boilerplate — they are the nervous system of your cluster’s self-healing capability. Every meaningful Kubernetes behavior that engineers rely on in production — zero-downtime rollouts, automatic restarts, traffic routing, safe node drains — depends on probes being configured correctly.
The three probes serve three fundamentally different purposes. startupProbe buys slow applications the time they need to initialize without triggering false positive restarts. livenessProbe detects unrecoverable application states and triggers container restarts. readinessProbe dynamically manages traffic routing based on real-time application availability. Conflating their roles or omitting any of them creates reliability gaps that manifest as CrashLoopBackOff spirals, traffic sent to unready Pods, and deployment rollouts that block or cascade incorrectly.
The engineers who get this right share a common habit: they treat probe configuration as application-specific, not generic. They measure actual startup times and set budgets accordingly. They build dedicated health endpoints with clear semantics. They test probes under realistic load before deploying to production. They alert on probe failure rates and Pod readiness conditions as first-class signals.
Configure probes thoughtfully, and Kubernetes delivers on its self-healing promise. Treat them as an afterthought, and you will configure them in production at 3 AM under pressure — which is the worst possible time to learn how they work.
Quick Reference: Probe Configuration Checklist
□ startupProbe configured for any app with startup time > 30s
□ livenessProbe checks process health only (no external dependencies)
□ readinessProbe checks local app readiness only (no external dependencies)
□ Separate health endpoints for liveness vs readiness
□ timeoutSeconds > p99 latency of health endpoint under load
□ failureThreshold × periodSeconds > acceptable flap window
□ successThreshold ≥ 2 on readinessProbe for stability
□ Named ports used in probe configuration
□ Probe endpoints tested manually before production deployment
□ Probe failure alerts configured in monitoringArticle maintained at doc.thedevops.dev | Last updated: March 2026