The difference between a self-healing cluster and a CrashLoopBackOff spiral is three YAML fields most engineers configure wrong.

1. Introduction

Kubernetes is designed to be self-healing. When a container crashes, it restarts. When a node fails, Pods reschedule. When a deployment rolls out, traffic shifts gradually. But none of this intelligence is automatic — it depends entirely on Kubernetes knowing the truth about your application’s health at every moment.

That truth comes from probes.

Without probes, Kubernetes operates blind. It assumes your application is healthy the moment the container process starts. It routes live traffic to Pods that are still warming up. It keeps running Pods that are deadlocked, out of memory, or silently serving 500 errors to every request. It marks rollouts as successful before a single request has been successfully handled.

With correctly configured probes, the picture changes entirely. Kubernetes knows when a slow-starting JVM application is actually ready to serve. It detects a deadlocked goroutine pool and restarts the container. It removes a Pod from the load balancer during a scheduled maintenance window and adds it back when the operation completes. It blocks a broken deployment from progressing until new Pods prove they can handle traffic.

This guide provides a complete mechanical understanding of Kubernetes health checks: the three probe types, the four check methods, every configuration parameter, the container startup sequence, the interaction with Service endpoints, and the production failure patterns that result from misconfiguration.


2. Overview of Kubernetes Health Checks

Kubernetes provides three distinct probe types, each serving a different role in the container lifecycle. Engineers who treat them as interchangeable — or who configure only one — are leaving significant reliability on the table.

startupProbe

Question answered: “Has the application finished starting up?”

The startupProbe runs before any other probe. While it is active, both livenessProbe and readinessProbe are suspended. Its sole purpose is to give slow-starting applications sufficient time to initialize without triggering premature liveness restarts.

Once startupProbe succeeds for the first time, it stops running entirely. It does not repeat.

livenessProbe

Question answered: “Is the application still alive and worth keeping?”

The livenessProbe runs continuously throughout the container's lifetime. When it fails beyond the configured failureThreshold, kubelet kills and restarts the container. This is the mechanism behind Kubernetes self-healing.

Use it to detect states the application cannot recover from on its own: deadlocks, memory corruption, infinite loops, exhausted thread pools with no ability to drain.

readinessProbe

Question answered: “Is the application ready to receive traffic right now?”

The readinessProbe also runs continuously. When it fails, Kubernetes removes the Pod's IP from the Service endpoints. Traffic stops reaching the Pod. The container is not restarted. When the probe passes again, the Pod is re-added to the endpoint list and traffic resumes.

Use it to signal temporary unavailability: cache warming, database connection establishment, downstream dependency degradation, or scheduled maintenance.

Probe Role Summary

Probe Runs When On Failure On Success Repeats startupProbe Container start, until first success Container restarted Probe stops, liveness/readiness begin No (stops after first success) livenessProbe After startup succeeds, continuously Container killed + restarted No action Yes, every periodSeconds readinessProbe After startup succeeds, continuously Pod removed from Service endpoints Pod added to Service endpoints Yes, every periodSeconds


3. How Kubernetes Executes Probes

Probes are executed by kubelet — the node agent running on every Kubernetes worker node. The API server does not run probes. The scheduler does not run probes. kubelet owns the entire probe lifecycle.

┌──────────────────────────────────────────────────────────┐ 
│                        NODE                              │ 
│                                                          │ 
│   ┌─────────────────────────────────────────────────┐   │ 
│   │                    kubelet                      │   │ 
│   │                                                 │   │ 
│   │  Probe scheduler (per container, per probe)     │   │ 
│   │       │                                         │   │ 
│   │       ├── httpGet  ──▶ HTTP request to container│   │ 
│   │       ├── tcpSocket ──▶ TCP dial to container   │   │ 
│   │       ├── exec    ──▶ command inside container  │   │ 
│   │       └── grpc    ──▶ gRPC health check         │   │ 
│   │                                                 │   │ 
│   │  Result: Success / Failure / Unknown            │   │ 
│   │       │                                         │   │ 
│   │       ├── Update Pod status conditions          │   │ 
│   │       ├── Report to API server                  │   │ 
│   │       └── Take action (restart / endpoint mgmt) │   │ 
│   └─────────────────────────────────────────────────┘   │ 
│                                                          │ 
│   ┌──────────────────────────────────────────────────┐  │ 
│   │  Container Runtime (containerd / CRI-O)          │  │ 
│   │  Runs exec probes, manages container processes   │  │ 
│   └──────────────────────────────────────────────────┘  │ 
└──────────────────────────────────────────────────────────┘

kubelet tracks three possible probe states:

  • Success — the check passed
  • Failure — the check failed (counts toward failureThreshold)
  • Unknown — the check could not be completed (treated as failure for restart purposes)

Probe results are reflected in the Pod’s status.conditions and influence the Ready condition that controls Service endpoint membership.


4. Probe Types: Check Methods

4.1 HTTP Get (httpGet)

kubelet sends an HTTP GET request to the container. Any response with status code 200–399 is a success. Anything else — including network errors, timeouts, and 4xx/5xx responses — is a failure.

livenessProbe: 
  httpGet: 
    path: /health/live 
    port: 8080 
    scheme: HTTP 
    httpHeaders: 
      - name: X-Health-Check 
        value: kubelet

Best for: Web services, REST APIs, any HTTP server. The most common probe type in production.

Important: The HTTP check is performed by kubelet on the node, not from within the cluster network. Ensure the endpoint is accessible on the container’s network interface, not just on localhost loopback if you have unusual network configurations.


4.2 TCP Socket (tcpSocket)

kubelet attempts to open a TCP connection to the specified port. If the connection is established, the probe succeeds. The connection is immediately closed — no data is sent or received.

readinessProbe: 
  tcpSocket: 
    port: 5432

Best for: Databases, message brokers, and any service that speaks a binary protocol rather than HTTP. Use this for PostgreSQL, MySQL, Redis, Kafka, and similar workloads where an HTTP endpoint is not available.

Limitation: TCP success only means the port is open and accepting connections. It does not validate that the application is actually processing requests correctly.


4.3 Command Execution (exec)

kubelet executes a command inside the container. Exit code 0 is success. Any non-zero exit code is failure.

livenessProbe: 
  exec: 
    command: 
      - /bin/sh 
      - -c 
      - "pg_isready -U postgres -h localhost"

Best for: Databases and legacy applications without HTTP endpoints, custom health logic that cannot be expressed as a network check, or verifying filesystem state (e.g., checking a PID file exists).

Warning: exec probes spawn a new process inside the container for every check. At high periodSeconds frequency with CPU-constrained containers, this overhead accumulates. Do not use exec probes for high-frequency checks on resource-limited workloads.


4.4 gRPC (grpc)

Uses the standard gRPC Health Checking Protocol. kubelet calls the grpc.health.v1.Health/Check RPC. A SERVING response is success.

livenessProbe: 
  grpc: 
    port: 50051 
    service: "myapp.Service"

Best for: gRPC-native microservices. Requires the application to implement the gRPC health protocol, which most major gRPC frameworks support natively.

Note: gRPC probes require Kubernetes 1.24+ and the feature gate GRPCContainerProbe enabled (enabled by default from 1.27).


Probe Method Comparison

Method Protocol Use Case Validates App Logic Overhead httpGet HTTP/HTTPS Web services, REST APIs Yes (if endpoint is meaningful) Low tcpSocket TCP Databases, binary protocols No (port open only) Very low exec Process exec Legacy apps, custom checks Yes (if command is meaningful) Medium grpc gRPC gRPC microservices Yes Low


5. Probe Parameters and Configuration Options

Core Parameters

Parameter Default Description initialDelaySeconds 0 Seconds to wait after container start before first probe. periodSeconds 10 How often (in seconds) to run the probe. timeoutSeconds 1 Seconds after which the probe times out. Counts as failure. successThreshold 1 Minimum consecutive successes to consider probe passing. failureThreshold 3 Consecutive failures before action is taken (restart or endpoint removal). terminationGracePeriodSeconds Pod-level default (30) Probe-level override for grace period on liveness failure.

How Thresholds Work Together

The total time before Kubernetes acts on a failing liveness probe:

initialDelaySeconds + (failureThreshold × periodSeconds)

Example with defaults (initialDelaySeconds: 0, failureThreshold: 3, periodSeconds: 10):

0 + (3 × 10) = 30 seconds before container restart

httpGet Fields

Field Required Description path Yes URL path to request (e.g., /health) port Yes Port number or named port scheme No (default: HTTP) HTTP or HTTPS host No (default: Pod IP) Override hostname for the request httpHeaders No Custom headers as list of {name, value} pairs

tcpSocket Fields

Field Required Description port Yes Port number or named port to dial host No (default: Pod IP) Override host address

exec Fields

Field Required Description command Yes Command and args as a string array. Runs without shell by default.

grpc Fields

Field Required Description port Yes Port number for gRPC server service No Service name to pass to Health/Check RPC

Parameter Configuration Example

livenessProbe: 
  httpGet: 
    path: /health/live 
    port: 8080 
  initialDelaySeconds: 15    # Wait 15s after container start 
  periodSeconds: 20          # Check every 20s 
  timeoutSeconds: 5          # Fail if no response in 5s 
  failureThreshold: 3        # Restart after 3 consecutive failures 
  successThreshold: 1        # 1 success to consider healthy

readinessProbe:
 httpGet:
   path: /health/ready
   port: 8080
 initialDelaySeconds: 5
 periodSeconds: 10
 timeoutSeconds: 3
 failureThreshold: 3
 successThreshold: 2        # Require 2 consecutive successes before re-adding to endpoints


6. Startup Sequence of a Kubernetes Container

Understanding the order of operations is critical for correct probe configuration. Many production issues stem from engineers assuming all three probes run simultaneously from container start.

Container Created 
       │ 
       ▼ 
┌─────────────────────────────────────────────────────────────┐ 
│  PHASE 1: STARTUP                                           │ 
│                                                             │ 
│  startupProbe runs (if configured)                          │ 
│  livenessProbe  ──── SUSPENDED                              │ 
│  readinessProbe ──── SUSPENDED                              │ 
│                                                             │ 
│  ┌─────────────────────────────────────────────────────┐   │ 
│  │ startupProbe polls every periodSeconds              │   │ 
│  │                                                     │   │ 
│  │   Failure × failureThreshold ──▶ Container RESTART  │   │ 
│  │   First Success ──────────────▶ Phase 2 begins      │   │ 
│  └─────────────────────────────────────────────────────┘   │ 
└─────────────────────────────────────────────────────────────┘ 
       │ 
       ▼ (startupProbe passes or not configured) 
┌─────────────────────────────────────────────────────────────┐ 
│  PHASE 2: RUNNING (both probes active simultaneously)       │ 
│                                                             │ 
│  readinessProbe ──── polls every periodSeconds              │ 
│  │                                                          │ 
│  ├── Failure: Pod IP removed from Service endpoints         │ 
│  │           Pod stays running, no restart                  │ 
│  └── Success: Pod IP added to Service endpoints             │ 
│               Pod receives traffic                          │ 
│                                                             │ 
│  livenessProbe ──── polls every periodSeconds               │ 
│  │                                                          │ 
│  ├── Failure × failureThreshold: Container RESTARTED        │ 
│  └── Success: No action                                     │ 
└─────────────────────────────────────────────────────────────┘ 
       │ 
       ▼ (livenessProbe triggers restart) 
Container Terminated → terminationGracePeriodSeconds → New Container 
       │ 
       └──▶ Sequence repeats from Phase 1

Key Timing Insight

Without startupProbe, a slow-starting application faces this dangerous window:

t=0   Container starts, JVM begins loading 
t=10  livenessProbe fires (first check) → app not ready → FAILURE 1 
t=20  livenessProbe fires → app still loading → FAILURE 2 
t=30  livenessProbe fires → app still loading → FAILURE 3 → RESTART 
t=0   Container restarts. Loop repeats forever. CrashLoopBackOff.

With startupProbe (failureThreshold: 30, periodSeconds: 10 = 300s budget):

t=0    Container starts, JVM begins loading 
t=10   startupProbe fires → not ready → failure 1 of 30 (no restart) 
... 
t=120  startupProbe fires → app ready → SUCCESS → startup complete 
t=130  livenessProbe and readinessProbe begin

7. Interaction with Kubernetes Networking

The readinessProbe has a direct and immediate effect on traffic routing through Kubernetes Services.

When a Service selects Pods by label, it maintains an Endpoints object (or EndpointSlice in modern clusters) listing the IP addresses of all Pods currently eligible to receive traffic. The endpoint controller watches Pod Ready conditions and updates this list continuously.

┌───────────────────────────────────────────────────────────────┐ 
│                    TRAFFIC ROUTING FLOW                       │ 
│                                                               │ 
│  Client Request                                               │ 
│       │                                                       │ 
│       ▼                                                       │ 
│  ┌──────────┐     selector: app=myapp                        │ 
│  │ Service  │ ──────────────────────────────────────────┐    │ 
│  └──────────┘                                           │    │ 
│                                                         ▼    │ 
│  ┌──────────────────────────────────────────────────────────┐│ 
│  │  EndpointSlice                                           ││ 
│  │  addresses: [10.0.0.1, 10.0.0.3]  ← only Ready Pods    ││ 
│  │  NOT included: 10.0.0.2  ← readinessProbe failing       ││ 
│  └──────────────────────────────────────────────────────────┘│ 
│                                                               │ 
│   Pod 10.0.0.1   Ready: true   ← receives traffic          │ 
│   Pod 10.0.0.2   Ready: false  ← excluded from endpoints   │ 
│   Pod 10.0.0.3   Ready: true   ← receives traffic          │ 
└───────────────────────────────────────────────────────────────┘

During Rolling Updates

This mechanism is what makes zero-downtime rolling updates possible:

  1. New Pod starts → readinessProbe not yet passing → Pod excluded from endpoints
  2. New Pod passes readiness → added to endpoints → traffic begins routing to it
  3. Old Pod receives SIGTERMreadinessProbe can immediately fail → removed from endpoints
  4. Old Pod drains in-flight requests during terminationGracePeriodSeconds
  5. Old Pod process exits

If step 1 and 3 overlap in time, traffic is always covered by at least one healthy Pod.

Readiness Gates

For advanced use cases, ReadinessGates allow external systems to contribute to a Pod's readiness condition. A service mesh or custom controller can set a condition that Kubernetes includes in the overall readiness evaluation — useful for ensuring a sidecar proxy is fully initialized before the Pod receives traffic.


8. Common Production Problems

Problem 1: CrashLoopBackOff from Aggressive livenessProbe

Scenario: A Spring Boot application with a 45-second startup time. livenessProbe configured with initialDelaySeconds: 10, failureThreshold: 3, periodSeconds: 10.

t=10  liveness fires → app loading → FAIL 1 
t=20  liveness fires → app loading → FAIL 2 
t=30  liveness fires → app loading → FAIL 3 → RESTART 
t=0   Container restarts → repeat → CrashLoopBackOff

Fix: Add startupProbe with sufficient budget to cover worst-case startup time.

startupProbe: 
  httpGet: 
    path: /actuator/health 
    port: 8080 
  failureThreshold: 30   # 30 × 10s = 5 minutes budget 
  periodSeconds: 10

Problem 2: Traffic Sent to Unready Pods

Scenario: No readinessProbe configured. Rolling update deploys new Pods. New Pods are added to Service endpoints immediately on container start, before the application finishes initializing. Clients receive connection refused or 503 errors for 20–40 seconds.

Fix: Always configure readinessProbe. Separate it from livenessProbe — use a dedicated /health/ready endpoint that checks downstream dependencies.


Problem 3: readinessProbe Takes Down Healthy Pods

Scenario: readinessProbe calls an endpoint that checks a third-party payment API. The payment API has a 5-minute outage. All Pods fail readiness and are removed from endpoints. The application is completely unavailable even though it could serve non-payment requests.

Fix: Readiness probes should check local application health, not external dependency health. External dependency checks belong in application-level circuit breakers, not Kubernetes probes.


Problem 4: Network Delays Causing Probe Flapping

Scenario: timeoutSeconds: 1 on a probe calling an endpoint that occasionally takes 1.2 seconds under load. Probes intermittently fail and succeed, causing Pods to flap in and out of Service endpoints. Clients experience intermittent errors.

Fix: Set timeoutSeconds to a realistic value based on observed p99 response times at peak load. Use successThreshold: 2 on readinessProbe to require consistent success before re-adding to endpoints.


Problem 5: Misconfigured Port in Probe

Scenario: Application listens on port 8080 for application traffic, port 9090 for metrics and health. livenessProbe configured with port: 8080 pointing at the app port, but the /health path is only served on 9090.

Symptom: Probe always returns 404. Container restarts continuously.

Fix: Always verify probe port and path match exactly. Use named ports to avoid numeric port confusion:

ports: 
  - name: http 
    containerPort: 8080 
  - name: health 
    containerPort: 9090

livenessProbe:
 httpGet:
   path: /health/live
   port: health   # uses named port, less error-prone


9. Debugging Health Checks

Identify Probe Failures

# Show probe configuration and recent events 
kubectl describe pod 

# Look for in Events section:
# Warning  Unhealthy  Liveness probe failed: HTTP probe failed with statuscode: 503
# Warning  Unhealthy  Readiness probe failed: dial tcp: connection refused

Check Pod Conditions

kubectl get pod  -o jsonpath='{.status.conditions}' | jq . 
# Look for: 
# type: Ready, status: "False", reason: ContainersNotReady 
# type: ContainersReady, status: "False"

Watch Events in Real Time

kubectl get events --sort-by='.metadata.creationTimestamp' -n  -w 
# Filter for probe-related: 
kubectl get events -n <namespace> | grep -i "unhealthy\|probe"

Check Container Logs Around Restart Time

# Logs from the previous (crashed) container instance 
kubectl logs  --previous

# Logs from specific container in multi-container pod
kubectl logs  -c  --previous

Verify Endpoint Membership

# Check if pod is in service endpoints 
kubectl get endpoints <service-name> 
kubectl describe endpoints <service-name>

# Check EndpointSlices (modern clusters)
kubectl get endpointslices -l kubernetes.io/service-name=

Monitor Rollout Health

kubectl rollout status deployment/ 
# "Waiting for deployment rollout to finish: 1 out of 3 new replicas have been updated" 
# If stuck: new pods are failing readiness

kubectl get rs -l app=
# If new RS shows READY < DESIRED, readiness probe is blocking progression

Manual Probe Testing

Test your probe endpoints directly from inside the cluster to rule out networking issues:

# Exec into a pod and test the health endpoint manually 
kubectl exec -it  -- curl -v http://localhost:8080/health/ready

# Or use a debug pod
kubectl run debug --image=curlimages/curl -it --rm -- \
 curl http://:8080/health/ready


10. Best Practices for Production Systems

Use Three Separate Endpoints

Do not route all probes to the same URL. Each probe has a different semantic purpose and should query different aspects of application state:

GET /health/startup  → Is initialization complete? 
GET /health/live     → Is the process alive and not deadlocked? 
GET /health/ready    → Is the app ready to handle requests?

Always Use startupProbe for JVM, Python, and Large Runtimes

JVM warmup, Python import chains, and applications loading large ML models all have startup times incompatible with aggressive liveness timeouts. A 300-second startup budget via startupProbe with failureThreshold: 30, periodSeconds: 10 is safer and more explicit than inflating initialDelaySeconds on livenessProbe.

Set timeoutSeconds Based on Observed Latency

timeoutSeconds: 1 (the default) is dangerously low for any endpoint that touches a database or makes a downstream call. Measure your health endpoint's p99 latency under load and set timeoutSeconds to at least 2× that value.

Readiness ≠ Liveness

Never use the same logic for both probes. A Pod that is temporarily not ready (waiting for a cache warm, holding a maintenance mode flag) should not be restarted. A Pod that has not been ready for 10 minutes probably should be. These are different conditions requiring different probes.

PodDisruptionBudget + readinessProbe Together

A PodDisruptionBudget protects against too many Pods being unavailable at once. The readinessProbe dynamically removes unavailable Pods from traffic. Together, they prevent both voluntary disruptions (node drains) and organic degradation from routing traffic to broken Pods simultaneously.

Monitor Probe Metrics

kube-state-metrics exposes kube_pod_container_status_ready and probe-related metrics. Set alerts for:

Metric / Condition Alert Threshold Pod Ready: false for > 5 minutes Immediate page Container restart count increase > 2 restarts in 10 minutes Probe failure events in namespace > 5 per minute Endpoint count drops below minimum Below minAvailable in PDB


11. Real Configuration Examples

Web Service (REST API)

apiVersion: apps/v1 
kind: Deployment 
metadata: 
  name: api-service 
spec: 
  template: 
    spec: 
      containers: 
        - name: api 
          image: registry.example.com/api:v1.2.0 
          ports: 
            - name: http 
              containerPort: 8080 
            - name: management 
              containerPort: 9090 
          startupProbe: 
            httpGet: 
              path: /actuator/health/liveness 
              port: management 
            failureThreshold: 20 
            periodSeconds: 5         # 100 second startup budget 
          livenessProbe: 
            httpGet: 
              path: /actuator/health/liveness 
              port: management 
            initialDelaySeconds: 0   # startupProbe handles delay 
            periodSeconds: 20 
            timeoutSeconds: 5 
            failureThreshold: 3 
          readinessProbe: 
            httpGet: 
              path: /actuator/health/readiness 
              port: management 
            initialDelaySeconds: 0 
            periodSeconds: 10 
            timeoutSeconds: 3 
            failureThreshold: 3 
            successThreshold: 2

PostgreSQL Database

containers: 
  - name: postgres 
    image: postgres:15 
    ports: 
      - containerPort: 5432 
    startupProbe: 
      exec: 
        command: 
          - /bin/sh 
          - -c 
          - "pg_isready -U $POSTGRES_USER -d $POSTGRES_DB" 
      failureThreshold: 30 
      periodSeconds: 5          # 150 second startup budget 
    livenessProbe: 
      exec: 
        command: 
          - /bin/sh 
          - -c 
          - "pg_isready -U $POSTGRES_USER" 
      periodSeconds: 30 
      timeoutSeconds: 5 
      failureThreshold: 3 
    readinessProbe: 
      exec: 
        command: 
          - /bin/sh 
          - -c 
          - "pg_isready -U $POSTGRES_USER -d $POSTGRES_DB" 
      periodSeconds: 10 
      timeoutSeconds: 5 
      failureThreshold: 3 
      successThreshold: 1

gRPC Microservice

containers: 
  - name: grpc-service 
    image: registry.example.com/grpc-service:v2.0.0 
    ports: 
      - name: grpc 
        containerPort: 50051 
    startupProbe: 
      grpc: 
        port: 50051 
        service: "grpc.health.v1.Health" 
      failureThreshold: 15 
      periodSeconds: 5 
    livenessProbe: 
      grpc: 
        port: 50051 
        service: "grpc.health.v1.Health" 
      periodSeconds: 20 
      timeoutSeconds: 5 
      failureThreshold: 3 
    readinessProbe: 
      grpc: 
        port: 50051 
        service: "myapp.PaymentService"  # service-specific check 
      periodSeconds: 10 
      timeoutSeconds: 3 
      failureThreshold: 3 
      successThreshold: 1

Slow-Starting Application (ML Model Server)

containers: 
  - name: model-server 
    image: registry.example.com/model-server:v1.0.0 
    ports: 
      - containerPort: 8501 
    startupProbe: 
      httpGet: 
        path: /v1/models/mymodel     # model must be loaded 
        port: 8501 
      failureThreshold: 60 
      periodSeconds: 10              # 600 second (10 min) startup budget 
      timeoutSeconds: 10 
    livenessProbe: 
      tcpSocket: 
        port: 8501                   # just verify process is alive 
      periodSeconds: 30 
      timeoutSeconds: 5 
      failureThreshold: 3 
    readinessProbe: 
      httpGet: 
        path: /v1/models/mymodel:predict 
        port: 8501 
      periodSeconds: 15 
      timeoutSeconds: 10 
      failureThreshold: 2 
      successThreshold: 1

Redis Cache

containers: 
  - name: redis 
    image: redis:7-alpine 
    ports: 
      - containerPort: 6379 
    livenessProbe: 
      exec: 
        command: 
          - redis-cli 
          - ping 
      periodSeconds: 10 
      timeoutSeconds: 3 
      failureThreshold: 3 
    readinessProbe: 
      exec: 
        command: 
          - redis-cli 
          - ping 
      periodSeconds: 5 
      timeoutSeconds: 2 
      failureThreshold: 3 
      successThreshold: 1

12. Conclusion

Kubernetes health probes are not configuration boilerplate — they are the nervous system of your cluster’s self-healing capability. Every meaningful Kubernetes behavior that engineers rely on in production — zero-downtime rollouts, automatic restarts, traffic routing, safe node drains — depends on probes being configured correctly.

The three probes serve three fundamentally different purposes. startupProbe buys slow applications the time they need to initialize without triggering false positive restarts. livenessProbe detects unrecoverable application states and triggers container restarts. readinessProbe dynamically manages traffic routing based on real-time application availability. Conflating their roles or omitting any of them creates reliability gaps that manifest as CrashLoopBackOff spirals, traffic sent to unready Pods, and deployment rollouts that block or cascade incorrectly.

The engineers who get this right share a common habit: they treat probe configuration as application-specific, not generic. They measure actual startup times and set budgets accordingly. They build dedicated health endpoints with clear semantics. They test probes under realistic load before deploying to production. They alert on probe failure rates and Pod readiness conditions as first-class signals.

Configure probes thoughtfully, and Kubernetes delivers on its self-healing promise. Treat them as an afterthought, and you will configure them in production at 3 AM under pressure — which is the worst possible time to learn how they work.


Quick Reference: Probe Configuration Checklist

□ startupProbe configured for any app with startup time > 30s 
□ livenessProbe checks process health only (no external dependencies) 
□ readinessProbe checks local app readiness only (no external dependencies) 
□ Separate health endpoints for liveness vs readiness 
□ timeoutSeconds > p99 latency of health endpoint under load 
□ failureThreshold × periodSeconds > acceptable flap window 
□ successThreshold ≥ 2 on readinessProbe for stability 
□ Named ports used in probe configuration 
□ Probe endpoints tested manually before production deployment 
□ Probe failure alerts configured in monitoring

Article maintained at doc.thedevops.dev | Last updated: March 2026